Chinese Word Segmentation Based On Direct Maximum Entropy 
Model
Wu-Guang Shi 
Peking University, Beijing, 100871, China 
shiwuguang@pku.edu.cn
Abstract
Chinese word segmentation is a fun-
damental and important issue in Chi-
nese information processing. In order 
to find  a unified approach for Chinese
word segmentation, the author develop
a Chinese lexical analyzer PCWS using 
direct maximum entropy model. The 
paper presents the general description 
of PCWS, as well as the result and
analysis of its performance at the Sec-
ond International Chinese Word Seg-
mentation Bakeoff. 
1 Introduction
Och and Ney(2002) present a framework based 
on direct maximum entropy model to construct 
the machine translation system. The model treats
knowledge sources as feature functions, and al-
lows the system to be extended easily by adding 
new feature functions. We think the model can 
be used to provide a unified approach for Chi-
nese word segmentation. PCWS is the system
based on this thinking.
2 System Description 
PCWS consists of four components: Word gen-
eration, Disambiguation, Select the best word 
sequence and Output the result. They are de-
scribed below. 
2.1 Word Generation
The procedure of word generation involves two
steps: (1) generation of the common words
which are listed in the Dictionary. (2) generation
of the unknown words. The unknown words 
handled by the system involve numeric expres-
sion, time expression, personal name, location 
name and organization name. PCWS can recog-
nize the abbreviation of person name, and fail to 
find the abbreviation of location name and or-
ganization name. 
PCWS constructs an integrated segmentation
graph. The node in the graph is the minimal
segmentation unit that cannot be split in any
stage that follows. The unit consists of Chinese
character, punctuation, Arabic numeral string 
and English character string. Every word that be
generated is an edge in the graph.
Making the integrated segmentation graph is to 
avoid the blind spots in segmentation, but it
brings the graph more complex, and make the 
system’s speed slow. 
Every word will belong to a class. 
Given a word Wi, its class is defined by Figure 1. 
ic
 �
 �
 �
 � �  
 �
 �
 �
 � �
 �
Wi    iff Wi is listed in the segmentation lexicon.
PER iff Wi is a person name 
LOC iff Wi is a location name
ORG iff Wi is an organization name
NUM iff Wi is a numeral expression
TIME iff Wi is a time expression
Figure 1: Class Definition of word Wi
2.2 Disambiguation
In constructing the graph, PCWS detect the am-
biguities of the segmentation and classify the 
ambiguities into two classes: the false ambiguity
and the true ambiguity. The former is simply
solved by querying a table. The segmentation 
information around the true ambiguities will be 
collected and PCWS will give an estimate of 
each possible segmentation mode of the true 
193
am guities. These estimates will be used by 
f two proc-
generated by each node in 
context
model to
m entropy model and
M
bi
selecting the best path. 
2.3 Select the Best Word Sequence
The procedure of this part consists o
esses: generating the candidate word sequences
and finding the best word sequence. 
If S is a Chinese sentence which is a character
sequence, W is the all possible word sequences 
given S,)�# = {W1)�W2)�…)�WN} is a word se-
quence,)�={C1)�C2)�…)�CN} is a corresponding
class sequence of )�#. We use Viterbi algorithm
to generate the candidate paths. In order to con-
trol the search space, all the paths will be ranked 
by a class mode score. The maximum number of
the candidate paths
the graph cannot be larger than a Number
threshold we give.
The class mode score we used can be written as 
generateScore( ) = P ( | )P ( )
# #w w c c
The P(C) and P()�#*C) is similar to the one de-
fined by Gao et al.(2003).
We use the direct maximum entropy
find the best word sequence. If )�)� is the best 
path we need. W* is the candidate set.
Giving the direct maximu
neglecting its renormalization, we can obtain the 
following decision rule:
   
#
m
* 1
arg m ax h ,#+ m
w w m
w w s O
 �   
 �  �   �  �
 �
S) is the feature function of the word 
the
s:
, ) = genper( | )# i is w c �
Here
iwc c      �
 �
3) Candidate location name featur
N
3
i==1
h ( , ) = genloc( | )# i iw s w c �
 � �
hi (W#)�
sequence. The parameteri is the power of
feature.
In PCWS, we define five feature function
1) Context feature
#w s c
1 contexth ( , ) = -log P ( )
2) Candidate person name feature 
N
2h (w
i==1
generategenper( | ) = i i
i iwc  �
-logP ( | )   iff PER
0             else
e
Here
generate-logP ( | )   iff LOCgenloc( | ) = 
0             else
i i i
i i
w c cw c      �
 �
 �
4) Candidate organization name feature 
N
4
i==1
h ( , ) = genorg( | )# i iw s w c �
Here
generate-logP ( | )   iff ORGgenorg( | ) =
0                           else
i i i
i i
w c cw c      �
 �
 �
5) The length of the path 
5h ( , ) = Length( )
#w s c
We realize the GIS algorithm which can handle
any type of real-valued features to train the val-
ues of 51 O .
2.4 Output the Result
The component outputs the best word sequence 
and adjusts the result form based on the standard
of test corpora. In PCWS, we only adjust the 
form of the unknown words the system recog-
nizes.
For example, in “����H���”, “���
�H” and “���” will be recognized inde-
pendently as unknown word. Base on the stan-
dard of Msr corpora, we combine the continuous
words which belong to the class TIME.” 
2.5 System Overview
The overall architecture of our word segmenta-
tion system is presented in figure 2. 
Generate Common Words
Figure 2: Overall architecture of PCWS 
Dictionary
with
Feature
Label
Generate Unknown Words
 Disambiguation
Generating Candidate Word Sequences
Find The Best Word Sequence
Output the Result
194
Track
TOTAL TRUE 
WORD COUNT
TOTAL TEST 
WORD COUNT P   R F OOV Roov Riv
Msr_Open  106873 106624 0.913 0.915 0.914 0.026 0.725 0.918
Table 1: Test Result on Msr-open Track 
3 Evaluation
Because of the bakeoff’s rule and the limit of
time, we only attend the track of Msr_open.
Table 1 is the result of PCWS in this bakeoff.
Form the table, we can know the system is re-
markable in OOV recognize.
Due to only small named entity words are in-
cluded in the PCWS’s dictionary, most of 
named entity words are generated by the system.
However, the system’s overall performance is
not in balance with its good Roov. We notice 
Riv of the result is low. The main reason causes 
Riv low is the difference between the segmenta-
tion standard of PKU training corpora and the 
segmentation standard of MSR test corpora. The 
difference is so distinctly even in the segmenta-
tion standard of common words. For instance: 
Example 1 
[Correct Result] “	���$�|"34�#"

9,XG�?U4��F���
E��?��,��
EW��”
[PCWS Result] “	���$�|"34�#"

9,XG�?U4��F���
E��?��,��
EW��”
The result’s True Words Recall = 0.875, Test 
Words Precision = 0.778
Example 2 
[Correct Result] “�!8!,X�C	�	�4�
4�;> `!��”
[PCWS Result] “�!8!,X�C	�	�4�4�
;> `!��”
The result’s True Words Recall = 0.700, Test 
Words Precision = 0.700
Before we get the result, we neglected the prob-
lem. We put our attention in named entity rec-
ognize, which need the training corpora with the 
label information, so we use six month PKU 
corpora to construct our system and not use well 
the MSR training corpora in the bakeoff.
In order to know the influence of the problem,
we test our system in PKU test corpora, Table 2 
is the result of PCWS in the test. 
Track  P  R  F
Pku_Open 0.973887 0.978405 0.976141
Table 2: Test Result on Pku_open
It’s an excellent result and powerful proves our 
suppose.
We wish to make our system more adaptable to
different standards in the near future. 
4 Conclusion
We have presented our Chinese word segmenta-
tion system PCWS and its result for Msr_open
track. We are glad to see its good performance
of OOV recognize. In the course of the bakeoff,
we find some problems in PCWS. We will try to 
select more useful feature functions into the ex-
isting segmentation model in future work. We
are confident the system’s performance will
have a big progress next time. 
5 Acknowledgement
Thank Prof. Hou-Feng Wang for his attention 
and aid. The author would especially express
gratitude to his family, their encouragement and
support make the author to hold the line.
References
Franz J. Och and Hermann Ney. 2002. Discrimina-
tive training and maximum entropy models for
statistical machine translation. In Proceedings of
the 40th Annual Meeting of the Association for
Computational Linguistics (ACL), pages 295-302,
Philadelphia, PA, July.
Gao, Jianfeng, Mu Li and Chang-Ning Huang. 2003.
Improved source-channel model for Chinese word
segmentation. In: ACL2003.
195
