Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pages 201–204,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Chinese Word Segmentation based on an Approach of Maximum Entropy
Modeling
Yan Song1 Jiaqing Guo1 Dongfeng Cai2
Natural Language Processing Lab
Shenyang Institute of Aeronautical Engineering
Shenyang, 110034, China
1.{mattsure,guojiaqing}@gmail.com
2.cdf@ge-soft.com
Abstract
In this paper, we described our Chinese
word segmentation system for the 3rd
SIGHAN Chinese Language Processing
Bakeoff Word Segmentation Task. Our
system deal with the Chinese character se-
quence by using the Maximum Entropy
model, which is fully automatically gen-
erated from the training data by analyz-
ing the character sequences from the train-
ing corpus. We analyze its performance
on both closed and open tracks on Mi-
crosoft Research (MSRA) and University
of Pennsylvania and University of Col-
orado (UPUC) corpus. It is shown that we
can get the results just acceptable without
using dictionary. The conclusion is also
presented.
1 Introduction
In the 3rd SIGHAN Chinese Language Process-
ing Bakeoff Word Segmentation Task, we partici-
pated in both closed and open tracks on Microsoft
Research corpus (MSRA for short) and University
of Pennsylvania and University of Colorado cor-
pus (UPUC for short). The following sections de-
scribed how our system works and presented the
results and analysis. Finally, the conclusion is pre-
sented with discussions of the system.
2 System Overview
Using Maximum Entropy approach for Chinese
Word Segmentation is not a fresh idea, some pre-
vious works (Xue and Shen, 2003; Low, Ng and
Guo, 2005) have got good performance in this
field. But what we consider in the process of
Segmentation is another way. We treat the input
text which need to be segmented as a sequence of
the Chinese characters, The segment process is, in
fact, to find where we should split the character se-
quence. Thepointistogetthesegmentprobability
between 2 Chinese characters, which is different
from dealing with the character itself.
In this section, training and segmentation pro-
cess of the system is described to show how our
system works.
2.1 Pre-Process of Training
For the first step we find the Minimal Segment
Unit (MSU for short) of a text fragment in the
training corpus. A MSU is a character or a string
which is the minimal unit in a text fragment that
cannot be segmented any more. According to the
corpus, all of the MSUs can be divided into 5
type classes: “C” - Chinese Character (such as
a47a92a48 and a47a208a48), “AB” - alphabetic string
(such as “SIGHAN”), “EN” - digit string (such
as “1234567”), “CN” - Chinese number string
(such as a47a152a122a19a155a48) and “P” - punctua-
tion (a47a167a48,a47a34a48,a47a182a48, etc). Besides the
classes above, we define a tag ”NL” as a special
MSU, which refers to the beginning or ending of a
text fragment. So, any MSU u can be described
as: u∈C∪AB∪EN∪CN∪P∪{NL}. In order
to check the capability of the pure Maximum En-
tropy model, in closed tracks, we didn’t have any
type of classes, the MSU here is every character
of the text fragment, u∈Cprime∪{NL}. For instance,
a47a183a130a235a92a10SIGHAN2006a169a99a140a109a34a48
is segmented into these MSUs: a47a183/a130/a235/a92/
a10/S/I/G/H/A/N/2/0/0/6/a169/a99/a140/a109/a34a48.
Once we get all the MSUs of a text fragment,
we can get the value of the Nexus Coefficient (NC
for short) of any 2 adjacent MSUs according to
the training corpus. The set of NC value can be
201
described as: NC ∈ {0,1}, where 0 means those
2 MSUs are segmented and 1 means they are not
segmented (Roughly, we appoint r = 0 if either
one of the 2 adjacent MSUs is NL). For example,
the NC value of these 2 MSUsa47a92a48anda47a208a48
in the text fragment a47a92a208a48 is 0 since these 2
characters is segmented according to the training
corpus.
2.2 Training
Since the segmentation is to obtain NC value of
any 2 adjacent MSUs (here we call the interspace
of the 2 adjacent MSUs a check point, illustrated
below),
...U−3 U−2 U−1 U+1 U+2 U+3...
a54
Check Point of U−1 and U+1
we built a tool to extract the feature as follows:
(α) U−3,U−2,U−1,U+1,U+2,U+3
(β) U−1U+1
(γ) r−2r−1
(δ) U−3r−2,U−2r−1
(epsilon1) r−2U−2,r−1U−1
In these features above, U+n (U−n) refers to
the following (previous) n MSU of the check
point with the information of relative position
(Intuitively, We consider the same MSU has
different effect on the NC value of the check point
when its relative position is different to check
point). And U−1U+1 is the 2 adjacent MSUs of
the check point. r−2r−1 is the NC value of the
previous 2 check points. Similarly, the (δ) and (epsilon1)
features represent the MSUs with their adjacent
r. For instance, in the sentencea183a180a152a135a165a73a60,
we can extract these features for the check point
between the MSUa183anda180:
(α) NL−3,NL−2,a183−1,a180+1,a152+2,a135+3,
(β)a183−1a180+1
(γ) 00 (because a183 is the boundary of the sen-
tence)
(δ) NL−30,NL−20
(epsilon1) 0NL−2,0a183−1
and also these features for the check point be-
tween the MSUa135anda165:
(α)a180−3,a152−2,a135−1,a165+1,a73+2,a60+3
(β)a135−1a165+1
Figure 1: MSRA training curve
Figure 2: UPUC training curve
(γ) 01 (for UPUC corpus, here the value is 00
since a152a135 is segmented into 2 characters, but in
MSRA corpus,a152a135is treated as a word)
(δ)a180−30,a152−21
(epsilon1) 0a152−2,1a135−1
After the extraction of the features, we use the
ZhangLe’s Maximum Entropy Toolkit1 to train the
model with a feature cutoff of 1. In order to get
the best number of iteration, 9/10 of the training
data is used to train the model, and the other 1/10
portion of the training data is used to evaluate the
model. Figure 1 and 2 show the results of the eval-
uation on MSRA and UPUC corpus.
From the figures we can see the best iteration
number range from 555 to 575 for MSRA corpus,
and 360 to 375 for UPUC corpus. So we decide
the iteration for 560 rounds for MSRA tracks and
365 rounds for UPUC tracks, respectively.
2.3 Segmentation
As we mentioned in the beginning of this section,
the segmentation is the process to obtain the value
1Download from http://maxent.sourceforge.net
202
of every NC in a text fragment. This process is
similar to the training process. Firstly, We scan
the text fragment from start to end to get all of
the MSUs. Then we can extract all of the features
from the text fragment and decide which check
point we should tag as r = 0 by this equation:
p(r|c) = 1Z
Kproductdisplay
j=1
αfj(r|c)j (1)
where K is the number of features, Z is the nor-
malization constant used to ensure that a probabil-
ity distribution results, and c represents the con-
text of the check point. αj is the weight for fea-
ture fj, here {α1α2 ...αK} is generated by the
training data. We then compute P(r = 0|c) and
P(r = 1|c) by the equation (1).
After one check point is treated with value of
r, the system shifts backward to the next check
point until all of the check point in the whole text
fragment are treated. And by calculating:
P =
n−1productdisplay
i=1
p(ri|ci) =
n−1productdisplay
i=1
1
Z
Kproductdisplay
j=1
αfk(ri|ci)j (2)
toget anr sequencewhich canmaximizeP. From
thisprocesswecanseethatthesequenceis,infact,
a second-order Markov Model. Thus it is easily to
think about more tags prior to the check point (as
an nth-order Markov Model) to get more accuracy,
but in this paper we only use the previous 2 tags
from the check point.
2.4 Identification of New words
We perform the new word(s) identification as a
post-process by check the word formation power
(WFP) of characters. The WFP of a character is
defined as: WFP(c) = Nwc/Nc, where Nwc is
the number of times that the character c appears
in a word of at least 2 characters in the training
corpus, Nc is the number of times the character c
occursinthetrainingcorpus. Afteratextfragment
is segmented by our system, we extract all consec-
utive single characters. If at least 2 consecutive
characters have the WFP larger than our threshold
of 0.88, we polymerize them together as a word.
For example,a47a178a214a151a48is a new word which is
segmented as a47a178/a214/a151a48 by our system, WFP
of these 3 characters is 0.9517,0.9818 and 1.0 re-
spectively, then they are polymerized as one word.
Besides the WFP, during the experiments, we
find that the Maximum Entropy model can poly-
merizesomeMSUsasanewword(Wecallitpoly-
merization characteristic of the model), such asa170
a186a13a196 in the training corpus, we can extract a170
a186a13as the previous context feature of the check
point aftera13, in another stringa83a44a13a142, we can
extract the backward contexta142of the check point
aftera13with r = 1. Then in the test, a new word
a170a186a13a142 is recognized by the model since a170
a186a13 and a142 are polymerized if a13a142 appears to-
gether a large number of times in the training cor-
pus.
3 Performance analysis
Here Table 1 illustrates the results of all 4 tracks
we participate. The first column is the track name,
and the 2nd column presents the Recall (R), the
3rd column the Precision (P), the 4th column is
F-measure (F). The Roov refers to the recall of the
out-of-vocabulary words and the Riv refers to the
recall of the words in training corpus.
Track R P F Roov Riv
MSRA Closed 0.923 0.929 0.926 0.554 0.936
MSRA Open 0.938 0.946 0.942 0.706 0.946
UPUC Closed 0.902 0.887 0.895 0.568 0.934
UPUC Open 0.926 0.906 0.917 0.660 0.954
Table 1: Results of our system in 4 tracks.
3.1 Closed tracks
For all of the closed tracks, we perform the seg-
mentation as we mentioned in the section above,
without any class defined. Every MSU we extract
from the training data is a character, which may be
a Chinese character, an English letter or a single
digit. We extract the features based on this kind of
MSUs to generate the models. The results show
these models are not precise.
For the UPUC closed track, the official released
training data is rather small. Then the capability
of the model is limited, this is the most reasonable
negative effect on our F-measure 0.895.
3.2 Open tracks
The primary change between open tracks and
closed tracks is that we have classified 5 classes
(“C”,“AB”,“EN”,“CN” and “P”) to MSUs in or-
der to improve the accuracy of the model. The
classification really works and affects the perfor-
mance of the system in a great deal. As this text
fragment 1998a99 can be recognized as (EN)(C),
which can also presents 1644a99, thus 1644a99can
203
be easily recognized though there is no 1664a99in
the training data.
The training corpus we used in UPUC open
track is the same as in UPUC closed track. With
those 5 classes, it is easily seen that the F-measure
increased by 2.2% in the open tracks.
For the MSRA open track, we adjust the class
“P” by removing the punctuation “a33” from the
class, because in the MSRA corpus, “a33” can be
a part of a organization name, such as “a33” in
a47a165a2a108a208a33a218a178a134a117a208a148a10a172a48. Besides,
we add the Microsoft Research training data of
SIGHAN bakeoff 2005 as extended training cor-
pus. The larger training data cooperate with the
classification method, the F-measure of the open
track increased to 0.942 as comparison with 0.926
of closed track.
3.3 Discussion of the tracks
Through the tracks, we tested the performance by
using the pure Maximum Entropy model in closed
tracks and run with the improved model with clas-
sified MSUs in open tracks. It is shown that the
pure model without any additional methods can
hardly make us satisfied, for the open tracks, the
model with classes are just acceptable in segmen-
tation.
In both closed and open tracks, we use the
same new word identification process, and with
the polymerization characteristic of the model, we
find the Roov is better than we expected.
Ontheotherhand,inoursystem,thereisnodic-
tionary used as we described in the sections above,
the Riv of each track shows that affects the system
performance.
Another factor affects our system in the UPUC
tracks is the wrongly written characters. Consider
that our system is based on the sequence of char-
acters, this kind of mistake is fatal. For example,
in the sentencea166a130a195a35a117a129a140a79a27a60a27a123a146,
wherea123a153is written asa123a146. The model cannot
recognize it since a123a146 didn’t occur in the train-
ing corpus. In the step of new word identification,
the WFPs of the 2 characters a123a167a146 are 0.8917
and 0.8310, thus they are wrongly segmented into
2 single characters while they are treated as a word
in the gold standard corpus. Therefore, we believe
the results can increase if there are no such mis-
takes in the test data.
4 Conclusion
We propose an approach to Chinese word seg-
mentation by using Maximum Entropy model,
which focuses on the nexus relationship of any
2 adjacent MSUs in a text fragment. We tested
our system with pure Maximum Entropy models
and models with simplex classification method.
Compare with the pure models, the models with
classified MSUs show us better performances.
However, the Maximum Entropy models of our
system still need improvement if we want to
achieve higher performance. In future works,
we will consider using more training data and
add some hybrid methods with pre- and post-
processes to improve the system.
Acknowledgements
We would like to thank all the colleagues of our
Lab. Without their encouragement and help, this
work cannot be accomplished in time.
This is our first time to participate such an in-
ternational bakeoff. There are a lot of things we
haven’t experienced ever before, but with the en-
thusiastic help from the organizers, we can come
through the task. Especially, We wish to thank
Gina-Anne Levow for her patience and immediate
reply for any of our questions, and we also thank
Olivia Kwong for the advice of paper submission.

References
Nianwen Xue and Libin Shen. 2003. Chinese Word
Segmentation as LMR tagging. In Proceedings of
the Second SIGHAN Workshop on Chinese Lan-
guage Processing, p176-179.
Maosong Sun, Ming Xiao, B K Tsou. 2004. Chinese
Word Segmentation without Using Dictionary Based
on Unsupervised Learning Strategy. Chinese Jour-
nal of Computers, Vol.27, #6, p736-742.
Jin Kiat Low, Hwee Tou Ng and Wenyuan Guo.
2005. A Maximum Entropy Approach to Chinese
Word Segmentation. In Proceedings of the Fourth
SIGHAN Workshop on Chinese Language Process-
ing, p161-164.
