A Maximum Entropy Approach to Chinese Word Segmentation
Jin Kiat Low 1 and Hwee Tou Ng 1,2 and Wenyuan Guo 2
1. Department of Computer Science, National University of Singapore,
3 Science Drive 2, Singapore 117543
2. Singapore-MIT Alliance, E4-04-10, 4 Engineering Drive 3, Singapore 117576
{lowjinki, nght, guowy}@comp.nus.edu.sg
Abstract
We participated in the Second Inter-
national Chinese Word Segmentation
Bakeoff. Specifically, we evaluated
our Chinese word segmenter in the
open track, on all four corpora, namely
Academia Sinica (AS), City University
of Hong Kong (CITYU), Microsoft Re-
search (MSR), and Peking University
(PKU). Based on a maximum entropy
approach, our word segmenter achieved
the highest F measure for AS, CITYU,
and PKU, and the second highest for
MSR. We found that the use of an ex-
ternal dictionary and additional training
corpora of different segmentation stan-
dards helped to further improve seg-
mentation accuracy.
1 Chinese Word Segmenter
The Chinese word segmenter we built is similar
to the maximum entropy word segmenter we em-
ployed in our previous work (Ng and Low, 2004).
Our word segmenter uses a maximum entropy
framework (Ratnaparkhi, 1998; Xue and Shen,
2003) and is trained on manually segmented sen-
tences. It classifies each Chinese character given
the features derived from its surrounding context.
Each Chinese character can be assigned one of
four possible boundary tags: s for a character that
occurs as a single-character word, b for a charac-
ter that begins a multi-character (i.e., two or more
characters) word, e for a character that ends a
multi-character word, and m for a character that is
neither the first nor last in a multi-character word.
Our implementation used the opennlp maximum
entropy package v2.1.0 from sourceforge.1
1.1 Basic Features
The basic features of our word segmenter are
similar to our previous work (Ng and Low, 2004):
(a) Cn(n = −2,−1,0,1,2)
(b) CnCn+1(n = −2,−1,0,1)
(c) C−1C1
(d) Pu(C0)
(e) T(C−2)T(C−1)T(C0)T(C1)T(C2)
In the above feature templates, C refers to a
Chinese character. Templates (a) – (c) refer to a
context of five characters (the current character
and two characters to its left and right). C0
denotes the current character, Cn(C−n ) denotes
the character n positions to the right (left) of the
current character. For example, given the charac-
ter sequence “_d_3249_d_1509_d_4840_d_1487_d_1048”, when considering
the character C0 “_d_4840”, C−2 denotes ”_d_3249”, C1C2
denotes “_d_1487_d_1048”, etc. The punctuation feature,
Pu(C0), checks whether C0 is a punctuation
symbol (such as “?”, “–”, “,”). For the type fea-
ture (e), four type classes are defined: numbers
represent class 1, dates (“_d_3267”, “_d_3362”, “_d_2543”, the
Chinese characters for “day”, “month”, “year”,
respectively) represent class 2, English letters
represent class 3, and other characters represent
class 4. For example, when considering the
character “_d_2543” in the character sequence “_d_1010_d_7235
_d_2543_d_1082R”, the feature T(C−2)...T(C2) = 11243
1http://maxent.sourceforge.net/
161
will be set to 1 (“_d_1010” is the Chinese character for
“9” and “_d_7235” is the Chinese character for “0”).
Besides these basic features, we also made use
of character normalization. We note that char-
acters like punctuation symbols and Arabic dig-
its have different character codes in the ASCII,
GB, and BIG5 encoding standard, although they
mean the same thing. For example, comma “,”
is represented as the hexadecimal value 0x2c in
ASCII, but as the hexadecimal value 0xa3ac in
GB. In our segmenter, these different character
codes are normalized and replaced by the corre-
sponding character code in ASCII. Also, all Ara-
bic digits are replaced by the ASCII digit “0” to
denote any digit. Incorporating character normal-
ization enables our segmenter to be more robust
against the use of different encodings to represent
the same character.
For all the experiments that we conducted,
training was done with a feature cutoff of 2 and
100 iterations, except for the AS corpus which
had a feature cutoff of 3.
A major difficulty faced by a Chinese word
segmenter is the presence of out-of-vocabulary
(OOV) words. Segmenting a text with many OOV
words tends to result in lower accuracy. We ad-
dress the problem of OOV words in two ways:
using an external dictionary containing a list of
predefined words, and using additional training
corpora which are not segmented according to the
same segmentation standard.
1.2 External Dictionary
If a sequence of characters in a sentence matches
a word in an existing dictionary, it may be a
clue that the sequence of characters should be
segmented as one word. We used an online
dictionary from Peking University downloadable
from the Internet2, consisting of about 108,000
words of length one to four characters. If there is
some sequence of neighboring characters around
C0 in the sentence that matches a word in this
dictionary, then we greedily choose the longest
such matching word W in the dictionary. Let t0
be the boundary tag of C0 in W, L the number of
characters in W, and C1(C−1) be the character
2http://ccl.pku.edu.cn/doubtfire/Course/
Chinese%20Information%20Processing/Source Code/
Chapter 8/Lexicon full 2000.zip
immediately following (preceding) C0 in the
sentence. We then add the following features
derived from the dictionary:
(f) Lt0
(g) Cnt0(n = −1,0,1)
For example, consider the sentence “_d_3249_d_1509_d_4840
_d_1487_d_1048. . . ”. When processing the current character
C0 “_d_1509”, we will attempt to match the following
candidate sequences “_d_1509”, “_d_3249_d_1509”, “_d_1509_d_4840”, “_d_3249
_d_1509_d_4840”, “_d_1509_d_4840_d_1487”, “_d_3249_d_1509_d_4840_d_1487”, and “_d_1509_d_4840_d_1487_d_1048”
against existing words in our dictionary. Suppose
both “_d_1509_d_4840” and “_d_3249_d_1509_d_4840” are found in the
dictionary. Then the longest matching word W
chosen is “_d_3249_d_1509_d_4840”, t0 is m, L is 3, C−1 is “_d_3249”,
and C1 is “_d_4840”.
1.3 Additional Training Corpora
The presence of different standards in Chinese
word segmentation limits the amount of training
corpora available for the community, due to dif-
ferent organizations preparing training corpora in
their own standards. Indeed, if one uniform seg-
mentation standard were adopted, more training
data would have been available, and the OOV
problem could be significantly reduced.
We observed that although different segmenta-
tion standards exist, the differences are limited,
and many words are still segmented in the same
way across two different segmentation standards.
As such, in our work, we attempt to incorporate
corpora from other segmentation standards as ad-
ditional training data, to help reduce the OOV
problem.
Specifically, the steps taken are:
1. Perform training with maximum entropy
modeling using the original training corpus
D0 annotated in a given segmentation stan-
dard.
2. Use the trained word segmenter to segment
another corpus D annotated in a different
segmentation standard.
3. Suppose a Chinese character C in D is as-
signed a boundary tag t by the word seg-
menter with probability p. If t is identical to
the boundary tag of C in the gold-standard
162
annotated corpus D, and p is less than some
threshold θ, then C (with its surrounding
context in D) is used as additional training
data.
4. Add all such characters C as additional train-
ing data to the original training corpus D0,
and train a new word segmenter using the en-
larged training data.
5. Evaluate the accuracy of the new word seg-
menter on the same test data annotated in the
original segmentation standard of D0.
For the current bakeoff, when training a word
segmenter on a particular training corpus, the ad-
ditional training corpora are all the three corpora
in the other segmentation standards. For example,
when training a word segmenter for the AS cor-
pus, the additional training corpora are CITYU,
MSR, and PKU. The necessary character encod-
ing conversion between GB and BIG5 is per-
formed, and the probability threshold θ is set to
0.8. We found from our experiments that setting
θ to a higher value did not further improve seg-
mentation accuracy, but would instead increase
the training set size and incur longer training time.
2 Testing
During testing, the probability of a boundary tag
sequence assignment t1 ...tn given a character
sequence C1 ...Cn is determined by using the
maximum entropy classifier to compute the prob-
ability that a boundary tag ti is assigned to each
individual character Ci. If we were to just as-
sign each character the boundary tag with the
highest probability, it is possible that the clas-
sifier produces a sequence of invalid tags (e.g.,
m followed by s). To eliminate such possibil-
ities, we implemented a dynamic programming
algorithm which considers only valid boundary
tag sequences given an input character sequence.
At each character position i, the algorithm con-
siders each last word candidate ending at posi-
tion i and consisting of K characters in length
(K = 1,...,20 in our experiments). To deter-
mine the boundary tag assignment to the last word
W with K characters, the first character of W is
assigned boundary tag b, the last character of W
is assigned tag e, and the intervening characters
Corpus R P F ROOV RIV
AS 0.962 0.950 0.956 0.684 0.975
CITYU 0.967 0.956 0.962 0.806 0.980
MSR 0.969 0.968 0.968 0.736 0.975
PKU 0.968 0.969 0.969 0.838 0.976
Table 1: Our official SIGHAN bakeoff results
are assigned tag m. (If W is a single-character
word, then the single character is assigned tag s).
In this way, the dynamic programming algorithm
only considers valid tag sequences.
After word segmentation is done by the maxi-
mum entropy classifier, a post-processing step is
applied to correct inconsistently segmented words
made up of 3 or more characters. A word W is
defined to be inconsistently segmented if the con-
catenation of 2 to 6 consecutive words elsewhere
in the segmented output document matches W. In
the post-processing step, the segmentation of the
characters of these consecutive words is changed
so that they are segmented as a single word. To
illustrate, if the concatenation of 2 consecutive
words “_d_7403_d_3496” and “_d_6868_d_2975” in the segmented out-
put document matches another word “_d_7403_d_3496_d_6868_d_2975”,
then the 2 consecutive words “_d_7403_d_3496” and “_d_6868_d_2975”
will be re-segmented as a single word “_d_7403_d_3496_d_6868
_d_2975”.
3 Evaluation Results
We evaluated our Chinese word segmenter in
the open track, on all 4 corpora, namely
Academia Sinica (AS), City University of Hong
Kong (CITYU), Microsoft Research (MSR), and
Peking University (PKU). Table 1 shows our of-
ficial SIGHAN bakeoff results. The columns R,
P, and F show the recall, precision, and F mea-
sure, respectively. The columns ROOV and RIV
show the recall on out-of-vocabulary words and
in-vocabulary words, respectively. Our Chinese
word segmenter which participated in the bakeoff
was trained with the basic features (Section 1.1),
and made use of the external dictionary (Sec-
tion 1.2) and additional training corpora (Sec-
tion 1.3). Our word segmenter achieved the high-
est F measure for AS, CITYU, and PKU, and the
second highest for MSR.
After the release of the official bakeoff results,
163
Corpus V1 V2 V3 V4
AS 0.953 0.955 0.956 0.956
CITYU 0.950 0.960 0.961 0.962
MSR 0.960 0.968 0.963 0.968
PKU 0.948 0.965 0.956 0.969
Table 2: Word segmentation accuracy (F mea-
sure) of different versions of our word segmenter
we ran a series of experiments to determine the
contribution of each component of our word seg-
menter, using the official scorer and test sets with
gold-standard segmentations. Version V1 used
only the basic features (Section 1.1); Version V2
used the basic features and additional features de-
rived from our external dictionary (Section 1.2);
Version V3 used the basic features but with ad-
ditional training corpora (Section 1.3); and Ver-
sion V4 is our official submitted version combin-
ing basic features, external dictionary, and addi-
tional training corpora. Table 2 shows the word
segmentation accuracy (F measure) of the differ-
ent versions of our word segmenter, when tested
on the official test sets of the four corpora. The
results indicate that the use of external dictionary
increases segmentation accuracy. Similarly, the
use of additional training corpora of different seg-
mentation standards also increases segmentation
accuracy.
4 Conclusion
Using a maximum entropy approach, our Chi-
nese word segmenter achieves state-of-the-art ac-
curacy, when evaluated on all four corpora in the
open track of the Second International Chinese
Word Segmentation Bakeoff. The use of an exter-
nal dictionary and additional training corpora of
different segmentation standards helps to further
improve segmentation accuracy.
Acknowledgements
This research is partially supported by a research
grant R252-000-125-112 from National Univer-
sity of Singapore Academic Research Fund, as
well as the Singapore-MIT Alliance.
References
Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-
of-speech tagging: One-at-a-time or all-at-once?
word-based or character-based? In Proceedings of
the 2004 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP 2004), pages
277–284.
Adwait Ratnaparkhi. 1998. Maximum Entropy Mod-
els for Natural Language Ambiguity Resolution.
Ph.D. thesis, University of Pennsylvania.
Nianwen Xue and Libin Shen. 2003. Chinese word
segmentation as LMR tagging. In Proceedings of
the Second SIGHAN Workshop on Chinese Lan-
guage Processing, pages 176–179.
164
