Chinese Word Segmentation Using Minimal Linguistic Knowledge
Aitao Chen
School of Information Management and Systems
University of California at Berkeley
Berkeley, CA 94720, USA
aitao@sims.berkeley.edu
Abstract
This paper presents a primarily data-driven Chi-
nese word segmentation system and its perfor-
mances on the closed track using two corpora at
the first international Chinese word segmentation
bakeoff. The system consists of a new words rec-
ognizer, a base segmentation algorithm, and pro-
cedures for combining single characters, suffixes,
and checking segmentation consistencies.
1 Introduction
At the first Chinese word segmentation bakeoff, we partici-
pated in the closed track using the Academia Sinica corpus
(a0a2a1 for short) and the Beijing University corpus (a3a5a4 for
short). We will refer to the segmented texts in the training
corpus as the training data, and to both the unsegmented
testing texts and the segmented texts (the reference texts)
as the testing data. For details on the word segmentation
bakeoff, see (Sproat and Emerson, 2003).
2 Word segmentation
New texts are segmented in four steps which are described
in this section. New words are automatically extracted from
the unsegmented testing texts and added to the base dictio-
nary consisting of words from the training data before the
testing texts are segmented, line by line.
2.1 Base segmentation algorithm
Given a dictionary and a sentence, our base segmenta-
tion algorithm finds all possible segmentations of the sen-
tence with respect to the dictionary, computes the prob-
ability of each segmentation, and chooses the segmenta-
tion with the highest probability. If a sentence of a6 char-
acters, a1a8a7a10a9a12a11a13a9a13a14a16a15a12a15a17a15a18a9a17a19 , has a segmentation of a20 words,
a1a21a7a23a22a24a11a13a22a25a14a26a15a17a15a12a15a27a22a29a28 , then the probability of the segmentation
is estimated as a30a32a31 a1a34a33a18a35a2a36a37a7 a30a38a31a22a24a11a39a22a25a14a26a15a17a15a12a15a40a22a29a28a41a36a29a42a44a43
a28a45
a46
a11
a30a32a31
a22
a45
a36 ,
where a35 denotes a segmentation of a sentence. The prob-
ability of a word is estimated from the training corpus as
a30a38a31
a22a2a36a37a42a8a47a2a48a50a49a52a51
a47
, where a53a54a31a22a2a36 is the number of times that the
word a22 occurs in the training corpus, and a53 is the num-
ber of words in the training corpus. When a word is not
in the dictionary, a frequency of 0.5 is assigned to the new
word. The dynamic programming technique is applied to
find the segmentation of the highest probability of a sen-
tence without first enumerating all possible segmentations
of the sentence with respect to the dictionary. Consider the
text fragmenta55a44a56a44a57a44a58a60a59 with respect to a dictionary con-
taining the wordsa55a61a56a62a59a63a55a61a56a61a57a62a59a63a57a61a58a62a59a63a57 anda58a62a59 it
has three segmentations: (1) a55a64a56 / a57a65a58a62a66 (2) a55a65a56a64a57 /
a58a60a66 and (3)a55a61a56 /a57 /a58a60a67 The probabilities of the three
segmentations are computed as: (1) p(a55a64a56 )*p(a57a64a58 ); (2)
p(a55a68a56a68a57 )*p(a58 ); (3) p(a55a68a56 )*p(a57 )*p(a58 ). The proba-
bility of a word is estimated by its relative frequency in the
training data. Assume the first segmentation has the highest
probability, then the text fragment will be segmented into
a55a61a56 / a57a61a58a62a67
2.2 Combining single characters
New words are usually two or more characters long and are
often segmented into single characters. For example, the
word a69a64a70 is segmented into a69 / a70 when it is not in the
dictionary. After a sentence is segmented using the base al-
gorithm, the consecutive single Hanzi characters are com-
bined into a word if the in-word probabilities of the single
characters are over a threshold which is empirically deter-
mined from the training data. The in-word probability of
a character is the probability that the character occurs in a
word of two or more characters.
Some Hanzi characters, such as a71 and a72a73a59 occur as
words on their own in segmented texts much more fre-
quently than in words of two or more characters. For exam-
ple, in the PK training corpus, the character a72 occurs as a
word on its own 11,559 times, but in a word only 875 times.
On the other hand, some Hanzi characters usually do not
occur alone as words, instead they occur as part of a word.
As an example, the character a74 occurs in a word 17,108
times, but as a word alone only 794 times in the PK training
data. For each character in the training data, we compute its
in-word probability as follow: a30a38a31a76a75
a45
a19
a49a78a77a18a79a81a80
a36a82a7 a47a2a48a50a83a85a84a87a86a12a88a90a89a92a91a94a93a81a51
a47a2a48a50a83a38a51
,
where a53a54a31a92a75 a36 is the number of times that character a75 occurs
in the training data, and a53a54a31a92a75
a45
a19
a49a78a77a18a79a95a80
a36 is the number of times
that character a75 is in a word of two or more characters.
We do not want to combine the single characters that oc-
cur as words alone more often than not. For both the PK
training data and the AS training data, we divided the train-
ing data into two parts, two thirds for training, and one third
for system development. We found that setting the thresh-
old of the in-word probability to 0.85 or around works best
on the development data. After the initial segmentation
of a sentence, the consecutive single-characters are com-
bined into one word if their in-word probabilities are over
the threshold of 0.85. The text fragmenta96a44a96a8a97a73a98a44a99a44a100a44a101
contains a new worda99a44a100a44a101 which is not in the PK training
data. After the initial segmentation, the text is segmented
into a96a68a96 / a97a102a98 / a99 / a100 / a101 /, which is subsequently
changed into a96a68a96 / a97a103a98 / a99a68a100a68a101 after combining the
three consecutive characters. The in-word probabilities for
the three characters a99a44a59a60a100a44a59 and a101 are 0.94, 0.98, and
0.99, respectively.
2.3 Combining suffixes
A small set of characters , such as a104a44a59a60a105 and a106a44a59 fre-
quently occur as the last character in words. We selected
145 such characters from the PK training corpus, and 113
from the AS corpus. After combining single characters, we
combine a suffix character with the word preceding it if the
preceding word is at least two-character long.
2.4 Consistency check
The last step is to perform consistency checks. A seg-
mented sentence, after combining single characters and suf-
fixes, is checked against the training data to make sure
that a text fragment in a testing sentence is segmented in
the same way as in the training data if it also occurs in
the training data. From the PK training corpus, we cre-
ated a phrase segmentation table consisting of word quad-
grams, trigrams, bigrams, and unigrams, together with their
segmentations and frequencies. Our phrase table created
from the AS corpus does not include word quad-grams to
reduce the size of the phrase table. For example, from
the training text a107a44a108 / a109 / a110 / a111a8a112a68a59 we create
the following entries (only some are listed to save space):
text fragment freq segmentation
a107a73a108a64a109a61a110a64a111a61a112 1 a107a73a108 /a109 /a110 /a111a64a112
a107a73a108a64a109a61a110 1 a107a73a108 /a109 /a110
a109a61a110a64a111a61a112 1 a109 /a110 /a111a61a112
a109a61a110 1 a109 /a110
a111a61a112 1 a111a61a112
After a new sentence is processed by the first three steps,
we look up every word quad-grams of the segmented sen-
tence in the phrase segmentation table. When a word quad-
gram is found in the phrase segmentation table with a differ-
ent segmentation, we replace the segmentation of the word
quad-gram in the segmented sentence by its segmentation
found in the phrase table. This process is continued to word
trigrams, word bigrams, and word unigrams. The idea is
that if a text fragment in a new sentence is found in the
training data, then it should be segmented in the same way
as in the training data. As an example, in the PK testing
data, the sentence a107a113a108a61a109a64a110a61a114a61a115a64a116a61a117a118a71a73a119a61a117a64a120a61a121a62a122 is
segmented into a107a113a108 /a109a61a110 /a114a61a115 /a116 /a117 / a71 /a119a61a117a64a120
a121 /a122 after the first three steps (the two characters a116 and
a117 are not, but should be, combined because the in-word
probability of character a116a61a59 which is 0.71, is below the
pre-defined threshold of 0.85). The word bigram a107a102a108a68a109
a110 is found in the phrase segmentation table with a differ-
ent segmentation, a107a102a108 / a109 / a110a61a67 So the segmentation
a107a102a108 / a109a68a110 is changed to the segmentation a107a102a108 / a109 /
a110 in the final segmented sentence. In essence, when a text
fragment has two or more segmentations, its surrounding
context, which can be the preceding word, the following
word, or both, is utilized to choose the most appropriate
segmentation. When a text fragment in a testing sentence
never occurred in the same context in the training data, then
the most frequent segmentation found in the training data is
chosen. Consider the text a109a68a110 again, in the testing data,
a59a123a109a44a110a61a124a44a125 is segmented into a59 /a109a61a110 /a124a44a125 by our base
algorithm. In this case, a109a64a110 never occurred in the context
of a59a126a109a64a110a61a124a61a125a62a59a127a59a126a109a64a110 ora109a61a110a64a124a61a125a62a67 The consistency
check step changes a59 /a109a61a110 / a124a61a125 into a59 /a109 /a110 / a124
a125 since a109a64a110 is segmented into a109 / a110 515 times, but is
treated as one word a109a61a110 105 times in the training data.
3 New words recognition
We developed a few procedures to identify new words in
the testing data. Our first procedure is designed to recog-
nize numbers, dates, percent, time, foreign words, etc. We
defined a set of characters consisting of characters such as
the digits ‘0’ to ‘9’ (in ASCII and GB), the letters ‘a’ to
’z’, ‘A’ to ‘Z’ (in ASCII and GB), ‘a128a68a121a68a129a68a130a132a131a103a133a68a134a68a135
a136a68a137a68a138a65a139a68a140a68a141a68a142
a67a144a143a102a145a68a146a65a147a68a117a149a148 ’, and the like. Any
consecutive sequence of the characters that are in this pre-
defined set of characters is extracted and post-processed.
A set of rules is implemented in the post-processor. One
such rule is that if an extracted text fragments ends with
the character a117 and contains any character in a138a65a139a65a140a64a141
a142
a147a62a59 then remove the ending character a117 and keep the
remaining fragment as a word. For example, our recognizer
will extract the text fragment a131 a139a150a136a150a138 a117 and a147 a137 a117
since all the characters are in the pre-defined set of charac-
ters. The post-processor will strip off the trailing character
a117a44a59 and return a131
a139a65a136a65a138 and
a147
a137 as words. For per-
sonal names, we developed a program to extract the names
preceding texts such as a151a153a152a118a154a103a155a157a156 and a151a153a158a159a156a27a59 a pro-
gram to detect and extract names in a sequence of names
separated by the Chinese punctuation “a160 ”, such as a161a61a162a61a163
a164
a110a166a165a21a167a118a168a23a160a60a169a171a170a23a160a172a165a21a173a64a174a61a160a175a59 a program to extract
steps dict R P F a176a26a177a40a177a40a178 a176a123a179a87a178
1 1 pkd1 0.919 0.838 0.877 0.050 0.984
2 1 pkd2 0.940 0.892 0.915 0.347 0.984
3 1 pkd3 0.949 0.920 0.934 0.507 0.982
4 1-2 pkd3 0.950 0.935 0.942 0.610 0.975
5 1-3 pkd3 0.951 0.940 0.945 0.655 0.972
6 1-4 pkd3 0.955 0.938 0.946 0.647 0.977
Table 1: Results for the closed track using the PK corpus.
personal names (Chinese or foreign) following title or pro-
fession names, such asa180a64a181a183a182 in the texta184a64a185a61a186a61a187a118a188a73a180
a181a189a182a190a59 and a program to extract Chinese personal names
based on the preceding word and the following word. For
example, the string a191a171a192a64a107 in a193a64a163a171a71a113a191a171a192a65a107a113a194 is most
likely a personal name (in this case, it is) sincea191 is a Chi-
nese family name, the string is three-character long (a typ-
ical Chinese personal name is either three or two-character
long). Furthermore, the preceding word a71 and the fol-
lowing word a194 are highly unlikely to appear in a Chinese
personal name. For the personal names extracted from the
PK testing data, if the name is two or three-character long,
and if the first character or two is a Chinese family name,
then the family name is separated from the given name. The
family names are not separated from the given names for the
personal names extracted from the AS testing data. In some
cases, we find it difficult to decide whether or not the first
character should be removed from a personal name. Con-
sider the personal name a195a23a196a44a197 which looks like a Chinese
personal name since the first character is a Chinese family
name, and the name is three-character long. If it is a trans-
lated foreign name (in this case, it is), then the name should
not be split into family name and given name. But if it is
the name of a Chinese personal name, then the family name
a195 should be separated from the given name. For place
names, we developed a simple program to extract names of
cities, counties, towns, villages, streets, etc, by extracting
the strings of up to three characters appearing between two
place name designators. For example, from the text a198a64a199
a200a65a201a113a202a65a203a118a204a103a205a65a203a118a204a103a206
a59 our program will extract
a203a171a204
a205 and a203a207a204a113a206
a67
4 Results
The last row (in boldface) in Table 1 gives our official re-
sults for the PK closed track. Other rows in the table present
the results under different experimental conditions. The
column labeled steps refers to the executed steps of our
Chinese word segmentation algorithm. Step 1 segments a
text using the base segmentation algorithm, step 2 combines
single characters, step 3 attaches suffixes to the preceding
words, and step 4 performs consistency checks. The four
steps are described in details in section 2. The column la-
beled dict gives the dictionary used in each experiment. The
pkd1 consists of only the words from the PK training cor-
steps dict R P F a176a37a177a27a177a40a178 a176a37a179a50a178
1 1 asd1 0.950 0.936 0.943 0.000 0.970
2 1 asd2 0.950 0.943 0.947 0.132 0.968
3 1-2 asd2 0.951 0.952 0.951 0.337 0.964
4 1-3 asd2 0.949 0.952 0.951 0.372 0.961
5 1-4 asd2 0.966 0.956 0.961 0.364 0.980
Table 2: Results for the closed track using the AS corpus.
corpus dict R P F a176a37a177a27a177a40a178 a176a37a179a50a178
AS asd1 0.917 0.912 0.915 0.000 0.938
PK pkd1 0.909 0.829 0.867 0.050 0.972
Table 3: Performances of the maximum matching (forward)
using words from the training data.
pus, pkd2 consists of the words in pkd1 and the words con-
verted from pkd1 by changing the GB encoding to ASCII
encoding for the numeric digits and the English letters, and
pkd3 consists of the words in pkd2 and the words automat-
ically extracted from the PK testing texts using the proce-
dures described in section 3. The columns labeled R, P and
F give the recall, precision, and F score, respectively. The
columns labeled a208 a77a81a77a40a209 and a208
a45
a209 show the recall on out-of-
vocabulary words and the recall on in-vocabulary words,
respectively. All evaluation scores reported in this paper
are computed using the score program written by Richard
Sproat. We refer readers to (Sproat and Emerson, 2003) for
details on the evaluation measures. For example, row 4 in
table 1 gives the results using pkd3 dictionary when a sen-
tence is segmented by the base algorithm, and then the sin-
gle characters in the initial segmentation are combined, but
suffixes are not attached and consistency check is not per-
formed. The last row in table 2 presents our official results
for the closed track using the AS corpus. The asd1 dictio-
nary contains only the words from the AS training corpus,
while the asd2 consists of the words in asd1 and the new
words automatically extracted from the AS testing texts us-
ing the new words recognition described in section 3. The
results show that new words recognition and joining single
characters contributed the most to the increase in precision,
while the consistency check contributed the most to the in-
crease in recall. Table 3 gives the results of the maximum
matching using only the words in the training data. While
the difference between the F-scores of the maximum match-
ing and the base algorithm is small for the PK corpus, the
F-score difference for the AS corpus is much larger. Our
base algorithm performed substantially better than the max-
imum matching for the AS corpus. The performances of our
base algorithm on the testing data using the words from the
training data are presented in row 1 in table 1 for the a3a5a4
corpus, and row 1 in table 2 for the a0a2a1 corpus.
5 Discussions
In this section we will examine in some details the problem
of segmentation inconsistencies within the training data,
within the testing data, and between training data and test-
ing data. Due to space limit, we will only report our find-
ings in the PK corpus though the same kinds of inconsis-
tencies also occur in the AS corpus. We understand that
it is difficult, or even impossible, to completely eliminate
segmentation inconsistencies. However, perhaps we could
learn more about the impact of segmentation inconsisten-
cies on a system’s performance by taking a close look at the
problem.
We wrote a program that takes as input a segmented cor-
pus and prints out the shortest text fragments in the corpus
that have two or more segmentations. For each text frag-
ment, the program also prints out how the text fragment is
segmented, and how many times it is segmented in a partic-
ular way. While some of the text fragments, such as a210a171a211
anda212a61a213a214a59 truly have two different segmentations, depend-
ing on the contexts in which they occur or the meanings
of the text fragments, others are segmented inconsistently.
We ran this program on the PK testing data and found 21
unique shortest text fragments, which occur 87 times in to-
tal, that have two different segmentations. Some of the text
fragments, such asa215a44a216a44a217a60a59 are inconsistently segmented.
The fragment a215a64a216a65a217 occurs twice in the testing data and
is segmented into a215a61a216 / a217 in one case, but treated as one
word in the other case. We found 1,500 unique shortest text
fragments in the PK training data that have two or more seg-
mentations, and 97 unique shortest text fragments that are
segmented differently in the training data and in the test-
ing data. For example, the text a218a61a219a61a220a118a221 is treated as one
word in the training data, but is segmented into a218 / a219 /
a220 / a221 in the testing data. We found 11,136 unique short-
est text fragments that have two or more segmentations in
the AS training data, 21 unique shortest text fragments that
have two or more segmentations in the AS testing data, and
38 unique shortest text fragments that have different seg-
mentations in the AS training data and in the AS testing
data.
Segmentation inconsistencies not only exists within
training and testing data, but also between training and test-
ing data. For example, the text fragmenta222a61a119a61a223 occurs 35
times in the PK training data and is consistently segmented
into ” a222a68a119 / a223a61a59 but the same text fragment, occurring
twice in the testing data, is segmented into a222 / a119 / a223 in
both cases. The text a224a65a225 occurs 67 times in the training
data and is treated as one word a224a68a225 in all 67 cases, but
the same text, occurring 4 times in the testing data, is seg-
mented intoa224 / a225 in all 4 cases. The text a226a61a227a113a228 occurs
16 times in the training data, and is treated as one word in
all cases, but in the testing data, it is treated as one word in
three cases and segmented into a226a61a227 / a228 in one case. The
text a229a65a230 is segmented into a229 / a230 in 8 cases, but treated
as one word in one case in the training data. A couple of
text fragments seem to be incorrectly segmented. The text
a231a61a232a64a233a207a234
a155a61a96 in the testing data is segmented into
a231a61a232
a233a207a234 /
a155a64a96a60a59 and the texta235a61a236a64a237a61a238 segmented intoa235 /
a236a61a237a64a238a60a67
Our segmented texts of the PK testing data differ from
the reference segmented texts for 580 text fragments (427
unique). Out of these 580 text fragments, 126 text frag-
ments are among the shortest text fragments that have one
segmentation in the training data, but another in the test-
ing data. This implies that up to 21.7% of the mistakes
committed by our system may have been impacted by the
segmentation inconsistencies between the PK training data
and the PK testing data. Since there are only 38 unique
shortest text fragments found in the AS corpus that are seg-
mented differently in the training data and the testing data ,
the inconsistency problem probably had less impact on our
AS results. Out of the same 580 text fragments, 359 text
fragments (62%) are new words in the PK testing data. For
example, the proper name a239a241a240a64a242a60a59 which is a new word,
is incorrectly segmented into a239 /a240a64a242 by our system. An-
other example is the new word a243a113a244a171a243a113a245 which is treated
as one word in the testing data, but is segmented into a243 /
a244 / a243 / a245 by our system. Some of the longer text frag-
ments that are incorrectly segmented may also involve new
words, so at least 62%, but under 80%, of the incorrectly
segmented text fragments are either new words or involve
new words.
6 Conclusion
We have presented our word segmentation system and the
results for the closed track using the a0a2a1 corpus and the a3a5a4
corpus. The new words recognition, combining single char-
acters, and checking consistencies contributed the most to
the increase in precision and recall over the performance of
the base segmentation algorithm, which works better than
maximum matching. For the closed track experiment using
the a3a5a4 corpus, we found that 62% of the text fragments
that are incorrectly segmented by our system are actually
new words, which clearly shows that to further improve the
performance of our system, a better new words recognition
algorithm is necessary. Our failure analysis also indicates
that up to 21.7% of the mistakes made by our system for
the PK closed track may have been impacted by the seg-
mentation inconsistencies between the training and testing
data.

References
Richard Sproat and Tom Emerson. 2003. The First Interna-
tional Chinese Word Segmentation Bakeoff. In proceed-
ings of the Second SIGHAN Workshop on Chinese Lan-
guage Processing, July 11-12, 2003, Sapporo, Japan.
