Chinese Word Segmentation as LMR Tagging
Nianwen Xue
Inst. for Research in Cognitive Science
University of Pennsylvania
Philadelphia, PA 19104, USA
xueniwen@linc.cis.upenn.edu
Libin Shen
Dept. of Computer and Info. Science
University of Pennsylvania
Philadelphia, PA 19104, USA
libin@linc.cis.upenn.edu
Abstract
In this paper we present Chinese word
segmentation algorithms based on the so-
called LMR tagging. Our LMR taggers
are implemented with the Maximum En-
tropy Markov Model and we then use
Transformation-Based Learning to com-
bine the results of the two LMR taggers
that scan the input in opposite directions.
Our system achieves F-scores of a0a2a1a4a3a5a0a7a6
and a0a9a8a10a3a5a11a7a6 on the Academia Sinica corpus
and the Hong Kong City University corpus
respectively.
1 Segmentation as Tagging
Unlike English text in which sentences are se-
quences of words delimited by white spaces, in Chi-
nese text, sentences are represented as strings of
Chinese characters or hanzi without similar natural
delimiters. Therefore, the first step in a Chinese lan-
guage processing task is to identify the sequence of
words in a sentence and mark boundaries in appro-
priate places. This may sound simple enough but in
reality identifying words in Chinese is a non-trivial
problem that has drawn a large body of research in
the Chinese language processing community (Fan
and Tsai, 1988; Gan et al., 1996; Sproat et al., 1996;
Wu, 2003; Xue, 2003).
The key to accurate automatic word identification
in Chinese lies in the successful resolution of ambi-
guities and a proper way to handle out-of-vocabulary
words. The ambiguities in Chinese word segmenta-
tion is due to the fact that a hanzi can occur in differ-
ent word-internal positions (Xue, 2003). Given the
proper context, generally provided by the sentence
in which it occurs, the position of a hanzi can be de-
termined. In this paper, we model the Chinese word
segmentation as a hanzi tagging problem and use a
machine-learning algorithm to determine the appro-
priate position for a hanzi. There are several reasons
why we may expect this approach to work. First,
Chinese words generally have fewer than four char-
acters. As a result, the number of positions is small.
Second, although each hanzi can in principle occur
in all possible positions, not all hanzi behave this
way. A substantial number of hanzi are distributed
in a constrained manner. For example, , the plu-
ral marker, almost always occurs in the word-final
position. Finally, although Chinese words cannot be
exhaustively listed and new words are bound to oc-
cur in naturally occurring text, the same is not true
for hanzi. The number of hanzi stays fairly constant
and we do not generally expect to see new hanzi.
We represent the positions of a hanzi with four
different tags (Table 1): LM for a hanzi that oc-
curs on the left periphery of a word, followed by
other hanzi, MM for a hanzi that occurs in the mid-
dle of a word, MR for a hanzi that occurs on the
right periphery of word, preceded by other hanzi,
and LR for hanzi that is a word by itself. We call
this LMR tagging. With this approach, word seg-
mentation is a process where each hanzi is assigned
an LMR tag and sequences of hanzi are then con-
verted into sequences of words based on the LMR
tags. The use of four tags is linguistically intuitive
in that LM tags morphemes that are prefixes or stems
in the absence of prefixes, MR tags morphemes that
are suffixes or stems in the absence of suffixes, MM
tags stems with affixes and LR tags stems without
affixes. Representing the distributions of hanzi with
LMR tags also makes it easy to use machine learn-
ing algorithms which has been successfully applied
to other tagging problems, such as POS-tagging and
IOB tagging used in text chunking.
Right Boundary (R) Not Right Boundary (M)
Left Boundary (L) LR LM
Not Left Boundary (M) MR MM
Table 1: LMR Tagging
2 Tagging Algorithms
Our algorithm consists of two parts. We first imple-
ment two Maximum Entropy taggers, one of which
scans the input from left to right and the other scans
the input from right to left. Then we implement a
Transformation Based Algorithm to combine the re-
sults of the two taggers.
2.1 The Maximum Entropy Tagger
The Maximum Entropy Markov Model (MEMM)
has been successfully used in some tagging prob-
lems. MEMM models are capable of utilizing a
large set of features that generative models cannot
use. On the other hand, MEMM approaches scan
the input incrementally as generative models do.
The Maximum Entropy Markov Model used in
POS-tagging is described in detail in (Ratnaparkhi,
1996) and the LMR tagger here uses the same prob-
ability model. The probability model is defined over
a12a14a13a16a15 , where a12 is the set of possible contexts or
”histories” and a15 is the set of possible tags. The
model’s joint probability of a history a17 and a tag a18 is
defined as
a19a21a20
a17a23a22a24a18a24a25a27a26a29a28a31a30
a32
a33
a34a36a35a38a37a40a39
a41a43a42a5a44a46a45a47a49a48
a50
a34 (1)
where a28 is a normalization constant, a51a52a30a53a22
a39
a37
a22
a3a49a3a49a3
a22
a39
a32a4a54
are the model parameters and a51a56a55 a37 a22 a3a49a3a49a3 a22a57a55 a32a40a54 are known
as features, where a55 a34 a20 a17a58a22a24a18a59a25a61a60a62a51a64a63a9a22 a8 a54 . Each fea-
ture a55 a34 has a corresponding parameter
a39
a34 , that ef-
fectively serves as a ”weight” of this feature. In
the training process, given a sequence of characters
a51a64a65
a37
a22 a22a59a65a46a66
a54 and their LMR tags
a51a52a18
a37
a22
a3a49a3a49a3
a22a24a18a67a66
a54 as train-
ing data, the purpose is to determine the parameters
a51a52a30a53a22
a39
a37
a22
a3a49a3a49a3
a22
a39
a32a4a54 that maximize the likelihood of the
training data using a19 :
a68
a20a70a69
a25a27a26
a66
a33
a71a49a35a38a37
a69a72a20
a17
a71
a22a24a18
a71
a25a27a26
a66
a33
a71a73a35a38a37
a28a31a30
a32
a33
a34a46a35a38a37a40a39
a41 a42a74a44a59a75a76a45a47a77a75a78a48
a50
a34 (2)
The success of the model in tagging depends to
a large extent on the selection of suitable features.
Given a20 a17a23a22a24a18a24a25 , a feature must encode information that
helps to predict a18 . The features we used in our ex-
periments are instantiations of the feature templates
in (1). Feature templates (b) to (e) represent charac-
ter features while (f) represents tag features. In the
following list, a79a81a80a83a82 a3a49a3a49a3 a79a84a82 are characters and a15 a80a83a82 a3a49a3a49a3a15 a82
are LMR tags.
(1) Feature templates
(a) Default feature
(b) The current character (a79a84a85 )
(c) The previous (next) two characters
(a79a81a80a83a86 , a79a81a80 a37 , a79 a37 , a79a84a86 )
(d) The previous (next) character and the current
character (a79 a80 a37 a79 a85 , a79 a85 a79 a37 ),
the previous two characters (a79a81a80a83a86a64a79a81a80 a37 ), and
the next two characters (a79 a37 a79a84a86 )
(e) The previous and the next character (a79a81a80 a37 a79 a37 )
(f) The tag of the previous character (a15 a80 a37 ), and
the tag of the character two before the current
character (a15 a80a83a86 )
2.2 Transformation-Based Learning
One potential problem with the MEMM is that it
can only scan the input in one direction, from left
to right or from right to left. It is noted in (Lafferty
et al., 2001) that non-generative finite-state models,
MEMM models included, share a weakness which
they call the Label Bias Problem (LBP): a transition
leaving a given state compete only against all other
transitions in the model. They proposed Conditional
Random Fields (CRFs) as a solution to address this
problem.
A partial solution to the LBP is to compute the
probability of transitions in both directions. This
way we can use two MEMM taggers, one of which
scans the input from left to right and the other scans
the input from right to left. This strategy has been
successfully used in (Shen and Joshi, 2003). In that
paper, pairwise voting (van Halteren et al., 1998) has
been used to combine the results of two supertaggers
that scan the input in the opposite directions.
The pairwise voting is not suitable in this appli-
cation because we must make sure that the LMR
tags assigned to consecutive words are compatible.
For example, an LM tag cannot immediately follow
an MM. Pairwise voting does not use any contex-
tual information, so it cannot prevent incompatible
tags from occurring. Therefore, in our experiments
described here, we use the Transformation-Based
Learning (Brill, 1995) to combine the results of two
MEMM taggers. The feature set used in the TBL al-
gorithm is similar to those used in the NP Chunking
task in (Ngai and Florian, 2001).
3 Experiments
We conducted closed track experiments on three
data sources: the Academia Sinica (AS) corpus,
the Beijing University (PKU) corpus and the Hong
Kong City University (CityU) corpus. We first split
the training data from each of the three sources into
two portions. a0a7a87a4a8 a63 of the official training data is
used to train the MEMM taggers, and the other a8a64a87a4a8 a63
is held out as the development test data (the devel-
opment set). The development set is used to esti-
mate the optimal number of iterations in the MEMM
training. Figure (1), (2) and (3) show the curves of
F-scores on the development set with respect to the
number of iterations in MEMM training.
0.959
0.95905
0.9591
0.95915
0.9592
0.95925
0.9593
0.95935
0.9594
0.95945
0.9595
200 300 400 500 600 700 800
F-score
a88
iteration
AS
Figure 1: Learning curves on the development
dataset of the Academia Sinica corpus. X-axis
stands for the number of iteration in training. Y-axis
stands for the a89 -score.
Experiments show that the MEMM models
0.9126
0.9128
0.913
0.9132
0.9134
0.9136
0.9138
0.914
0.9142
0.9144
0.9146
0.9148
100 150 200 250 300
F-score
a88
iteration
HK
Figure 2: Learning curves on the development
dataset of the HK City Univ. corpus.
0.9381
0.9382
0.9383
0.9384
0.9385
0.9386
0.9387
0.9388
0.9389
0.939
0.9391
200 300 400 500 600 700 800
F-score
a88
iteration
PK
Figure 3: Learning curves on the development
dataset of the Beijing Univ. corpus.
achieve the best results after 500 and 400 rounds (it-
erations) of training on the AS data and the PKU
data respectively. However, the results on the CityU
data is not very clear. From Round 100 through 200,
the F-score on the development set almost stays un-
changed. We think this is because the CityU data
is from three different sources, which differ in the
optimal number of iterations. We decided to train
the MEMM taggers for 160 iterations the HK City
University data.
We implemented two MEMM taggers, one scans
the input from left to right and one from right to
left. We then used these two MEMM taggers to tag
both the training and the development data. We use
the LMR tagging output to train a Transformation-
Based learner, using fast TBL (Ngai and Florian,
2001). The middle in Table 2 shows the F-score
on the development set achieved by the MEMM tag-
ger that scans the input from left to right and the
last column is the results after the Transformation-
Based Learner is applied. The results show that us-
ing Transformation-Based learning only give rise to
slight improvements. It seems that the bidirectional
approach does not help much for the LMR tagging.
Therefore, we only submitted the results of our left-
to-right MEMM tagger, retrained on the entire train-
ing sets, as our official results.
F-score MEMM MEMM+TBL
AS 0.9595 0.9603
HK 0.9143 N/A
PK 0.9391 0.9398
Table 2: F-score on development data
The results on the official test data is similar to
what we have got on our development set, except
that the F-score on the Beijing Univ. corpus is over
2a6 lower in absolute accuracy than what we ex-
pected. The reason is that in the training data of
Beijing University corpus, all the numbers are en-
coded in GBK, while in the test data many numbers
are encoded in ASCII, which are unknown to our
tagger. With this problem fixed, the results of the
official test data are compatible with the results on
our development set. However, we have withdrawn
our segmentation results on the Beijing University
corpus.
corpus R P F a90a92a91a23a91a23a93 a90a95a94a57a93
AS 0.961 0.958 0.959 0.729 0.966
HK 0.917 0.915 0.916 0.670 0.936
Table 3: Official Bakeoff Outcome
4 Conclusions and Future Work
Our closed track experiments on the first Sighan
Bakeoff data show that the LMR algorithm pro-
duces promising results. Our system ranks the sec-
ond when tested on the Academia Sinica corpus and
third on the Hong Kong City University corpus. In
the future, we will try to incorporate a large word list
into our tagger to test its performance on open track
experiments. Its high accuracy on a90a92a91a23a91a23a93 makes it a
good candidate as a general purpose segmenter.

References
E. Brill. 1995. Transformation-based error-driven learn-
ing and natural language processing: A case study
in part-of-speech tagging. Computational Linguistics,
21(4):543–565.
C. K. Fan and W. H. Tsai. 1988. Automatic word iden-
tification in chinese sentences by the relaxation tech-
nique. Computer Processing of Chinese and Oriental
Languages, 4(1):33–56.
Kok-Wee Gan, Martha Palmer, and Kim-Teng Lua. 1996.
A statistically emergent approach for language pro-
cessing: Application to modeling context effects in
ambiguous chinese word boundary perception. Com-
putational Linguistics, 22(4):531–53.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Condi-
tional random fields: Probabilistic models for stgmen-
tation and labeling sequence data. In Proceedings of
ICML 2001.
G. Ngai and R. Florian. 2001. Transformation-based
learning in the fast lane. In Proceedings of NAACL-
2001, pages 40–47.
Adwait Ratnaparkhi. 1996. A maximum entropy part-of-
speech tagger. In Proceedings of the Empirical Meth-
ods in Natural Language Processing Conference, Uni-
versity of Pennsylvania.
L. Shen and A. K. Joshi. 2003. A SNoW based supertag-
ger with application to NP chunking. In Proceedings
of ACL 2003.
R. Sproat, Chilin Shih, William Gale, and Nancy Chang.
1996. A stochastic finite-state word-segmentation
algorithm for chinese. Computational Linguistics,
22(3):377–404.
H. van Halteren, J. Zavrel, and W. Daelmans. 1998. Im-
proving data driven wordclass tagging by system com-
bination. In Proceedings of COLING-ACL 98.
Andi Wu. 2003. Customizable segmentation of mor-
phologically derived words in chinese. Computational
Linguistics and Chinese Language Processing.
Nianwen Xue. 2003. Chinese word segmentation as
character tagging. Computational Linguistics and
Chinese Language Processing.
