Integrating Ngram Model and Case-based Learning
For Chinese Word Segmentation
Chunyu Kit Zhiming Xu Jonathan J. Webster
Department of Chinese, Translation and Linguistics
City University of Hong Kong
Tat Chee Ave., Kowloon, Hong Kong
fctckit, ctxuzm, ctjjwg@cityu.edu.hk
Abstract
This paper presents our recent work
for participation in the First Interna-
tional Chinese Word Segmentation Bake-
off (ICWSB-1). It is based on a general-
purpose ngram model for word segmen-
tation and a case-based learning approach
to disambiguation. This system excels
in identifying in-vocabulary (IV) words,
achieving a recall of around 96-98%.
Here we present our strategies for lan-
guage model training and disambiguation
rule learning, analyze the system’s perfor-
mance, and discuss areas for further im-
provement, e.g., out-of-vocabulary (OOV)
word discovery.
1 Introduction
After about two decades of studies of Chinese word
segmentation, ICWSB-1 (henceforth, the bakeoff)
is the  rst effort to put different approaches and
systems to the test and comparison on common
datasets. We participated in the bakeoff with a
segmentation system that is designed to integrate a
general-purpose ngram model for probabilistic seg-
mentation and a case- or example-based learning
approach (Kit et al., 2002) for disambiguation.
The ngram model, with words extracted from
training corpora, is trained with the EM algorithm
(Dempster et al., 1977) using unsegmented train-
ing corpora. Originally it was developed to en-
hance word segmentation accuracy so as to facili-
tate Chinese-English word alignment for our ongo-
ing EBMT project, where only unsegmented texts
are available for training. It is expected to be ro-
bust enough to handle novel texts, independent of
any segmented texts for training. To simplify the
EM training, we used the uni-gram model for the
bakeoff and relied on the Viterbi algorithm (Viterbi,
1967) for the most probable segmentation, instead of
attempting to exhaust all possible segmentations of
each sentence for a complicated full version of EM
training.
The case-based learning works in a straightfor-
ward way. It  rst extracts case-based knowledge,
as a set of context-dependent transformation rules,
from the segmented training corpus, and then ap-
plies them to ambiguous strings in a test corpus in
terms of the similarity of their contexts. The simi-
larity is empirically computed in terms of the length
of relevant common af xes of context strings.
The effectiveness of this integrated approach is
veri ed by its outstanding performance on IV word
identi cation. Its IV recall rate, ranging from 96%
to 98%, stands at the top or the next to the top in all
closed tests in which we have participated. Unfortu-
nately, its overall performance is not sustainable at
the same level, due to the lack of a module for OOV
word detection.
This paper is intended to present the implementa-
tion of the system and analyze its performance and
problems, aiming at exploration of directions for fur-
ther improvement. The remaining sections are or-
ganized as follows. Section 2 presents the ngram
model and its training with the EM algorithm, and
Section 3 presents the case-based learning for dis-
ambiguation. The overall architecture of our system
is given in Section 4, and its performance and prob-
lems are analyzed in Section 5. Section 6 concludes
the paper and previews future work.
2 Ngram model and training
An ngram model can be utilized to  nd the most
probable segmentation of a sentence. Given a Chi-
nese sentence s = c1c2    cm (also denoted as cn1 ),
its probabilistic segmentation into a word sequence
w1w2    wk (also denoted as wk1) with the aid of an
ngram model can be formulated as
seg(s) = arg max
s= w1 w2     wk
kY
i
p(wijwi 1i n+1) (1)
where  denotes string concatenation, wi 1i n+1 the
context (or history) of wi, and n is the order of the
ngram model in use. We have opted for uni-gram for
the sake of simplicity. Accordingly, p(wijwi 1i n+1)
in (1) becomes p(wi), which is commonly estimated
as follows, given a corpus C for training.
p(wi) := f(wi)=
X
w2C
f(w) (2)
In order to estimate a reliable p(wi), the ngram
model needs to be trained with the EM algorithm
using the available training corpus. Each EM itera-
tion aims at approaching to a more reliable f(w) for
estimating p(w), as follows:
fk+1(w) =
X
s2C
X
s02S(s)
pk(s0) fk(w 2 s0) (3)
where k denotes the current iteration, S(s) the set of
all possible segmentations for s, and f k(w 2s0) the
occurrences of w in a particular segmentation s0.
However, assuming that every sentence always
has a segmentation, the following equation holds:
X
s02S(s)
pk(s0) = 1 (4)
Accordingly, we can adjust (3) as (5) with a normal-
ization factor  = Ps02S(s) pk(s0), to avoid favor-
ing words in shorter sentences too much. In general,
shorter sentences have higher probabilities.
fk+1(w) =
X
s2C
X
s02S(s)
pk(s0)
 f
k(w 2 s0) (5)
Following the conventional idea to speed up the
EM training, we turned to the Viterbi algorithm. The
underlying philosophy is to distribute more prob-
ability to more probable events. The Viterbi seg-
mentation, by utilizing dynamic programming tech-
niques to go through the word trellis of a sentence
ef ciently,  nds the most probable segmentation un-
der the current parameter estimation of the language
model, ful lling (1)). Accordingly, (6) becomes
fk+1(w) =
X
s2C
pk(seg(s)) fk(w 2 seg(s)) (6)
and (5) becomes
fk+1(w) =
X
s2C
fk(w 2 seg(s)) (7)
where the normalization factor is skipped, for
only the Viterbi segmentation is used for EM re-
estimation. Equation (7) makes the EM training
with the Viterbi algorithm very simple for the uni-
gram model: iterate word segmentation, as (1), and
word count updating, via (7), sentence by sentence
through the training corpus until there is a conver-
gence.
Since the EM algorithm converges to a local max-
ima only, it is critical to start the training with an
initial f0(w) for each word not too far away from its
 true value. Our strategy for initializing f 0(w) is
to assume all possible words in the training corpus
as equiprobable and count each of them as 1; and
then p0(w) is derived using (2). This strategy is sup-
posed to have a weaker bias to favor longer words
than maximal matching segmentation.
For the bakeoff, the ngram model is trained with
the unsegmented training corpora together with the
test sets. It is a kind of unsupervised training.
Adding the test set to the training data is reasonable,
to allow the model to have necessary adaptation to-
wards the test sets. Experiments show that the train-
ing converges very fast, and the segmentation per-
formance improves signi cantly from iteration to it-
eration. For the bakeoff experiments, we carried out
the training in 6 iterations, because more iterations
than this have not been observed to bring any signif-
icant improvement on segmentation accuracy to the
training sets.
3 Case-based learning for disambiguation
No matter how well the language model is trained,
probabilistic segmentation cannot avoid mistakes on
ambiguous strings, although it resolves most ambi-
guities by virtue of probability. For the remaining
unresolved ambiguities, however, we have to resort
to other strategies and/or resources. Our recent study
(Kit et al., 2002) shows that case-based learning is
an effective approach to disambiguation.
The basic idea behind the case-based learning is
to utilize existing resolutions for known ambiguous
strings to do disambiguation if similar ambiguities
occur again. This learning strategy can be imple-
mented in two straightforward steps:
1. Collection of correct answers from the train-
ing corpus for ambiguous strings together with
their contexts, resulting in a set of context-
dependent transformation rules;
2. Application of appropriate rules to ambiguous
strings.
A transformation rule of this type is actually an ex-
ample of segmentation, indicating how an ambigu-
ous string is segmented within a particular context.
It has the following general form:
Cl Cr :  ! w1 w2    wk
where  is the ambiguous string, Cl and Cr its left
and right contexts, respectively, and w1 w2    wk
the correct segmentation of  given the contexts.
In our implementation, we set the context length on
each side to two words.
For a particular ambiguity, the example with the
most similar context in the example (or, rule) base
is applied. The similarity is measured by the sum
of the length of the common suf x and pre x of,
respectively, the left and right contexts. The details
of computing this similarity can be found in (Kit et
al., 2002) . If no rule is applicable, its probabilistic
segmentation is retained.
For the bakeoff, we have based our approach to
ambiguity detection and disambiguation rule extrac-
tion on the assumption that only ambiguous strings
cause mistakes: we detect the discrepancies of our
probabilistic segmentation and the standard segmen-
tation of the training corpus, and turn them into
transformation rules. An advantage of this approach
is that the rules so derived carry out not only disam-
biguation but also error correction. This links our
disambiguation strategy to the application of Brill’s
(1993) transformation-based error-driven learning to
Chinese word segmentation (Palmer, 1997; Hocken-
maier and Brew, 1998).
4 System architecture
The overall architecture of our word segmentation
system is presented in Figure 1.
Figure 1: Overall architecture of the system
5 Performance and analysis
The performance of our system in the bakeoff is pre-
sented in Table 1 in terms of precision (P), recall
(R) and F score in percentages, where  c denotes
closed tests. Its IV word identi cation performance
is remarkable.
However, its overall performance is not in bal-
ance with this, due to the lack of a module for OOV
word discovery. It only gets a small number of OOV
words correct by chance. The higher OOV propor-
tion in the test set, the worse is its F score. The rel-
atively high Roov for PKc track is, mostly, the result
of number recognition with regular expressions.
Test P R F OOV Roov Riv
SAc 95.2 93.1 94.2 02.2 04.3 97.2
CTBc 80.0 67.4 73.2 18.1 07.6 95.9
PKc 92.3 86.7 89.4 06.9 15.9 98.0
Table 1: System performance, in percentages (%)
5.1 Error analysis
Most errors on IV words are due to the side-effect
of the context-dependent transformation rules. The
rules resolve most remaining ambiguities and cor-
rect many errors, but at the same time they also cor-
rupt some proper segmentations. This side-effect is
most likely to occur when there is inadequate con-
text information to decide which rules to apply.
There are two strategies to remedy, or at least al-
leviate, this side-effect: (1) retrain probabilistic seg-
mentation  a conservative strategy; or, (2) incorpo-
rate Brill’s error-driven learning with several rounds
of transformation rule extraction and application, al-
lowing mistakes caused by some rules in previous
rounds to be corrected by other rules in later rounds.
However, even worse than the above side-effect is
a bug in our disambiguation module: it always ap-
plies the  rst available rule, leading to many unex-
pected errors, each of which may result in more than
one erroneous word. For instance, among 430 er-
rors made by the system in the SA closed test, some
70 are due to this bug. A number of representative
examples of these errors are presented in Table 2,
together with some false errors resulting from the
inconsistency in the standard segmentation.
Errors Standard False errors Standard
2 (8) 2   ø  D  ø   D
 u (7)  u tjs  tj s  
. ? (7) .?  ˝ ˜  ˝  ˜
_ A (5) _A P_,æ P_ ,æ
  (4)   1w– 1w–  
n? (4) n ? . ⁄ .  ⁄  
Table 2: Errors and false errors
6 Conclusion and future work
We have presented our recent work for partici-
pation in ICWSB-1 based on a general-purpose
ngram model for probabilistic word segmentation
and a case-based learning strategy for disambigua-
tion. The ngram model is trained using available
unsegmented texts with the EM algorithm with the
aid of Viterbi segmentation. The learning strategy
acquires a set of context-dependent transformation
rules to correct mistakes in the probabilistic segmen-
tation of ambiguous substrings. This integrated ap-
proach demonstrates an impressive effectiveness by
its outstanding performance on IV word identi ca-
tion. With elimination of the bug and false errors, its
performance could be signi cantly better.
6.1 Future work
The above problem analysis points to two main di-
rections for improvement in our future work: (1)
OOV word detection; (2) a better strategy for learn-
ing and applying transformation rules to reduce the
side-effect. In addition, we are also interested in
studying the effectiveness of higher-order ngram
models and variants of EM training for Chinese
word segmentation.
Acknowledgements
The work is part of the CERG project  EBMT for
HK Legal Texts funded by HK UGC under the
grant #9040482, with Jonathan J. Webster as the
principal investigator and Chunyu Kit, Caesar S.
Lun, Haihua Pan, King Kuai Sin and Vincent Wong
as investigators. The authors wish to thank all team
members for their contribution to this paper.

References
E. Brill. 1993. A Corpus-Based Approach to Language
Learning. Ph.D. thesis, University of Pennsylvania,
Philadelphia, PA.
A. P. Dempster, N. M. Laird, and D. B.Rubin. 1977.
Maximum likelihood from incomplete data via the em
algorithm. Journal of the Royal Statistical Society, Se-
ries B, 34:1 38.
J. Hockenmaier and C. Brew. 1998. Error-driven learn-
ing of Chinese word segmentation. In PACLIC-12,
pages 218 229, Singapore. Chinese and Oriental Lan-
guages Processing Society.
C. Kit, H. Pan, and H. Chen. 2002. Learning case-based
knowledge for disambiguating Chinese word segmen-
tation: A preliminary study. In COLING2002 work-
shop: SIGHAN-1, pages 33 39, Taipei.
D. Palmer. 1997. A trainable rule-based algorithm
for word segmentation. In ACL-97, pages 321 328,
Madrid.
A. J. Viterbi. 1967. Error bounds for convolutional codes
and an asymptotically optimum decoding algorithm.
IEEE Transactions on Information Theory, IT-13:260 
267.
