An Example-Based Chinese Word Segmentation System for CWSB-2
Chunyu Kit Xiaoyue Liu
Department of Chinese, Translation and Linguistics
City University of Hong Kong
Tat Chee Ave., Kowloon, Hong Kong
{ctckit, xyliu0}@cityu.edu.hk
Abstract
This paper reports the example-based
segmentation system for our participa-
tion in the second Chinese Word Seg-
mentation Bakeoff (CWSB-2), present-
ing its basic ideas, technical details and
evaluation. It is a preliminary imple-
mentation. CWSB-2 valuation shows
that it performs very well in identify-
ing known words. Its unknown word
detection module also illustrates great
potential. However, proper facilities for
identifying time expressions, numbers
and other types of unknown words are
needed for improvement.
1 Introduction
Word segmentation is to identify lexical items,
especially individual word forms, in a text. It
involves two fundamental tasks, both aiming at
minimizing segmentation errors: one is to in-
fer out-of-vocabulary (OOV) words, also known
as unknown (or unseen) word detection, and the
other to identify in-vocabulary (IV) words, with
an emphasis on disambiguation. OOV words and
ambiguities are the two major causes to segmen-
tation errors.
Accordinglyword segmentation approaches
can be divided into the categories summarized in
Table 1 in terms of the resources in use to tackle
these two causes. The closed and open tracks in
CWSB correspond, respectively, to the last two
categories, both involving inferring OOV words
Category Resource in use Major Task
Lexicon Tr. Corpus OOV Disamb.
WDa - (-)b +
WS/CLc + - - +
WS/ILd + - + +
WS/TCe (+)f + + +
WS/TC+Lg + + + +
aWord discovery, or unsupervised lexical acquisition
bInput data is used for unsupervised training
cWord segmentation with a complete lexicon
dWord segmentation with an incomplete lexicon
eWord segmentation with a pre-segmented training corpus
fIt can be extracted from the given training corpus.
gWord segmentation with a pre-segmented training corpus
and an extra lexicon
Table 1: Categories of segmentation approach
beyond disambiguating IV words. Word discov-
ery and OOV word detection pursue a similar tar-
get, i.e., inferring new words. The continuum
connecting them is the size of the lexicon in use:
the former assumes few words known and the lat-
ter an existing lexicon to some scale. Inferring
new words is an essential task in word segmen-
tation, for a complete lexicon is rarely a realistic
assumption in practice.
This paper presents our segmentation system
for participation in CWSB-2. It takes an example-
based approach to recognize IV words and fol-
lows description length gain (DLG) to infer OOV
words in terms of their text compression effect.
Sections 2 and 3 below introduce the example-
based and DLG-based segmentation respectively.
Section 4 presents a strategy to combine their
strength and Section 5 reports our system’s per-
formance in CWSB-2. Following error analysis
in Section 6, Section 7 concludes the paper.
146
2 Example-based segmentation
How to utilize as much information as possi-
ble from the training corpus to adapt a segmen-
tation system towards a segmentation standard
has been a critical issue. Kit et al. (2002) and
Kit et al. (2003) attempt to integrate case-based
learning with statistical models (e.g., n-gram) by
extracting transformation rules from the train-
ing corpus for disambiguation via error correc-
tion; Gao et al. (2004) adopt a similar strategy
for adaptive segmentation, with transformation
templates (instead of case-based rules) to modify
word boundaries (instead of individual words).
The basic idea of example-based segmentation
is very simple: existing pre-segmented strings in
training corpus provide reliable examples for seg-
menting similar strings in input texts. In contrast
to dictionary checking for locating possible words
in an input sentence to facilitate later segmenta-
tion operations, pre-segmented examples give ex-
act segmentation to copy.
The example-based segmentation can be im-
plemented in the following steps.
1. Find all exemplar pre-segmented fragments,
with regards to a training corpus, and all
possible words, with regards to a lexicon,
from each character in an input sentence;
2. Identify the optimal sequence, among all
possibilities, of the above items over the sen-
tence following some optimization criterion.
If adopting the minimal number of fragments or
words in a sequence as optimization criterion, we
have a maximal matching approach to word seg-
mentation. However, it differs remarkably from
the previous maximal matching approaches: it
matches pre-segmented fragments, instead of dic-
tionary words, against an input sentence. It can be
carried out by a best-first strategy: repeatedly se-
lect the next longest example or word until the en-
tire sentence is properly covered. Unfortunately,
the best-first approach does not guarantee to give
the best answer. For CWSB-2, we implemented
a program following the Viterbi algorithm to per-
form a complete search in terms of the number of
fragments, and then words, in a sequence.
However, a serious problem with this example-
based approach is the sparse data problem. Long
exemplar fragments are more reliable but small
in number, whereas short ones are large in num-
ber but less reliable. In the case of no exemplar
fragment available for an input sentence, this ap-
proach draws back to the maximal match segmen-
tation with a dictionary. How to incorporate sta-
tistical inference into example-based segmenta-
tion to infer more reliable optimal segmentation
beyond string matching remains a critical issue
for us to tackle.
3 DLG-based segmentation
DLG is formulated in Kit and Wilks (1999) and
Kit (2000) as an empirical measure for the com-
pression effect of extracting a substring from a
given corpus as a lexical item. DLG optimization
is applied to detect OOV words for our participa-
tion in CWSB-2. It works as follows in two steps.
1. Calculate the DLG for all known words
and all new word candidate (i.e., substrings
with frequency ≥ 2, preferably, in the test
corpus), based on frequency information in
the training and the test corpora;
2. Find the optimal sequence of such items over
an input sentence with the greatest sum of
DLG.
Step 2 above in our system re-implements only
the first round of DLG-based lexical learning in
Kit (2000). It is implemented by the same algo-
rithm as the one for example-based segmentation,
with DLG as optimization criterion. Evaluation
results show that this learning-via-compression
approach discovers many OOV words success-
fully, in particular, person names.
4 Integration
The example-based segmentation is good at iden-
tifying IV words but incapable of recognizing any
new words. In contrast, the DLG-based segmen-
tation performs slightly worse but has potential to
detect new words. It is expected that the strength
of the two could be combined together for perfor-
mance enhancement.
However, because of inadequate time we had
to take a shortcut in order to catch the CWSB-
2 deadline: DLG segmentation is only applied
to recognize new words among the sequences of
mono-character items in the example-based seg-
mentation output.
147
Track P R F OOV ROOV RIV
ASc .944 .902 .923 .043 .234 .976
PKUc .929 .904 .916 .058 .252 .971
MSRc .965 .935 .950 .026 .189 .986
Table 2: System performance in CWSB-2
5 Performance
Our group took part in three closed tracks in
CWSB-2, namely, ASc, PKUc and MSRc, with a
preliminary implementation of the example-based
word segmentation presented above. Our sys-
tem’s performance in terms of CWSB-2’s offi-
cial scores is presented in Table 2. Its ROOV
scores look undesirable, showing that applying
the first round of DLG-based segmentation to se-
quences of mono-character items is inadequate
for the OOV word discovery task. Nevertheless,
its RIV scores are, in general, quite close to the
top systems in CWSB-2, although it does not have
a disambiguation module to polish its maximal
matching output.
However, this is not to say that the DLG-based
segmentation deserves no credit in unknown word
detection. It does recognize many OOV words,
as shown in Table 3. The low ROOV rate has to
do with our system’s incapability in handling time
expressions, numbers, and foreign words.
6 Error analysis
Most errors made by our system are due to
the following causes: (1) no knowledge, overt
or implicit, in use for recognizing time expres-
sions, numbers and foreign words, as restricted by
CWSB-2 rules, (2) a premature module for OOV
word detection, (3) no further disambiguation be-
sides example application, and (4) significant in-
consistency in the training and test data.
The inconsistency exists not only between the
training and test corpora for each track but, more
surprisingly, also within individual training cor-
pora. Some suspected cases are illustrated in Ta-
bles 4, 5 and 6. They are observed to be in a large
number in the CWSB-2 corpora. Scoring with a
golden standard involving so many of them ap-
pears to be problematic, for it penalizes the sys-
tems for handling such cases right and rewards
the others for producing “correct” answers. What
ASc:l�(106)P!,�(45) (31)	�(29) j�(21)� �(20)"u
V(18)�+�(17)� �9(16)
+!-M(15)"��\(13) 	�(12)U�
 �	�(11)	5 }(11)��
	�(11)~ �}(10)Io(9)S!�r(8)R
o(8)n
��(7)!�_(7)�!�D(7)"u�
�(6)
�B��(6)"�
"1(5)�	XM(5)��\(5)!S�9(5)~�
*(5)�
�!~(5)\�(5)�
��(5)R� 1(4)��(4)��(4)�� B?(4) b 
�(3)"u
=(3)
��9Q0(3)~x
%(2) b	�(2)��	���(2)�(�(2) · · · · · ·
PKUc:�
�(38)
Wb(23)�
�(21)	�
�c:�(20)W

�(19)Z��(17)	+)(16)F,(15)
��h(12)��(11)l�
h(10)
�
�(10)s�~(9)
#�(9)	o�:(8)�
�(8)��	�(8)�

�(7)��(7)��(6)�
�
�(6)E
}�(6)P	Z(6)��(6)�b

g(5)	O:	X+(5)�?(5)
�

�(5)
�v
�(5) (5)
�C(5)J
�(5)W
�r(4)�
#�(4)�E(4)��(4)
�
g�(4)
�C(4)�

�(4)>�(4)
(4)
"�E(4)���(4)�*(4)E
��(3)
@
��(3)
C�iE(3)�B(3)��
(3)��(3)+J(3)�

�(3)=�(3)
�~h(3) · · · · · ·
MSRc:p(26)
*�(19)���(19)�)�(17)U(15)�
_(14)�](14)
>�(13)�(13)NSYJWSJY(12)�
�(12)
�(11)fy�(10)��(10)2��:(10)�==	(10)Z
(10)��	�(9)%
oE(9)v1h(8)!�(8)f	*(8)�
F�(7)�
	�
��(7)	��(7)i�(6)UE�(6)��(6)��(6)�(6)k
(5)	��Y(5)49(5)v1(5)�d)(5)
�
.(5)�aQ(4)
�(4)
��(4)
db(4)�
o�(4)U�I(4)1l0(3)=��(3)
�
�(3)	���(3)~l�(3)���(3)���(3)��(3)��

(3)
d
��(3)��E(3)��
�(3)�
@(3)��
C(3)�3^o(2) · · · · · ·
Table 3: Illustration of new words successfully
detected, with frequency in parentheses
is even more worth noting is that (1) an inconsis-
tent case involves more than one word, and (2)
the difference between a correct and an erroneous
judgment of a word is 1, in a sense, but the differ-
ence between one system that loses it for doing
right and another that earns it by doing wrong is
surely greater.
7 Conclusions
In the above sections we have reported the
example-based word segmentation system for
our participation in CWSB-2, including its ba-
sic ideas, technical details and evaluation results.
It has illustrated an excellent performance in IV
word identification and nice potential in OOV
word discovery. However, its weakness in han-
dling time expressions, numbers and other types
of unknown words has hindered it from perform-
ing better. We are expecting to implement a full-
fledged version of the system for improvement.
Acknowledgements
The work described in this paper was supported
by the RGC of HKSAR, China, through the
CERG grant 9040861. We wish to thank Alex
Fang and Robert Neather for their help.
148
Training & Answer fT/fA Golden Standard fT/fA
N�s4/8N�s0/0
	��28/7	��0/0
"�>
q5/7"�>
q0/0
l{11/6l{0/0
'�186/5'�0/0
���41/4���0/0
�B}29/4�B}0/0
��;*129/4��;*0/0
-:�-:23/3-:�-:0/0
!-�47/3!-�0/0
��hs33/2��hs0/0

�
gu32/2
�
gu0/0

�i}V85/2
�i}V0/0
b�>�10/2b�>�0/0
!|��62/2!|��0/0
�
323/2�
30/0
9�
^J�192/19�
^J�0/0
�� �149/1�� �0/0
���66/1���0/0
G"s�31/1G"s�0/0
	A��80/1	A��0/0
���68/1���0/0

�!�Vf13/1
�!�Vf0/0
;�
��13/1;�
��0/0
!�i!020/1!�i!00/0
!�P�!i6/1!�P�!i0/0

6B
�29/1
6B
�0/0
Hi�F?4/1Hi�F?0/0

��!�!t24/7
��!�!t25/0
o��"17/3o��"53/0

^�
^1201/2
^�
^2/0
Table 4: Some inconsistent cases in AS corpus
Training & Answer fT/fA Golden Standard fT/fA
T��14/26T��0/0
u��6/1u��0/0
���5/21���0/0

W@��24/19
W@��0/0
��23/18��0/0

3�9�66/15
3�9�0/0
��
��
�10/9��
��
�0/0
����
V10/5����
V0/0
�MH
S45/5�MH
S0/0
F��s42/5F��s0/0
%�'�27/4%�'�0/0

W���21/4
W���0/0
T�<126/4T�<0/0
�1�120/4�1�10/0
8�S	�15/48�S	�0/0
9��25/49��0/0
�6
 �25/3�6
 �0/0
	��
3�
q13/3	��
3�
q0/0
�M�32/3�M�0/0

W�@
c30/3
W�@
c0/0
���11/3���0/0
p
Y��15/3p
Y��0/0
�l<22/3�l<0/0
�]�11/2�]�0/0
N
Y25/2N
Y0/0

���@?3/1
���@?0/0
�C�13/1�C�0/0
E��24/5E��1/0
i�"49/4i�"1/0
�112/3�14/0
?rSE48/1?rSE1/0
Table 5: Some inconsistent cases in PKU corpus
Training & Answer fT/fA Golden Standard fT/fA
�<�12/7�<�0/0
��?16/6��?0/0
 o��29/5 o��0/0
v�<"v6/3v�<"v0/0
�
@
3/3�
@
0/0
Z��1/2Z��0/0
g'�? 4/2g'�0/0
B�10/2B�0/0
��3/2��0/0
�7/1�0/0
�`v2/1�`v0/0
��SE1/1��SE0/0
�B a= a�? 4/1�B a= a�0/0
���v~1/1���v~0/0
S=�	��
�1/1S=�	��
�0/0
9
����q1/19
����q0/0
E
�	g? 10/1E
�	g0/0

�? 16/1
�0/0

g
? 4/1
g
0/0
B	g? 16/1B	g0/0
�S�
�122/1�S�
�0/0
���'��3/1���'��0/0
Table 6: Some inconsistent cases in MSR corpus
References
E. Brill. 1993. A Corpus-Based Approach to Lan-
guage Learning. PhD thesis, University of Pennsyl-
vania, Philadelphia.
J. Gao, A. Wu, M. Li, C. Huang, H. Li, X. Xia and H.
Qin. 2004. Adaptive Chinese word segmentation.
In ACL-04. Barcelona, July 21-26.
C. Kit and Y. Wilks. 1999. Unsupervised learning of
word boundary with description length gain. In M.
Osborne and E. T. K. Sang (eds.), CoNLL-99, pp.1-
6. Bergen, Norway, June 12.
C. Kit 2000. Unsupervised Lexical Learning as
Inductive Inference. PhD thesis, University of
Sheffield.
C. Kit, H. Pan and H. Chen. 2002. Learning case-
based knowledge for disambiguating Chinese word
segmentation: A preliminary study. SIGHAN-1,
pp.33&�39. Taipei, Sept. 1, 2002.
C. Kit, Z. Xu and J. J. Webster. 2003. Integrating
n-gram model and case-based learning for Chinese
word segmentation. In Q. Ma and F. Xia (eds.),
SIGHAN-2, pp.160-163. Sapporo, 11 July, 2003.
D. Palmer. A trainable rule-based algorithm for word
segmentation. In ACL-97, pp.321-328. Madrid.
149
