   
A two-stage statistical word segmentation system for Chinese 
Guohong Fu 
Dept of Linguistics 
 The University of Hong Kong 
Pokfulam Road, Hong Kong 
ghfu@hkucc.hku.hk 
K.K. Luke 
Dept of Linguistics 
The University of Hong Kong 
Pokfulam Road, Hong Kong 
kkluke@hkusua.hku.hk 
 
Abstract 
In this paper we present a two-stage 
statistical word segmentation system for 
Chinese based on word bigram and word-
formation models. This system was 
evaluated on Peking University corpora at 
the First International Chinese Word 
Segmentation Bakeoff. We also give 
results and discussions on this evaluation. 
1 Introduction 
Word segmentation is very important for Chinese 
language processing, which aims to recognize the 
implicit word boundaries in Chinese text. During 
the past decades, great success has been achieved in 
Chinese word segmentation (Nie, et al, 1995; Yao, 
1997; Fu and Wang, 1999; Wang et al, 2000; 
Zhang, et al, 2002). However, there still remain two 
difficult problems, i.e. ambiguity resolution and 
unknown word (so-called OOV word) identification, 
while developing a practical segmentation system 
for large open applications. 
In this paper, we present a two-stage statistical 
word segmentation system for Chinese. In the first 
stage, we employ word bigram model to segment 
known words (viz. the words included in the system 
dictionary) in input. In the second stage, we develop 
a hybrid algorithm to perform unknown word 
identification incorporating word contextual 
information, word-formation patterns and word 
juncture model.   
The rest of this paper is organized as follows: 
Section 2 presents a word bigram solution for 
known word segmentation. Section 3 describes a 
hybrid approach for unknown word identification. 
In section 4, we report the results of our system at 
the SIGHAN evaluation program, and in the final 
section we give our conclusions on this work.  
2 The first stage: Segmentation of known 
words 
In a sense, known word segmentation is a process 
of disambiguation. In our system, we use word 
bigram language models and Viterbi algorithm 
(1967) to resolve word boundary ambiguities in 
known word segmentation. 
For a particular input Chinese character string 
n
cccCL
21
= , there is usually more than one 
possible segmentation 
m
wwwWL
21
= according to 
given system dictionary. Word bigram segmentation 
aims to find the most appropriate segmentation 
m
wwwWL
21
ˆ
=  that maximizes the probability 
∏
=
−
m
i
iir
wwP
1
1
)|(
, i.e.  
    
∏
=
−
≈=
m
i
iir
W
r
W
wwPCWPW
1
1
)|(maxarg)|(maxarg
ˆ
    (1) 
where )|(
1−iir
wwP  is the probability that word 
l
w  
will occur given previous word 
1−i
w , which can be 
easily estimated from segmented corpus using 
maximum likelihood estimation (MLE), i.e. 
              
)(
)(
)|(
1
1
1
−
−
−
≈
i
ii
iir
wCount
wwCount
wwP              (2) 
To avoid the problem of data sparseness in MLE, 
here we apply the linear interpolation technique 
(Jelinek and Mercer, 1980) to smooth the estimated 
word bigram probabilities.  
3 The second stage: Unknown word 
identification 
The second stage mainly concerns unknown words 
segmentation that remains unresolved in first stage. 
This section describes a hybrid algorithm for 
unknown word identification, which can incorporate 
word juncture model, word-formation patterns and 
contextual information. To avoid the complicated 
normalization of the probabilities of different 
dimensions, the simple superposition principle is 
also used in merging these models. 
3.1 Word juncture model 
Word juncture model score an unknown word by 
assigning word juncture type. Obviously, most 
unknown words appear as a string of known words 
after segmentation in first stage. Therefore, 
unknown word identification can be viewed as a 
process of re-assigning correct word juncture type 
to each known word pair in input. Given a known 
word string 
n
wwwWL
21
= , between each word pair 
)11(
1
−≤≤
+
niww
ii
 is a word juncture. In general, 
there are two types of junctures in unknown word 
identification, namely word boundary (denoted by 
B
t ) and non-word boundary (denoted by 
N
t ). 
Let )(
1+ii
wwt denote the type of a word juncture 
1+ii
ww , and ))((
1+iir
wwtP denote the relevant 
conditional probability, then 
              
)(
))((
))((
1
1
1
+
+
+
=
ii
ii
def
iir
wwCount
wwtCount
wwtP              (3) 
Thus, the word juncture probability )(
UCJM
wP of a 
particular unknown word 
jiiU
wwwwL
1+
=  
)1( nji ≤≤≤ can be calculated by  
 
∏
−
=
++−
××=
1
111
))(())(())(()(
j
il
llNrjjBriiBrUCJM
cctPwwtPwwtPwP
 (4) 
In a sense, word juncture model mirrors the affinity 
of known word pairs in forming an unknown word. 
For a word juncture ),(
1+ii
ww , the larger the 
probability ))((
1+iiNr
wwtP , the more likely the two 
words are merged together into one new word. 
3.2 Word-formation patterns 
Word-formation pattern model scores an unknown 
word according to the probability of how each 
internal known word contributes to its formation. In 
general, a known word w  may take one of the 
following four patterns while forming a word: (1) 
w itself is a word. (2) w is the beginning of an 
unknown word. (3) w is at the middle of an 
unknown word. (4) w appears at the end of an 
unknown word.  For convenience, we use S , B , 
M and E  to denote the four patterns respectively. 
Let )(wpttn denote a particular pattern of w  in an 
unknown word and ))(( wpttnP
r
 denote the relevant 
probability, then  
                
)(
))((
))((
wCount
wpttnCount
wpttnP
def
r
=              (5) 
Obviously,
1))(( =
∑
pttn
r
wpttnP
. And  1- ))(( wSP
r
 is the 
word-formation power of the known word w . 
Let  )(
Upttn
wP be the overall word-formation 
pattern probability of a certain unknown word 
lU
wwwwL
21
=  , then 
               
∏
∈
=
Ui
ww
irUpttn
wpttnPwP ))(()(              (6) 
Theoretically speaking, a known word can take any 
pattern while forming an unknown word. But it is 
not even in probability for different known words 
and different patterns. For example, the word 性 
(xing4, nature) is more likely to act as the suffix of 
words. According to our investigation on the 
training corpus, the character性appears at the end 
of a multiword in more than 93% of cases.  
3.3 Hybrid algorithm for unknown word 
identification 
Current algorithm for unknown word identification 
consists of three major components: (1) an 
unknown word extractor firstly extracts a fragment 
of known words 
n
wwwL
21
that that may have 
unknown words based on the related word-
formation power and word juncture probability and 
its left and right contextual word 
L
w , 
R
w from the 
output of the first stage.  (2) A candidate word 
constructor then generates a lattice of all possible 
new segmentations }|{
21 mUU
xxxWWL=  that may 
involve unknown words from the extracted 
fragment. (3) A decoder finally incorporates word 
juncture model )(
UWJM
WP , word-formation 
patterns )(
Upttn
WP  and word bigram probability 
)(
Ubigram
WP to score these candidates, and then 
applies the Viterbi algorithm again to find the best 
new segmentation 
mU
xxxWL
21
ˆ
=  that has the 
maximum score: 
      
}))|()()(({maxarg
)}()()({maxarg
ˆ
,,1
1∑
=
−
++=
++=
ni
iiriCJMipttn
W
UbigramUCJMUpttn
W
U
xxPxPxP
WPWPWPW
U
U
L
           (7) 
where 
L
wx =
0
 and 
Rn
wx =
+1
. Let
U
w  denote any 
unknown word in the training corpus. If 
i
x  is an 
unknown word, then
)(
)(
)|(
1
1
1
−
−
−
∑
=
i
w
Ui
iir
xCount
wxCount
xxP
U .  
4 Experiments 
Our system participated in both closed and open 
tests on Peking University corpora at the First 
International Chinese Word Segmentation Bakeoff.  
This section reports the results and discussions on 
its evaluation. 
4.1 Measures 
In the evaluation program of the First International 
Chinese Word Segmentation Bakeoff, six measures 
are employed to score the performance of a word 
segmentation system, namely test recall (R), test 
precision (denoted by P), the balanced F-measure 
(F), the out-of-vocabulary (OOV) rate for the test 
corpus, the recall on OOV words (R
OOV
), and the 
recall on in-vocabulary (R
iv
) words. OOV is defined 
as the set of words in the test corpus not occurring 
in the training corpus in the closed test, and the set 
of words in the test corpus not occurring in the 
lexicon used in the open test. 
4.2 Experimental lexicons and corpora 
As shown in Table 1, we only used the training data 
from Peking University corpus to train our system 
in both the open and closed tests. As for the 
dictionary, we compiled a dictionary for the closed 
test from the training corpus, which contained 55, 
226 words, and used a dictionary in the open test 
that contained about 65, 269 words.  
 
Items # words in 
lexicon 
# train. 
words  
# test. 
words 
Closed 55,226 1,121,017 17,194 
Open 65,269 1,121,017 17,194 
Table 1: Experimental lexicons and corpora 
4.3 Experimental results and discussion 
 
Items F R P OOV ROOV Riv 
Closed 0.939 0.936 0.942 0.069 0.675 95.5 
Open 0.937 0.933 0.941 0.094 0.762 95.0 
Table 2: Test results on PK corpus 
 
Segmentation speed: There are in all about 28,458 
characters in the test corpus. It takes about 3.21 
and 3.07 seconds in all for our system to perform 
full segmentation (including known word 
segmentation and unknown word identification) on 
the closed and open test corpus respectively, 
running on an ACER notebook (TM632XC-P4M). 
This indicates that our system is able to process 
about 531,925~556,182 characters per minute.  
Results and discussions: The results for the closed 
and open test are presented in Table 2. We can 
draw some conclusions from these results.  
Firstly, the overall performance of our system is 
very stable in both the closed and open tests. As 
shown in Table 2, the out-of-vocabulary (OOV) 
rate is 6.9% in the closed test and 9.4% in the open 
test. However, the overall test F-measure drops by 
only 0.2 percent in the open test, compared with the 
closed test.  
Secondly, our approach can handle most unknown 
words in the input. As can be seen from Table 2, 
the recall on OOV words are 67.5% the closed-test 
and 76.2% in the open-test. Wang et al (2000) and 
Yao (1997) have proposed a character juncture 
model and word-formation patterns for Chinese 
unknown word identification. However, their 
approaches can only work for the unknown words 
that are made up of pure monosyllable character in 
that they are character-based methods. To address 
this problem, we introduce both word juncture 
model and word-based word-formation patterns into 
our system. As a result, our system can deal with 
different unknown words that consist of different 
known words, including monosyllable characters 
and multiword.  
Although our system is effective for most 
ambiguities and unknown words in the input, it has 
its inherent deficiencies. Firstly, to avoid data 
sparseness, we do not differentiate known words 
and unknown words while estimating word juncture 
models and word-formation patterns from the 
training corpus. This simplification may introduce 
some noises into these models for identifying 
unknown words. Our further investigations show 
that the precision on OOV words is still very low, 
i.e. 67.1% for closed test and 72.5% for open test. 
As a result, our system may yield a number of 
mistaken unknown words in the processing. 
Secondly, we regard known word segmentation and 
unknown word identification as two independent 
stages in our system. This strategy is obviously 
simple and more easily applicable. However, it does 
not work while the input contains a mixture of 
ambiguities and unknown words.  For example, 
there was a sentence 中行长葛支行注重健身in 
the test corpus, where, the string中行长葛 is a 
fragment mixed with ambiguity and unknown 
words. The correct segmentation should be 中行/长
葛/, where中行(Zhonghang, the Bank of China) is 
a abbreviation of organization name, and长葛 
(Changge) is a place name. Actually, this fragment 
is segmented as中/行长/葛/ in the first stage of our 
system. However, the unknown word identification 
stage does not have a mechanism to split the word
行长(Hangzhang, president) and  finally resulted in 
wrong segmentation.  
5 Conclusions 
This paper presents a two-stage statistical word 
segmentation system for Chinese. In the first stage, 
word bigram model and Viterbi algorithm are 
applied to perform known word segmentation on 
input plain text, and then a hybrid approach is 
employed in the second stage to incorporate word 
bigram probabilities, word juncture model and 
word-based word-formation patterns to detect OOV 
words. The experiments on Peking University 
corpora have shown that the present system based 
on fairly simple word bigram and word-formation 
models can achieve a F-score of 93.7% or above. In 
future work, we hope to improve our strategies on 
estimating word juncture model and word-formation 
patterns and develop an integrated segmentation 
technique that can perform known word 
segmentation and unknown word identification at 
one time. 
Acknowledgments 
We would like to thank all colleagues of the First 
International Chinese Word Segmentation Bakeoff 
for their evaluations of the results and the Institute 
of Computational Linguistics, Peking University for 
providing the experimental corpora. 

References 
Fu, Guohong and Xiaolong Wang. 1999. Unsupervised 
Chinese word segmentation and unknown word 
identification. In: Proceedings of NLPRS’99, 
Beijing, China, 32-37. 
Jelinek, Frederick, and Robert L. Mercer. 1980. 
Interpolated estimation of Markov source parameters 
from sparse data. In: Proceedings of Workshop on 
Pattern Recognition in Practice, Amsterdam, 381-
397. 
Nie, Jian-Yuan, M.-L. Hannan and W.-Y. Jin. 1995. 
Unknown word detection and segmentation of 
Chinese using statistical and heuristic knowledge. 
Communication of COLIPS, 5(1&2): 47-57.  
Viterbi, A.J. 1967. Error bounds for convolutional 
codes and an asymmetrically optimum decoding 
algorithm. IEEE Transactions on Information 
Theory, IT-13(2): 260-269. 
Wang, Xiaolong, Fu Guohong, Danial S.Yeung, James 
N.K.Liu, and Robert Luk. 2000. Models and 
algorithms of Chinese word segmentation. In: 
Proceedings of the International Conference on 
Artificial Intelligence (IC-AI’2000), Las Vegas, 
Nevada, USA, 1279-1284. 
Yao, Yuan. 1997. Statistics Based approaches towards 
Chinese language processing. Ph.D. thesis. National 
University of Singapore. 
Zhang, Hua-Ping, Qun Liu, Hao Zhang, and Xue-Qi 
Cheng. 2002. Automatic recognition of Chinese 
unknown words based on roles tagging. In: 
Proceedings of The First SIGHAN Workshop on 
Chinese Language Processing, Taiwan, 71-77. 
