Towards a Hybrid Model for Chinese Word Segmentation 
Xiaofei Lu 
Department of Linguistics 
The Ohio State University 
Columbus, OH 43210, USA 
xflu@ling.osu.edu 
Abstract 
This paper describes a hybrid Chinese 
word segmenter that is being developed 
as part of a larger Chinese unknown 
word resolution system. The segmenter 
consists of two components: a tagging 
component that uses the transforma-
tion-based learning algorithm to tag 
each character with its position in a 
word, and a merging component that 
transforms a tagged character sequence 
into a word-segmented sentence. In ad-
dition to the position-of-character tags 
assigned to the characters, the merging 
component makes use of a number of 
heuristics to handle non-Chinese char-
acters, numeric type compounds, and 
long words. The segmenter achieved a 
92.8% F-score and a 72.8% recall for 
OOV words in the closed track of the 
Peking University Corpus in the Sec-
ond International Chinese Word Seg-
mentation Bakeoff. 
1 Introduction 
This paper describes a hybrid Chinese word 
segmenter that participated in the closed track of 
the Peking University Corpus in the Second In-
ternational Chinese Word Segmentation Bake-
off. This segmenter is still in its early stage of 
development and is being developed as part of a 
larger Chinese unknown word resolution system 
that performs the identification, part of speech 
guessing, and sense guessing of Chinese un-
known words (Lu, 2005).  
  The segmenter consists of two major compo-
nents. First, a tagging component tags each indi-
vidual character in a sentence with a position-of-
character (POC) tag that indicates the position of 
the character in a word. This could be one of the 
following four possibilities, i.e., the character is 
either a monosyllabic word or is in a word-
initial, middle, or final position. This component 
is based on the transformation-based learning 
(TBL) algorithm (Brill, 1995), where a simple 
first-order HMM tagger (Charniak et al., 1993) 
is used to produce an initial tagging of a charac-
ter sequence. Second, a merging component 
transforms the output of the tagging component, 
i.e., a POC-tagged character sequence, into a 
word-segmented sentence. Whereas this process 
relies largely on the POC tags assigned to the 
individual characters, it also takes advantage of 
a number of heuristics generalized from the 
training data to handle non-Chinese characters, 
numeric type compounds, and long words. 
The approach adopted here is reminiscent of 
the line of research that employs the idea of 
character-based tagging for Chinese word seg-
mentation and/or unknown word identification 
(Goh et al., 2003; Xue, 2003; Zhang et al., 
2002). The notion of character-based tagging 
allows us to model the tendency for individual 
characters to combine with other characters to 
form words in different contexts. This property 
gives the model a good potential for improving 
the performance of Chinese unknown word 
identification, a major concern of the Chinese 
unknown word resolution system that the seg-
menter is a part of.  
The rest of the paper is organized as follows. 
Section two describes the system architecture. 
Section three reports the results of the system in 
the bakeoff. Section four concludes the paper.  
189
2 System Description 
The overall architecture of the segmenter is de-
scribed in Figure 1. An input sentence is first 
segmented into a character sequence, with a 
space inserted after each character. The seg-
mented character sequence is then processed by 
the tagging component, where it is initially 
tagged by an HMM tagger, and then by a TBL 
tagger. Finally, the tagged character sequence is 
transformed into a word-segmented sentence by 
the merging component.  
  
Figure 1: System Architecture. 
2.1 The Tagging Component 
The tagset used by the tagging component con-
sists of the following four tags: L, M, R, and W, 
each of which indicates that the character is in a 
word-initial, word-middle, or word-final posi-
tion or is a monosyllabic word respectively. The 
transformation-based error-driven learning algo-
rithm is adopted as the backbone of the tagging 
component over other promising machine learn-
ing algorithms because, as Brill (1995) argued, it 
captures linguistic knowledge in a more explicit 
and direct fashion without compromising per-
formance. This algorithm requires a gold stan-
dard, some initial tagging of the training corpus, 
and a set of rule templates. It then learns a set of 
rules that are ranked in terms of the number of 
tagging error reductions they can achieve.    
A number of different initial tagging schemes 
can be used, e.g., tagging each character as a 
monosyllabic word or with its most probable 
POC tag. We used a simple first-order HMM 
tagger to produce an initial tagging. Specifically,
we calculate 
)|w)p(t|tp(t iii
n
i
i...tt n 1
1
1maxarg −
=
∏            (1) 
where ti denotes the ith tag in the tag sequence 
and wi denotes the ith character in the character 
sequence. The transition probabilities and lexi-
cal probabilities are estimated from the training 
data. The lexical probability for an unknown 
character, i.e., a character that is not found in the 
training data, is by default uniformly distributed 
among the four POC tags defined in the tagset. 
The Viterbi algorithm (Rabiner, 1989) is used to 
tag new texts.   
The transformation-based tagger was imple-
mented using fnTBL (Florian and Ngai, 2001). 
The rule templates used are the same as the con-
textual rule templates Brill (1995) defined for 
the POS tagging task. These templates basically 
transform the current tag into some other tag 
based on the current character/tag and the char-
acter/tag one to three positions before/after the 
current character. An example rule template is 
given below: 
(1) Change tag a to tag b if the preceding char-
acter is tagged z.   
The training process is iterative. At each itera-
tion, the algorithm picks the instantiation of a 
rule template that achieves the greatest number 
of tagging error reductions. This rule is applied 
to the text, and the learning process repeats until
no more rules reduce errors beyond a pre-
defined threshold. The learned rules can then be 
applied to new texts that are tagged by the initial
HMM tagger.  
2.2 The Merging Component 
The merging component transforms a POC-
tagged character sequence into a word-
segmented sentence. In general, the characters in 
a sequence are concatenated, and a space is in-
serted after each character tagged R (word-final 
position) or W (monosyllabic word).  
Unsegmented 
Chinese sentence 
Segmented  
character sequence 
Initial POC-tagged 
character sequence 
Final POC-tagged  
character sequence 
Word-segmented  
sentence 
Character 
segmenter 
HMM
POC tagger 
TBL
POC tagger 
Merging 
component 
190
  In addition, two sets of heuristics are used in 
this process. One set (H1) is used to handle non-
Chinese characters and numeric type compounds, 
e.g., numbers, time expressions, etc. A few pat-
terns of non-Chinese characters and numeric 
type compounds are generalized from the train-
ing data. If the merging algorithm detects such a 
pattern in the character sequence, it groups the 
characters that are part of the pattern accord-
ingly.  
The second set of heuristics (H2) is used to 
handle words that three or more characters long. 
Our hypothesis is that long words tend to have 
less fluidity than shorter words and their behav-
ior is more predictable (Lu, 2005). We extracted 
a wordlist from the training data. Based on our 
hypothesis, if the merging algorithm detects that 
a group of characters form a long word found in 
the wordlist, it groups these characters into one 
word.  
3 Results 
The segmenter was evaluated on the closed track 
of the Peking University Corpus in the bakeoff. 
In the development stage, we partitioned the 
official training data into two portions: the train-
ing set consists of 90% of the data, and the de-
velopment set consists of the other 10%. The 
POC tagging accuracy on the development set is 
summarized in Table 1. The results indicate that 
the TBL tagger significantly improves the initial 
tagging produced by the HMM tagger.  
 Accuracy 
HMM tagger 0.814 
TBL tagger 0.936 
Table 1: Tagging Results on the Development Set. 
  The performance of the merging algorithm on 
the development set is summarized in Table 2. 
To understand whether and how much the heu-
ristics contribute to improving segmentation, we 
evaluated four versions of the merging algo-
rithm. The set of heuristics used to handle non-
Chinese characters and numeric type compounds 
did not seem to improve segmentation results on 
the development set, suggesting that these char-
acters are handled well by the tagging compo-
nent. However, the second set of heuristics 
improved segmentation accuracy significantly. 
This seems to confirm our hypothesis that longer 
words tend to behave more stably.   
Resources used R P F 
POC Tags only 0.928 0.926 0.927 
+ H1 0.929 0.925 0.927 
+ H2 0.938 0.959 0.948 
+ H1 & H2 0.940 0.960 0.950 
Table 2: Segmentation Results on the Development 
Set. H1 stands for the set of heuristics used to handle 
non-Chinese characters and numeric type com-
pounds. H2 stands for the set of heuristics used to
handle long words.   
Corpus R P F ROOV RIV
PKU 0.922 0.934 0.928 0.728 0.934 
Table 3: Official Results in the Closed-Track of the 
Peking University Corpus. 
The official results of the segmenter in the 
closed-track of the Peking University Corpus are 
summarized in Table 3. It is somewhat unex-
pected that the results on the official test data 
dropped over 2% compared with the results ob-
tained on the development set. Compared with 
the other systems, the segmenter performed rela-
tively well on OOV words.  
Our preliminary error analysis indicates that 
this discrepancy in performance is partially at-
tributable to two kinds of inconsistencies be-
tween the training and test datasets. One is that 
there are many ASCII numbers in the test set, 
but none in the training set. These numbers be-
came unknown characters to the tagger and af-
fected tagging accuracy. It is possible that this 
inconsistency affected our system more than 
other systems. Second, there are also a number 
of segmentation inconsistencies between the 
training and test sets, but these should have af-
fected all systems more or less equally. The er-
ror analysis also indicates that the current 
segmenter performed poorly on transliterations 
of foreign names.  
4 Conclusions 
We described a hybrid Chinese word segmenter 
that combines the transformation-based learning 
algorithm for character-based tagging and lin-
guistic heuristics for transforming tagged char-
acter sequences into word-segmented sentences. 
191
As the segmenter is in its first stage of develop-
ment and is far from mature, the bakeoff pro-
vided an especially valuable opportunity for 
evaluating its performance. The results suggest 
that:  
1. Despite the lack of a separate mecha-
nism for unknown word recognition, the 
segmenter performed relatively well on 
OOV words. This confirms our hy-
pothesis that character-based tagging 
has a good potential for improving Chi-
nese unknown word identification. 
2. Using linguistic heuristics at the merg-
ing stage can help improve segmenta-
tion results.  
3. There is much room for improvement 
for both the tagging algorithm and the 
merging algorithm. This is being under-
taken.   
References 
Eric Brill. 1995. Transformation-based error-
driven learning and natural language process-
ing: A case study in part-of-speech tagging. 
Computational Linguistics, 21(4):543-565. 
Eugene Charniak, Curtis Hendrickson, Neil Ja-
cobson, and Mike Perkowitz. 1993. Equations 
for part-of-speech tagging. In Proceedings of 
AAAI-1993, pp. 784-789.  
Chooi Ling Goh, Masayuki Asahara, and Yuji 
Matsumoto. 2003. Chinese unknown word 
identification using character-based tagging 
and chunking. In Proceedings of ACL-2003 
Interactive Posters and Demonstrations, pp. 
197-200. 
Xiaofei Lu. 2005. Hybrid methods for POS 
guessing of Chinese unknown words. In Pro-
ceedings of ACL-2005 Student Research 
Workshop, pp. 1-6.  
Grace Ngai and Radu Florian. 2001. Transfor-
mation-based learning in the fast lane. In Pro-
ceedings of NAACL-2001, pp. 40-47.  
Lawrence R. Rabiner. 1989. A tutorial of hidden 
Markov models and selected applications in 
speech recognition. In Proceedings of IEEE-
1989, pp. 257-286.  
Nianwen Xue. 2003. Chinese word segmenta-
tion as character tagging. International Jour-
nal of Computational Linguistics and Chinese 
Language Processing, 8(1):29-48.  
Kevin Zhang, Qin Liu, Hao Zhang, and Xue-Qi 
Cheng. 2002. Automatic recognition of Chi-
nese unknown words based on roles tagging. 
In Proceedings of the 1st SIGHAN Workshop 
on Chinese Language Processing, pp. 71-78. 
192
