An Enhanced Model  
for Chinese Word Segmentation and Part-of-speech Tagging 
Jiang Feng 
Department of  
Computer Science and Engineering  
Shanghai Jiao Tong University 
Shanghai, China, 200030  
f_jiang@sjtu.edu.cn 
 
Liu Hui 
Department of  
Computer Science and Engineering  
Shanghai Jiao Tong University 
Shanghai, China, 200030 
lh_Charles@sjtu.edu.cn 
 
Chen Yuquan 
Department of  
Computer Science and Engineering  
Shanghai Jiao Tong University 
Shanghai, China, 200030 
yqchen@mail.sjtu.edu.cn 
Lu Ruzhan 
Department of  
Computer Science and Engineering  
Shanghai Jiao Tong University 
Shanghai, China, 200030 
rzlu@mail.sjtu.edu.cn 
 
Abstract 
This paper will present an enhanced 
probabilistic model for Chinese word 
segmentation and part-of-speech (POS) 
tagging. The model introduces the information 
of Chinese word length as one of its features 
to reach a more accurate result. And in 
addition, the model also achieves the 
integration of segmentation and POS tagging. 
After presenting the model, this paper will 
give a brief discussion on how to solve the 
problems in statistics and how to further 
integrate Chinese Named Entity Recognition 
into the model. Finally, some figures of 
experiments and comparisons will be reported, 
which shows that the accuracy of word 
segmentation is 97.09%, and the accuracy of 
POS tagging is 98.77%. 
1 Introduction 
Generally, Chinese Lexical Analysis consists of 
two phases; one is word segmentation and the 
other is part-of-speech(POS) tagging. Rule -based 
approach and statistic -based approach are two 
dominant ways in natural language processing, as 
well as Chinese Lexical Analysis. This paper will 
only focus on the later one. Hence, our model is 
called a probabilistic model.  
Scanning through the researches in this field 
before, we have just found two points at which the 
performance of a Chinese word segmentation and 
POS tagging system could get better. One is the on 
the system architecture, and the other is from the 
Machine Learning theory. 
First, the traditional way of Chinese Lexical 
Analysis simply regards the word segmentation 
and POS tagging as two separated phases. Each 
one of them has its own algorithms and models.  
Dividing the whole process into two independent 
parts can lower the complexity of the design of 
system, but decrease the performance as well, 
because the two are fully integrated when a human 
processing a sentence. Fortunately, many 
researchers have already noticed it, and recent 
projects pay more attention on the integration of 
word segmentation and POS tagging, such as [Gao 
Shan, Zhang Yan. 2001]’s pseudo trigram 
integrated model, [Fu Guohong et al. 2001]’s 
analyzer which incorporates backward Dynamic 
Programming and A* algorithm, [Sun Maosong, et 
al. 2003]’s ‘Divide and Conquer integration’, 
[Zhang Huaping, et al. 2003]’s hierarchical hidden 
Markov model and so on. The experiments given 
by these papers also showed a great potential of the 
integrated models. 
Besides the system architecture, another point 
should be noticed. A probabilistic model of word 
segmentation and POS tagging can be regarded as 
an instance of Machine Learning. In Machine 
Learning, the feature extraction is the most 
important aspect, and far more important than a 
learning algorithm. In the models nowadays, it 
seems that the features for Chinese Lexical 
Analysis are a little too simple . Most of them take 
tag sequences, or word frequencies as the 
distinguishing features and ignore the other useful 
information that are provided by Chinese itself. 
In this paper, we will present an enhanced, not 
too complex, model for word segmentation and 
POS tagging, which will not only inherit the merit 
of an integrated model, but also take a new feature 
(word length) into account.  
The second part of this paper will describe the 
model, including the input, output, and some 
assumptions. The third part will give some brief 
discussion about the model on some issues like 
data sparseness and Named Entity Recognition. In 
the final part, the results of our experiments will be 
reported. 
2 The Model 
The first step to establish the model is to make a 
formal description for its input and output. Here, a 
Chinese word segmentation and POS tagging 
system is viewed as with input, 
nCCC ,...,, 21  
where Ci is the i'th Chinese character of the input 
sentence, and with output pairs, ( nm ≤ ) 












m
m
T
L
T
L
T
L ,...,
2
2
1
1  
where Li is the word length of the i’th word in 
the segmented word sequence, Ti is the word tag, 
and each (Li, Ti) pair is corresponding to a 
segmented and tagged word, and ∑
=
=
m
i
ni
1
. 
 
It is easily seen that the distinction between this 
model and other models is that this one introduces 
word length. In fact, word length really works, and 
affects the performance of the system in a great 
deal, of which our later experiments will approve.  
The motivation to introduce word length into 
our model is initially from the classical Chinese 
poems. When we read these poems, we may 
spontaneously obey some laws in where to have a 
pause. For example, in most cases, a 7-character-
lined Jueju(A kind of poem format) is read as 
**/**/***. And the pauses in a sentence are much 
related to the length of words or chunks. Even in 
modern Chinese, word length also plays a part. 
Sometimes we prefer to use disyllabic words rather 
than single one, though both are correct in 
grammar. For example, in our daily lives, we 
always say “ /n /v /n” or “ /n 
/v /n”, but seldom hear “ /n /v /n”, 
where “ ”, “ ” and “ ” have the same 
meaning. So, it is reasonable to assume that the 
occurrence of the word length will obey some 
unwritten laws when human writes or speaks. 
Introducing the word length into the word 
segmentation and POS tagging model may be in 
accord with the needs for processing Chinese. 
Another main characteristic of the model is that 
it is an integrated model, because there is only one 
hop through the input sentence to the output word-
tag sequence.  
 
The following text will introduce how the model 
works. We will also inherit n-gram assumption in 
our model. 
Our destination is to find a sequence of (Li, Ti) 
pairs that maximizes the probability,  
)|),(( CTLP  
i.e. 
)|),((maxarg),(
),(
CTLPTL
TL
R =  …………. 2.1 
(For conveniece, we will use Ar  to represent a 
squence of A1, A2, A3...) 
And,  
)(
),(*)),(|()|),((
CP
TLPTLCPCTLP = .......2.2 
For )(CP  is a constant given a C , we just need 
to consider )),(|( TLCP and ),( TLP . 
First consider )),(|( TLCP . Suppose W  is the 
vertex of words that ),( TL represents(i.e. the 
segmented word sequence), and the dependency 
assumption is like the following Bayers Network: 
 
Figure 2.1: Dependency assumption among  
length-tag pair, word and character 
 
So, we have, 
)),(|(*)|()),(|( TLWPWCPTLCP = …2.3 
Because W  is the segmentation of C , 
)|( WCP  is always 1, and by another assumption 
that the occurrence of every word is independent to 
each other, then 
∏
=
=
m
i
iii TLWPTLWP
1
)),(|()),(|( ………2.4 
where )),(|( iii TLWP  means the conditional 
probability of Wi under Li and Ti. For example, 
P(“ ”| 2, v) is the conditional probability of 
“ ” under a 2-charactered verb which may be 
computed as (the number of “ ” appearing as a 
verb) / (the number of all 2-charactered verbs). 
With 2.3 and 2.4, )),(|( TLCP  is ready. 
 
Then consider ),( TLP , which is easy to 
retrieve when we apply n-gram assumption. 
Suppose n is 2, which means that (Li, Ti) only 
depends on (Li-1, Ti-1). 
∏
=
−−=
m
i
iiii TLTLPTLP
1
11 )),(|),((),( ……..2.5 
Here )),(|),(( 11 −− iiii TLTLP means the 
probability of a Tag Ti with Length Li appearing 
next to Tag Ti-1 with Length Li-1, which may be 
computed as (the number of (Li-1, Ti-1)(Li, Ti) 
appearing in corpus) / (the number of (Li-1, Ti-1) 
appearing in corpus). So, ),( TLP  is also ready. 
 
Combining formula 2.1, 2.2, 2.3, 2.4 and 2.5, we 
have, 
∏
= −
− 











= m
i i
i
i
i
i
i
i
TL
R T
L
T
LP
T
LWPTL
1 1
1
),(
)|(*)|(maxarg),(
............................................2.6 
 
Now, the enhanced model is complete with 2.6. 
When establishing the model, we have made 
several assumptions.  
1. the dependency assumption between tag-length 
pairs, words and characters like the Bayers 
network of figure 2.1 
2. Word and word are independent. 
3. n-gram assumption on (T,L) pairs. 
The validation of these assumptions is still 
somewhat in doubt, but the computational 
complexity of the model is decreased. 
All the resources required to achieve this model 
are also listed, i.e., a word list with 
probability )|( 




i
i
i T
LWP , and an n-gram transition 
network with probability ),...,|(
1
1
1
1 












−
−
+−
+−
i
i
ni
ni
i
i
T
L
T
L
T
LP . 
The algorithm to implement this model is also 
rather simple , and using Dynamic Programming, 
we could finish the algorithm in O(cn), where n is 
the length of input sentence, and c is a constant 
related to the maximum ambiguity in a position. 
 
3 Discussion 
Though the model itself is not difficult to 
implement as we have presented in last section, 
there are still some problems that we will be 
probably encountered with in practice. The first 
one is the data sparseness when we do the statistics. 
Another is how to further integrate Chinese Named 
Entity Recognition into the new, word-length-
introduced model. 
 
3.1 Data Sparseness 
The Data Sparseness happens when we are 
calculating ),...,|(
1
1
1
1 












−
−
+−
+−
i
i
ni
ni
i
i
T
L
T
L
T
LP . After the 
word length is introduced, the need for larger 
corpus is greatly increased. Suppose we are using a 
tri-gram assumption on length-tag pairs, the 
number of tags is 28 as that of our system, and the 
max word length is 6, then the number of patterns 
we should count is, 
 28 * 6 * 28 * 6 * 28 * 6 = 4,741,632.  
To retrieve a reasonable statistical result, the 
scale of the corpus should at least be several times 
larger than that value. It is common that we don’t 
have such a large corpus, and meet the problem so 
called Data Sparseness.  
One way to deal with the problem is to find a 
good smoothing, and another is to make further 
independent assumption between word length and 
word tag. The word length sequence and word tag 
sequence can be considered independent. That 
means, 
),...,|(*),...,|(
),...,|(
1111
1
1
1
1
−+−−+−
−
−
+−
+−
=












iniiinii
i
i
ni
ni
i
i
TTTPLLLP
T
L
T
L
T
LP
 
.........................................3.1 
Now, the patterns to count are just as many as 
those of a traditional n-gram assumption that only 
assumes the dependency among tags. 
 
3.2 Named Entity Recognition Integration 
Named Entity Recognition is one of the most 
important parts of word segmentation and POS 
tagging systems, for the words in word list are 
limited while the language seems infinite. There 
are always new words appearing in human 
language, among which human names, place 
names and organization names are most common 
and most valuble  to recognize. The performance of 
Named Entity Recognition will have a deep impact 
on the performance of a whole word segmentation 
and POS tagging system. The research on Named 
Entity Recognition has appeared for many years. 
No matter whether the current performance of 
Named Entity Recognition is ideal or not, we will 
not discuss it  here, and instead, we will just show 
how to integrate the existing Name Entity 
Recognition methods into the new model.  
During the integration, more attention should be 
paid to the structural and probabilistic consistency. 
For structural consistency, the original system 
structure does not need modifying when a new 
method of Named Entity Recognition is applied. 
For probabilistic  consistency, the probabilit ies 
outputted by the Named Entity Recognition should 
be compatible with the probabilit ies of the words 
in the original word list.  
Here, we will take the Human Name 
Recognition as an example to show how to do the 
integration. 
[Zheng Jiahen, et al. 2000] has presented a 
probabilistic  method for Chinese Human Name 
Recognition, which is easy to understand and 
suitable to be borrowed as a demonstration. 
That paper defined the probability for a Chinese 
Human Name as: 
)(*)()|( kEiFiknsP = ............................3.2 
)(*)(*)()|( kEjMiFijknpP = .............3.3 
Where each one of “i”, “j”, “k” represents a 
single Chinese characters, “ik ”, “ijk” are the strings 
which may be a human name, “ns” means a single 
name when “j” is empty, “np” means plural name 
when “j” is not empty, F(i) is the probability of “i” 
being a family name, M(j) means the probability of 
“j” being the middle character of a human name, 
E(k) means the probability of “k” being the tailing 
character of a human name, P(ns | ik ) is the 
probability of “ik ” being a single name, and P(np | 
ijk ) is the probability of “ijk ” being a plural name. 
F(i), M(j), and E(k) are easily retrieved from 
corpus, so P(ns | ik ) and P(np | ijk) can be known. 
However, P(ns | ik ) and P(np | ijk ) do not satisfy 
the requirements of the word length introduced 
model. The model needs probabilit ies like 
)),(|( tlwP , where w is a word, t is a word tag, and 
l is the word length. Therefore, P(ns | ik ) needs to 
be modified into P(ik | nh, 2), for ik  is always a 2-
charactered word, and likewise, P(np | ijk ) needs to 
be modified into P(ijk | nh, 3), where “nh” is the 
word tag for human name in our system. 
P(ns | ik ) is equivalent to P(nh, 2 | ik ) and P(np | 
ijk ) is equivalent to P(nh, 3 | ijk). P(ns | ik ) can be 
converted into P(ik | nh, 2) through following way, 
)2,(
)()()|2,(
)2,(
)()|2,(
)2,|(
nhP
kPiPiknhP
nhP
ikPiknhP
nhikP
==  
.........................................3.4 
where “i”, “k” have the same meaning with those 
in 3.2 and 3.3. and nh is the tag for human name. 
In this formula, “i” and “k” are assumed to be 
independent. P(nh, 2), P(i), P(k) are easy to 
retrieve, which represent the probability of a 2-
charactered human name, the probability of 
character “i” and the probability of character “k”. 
P(nh, 2 | ik ) is computed from 3.2. Thus, the 
conversion of P(nk | nh, 2) to P(nh, 2 | ik ) is done. 
In the same way, P(np | ijk) can be converted 
into P(ijk | nh, 3) by: 
)3,(
)()()()|3,(
)3,(
)()|3,(
)3,|(
nhP
kPjPiPiknhP
nhP
ijkPijknhP
nhijkP
==
 
...........................................3.5 
Finally, the Human Name Recognition Module  
is integrated into the whole system. The input 
string C1, C2, …, Cn first goes through the Human 
Name Recognition module, and the module 
outputs a temporary word list, which consists of a 
column of words that are probably human names 
and a column of probabilities corresponding to the 
words, which can be computed by 3.4 and 3.5. The 
whole system then merges the temporary word list 
and the original word list into a new word list, and 
applies the new word list in segmenting and 
tagging C1, C2, …, Cn. 
 
4 Conclusion & Experiments 
This paper has presented an enhanced 
probabilistic model of Chinese Lexical Analysis, 
which introduces word length as one of the 
features and achieves the integration of word 
segmentation, Named Entity Recognition and POS 
tagging.  
At last, we will briefly give the results of our 
experiments. In the previous experiments, we have 
compared many simple probabilistic models for 
Chinese word segmentation and POS tagging, and 
found that the system using maximum word 
frequency as segmentation strategy and forward 
tri-gram Markov model as POS tagging model 
(MWF + FTMM) reaches the best performance. 
Our comparisons will be done between the 
MWF+FTMM and the enhance model with tri-
gram assumption. The training corpus is 40MB 
annotated Chinese text from People’s Daily. The 
testing data is about 1MB in size and is from 
People’s Daily, too.  
 
 MWF+FTMM New Model 
WSA 95.24% 97.09% 
PTA 97.12% 98.77% 
Total 92.50% 95.90% 
Table 4.1: The accuracy by word,  
with named entity not considered 
 
 MWF+FTMM New Model 
WSA 93.86% 95.68% 
PTA 93.89% 95.72% 
Total 88.13% 91.59% 
Table 4.2: The accuracy by word, 
with named entity considered 
 
 MWF+FTMM New Model 
WSA 69.46% 82.63% 
PTA 72.58% 80.33% 
Total 50.42% 66.38% 
Table 4.3: The accuracy by sentence, 
with named entity not considered 
 
 MWF+FTMM New Model 
WSA 63.86% 74.78% 
PTA 61.40% 67.41% 
Total 39.21% 50.42% 
Table 4.4: The accuracy by sentence, 
with named entity considered 
 
NOTES: 
MWF: Maximum Word Frequency, a very simple 
strategy in word segmentation disambiguation, 
which chooses the word sequence with max 
probability as its result.  
FTMM: Forward Tri-gram Markov Model, a 
popular model in POS tagging. 
MWF+FTMM: A strategy, which chooses the 
output that makes a balance between the MWF 
and FTMM as its result. 
WSA (by word): Word Segmentation Accuracy, 
measured by recall, i.e. the number of correct 
segments divided by the number of segments 
in corpus.  
(In a problem like word segmentation, the 
result of precision measurement is commonly 
around that of recall measurement.) 
PTA (by word): POS Tagging Accuracy based on 
correct segmentation, the number of words that 
are correctly segmented and tagged divided by 
the number of words that are correctly 
segmented. 
Total (by word): total accuracy of the system, 
measured by recall, i.e. the number of words 
that are correctly segmented and tagged 
divided by the number of words in corpus, or 
simply WSA * PTA. 
WSA (by sentence): the number of correctly 
segmented sentences divided by the number of 
sentences in corpus. A correctly segmented 
sentence is a sentence whose words are all 
correctly segmented.  
PTA (by sentence): the number of correctly tagged 
sentences divided by the number of correctly 
segmented sentences in corpus. A correctly 
tagged sentence is a sentence whose words are 
all correctly segmented and tagged. 
Total (by sentence): WSA * PTA. 
Named entity considered or not: When named 
entity is not considered, all the unknown words 
in corpus are deleted before evaluation. 
Otherwise, nothing is done on the corpus. 
 
According to the results above (Table 4.1, Table 
4.2, Table 4.3, Table 4.4), the new enhanced model 
does better than the MWF + FTMM in every field. 
Introducing the word length into a Chinese word 
segmentation and POS taggin g system seems 
effective.  
This paper just focuses on the pure probabilistic 
model for word segmetation and POS tagging. It 
can be predicted that, with more disambiguation 
strategies, such as some rule based approaches, 
being implemented into the new model to achieve 
a multi-engine system, the performance will be 
further improved. 
 
5 Acknowledgements 
Thank Fang Hua and Kong Xianglong for their 
previous work, who have just graduated. 

References  
Sun Maosong, Xu Dongliang, Benjamin K Tsou. 
2003. Integrated Chin ese word segmentation and 
part-of-speech tagging based on the divide-and-
conquer strategy. International Conference on 
Natural Language Processing and Knowledge 
Engineering Proceedings, Beijing. 
Zhang Huaping, Liu Qun, et al. 2003. Chinese 
lexical analysis using hierarchical hidden 
Markov model. 2nd SIGHAN workshop 
affiliated with 41th ACL, Sapporo Japan 
Fu Guohong, Wang Ping, Wang Xiaolong. 2001. 
Research on the approach of integrating chinese 
wordd segmentation with part-of-speech tagging. 
Application Research of Computer. (In Chinese) 
Gao Shan, Zhang Yan. 2001. The Research on 
Integrated Chinese Word Segmentation and 
Labeling based on trigram statistical model. 
Natural Language Understanding & Machine 
Translation (JSCL-2001), Taiyuan. (In Chinese) 
Zheng Jiahen, Li Xin, et al. 2000. The Research of 
Chinese names recognition method based on 
corpus. Journal of Chinese Information 
Processing. (In Chinese) 
