Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pages 205–208,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Using Part-of-Speech Reranking to Improve Chinese Word Segmentation
Mengqiu Wang Yanxin Shi
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213, USA
{mengqiu,yanxins}@cs.cmu.edu
Abstract
Chinese word segmentation and Part-of-
Speech (POS) tagging have been com-
monly considered as two separated tasks.
In this paper, we present a system that
performs Chinese word segmentation and
POS tagging simultaneously. We train a
segmenter and a tagger model separately
based on linear-chain Conditional Ran-
dom Fields (CRF), using lexical, morpho-
logical and semantic features. We propose
an approximated joint decoding method
by reranking the N-best segmenter out-
put, based POS tagging information. Ex-
perimental results on SIGHAN Bakeoff
dataset and Penn Chinese Treebank show
that our reranking method significantly
improve both segmentation and POS tag-
ging accuracies.
1 Introduction
Word segmentation and Part-of-speeching (POS)
tagging are the most fundamental tasks in Chinese
natural language processing (NLP). Traditionally,
these two tasks were treated as separate and in-
dependent processing steps chained together in a
pipeline. In such pipeline systems, errors intro-
duced at the early stage cannot be easily recov-
ered in later steps, causing a cascade of errors
and eventually harm overall performance. Intu-
itively, a correct segmentation of the input sen-
tence is more likely to give rise to a correct POS
tagging sequence than an incorrect segmentation.
Hinging on this idea, one way to avoid error prop-
agation in chaining subtasks such as segmentation
and POS tagging is to exploit the learning trans-
fer (Sutton and McCallum, 2005) among sub-
tasks, typically through joint inference. Sutton et
al. (2004) presented dynamic conditional random
fields (DCRF), a generalization of the traditional
linear-chain CRF that allow representation of in-
teraction among labels. They used loopy belief
propagation for inference approximation. Their
empirical results on the joint task of POS tagging
and NP-chunking suggested that DCRF gave supe-
rior performance over cascaded linear-chain CRF.
Ng and Low (2004) and Luo (2003) also trained
single joint models over the Chinese segmentation
and POS tagging subtasks. In their work, they
brought the two subtasks together by treating it as
a single tagging problem, for which they trained a
maximum entropy classifier to assign a combined
word boundary and POS tag to each character.
A major challenge, however, exists in doing
joint inference for complex and large-scale NLP
application. Sutton and McCallum (Sutton and
McCallum, 2005) suggested that in many cases ex-
act inference can be too expensive and thus formi-
dable. They presented an alternative approach in
which a linear-chain CRF is trained separately for
each subtask at training time, but at decoding time
they combined the learned weights from the CRF
cascade into a single grid-shaped factorial CRF
to perform joint decoding and make predictions
for all subtasks. Similar to (Sutton and McCal-
lum, 2005), in our system we also train a cas-
cade of linear-chain CRF for the subtasks. But
at decoding time, we experiment with an alterna-
tive approximation method to joint decoding, by
taking the n-best hypotheses from the segmenta-
tion model and use the POS tagging model for
reranking. We evaluated our system on the open
tracks of SIGHAN Bakeoff 2006 dataset. Fur-
thermore, to evaluate our reranking method’s im-
pact on the POS tagging task, we also performed
10-fold cross-validation tests on the 250k Penn
205
Chinese Treebank (CTB) (Xue et al., 2002). Re-
sults from both evaluations suggest that our simple
reranking method is very effective. We achieved
a consistent performance gain on both segmenta-
tion and POS tagging tasks over linearly-cascaded
CRF. Our official F-scores on the 2006 Bakeoff
open tracks are 0.935 (UPUC), 0.964 (CityU),
0.952 (MSRA) and 0.949 (CKIP).
2 Algorithm
Given an observed Chinese character sequence
X = {C1,C2,...,Cn}, let S and T denote a seg-
mentation sequence and a POS tagging sequence
over X. Our goal is to find a segmentation se-
quence ˆS and a POS tagging sequence ˆT that max-
imize the posterior probability :
P(S,T|X = {C1,C2,...,Cn}) (1)
Applying chain rule, we can further derive from
Equation 1 the following:
< ˆS, ˆT >
= argmax
S,T
P(T|S,X = {C1,C2,...,Cn})
×P(S|X = {C1,C2,...,Cn}) (2)
Since we have factorized the joint probability
in Equation 1 into two terms, we can now model
these two components using conditional random
fields (Lafferty et al., 2001). Linear-chain CRF
models define conditional probability, P(Z|X), by
linear-chain Markov random fields. In our case, X
is the sequence of characters or words, and Z is
the segmentation labels for characters (START or
NON-START, used to indicate word boundaries)
or the POS tagging for words (NN, VV, JJ, etc.).
The conditional probability is defined as:
P(Z|X) = 1N(X) exp(
Tsummationdisplay
t=1
Ksummationdisplay
k=1
λkfk(Z,X,t))
(3)
where N(X) is a normalization term to guaran-
tee that the summation of the probability of all
label sequences is one. fk(Z,X,t) is the kth
localfeaturefunction at sequence position t. It
maps a pair of X and Z and an index t to {0,1}.
(λ1,...,λK) is a weight vector to be learned from
training set. A large positive value of λi means
that the ith feature function’s value is frequent to
be 1, whereas a negative value of λi means the ith
feature function’s value is unlikely to be 1.
At decoding time, we are interested in finding
the segmentation sequence ˆS and POS tagging se-
quence ˆT that maximizes the probability defined
in Equation 2. Instead of exhaustively searching
the whole space of all possible segmentations, we
restrict our searching to S = {S1,S2,...,SN},
where S is the restricted search space consisting
of N-best decoded segmentation sequences. This
N-best list of segmentation sequences, S, can be
obtained using modified Viterbi algorithm and A*
search (Schwartz and Chow, 1990).
3 Features
3.1 Features for Segmentation
We adopted the basic segmentation features used
in (Ng and Low, 2004). These features are summa-
rized in Table 1 ((1.1)-(1.7)). In these templates,
C0 refers to the current character, and C−n, Cn re-
fer to the characters n positions to the left and right
of the current character, respectively. Pu(C0) in-
dicates whether C0 is a punctuation. T(Cn) clas-
sifies the character Cn into four classes: num-
bers, dates (year, month, date), English letters and
all other characters. LBegin(C0), LEnd(C0) and
LMid(C0) represent the maximum length of words
found in a lexicon1 that contain the current char-
acter as either the first, last or middle character, re-
spectively. Single(C0) indicates whether the cur-
rent character can be found as a single word in the
lexicon.
Besides the adopted basic features mentioned
above, we also experimented with additional se-
mantic features (Table 1 (1.8)). For (1.8), Sem0
refers to the semantic class of current character,
and Sem−1, Sem1 represent the semantic class
of characters one position to the left and right of
the current character, respectively. We obtained
a character’s semantic class from HowNet (Dong
and Dong, 2006). Since many characters have
multiple semantic classes defined by HowNet, it
is a non-trivial task to choose among the differ-
ent semantic classes. We performed contextual
disambiguation of characters’ semantic classes by
calculating semantic class similarities. For ex-
ample, let us assume the current character is
_d_4663(look,read) in a word context of _d_4663_d_2958(read
1We compiled our lexicon from three external re-
sources. HowNet: www.keenage.com; On-Line Chinese
Tools: www.mandarintools.com; Online Dictionary from
Peking University: http://ccl.pku.edu.cn/doubtfire/Course/
Chinese%20Information%20Processing/Source Code/
Chapter 8/Lexicon full 2000.zip
206
newspaper). The character _d_4663(look) has two se-
mantic classes in HowNet, i.e. _d_6321(read) and _d_1499
_d_3826(doctor). To determine which class is more
appropriate, we check the example words illus-
trating the meanings of the two semantic classes,
given by HowNet. For _d_6321(read), the exam-
ple word is _d_4663_d_1015(read book); for _d_1499_d_3826(doctor),
the example word is _d_4663_d_4542(see a doctor). We
then calculated the semantic class similarity
scores between _d_2958(newspaper) and _d_1015(book), and
_d_2958(newspaper) and _d_4542(illness), using HowNet’s
built-in similarity measure function. Since
_d_2958(newspaper) and _d_1015(book) both have seman-
tic class _d_3230_d_1015(document), their maximum simi-
larity score is 0.95, where the maximum similar-
ity score between _d_2958(newspaper) and _d_4542(illness)
is 0.03478. Therefore, Sem0Sem1 =_d_6321(read),_d_3230
_d_1015(document). Similarly, we can figure out
Sem−1Sem0. For Sem0, we simply picked the
top four semantic classes ranked by HowNet, and
used ”‘NONE”’ for absent values.
Segmentation features
(1.1) Cn,n ∈ [−2,2]
(1.2) CnCn+1,n ∈ [−2,1]
(1.3) C−1C1
(1.4) Pu(C0)
(1.5) T(C−2)T(C−1)T(C0)T(C1)T(C2)
(1.6) LBegin(C0), LEnd(C0)
(1.7) Single(C0)
(1.8) Sem0,SemnSemn+1,n ∈−1,0
POS tagging features
(2.1) Wn,n ∈ [−2,2]
(2.2) WnWn+1,n ∈ [−2,1]
(2.3) W−1W1
(2.4) Wn−1WnWn+1,n ∈ [−1,1]
(2.5) Cn(W0),n ∈ [−2,2]
(2.6) Len(W0)
(2.7) Other morphological features
Table 1: Feature templates list
3.2 Features for POS Tagging
The bottom half of Table 1 summarizes the feature
templates we employed for POS tagging. W0 de-
notes the current word. W−n and Wn refer to the
words n positions to the left and right of the cur-
rent word, respectively. Cn(W0) is the nth char-
acter in current word. If the number of characters
in the word is less than 5, we use ”NONE” for ab-
sent characters. Len(W0) is the number of char-
acters in the current word. We also used a group
of binary features for each word, which are used to
represent the morphological properties of current
word, e.g. whether the current word is punctua-
tion, number, foreign name, etc.
4 Experimental Results
We evaluated our system’s segmentation results on
the SIGHAN Bakeoff 2006 dataset. To evaluate
our reranking method’s impact on the POS tagging
part, we also performed 10-fold cross-validation
tests on the 250k Penn Chinese Treebank (CTB
250k). The CRF model for POS tagging is trained
on CTB 250k in all the experiments. We report re-
call (R), precision (P), and F1-score (F) for both
word segmentation and POS tagging tasks. N
value is chosen to be 20 for the N-best list rerank-
ing, based on cross validation. For CRF learning
and decoding, we use the CRF++ toolkit2.
4.1 Results on Bakeoff 2006 Dataset
R P F Roov Riv
UPUC 0.942 0.928 0.935 0.711 0.964
CityU 0.964 0.964 0.964 0.787 0.971
MSRA 0.949 0.954 0.952 0.692 0.958
CKIP 0.953 0.946 0.949 0.679 0.965
Table 2: Performance of our system on open tracks
of SIGHAN Bakeoff 2006.
We participated in the open tracks of the
SIGHAN Bakeoff 2006, and we achieved F-scores
of 0.935 (UPUC), 0.964 (CityU), 0.952 (MSRA)
and 0.949 (CKIP). More detailed performances
statistics including in-vocabulary recall (Riv) and
out-of-vocabulary recall (Roov) are shown in Table
2.
More interesting to us is how much the N-best
list reranking method using POS tagging helped
to increase segmentation performance. For com-
parison, we ran a linear-cascade of segmentation
and POS tagging CRFs without reranking as the
baseline system, and the results are shown in Table
3. We can see that our reranking method consis-
tently improved segmentation scores. In particu-
lar, there is a greater improvement gained in recall
than precision across all four tracks. We observed
the greatest improvement from the UPUC track.
We think it is because our POS tagging model is
trained on CTB 250k, which could be drawn from
the same corpus as the UPUC training data, and
therefore there is a closer mapping between seg-
mentation standard of the POS tagging training
data and the segmentation training data (at this
2http://chasen.org/ taku/software/CRF++/
207
1 2 3 4 5 6 7 8 9 1093
94
95
96
97
10 cross−fold validation test
F−Measure(%)
CTB Segmentation Results
baseline
final system
1 2 3 4 5 6 7 8 9 1087
88
89
90
91
92
93
94
10 cross−fold validation test
F−Measure(%)
CTB POS Tagging Results
baseline
final system
Figure 1: Segmentation and POS tagging results
on CTB corpus.
point we are not sure if there exists any overlap
between the UPUC test data and CTB 250k).
Baseline system Final system
R P F R P F
UPUC 0.910 0.924 0.917 0.942 0.928 0.935
CityU 0.954 0.963 0.958 0.964 0.964 0.964
MSRA 0.935 0.953 0.944 0.949 0.954 0.952
CKIP 0.932 0.942 0.937 0.953 0.946 0.949
Table 3: Comparison of the baseline system (with-
out POS reranking) and our final system.
4.2 Results on CTB Corpus
To evaluate our reranking method’s impact on the
POS tagging task, we also tested our systems on
CTB 250k corpus using 10-fold cross-validation.
Figure 1 summarizes the results of segmentation
and POS tagging tasks on CTB 250k corpus. From
figure 1 we can see that our reranking method im-
proved both the segmentation and tagging accu-
racies across all 10 tests. We conducted pairwise
t-tests and our reranking model was found to be
statistically significantly better than the baseline
model under significance level of 5.0−4 (p-value
for segmentation) and 3.3−5 (p-value for POS tag-
ging).
5 Conclusion
Our system uses conditional random fields for per-
forming Chinese word segmentation and POS tag-
ging tasks simultaneously. In particular, we pro-
posed an approximated joint decoding method by
reranking the N-best segmenter output, based POS
tagging information. Our experimental results on
both SIGHAN Bakeoff 2006 datasets and Chinese
Penn Treebank showed that our reranking method
consistently increased both segmentation and POS
tagging accuracies. It is worth noting that our
reranking method can be applied not only to Chi-
nese segmentation and POS tagging tasks, but also
to many other sequential tasks that can benefit
from learning transfer, such as POS tagging and
NP-chunking.
Acknowledgment
This work was supported in part by ARDA’s
AQUAINT Program.

References
Zhengdong Dong and Qiang Dong. 2006. HowNet
And The Computation Of Meaning. World Scien-
tific.
John Lafferty, Andrew McCallum, and Fernando
Pereira. 2001. Conditional random fields: Prob-
abilistic models for segmenting and labeling se-
quence data. In Proceedings of ICML ’01.
Xiaoqiang Luo. 2003. A maximum entropy Chinese
character-based parser. In Proceedings of EMNLP
’03.
Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-
of-speech tagging: One-at-a-time or all-at-once?
word-based or character-based? In Proceedings of
EMNLP ’04.
Richard Schwartz and Yen-Lu Chow. 1990. The n-
best algorithm: An efficient and exact procedure for
finding the n most likely sentence hypotheses. In
Proceedings of ICASSP ’90.
Charles Sutton and Andrew McCallum. 2005. Compo-
sition of conditional random fields for transfer learn-
ing. In Proceedings of HLT/EMNLP ’05.
Charles Sutton, Khashayar Rohanimanesh, and An-
drew McCallum. 2004. Dynamic conditional ran-
dom fields: Factorized probabilistic models for la-
beling and segmenting sequence data. In Proceed-
ings of ICML ’04.
Nianwen Xue, Fu-Dong Chiou, and Martha Stone
Palmer. 2002. Building a large-scale annotated Chi-
nese corpus. In Proceedings of COLING ’02.
