A Classification-based Algorithm for Consistency Check of
Part-of-Speech Tagging for Chinese Corpora
Hu Zhang Jia-heng Zheng
School of Computer
& Information Technology
Shanxi University
Taiyuan, Shanxi 030006, China
Ying Zhao A3
Department of Computer Science
University of Minnesota
Minneapolis, MN 55455, USA
yzhao@cs.umn.edu
Abstract
Ensuring consistency of Part-of-Speech
(POS) tagging plays an important role
in constructing high-quality Chinese
corpora. After analyzing the POS tag-
ging of multi-category words in large-
scale corpora, we propose a novel con-
sistency check method of POS tagging
in this paper. Our method builds a
vector model of the context of multi-
category words, and uses the CZ-NN al-
gorithm to classify context vectors con-
structed from POS tagging sequences
and judge their consistency. The ex-
perimental results indicate that the pro-
posed method is feasible and effective.
1 Introduction
Constructing high-quality and large-scale corpora
has always been a fundamental research area in
the field of Chinese natural language process-
ing. In recent years, the rapid development in
the fields of machine translation (MT), phonetic
recognition (PR), information retrieval (IR), web
text mining, and etc., is demanding more Chinese
corpora of higher quality and larger scale. En-
suring consistency of Part-of-Speech (POS) tag-
ging plays an important role in constructing high-
quality Chinese corpora. In particular, we fo-
cus on consistency check of the POS tagging
of multi-tagging words, which consist of same
Chinese characters and are near-synonymous, but
A3To whom correspondence should be addressed.
have different grammatical functions. No mat-
ter how many different POS tags a multi-category
words may be tagged, ensuring consistency of
POS tagging means to assign the multi-category
word with the same POS tag when it appears in
similar context.
Novel approaches and techniques have been
proposed for automatic rule-based and statistics-
based POS tagging, and the “state-of-the-art” ap-
proaches achieve a tagging precision of 89% and
96%, respectively. A great portion of the words
appearing in Chinese corpora are multi-category
words. We have studied the text data from the
2M-word Chinese corpus published by Peking
University, and statistics show that multi-category
words cover 11% of the words, while the percent-
age of the occurrence of multi-category words is
as high as 47%. When checking the POS tags,
human experts may have disagreements or make
mistakes in some cases. After analyzing 1,042
sentences containing the word “ P�”, which are
extracted from the 2M-word Chinese corpus of
Peking University, the number of incorrect tags
for the word “ P�” is 15, which accounts for around
1.3%.
So far in the field of POS tagging, most of the
works have focused on novel algorithms or tech-
niques for POS tagging. There are only a lim-
ited number of studies has focused on consistency
check of POS tagging. Xing (Xing, 1999) ana-
lyzed the inconsistency phenomena of word seg-
mentation (WS) and POS tagging. Qu and Chen
(Qu and Chen, 2003) improved the corpus quality
by obtaining POS tagging knowledge from pro-
cessed corpora, preprocessing, and checking con-
1
sistency with methods based on rules and statis-
tics. Qian and Zheng (Qian and Zheng, 2003;
Qian and Zheng, 2004) introduced a rule-based
consistency check method that obtained POS tag-
ging knowledge automatically from processed
corpora by machine learning (ML) and rough set
(RS) methods. For real corpora, Du and Zheng
(Du and Zheng, 2001) proposed a rule-based con-
sistency check method and strategy to identify the
inconsistency phenomena of POS tagging. How-
ever, the algorithms and techniques for automatic
consistency check of POS tagging proposed in
(Qu and Chen, 2003; Qian and Zheng, 2003; Qian
and Zheng, 2004; Du and Zheng, 2001) still have
some insufficiencies. For example, the assign-
ment of POS tags of the inconsistent POS tagging
that are not included in the instance set needs to
be conducted manually.
In this paper, we propose a novel classification-
based method to check the consistency of POS
tagging. Compared to Zhang et al. (Zhang et
al., 2004), the proposed method fully considers
the mutual relation of the POS in POS tagging
sequence, and adopts transition probability and
emission probability to describe the mutual de-
pendencies and CZ-NN algorithm to weigh the sim-
ilarity. We evaluated our proposed algorithm on
our 1.5M-word corpus. In open test, our method
achieved a precision of 85.24% and a recall of
85.84%.
The rest of the paper is organized as follows.
Section 2 introduces the context vector model of
POS tagging sequences. Section 3 describes the
proposed classification-based consistency check
algorithm. Section 4 discusses the experimental
results. Finally, the concluding remarks are given
in Section 5.
2 Describing the Context of
Multi-category Words
The basic idea of our approach is to use the
context information of multi-category words to
judge whether they are tagged consistently or
not. In other words, if a multi-category word ap-
pears in two locations and the surrounding words
in those two locations are tagged similarly, the
multi-category word should be assigned with the
same POS tag in those two locations as well.
Hence, our approach is based on the context of
multi-category words and we model the context
by looking at a window around a multi-category
word and the tagging sequence of this window. In
the rest of this section, we describe our vector rep-
resentation of the context of multi-category words
and how to determine various parameters in our
vector representations.
2.1 Vector Representation of the Context of
Multi-category Words
Our vector representation of context consists of
three key components: the POS tags of each word
in a context window (POS attribute), the impor-
tance of each word to the center multi-category
word based on distance (position attribute), and
the dependency of POS tags of the center multi-
category word and its surrounding words (Depen-
dency Attribute).
Given a multi-category word and its context
window of size D0, we represent the words in se-
quential order as B4DB
BD
BNDB
BE
BNBMBMBMBNDB
D0
B5 and the POS
tags of each word as B4D8
BD
BND8
BE
BNBMBMBMBND8
D0
B5. We also re-
fer to the latter vector as POS tagging sequence.
In practise, we choose a proper value of D0 so
that the context window contains sufficient num-
ber of words and the complexity of our algorithm
remains relatively low. We will discuss this mat-
ter in detail later. In this study, we set the value of
D0 to be 7.
2.1.1 POS Attribute
The POS tagging sequence contains informa-
tion of the POS of each preceding (following)
word in a POS tagging sequence as well as the
position of each POS tag. The POS of surround-
ing words may have different effect on determin-
ing the POS of the multi-category word, which we
refer to as POS attribute and represent it using a
matrix as follows.
Suppose we have a tag set of size D1
B4CR
BD
BNCR
BE
BNBMBMBMBNCR
D1
B5, given a multi-category word with
a context window of size D0 B4DB
BD
BNDB
BE
BNBMBMBMBNDB
D0
B5 and its
POS tagging sequence, the POS attribute matrix
CH is an D0 by D1 matrix, where the rows indicate the
POS tags of the preceding words, multi-category
word, and the following words in the context win-
dow, while the columns present tags in the tag set.
CH
CXBNCY
BP BD iff the POS tag of DB
CX
is CR
CY
.
For example, consider the the POS attribute
matrix of “ P�” in the following sentence:
 P�4�/v $�/a =&C/n,X/uP�/a(/n ,�/d�/v �0N/n ,/w 
2
As we let D0 BP BJ, we look at the word “ P�” and
its 3 preceding and following words. Hence, the
POS tagging sequence is ( a, n, u, a, n, d, v ). In
our study, we used a standard tag set that consists
of 25 tags. Suppose the tag set is ( n, v, a, d, u, p,
r, m, q, c, w, I, f, s, t, b, z, e, o, l, j, h, k, g, y), then
the POS attribute matrix of “ P�” in this example
is:
CH
BP
BC
BU
BU
BU
BU
BU
BU
BU
BU
BU
BU
BS
BCBNBCBNBDBNBCBNBCBMBMBMBMBMBM
BDBNBCBNBCBNBCBNBCBMBMBMBMBMBM
BCBNBCBNBCBNBCBNBDBMBMBMBMBMBM
BCBNBCBNBDBNBCBNBCBMBMBMBMBMBM
BDBNBCBNBCBNBCBNBCBMBMBMBMBMBM
BCBNBCBNBCBNBDBNBCBMBMBMBMBMBM
BCBNBDBNBCBNBCBNBCBMBMBMBMBMBM
BD
BV
BV
BV
BV
BV
BV
BV
BV
BV
BV
BT
2.1.2 Position Attribute
Due to the different distances from the multi-
category word, the POS of the word before (after)
the multi-category word may in a POS tagging se-
quence have a different influence on the POS tag-
ging of the multi-category word, which we refer
to as position attribute.
Given a multi-category word with a con-
text window of size D0, suppose the number of
preceding (following) words is D2 (i.e., D0 BP
BED2 B7 BD), the position attribute vector CE
CG
of
the multi-category word is given by CE
CG
BP
B4CS
BD
BNBMBMBMBNCS
D2
BNCS
D2B7BD
BNCS
D2B7BE
BNBMBMBMBNCS
D0
B5, where CS
D2B7BD
is the
value of the position attribute of the multi-
category word and CS
D2B7BDA0CX
(CS
D2B7BDB7CX
) is the value
of the position attribute of the CXth preceding
(following) word. We further require that BKCX
CS
D2B7BDA0CX
BP CS
D2B7BDB7CX
and CS
D2B7BD
B7
C8
D2
CXBPBD
B4CS
D2B7BDA0CX
B7
CS
D2B7BDB7CX
B5 BP BD.
We choose a proper position attribute vector so
that the multi-category word itself has the high-
est weight, and the closer the surrounding word ,
the higher its weight is. If we consider a context
window of size 7, based on our preliminary exper-
iments, we chose the following position attribute
values: CS
BD
BP CS
BJ
BP BDBPBEBE; CS
BE
BP CS
BI
BP BDBPBDBD;
CS
BF
BP CS
BH
BP BEBPBDBD; and CS
BG
BP BGBPBDBD. Hence, the fi-
nal position attribute vector used in our study can
be written as follows:
CE
CG
BP B4
BD
BEBE
BN
BD
BDBD
BN
BE
BDBD
BN
BG
BDBD
BN
BE
BDBD
BN
BD
BDBD
BN
BE
BEBE
B5BM
Note that if the POS tag in the POS tagging se-
quence is incorrect, the position attribute value of
the corresponding position should be turned into
a negative value, so that when the incorrect POS
tag appears in a POS tagging sequence, this at-
tribute can correctly show that the incorrect POS
tag has negative effect on generating the final con-
text vector.
2.1.3 Dependency Attribute
The last attribute we focus on is dependency
attribute, which corresponds to the fact that there
are mutual dependencies on the appearance of ev-
ery POS in POS tagging sequences. In particular,
we use transition probability and emission prob-
ability in Hidden Markov Model (HMM) (Leek,
1997) to capture this dependency.
Given a tag set of size D1 B4CR
BD
BNCR
BE
BNBMBMBMBNCR
D1
B5, the
transition probability table CC is an D1 by D1 ma-
trix and given by:
CC
CXBNCY
BP C8
CC
B4CR
CX
BNCR
CY
B5 BP
CUB4CR
CX
BNCR
CY
B5
CUB4CR
CX
B5
BN
where CUB4CR
CX
BNCR
CY
B5 is the frequency of the POS tag CR
CY
appears after the POS tag CR
CX
in the entire corpus;
CUB4CR
CX
B5 is the frequency of the POS tag CR
CX
appears in
the entire corpus; and C8CC is the transition proba-
bility.
Given a tag set of size D1 B4CR
BD
BNCR
BE
BNBMBMBMBNCR
D1
B5, the
emission probability table BX is an D1 by D1 matrix
and given by:
BX
CXBNCY
BP C8
BX
B4CR
CX
BNCR
CY
B5 BP
CUB4CR
CX
BNCR
CY
B5
CUB4CR
CY
B5
BN
where CUB4CR
CX
BNCR
CY
B5 is the frequency of the POS tag CR
CX
appears before the POS tag CR
CY
in the entire corpus;
CUB4CR
CY
B5 is the frequency of the POS tag CR
CY
appears
in the entire corpus; and C8BX is the emission prob-
ability.
Note that both CC and BX are constructed from
the entire corpus and we can look up these two ta-
bles easily when we consider the POS tags appear
in POS tagging sequences.
Now, when we look at a context window of size
7 B4DB
BD
BNDB
BE
BNBMBMBMBNDB
BJ
B5 and its POS tagging sequence
B4D8
BD
BND8
BE
BNBMBMBMBND8
BJ
B5, there are three types of probabili-
ties we need to take into account.
The first one is the probability of the appear-
ance of the POS tag D8
BG
of the multi-category word,
which we can write as follows:
C8
BVCG
B4D8
BG
B5 BP CUB4DB
BG
is tagged as D8
BG
B5BPCUB4DB
BG
B5BN (1)
3
where CUB4DB
BG
B5 is the frequency of the appearance
of the multi-category word DB
BG
in the entire corpus
and CUB4DB
BG
CXD7D8CPCVCVCTCSCPD7D8
BG
B5 is the frequency of the
appearance where the word DB
BG
is tagged as D8
BG
in
the entire corpus.
The second one is transition probability, which
is the probability of the appearance of the POS tag
D8
CXB7BD
in the CX B7 BD position after the POS tag D8
CX
in
the CX position and shown in Eqn. 2:
C8
CC
B4CXBNCXB7BDB5
BP C8
CC
B4D8
CX
BND8
CXB7BD
B5 BP CUB4D8
CX
BND8
CXB7BD
B5BPCUB4D8
CX
B5BM
(2)
The last last is emission probability, which is
the probability of the appearance of the POS tag
D8
CXA0BD
in the CXA0BD position before the POS tag D8
CX
in
the CX position and shown in Eqn. 3:
C8
BX
B4CXA0BDBNCXB5
BP C8
BX
B4D8
CXA0BD
BND8
CX
B5 BP CUB4D8
CXA0BD
BND8
CX
B5BPCUB4D8
CX
B5BM
(3)
According to the above three probability for-
mulas we can build a seven- dimensional vector,
where each dimension corresponds to one POS
tag, respectively.
Given a multi-category word with a context
window of size 7 and its POS tagging sequence,
the dependency attribute vector CE
C8
of the multi-
category word is defined as follows:
CE
C8
BP B4C8
BD
BNC8
BE
BNC8
BF
BNC8
BG
BNC8
BH
BNC8
BI
BNC8
BJ
B5BN
where
C8
BD
BP C8
CC
B4BDBNBEB5
A1C8
BE
BP C8
CC
B4BDBNBEB5
A1C8
CC
BEBNBF
A1C8
CC
B4BFBNBGB5
A1C8
BVCG
B4D8
BG
B5BN
C8
BE
BP C8
CC
B4BEBNBFB5
A1C8
BF
BP C8
CC
B4BEBNBFB5
A1C8
CC
B4BFBNBGB5
A1C8
BVCG
B4D8
BG
B5BN
C8
BF
BP C8
CC
B4BFBNBGB5
A1C8
BG
BP C8
CC
B4BFBNBGB5
A1C8
BVCG
B4D8
BG
B5BN
C8
BG
BP C8
BVCG
B4D8
BG
B5BN
C8
BH
BP C8
CC
B4BGBNBHB5
A1C8
BG
BP C8
BX
B4BGBNBHB5
A1C8
BX
B4BGBNBHB5
A1C8
BVCG
B4D8
BG
B5BN
C8
BI
BP C8
CC
B4BHBNBIB5
A1C8
BH
BP C8
BX
B4BHBNBIB5
A1C8
BX
B4BGBNBHB5
A1C8
BVCG
B4D8
BG
B5BN
C8
BJ
BP C8
CC
B4BIBNBJB5
A1C8
BI
BP C8
BX
B4BIBNBJB5
A1C8
BX
B4BHBNBIB5
A1C8
B4BGBNBHB5
A1C8
BVCG
B4D8
BG
B5BM
2.1.4 Context Vector of Multi-category
Words
Now we are ready to define the context vector
of multi-category words.
Given a multi-category word with a context
window of size D0 and its POS attribute matrix CH ,
position attribute vector CE
CG
, and dependency at-
tribute vector CE
C8
, the context vector CE
CB
of the
multi-category word is defined as follows:
CE
CB
BP B4ABCE
CG
B7 ACCE
C8
B5A2CHBN (4)
where AB and AC are the weights of the position at-
tribute and the dependency attribute, respectively.
Note that we require ABB7AC BP BD, and their opti-
mal values are determined by experiments in our
study.
2.2 Experiment on the Size of the Context
Window
Context vectors can be extended by using 4 to 7
preceding (following) words to substitute 3 pre-
ceding (following) words in context windows and
POS tagging sequences. We conducted experi-
ments with a context window of size 3 to 7 on our
sampled 1M-word training corpus and performed
closed test. The experimental results are evalu-
ated in terms of both the precision of consistency
check and algorithm complexity simultaneously.
We plot the effect on precision in Figure 1.
 0.87
 0.872
 0.874
 0.876
 0.878
 0.88
 0.882
 7 6 5 4 3
Precision
Number of the preceding (following) words
Figure 1: Effect on precision of the number of
preceding (following) words.
As shown in Figure 1, the precision of consis-
tency check increases as we include more preced-
ing (following) words. In particular, the precision
is improved by 1% when we use 7 preceding (fol-
lowing) words. However, the increase of com-
plexity is much higher than that of precision, be-
cause the dimensionality of the position attribute
vector, POS attribute vector, and dependency at-
tribute vector doubles. Hence, we chose 3 as the
number of preceding (following) words to form
context windows and calculate context vectors.
4
2.3 Effect on consistency check precision of
AB and AC
When using our sampled 1M-word training cor-
pus to conduct closed test, we found that consis-
tency check precision changes significantly with
the different values of AB and AC. Figure 2 shows
the trend when AB varies from 0.1 to 0.9. We used
AB BP BCBMBG and AC BP BCBMBI in our experiments.
 0.5
 0.55
 0.6
 0.65
 0.7
 0.75
 0.8
 0.85
 0.9
 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Precision
α
Figure 2: Effect on consistency check precision
of AB and AC.
3 Consistency Check of POS Tagging
Our consistency check algorithm is based on clas-
sification of context vectors of multi-category
words. In particular, we first classify context
vectors of each multi-category word in the train-
ing corpus, and then we conduct the consistency
check of POS tagging based on classification re-
sults.
3.1 Similarity between Context Cectors of
Multi-category Words
After constructing context vectors for all multi-
category words from their context windows and
POS tagging sequences, the similarity of two con-
text vectors is defined as the Euclidean Distance
between the two vectors.
CSB4DCBNDDB5 BP CZDCA0DDCZ BP
AY
D2
CG
CXBPBD
B4DC
CX
A0DD
CX
B5
BE
AZ
B4BDBPBEB5
BN (5)
where DC and DD are two arbitrary context vectors of
D2 dimensions.
3.2 CZ-NN Classification Algorithm
Classification is a process to assign objects that
need to be classified to a certain class. In this pa-
per, we used a popular classification method: the
CZ-NN algorithm.
Suppose we have CR classes and a class
(AX
CX
B4CX BP BDBNBEBNBMBMBMBNCRB5) has C6
CX
samples (DC
B4CXB5
CY
B4CY BP
BDBNBEBNBMBMBMBNC6
CX
B5). The idea of the CZ-NN algorithm
is that for each unlabeled object DC, compute the
distances between DC and all samples whose class
is known, and select CZ samples (CZ nearest neigh-
bors) with the smallest distance. This object DC
will be assigned to the class that contains the most
samples in the CZ nearest neighbors.
We now formally define the discriminant func-
tion and discriminant rule. Suppose CZ
BD
BNCZ
BE
BNBMBMBMBNCZ
CR
are the numbers of samples in the CZ nearest neigh-
bors of the object DC that belong to the classes
AX
BD
BNAX
BE
BNBMBMBMBNAX
CR
, respectively. Define the discrimi-
nant function of the class AX
CX
as CS
CX
B4DCB5 BP CZ
CX
BNCX BP
BDBNBEBNBMBMBMBNCRBM Then, the discriminant rule of deter-
mining the class of the object DC can be defined
as follows:
CS
D1
DC BP D1CPDC
CXBPBDBNBEBNBMBMBMBNCR
CS
CX
B4DCB5 B5 DC BE AX
D1
3.3 Consistency Check Algorithm
In this section, we describe the steps of our
classification-based consistency check algorithm
in detail.
Step1: Randomly sampling sentences containing multi-
category words and checking their POS tagging manually.
For each multi-category word, classifying the context vec-
tors of the sampled POS tagging sequences, so that the con-
text vectors that have the same POS for the multi-category
word belong to the same class.
Step2: Given a context vector DC of a multi-category word
CR, calculating the distances between DC and all the context
vectors that contains the multi-category word CR in the train-
ing corpus, and selecting CZ context vectors with smallest dis-
tances.
Step3: According to the CZ-NN algorithm, checking the
classes of the CZ nearest context vectors and classifying the
vector DC.
Step4: Comparing the POS of the multi-category word CR
in the class that the CZ-NN algorithm assigns DC to and the POS
tag of CR. If they are the same, the POS tagging of the multi-
category word CR is considered to be consistent, otherwise it
is inconsistent.
The major disadvantage of this algorithm is the
difficulty in selecting the value of CZ. If CZ is too
small, the classification result is unstable. On the
other hand, if CZ is too big, the classification devi-
ation increases.
3.4 Selecting CZ in Classification Algorithm
Figure 3 shows the consistency check precision
values obtained with various CZ values in the CZ-
NN algorithm. The precision values are closed
5
Table 1: Experimental Results
Number of Number of Number of
Test Test multi-category the true the identified Recall Precision
corpora type words inconsistencies inconsistencies (%) (%)
1M-word closed 127,210 1,147 1,219 (156) 92.67 87.20
500K-word open 64,467 579 583 (86) 85.84 85.24
test results on our 1M-word training corpus, and
were obtained by using AB BP BCBMBG and AC BP BCBMBI in
the context vector model.
 0.5
 0.55
 0.6
 0.65
 0.7
 0.75
 0.8
 0.85
 10 9 8 7 6 5 4 3 2 1
Average precision
Number of nearest neighbors (k)
Figure 3: Effect on precision of CZ in the CZ-NN
algorithm.
As shown in Figure 3, when CZ continues to in-
crease from 6, the precision remains the same.
When when CZ reaches to 9, the precision starts
declining. Our experiment with other AB and AC
values also show similar trends. Hence, we chose
CZ BP BI in this paper.
4 Experimental Results
We evaluated our consistency check algorithm on
our 1.5M-word corpus (including 1M-word train-
ing corpus) and conducted open and closed tests.
The results are showed in Table 1.
The experimental results show two interest-
ing trends. First, the precision and recall of
our consistency check algorithm are 87.20% and
92.67% in closed test, respectively, and 85.24%
and 85.84% in open test, respectively. Compared
to Zhang et al. (Zhang et al., 2004), the precision
of consistency check is improved by 2AO3%, and
the recall is improved by 10%. The experimental
results indicate that the context vector model has
great improvements over the one used in Zhang
et al. (Zhang et al., 2004). Second, thanks to the
great improvement of the recall, to some extent,
our consistency check algorithm prevents the hap-
pening of events with small probabilities in POS
tagging.
5 Conclusion and Future Research
In this paper, we propose a new classification-
based method to check consistency of POS tag-
ging, and evaluated our method on our 1.5M-
word corpus (including 1M-word training corpus)
with both open and closed tests.
In the future, we plan to investigate more types
of word attributes and incorporate linguistic and
mathematical knowledge to develop better con-
sistency check models, which ultimately provide
a better means of building high-quality Chinese
corpora.
Acknowledgements
This research was partially supported by the Na-
tional Natural Science Foundation of China No.
60473139 and the Natural Science Foundation of
Shanxi Province No. 20051034.
References
Y. Du and J. Zheng. 2001. The proofreading method study
on consistence of segment and part-of-speech. Computer
Development and Application, 14(10):16–18.
T. R. Leek. 1997. Information extraction using hidden
Markov models. Master’s thesis, UC San Diego.
Y. Qian and J. Zheng. 2003. Research on the method of
automatic correction of chinese pos tagging. Journal of
Chinese Information Processing, 18(2):30–35.
Y. Qian and J. Zheng. 2004. An approach to improving the
quality of part-of-speech tagging of chinese text. In Pro-
ceedings of the 2004 IEEE International Conference on
Information Technology: Coding and Computing (ITCC
2004).
W. Qu and X. Chen. 2003. Analysing on the words classi-
fied hard in pos tagging. In Proceedings of the 9th Na-
tional Computational Linguistics (JSCL’03).
H. Xing. 1999. Analysing on the words classified hard in
pos tagging. In Proceedings of the 5th National Compu-
tational Linguistics (JSCL’99).
H. Zhang, J. Zheng, and J. Liu. 2004. The inspecting
method study on consistency of pos tagging of corpus.
Journal of Chinese Information Processing, 18(5):11–16.
6
