BLENDING SEGMENTATION WITH TAGGING 
IN CHINESE LANGUAGE CORPUS PROCESSING ~ 
Zhou Qiang, Yu Shiwen 
Institute of Computation Linguistics 
Peking UnivErsity, Beijing 100871, P.R.China 
ABSTRACT 
this paper proposes a new method for Chinese 
language corpus processing. Unlike the past researches, 
our approach has following charactericstics : it blends 
segmentation with tagging and integrates nile-based 
approach with statistics-bascd one in grammatical dis- 
ambiguation. The principal ideas presented in the paper 
are incorporated in the development of a Chinese corpus 
processing system. Expcrimcntal results prove that the 
overall accuracy for segmentation is 97.68% and that 
for tagging is 94.55% in about 400,000 Chinese 
characters. 
1. Introduction 
Processing a Mandarin Chinese corpus needs to go 
through several stages. From initial text corpus, 
through word segmentating, grammatic category 
tagging, syntactic analysis (bracketing), semantic and 
pragmatic analysis, one can get coq3ora with different 
tags, such as segment-ational tags, word categories, 
phrase categories and so on. In current paper, we will 
fbcus on the first two stages, i.e. word segmentation 
and category (i.e. part of speech) tagging. 
Word segmentation is essential in Chinese 
ilfformation processing because there are no obvious 
delimiting markers betwecn Chinese words except for 
some punctuation marks. Matching input characters 
against the lexical entries in a large dictionary is 
helpful in identifying the embedded words. However 
some ambiguous segmentation strings(ASSs) and 
unregistered words (i.e. the word that is not registered 
in the dictionary) in the text will lower the 
segmentation accuracy. To resolve these problems, 
various knowledge sources might have to be consulted. 
In the past decade, two different mcthodologics were 
used for word segmentation: some approaches are rule- 
based(ll--5\]), while others are statistics-based(16--8\]). 
Many automatic word segmentation systems adopting 
the above models have been developed and significant 
results have been achieved. But these systems were 
developed only on word level. They did not take large- 
scale corpus category tagging into account and were 
short of a objective cvaluaton for segmentation accuracy 
from category level. So the development of these 
automatic segmentation systems is restricted. 
Grammatical catcgory tagging for Chinese language 
is very difficult, because Chinese words are frequently 
ambiguous. One ChinEse word can represent lexical 
items of different categories. Apart from this, unlike 
English and other Indo-European languages, Chinese 
has no inflexions and therefore there arc not obvious 
morphological variations in Chinese text which are 
helpful to distinguish one grammatic category from 
others. 
In some Chinese category tagging systems, statistics- 
based algorithms were used(\[10--12\]). The basic 
processing procedure of these systems is: First, a tagged 
corpus was made through editing. Then, a dictionary 
containing category tagging entries and a matrix of 
category collocational probabilities were derived from 
the tagged corpus. Using these arguments, a probability 
model was built and category tagging was completed 
automatically. Up to now, there are not any reports 
about rule-based approach to Chinese language category 
tagging. 
Comparing with the above researches on 
segmentation and tagging, our method has the following 
new characteristics: 
First, it blends segmentation with tagging. We use a 
segmentation dictionary, in which every word is marked 
with its word category, to complete segmentation and 
initial tagging simultaneously. The category becomes a 
bridge linking segmentation and tagging. 
Second, it integrates nde-based approach with 
statistics-based approach in category tagging. Therefore 
it inherits the advantages of the two approaches and 
overcome their respective disadvantages. 
The following sections will 5iscuss this method in 
detail. 
i The project is support by National Science Fundation of China 
1274 
2. Corpus processing blending segmentation 
with tagging 
In practice of segmenting many Chinese sentences, we 
ffi}d that it is helpfid to make use of word category in 
automatic segmentation processing. In general, there are 
three advantages: 
1). Using category collocational relation of differcnt 
words in ASSs and the contextnal word categories, one can 
resolve most segmentation ambiguities. 
As we know, there are two types of ASS : intersecting 
ASS (IASS) and combining ASS(CASS). 
An lASS S=ABC has two possible segment-ation : 
AB+C and A+BC. Thus it results in two category 
combinations : CaB + Cc and CA + CBC. But the probility 
for them to appear in a given context is not the same. 
Depending on their context and the difference between 
two category collocational probabilities (P(CM~\]Cc) and 
P(CA\]Cnc) ), we can select a correct segmcntatim,. 
Sometimes a CASS S=AB can be segmented into two 
words: A+B, but occasionally it is only one word S. Since 
the CASS itself can not provide the special information 
R~r correct segmentation, it is necessmy to lake the relation 
between it and its fonvard word or its backward word into 
consideration. In this sense, the categories of the words in 
the CASS and those one beside tim CASS play a very 
important role. 
2). llelp to compound new words by using Clfincsc 
word-lbrmation theory 
In Chinese, a word is composed of morplmmes. The 
combination of morphemes has its special rt, les. These 
rifles tell us which and what kind of morphcmes can be 
combined into a word. Using these roles, we can find out 
some tmrcgistercd words and segment them correctly fi'om 
a sentence. For example, typical word-compovnding cases 
of nouns are : 
A). mono-syllablie noun + mono-syllablic noun 
ma(horsc) + che(car) --> 
mache(carriage) 
B). mono-syllablie noun + bi-syllablie noun 
shou(hand) + zhijia(nail) --> 
shouzhijia(finger nail) 
C). bi-syllablic noun + mono-syllablic noun 
dianliu(current) + biao(lable) --> 
diaoliubiao(galvanometer) 
D). bi-syllablic verb + mono-syllablic noun 
zhengming(prove) + xin(letter) --> 
zhengmingxin(testimonial) 
From such word-compounding cases, we can sum up 
many nsefifl word lbrmation rules that are based on 
categoly combination. Therefore, we will achieve a better 
segmentation effect in spite of using a smaller 
segmcntation dictionary. 
3). Be helpful to discover some segmentation crrors 
In Chinese sentence, the frequency of some categmy 
collocations is very low, such as d+n+$, v+u+d+$ and so 
on, where d is advclb, n is noun, v is verb, u is auxiliary, $ 
is the ending mark of a sentence. Therefore, if there is 
such a category combination in the segmented sentence, 
we will ahnost be certain that this segmentation may be 
wrong, in the following examples, there are such Errors : 
i). mailv le/u yitou/d niu/n ./w 
(btty -ed head cow 
correct result : bought a cow ) 
it). ta/r qiu/n da/v de/u zuihao/d ./w 
(he ball play Prt hadbetter 
correct result : lie plays basketball best.) 
liore, we can see that the categmy information provides 
a t)owerful means to check seg,ncntation errors 
atttomatically. 
Based on all the above understandings, we proposed a 
method combining segmentation with tagging and used it 
in the practice of segmentation and category tagging on a 
large-scale Chinese language corpus. The basic processing 
procedures are : 
First, complete automatic segmentation by using a 
segmentation dictionary with word categories. On the 
meantime, assign an initial tag(all possible categories for 
a word) to every segmentation unit. 
Second, cart3' out seine basic word-compounding words, 
such as combining stems with affixes , combining 
overlapping morphemes, integrating Clfinesc numberal 
words and so on. 
Third, implement automatic category tagging through 
grammatic catcgmy disambiguation and assign a single 
category tag to every word. 
Fourth, find and coinbinc unregistered words which 
accord with Chinese word formation rules and assign a 
suitable categmy to t11e combined new words. 
Fiflh, check the catcgmy combination in segmented 
sentences, find some possible errors and then go back to 
the segmentation process. 
3. The designing strategy of categm T tagging 
Comparing with many past automatic catcgmy tagging 
systems(t10--12\]), our current processing has some new 
properties. The basic idea can be briefly summarized as 
following: 
l). 13e based on a dictionary with word categories 
1275 
In current process, the initial category tagging was made 
by looking up the segmentation dictionary with word 
categories during segmentation. The category is derivcd 
from the "Grammar Knowledge Base for Chinese Words" 
(GKBCW), which has been developed by the Institute of 
Computational Linguistics of Peking University in the past 
five years\[13\]. Since the information in the dictionary was 
provided by linguists who refer themselves to the standard 
of classification based on the distribution of grammatical 
functions\[14\], it is of high accuracy. Therefore, applying 
this information to initial category tagging, the coherence 
and reliability of the tagging results can be guaranteed. 
This has laid good foundation for the following 
disambiguation processing. 
2). Use a small tag set 
In our current system, category tagging is restricted to 
the basic categm-y descriptions, i.e. 26 categories. 
Meanwhile, in order to keep the new information that was 
found during manually proofreading, such as proper 
names, proper addresses, and so on, we add up several 
subcategories: ng(proper noun), ngp(,proper noun for a 
person), and Ng(noun morpheme), Ag (adjective 
morpheme) and Vg(verb morpheme). All these categories 
and subcategories form a tag set of 31 tags. 
A small tag set can help us concentrate on the 
ambiguous words that appear the most frequently in a 
sentence. Therefore, the processing complex can be 
reduced and tagging accuracy will be improved. 
3). Form a stereo knowledeg base by combining tagged 
rcsults with the information in the dictionary 
Although our tag set is small, we can easily expand the 
tag set for the different application by linking with the 
GKBCW. Because in our GKBCW, each category has 
many features, which were proposed by liguists. These 
fcaturos help to describe the grammatic functions and 
distributions of every category completely. For example, 
verb category has about fm-ty features, and noun category 
has twenty-five features(\[13\]). In general, these 
grammatic features are also one kind of information for 
classification. 
If we use the word and its basic category in tagged 
corpus as a keDvord to look up GKBCW, we can get the 
detailed grammatic features of each word. Therefore, 
taking all tagged words as a plane, and the grammatic 
features of every word as a depth, we will give a stereo 
knowledge base. According to different needs, we can tag 
different grammatic categories or subcategorics to the 
words in corpus by using the grammatic features in 
knowledge base. In addition, using the stereo knowledge 
base, we can also analyse the phrase structure of sentence 
in corpus. 
4). Integrate rule-based approach with statistics-based 
approach in disambiguation 
Because rule-based approach and statistics-based 
approach have their respective advantages, we tried to 
integrate them in our category tagging system. Our method 
is: First, through statistical analysis (manually or 
automatically) in a large-scale corpus, find the tile most 
frequent ambiguous phenomena, study their context, and 
extract some contextual frame rules to eliminate those 
most frequently appearing and comparatively simpler 
ambiguities. Then, using the arguments trained by 
correctly tagged corpus, make a probability model to 
disambiguate some ambiguous category combination of 
lower frequence and deduce the category of the 
unregistered words. 
But during actually processing, we lay different particular 
emphasis on these two approaches at different stages. At 
first, because there was not a large-scale corpus tagged 
with correct category, a small-scale corpus had to be 
tagged using rule-bascd approach and its remaining 
ambiguities and some tagging errors were corrected 
manually. After statistic analysis on the correctly tagged 
corpus, the rule base was adjusted and some trained 
arguments were given. Then some new sentences were 
added to the old corpus to form a new middle-scale corpus, 
Using the new adjusted rules and trained arguments, the 
new corpus was tagged through both rule-based approach 
and statistics-based approach. In this way, the scale of the 
corpus was increased gradually like a snowball. Due to the 
increase in corpus scale, the descriptions of rule became 
more and more accurate and the statistic information 
became more and more comprehensive. Therefore thc 
manual proofreading work will decrease drastically. As a 
result, a best integration of these two approaches was 
achieved. 
4. Disambiguation in automatic categoo' 
tagging 
4.1. rule-based approach 
Tile basic strategy of rule-based approach is to determine 
one category for a categorically ambiguous word based on 
its syntatic or semantic context. In our system, in order to 
hightcn the tagging effect, the task is divided into thrce 
stages: 
1). disalnbiguato against special word 
In Chinese running text, some multi-tag words appear 
frequently, especially the mono-syllablic words, such as, 
"yi", "zhe", "le", "guo", "ba", "lai", "hao", "jiu", and so on. 
For these words, we set some special disambiguation rules, 
which describe the different context for these words with 
different category. Therefore, t/to category of words in one 
sentence can be determinated easily. This is a word- 
oriented disambiguation. 
2). disambiguation against special multi-tag 
7276 
According to statistic analysis, some nmlti-tag 
combinations, such as v-q, p-v,v-n,q-n,v-d,a-v and so on, 
appear li'cquently in corpus, lit order to construct the 
disambiguation rides for these multi-tag combinations, the 
probability that one special tag is selected from a multi- 
tag set in the difI~rent context is counted. At the same 
time, the grammatic function featnres of category, 
especially the distribution inforamtion which distinguishes 
one category from the oflmrs are snnlmed till and 
extracted. Then the ambiguities can be eliminated by these 
rifles. This is a multitag-oriented disambiguation. 
3). disambiguate by context constraint 
The approach applies a set of context flame rules. Each 
rtfle, when its context is satisfied, has the effect of deleting 
one or more candidates from the list of possible lags for 
one word. if the nmnber of the candidalcs is reduced to 
one, disambiguation is considered successful. This is a 
fiame-oriented disambiguation. 
4.2. statistics-based alJproach 
Formally, the statistic schcme can be described as 
following: 
Let W=Wb..Wn be a span of ambiguous words in 
scntence and Wl,W n are unanlbiguous, C--Ct...Cn be a 
possible tag sequence for the span, where Ci is a category 
of Wi. P(CIW) is conditional probabilily fiom W to C. 
Therefore, the goal of disambiguation is equivalent to lind 
a list of category sequence C' with the largest score 
P(C'lW), i.e. 
P(C'lW)=,nax P(CIW) 
C'GC 
Computing tile above fornmla with bi-g,'am model, we 
gct: 
I1 
P(ClW),= 11~ P2(C~ IC~.~ ) p(w~ IC~) 
where P(Ci ICiq) are the contextual probabilities and 
P(W i ICi) are the lexical probabilitics. The approximation 
of two probablities can be calculated from the trained 
argmnents. 
During actual process, the category of the unregistered 
word is deduced firstly. Let Cu is a possible tag set for 
unregistered word, CI is the tag of its left word and the Cr 
is the tag of its right word. q' is the set of total tags in 
corpus. Therefore, Cu={CI, C2}, where : 
C1 = argmax P(Ci \]Cl ) 
Ci GT 
C2 = argmax P(Cr \]Cj ) 
cj ~'l' 
So the unregistered word phenomenon is changed into 
categorically ambiguous problem. 
For a span of ambiguous words (bounded by 
tmambiguous words), if we arrange the diffcrent tags of 
cvcry word vertically and the different words horizontally, 
we will form a direct chart whose nodes are tagged with 
P(Wi ICi ) and whose arcs are tagged with P(Ci ICj ). Using 
VOLSUNGA algorithm (\[9\]) to get the best path in direct 
chart, we will complete the automatic category 
disambiguation. 
5. Experimental results and future work 
A segmentation and tagging system was built based on 
the above mentioned. The programs of the system are 
written by C language. Using this system, a verb usage 
corpus with abont 400,000 Chinese characters or 300,000 
Chinese words was segmented and tagged. The test results 
are: segmentation accuracy --- 97.68%, lagging accuracy - 
-- 94.55%. 
Some better processing results of previous segmentation 
systems and tagging systcrms are : about 99'% 
segmentation acctu'acy on 150,000 Chinese characters 
(\[5\[) and 94.82% tagging accuracy by close test on 
150,000 words of taggcd corpus (\[121). Compared with 
these systems, the result of our system is promising. 
I11 our future research, we try to make filrther 
improvement on our method and add some now fimtions 
to our seglnentation and tagging system, such as, 
unregistered word deduce during segmentation, identity 
management in knowledl,e base, analysis belief degree on 
tagging results. Then we will extend our corpus' scale to 
about five milliou words. 
In addition, we will pcrst,e research on Chinese phrase 
structure analysis and try to tag phrase category in corpus. 
We hope the work will be l~elpful for the study on 
Mandarin Chinese grammar. 
References 
11\], Liang N.Y., (1987). An Autonmtic Word 
Scglnentatiou System of Written Chinesc---CDWS. 
./ow'nal of Chinese lrybrmation lS'ocessing (,\]CIP), V o\[ 
2. 
12J. Li G.C., Liu K.Y. & Zhang Y.K. , (1988)= 
Segmenting Chinese Word and Processing Different 
Meaning st,-uctures. JCIP, Vol 3. 
131. iIuang Y.X., (1989). A 'Produce-Test' Approach to 
Autonmtic Segmentation of Written Chinese. JC7P, Vol 
4~ 
141. Yao T.S. , Zlmng G.P. & Wu Y., (1990). A Rule- 
Based Chinese Word Segmentation System. JC1P, 
YgL!, 
\[51. tie K.K. , Xu 1I. & Suu B. , (1991). The Implement 
of Automatic Segmentation Expert System of Written 
Clfinese. JCIP, Vol 3. 
1277 
\[6\]. Li B.Y. & i.e. , (1992). A MM Automatic 
Segmentation Algorithm using Corpus tag to 
Disambiguation. Proc ofROCLING IV, P14%165. 
\[7\]. Zhang J.S. ,Chen Z.D. , Shen S.D. , (1992). A 
method of word identification for Chinese by Constraint 
Satisfaction and Statistical Optimization Techniques. 
Proc. of ROCLING IV, P147-165. 
\[8\]. Sun M.S., Lai T.B.Y., Lun S.C. & Sun C.F., (1992). 
Some Issues on the Statistical Approach to Chinese Word 
Identification. 1CCIP 92, Vol 1, P246-253 
\[9\]. Stevcn J. DcRose, (1989). Grammatical categoo, 
disambiguation by statistical Optimization. Computional 
Linguistics, Vol 14, P31-39 
\[10\]. Liu K.Y., Zhen J.H. & Zhao J., (1992). A Research 
on Several Algorithms for the Assignment Parts of Speech 
to Words in Corpus. Advances on Research of Machine 
Translation, P378-386. 
\[11\]. Bai S.H, , Xia Y. & Huang C.N. , (1992). The 
Methordic Research of Grammatical tagging Chines 
Corpus. Advances on Research of Machine Translation, 
P408-418 
\[12\]. Bai Shuanhu & Xia Ying, (1991). A Scheme For 
Tagging Chinese Runing Text. NLI'RS'91, P345-349 
\[13\]. Yu S.W,, Zhu X.F,, Guo L., (1992). Outline of the 
Grammar Knowledge Base for Chinese Words and its 
Developing Approachs. ICCIP92, P186-191 
\[14\]. Zhu Dexi , (1979). Xufa Jianxi (Lectures of 
Grammar). Business Press. 
1278 
