Example-Based Sense Tagging of Running Chinese Text 
Xiang Tong 
Chang-ning Huang 
Cheng-ming Guo 
Computer Science Department 
Tsinghua University 
Beijing, CHINA 
Phone: +86-1-2594895 
Fax: +86-1-2562768 
ABS7"RACT 
This paper describes a sense tagging technique for the auto- 
matic sense tagging of running Chinese text. The system takes as 
input running Chinese text, and outputs sense disambiguated 
text. Whereas previous work (Yarowsky, 1992; Gale, et al. , 
1992, 1993) relies heavily on the role of statistics, the present 
system makes use of Machine Readable/Tractable Dictionaries 
(Wilks, et al. , 1990; Guo, in press) and an example-based rea- 
soning technique (Nagao, 1984; Sumita, et al. , 1990) to treat 
novel words, compound words, and phrases found in the input 
text. 
Key words: sense tagging 
1. Introduction 
If the 1980's were characterized by the surge of efforts on Machine Read- 
able/Tractable Dictionary (MRD/MTD) research, the 1990's would be a time 
of massive efforts on constructing annotated text corpora. Properly annotated 
text corpora could form, at least, the bases for the following: 
a. the core of commercial information systems; 
b. the kernel engine of 'Cognitive Agents' ; 
c. the essentials of systems vital to national security. 
Sense tagging of large text corpora has been on the back-burner for too 
102 
long. The preparation of large annotated text corpora, especially those with 
word sense disambiguated, has always been brushed aside for some piteous 
'smart' approaches to prevail. However, it is just this kind of hopeless clever- 
ness that handicapped the speedy growth of the language enterprise. Fortunate- 
ly, more and more researchers have come to realize the importance, as well as 
the necessity, of being earnest in annotating large text corpora of all major lan- 
guages. 
The present discussion presents a system for the automatic sense tagging of 
running Chinese text -- a necessary mechanism for the construction of anno- 
tated 'Monitor Corpora ~ (Sinclare, 1991) that do not degrade over time. The 
system takes as input running Chinese text, and outputs sense disambiguated 
text. Whereas previous work (Yarowsky, 1992; Gale, et al. , 1992, 1993) re- 
lies heavily on the role of statistics, the present system makes use of Machine 
Readable/Tractable Dictionaries (Wilks, et al., 1990; Guo, in press) and an 
example-based reasoning technique (Nagao, 1984; Sumita, et al., 1990) to 
treat novel words, compound words, and phrases found in the input text. The 
focus of this discussion is on the example-based reasoning technique. The exam- 
ples that support the tagging operation come from the system MTD. 
The sense tagging system assigns a unique number for every Chinese charac- 
ters occurred in the text. In most cases, the senses tagged are word senses. This 
is due to the fact that most Chinese characters are words. For example, '~\]" 
(beat) has 26 senses. '~' (drum) has 6 senses. The phrase '~\]'\]~l~' (beat 
drums) becomes '~\]'-B02 \]lYE_A01' after sense tagging. However, not all Chi- 
nese characters are words. Sometimes they are bound morphemes. In these cas- 
es, the senses tagged are the meanings of the morphemes as given in the dictio- 
nary. For example, '~\]'~ as in '~IB\]'"~.~ ~ , '\]~\]'~' is tagged 'A01', which is the 
number of '~' as given in the MTD when 'II~ ~ is used as a prefix, i. e. , a 
bound morpheme. 
2. Overview of tile Sense-Tagging System 
The sense-tagger under discussion represents partial results of some three 
years of continued efforts on the part of Tsinghua University, Beijing, China to 
build systems for the processing of general, unrestricted running Chinese texts. 
The system was implemented in ' C', and currently runs on the Sun Workstation 
at the National AI Laboratory in the University. 
2. 1. Resources 
The sense-tagging module uses two MRDs and one MTD. The first MRD, 
103 
for the sake of discussion, say MRD-I,is 'tP~'l~'?.7,.i~-i~:-~::gtg' (Fu, 1987). It 
contains about 6,000 one-syllable words, e. g. , '~' (beat), '~' (drum), 
and 43,000 compound words and phrases, e.g. ':Ff~' (beating drums). Each 
word has one or more word senses. For example, ':tq' (beat) has 26 senses and 
'~fi\[' (drum) 6. Note that capital letters in the numbers tagged indicate homo- 
graphs, and the Arabic numbers the sense number under the homograph. The 
content of the word ':\]q' (beat) is given as following: 
~_A01 : Rig ,+---+nq--~ 
:fJ_B01: ~\]', YX~\]" :~\]'_B02, ~-~--~:I$~_-~~Z~ 
~_B03: {~, ~: :~\]'_B04: :~:~'{$._hl~..~.~JJ{"g 
{\]'_B05: ~d/, i#~ {\]'_B06: ~.,~:~, ~,..~ 
{\]'_B07: ~'~,~, ~, g~ ~_BO8: ~,~ 
~_B09: ~(~=m~) ~_BI0: ~, ~: 
~_Bi3: ~ ~_BI4: ~j~_, ~ 
~Y-BI5: ~-~)~~ ~\]'_Bi6: 
{\]'_BI9: ~\]',~(~3,'I~) {\]'_B20: ~-~ 
{\]'_B23: ~\[~ {\]'_B24: ~, 
~q-B25: :~ III1., ~ i~ ~ ~1 :~. ~~ ~ ~ 
The second MRD, for the sake of discussion, say MRD-2, is the Chinese 
thesaurus '~\] ~i~\]i~\]~0~' (Mei, 1983) with about 70,000 entries. It has a 3-lev- 
el categorization system. At Level 1, the dictionary has 12 major categories. At 
Level 2, the 12 major categories split into 94 subcategories. At the lowest level, 
Level 3, the dictionary has altogether 1,428 subcategories. Under the current 
numbering system, the capital letter indicates major categories, the lower-case 
letter subcategories, and the Arabic numbers the numbering under the two su- 
perordinate categories. For example, 'Bp13' refers to one of the categories that 
the word '~' (drum) falls into. B is a first level category, p is a second level 
subcategory, 13 is the numbering of the subcategory under Bp. Partial list of the 
numbering of some\categories is given as follows: 
104 
Bpl3 Fa0l Fa30 Ic05 ~ Hc26 
Be04 Bho5 ~ Bh07 
Fa23 Hdi 6 ~ Dm05 
~f BOI3 ~ Bpl0 
~.~ ld03 ~.~ Iel4 
The MTD was constructed from MRD-1. It has 43,000 annotated com- 
pound words and phrases. Word phrases like '{T~rf~' (beating drums) are disam- 
biguated in the MTD with word sense numbers tagged to both '~J" (beat) and 
'~J~' (drum), e.g. '~_B02 ~_A01'. The numbers tagged are based on the 
numbering system as used in MRD_ 1. For those compounds that have compo- 
nent whose meaning is not related to the resultant compound, the Arabic num- 
bers in the component's tag is '00' (e.g. , ~_A00 ~_A0\], tO_A00 
t~_ A00). Much of the work in constructing the MTD was done by machine, 
but supplemented by handcoding. The following gives a partial list of the con- 
tents of the MTD : 
~T-BOl ~J_A03 ~\]'_B01 -~_A02 ~._B0\] ~\]'_B01 
~\]'_B02 ~_A02 ~\]'_B02 ~k._A01 ~\]'_B02 I'\]_A01 
~\]'_B03 ~j_A0I ~T-B03 ~_A01 ~qf_B05 i-A02 
2.2. Three-step Sense-tagging Procedure 
Step 1 : Segmenting the inpul text into words, compound words and phrases 
The word segmentation module is a much simplied version of a more compli- 
cated segmentation program developed at the Laboratory. It looks forward 
through each sentence for maximum match of character strings as recorded in the 
MTD. The tagging of most known phrases is done with the help of the MTD. 
'{T~' would be an example in question. The involved operation is simple, i. e. , 
'match to access'. When an input segment matches an entry in the MTD, the 
tagged form of the matched segment replaces the input segment in the sentence. 
Step 2 : Example-based sense tagging of one-syllable words 
The system uses an example-based sense-tagging algorithm for the disam- 
biguation of one-syllable words, which are not listed in system MTD. The detail 
of the algorithm is described in Section 3. 
Step 3 : Default sense tagging of untagged one-syllable words from Step 2 
A default sense number is assigned to each and every one syllable word un- 
tagged from Step 2. The default sense numbers are determined on the basis of 
frequency of occurrence data. 
105 
3. Example-Based Sense-Tagging 
Chinese words build to form compound words. In 94.7 % of the time, the 
meaning of the resultant compounds is related to the contributing meanings of the 
component words (Zhang, 1986, p. 87). The compound words and phrases in 
the MTD contain implicit syntactic information for purpose of example-based rea- 
soning about the senses of Chinese words in context. 
For example, if ':~q" '~' (beat gongs and drums ) is in the input text and 
the sense of ':~\]" (beat) cannot be determined. In order to disambiguate the 
word sense of '{\]" (beat), the system looks through the MTD for every com- 
pound word and phrase beginning with '~\]" (beat) and decides that the phrases 
':~q'_B02 ~-A01' (beat drums) is an appropriate example to reason about the 
word '~I' (beat) as found in '¢I ~' (beat gongs and drums), since '\]~' 
(drums) and '~ ~' (gongs and drums) are in the same lowest category 
'Bpl3' in MRD_2. The system then assigns the tag 'B02', which belongs to 
'~\]" (beat) in '{\]'_B02 ~_A0I' (beat drums), to '~\]" (beat) in '{\]" '~' 
(beat gongs and drums). 
Formally, when S~ Sz"" S, represent input segments from 1 to n, W repre- 
sents an untagged segment, and the immediate context of I,,V is represented by 
L....'-- L2 Lt W R~ Rz"" R.o..., where L stands for 'Left', R stands for 
'Right', and range equals 5, we have the following: 
St $2 "'" S. (a) 
where S,(k= 1, "" , n) is a word, compound word or phrase 
L,..,, ... L2 Ll W Rt R~ ... R,.,, (b) 
where L,, R~(i= 1, "'" ,rmzge) is a word, compound word or phrase 
In the forward reasoning process, assuming that (W R~) is a possible com- 
pound word or phrase, for all entries in MTD beginning with W which is in the 
form (W_tag Item), the system computes the relatedness of the two words or 
phrases (W R,) and (W_tag Item), where 'Item' may be an annotated word, 
compound word, phrase, or just a meaningless Chinese character string. The 
concept distance of R, and Item is computed to determine the relatedness of the 
two compound words/phrases. Hence, 
Concept_Distance(R~, Item) = 
0 if R~ and hem are in the same lowest category in MRD_2 
1 if R~ and hem are in the adjacent categories in MRD_ 2 
100 all other cases 
Relatedness( (W R,), (W_tag Item)) = 
106 
2 if Concept-Distance(R,, Item) = 0 
1 if Concept_Distance(Ri, Item) = 1 
0 if Concept_Distance(R,, Item) = \] O0 
For every pair of (W R~) (i=\], ... ,range) and (W_tag Item) in the 
MTD, the pair that has the greatest non-zero relatedness measure is determined 
and the W in (b) above is substituted by the W_tag in the determined pair. 
The reasoning process works similarly in both directions of W, i. e. , for- 
ward to R,o,t, and backward to L,~a,. When the process proceeds forward, the 
system looks for entries beginning with W. On the other hand, when the process 
works backwards to the left of W, the system looks for annotated entries in the 
MTD ending with W. 
The examples are given as following: 
(1) I~\]~,~J~ ~'1~ ~ ~~ *~* ~,~ ~ ~ ~ ~ 
~o 
The word '~i' (new) has six senses. The annotated phrase '~-A01 ~_ 
AOI ~ is found in the MTD. The system calculates the conceptual distance be- 
tween '~i~' and '~' among others. Since '~d~' and '~'~' are found to be in 
the same lowest subcategory 'Dd06', the conceptual distance between them is 
0. The system then assigns the tag 'A01', which belongs to '~,ti:' as in 'j~-~:- 
A0I ~-A0I', to '~i:' in the above sentence. 
The word '~' (receive, suffer) has six senses. The annotated phrase 
'~-A02 ~'_A02' is found in the MTD. The system calculates the conceptual 
distance between "/~" and '~'~' among others. Since '~" and 'It~'~' are 
found to be in the adjacent lowest subcategories, i. e. , 'HclS' and 'Hcl9' re- 
spectively, the conceptual distance between them is 1. The system then assigns 
the tag 'A02', which belongs to '~' as in '~_A02 ~'-A02', to '~' in the 
above sentence. 
(3) 1~I~ ~ ~ ~ ~ ~ ~X ,~E ,~ ~ ~(g ~ ~P: ¢~ 
~ *~* . 
The word '/~' (right, power) has seven senses. The annotated phrase 
'~_A01 ~:~-A01' is found in the MTD. The system calculates the conceptual 
distance between '~' and '~' among others. Since 'g,.~ and 'g2J'~' are 
found to be in the same lowest subcategory 'Dj03', the conceptual distance be- 
107 
tween them is O. The system then assigns the tag 'A01', which belongs to 
'~' as in '~.~-A01 ~._A01', to '~' in the above sentence. 
The word '$fl' (each other) has four senses. The annotated phrase '$~- 
A01 ~,~_A01' is found in the MTD. The system calculates the conceptual dis- 
tance between '~' and '~I~' among others. Since '~' and '~' are found 
to be in the same lowest subcategory 'Jc01', the conceptual distance between 
them is 0. The system then assigns the tag 'A01', which belongs to '~l~' as in 
'#I~_A01 ~,_A01' , to '}I~' in the above sentence. 
4. Evaluation 
The input Chinese texts that the present system works on are news release 
texts from the official Chinese Xinhua News Agency. No preprocessing of these 
news release texts is required. 
The performance of the present sense-tagger is encouraging. The hit rate of 
correct sense tagging can run as high as 95 %. The lowest hit rate ever recorded 
was 70M. The appendix gives a sample text which is the output of our system. 
The hit rate of correct sense tagging of this sample is 93.79M. Essentially, the 
hit rate of correct sense tagging performed by the system is a function of the cov- 
erage of the system MTD and MRDs. 
5. Limitations and Future Work 
a. The system makes errors when the segmentation of the input texts is less 
than correct. The performance of the current sense tagger can be improved if 
more sophisticated segmentation method is adopted. 
b. Although the reasoning process takes advantage of collocational informa- 
tion within the phrase in which the untagged segment is a part, there is no guar- 
antee that the phrase does not have multiple meanings. When such cases occur, 
the result of the reasoning is subject to chance. 
c. The example-based sense tagging method works quite well with content 
words, but for function words it often makes faulty guesses. This is partly due 
to the fact that function words are less sensitive to context. The current system 
assigns a default sense number for most function words. However, for those 
words which can both be a function word and a content word, the system often 
makes errors. This kind of errors decreases when the system preprocesses the in- 
put texts with a stochastic Chinese grammatical tagger like the one developed at 
Tsinghua University (Bai, et al. , 1992). 
108 
6. Conci usion 
In this paper we presented a relatively simple but effective method for the 
sense tagging of running Chinese texts. The system takes advantage of the collo- 
cation information within the annotated compound words or phrases in the sys- 
tem MTD. Considering that annotated Chinese texts constitute very useful re- 
sources for Chinese language processing, especially in generating frequency of oc- 
currence/co-occurrence data, general and special purpose concordances and the 
data for the derivation of a natural set of semantic primitive for the Chinese lan- 
guage, the current sense-tagging system looks promising. The room for progress 
is to be found in the further improvement of the system resources and the refine- 
ment of the reasoning algorithm. 

References 
Bai, S-H, Xia, Y. , Huang, C-L. (1992) Research on Chinese grammatical 
tagging method for Chinese corpus. In: Chen, Z-X (Ed.), Development in 
machine translation. Dianzi Gongye Publishing House: Beijing. pp. 408- 
418. 
Fu, X-L. , (1987) '~t~'f~?.Y,i~-~/.im~-~.~', Waiyu Jiaoxue Yu Yanjiu Publish- 
ingHouse : Beijing. 
Gale,W. A. , Church, K. W. , and Yarowsky, D. (1992) Work on statistical 
methods for word sense disambiguation. Working Notes for AAAI Fall 
Symposium on ProbabilLctic Approaches to Natural .Language. pp. 54-60. 
Gale, W. A. , Church, K. W. , and Yarowsky, D. (1993) A method fordis- 
ambiguating word senses in a large corpus. To appear in: Computers and 
Humanities. 
Guo,C-M (in press) Machine Tractable Dictionaries: Design and Construction. 
Ablex: Norwood, NJ. 
Mei, J-Zh. (1983), 'l~.Y~iFJi~$t~', Shanghai Cishu Publishing House: Shang- 
hai. 
Nagao, M. (198,1) A framework of a mechanical translation between Japanese 
and English by analogy example. In: A. Elithorn, R. Benerji, (Eds), A,'- 
tif icial and Human Intelligence. Elsevier: Amsterdam. 
Sinclare, J. (1991) Monitor corpora. Corpu.,, Concordance, Collocation. Oxford 
University Press. pp. 24-26 
Sumita, E. , Iida, H. and Kohyama, H. (1990) Translating with examples: a 
newapproach to machine translation. The Thb'd International Conference 
on Theoretical and Methodological Issues in Machine Translation of Natural 
Language. Austin, Texas. 
Wilks,Y., Fass, D., Guo, C-M, McDonald, J., Plate, T., and Slator, B. 
M. (1990)Providing machine tractable dictionary tools. Journal of Ma- 
chine Tran.~lation. 5, 2, pp. 99-151. 
Yarowsky, D. (1992) Word-sense disambiguation using statistical models of 
Roger's categories trained on large corpora. COLING-92 
Zhang , W. (1986) Character Meanings and Ww'd Meanbzgs. China Wu Zi 
Publishing House: Beiiing. 
