The Research of Word Sense Disambiguation Method Based on 
Co-occurrence Frequency of Hownet* 
Erhong Yang, Guoqing Zhang, and Yongkui Zhang 
Dept of Computer Science, Shanxi University, 
TaiYuan 030006 ! P. R. China 
Email: zyk@sxu.edu.cn 
Abstract 
Word sense disambiguafion (WSD) is a difficult 
problem in natural  processing. In this 
paper, a sememe co-occurrence frequency 
based WSD method was introduced. In this 
method, Hownet was used as our information 
source , and a co-occurrence frequency 
database of sememes was constructed and then 
used for WSD. The experimental result showed 
that this method is successful. 
Keywords 
word sense disambiguation, Hownet, sememe, 
co-occurrence 
1. Introduction 
Word sense disarnbiguafion (WSD) is one of 
• the most difficult problems in NLP. It is helpful 
and in some instances required for such 
applications as machine translation, information 
retrieval, content and thematic analysis, 
hypertext navigation and so on. The problem of 
WSD was first put forward in 1949. And then 
in the following decades researchers adopted 
many methods to solve the problem of 
automatic word sense disambiguation, 
including:l) AI-based method, 2) knowledge- 
based method and 3) corpus-based method. 01 
Although some useful results have been got, the 
problem of word sense disambiguation is far 
from being solved. 
The difficult of WSD is as follow: 1) 
Evaluation of word sense disambiguation 
systems is not yet standardized. 2) The potential 
for WSD varies by task. 3) Adequately large 
sense-tagged data sets are difficult to obtain. 4) 
The field has narrowed down approaches, but 
only a little. \[21 
In this paper, we use a statistical based method 
to solve the problem of automatic word sense 
disambiguafion. \[31 In this method, a new 
knowledge base- ..... Hownet t4'5\] was use as 
knowledge resources. And instead of words, the 
sememes which are defined in Hownet were 
used to get the statistical figure. By doing this, 
the problem of data sparseness was solved to a 
large degree. 
2. A Brief Introduction Of Hownet 
Hownet is a knowledge base which was 
released recently on Intemet. In Hownet, the 
concept which were represented by Chinese or 
English words were described and the relations 
between concepts and the attributes of concepts 
were revealed. In this paper, we use Chinese 
knowledge base, which is an important p.art of 
Hownet, as the resource of our disambiguafion. 
The format of this file is as follow: 
W_X =word 
E_X = some examples of this word 
G X= the pos of this word 
DEF= the definition of this word 
"This research project is supported by a grant from Shanxi Natural Science Foundation of China 
60 
A important concept used in Hownet that 
we must introduce is sememe. In Hownet, 
sememes refer to some basic unit of senses. 
They are used to descnbe all the entries in 
Hownet and there are more than 1,500 sememe 
all together. 
3. Sense Co-occurrence Frequency 
Database 
It is well known that some words tend to 
co-occur frequently with some words than with 
others\[6\]. Similarly, some meaning of words 
tend to co-occur more often with some meaning 
of words than with others. If we can got the 
relations of word meanings quantitatively, it 
would have some help on word sense 
disambiguafion. In Hownet, all words are 
defined with limited sememes and the 
combination of sememes is fixed. If we make 
statistic on the co-occurrence frequency of 
sememe so as to reflect the co-occurrence of 
words, the problem of data sparseness would be 
solved to a large degree. Based on the above 
thought, we built a sense co-occurrence 
frequency database to disambiguate word 
senses. 
3.1 The Preproeessing Of Hownet 
The Hownet we downloaded from Intemet is in 
the form of plain text. It is not convenient for 
computer to use and it must been converted into 
a database. In the database, each lexical entry is 
converted into a record. The formalization 
description of the records is as follow: 
<lexical entry> ::= <NO.><morphology> 
<part-of-speech><definifion> 
Where NO. is the corresponding number of 
this lexical entry in Hownet. And the definition 
is composed of several sememes (short for SU) 
which were divided by comma. In addition, we 
have deleted the Engfish sememees in order to 
saving space and speeding up the processing. 
Here are some examples after preprocessing: 
NO. Morphology 
21424 t~tSb 
18888 
18889 
18887 
18890 
Part-of-speech 
ADJ 
ADJ , 
V 
V 
N 
definition 
I~,~,~,~ 
3.2 The Creation Of Sememe Co-occurrence 
Frequency Database 
The sememe co-occurrence frequency database 
is the basic of sense disambiguafion. Now we 
will introduce it briefly. 
The sememe co-occurrence frequency 
database is a table of two dimension. Each item 
corresponding to the co-occurrence frequency 
of a pair of sememes. 
Before introducing the sememe 
co-occurrence frequency database, we gave the 
following definition: 
Definiton: suppose word W has m sense 
items in hownet, and the corresponding 
definition of each sense item is: Yn, Y\]2, .... Y1(,1); 
Y21, Y22, .... Y2(,a); ...; Ym\],Ym2, .... Y~,~> 
respectively. We call \[Yu,Y~ .... Yioada 
sememe set of W(short for SS), and call \[{ ym 
YI2, .... Yl(,a)},{ Y21, Y22, .... Y2(,a)}, .... 
\[ Yml.Ym2, .... y.c~m)}}the sememe expansion of 
W (short for SE). 
For example, in the above mentioned 
example, the word "~fl'" has only one sense 
item. The corresponding sememe set of this 
61 
sense item is {\]~'\]~i,~.l.l:,~,~} and the 
sememe expansion of "~1"" is {()~'l~i, 
~.1.1:,@, ~ } } . The word "~" has four sense 
items, and the corresponding sememe set of 
each item is {)~i~,~,~,~}, {~.~-}, {~ 
} and { ~3~,,'~ } respectively. The sememe 
expansion of word "~" is {{)~'l~,~ff~; 
~,~}, {~:}, {:~}, {¢,,~,:~}}. 
When building the sememe co-occurrence 
frequency database, the corpus is segmented 
first and each word is tagged with its sememe 
expansion in Hownet. Then for each unique 
pair of words co-occurred in a sentence (here a 
sentence is a string of characters delimited by 
punctuations.), the co-occurrence data of 
sememes which belong to the definition of each 
words respectively were collect, when 
coUecting co0occurrence data, we adopt a 
principle that every pair of word which 
co-occurred in a sentence should have equal 
contribution to the sememe co-occurrence data 
regardless of the number of sense items of this 
word and the length of the definition. Moreover, 
the contribution of a word should be evenly 
distributed between all the senses of a word and 
the contribution of a sense should been evenly 
distributed between all the sememe in a sense. 
The algorithm is as follow: 
1.Initial each cell in the sememe 
co-occurrence frequency database(short for 
SCFD) with 0. 
2.For each sentence S in training corpus, do 
3-7. 
3.For each word in sentense S, tag the 
sememe expansion to it. 
4.For each unique pair of sememe 
expansion (SEi, SEj), do 5-7. 
5.For each sememe SUimp in each sememe 
set SSim in S~, do 6-7. 
6.For each sememe SUjm in each sememe 
set SSj, in SEj, do 7. 
7.Increase the value of cell SCFD(SUimp, 
SUjnq) and SCFD(SUjnq,SUimp) by the product 
of w(SUimp) and w(SUj~), where w(SUxyO is 
weight of SUxyz given by 
1 w(su >--ls ,l×lss l 
It can be concluded from the above 
algorithm that the SCFD are symmetrical. In 
order to saving space and speeding up the 
processing, we only save those cells (SUi,SUj) 
that satisfying SUi~<SUj. 
3.3 The Sememe Co-occurrence Frequency 
Database Based Disambiguafion Method 
3.3.1 The Sememe Co-occurrence Frequency 
Based Scoring Method 
When disambiguate a polysemous word, we 
given the following equation as the score of a 
sense item of the polysemous word and the 
context containing this polysemous word. The 
context of the word is the sentence containing 
this word. 
score(S, C) (1) 
= score(SS, C') - score(SS, GlobalSS) 
Where S is a sense item of polysemouse 
word W, C is the context containing W, SS is 
the corresponding sememe set of S, C' is the set 
of sememe expansion of words in C and 
GlobalSS is the sememe set that containing all 
of the sememe defined in Hownet. 
score(SS, C') = vsE~c~ SC°re(SS'SE')/lC' I (2) 
for any sememe set SS and sememe 
expansion set C'. 
score(SS, SE') = max score(SS, SS') (3) 
SS" eSE' 
for any sememe set SS and sememe 
expansion SE'. 
score(SS, SS') = vsuX ss. 
(4) 
for any sememe set SS and SS'. 
62 
score(SS,SU')=vsu~JsCOre(SU,SU') / SS I 
(5) 
for any sememe set SS and sememe SU'. 
score( SU , SU') = I ( SU, SU') (6) 
for any sememe SU and SU'. 
I (SU, SU') = log 2 f (SU,SU'). N ~ g(SU), g(SU') 
(7) 
Where f(SU,SU') is the co-occurrence 
frequency corresponding to sememe pair (SU, 
SU' ) in SCFD. And for g(SU) and N, we have 
the following equation: 
g(SU) = ~f(SU,SU') (8) 
vsu' 
N= vsu~ f ( SU' SU') (9) 
In equation (7), the mutual-information- 
like measure deviated from the stardard 
mutual-information measure by multiple a extra 
multiplicative factor N, this is because that the 
scale of the corpus is not large enough that the 
mutual-information of some sememes pairs 
would be negtive if it was not normalized by a 
extra multiplicative factor N. In equation (9), 
the sum of f(SU, SU') was divided by 2, this is 
because for each pair of sememes, 
~ f (SU, SU') is increaseby2. 
VSU,VSU" 
When disambiguation, we tag the sememe 
T that satisfying the following equation to 
polysemous word W. 
T = arg max score(S, C) (10) 
s 
3.3.2 The Creation Of Mutual Information 
Database 
We have created a mutual information database 
according to (7),(8) and(9) Here is some 
examples: 
The examples in table 1 have a high mutual 
information. The sememe pairs in this table 
have certain semantic relations. While the 
examples in table 2 have a low mutual 
information. And the sememe pairs in this table 
have no patency semantic relations. 
Table 1 example of sememe pairs which have a high mutual information 
Sememe 1 Sememe 2 Mutual-Inf°rml Sememe 1 Sememe 2 Mutual-Informa ation \[ tion 
~~. ,~-~ 33.811057 :~'I~ :~'~ 27.418417 
~ ~gk: 29.441937 ~ ~ 27.234630 
5~ ~ 28.024560 ~ ~: 27.093292 
,~Ir~ 28.023521 'I~,~:~j ~ ~ 26.984521 
~ ~ 27.571478 {~i ~ 26.710478 
Table 2: example of sememe pairs which have a low mutual information 
Mutual-Inform Mutual-Inform Sememe 1 Sememe 2 Sememe I Sememe 2 
ation %ion 
"~r~n i~ 8.693242 ~J(~ ~g 9.171023 
.~=t ~ 8.754611 ~ ~ 9.357734 
:~Z \[\] 8.793914 ~\[~\]~ IS 9.448947 
~'~ ~3~ 9.121846 ~-~\]~ ~-~ 9.528801 
}~\]~JJ 9.150412 ,~ ~. 9.599495 
It can been concluded from ruble 1 and table 2 that the mutual information can reflect 
63 
the tightness of semantic relations. 
4. Experiment And Analysis 
We did the experiment on a corpus of 10,000 
characters from People's Dialy. 
Firstly, the corpus is segmented, and then 
the sememe co-occurrence frequecny database 
and mutual information database is created. In 
the mutual-informationdatabase, there is 
709,496 data items corresponding to different 
sememes pairs. In order to speeding up the 
processing, the mutual-information database 
was sorted and indexed according to the first 
two bytes of each sememe pair. At last the 
experiment of disambiguation of some 
polysemous words was done. Here is two 
examples: 
Example 1: 
Example 2:~1:~1~1~1~"~1~11~1 
• 1--1 I 
We use the following euqation to access the 
accuracy ratio of disambiguafion: 
accuracy ratio = the number of correctlytagged example~ 
the total number of examplesin testing se; 
(11) 
the experimental result is shown in table 4. 
Tab~3: Two examples that disambiguate using sememe co-occurrence frequency database 
The definition of 
word " ~ " 
The score of sense items and 
the context of word "~" in 
example 1 
The score of sense items and 
the context of word "~" in 
example 2 
~3~'-~ 14. 459068 8. 659968 
Z~'F~ 9. 817648 i0. 817648 
M'I~ -~ 7. 415986 12. 415986 
~ ~ -0. 134779 -0. 134779 
i,~3~: :k~/..W... ~ji,~" ~.i..~$.~ -0. 818518 -0. 818518 
~9k:~ 14. 459068 12. 415986 
Table 4: the experiment result 
Total number of testing The number of correctly Accurracy 
examples tagged examples rat i o 
Close test I00 75 75% 
Open test I00 71 71% 
The disambiguation method introduced 
above have the following charatristics: 
(1) The problem of data spraseness is 
solved in a large degree. 
(2) This disambiguation method avoids 
the laborious hand tagging of training corpus. 
(3) This method can been easily applied 
to other kind of corpus. 

References
\[1\]. Nancy Ide, Jean Veronis, Introduction to 
the Special Issue on Word Sense 
Disambiguation: The State of the Art, 
Computational Linguistics, 1998, Volume 
24, number 1, pp 1-40 
\[2\]. Philip Resnik, David Yarowsky, A 
Perspective on Word Sense 
Disambiguation Methods and their 
Evaluation, 
http://www.cs.jhu.edu/~yarowsky/pubs.ht 
ml 
\[3\]. Alpha K. Luk, Statistical Sense 
Disambiguation with Relatively Small 
Corpus Using Dictionary Definitions, 33rd 
Annual Meeting of the Association for 
Computational Linguistics,26-30 June, 
1995, Massachusetts Institute of 
Technology, Cambridge, Massachusetts. 
USA, pp.181-188 
\[411 -~'~3}~, i.~¢~j~3~j2~II~Ilj,R~j~f~, 
~,~3~/~ff\], 1998 ~lz~ 3 ~, ,~,~ 27 ~, 
pp.76-82 
\[5\]. ~, ~11~, http://www.how-net.com. 
\[6\]. Kenneth Ward Church, Word Association 
Norms, Mutual Information, and 
Lexicography, Computational Linguistics, 
1990,Volume 16, Number 1, pp.22-29 
