Redefining similarity in a thesaurus by using corpora 
Hiroyuki Shinnou 
Ibar~ki University 
Dept. of Systems Engineering 
N~k~narusawa, 4-12-1 
Hitachi, Ibaraki, 316, Japan 
shinnou@lily, dse. ibaraki, ac. jp 
1 Introduction 
The aim of this paper is to automatically define 
the similarity I)etween two nouns which are gener- 
ally used in various domains. By these similarities, 
we can construct a large and general thesaurus. 
In applications of natural language processing, 
it is necessary to appropriately measure the sim- 
ilarity between two nouns. The similarity is usu- 
ally calculated from a thesaurus. Since a hand- 
made thesaurus is not slfitahle for machine use, 
and expensive to compile, automatical construc- 
tion of~a thesaurus has been attempted using cor- 
pora (Hindle, 1990). llowever, the thesaurus con- 
structed by such ways does not contain so many 
nouns, and these nouns are specified by the used 
corpus. In other words, we cannot construct the 
general thesaurus from only a corpus. This can 
be regarded as data sparseness problem that few 
nouns appear in the corpus. 
9b overcome data sparseness, methods to esti- 
mate the distribution of unseen eooecurrence frorn 
the distribution of similar words in the seen cooc- 
currence has been proposed. Brown et al. pro- 
posed a class-based n-gram model, which general- 
izes the n-gram model, to predict a word from pre- 
vious words in a text (Brown et al., 1992). They 
tackled data sparseness by generalizing the word 
to the class which contains the word. Pereira ct 
al. also basically used the above method, but they 
proposed a soft clustering scheme, in which mem- 
bership of a word in a class is probabilistic (Pereira 
et al., 1993). Brown and Pereira provide the clus- 
tering algorithm assigning words to proper classes, 
based on their own models. I)agan eL al. proposed 
a similarity-based model in which each word is 
generalized, not to its own specific class, but to a 
set of words which are most similar to it (Dagan et 
al., 1993). Using this model, they successfully l)re- 
dieted which unobserved cooccurrenccs were more 
likely than others, and estimated the probability 
of the cooecurrences (Dagan et al., 1994). How- 
ever, because these schemes look for similar words 
in the corpus, the number of similarities which we 
can define is rather small in comparison with the 
nunlber of similarities for pairs of the whole. The 
scheme to look for similar words in the corpus has 
already taken the influence of data sparseness. 
In this paper, we propose a method distinct 
from the above methods, which use a handmade 
thesaurus to find similar words. The proposed 
method avoids data sparseness by estimating un- 
defined similarities from the similarity in the the- 
saurus and similarities defined by the corpus. 
Thus, the obtained similarities are the same in 
nmuber as the similarities in the thesaurus, and 
they reflect the particularity of the domain to 
which the used corpus belongs. The use of a tlm- 
saurus can obviously set up the similar word in- 
dependent of the tort)us, and has an advantage 
that some ambiguities in analyzing the corpus are 
solved. 
We have experimented by using Bunrui-goi- 
hyon(Bmlrui-goi-hyon, 1994), which is a kind of 
Japanese handmade thesaurus, and the corpus 
which consists of Japanese economic newspaper 
5 years articles with about 7.85 M sentences. We 
evaluate the appropriateness of the obtained sim- 
ilarities. 
2 Defining the similarity 
We call easily judge the similarity of two nouns 
if they are very similar. However, the more dif- 
ferent they arc, the more difficult it is to define 
their similarity. Thus, we can trust that nouns in 
the class corresponding to the "leaf" of BunruL 
goi-hyou are similar to each another, and this is 
not affected by the domain. In this paper, we 
will refer to the class corresponding to the leaf of' 
Bunrui-goi-hyou the primitive class. Therefore, 
tile similarity we have to detine is the silnilarity 
between these classes. 
This method consists of 4 steps. 
Step 1 Gather the cooccurrence data from the 
corpus. 
Step 2 Generalize the noun in the cooccurrence 
data to the primitive class. 
Step 3 Measure the similarity between two prim- 
itive classes by using the cooccurrence data 
obtained in step 2. 
1131 
Step 4 Estimate undefined similarities. 
We will describe each step in detail in following 
subsections. 
2.1 Gathering cooccurrenee data (step 1) 
In order to carry out our method, it is necessary to 
first gather the cooccurrence data from the corpus. 
If a noun (N), a postpostional particle (P), and 
a verb (V) appear in a sentence in this order, we 
pick out the cooccurrence data \[N, P, V\]. In this 
study, we gathered cooccurrence data only from 
the postpostional particle "wo', because "wo" is 
the most effective postpostional particle for clas- 
sifying nouns. 
As a corpus, we used five years of Japanese eco- 
nomic newspaper articles. The corpus has about 
7.85 M sentences, and the average number of char- 
acters in one sentence was about 49. From the 
corpus, we gathered about 4.41 M bits of cooccur- 
rence data (about 1.48 M types) whose postposi- 
tional particle was "wo". From them, we removed 
the cooccurrence data whose frequency was 1, or 
whose verb does not appear more than 20 times. 
In all, we obtained about 3.26 M bits of cooccur- 
fence data, which consisted of about 0.36 M types. 
These cooccurrence data are used in the next step. 
2.2 Generalizing the word to the class 
(step 2) 
In step 2, we generalize the noun in cooccurrence 
data gathered in step 1 to the primitive class to 
which this noun belongs. 
First, we should explain about Bunrui-goi-hyou. 
Bunrui-goi-hyou is a kind of thesaurus with a tree- 
like structure that has a maximum depth of level 
6. Class IDs are assigned to each "leaf" of the 
"tree". Each noun has a class ID corresponding 
to the meaning of the noun. The class ID cor- 
responds to the primitive class. Bunrui-goi-hyou 
has 3,582 primitive classes. 
Because many nouns, such as compound nouns, 
are not in Bunrui-goi-hyou, we cannot always gen- 
eralize all nouns to primitive classes, 86.0% of the 
nouns in cooccurrence data gathered in step 1 
could be generalized to primitive classes. 
In this generalization, the problem of poly- 
semy arises. A noun has usually several primitive 
classes because of the polysemy. We solve some 
polysemy from the distribution of nouns in cooc- 
currence data which have the same verb. This 
cannot be discussed here for lack of space. We 
only report that the cooccurrence data gathered 
in step 1 contain 572,529 bits of polysemy which 
consisted of 27,918 types, and 472,273 bits of pol- 
ysemy ( 18,534 types ) were solved. 
In all, we obtained 2,708,135 bits of general- 
ized cooccurrence data, which consisted of 115,330 
types. 
2.3 Measuring the similarity between 
classes (step 3) 
In step 3, we measure the similarity between 
two primitive classes by using the method given 
by Hindle (Hindle, 1990). 
First, we define the nmtual information MI of 
a verb v and a primitive class C as follows. 
"Z~mY2 
M ( ,C)=logs N (eq.1) 
N N 
In the above equation, N is the total number of 
cooccurrence data bits, and f(v) and f(C) are the 
frequency of v and C in the whole cooccurrence 
data set respectively, and f(v, C) is the frequency 
of the cooccurrence data \[C, wo, v\]. Next, the sim- 
ilarity sire of a class Ci and Cj for a verb v is 
defined as follows. 
min(IfI(v, Ci)I, IMI(v, Ci)l) 
= :fl(v, Ci)*MI(v, Cj)>O 
0 : otherwise 
Finally, the similarity of Ci and Cj is measured as 
follows. 
SIM(Ci, Cj) = E sire(v, Ci, Cj ) 
v 
In eqnation (eq.1), f(v) > 0 because v is the 
verb in a certain cooccurrence data obtained in 
step 2. However, f(C) may be equal to 0 because 
tile primitive class C is a certain class in all prim- 
itive classes. If f(C) = 0, then MI(v, C) cannot 
be defined. So, if f(Ci) = 0 or f(Cj) = 0 for all 
verb v, then SIM(Ci, Cj) is undefined. 
2.4 Estimating the undefined similarity 
(step 4) 
There are 3,582 types of primitive classes, so 
ass2C2 = 6,413,571 similarities must be defined. 
Through step 3, there were 2,049,566 similarities 
which had been defined. This is 32.0 % of the 
whole. 
In step 4, we estimate undefined similarities 
by the thesaurus and defined similarities. Let 
us estimate the undefined similarity between the 
classes Ca and Cb. First, we pick out the set 
of primitive classes {Ca,, Ca2," ", Ca, }, such that 
each class has the common parent node as class 
Ca in Bunrui-goi-hyou, that is, the class C(~, 
is the brother node of class Ca. By the same 
process, we pick out the set of primitive classes 
{Cbl, Cb2,''', Cbj } for class Cb. The similarity in 
Bunrui-goi-hyou are reliable if its value is large. 
Thus, it is reliable the defined SIM(C~k,Cb) 
and the defined SIM(C~,Cb,,) are close to the 
undefined SIM(C~,Cb). Therefore, we define 
SIM(C~, Cb) by the average of SIM(C~ k , Cb) and 
SIM(Ca, Cb~). This process corresponds to that 
the slot in the Fig.l(a) is filled with the aver- 
age of values in the shade part in the figure. If 
1132 
SIM(C4., Cb) 
. 
• C6 
Cbi 
Ca,'" ca "'" cai 
(a) 1st estimation 
SIM(Ca, Cb) 
"k/" " 
• Cb 
\] I cu, 
0<'" Ca '" Ca~ 
(b) 2nd estimation 
SIM(Ca, G,) 
.............. a:: 
: ~ :: 
m m J:,: 
(c) 3rd estimation 
Figure 1: Estimation of SIM((/~, C~) 
the undefined pairs are left through above esti- 
mations, they are estimated by the ave.rage of 
SIM (U,,k, (lb,,). This process corresponds to that 
the slot in the Fig.l (b) is filled with the average 
values in the shade part in the figure. If undefined 
pairs still remain, we pick out the set of primitive 
classes, such that the grandmother node of each 
class is the same as that of Ca and (;'~ , and we 
repeat the above processes (ef. Fig.l((')). 
Fig.2 shows the ratio of the number of similar- 
ities defined in each process. 
r corpUs 
.3rd estimation ,~I 
1st estimation I 2LII .,,% I, %H 
Figure 2: ratio of the number of similarities de- 
fined in each process 
3 Evaluations 
First, we evaluate the obtained similarities by 
comparing them with the similarities in Bunrui- 
goidlyou. The similarity in Bunrui-goi-hyou are 
defined by the level of the common parent node 
of two classes. Tab.2 shows the average of simi- 
larities between two classes, such that these two 
classes have the common parent node whose level 
is x in Bunrni-goi-hyou. 
Tab.2 shows that the larger the similarity in 
Bunrui-goi-hyou is, the larger the obtained sim- 
ilarity is. It follows that the obtained similarity 
is roughly similar to the similarity in Bunrui-goi- 
hyou. 
Next, we evaluate the appropriateness of the 
first estimation. The average of "coefftcient of 
variation >' for similarities used in each first cs- 
>l'he coefficient of variation is the stamtard devia- 
tion divided by the mean. 
the level of the 
COFIIIIIOII |ntrellt node 
1 
average of 
obtained similarities 
2.160 
3.690 
6.51.9 
lO.O9O 
5 14.815 
6 oo 
Table 2: tendency of obtained similarities 
timation is 0.384. And the coetlicient of variation 
for all similarities measured by the corpus is 2.125. 
It follows that similarities used in first estimation 
are close to each other. 
At l~t, we evaluate the appropriateness of the 
obtained similarity by selecting a verbal meaning. 
In this experiment, to measure the similarity in 
Bunrui-goi-hyou and the similarity obtained by 
our method. Because the similarity in Bunrui- 
goi-hyou is rough, multiple answers may arise. In 
evaluation of the similarity in Bunrui-goi-hyou, we 
give a C) if the answer is unique and right, a A 
if the answers contain the right answer, and × if 
the answers don't contain the right answer. \[n 
evaluation of our similarities, we give a C) if the 
largest similarity is right, a A if 1st or 2ud largest; 
similarities is right answer, and × if neither of 1st 
and 2nd largest similarities is the right answer. 
Tab.1 shows the results of evaluations. \]'his 
table shows that the similarity obtained by our 
method is a little better than the similarity in 
Bunrui-goi-hyou. 
4 Remarks 
It is difficult to extract all knowledge from only 
a corpus because of ineoml)lete analysis and data 
sparseness. In order to avoid these difilculties, the 
approach to use of different resources from the col 
pus is promising. To construct the thesaurus fi'om 
1133 
exmnple nouns 
e ~c. (9) 
~ ~ (4) 
~ ~#tzo/3) 
Our method 
u7 (~, ~, ±, ~ .... ) ~ (~,,Nv,~,~!~' .... ) 
25 (3~, ~'-P ~4, ~, ~,... ) 
18~ 
pattern (num. 
of meanings ) 
16 
13 
13 
22 
17, 
~(i 
19 
19 
~' ~N' N&' ~"' / 
/ 
,~q~, g, ~.~, ~ .... ) 
~, ~E, ~, ~z,u .... ) 
f~, ~, ~, ~ .... ) '~, ~, 
~m, z, .... ) ~,.~, ~ v, $, Nm .... ) 
nouns for test 
13 (~tI, ~M, I::'--31/, {ziznr~ .... 
9 (:b,~.x.,~,oc:oa: .... ) 
19 ~,~\]'., 7U/~f,~,...) 
is V~,~/, 4:~--Y, ~ .... ) s (~,~,~,~ .... ) 
is i~, ~2~, ~JN, i~,... ) 
28 ~, H~,~N, ~,...) 
Total \] 184 
I Bunrui-goi-hyou 
ol/'1 x O /'Ix 
14 2 8 17 0 7 
0 2 2 1 1 2 
8 1 7 9 1 6 
1 1 1 1 0 2 
) 14 0 2 13 2 1 $ 
l 
7 0 1 7 0 1 ) 
', 12 0 1 10 l 2 
3 1 5 3 1 5 
7 4 8 9 3 7 
16 0 2 14 2 2 
7 0 1 8 0 0 
14 1 3 16 0 2 
13 6 9 16 3 9 
_.1 116 I 18 I 50 I 124 I 14 14~ I 
Table h Result of test of verbal meaning selection 
a dictionary (Turumaru et al., 1991), and to make 
example data from a usable knowledge (Kaneda 
et al., 1995) is considered this approach. The pro- 
posed method uses the handmade thesaurus as the 
different resource from the corpus. In addition, 
the statistical data from the corpus are weighted. 
However, it will be important in future research 
to investigate how much weight should be given 
to each bit of data. 
It is difficult to build knowledge corresponding 
to each domain from zero. So it is important to 
extend and modify the existing knowledge corre- 
sponding to the purpose of use. In this method, 
relatively few bits of cooccurrence data are used 
because nouns in the cooecurrence data are not on 
Bunrui-goi-hyou. If we extend Bunrui-goi-hyou, 
these unused cooccurrence data may be useful. 
And by using the obtained similarities, we can 
modify Bunrui-goi-hyou. Since our method con- 
struct a thesaurus from the handmade thesaurus 
by the corpus, it can be considered a method to 
refine the handmade thesaurus such as to be suit- 
able for the domain of the used corpus. 
5 Conclusions 
In this paper, we proposed a method to define 
similarities between general nouns used in vari- 
ous domains. The proposed method redefines the 
similarity in a handmade thesaurus by using cor- 
pora. The method avoids data sparseness by esti- 
mating undefined similarities from the similarity 
in the thesaurus and similarities defined by cor- 
pora. The obtained similarities are obviously the 
same in number as the original similarities, and 
are more appropriate than the original similari- 
ties in the thesaurus. 
By using Bnnru~-goi-hyou as the handmade the- 
saurus and newspaper articles with about 7.85 M 
sentences as a corpus, we confirmed the appropri- 
ateness of this method. 
In the future, we will extend and modify 
Bunrui-goi-hyou by the cooecurrence data and the 
similarities obtained in this study, and will try to 
classify multiple senses of verbs. 
Acknowledgment 
The corpus used in our experiment is extracted 
from CD-ROMs ('90 -'94) sold by Nihon Keizai 
Shinbun company. We deeply appreciate the Ni- 
hon Keizai Shinbun company to permit the use of 
this corpus and many people who negotiated with 
the company about the use of this corpus. 
References 
Brown,P.F.,Pietra,V.D, deSouza,P.V., Lai,J.C. and 
Mercer,R.L. : 1992. Class-Based n-gram Models 
of Natural Language, Computational Linguistics, 
Vol. 18,No.4,pp.467-479 (1992). 
Dagan,I., Marcus,S, and Markovitch,S. : 1993. 
Contextual Word Similarity and Estimation from 
Sparse Data, In 31th Annual Meeting of the Asso- 
ciation for Computational Linguistics, pp.164-171. 
Dagan,I., Pereira,F., and Lee,L. : \]994. Similarity- 
Based Estimation of Word Coocurrencc Probabili- 
ties, In 32th Annual Meeting of the Association for 
Computational Linguistics, pp.272-278. 
Hindle,D. 1990. Noun classification from predicate- 
argument structures. In 28th Annual Meeting of the 
Association for Computational Linguistics, pp.268- 
275. 
Kaneda,S., Akib%Y., and Ishii,M. : 1995. Jireini 
motozuku eigodousi scntakuruuru no syuuseigata 
gakusyuuhou (in Japanese), In Proceedings of the 
first annual meeting of the Assoiation for Natural 
Language Processing, pp.333 336. 
Pereira,F., Tishby, N., and Lee,L. : 1993. Distribu- 
tional Clustering of English Word, In 31th An- 
nual Meeting of the Association for Computational 
Linguistics,pp.183-\]90. 
The National Language Research Institute : 1994. 
Bunrui-goi-hyou (in Japanese), Shuuei Publishing. 
Turumaru,H., Takesita,K., Itami,K., Yanagawa,T. 
and Yoshida,S. : 1991. An Approach to'The- 
saurus Construction from Japanese Language Dic- 
tionary (in Japanese), IPS Japan NL-83-16, 
Vol.91,No.37,91-NL-83, pp.121-128. 
1134 
