 
Word Sense Acquisition from Bilingual Comparable Corpora 
 
Hiroyuki Kaji
 
Central Research Laboratory, Hitachi, Ltd. 
1-280 Higashi-Koigakubo, Kokubunji-shi, Tokyo 185-8601, Japan 
kaji@crl.hitachi.co.jp 
 
 
 
 
 
Abstract 
Manually constructing an inventory of word 
senses has suffered from problems including 
high cost, arbitrary assignment of meaning to 
words, and mismatch to domains. To over-
come these problems, we propose a method 
to assign word meaning from a bilingual 
comparable corpus and a bilingual dictionary. 
It clusters second-language translation 
equivalents of a first-language target word on 
the basis of their translingually aligned dis-
tribution patterns. Thus it produces a hierar-
chy of corpus-relevant meanings of the target 
word, each of which is defined with a set of 
translation equivalents. The effectiveness of 
the method has been demonstrated through an 
experiment using a comparable corpus con-
sisting of Wall Street Journal and Nihon Kei-
zai Shimbun corpora together with the EDR 
bilingual dictionary. 
1  Introduction 
Word Sense Disambiguation (WSD) is an important 
subtask that is necessary for accomplishing most natu-
ral language processing tasks including machine 
translation and information retrieval. A great deal of 
research on WSD has been done over the past decade 
(Ide and Veronis, 1998). In contrast, word sense acqui-
sition has been a human activity; inventories of word 
senses have been constructed by lexicographers based 
on their intuition. Manually constructing an inventory 
of word senses has suffered from problems such as 
high cost, arbitrary division of word senses, and mis-
match to application domains. 
We address the problem of word sense acquisition 
along the lines of the WSD where word senses are 
defined with sets of translation equivalents in another 
language. Bilingual corpora or second-language cor-
pora enable unsupervised WSD (Brown, et al., 1991; 
Dagan and Itai, 1994). However, the correspondence 
between senses of a word and its translations is not 
one-to-one, and therefore we need to prepare an in-
ventory of word senses, each of which is defined with 
a set of synonymous translation equivalents. Although 
conventional bilingual dictionaries usually group 
translations according to their senses, the grouping 
differs by dictionary. In addition, senses specific to a 
domain are often missing while many senses irrelevant 
to the domain or rare senses are included. To over-
come these problems, we propose a method for pro-
ducing a hierarchy of clusters of translation equiva-
lents from a bilingual corpus and a bilingual diction-
ary. 
To the best of our knowledge, there are two pre-
ceding research papers on word sense acquisition (Fu-
kumoto and Tsujii, 1994; Pantel and Lin, 2002). Both 
proposed distributional word clustering algorithms that 
are characterized by their capabilities to produce 
overlapping clusters. According to their algorithms, a 
polysemous word is assigned to multiple clusters, each 
of which represents one of its senses. These and our 
approach differ in how to define the word sense, i.e., a 
set of synonyms in the same language versus a set of 
translation equivalents in another language. Schuetze 
(1998) proposed a method for dividing occurrences of 
a word into classes, each of which consists of contex-
tually similar occurrences. However, it does not pro-
duce definitions of senses such as sets of synonyms 
and sets of translation equivalents. 
2  Basic Idea 
2.1  Clustering of translation equivalents 
Most work on automatic extraction of synonyms from 
text corpora rests on the idea that synonyms have 
                                                               Edmonton, May-June 2003
                                                               Main Papers , pp. 32-39
                                                         Proceedings of HLT-NAACL 2003
 
similar distribution patterns (Hindle, 1990; Peraira, et 
al., 1993; Grefenstette, 1994). This idea is also useful 
for our task, i.e., extracting sets of synonymous trans-
lation equivalents, and we adopt the approach to dis-
tributional word clustering. 
We need to mention that the singularity of our task 
makes the problem easier. First, we do not have to 
cluster all words of a language, but we only have to 
cluster a small number of translation equivalents for 
each target word, whose senses are to be extracted, 
separately. As a result, the problem of computational 
efficiency becomes less serious. Second, even if a 
translation equivalent itself is polysemous, it is not 
necessary to consider senses that are irrelevant to the 
target word. A translation equivalent usually represents 
one and only one sense of the target word, at least in 
case the language-pair is those with different origins 
like English and Japanese. Therefore, a 
non-overlapping clustering algorithm, which is far 
simpler than overlapping clustering algorithms, is suf-
ficient. 
2.2  Translingual distributional word clustering 
In conventional distributional word clustering, a word 
is characterized by a vector or weighted set consisting 
of words in the same language as that of the word it-
self. In contrast, we propose a translingual distribu-
tional word clustering method, whereby a word is 
characterized by a vector or weighted set consisting of 
words in another language. It is based on the 
sense-vs.-clue correlation matrix calculation method 
we originally developed for unsupervised WSD (Kaji 
and Morimoto, 2002). That method presupposes that 
each sense of a target word x is defined with a syno-
nym set consisting of the target word itself and one or 
more translation equivalents which represent the sense. 
It calculates correlations between the senses of x and 
the words statistically related to x, which act as clues 
for determining the sense of x, on the basis of 
translingual alignment of pairs of related words. Rows 
of the resultant correlation matrix are regarded as 
translingual distribution patterns characterizing trans-
lation equivalents.  
Sense-vs.-clue correlation matrix calculation 
method 
*)
 
1) Alignment of pairs of related words 
                                                   
*)
 A description of the wild-card pair of related words, which plays 
an essential role in recovering alignment failure, has been omitted 
for simplicity. 
Let X(x) be the set of clues for determining the sen-
se of a first-language target word x. That is, 
X(x)={x�|(x, x�)∈R
X
}, 
where R
X
 denotes the collection of pairs of related 
words extracted from a corpus of the first language. 
Henceforth, the j-th clue for determining the sense of x 
will be denoted as x�(j). Furthermore, let Y(x, x�(j)) be 
the set consisting of all second-language counterparts 
of a first-language pair of related words x and x�(j). 
That is, 
Y(x, x�(j)) = {(y, y�) | (y, y�)∈R
Y
, (x, y)∈D, 
(x�(j), y�)∈D}, 
where R
Y
 denotes the collection of pairs of related 
words extracted from a corpus of the second language, 
and D denotes a bilingual dictionary, i.e., a collection 
of pairs consisting of a first-language word and a 
second-language word that are translations of one an-
other. 
Then, for each alignment, i.e., pair of (x, x�(j)) and 
(y, y�) (∈Y(x, x�(j))), a weighted set of common re-
lated words Z((x, x�(j)), (y, y� )) is constructed as fol-
lows: 
Z((x, x�(j)), (y, y� )) = {x� / w(x�) | (x, x�)∈R
X
, 
(x�(j), x�)∈R
X
}. 
The weight of x�, denoted as w(x�), is determined as 
follows: 
- w(x�) = 1+α�MI(y, y�) when ∃y� (x�, y�)∈D, 
(y, y�)∈R
Y
, and (y�, y�)∈R
Y
 . 
- w(x�) = 1 otherwise. 
This is where MI(y, y�) is the mutual information of y 
and y�. The coefficient α was set to 5 in the experiment 
described in Section 4. 
2) Calculation of correlation between senses and clues 
The correlation between the i-th sense S(x, i) and 
the j-th clue x�(j) is defined as: 
()()
()()
()()
.
),(,,,)(,maxmax
),(,,,)(,max
)(,)(),,(
),(
)),(,(Y),(
),(
)),(,(),(










⋅=
∈
∈
∈
∈
kxSy'yjx'xA
ixSy'yjx'xA
jx'xMIjx'ixSC
kxSy
jx'xy'yk
ixSy
jx'xYy'y
 
This is where MI(x, x�(j)) is the mutual information of 
x and x�(j), and A((x, x�(j)), (y, y�), S(x,i)), the plausi-
bility of alignment of (x, x�(j)) with (y, y�) suggesting 
S(x, i), is defined as the weighted sum of the correla-
tions between the sense and the common related words, 
i.e., 
()
().),,()w(
),(),,()),(,(
)),()),(,((
∑
∈
⋅
=
y'yjx'xZx"
x"ixSCx"
ixSy'yjx'xA
 
The correlations between senses and clues are cal-
 
culated iteratively with the following initial values: 
C
0
(S(x, i), x�(j))=MI(x, x�(j)). The number of iterations 
was set to 6 in the experiment. Figure 1 shows how the 
correlation values converge. 
Advantages of using translingually aligned 
distribution patterns 
Translingual distributional word clustering has advan-
tages over conventional monolingual distributional 
word clustering, when they are used to cluster transla-
tion equivalents of a target word. First, it avoids clus-
ters being degraded by polysemous translation equiva-
lents. Let �race� be the target word. One of its 
translation equivalents, �レース<REESU>�, is a poly-
semous word representing �lace� as well as �race�. 
According to monolingual distributional word cluster-
ing, �レース<REESU>� is characterized by a mixture of 
the distribution pattern for �レース<REESU>� repre-
senting �race� and that for �レース<REESU>� repre-
senting �lace�, which often results in degraded clusters. 
In contrast, according to translingual distributional 
word clustering, �レース<REESU>� is characterized by 
the distribution pattern for the sense of �race� that 
means �competition�. 
Second, translingual distributional word clustering 
can exclude from the clusters translation equivalents 
irrelevant to the corpus. For example, a bilingual dic-
tionary renders �特徴<TOKUCHOU>� (�feature�) as a 
translation of �race�, but that sense of �race� is used 
infrequently. If it is the case in a given domain, �特徴
<TOKUCHOU>� has low correlation with most words 
related to �race�, and can therefore be excluded from 
any clusters. 
We should also mention the data-sparseness prob-
lem that hampers distributional word clustering. Gen-
erally speaking, the problem becomes more difficult in 
translingual distributional word clustering, since the 
sparseness of data in two languages is multiplied. 
However, the sense-vs.-clue correlation matrix calcu-
lation method overcomes this difficulty; it calculates 
the correlations between senses and clues iteratively to 
smooth out the sparse data. 
Translingual distributional word clustering can also 
be implemented on the basis of word-for-word align-
ment of a parallel corpus. However, availability of 
large parallel corpora is extremely limited. In contrast, 
the sense-vs.-clue correlation calculation method ac-
cepts comparable corpora which are available in many 
domains. 
2.3  Similarity based on subordinate distribu-
tion pattern 
Naive translingual distributional word clustering based 
on the sense-vs.-clue correlation matrix calculation 
method is outlined in the following steps: 
1) Define the sense of a target word by using each 
translation equivalent. 
2) Calculate the sense-vs.-clue correlation matrix for 
the set of senses resulting from step 1). 
3) Calculate similarities between senses on the basis 
of distribution patterns shown by the sense-vs.-clue 
correlation matrix. 
4) Cluster senses by using a hierarchical agglomera-
tive clustering method, e.g., the group-average 
method. 
However, this naive method is not effective be-
cause some senses usually have duplicated definitions 
in step 1) despite the fact that the sense-vs.-clue corre-
lation matrix calculation algorithm presupposes a set 
of senses without duplicated definitions. The algo-
rithm is based on the �one sense per collocation� hy-
pothesis, and it results in each clue having a high cor-
relation with one and only one sense. A clue can never 
have high correlations with two or more senses, even 
when they are actually the same sense. Consequently, 
synonymous translation equivalents do not necessarily 
have high similarity. 
Figure 2(a) shows parts of distribution patterns for 
0.0
0.5
1.0
1.5
2.0
2.5
012345678910
Iteration
Cor
r
e
lation
C(S1, brand) C(S2, brand)
C(S3, brand) C(S1, woman)
C(S2, woman) C(S3, woman)
 
S1={promotion, 宣伝<SENDEN>, プロモーション
<PUROMOUSHON>,  売り込み<URIKOMI>, �} 
 (�an activity intended to help sell a product�) 
S2={promotion, 昇格<SHOUKAKU>, 昇進<SHOUSHIN>, 
登用<TOUYOU>, �} 
 (�advancement in rank or position�) 
S3={promotion, 奨励<SHOUREI>, 振興<SHINKOU>, 助
長<JOCHOU>,�} 
 (�action to help something develop or succeed�) 
 
Figure 1. Convergence of correlation between 
senses and clues. 
 
{promotion, 宣伝<SENDEN>}, {promotion, プロモー
ション<PUROMOUSHON>}, and {promotion, 売り込み
<URIKOMI>} all of which define the �sales activity� 
sense of �promotion�. We see that most clues for se-
lecting that sense have higher correlation with {pro-
motion, 宣伝<SENDEN>} than with {promotion, プロ
モーション<PUROMOUSHON>} and {promotion, 売り
込み<URIKOMI>}. This is because �宣伝<SENDEN>� 
is the most dominant translation equivalent of �promo-
tion� in the corpus. 
To resolve the above problem, we calculated the 
sense-vs.-clue correlation matrix not only for the full 
set of senses but also for the set of senses excluding 
one of these senses. Excluding a definition of the sense, 
which includes the most dominant translation equiva-
lent, allows most clues for selecting the sense to have 
the highest correlations with another definition of the 
same sense, which includes the second most dominant 
translation equivalent. Figure 2(b) shows parts of dis-
tribution patterns for {promotion, プロモーション
<PUROMOUSHON>} and {promotion, 売り込み
<URIKOMI>} shown by the sense-vs.-clue correlation 
matrix for the set of senses excluding {promotion, 宣
伝<SENDEN>}. We see that most clues for selecting 
the �sales activity� sense have higher correlations with 
{promotion, プロモーション<PUROMOUSHON>} than 
with {promotion, 売り込み<URIKOMI>}. This is be-
cause �プロモーション<PUROMOUSHON>� is the sec-
ond most dominant translation equivalent in the corpus. 
We also see that the distribution pattern for {promo-
tion, プロモーション<PUROMOUSHON>} in Fig. 2(b) is 
more similar to that for {promotion, 宣伝<SENDEN>} 
in Fig. 2(a) than that for {promotion, プロモーション
<PUROMOUSHON>} in Fig. 2(a). 
We call the distribution pattern for sense S
2
, result-
ing from the sense-vs.-clue correlation matrix for the 
set of senses excluding sense S
1
, the distribution pat-
tern for S
2
 subordinate to S
1
, while we call the distri-
bution pattern for sense S
2
, resulting from the 
sense-vs.-clue correlation matrix for the full set of 
senses, simply the distribution pattern for S
2
. We de-
fine the similarity of S
2
 to S
1
 as the similarity of the 
distribution pattern for S
2
 subordinate to S
1
 to the dis-
tribution pattern for S
1
. 
Calculating the sense-vs.-clue correlation matrix 
for a set of senses excluding one sense is of course 
insufficient since three or more translation equivalents 
may represent the same sense of the target word. We 
should calculate the sense-vs.-clue correlation matrices 
both for the full set of senses and for the set of senses 
excluding one of these senses again, after merging 
similar senses into one. Repeating these procedures 
enables corpus-relevant but less dominant translation 
equivalents to be drawn up, while corpus-irrelevant 
ones are never drawn up. Thus, a hierarchy of cor-
pus-relevant senses or clusters of corpus-relevant 
translation equivalents is produced. 
3  Proposed Method 
3.1  Outline 
As shown in Fig. 3, our method repeats the following 
three steps: 
1) Calculate sense-vs.-clue correlation matrices both 
for the full set of senses and for a set of senses ex-
cluding each of these senses. 
2) Calculate similarities between senses on the basis 
of distribution patterns and subordinate distribution 
patterns. 
3) Merge each pair of senses with high similarity 
into one. 
The initial set of senses is given as Σ(x)={{x, y
1
}, {x, 
y
2
}, �, {x, y
N
}} where x is a target word in the first 
language, and y
1
, y
2
, �, and y
N
 are translation equiva-
lents of x in the second-language.  Translation 
equivalents that occur less frequently in the sec-
ond-language corpus can be excluded from the initial 
0
1
2
3
4
5
6
7
Accl
ai
m
ad cam
pai
g
n
a
ffirma
tiv
e
anal
ys
t
 s
a
y
Audi
Bat
m
an
br
and
Bu
r
g
er
 Ki
n
g
car
eer
cer
eal
Co
ca-
Co
l
a
C
onr
ail
C
oor
s
 L
i
ght
Cy
r
k
dis
c
r
i
mination
employee
film
Gener
a
l
His
p
anic
indus
tr
y
job
la
be
l
l
a
s
t
 year
management
Clue
Cor
r
e
lation
{promotion, 宣伝<SENDEN>}
{promotion, プロモーション<PUROMOUSHON>}
{promotion, 売り込み<URIKOMI>}
 
(a) Distribution patterns 
0
1
2
3
4
5
6
7
Clue
Cor
r
elation
 
(b) Distribution patterns subordinate to 
 {promotion, 宣伝<SENDEN>} 
Figure 2. Distribution Patterns for Some Senses 
of �promotion�. 
 
set to shorten the processing time. The details of the 
steps are described in the following sections. 
3.2  Calculation of sense-vs.-clue correlation 
matrices 
First, a sense-vs.-clue correlation matrix is calculated 
for the full set of senses. The resulting correlation ma-
trix is denoted as C. That is, C(i, j) is the correlation 
between the i-th sense S(x,i) of a target word x and its 
j-th clue x�(j). 
Then a set of active senses, Σ
A
(x), is determined. A 
sense is regarded active if and only if the ratio of clues 
with which it has the highest correlation exceeds a 
predetermined threshold θ (In the experiment in Sec-
tion 4, θ was set to 0.05). That is,  
{}θ))(()()( >= ix,SR|ix,SxΣ
A
, 
where R(S(x, i)) denotes the ratio of clues having the 
highest correlation with S(x, i), i.e., 
})({)},(max),()({)),(( jx'jkCjiC|jx'ixSR
k
==
. 
Thus Σ
A
(x) consists of senses of the target word x that 
are relevant to the corpus. 
Finally, a sense-vs.-clue correlation matrix is cal-
culated for the set of senses excluding each of the ac-
tive senses. The correlation matrix calculated for the 
set of senses excluding the k-th sense is denoted as C
-k
. 
That is, C
-k
(i, j) (i≠k) is the correlation between the 
i-th sense and the j-th clue that is calculated excluding 
the k-th sense. C
-k
(k, j) (j=1, 2, ...) are set to zero. This 
redundant k-th row is included to maintain the same 
correspondence between rows and senses as in C. 
3.3  Calculation of sense similarity matrix 
Similarity of the i-th sense S(x, i) to the j-th sense S(x, 
j), Sim(S(x, i), S(x, j)), is defined as the similarity of 
the distribution pattern for S(x, i) subordinate to S(x, j) 
to the distribution pattern of S(x, j). Note that this 
similarity is asymmetric and reflects which sense is 
more dominant in the corpus. It is probable that 
Sim(S(x, i), S(x, j)) is large but Sim(S(x, j), S(x, i)) is 
not when S(x, j) is more dominant than S(x, i). 
According to the sense-vs.-clue correlation matrix, 
each sense is characterized by a weighted set of clues. 
Therefore, we used the weighted Jaccard coefficient as 
the similarity measure. That is, 
{}
{}
∑
∑
=
k
j-
k
j-
kj,Cki,C
kj,Cki,C
jxSixSSim
)(),(max
)(),(min
)),(),,((  
when S(x, j)∈Σ
A
(x). 
0)),(),,(( =jxSixSSim  otherwise. 
It should be noted that a sense is characterized by dif-
ferent weighted sets of clues depending on which 
sense the similarity is calculated. Note also that inac-
tive senses are neglected because they are not reliable. 
3.4  Merging similar senses 
The set of senses is updated by merging every pair of 
mutually most-similar senses into one. That is, 
Σ(x) ← Σ(x) � {S(x, i), S(x, j)} + {S(x, i)∪S(x, j)} 
if {maxmax)),(),,((
j'
jxSixSSim =  
))}},(),',(()),,(),,(({ ixSjxSSimj'xSixSSim ,  
{maxmax)),(),,((
i'
jxSixSSim =  
))}},(),,(()),,(),,(({ i'xSjxSSimjxSi'xSSim ,  
and σ>)),(),,(( jxSixSSim . 
The σ is a predetermined threshold for similarity, 
which is introduced to avoid noisy pairs of senses be-
ing merged. In the experiment in Section 4, σ was set 
to 0.25. 
If at least one pair of senses are merged, the whole 
procedure, i.e., the calculation of sense-vs.-clue ma-
trices through the merger of similar senses, is repeated 
for the updated set of senses. Otherwise, the clustering 
procedure terminates. 
Agglomerative clustering methods usually suffer 
from the problem of when to terminate merging. In our 
method described above, the similarity of senses that 
are merged into one does not necessarily decrease 
Initial set of senses 
Sense-vs.-clue correlation matrices 
Sense similarity matrix 
Updated set of senses 
Comparable 
corpus 
Bilingual 
dictionary 
Calculate similarities 
between distribution patterns 
Calculate correlations 
between senses and clues 
Merge similar senses 
 
Figure 3. Flow Diagram of Proposed Method. 
 
monotonically, which makes the problem more diffi-
cult. At present, we are forced to output a dendrogram 
that represents the history of mergers and leave the 
final decision to humans. The dendrogram consists of 
translation equivalents that are included in active 
senses in the final cycle. Other translation equivalents 
are rejected as they are irrelevant to the corpus. 
4  Experimental Evaluation 
4.1  Experimental settings 
Our method was evaluated through an experiment us-
ing a Wall Street Journal corpus (189 Mbytes) and a 
Nihon Keizai Shimbun corpus (275 Mbytes). 
First, collected pairs of related words, which we 
restricted to nouns and unknown words, were obtained 
from each corpus by extracting pairs of words 
co-occurring in a window, calculating mutual informa-
tion of each pair of words, and selecting pairs with 
mutual information larger than the threshold. The size 
of the window was 25 words excluding function words, 
and the threshold for mutual information was set to 
zero. Second, a bilingual dictionary was prepared by 
collecting pairs of nouns that were translations of one 
another from the Japan Electronic Dictionary Research 
Institute (EDR) English-to-Japanese and Japa-
nese-to-English dictionaries. The resulting dictionary 
includes 633,000 pairs of 269,000 English nouns and 
276,000 Japanese nouns. 
Evaluating the performance of word sense acquisi-
tion methods is not a trivial task. First, we do not have 
a gold-standard sense inventory. Even if we have one, 
we have difficulty mapping acquired senses onto those 
in it. Second, there is no way to establish the complete 
set of senses appearing in a large corpus. Therefore, 
we evaluated our method on a limited number of target 
words as follows. 
We prepared a standard sense inventory by select-
ing 60 English target words and defining an average of 
3.4 senses per target word manually. The senses were 
rather coarse-grained; i.e., they nearly corresponded to 
groups of translation equivalents within the entries of 
everyday English-Japanese dictionaries. We then sam-
pled 100 instances per target word from the Wall 
Street Journal corpus, and we sense-tagged them 
manually. Thus, we estimated the ratios of the senses 
in the training corpus for each target word. 
We defined two evaluative measures, recall of 
senses and accuracy of sense definitions. The recall of 
senses is the proportion of senses with ratios not less 
than a threshold that are successfully extracted, and it 
varies with change of the threshold. We judged that a 
sense was extracted, when it shared at least one trans-
lation equivalent with some active sense in the final 
cycle. 
To evaluate the accuracy of sense definitions while 
avoiding mapping acquired senses onto those in the 
standard sense inventory, we regard a set of senses as a 
set of pairs of synonymous translation equivalents. Let 
T
S
 be a set consisting of pairs of translation equiva-
lents belonging to the same sense in the standard sense 
inventory. Likewise, let T(k) be a set consisting of 
pairs of translation equivalents belonging to the same 
active sense in the k-th cycle. Further, let U be a set of 
pairs of translation equivalents that are included in 
active senses in the final cycle. Recall and precision of 
pairs of synonymous translation equivalents in the k-th 
cycle are defined as: 
UT
kTT
kR
S
S
∩
∩
=
)(
)( . 
)(
)(
)(
kT
kTT
kP
S
∩
= . 
Further, F-measure of pairs of synonymous translation 
equivalents in the k-th cycle is defined as: 
)()(
)()(2
)(
kPkR
kPkR
kF
+
⋅⋅
= . 
The F-measure indicates how well the set of active 
senses coincides with the set of sense definitions in the 
standard senses inventory. Although the current 
method cannot determine the optimum cycle, humans 
can identify the set of appropriate senses from a hier-
archy of senses at a glance. Therefore, we define the 
accuracy of sense definitions as the maximum 
F-measure in all cycles. 
4.2  Experimental results 
To simplify the evaluation procedure, we clustered 
translation equivalents that were used to define the 
senses of each target word in the standard sense in-
ventory, rather than clustering translation equivalents 
rendered by the EDR bilingual dictionary. The recall 
of senses for totally 201 senses of the 60 target words 
was: 
96% for senses with ratios not less than 25%, 
87% for senses with ratios not less than 5%, and 
78% for senses with ratios not less than 1%. 
The accuracy of sense definitions, averaged over the 
60 target words, was 77%. 
The computational efficiency of our method 
proved to be acceptable. It took 13 minutes per target 
word on a HP9000 C200 workstation (CPU clock: 200 
MHz, memory: 32 MB) to produce a hierarchy of 
clusters of translation equivalents. 
 
Some clustering results are shown in Fig. 4. These 
demonstrate that our proposed method shows a great 
deal of promise. At the same time, evaluating the re-
sults revealed its deficiencies. The first of these lies in 
the crucial role of the bilingual dictionary. It is obvious 
that a sense is never extracted if the translation equiva-
lents representing it are not included in it. An 
exhaustive bilingual dictionary is therefore required. 
From this point of view, the EDR bilingual dictionary 
is fairly good. The second deficiency lies in the fact 
that it performs badly for low-frequency or non-topical 
senses. For example, the sense of �bar� as the �legal 
profession� was clearly extracted, but its sense as a 
�piece of solid material� was not extracted. 
We also compared our method with two alterna-
tives: monolingual distributional clustering mentioned 
in Section 2.2 and naive translingual clustering men-
tioned in Section 2.3. Figures 5(a), (b), and (c) show 
respective examples of clustering obtained by our 
method, the monolingual method, and the naive 
translingual method. Comparing (a) with (b) reveals 
the superiority of the translingual approach to the 
monolingual approach, and comparing (a) with (c) 
reveals the effectiveness of the subordinate distribu-
tion pattern introduced in Section 2.3. Note that delet-
ing the corpus-irrelevant translation equivalents from 
the dendrograms in both (b) and (c) would not result in 
appropriate ones. 
5  Discussion 
Our method has several practical advantages. One of 
these is that it produces a corpus-dependent inventory 
of word senses. That is, the resulting inventory covers 
most senses relevant to a domain, while it excludes 
senses irrelevant to the domain. 
Second, our method unifies word sense acquisition 
with word sense disambiguation. The sense-vs.-clue 
correlation matrix is originally used for word sense 
disambiguation. Therefore, our method guarantees that 
acquired senses can be distinguished by machines, and 
further it demonstrates the possibility of automatically 
optimizing the granularity of word senses. 
Some limitations of the present methods are dis-
cussed in the following with possible future extensions. 
First, our method produces a hierarchy of clusters but 
cannot produce a set of disjoint clusters. It is very im-
portant to terminate merging senses autonomously 
during an appropriate cycle. Comparing distribution 
patterns (not subordinate ones) may be useful to ter-
minate merging; senses characterized by complemen-
tary distribution patterns should not be merged. 
Second, the present method assumes that each 
translation equivalent represents one and only one sense of the target word, but this is not always the case. 
[Target word] 
Resulting dendrogram 
(English equivalent 
other than target 
word) 
[association] 
┌────関係<KANKEI> 
┌──┴────交際<KOUSAI> 
┤          ┌─提携<TEIKEI> 
│┌────┴─関連<KANREN> 
└┤┌─────共同<KYOUDOU> 
└┤┌────連合<RENGOU> 
└┤┌───組合<KUMIAI> 
└┤┌──協会<KYOUKAI> 
└┤┌──会<KAI> 
└┴─団体<DANTAI> 
 
(relation) 
(friendship) 
(cooperation) 
(relation) 
(cooperation) 
(federation) 
(society) 
(society) 
(society) 
(organization) 
[bar] 
┌────売り場<URIBA> 
┌─┤┌─カウンター<KAUNTAA> 
│  └┴────バー<BAA> 
┤    ┌────障害<SHOUHEKI> 
│┌─┴────格子<KOUSHI> 
└┤┌─────法曹<HOUSOU> 
└┤┌───弁護士<BENGOSHI> 
└┴────法廷<HOUTEI> 
 
(shop) 
(counter) 
(saloon) 
(obstacle) 
(lattice) 
(legal profession)
(lawyer) 
(law court) 
[discipline] 
┌─訓練<KUNREN> 
┌─┴─学科<GAKKA> 
┌┤  ┌─学問<GAKUMON> 
┤└─┴─教科<KYOUKA> 
│┌───秩序<CHITSUJO> 
└┤  ┌─規制<KISEI> 
│┌┴─懲罰<CHOUBATSU> 
└┤┌─統制<TOUSEI> 
└┴─規律<KIRITSU> 
 
(training) 
(subject of study) 
(learning) 
(subject of study) 
(order) 
(regulation) 
(punishment) 
(control) 
(order) 
[measure] 
┌──尺度<SHAKUDO> 
┌───┤┌──量<RYOU> 
│      └┴─指数<SHISUU> 
┤  ┌────手段<SHUDAN> 
│┌┴────対策<TAISAKU> 
└┤  ┌───基準<KIJUN> 
└─┤┌──法令<HOUREI> 
└┤┌─議案<GIAN> 
└┴─法案<HOUAN> 
 
(gauge) 
(quantity) 
(index) 
(means) 
(counter plan) 
(standard) 
(law) 
(bill) 
(bill) 
[promotion] 
┌─────登用<TOUYOU> 
┌─┴─────昇進<SHOUSHIN> 
┤┌────売り込み<URIKOMI> 
└┤┌─プロモーション 
││          <PUROMOUSHON> 
└┴─────宣伝<SENDEN> 
 
(elevation) 
(advancement) 
(sale) 
(advertising 
campaign) 
(advertisement) 
[traffic] 
┌──商業<SHOUGYOU> 
┌┤┌─取引<TORIHIKI> 
┤└┴─売買<BAIBAI> 
│┌──通 <TSUUKOU> 
└┤┌─交通<KOUTSUU> 
└┴─  <UNYU> 
 
(commerce) 
(trade) 
(bargain) 
(passage) 
(transport) 
(transport) 
Figure 4. Examples of Clustering. 
 
A Japanese Katakana word resulting from translitera-
tion of an English word sometimes represents multiple 
senses of the English word. It is necessary to detect 
and split translation equivalents representing more 
than one sense of the target word. 
Third, not only are acquired senses rather 
coarse-grained but also generic senses are difficult to 
acquire. One of the reasons for this may be that we 
rely on co-occurrence in the window. The fact that 
most distributional word clustering methods use syn-
tactic co-occurrence suggests that it is the most effec-
tive tool for extracting pairs of related words. 
6  Conclusion 
We presented a translingual distributional word clus-
tering method enabling word senses, exactly a hierar-
chy of clusters of translation equivalents, to be ac-
quired from a comparable corpus and a bilingual dic-
tionary. Its effectiveness was demonstrated through an 
experiment using Wall Street Journal and Nihon Kei-
zai Shimbun corpora and the EDR bilingual dictionary. 
The recall of senses was 87% for senses whose ratios 
in the corpus were not less than 5%, and the accuracy 
of sense definitions was 77%. 
Acknowledgments: This research was supported by 
the New Energy and Industrial Technology Develop-
ment Organization of Japan. 
References 
Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della 
Pietra, and Robert L. Mercer. 1991. Word-sense disam-
biguation using statistical methods. In Proceedings of 
the 29th Annual Meeting of the Association for Com-
putational Linguistics, pages 264-270.  
Dagan, Ido and Alon Itai. 1994. Word sense disambigua-
tion using a second language monolingual corpus. 
Computational Linguistics, 20(4): 563-596. 
Fukumoto, Fumiyo and Junichi Tsujii. 1994. Automatic 
recognition of verbal polysemy. In Proceedings of the 
15th International Conference on Computational Lin-
guistics, pages 762-768. 
Grefenstette, Gregory. 1994. Explorations in Automatic 
Thesaurus Discovery. Kluwer Academic Publishers, 
Boston. 
Hindle, Donald. 1990. Noun classification from predi-
cate-argument structures. In Proceedings of the 28th 
Annual Meeting of the Association for Computational 
Linguistics, pages 268-275. 
Ide, Nancy and Jean Veronis. 1998. Introduction to the 
special issue on word sense disambiguation: The state 
of the art. Computational Linguistics, 24(1): 1-40. 
Kaji, Hiroyuki and Yasutsugu Morimoto. 2002. Unsuper-
vised word sense disambiguation using bilingual com-
parable corpora. In Proceedings of the 19th Interna-
tional Conference on Computational Linguistics, pages 
411-417. 
Pantel, Patrick and Dekang Lin. 2002. Discovering word 
senses from text. In Proceedings of the 8th ACM 
SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pages 613-619. 
Pereira, Fernando, Naftali Tishby, and Lillian Lee. 1993. 
Distributional clustering of English words. In Proceed-
ings of the 31st Annual Meeting of the Association for 
Computational Linguistics, pages 183-190. 
Schuetze, Hinrich. 1998. Automatic word sense dis-
crimination. Computational Linguistics, 24(1): 97-124. 
[race] 
   ┌───  <KEIRIN> 
 ┌┤┌──  <KEIBA> 
 ┤└┴─レース<REESU> 
 │┌───  <KOKUMIN> 
 └┤┌──  <MINZOKU> 
   └┴──  <JINSHU> 
 
(cycle race) 
(horse race) 
(competition) 
(nation) 
(ethnic) 
(human race) 
(a) Proposed method 
[race] 
┌───────── <HI> 
┌────┴────────  <KYOUSOU>
┌─┤                    ┌─レース<REESU> 
┌─┤  └──────────┴──  <KEIBA> 
│  │          ┌─────────  <JINSHU> 
│  └─────┤    ┌──────  <SHISSOU>
│              └──┴──── り合 <SERIAI> 
┤                  ┌───────  <KOKUMIN> 
│          ┌───┴─────── 格<HINKAKU> 
│┌────┴───────────  <KEIRIN> 
││                        ┌───特徴<TOKUCHOU>
└┤          ┌──────┴───特 <TOKUSEI> 
│      ┌─┴──────────  <HINSHU> 
│  ┌─┤            ┌─────  <FUUMI> 
└─┤  └──────┴─────  <SUIRO> 
│                  ┌────  <MINZOKU>
└─────────┴────用 <YOUSUI> 
 
 
(match) 
(competition) 
(competition) 
(horse race) 
(human race) 
(scamper) 
(competition) 
(nation) 
(dignity) 
(cycle race) 
(feature) 
(character) 
(kind) 
(flavor) 
(waterway) 
(ethnic) 
(water for 
irrigation) 
(b) Monolingual distributional clustering 
[race] 
┌──────────────  <KEIBA> 
┌─┴─────────────レース<REESU> 
┌┤      ┌────────────  <JINSHU> 
│└───┴────────────  <MINZOKU>
│                          ┌─ り合 <SERIAI> 
│                        ┌┤┌──  <SHISSOU> 
│                ┌───┤└┴── 格<HINKAKU> 
│                │      └───── <HI> 
┤          ┌──┤    ┌─────  <SUIRO> 
│          │    │┌─┴─────特 <TOKUSEI> 
│      ┌─┤    └┤┌──────  <KYOUSOU> 
│      │  │      └┴──────特徴<TOKUCHOU>
│  ┌─┤  └───────────  <FUUMI> 
│  │  │      ┌─────────  <KEIRIN> 
└─┤  │  ┌─┴─────────  <HINSHU> 
│  └─┴───────────用 <YOUSUI> 
│ 
└───────────────  <KOKUMIN>
 
(horse race) 
(competition) 
(human race) 
(ethnic) 
(competition) 
(scamper) 
(dignity) 
(match) 
(waterway) 
(character) 
(competition) 
(feature) 
(flavor) 
(cycle race) 
(kind) 
(water for 
irrigation) 
(nation) 
(c) Naive translingual distributional clustering 
Figure 5. Comparison with Alternatives. 
