Resolving Translation Ambiguity 
using Non-parallel Bilingual Corpora 
Genichiro KIKUI 
NTT Cyberspace Laboratories 
1-1 Hikarinooka, Yokosuka-shi, Kanagawa, 239-0847, JAPAN 
e-mail: kikui~isl.ntt.co.jp 
Abstract 
This paper presents an unsupervised method for 
choosing the correct translation of a word in con- 
text. It learns disambiguation information from non- 
parallel bilinguM corpora (preferably in the same do- 
main) free from tagging. 
Our method combines two existing unsupervised 
disambiguation algorithms: a word sense disam- 
biguation algorithm based on distributional cluster- 
ing and a translation disambiguation algorithm us- 
ing target language corpora. 
For the given word in context, the former algo- 
rithm identifies its meaning as one of a number of 
predefined usage classes derived by clustering a large 
amount of usages in the source language corpus. The 
latter algorithm is responsible for associating each 
usage class (i.e., cluster) with a target word that is 
most relevant to the usage. 
This paper also shows preliminary results of trans- 
lation experiments. 
1 Introduction 
Choosing the correct translation of a content word 
in context, referred to as "translation disambigua- 
tion (of content word)", is a key task in machine 
translation. It is also crucial in cross-language text 
processing including cross-language information re- 
trieval and abstraction. 
Due to the recent availability of large text corpora, 
various statistical approaches have been tried includ- 
ing using 1) parallel corpora (Brown et al., 1990), 
(Brown et al., 1991), (Brown, 1997), 2) non-parallel 
bilingual corpora tagged with topic area (Yamabana 
et al., 1998) and 3) un-tagged mono-language cor- 
pora in the target language (Dagan and Itai, 1994), 
(Tanaka and Iwasaki, 1996), (Kikui, 1998). 
A problem with the first two approaches is that 
it is not easy to obtain sufficiently large parallel or 
manually tagged corpora for the pair of languages 
targeted. 
Although the third approach eases the problem of 
preparing corpora, it suffers from a lack of useful 
information in the source language. For example, 
suppose the proper name, "Dodgers", provides good 
context to identify the usage of "hit" in the training 
corpus in English. If the translation of "Dodgers" 
rarely occurs in the target language corpora, it does 
not contribute to target word selection. 
The method presented in this paper solves this 
problem by choosing the target word that corre- 
sponds to the usage identified in the source language 
corpora. This method is totally unsupervised in 
the sense that it acquires disambiguation informa- 
tion from non-parallel bilingual corpora (preferably 
in the same domain) free from tagging. 
It combines two unsupervised disambiguation al- 
gorithms: one is the word sense disambiguation al- 
gorithm based on distributional clustering(Schuetze, 
1997) and the other is the translation disambigua- 
tion algorithm using target language corpora(Kikui, 
1998). For the given word in context, the former 
algorithm identifies its usage as one of several pre- 
defined usage classes derived by clustering a large 
amount of usages in the source language corpus. The 
latter algorithm is responsible for associating each 
usage class (i.e., cluster) with a target word that 
best expresses the usage. 
The following sections are organized as follows. 
In Section 2, we overview the entire method. The 
following two sections (i.e., Section 3 and 4) then 
introduce the two major components of the method 
including the two unsupervised disambiguation algo- 
rithms. Section 5 and 6 are devoted respectively to 
a preliminary evaluation and discussions on related 
research. 
2 Overview of the method 
Figure 1 shows an overview of the entire method. 
Components inside the dotted line on the left 
represent word-sense disambiguation (WSD) in the 
source language. There are two sub-processes: dis- 
tributional clustering and categorizing. The former 
automatically identifies different usages (or senses) 
of the given source word (shown at top center) in the 
31 
\[Source Language i so.ur.cl,),_.w,or,,d iBlllngual ~ FTarget Language~ ........ ~orils~.._ ...... ~'..iu~ L~J~Corpus J 
Unsupervised I Distributional \] Sense.Translation Linker I word-sense Clustering 
dlsambi'guatlOnin the source Vl I .... T ermllst-!rans?atlon .... JI 
language \[ ................................................... ~" : Unsupervised ~11 i ~ ~ "'" ~r~P'1~""1" ' translation disambiguatlon\]l 
t ........................ Jl ~._2..~_8_e.'2 
.'2."2 ~_~2 ......... .'?222.:."_.J sense-tra ns~latll on ta ble 
~., Categorization I 
...................... .............. Input 
word / sense-2 in context 
(e.g., "suit ... ") "'..,i..,....~ accuses .... able 
Sense Translation \] 
sense-1 ~--'Y(¢|othes) 
sense-2 m~J(lewsult) 
/ Look up ~ M~lJ(iawsult) / 
Figure 1: Overview of the disambiguation method 
source language corpus and creates a profile, referred 
to as the "sense profile" for each class. The catego- 
rization process chooses the profile most relevant to 
the input word whose sense is implicitly given by its 
surrounding context. 
Located to the right is what we call the sense.. 
translation linking process. It is responsible for as-- 
sociating each semantic profile with the most likely 
translation of the source word (for which the seman-. 
tic profile is derived). The result of this process is 
registered in the sense-$ranslation table. 
The table look-up process, bottom center, simply 
retrieves the translation associated with the sense 
identified by the categorization process. 
3 Unsupervised Word Sense 
Disambiguation 
We adopted the unsupervised word-sense disam- 
biguation (WSD) algorithm based on distributional 
clustering (Schuetze, 1997). The underlying idea is 
that the sense of a word is determined by its co- 
occurring words. For example, the word "suit" co- 
occurring with "jacket" and "pants" tends to mean 
a set of clothes, whereas the same word co-occurring 
with "file" and "court" means "lawsuit". 
As stated in Section 2, the WSD algorithm com- 
prises two parts: distributional clustering and cate- 
gorization. 
The former learns the relation between sense and 
co-occurring words in the following steps: 
1. Collecting contexts of the word in the corpus, 
then 
2. Clustering them into small coherent groups 
(clusters). 
Table 1 shows sample contexts surrounding "suit" 
as extracted by the first step from actual news ar- 
ticles. These contexts are expected to be clustered 
into two sets: (1,4) and (2,3) by the second step. 
Since each cluster corresponds to a particular 
sense (or usage) of the word, it is called a "sense 
profile". 
The latter part of the WSD algorithm is responsi- 
ble for choosing the cluster "closest" or relevant to a 
new context of the same word (in this case, "suit"). 
The selected cluster is the "sense" in the new con- 
text. 
3.1 Distributional Clustering 
The above idea is implemented using the multi- 
dimensional vector space derived from word co- 
occurrence statistics in the source language corpus. 
We first map each word, w, in the corpus onto a 
vector, g(w), referred to as the "word vector". A 
word vector is created from a large corpus(Schuetze, 
1997)by the following procedure: 
1. Choose a set of content bearing words (typically 
1,000 most frequent content words). 
32 
Table 1: Contexts surrounding "suit" 
No Sample occurrences of "suit" in context 
1 ..without fear of a libel suit A California court recently .. 
2 ..fitted jacket of the Chanel suit Referring to the push-up bras .. 
3 ..double-breasted dark blue suit Robinson was asked if he ... 
4 .. the remaining suit accuses the hospital of turning... 
. Make a co-occurrence matrix whose (i, j) ele- 
ment corresponds to the occurrence count of 
the j-th content bearing word in the context 
of every occurrence of word-i 1 in the corpus. 
For simplicity we employ the sliding window ap- 
proach where neighboring n words are judged to 
be the context. 
3. Apply singular value decomposition (SVD) to 
the matrix to reduce its dimensionality. 
4. The vector representation of word-/is the i-th 
row vector of the reduced matrix. 
Second, the context of each occurrence of the word 
is also mapped to a multi-dimensional vector, called 
the "context vector". The context vector is the 
sum of every word vector within the context (again, 
neighboring n words) weighted by its idf score. For- 
mally, the context vector cxt of word set W is defined 
as follows: 
c~t(W) = ~ idf~og(w) (1) 
toEW 
idy  = lOg( ) (2) 
N = the total number of (3) 
documents in the collection 
N~ = the number of documents (4) 
containing w. 2 
Finally, derived context vectors are clustered by 
applying a clustering algorithm. We used the group- 
average agglomerative clustering algorithm called 
Buckshot(Cutting et al., 1992). In this algorithm, 
the proximity, proz, between two context vectors, 
~, b is measured by cosine of the angle between these 
two vectors: 
pro,( , g) = b')/(I II b'l) (5) 
IEvery word (type) is assigned a sequential id number. 
~The document unit may be a paragraph, a text, an article 
etc. 
Since we hypothesize that each translation alter- 
native corresponds to at least one usage of the source 
word, the number of clusters is determined to be 
the number of the translation alternatives plus some 
fixed number (e.g., 3). 
3.2 Categorization 
The task of this step is to determine which seman- 
tic profile (i.e., context cluster) is "closest" to the 
word in a new context. In this step, the "closeness" 
between a semantic profile and the context is mea- 
sured by the proximity, defined by (5), between the 
former vector and the representative vector of the 
latter called the sense vector. The sense vector of a 
sense profile is the centroid of all the context vectors 
in the cluster. 
Unlike the original algorithm, we used only a por- 
tion (e.g., 70%) of the context vectors closest to the 
centroid for computing the sense vector since these 
central vectors contain less noise in terms of repre- 
senting the cluster(Cutting et al., 1992). 
4 Linking Sense to Translation 
The WSD algorithm introduced in the previous sec- 
tion represents the sense of a given word, w, as a 
cluster of contexts (i.e., co-occurring words) in the 
source language. If each cluster is associated with 
one translation, then the result of the WSD can di- 
rectly be maped to the translation. 
Our method for associating each cluster with a 
translation consists of the following two major steps: 
1. Extracting characteristic words from the clus- 
ter, then 
2. Applying the termlist translation (disambigua- 
tion) algorithm(Kikui, 1998) to the list of words 
consisting of these characteristic words and the 
given word, w. 
The termlist translation algorithm employed in 
the second step chooses the translation, from pos- 
sible translation alternatives, that is most relevant 
to the context formed by the entire input (i.e., word 
33 
Table 2: Characteristics words of sense profiles for 
"suit" 
sense id characteristic words 
1 wearing(34.6), blue(28.5), designer(21.6), 
white(21.0), dark(20.7), shoes(20.6), 
hat(18.5), shirt(17.6), ... 
fees(51.3), defendants(47.6), filed(45.6), 
court(43.8), District Court(35.8), 
fund(33.2) .... 
list). Thus, the second step is expected to trans- 
late the given source word w into the target word 
relevant to the sense represented by the cluster. 
4.1 Extracting Characteristics Words 
We applied IR 3 techniques to extract characteristic 
words as follows. 
1. Let S be a sense profile of a source word w . 
2. Extract central elements (i.e., contexts or con- 
text vectors) of S in the same way as described 
in Section 3.2. 
3. Calculate the tf-idf score for each word in the 
extracted contexts, where tf-idf score of a word 
w is the frequency of w (term frequency) multi- 
plied by the idf value defined by (3) in Section 
3.1. 
4. Choose the topmost m words (m is typically 1.0 
to 20). 
Table 2 shows the output of the above procedure 
applied to two sense profiles for "suit" using the 
training data shown in Section 5.1. 
The extracted words for each cluster are combined 
with the source word to form a term-list. The terra- 
list of length 8 for the first sense in Table 2 is : 
(suit, wearing, blue, designer, white, dark, 
shoes, hat). 
Each term-list is then sent to the term-list transla- 
tion module. The resulting translation of the source 
word is associated with the cluster and stored in the 
sense-translation table, shown in Figure 1. 
4.2 Term-list Translation using Target 
Language Corpus 
The termlist translation algorithm(Kikui, 1998) 
aims at translating a list of words that character- 
ize a consistent text or a concept. It is an unsuper- 
vised algorithm in the sense that it relies only on a 
mono-lingual corpus free from manual tagging. 
3 IR = Information Retrieval 
The algorithm first retrieves all translation al- 
ternatives of each word from a bilingual dictionary 
(Dictionary Lookup), then tries to find the most co- 
herent (or semantically relevant) combination of the 
translation alternatives in the target language cor- 
pus (Disambiguation), detailed as follows: 
1. Dictionary Lookup: 
For each word in the given term-list, all the al- 
ternative translations are retrieved from a bilin- 
gual dictionary. A combination of one transla- 
tion for each input word is calied a translation 
candidate. For example, if the input is (book, 
library), then a translation candidate in French 
is (livre, biblioth~que). 
2. Disambiguation: 
In this step, all possible translation candidates 
are ranked by the 'similarity' score of a candi- 
date. The top ranked candidate is the output 
of the entire algorithm. 
The similarity score of a translation candidate 
(i.e., a set of target words) is defined again by us- 
ing the multi-dimensional vector space introduced 
in Section 3.1. Each target word in the transla- 
tion candidates is, first, mapped to a word vector 
derived from the target language corpus. The simi- 
larity score sim of a set of (target) words W, is the 
average distance of word vectors from the centroid g 
of them as shown below. 
5 
1 sire(W) 
- I W l ~ prox(g(w),g(W)) (6) 
wEW 
= (71 
wEW 
IWI = the number of words inW (8) 
Evaluation and Discussion 
We conducted English-to-Japanese translation ex- 
periments using newspaper articles. The results of 
the proposed algorithm were compared against those 
of the previous algorithm which relies solely on tar- 
get language corpora(Kikui, 1998). 
5.1 Experimental Data 
The bilingual dictionary, from English to Japanese, 
was an inversion of the EDICT(Breen, 1995), a free 
Japanese-to-English dictionary. 
The co-occurrence statistics were extracted from 
the 1994 New York Times (420MB) for English 
and 1994 Mainichi Shinbun (Japanese newspaper) 
(90MB) for Japanese. Note that 100 articles were 
34 
TAble 3: Result of Translation for Test-NYT 
Method success/ambiguous (rate) 
previous 91/120 (75.8%) 
(Target Corpus Only) 
proposed 95/120 (79.1%) 
randomly separated from the former corpus as the 
test set described below. 
Although these two kinds of newspaper articles 
were both written in 1994, topics and contents 
greatly differ. This is because each newspaper pub- 
lishing company edited its paper primarily for do- 
mestic readers. Note that the domains of these texts 
range from business to sports. 
The initial size of each co-occurrence matrix was 
50000-by-1000, where rows and columns correspond 
to the 50,000 and 1000 most frequent words in the 
corpus 4. Each initial matrix was then reduced by 
using SVD into a matrix of 50000-by-100 using SVD- 
PACKC(Berry et al., 1993). 
Test data, a set of word-lists, were automatically 
generated from the 120 articles from the New York 
Times separated from the training set. A word-list 
was extracted from an article by choosing the top- 
most n words ranked by their tf-idfscores, given in 
Section 5. In the following experiments, we set n to 
6 since it gave the "best" result for this corpus. 
5.2 Results 
In order to calculate success rates, translation out- 
puts were compared against the "correct data" 
which were manually created by removing incorrect 
alternatives from all possible alternatives. If all the 
translation alternatives in the bilingual dictionary 
were judged to be correct, we then excluded them in 
calculating the success-rate. 
The success rates of the proposed method and the 
previous algorithm are shown in Table 3. 
5.3 Discussion 
Although our method produced higher accuracy 
than the previous method, we cannot tell whether or 
not the difference is quantitatively significant. Fur- 
ther experiments with more data might be required. 
From a qualitative view point, the proposed 
method successfully learned useful knowledge for 
choosing the correct target word. An example is 
shown in Table 4. 
One advantage of the proposed method is that 
it is applicable to interactive disambiguation. The 
acquired disambiguation knowledge gives clues for 
4 Stopwords are ignored. 
Table 4: An example of acquired disambiguation 
knowledge 
Significant Word Translation 
in Cluster 
train, suburban, tracks, man, 
suicide, Glendale, brakes, .. 
at-bats, game, homers, 
home, runs, ... 
tsuitotsu 
(collision) 
hitto 
(hit in baseball) 
graceful, Lion King, comedy, hitto 
Jackson, idea (becoming popular) 
choosing a target word in terms of the source lan- 
guage. For example, Table 4 enables English speak- 
ers to choose their preferred translation. 
Another contribution of this research is that it 
gives one criteria for determining the number of clus- 
ters. 
6 Related Work 
Since we have referred to previous work in the area of 
statistical target word selection in Section 1 ((Brown 
et hi., 1990), (Brown et al., 1991), (Brown, 1997), 
(Yamabana et hi., 1998), (Dagan and Itai, 1994), 
(Tanaka and Iwasaki, 1996), (Kikui, 1998)), this sec- 
tion focuses on other related research. 
Fung et al. ((Fung and K., 1997),(Fung and Yee, 
1998)) presented interesting results on bilingual lexi- 
con construction from "comparable corpora", which 
is non-parallel bilingual corpora in the same domain. 
Since their algorithm does not resolve word-sense 
ambiguity in the source language, it would be inter- 
esting to combine unsupervised disambiguation in 
the same way as we did. 
Although we employed the distributional cluster- 
ing algorithm for resolving word sense ambiguity, 
different algorithms are also applicable. Among 
them, the unsupervised algorithm using decision- 
trees (Yarowsky, 1995) has achieved promising per- 
formance. An interesting approach is to use the out- 
put of our sense-translation linking process as the 
"seeds" required by that algorithm. 
7 Concluding Remarks 
This paper presented an unsupervised method for 
choosing the correct translation of a source word 
in context. Preliminary experiments have shown 
that this method achieved 79% success rate. The 
method generates associations between word usages 
and their corresponding translations, which are use- 
ful for interactive machine translation. 
Future directions include extending the proposed 
method with the decision-tree based word-sense dis- 
35 
ambiguation algorithm, and applying it to situations 
in which a reliable bilingual dictionary is not avail- 
able. 
8 Acknowledgment 
A part of this work was done when the author was 
at Center for the Study of Language and Informa- 
tion(CSLI), Stanford University. The author ac- 
knowledges the fruitful discussions with the staff of 
the Computational Semantics Laboratory at CSLI. 
Comments from anonymous reviewers helped to im- 
prove the paper. 

References 

M.W. Berry, T. Do, G. O'Brien, V. Krishna, 
and S. Varadhan. 1993. SVDPACKC USER'S 
GUIDE. Tech. Rep. CS-93-194, University ofTen- 
nessee, Knoxville, TN,. 

J.W. Breen. 1995. EDICT, Freeware, Japanese-to- 
English Dictionary. 

P. Brown, J. Cooke, V. Della Pietra, F. Jelinek, R.L. 
Mercer, and P. C. Roosin. 1990. A statistical 
approach to language translation. Computational 
Linguistics, 16(2):79-85. 

P. Brown, V. Della Pietra, and R.L. Mercer. 1991. 
Word sense disambiguation using statisical meth- 
ods. In Proceedings of ACL-91, pages 264-270. 

R. D. Brown. 1997. Automated dictionary extrac- 
tion for "knowledge-free" example-based transla- 
tion. In Proceedings of Theoretical and Method- 
ological Issues in Machine Translation. 

D. R. Cutting, D. R. Karger, J. O. Pedersen, and 
J.W. Tukey. 1992. Scatter/gather: A cluster- 
based approach to browsing large document col- 
lections. In Proceedings of A CM SIGIR-92. 

I. Dagan and A. Itai. 1994. Word sense disambigua- 
tion using a second language monolingual corpus. 
Computational Linguistics, 20(4):564-596. 

P. Fung and McKeown K. 1997. Finding termi- 
nology translations from non-parallel corpora. In 
Proceedings of the 5th Annual Workshop on Ver~ 
Large Corpora. 

P. Fung and L. Y. Yee. 1998. An ir approach for 
translating new words from nonparallel, compa- 
rable texts. In Proceedings of COLING-A CL-98. 

G. Kikui. 1998. Term-list translation using mono- 
lingual co-occurence vectors. In Proceedings of 
COLING-A CL- 98. 

H. Schuetze. 1997. Ambiguity Resolution in Lan- 
guage Learning. CSLI. 

K. Tanaka and H. Iwasaki. 1996. Extraction of lexi- 
cal translations from non-aligned corpora. In Pro- 
ceedings of COLING-96. 

K. Yamabana, K. Muraki, S. Doi, and S. Kamei. 
1998. A language conversion front-end for cross- 
language information retrieval. In G. Grefen- 
stette, editor, Cross-langauge Information Re- 
trieval, pages 93-104. Kluwer Academic Publish- 
ers. 

D. Yarowsky. 1995. Unsupervised word sense dis- 
ambiguation rivaling supervised methods. In Pro- 
ceedings of ACL-95, pages 189-195. 
