Unsupervised Sense Disambiguation Using Bilingual Probabilistic Models
Indrajit Bhattacharya
Dept. of Computer Science
University of Maryland
College Park, MD,
USA
indrajit@cs.umd.edu
Lise Getoor
Dept. of Computer Science
University of Maryland
College Park, MD,
USA
getoor@cs.umd.edu
Yoshua Bengio
Dept. IRO
Universit·e de Montr·eal
Montr·eal, Qu·ebec,
Canada
bengioy@IRO.UMontreal.CA
Abstract
We describe two probabilistic models for unsuper-
vised word-sense disambiguation using parallel cor-
pora. The  rst model, which we call the Sense
model, builds on the work of Diab and Resnik
(2002) that uses both parallel text and a sense in-
ventory for the target language, and recasts their ap-
proach in a probabilistic framework. The second
model, which we call the Concept model, is a hier-
archical model that uses a concept latent variable to
relate different language speci c sense labels. We
show that both models improve performance on the
word sense disambiguation task over previous unsu-
pervised approaches, with the Concept model show-
ing the largest improvement. Furthermore, in learn-
ing the Concept model, as a by-product, we learn a
sense inventory for the parallel language.
1 Introduction
Word sense disambiguation (WSD) has been a cen-
tral question in the computational linguistics com-
munity since its inception. WSD is fundamental to
natural language understanding and is a useful in-
termediate step for many other language process-
ing tasks (Ide and Veronis, 1998). Many recent
approaches make use of ideas from statistical ma-
chine learning; the availability of shared sense de -
nitions (e.g. WordNet (Fellbaum, 1998)) and recent
international competitions (Kilgarrif and Rosen-
zweig, 2000) have enabled researchers to compare
their results. Supervised approaches which make
use of a small hand-labeled training set (Bruce
and Wiebe, 1994; Yarowsky, 1993) typically out-
perform unsupervised approaches (Agirre et al.,
2000; Litkowski, 2000; Lin, 2000; Resnik, 1997;
Yarowsky, 1992; Yarowsky, 1995), but tend to be
tuned to a speci c corpus and are constrained by
scarcity of labeled data.
In an effort to overcome the dif culty of  nd-
ing sense-labeled training data, researchers have be-
gun investigating unsupervised approaches to word-
sense disambiguation. For example, the use of par-
allel corpora for sense tagging can help with word
sense disambiguation (Brown et al., 1991; Dagan,
1991; Dagan and Itai, 1994; Ide, 2000; Resnik and
Yarowsky, 1999). As an illustration of sense disam-
biguation from translation data, when the English
word bank is translated to Spanish as orilla, it is
clear that we are referring to the shore sense of bank,
rather than the  nancial institution sense.
The main inspiration for our work is Diab and
Resnik (2002), who use translations and linguistic
knowledge for disambiguation and automatic sense
tagging. Bengio and Kermorvant (2003) present
a graphical model that is an attempt to formalize
probabilistically the main ideas in Diab and Resnik
(2002). They assume the same semantic hierarchy
(in particular, WordNet) for both the languages and
assign English words as well as their translations
to WordNet synsets. Here we present two variants
of the graphical model in Bengio and Kermorvant
(2003), along with a method to discover a cluster
structure for the Spanish senses. We also present
empirical word sense disambiguation results which
demonstrate the gain brought by this probabilistic
approach, even while only using the translated word
to provide disambiguation information.
Our  rst generative model, the Sense Model,
groups semantically related words from the two
languages into senses, and translations are gener-
ated by probabilistically choosing a sense and then
words from the sense. We show that this improves
on the results of Diab and Resnik (2002).
Our next model, which we call the Concept
Model, aims to improve on the above sense struc-
ture by modeling the senses of the two languages
separately and relating senses from both languages
through a higher-level, semantically less precise
concept. The intuition here is that not all of the
senses that are possible for a word will be relevant
for a concept. In other words, the distribution over
the senses of a word given a concept can be expected
to have a lower entropy than the distribution over
the senses of the word in the language as a whole.
In this paper, we look at translation data as a re-
source for identi cation of semantic concepts. Note
that actual translated word pairs are not always good
matches semantically, because the translation pro-
cess is not on a word by word basis. This intro-
duces a kind of noise in the translation, and an addi-
tional hidden variable to represent the shared mean-
ing helps to take it into account. Improved perfor-
mance over the Sense Model validates the use of
concepts in modeling translations.
An interesting by-product of the Concept Model
is a semantic structure for the secondary language.
This is automatically constructed using background
knowledge of the structure for the primary language
and the observed translation pairs. In the model,
words sharing the same sense are synonyms while
senses under the same concept are semantically re-
lated in the corpus. An investigation of the model
trained over real data reveals that it can indeed
group related words together.
It may be noted that predicting senses from trans-
lations need not necessarily be an end result in it-
self. As we have already mentioned, lack of labeled
data is a severe hindrance for supervised approaches
to word sense disambiguation. At the same time,
there is an abundance of bilingual documents and
many more can potentially be mined from the web.
It should be possible using our approach to (noisily)
assign sense tags to words in such documents, thus
providing huge resources of labeled data for super-
vised approaches to make use of.
For the rest of this paper, for simplicity we will
refer to the primary language of the parallel docu-
ment as English and to the secondary as Spanish.
The paper is organized as follows. We begin by for-
mally describing the models in Section 2. We de-
scribe our approach for constructing the senses and
concepts in Section 3. Our algorithm for learning
the model parameters is described in Section 4. We
present experimental results in Section 5 and our
analysis in Section 6. We conclude in Section 7.
2 Probabilistic Models for Parallel
Corpora
We motivate the use of a probabilistic model by il-
lustrating that disambiguation using translations is
possible even when a word has a unique transla-
tion. For example, according to WordNet, the word
prevention has two senses in English, which may
be abbreviated as hindrance (the act of hindering
or obstruction) and control (by prevention, e.g. the
control of a disease). It has a single translation in
our corpus, that being prevenci·on. The  rst En-
glish sense, hindrance, also has other words like
bar that occur in the corpus and all of these other
words are observed to be translated in Spanish as
the word obstrucci·on. In addition, none of these
other words translate to prevenci·on. So it is not
unreasonable to suppose that the intended sense for
prevention when translated as prevenci·on is differ-
ent from that of bar. Therefore, the intended sense
is most likely to be control. At the very heart of
the reasoning is probabilistic analysis and indepen-
dence assumptions. We are assuming that senses
and words have certain occurrence probabilities and
that the choice of the word can be made indepen-
dently once the sense has been decided. This is the
 avor that we look to add to modeling parallel doc-
uments for sense disambiguation. We formally de-
scribe the two generative models that use these ideas
in Subsections 2.2 and 2.3.
T
We Ws
Te Ts
C
WsWeword
concept
sense
b) Concept Modela) Sense Model
Figure 1: Graphical Representations of the a) Sense
Model and the b) Concept Model
2.1 Notation
Throughout, we use uppercase letters to denote ran-
dom variables and lowercase letters to denote spe-
ci c instances of the random variables. A transla-
tion pair is (a0a2a1 , a0a2a3 ) where the subscript a4 and a5
indicate the primary language (English) and the sec-
ondary language (Spanish). a0 a1a7a6a9a8a11a10a12a1a14a13a16a15a16a17a16a17a16a17a18a15a19a10a12a1a21a20a23a22
and a0a2a3 a6a24a8a11a10 a3a19a13 a15a16a17a16a17a16a17a11a15a19a10 a3a26a25 a22 . We use the shorthand
a27a29a28 a10 a1a31a30 for a27a29a28a32a0a33a1a35a34 a10 a1a36a30 .
2.2 The Sense Model
The Sense Model makes the assumption, inspired
by ideas in Diab and Resnik (2002) and Ben-
gio and Kermorvant (2003), that the English word
a0a33a1 and the Spanish word a0a37a3 in a translation pair
share the same precise sense. In other words, the
set of sense labels for the words in the two lan-
guages is the same and may be collapsed into one
set of senses that is responsible for both English
and Spanish words and the single latent variable
in the model is the sense label a38 a6a39a8a11a40a31a41a11a15a16a17a16a17a16a17a42a15a19a40a21a43a44a22
for both words a0a2a1 and a0a45a3 . We also make the as-
sumption that the words in both languages are con-
ditionally independent given the sense label. The
generative parameters a46a48a47 for the model are the prior
probability a27a29a28 a40 a30 of each sense a40 and the conditional
probabilities a27a29a28 a10 a1a50a49 a40 a30 and a27a51a28 a10 a3a42a49 a40 a30 of each word a10 a1
and a10 a3 in the two languages given the sense. The
generation of a translation pair by this model may
be viewed as a two-step process that  rst selects
a sense according to the priors on the senses and
then selects a word from each language using the
conditional probabilities for that sense. This may
be imagined as a factoring of the joint distribution:
a27a29a28a32a0 a1a18a15 a0 a3a16a15
a38
a30a52a34a53a27a29a28
a38
a30a21a27a29a28a32a0 a1 a49
a38
a30a21a27a29a28a32a0 a3 a49
a38
a30 . Note
that in the absence of labeled training data, two
of the random variables a0a45a1 and a0a2a3 are observed,
while the sense variable a38 is not. However, we can
derive the possible values for our sense labels from
WordNet, which gives us the possible senses for
each English word a0a45a1 . The Sense model is shown
in Figure 1(a).
2.3 The Concept Model
The assumption of a one-to-one association be-
tween sense labels made in the Sense Model may be
too simplistic to hold for arbitrary languages. In par-
ticular, it does not take into account that translation
is from sentence to sentence (with a shared mean-
ing), while the data we are modeling are aligned
single-word translations a28a32a0a37a1 a15 a0a45a3a54a30 , in which the in-
tended meaning of a0a45a1 does not always match per-
fectly with the intended meaning of a0a55a3 . Generally,
a set of a56 related senses in one language may be
translated by one of a57 related senses in the other.
This many-to-many mapping is captured in our al-
ternative model using a second level hidden vari-
able called a concept. Thus we have three hid-
den variables in the Concept Model  the English
sense a38 a1 , the Spanish sense a38 a3 and the concept a58 ,
where a38 a1a59a34 a8a11a40 a1a60a13 a15a16a17a16a17a16a17a42a15a19a40 a1a62a61 a22 , a38 a3a63a34 a8a11a40 a3a60a13 a15a16a17a16a17a16a17a42a15a19a40 a3a65a64 a22 and
a58
a34 a8a18a66a18a41a11a15a16a17a16a17a16a17a42a15a60a66a31a67a68a22 .
We make the assumption that the senses a38 a1 and
a38
a3 are independent of each other given the shared
concept a58 . The generative parameters a46 a47 in the
model are the prior probabilities a27a29a28 a66 a30 over the
concepts, the conditional probabilities a27a29a28 a40 a1a69a49 a66 a30 and
a27a29a28 a40 a3a42a49 a66 a30 for the English and Spanish senses given the
concept, and the conditional probabilities a27a29a28 a10 a1a70a49 a40 a1a36a30
and a27a29a28 a10 a3a50a49 a40 a3a36a30 for the words a10 a1 and a10 a3 in each
language given their senses. We can now imag-
ine the generative process of a translation pair by
the Concept Model as  rst selecting a concept ac-
cording to the priors, then a sense for each lan-
guage given the concept, and  nally a word for
each sense using the conditional probabilities of the
words. As in Bengio and Kermorvant (2003), this
generative procedure may be captured by factor-
ing the joint distribution using the conditional inde-
pendence assumptions as a27a51a28a32a0a45a1 a15 a0a45a3 a15 a38 a1 a15 a38 a3 a15 a58 a30a7a34
a27a29a28
a58
a30a21a27a29a28
a38
a1a70a49
a58
a30a21a27a51a28a32a0a33a1a50a49
a38
a1a36a30a21a27a29a28
a38
a3a69a49
a58
a30a21a27a29a28a32a0a2a3a50a49
a38
a3a36a30 . The
Concept model is shown in Figure 1(b).
3 Constructing the Senses and Concepts
Building the structure of the model is crucial for
our task. Choosing the dimensionality of the hidden
variables by selecting the number of senses and con-
cepts, as well as taking advantage of prior knowl-
edge to impose constraints, are very important as-
pects of building the structure.
If certain words are not possible for a given sense,
or certain senses are not possible for a given con-
cept, their corresponding parameters should be 0.
For instance, for all words a10 a1 that do not belong to a
sense a40 a1 , the corresponding parameter a46a69a71a73a72a36a74a75a76a72 would
be permanently set to 0. Only the remaining param-
eters need to be modeled explicitly.
While model selection is an extremely dif cult
problem in general, an important and interesting op-
tion is the use of world knowledge. Semantic hi-
erarchies for some languages have been built. We
should be able to make use of these known tax-
onomies in constructing our model. We make heavy
use of the WordNet ontology to assign structure to
both our models, as we discuss in the following sub-
sections. There are two major tasks in building the
structure  determining the possible sense labels
for each word, both English and Spanish, and con-
structing the concepts, which involves choosing the
number of concepts and the probable senses for each
concept.
3.1 Building the Sense Model
Each word in WordNet can belong to multiple
synsets in the hierarchy, which are its possible
senses. In both of our models, we directly use the
WordNet senses as the English sense labels. All
WordNet senses for which a word has been ob-
served in the corpus form our set of English sense
labels. The Sense Model holds that the sense labels
for the two domains are the same. So we must use
the same WordNet labels for the Spanish words as
well. We include a Spanish word a10 a3 for a sense a40 if
a10 a3 is the translation of any English word a10 a1 in a40 .
3.2 Building the Concept Model
Unlike the Sense Model, the Concept Model does
not constrain the Spanish senses to be the same as
the English ones. So the two major tasks in build-
ing the Concept Model are constructing the Spanish
senses and then clustering the English and Spanish
senses to build the concepts.
Concept Model
te2 ts1te1
barprevention
c6118
ts2
c20
prevencio’n obstruccio’n
Sense Model
bar prevention
te1 te2
prevencio’nobstruccio’n
Figure 2: The Sense and Concept models for prevention, bar, prevenci ·on and obstrucci·on
For each Spanish word a10a77a3 , we have its set of En-
glish translations a8a11a10a78a1a14a13a16a15a16a17a16a17a16a17a11a15a19a10a12a1a21a61a79a22 . One possibility is
to group Spanish words looking at their translations.
However, a more robust approach is to consider the
relevant English senses for a10 a3 . Each English trans-
lation for a10 a3 has its set of English sense labels a80 a71a73a72a68a81
drawn from WordNet. So the relevant English sense
labels for a10 a3 may be de ned as a80 a71a83a82 a34a85a84a78a86 a80 a71a73a72 a81 .
We call this the English sense map or a5a50a87a24a88a18a89 for
a10 a3 . We use the
a5a50a87a24a88a18a89 s to de ne the Spanish senses.
We may imagine each Spanish word to come from
one or more Spanish senses. If each word has a
single sense, then we add a Spanish sense a40 a3 for
each a5a42a87a90a88a18a89 and all Spanish words that share that
a5a50a87a24a88a18a89 belong to that sense. Otherwise, the a5a42a87a90a88a11a89 s
have to be split into frequently occurring subgroups.
Frequently co-occurring subsets of a5a50a87a24a88a18a89 s can de-
 ne more re ned Spanish senses. We identify these
subsets by looking at pairs of a5a50a87a90a88a11a89 s and comput-
ing their intersections. An intersection is consid-
ered to be a Spanish sense if it occurs for a signi -
cant number of pairs of a5a50a87a24a88a18a89 s. We consider both
ways of building Spanish senses. In either case, a
constructed Spanish sense a40 a3 comes with its rele-
vant set a8a11a40 a1 a81 a22 of English senses, which we denote
as a5a50a87a90a88a11a89 a28 a40a19a3 a30 .
Once we have the Spanish senses, we cluster
them to form concepts. We use the a5a42a87a90a88a18a89 corre-
sponding to each Spanish sense to de ne a measure
of similarity for a pair of Spanish senses. There
are many options to choose from here. We use a
simple measure that counts the number of common
items in the two a5a42a87a90a88a11a89 s.1 The similarity measure is
now used to cluster the Spanish senses a40 a3 . Since
this measure is not transitive, it does not directly
de ne equivalence classes over a40a54a3 . Instead, we get
a similarity graph where the vertices are the Span-
ish senses and we add an edge between two senses
if their similarity is above a threshold. We now
pick each connected component from this graph as
a cluster of similar Spanish senses.
1Another option would be to use a measure of similarity for
English senses, proposed in Resnik (1995) for two synsets in
a concept hierarchy like WordNet. Our initial results with this
measure were not favorable.
Now we build the concepts from the Spanish
sense clusters. We recall that a concept is de ned by
a set of English senses and a set of Spanish senses
that are related. Each cluster represents a concept.
A particular concept is formed by the set of Spanish
senses in the cluster and the English senses relevant
for them. The relevant English senses for any Span-
ish sense is given by its a5a50a87a90a88a11a89 . Therefore, the union
of the a5a42a87a90a88a11a89 s of all the Spanish senses in the cluster
forms the set of English senses for each concept.
4 Learning the Model Parameters
Once the model is built, we use the popular EM al-
gorithm (Dempster et al., 1977) for hidden vari-
ables to learn the parameters for both models. The
algorithm repeatedly iterates over two steps. The
 rst step maximizes the expected log-likelihood of
the joint probability of the observed data with the
current parameter settings a46 a47 . The next step then re-
estimates the values of the parameters of the model.
Below we summarize the re-estimation steps for
each model.
4.1 EM for the Sense Model
a27a51a28
a38
a86
a34 a40 a30a91a34 a92
a93
a94
a95
a86a76a96 a41
a27a29a28
a38
a34 a40 a49 a10 a1
a81
a15a19a10 a3
a81
a15
a46 a47
a30
a27a29a28a32a0a2a1
a81
a34
a4
a49
a38
a86
a34 a40 a30a97a34
a98
a94
a71a73a72a68a81
a96
a1a14a99
a86a76a96 a41 a27a51a28
a38
a34 a40 a49 a10a12a1
a81
a15a19a10a100a3
a81
a15
a46a70a47
a30
a98
a1
a98
a94a101
a72 a81
a96
a1a60a99
a86a76a96 a41 a27a29a28
a38
a34 a40 a49 a10 a1
a81
a15a19a10 a3
a81
a15
a46 a47
a30
a27a29a28a32a0 a3
a81
a34
a5
a49
a38
a86
a34 a40 a30 follows similarly.
4.2 EM for the Concept Model
a27a29a28
a58
a86
a34a103a102a104a30a97a34 a92
a93
a94
a95
a86a105a96 a41
a27a29a28
a58
a86
a34a103a102a106a49 a10 a1
a81
a15a19a10 a3
a81
a15
a46 a47
a30
a27a29a28
a38
a1
a81
a34a108a107a60a49
a58
a86
a34a103a102a104a30a97a34
a98
a94
a86a76a96 a41 a27a29a28
a58
a86
a34a109a102 a15
a38
a1
a81
a34a108a107a60a49 a10a12a1
a81
a15a19a10a100a3
a81
a15
a46a70a47
a30
a98
a94
a86a76a96 a41 a27a29a28
a58
a86
a34a103a102a106a49 a10 a1
a81
a15a19a10 a3
a81
a15
a46 a47
a30
a27a29a28a32a0a33a1
a81
a34
a4
a49
a38
a1
a81
a34a108a107a32a30a91a34
a98
a94a101
a72a68a81
a96
a1a60a99
a86a76a96
a41
a27a29a28
a38
a1
a81
a34a108a107a60a49 a10 a1
a81
a34
a4
a15a19a10 a3
a81
a15
a46 a47
a30
a98
a1
a98
a94a101
a72a68a81
a96
a1a60a99
a86a76a96
a41
a27a29a28
a38
a1
a81
a34a103a107a19a49a110a0a2a1
a81
a34
a4
a15a19a10 a3
a81
a15
a46
a47
a30
a27a29a28
a38
a3
a81
a34
a56
a49
a58
a86
a34a111a102a104a30 and a27a29a28a32a0 a3
a81
a34
a5
a49
a38
a3
a81
a34
a56
a30
follow similarly.
4.3 Initialization of Model Probabilities
Since the EM algorithm performs gradient ascent
as it iteratively improves the log-likelihood, it is
prone to getting caught in local maxima, and se-
lection of the initial conditions is crucial for the
learning procedure. Instead of opting for a uni-
form or random initialization of the probabilities,
we make use of prior knowledge about the English
words and senses available from WordNet. Word-
Net provides occurrence frequencies for each synset
in the SemCor Corpus that may be normalized to
derive probabilities a27 a71a113a112 a28 a40 a1a36a30 for each English sense
a40a62a1 . For the Sense Model, these probabilities form
the initial priors over the senses, while all English
(and Spanish) words belonging to a sense are ini-
tially assumed to be equally likely. However, ini-
tialization of the Concept Model using the same
knowledge is trickier. We would like each En-
glish sense a40 a1 to have a27 a86 a112 a86 a75 a28 a40 a1a31a30a2a34a114a27 a71a113a112 a28 a40 a1a36a30 . But
the fact that each sense belongs to multiple con-
cepts and the constraint a98 a75 a72a54a115a69a116 a27a29a28 a40 a1a69a49 a66 a30a117a34 a92 makes
the solution non-trivial. Instead, we settle for a
compromise. We set a27 a86 a112 a86 a75 a28 a40 a1a69a49 a66 a30a118a34a119a27 a71a113a112 a28 a40 a1a36a30 and
a27a29a28 a66 a30a120a34 a98
a75a76a72 a115a50a116
a27
a71a113a112
a28 a40 a1a36a30 . Subsequent normalization
takes care of the sum constraints. For a Spanish
sense, we set a27a29a28 a40a60a3 a30a97a34 a98 a75a76a72 a115 a3a21a121a123a122a125a124a69a126 a75a76a82a21a127 a27 a71a113a112 a28 a40a62a1 a30 . Once
we have the Spanish sense probabilities, we follow
the same procedure for setting a27a29a28 a40 a3a50a49 a66 a30 for each con-
cept. All the Spanish and English words for a sense
are set to be equally likely, as in the Sense Model.
It turned out in our experiments on real data that
this initialization makes a signi cant difference in
model performance.
5 Experimental Evaluation
Both the models are generative probabilistic models
learned from parallel corpora and are expected to
 t the training and subsequent test data. A good  t
should be re ected in good prediction accuracy over
a test set. The prediction task of interest is the sense
of an English word when its translation is provided.
We estimate the prediction accuracy and recall of
our models on Senseval data.2 In addition, the Con-
cept Model learns a sense structure for the Spanish
2Accuracy is the ratio of the number of correct predictions
and the number of attempted predictions. Recall is the ratio of
the number of correct predictions and the size of the test set.
language. While it is hard to objectively evaluate
the quality of such a structure, we present some in-
teresting concepts that are learned as an indication
of the potential of our approach.
5.1 Evaluation with Senseval Data
In our experiments with real data, we make use of
the parallel corpora constructed by Diab and Resnik
(2002) for evaluation purposes. We chose to work
on these corpora in order to permit a direct compar-
ison with their results. The sense-tagged portion of
the English corpus is comprised of the English  all-
words section of the SENSEVAL-2 test data. The
remainder of this corpus is constructed by adding
the Brown Corpus, the SENSEVAL-1 corpus, the
SENSEVAL-2 English Lexical Sample test, trial
and training corpora and the Wall Street Journal sec-
tions 18-24 from the Penn Treebank. This English
corpus is translated into Spanish using two com-
mercially available MT systems: Globalink Pro 6.4
and Systran Professional Premium. The GIZA++
implementation of the IBM statistical MT models
was used to derive the most-likely word-level align-
ments, and these de ne the English/Spanish word
co-occurrences. To take into account variability of
translation, we combine the translations from the
two systems for each English word, following in the
footsteps of Diab and Resnik (2002). For our ex-
periments, we focus only on nouns, of which there
are 875 occurrences in our tagged data. The sense
tags for the English domain are derived from the
WordNet 1.7 inventory. After pruning stopwords,
we end up with 16,186 English words, 31,862 Span-
ish words and 2,385,574 instances of 41,850 distinct
translation pairs. The English words come from
20,361 WordNet senses.
Table 1: Comparison with Diab’s Model
Model Accuracy Recall Parameters
Diab 0.618 0.572 -
Sense M. 0.624 0.616 154,947
Concept M. 0.672 0.651 120,268
As can be seen from the following table, both our
models clearly outperform Diab (2003), which is
an improvement over Diab and Resnik (2002), in
both accuracy and recall, while the Concept Model
does signi cantly better than the Sense Model with
fewer parameters. The comparison is restricted to
the same subset of the test data. For our best re-
sults, the Sense Model has 20,361 senses, while the
Concept Model has 20,361 English senses, 11,961
Spanish senses and 7,366 concepts. The Concept
Model results are for the version that allows mul-
tiple senses for a Spanish word. Results for the
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9
Recall
Accuracy
unsup.
sup.
diab
concept model
sense model
Figure 3: Comparison with Senseval2 Systems
single-sense model are similar.
In Figure 3, we compare the prediction accuracy
and recall against those of the 21 Senseval-2 English
All Words participants and that of Diab (2003),
when restricted to the same set of noun instances
from the gold standard. It can be seen that our mod-
els outperform all the unsupervised approaches in
recall and many supervised ones as well. No un-
supervised approach is better in both accuracy and
recall. It needs to be kept in mind that we take into
account only bilingual data for our predictions, and
not monolingual features like context of the word as
most other WSD approaches do.
5.2 Semantic Grouping of Spanish Senses
Table 2 shows some interesting examples of differ-
ent Spanish senses for discovered concepts.3 The
context of most concepts, like the ones shown, can
be easily understood. For example, the  rst concept
is about government actions and the second deals
with murder and accidental deaths. The penulti-
mate concept is interesting because it deals with dif-
ferent kinds of association and involves three dif-
ferent senses containing the word conexi·on. The
other words in two of these senses suggest that
they are about union and relation respectively. The
third probably involves the link sense of connection.
Conciseness of the concepts depends on the simi-
larity threshold that is selected. Some may bring
together loosely-related topics, which can be sepa-
rated by a higher threshold.
6 Model Analysis
In this section, we back up our experimental results
with an in-depth analysis of the performance of our
two models.
Our Sense Model was motivated by Diab and
Resnik (2002) but the  avors of the two are quite
3Some English words are found to occur in the Spanish
Senses. This is because the machine translation system used
to create the Spanish document left certain words untranslated.
different. The most important distinction is that the
Sense Model is a probabilistic generative model for
parallel corpora, where interaction between differ-
ent words stemming from the same sense comes
into play, even if the words are not related through
translations, and this interdependence of the senses
through common words plays a role in sense disam-
biguation.
We started off with our discussions on semantic
ambiguity with the intuition that identi cation of
semantic concepts in the corpus that relate multi-
ple senses should help disambiguate senses. The
Sense Model falls short of this target since it only
brings together a single sense from each language.
We will now revisit the motivating example from
Section 2 and see how concepts help in disambigua-
tion by grouping multiple related senses together.
For the Sense Model, a27a29a28a129a128a83a130a60a131a31a132a70a131a31a133a73a134a68a135a137a136a69a133a63a49 a40 a1a62a138a31a30a140a139
a27a29a28a129a128a83a130a60a131a31a132a70a131a31a133a73a134a68a135a137a136a69a133a63a49 a40 a1a60a13a14a30 since it is the only word that
a40 a1a21a138 can generate. However, this difference is com-
pensated for by the higher prior probability a27a29a28 a40 a1a14a13a14a30 ,
which is strengthened by both the translation pairs.
Since the probability of joint occurrence is given by
the product a27a51a28 a40 a30a21a27a29a28 a10 a1a70a49 a40 a30a21a27a29a28 a10 a3a50a49 a40 a30 for any sense a40 ,
the model does not develop a clear preference for
any of the two senses.
The critical difference in the Concept Model can
be appreciated directly from the corresponding joint
probability a27a29a28 a66 a30a21a27a29a28 a40 a1a70a49 a66 a30a21a27a29a28 a10 a1a79a49 a40 a1a31a30a21a27a29a28 a40 a3a50a49 a66 a30a21a27a29a28 a10 a3a50a49 a40 a3a54a30 ,
where a66 is the relevant concept in the model.
The preference for a particular instantiation in the
model is dependent not on the prior a27a29a28 a40a14a1 a30 over
a sense, but on the sense conditional a27a51a28 a40 a1a50a49 a66 a30 . In
our example, since a141 bar, obstrucci·ona139 can be
generated only through concept a66a16a142a69a143 , a27a29a28 a40 a1a14a13a42a49 a66a16a142a69a143 a30 is
the only English sense conditional boosted by it.
a141 prevention, prevenci·on
a139 is generated through a
different concept a66a31a144 a92a70a92a11a145 , where the higher condi-
tional a27a29a28a129a128a113a130a19a131a16a132a69a131a31a133a83a134a68a135a68a136a69a133a63a49 a40 a1a21a138a31a30 gradually strengthens one
of the possible instantiations for it, and the other
one becomes increasingly unlikely as the iterations
progress. The inference is that only one sense of
prevention is possible in the context of the parallel
corpus. The key factor in this disambiguation was
that two senses of prevention separated out in two
different concepts.
The other signi cant difference between the mod-
els is in the constraints on the parameters and the
effect that they have on sense disambiguation. In
the Sense Model, a98 a75 a27a29a28 a40 a30a117a34 a92 , while in the Con-
cept Model, a98 a75 a72a54a115a69a116 a27a29a28 a40 a1a69a49 a66 a30a63a34 a92 separately for each
concept a66 . Now for two relevant senses for an En-
glish word, a slight difference in their priors will
tend to get ironed out when normalized over the en-
Table 2: Example Spanish Senses in a Concept. For each concept, each row is a separate sense. Dictionary
senses of Spanish words are provided in English within parenthesis where necessary.
actos accidente accidentes
supremas muertes(deaths)
decisi·on decisiones casualty
gobernando gobernante matar(to kill) matanzas(slaughter) muertes-le
gubernamentales slaying
gobernaci·on gobierno-proporciona derramamiento-de-sangre (spilling-of-blood)
prohibir prohibiendo prohibitivo prohibitiva cachiporra(bludgeon) obligar(force) obligando(forcing)
gubernamental gobiernos asesinato(murder) asesinatos
linterna-el·ectrica linterna(lantern) man·ia craze
faros-autom·ovil(headlight) culto(cult) cultos proto-senility
linternas-portuarias(harbor-light) delirio delirium
antorcha(torch) antorchas antorchas-pino-nudo rabias(fury) rabia farfulla(do hastily)
oportunidad oportunidades diferenciaci·on
ocasi·on ocasiones distinci·on distinciones
riesgo(risk) riesgos peligro(danger) especializaci·on
destino sino(fate) maestr·ia (mastery)
fortuna suerte(fate) peculiaridades particularidades peculiaridades-inglesas
probabilidad probabilidades especialidad especialidades
diablo(devil) diablos modelo parang·on
dickens ideal ideales
heller santo(saint) santos san
lucifer satan satan·as idol idols ·idolo
deslumbra(dazzle) dios god dioses
cromo(chromium) divinidad divinity
meteoro meteoros meteor meteoros-blue inmortal(immortal) inmortales
meteorito meteoritos teolog·ia teolog
pedregosos(rocky) deidad deity deidades
variaci·on variaciones minutos minuto
discordancia desacuerdo(discord) discordancias momento momentos un-momento
desviaci·on(deviation) desviaciones desviaciones-normales minutos momentos momento segundos
discrepancia discrepancias fugaces( eeting) variaci·on diferencia instante momento
disensi·on pesta neo(blink) gui na(wink) pesta nean
adhesi·on adherencia ataduras(tying) pasillo(corridor)
enlace(connection) ataduras aisle
atadura ataduras pasarela(footbridge)
conexi·on conexiones hall vest·ibulos
conexi·on une(to unite) pasaje(passage)
relaci·on conexi·on callej·on(alley) callejas-ciegas (blind alley) callejones-ocultos
implicaci·on (complicity) envolvimiento
tire set of senses for the corpus. In contrast, if these
two senses belong to the same concept in the Con-
cept Model, the difference in the sense conditionals
will be highlighted since the normalization occurs
over a very small set of senses  the senses for
only that concept, which in the best possible sce-
nario will contain only the two contending senses,
as in concept a66 a92a70a92a11a145 of our example.
As can be seen from Table 1, the Concept Model
not only outperforms the Sense Model, it does so
with signi cantly fewer parameters. This may be
counter-intuitive since Concept Model involves an
extra concept variable. However, the dissociation of
Spanish and English senses can signi cantly reduce
the parameter space. Imagine two Spanish words
that are associated with ten English senses and ac-
cordingly each of them has a probability for belong-
ing to each of these ten senses. Aided with a con-
cept variable, it is possible to model the same re-
lationship by creating a separate Spanish sense that
contains these two words and relating this Spanish
sense with the ten English senses through a concept
variable. Thus these words now need to belong to
only one sense as opposed to ten. Of course, now
there are new transition probabilities for each of the
eleven senses from the new concept node. The exact
reduction in the parameter space will depend on the
frequent subsets discovered for the a5a50a87a24a88a18a89 s of the
Spanish words. Longer and more frequent subsets
will lead to larger reductions. It must also be borne
in mind that this reduction comes with the indepen-
dence assumptions made in the Concept Model.
7 Conclusions and Future Work
We have presented two novel probabilistic models
for unsupervised word sense disambiguation using
parallel corpora and have shown that both models
outperform existing unsupervised approaches. In
addition, we have shown that our second model,
the Concept model, can be used to learn a sense
inventory for the secondary language. An advan-
tage of the probabilistic models is that they can eas-
ily incorporate additional information, such as con-
text information. In future work, we plan to investi-
gate the use of additional monolingual context. We
would also like to perform additional validation of
the learned secondary language sense inventory.
8 Acknowledgments
The authors would like to thank Mona Diab and
Philip Resnik for many helpful discussions and in-
sightful comments for improving the paper and also
for making their data available for our experiments.
This study was supported by NSF Grant 0308030.
References
E. Agirre, J. Atserias, L. Padr, and G. Rigau. 2000.
Combining supervised and unsupervised lexical
knowledge methods for word sense disambigua-
tion. In Computers and the Humanities, Special
Double Issue on SensEval. Eds. Martha Palmer
and Adam Kilgarriff. 34:1,2.
Yoshua Bengio and Christopher Kermorvant. 2003.
Extracting hidden sense probabilities from bi-
texts. Technical report, TR 1231, Departement
d’informatique et recherche operationnelle, Uni-
versite de Montreal.
Peter F. Brown, Stephen Della Pietra, Vin-
cent J. Della Pietra, and Robert L. Mercer.
1991. Word-sense disambiguation using statisti-
cal methods. In Meeting of the Association for
Computational Linguistics, pages 264 270.
Rebecca Bruce and Janyce Wiebe. 1994. A new
approach to sense identi cation. In ARPA Work-
shop on Human Language Technology.
Ido Dagan and Alon Itai. 1994. Word sense disam-
biguation using a second language monolingual
corpus. Computational Linguistics, 20(4):563 
596.
Ido Dagan. 1991. Lexical disambiguation: Sources
of information and their statistical realization. In
Meeting of the Association for Computational
Linguistics, pages 341 342.
A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977.
Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statis-
tical Society, B 39:1 38.
Mona Diab and Philip Resnik. 2002. An unsuper-
vised method for word sense tagging using paral-
lel corpora. In Proceedings of the 40th Anniver-
sary Meeting of the Association for Computa-
tional Linguistics (ACL-02).
Mona Diab. 2003. Word Sense Disambiguation
Within a Multilingual Framework. Ph.D. thesis,
University of Maryland, College Park.
Christiane Fellbaum. 1998. WordNet: An Elec-
tronic Lexical Database. MIT Press.
Nancy Ide and Jean Veronis. 1998. Word sense dis-
ambiguation: The state of the art. Computational
Linguistics, 28(1):1 40.
Nancy Ide. 2000. Cross-lingual sense determina-
tion: Can it work? In Computers and the Hu-
manities: Special Issue on Senseval, 34:147-152.
Adam Kilgarrif and Joseph Rosenzweig. 2000.
Framework and results for english senseval.
Computers and the Humanities, 34(1):15 48.
Dekang Lin. 2000. Word sense disambiguation
with a similarity based smoothed library. In
Computers and the Humanities: Special Issue on
Senseval, 34:147-152.
K. C. Litkowski. 2000. Senseval: The cl research
experience. In Computers and the Humanities,
34(1-2), pp. 153-8.
Philip Resnik and David Yarowsky. 1999. Distin-
guishing systems and distinguishing senses: new
evaluation methods for word sense disambigua-
tion. Natural Language Engineering, 5(2).
Philip Resnik. 1995. Using information content to
evaluate semantic similarity in a taxonomy. In
Proceedings of the International Joint Confer-
ence on Arti cial Intelligence, pages 448 453.
Philip Resnik. 1997. Selectional preference and
sense disambiguation. In Proceedings of ACL
Siglex Workshop on Tagging Text with Lexical
Semantics, Why, What and How?, Washington,
April 4-5.
David Yarowsky. 1992. Word-sense disambigua-
tion using statistical models of Roget’s cate-
gories trained on large corpora. In Proceedings
of COLING-92, pages 454 460, Nantes, France,
July.
David Yarowsky. 1993. One sense per collocation.
In Proceedings, ARPA Human Language Tech-
nology Workshop, Princeton.
David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods. In
Meeting of the Association for Computational
Linguistics, pages 189 196.
