The University of Maryland SENSEVAL-3 System Descriptions
Clara Cabezas, Indrajit Bhattacharya, and Philip Resnik
University of Maryland, College Park, MD 20742 USA
clarac@umiacs.umd.edu, indrajit@cs.umd.edu, resnik@umd.edu
Abstract
For SENSEVAL-3, the University of Maryland
(UMD) team focused on two primary issues: the
portability of sense disambigation across languages,
and the exploitation of real-world bilingual text as a
resource for unsupervised sense tagging. We vali-
dated the portability of our supervised disambigua-
tion approach by applying it in seven tasks (En-
glish, Basque, Catalan, Chinese, Romanian, Span-
ish, and “multilingual” lexical samples), and we ex-
perimented with a new unsupervised algorithm for
sense modeling using parallel corpora.
1 Supervised Sense Tagging for Lexical
Samples
1.1 Tagging Framework
For the English, Basque, Catalan, Chinese, Roma-
nian, Spanish, and “multilingual” lexical samples,
we employed the UMD-SST system developed for
SENSEVAL-2 (Cabezas et al., 2001); we refer the
reader to that paper for a detailed system descrip-
tion. Briefly, UMD-SST takes a supervised learning
approach, treating each word in a task’s vocabulary
as an independent problem of classification into that
word’s sense inventory. Each training and test item
is represented as a weighted feature vector, with di-
mensions corresponding to properties of the con-
text. As in SENSEVAL-2, our system supported the
following kinds of features:
a0 Local context. For each
a1 = 1, 2, and 3, and for
each word a2 in the vocabulary, there is a fea-
ture a3a5a4a7a6a8a2 representing the presence of word
a2 at a distance of
a1 words to the left of the word
being disambigated; there is a corresponding
set of features a9a10a4a10a6a11a2 for the local context to
the right of the word.
a0 Wide context. Each word
a12 in the training set
vocabulary has a corresponding feature indi-
cating its presence. For SENSEVAL-3, wide
context features were taken from the entire
training or test instance. In other settings, one
might make further distinctions, e.g. between
words in the same paragraph and words in the
document.
We also experimented with the following additional
kinds of features for English:
a0 Grammatical context. We use a syntactic de-
pendency parser (Lin, 1998) to produce, for
each word to be disambiguated, features iden-
tifying relevant syntactic relationships in the
sentence where it occurs. For example, in the
sentence The U.S. government announced a
new visa waiver policy, the word government
would have syntactic features like DET:THE,
MOD:U.S., and SUBJ-OF:ANNOUNCED.
a0 Expanded context. In information retrieval,
we and other researchers have found that it
can be useful to expand the representation of a
document to include informative words from
similar documents (Levow et al., 2001). In
a similar spirit, we create a set of expanded-
context features a13a15a14a8a16a17a6a18a13 by (a) treating the
WSD context as a bag of words, (b) issuing it
as a query to a standard information retrieval
system that has indexed a large collection
of documents, and (c) including the non-
stopword vocabulary of the top a19 documents
returned. So, for example, in a context
containing the sentence The U.S. government
announced a new visa waiver policy, the query
might retrieve news articles like “US to Ex-
tend Fingerprinting to Europeans, Japanese”
(Bloomberg.com, April 2, 2004), leading to
the addition of features like EXT:EUROPEAN,
EXT:JAPANESE, EXT:FINGERPRINTING
EXT:VISITORS, EXT:TOURISM, and so forth.
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
Lexical Sample Coarse (prec/rec) Fine (prec/rec)
UMD-SST 0.643/0.643 0.568/0.568
UMD-SST-gram 0.600/0.600 0.576/0.576
UMD-SST-docexp 0.541/0.542 0.516/0.491
Table 1: UMD-SST variations on the SENSEVAL-2
English lexical sample task
As described by Cabezas et al. (2001), we have
adopted the framework of support vector machines
(SVMs) in order to perform supervised classifica-
tion. Because we used a version of SVM learn-
ing designed for binary classification tasks, rather
than the multi-way classification needed for disam-
biguating among a0 senses, we constructed a family
of SVM classifiers for each word a2 — one for each
of the word’s a0a2a1 senses. All positive training ex-
amples for a sense a3 a4 of a2 were treated as negative
training examples for all the other senses a3a5a4 , a6a8a7a9 a1 .
Table 1 shows the performance of our approach
on the English lexical sample task from the previ-
ous SENSEVAL exercise (SENSEVAL-2), includ-
ing the basic system (UMD-SST), the basic sys-
tem with grammatical features added (UMD-SST-
gram), and the basic system with document expan-
sion features added (UMD-SST-docexp). (We have
not done a run with both sets of features added.) The
results showed a possible potential benefit for us-
ing grammatical features, in the fine-grained scor-
ing. However, we deemed the benefit too small to
rely upon, and submitted our official SENSEVAL-3
runs using UMD-SST without the grammatical or
document-expansion features.
1.2 SENSEVAL-3 Lexical Sample Tasks
For SENSEVAL-3, the modularity of our system
made it very easy to participate in the many lex-
ical sample tasks, including the multilingual lexi-
cal sample, where the “sense inventory” consisted
of vocabulary items from a second language.1 In-
deed, we participated in several tasks without hav-
ing anyone on the team who could read the lan-
guage. (Whether or not this was a good idea remains
to be seen.) For Basque, Catalan, and Spanish, we
used the lemmatized word forms provided by the
task organizers; for the other languages, including
English, we used only simple tokenization.
Table 2 shows the UMD-SST system’s official
1Data format problems prevented us from participating in
the Italian lexical sample task.
Lexical Sample Precision (%) Recall (%)
Basque 65.6 58.7
Catalan 81.5 80.3
Chinese 51.3 51.2
English 66.0 66.0
Romanian 70.7 70.7
Spanish 82.5 82.5
Multilingual 58.8 58.8
Table 2: UMD-SST results (fine-grained) on
SENSEVAL-3 lexical sample tasks
Lexical Sample Coarse (prec/rec) Fine (prec/rec)
UMD-SST 0.709/0.709 0.660/0.660
UMD-SST-gram 0.703/0.703 0.655/0.655
UMD-SST-docexp 0.691/0.680 0.637/0.627
Table 3: UMD-SST variations on the SENSEVAL-3
English lexical sample task
SENSEVAL-3 performance on the lexical sample
runs in which we participated, using fine-grained
scores.
In unofficial runs, we also experimented with
the grammatical and document-expansion features.
Table 3 shows the results, which indicate that on this
task the additional features did not help and may
have hurt performance slightly. Although we have
not yet reached any firm conclusions, we conjecture
that value potentially added by these features may
have been offset by the expansion in the size of the
feature space; in future work we plan to explore fea-
ture selection and alternative learning frameworks.
2 Unsupervised Sense Tagging using
Bilingual Text
2.1 Probabilistic Sense Model
For the past several years, the University of Mary-
land group has been exploring unsupervised ap-
proaches to word sense disambiguation that take ad-
vantage of parallel corpora (Diab and Resnik, 2002;
Diab, 2003). Recently, Bhattacharya et al. (2004)
(in a UMD/Montreal collaboration) have developed
a variation on this bilingual approach that is in-
spired by the central insight of Diab’s work, but re-
casts it in a probabilistic framework. A generative
model, it is a variant of the graphical model of Ben-
gio and Kermorvant (2003), which groups seman-
tically related words from the two languages into
“senses”; translations are generated by probabilis-
tically choosing a sense and then words from the
sense.
Briefly, the model of Bhattacharya et al. uses
probabilistic analysis and independence assump-
tions: it assumes that senses and words have cer-
tain occurrence probabilities and that the choice of
the word can be made independently once the sense
has been decided. Here interaction between dif-
ferent words arising from the same sense comes
into play, even if the words are not related through
translations, and this interdependence of the senses
through common words plays a role in sense disam-
biguation.
The model takes as its starting point the idea of a
“translation pair” — a pair of words a13 and a0 that are
aligned in two sentences (here “English” and “non-
English”) that are translations of each other. For
example, in the English-Spanish sentence pair Me
gusta la ciudad/I like the city, one would find the
translation pairs a1a3a2a5a4a7a6 a13a9a8 , a1a3a10
a1a12a11
a13a13a4a15a14a17a16 a3 a16a19a18a20a8 , a1 a16a7a21 a13a13a4a22a10a3a18a23a8 ,
and a1a3a24 a1 a16a19a25a26a4a22a24 a1 a16a28a27a17a18a29a27a29a8 .2 Those familiar with statistical
machine translation (MT) models will note that a
translation pair is equivalent to a link in a word-level
alignment, and in fact we obtain translation pairs
from sentence-aligned parallel text by training a sta-
tistical MT model (using GIZA++, (Och and Ney,
2003)) and using the word-level alignments that re-
sult.
The probabilistic sense model makes the assump-
tion that the English word a2a31a30 and the non-English
word a2a33a32 in a translation pair share the same precise
sense, or, in other words, that the set of sense labels
for the words in the two languages is the same and
may be collapsed into one set of senses that is re-
sponsible for both English and non-English words.
Thus the one latent variable in the model is the
sense label a34 generating both words, represented by
variables a35 a30 and a35a36a32 . The model also makes the
assumption that words in both languages are con-
ditionally independent given the sense label. The
generative parameters a37 for the model are the prior
probability a38a39a1 a16a22a8 of each sense a16 and the conditional
probabilities a38a39a1 a2a40a30a42a41a16a7a8 and a38a39a1 a2a40a32a5a41a16a22a8 of each word a2a40a30
and a2a40a32 in the two languages given the sense. The
generation of a translation pair by this model may
be viewed as a two-step process that first selects
a sense according to the priors on the senses, and
then selects a word from each language using the
2Speakers of both Spanish and English will observe that
translation pairs may well include words that are not exactly
translations of each other.
conditional probabilities for that sense. This may
be imagined as a factoring of the joint distribution:
a38a39a1a43a35a44a30a45a4a46a35a47a32a23a4a15a34a48a8
a9
a38a39a1a49a34a48a8a19a38a39a1a43a35a44a30a50a41a34a48a8a19a38a39a1a43a35a36a32a51a41a34a48a8 .
Given WordNet as a sense inventory, the set
of English senses is as defined in the WordNet
database. Since the model assumes the sense labels
for the two languages are the same, it must use the
same WordNet labels for the non-English words as
well. Rather than considering all possible words in
the non-English vocabulary for each WordNet sense
a16 , we permit an association between non-English
word a2a40a32 and WordNet sense a16 if a2a31a32 is the trans-
lation of any English word a2 a30 in a16 ’s synonym set.
We use the popular EM algorithm (Dempster et al.,
1977) to estimate the model’s parameters from a set
of translation pairs derived from a parallel corpus.
3 Using the Model for WSD
Like Diab’s system, the sense model of Bhat-
tacharya et al. requires a parallel corpus in order
to estimate its parameters, and as currently imple-
mented it can only assign sense tags to words in par-
allel text. In order to perform WSD experiments on
English test items, therefore, two steps are neces-
sary: obtaining translation pairs from an English-F
parallel corpus in order to create a model, and trans-
lating test items from English into F in order to ob-
tain word-level alignments that can be used as the
basis for disambiguation.
In order to accomplish these steps, Bhattacharya
et al. (2004) used the pseudo-translation ap-
proach of Diab and Resnik (2002): they created
the model using an English-Spanish parallel corpus
constructed by using Systran to translate a large col-
lection of English text, and they obtained parallel
Spanish text for the test items in the same fashion.
On the nouns (only) in the SENSEVAL-2 English
all-words task, they obtained precision and recall of
0.624 and 0.616, respectively, improving on Diab’s
precision and recall of 0.618 and 0.572.3
Our goal for SENSEVAL-3 was to investigate the
use of human-translated text, rather than pseudo-
translated text, in creating the model. To that end,
we used three sources of sentence-aligned parallel
text:
a0 Spanish-English: A set of 107,222 sentence
pairs sampled from modern Bible translations,
3Their paper includes a refinement of the probabilistic
model that improves performance further, to precision and re-
call of 0.672 and 0.651.
Language pair Precision (%) Recall (%)
English-Chinese 0.445 0.445
English-Spanish 0.444 0.444
English-French 0.445 0.445
Table 4: Unsupervised probabilistic model results
(fine-grained) on the SENSEVAL-2 English all-
words task
United Nations Proceedings, and newswire
translations from FBIS (the Foreign Broadcast
Information Service).
a0 French-English: a set of 1,008,591 sentence
pairs from the Europarl corpus (Koehn, 2003)
a0 Chinese-English: a set of 440,223 sentence
pairs from FBIS.
In order to tag new test sentences, we used ma-
chine translation from English test items into each
of Spanish, French, and Chinese. We used Systran
for Spanish and French, and for Chinese we used
an implementation of the alignment template frame-
work for statistical MT (Kumar and Byrne, 2003).
Once having obtained the translations for test sen-
tences, we used GIZA++ to create word-level align-
ments within which translation pairs could be iden-
tified. We used the probabilistic model only for
WSD of nouns, where nouns were identified using
an automatic part-of-speech tagger. For other parts
of speech, we used the first-listed WordNet sense.
Time limitations prevented us from completing
SENSEVAL-3 runs in time for this writing. Ta-
ble 4 shows the performance of the system on the
SENSEVAL-2 English all-words task. This perfor-
mance level places the approach in the middle group
of performers on this task at the time of SENSEVAL-
2 (with scores roughly in the range of 0.44-0.46), as
contrasted to the top group (with scores in the range
of 0.56-0.69). In the near future, we hope to repeat
the experiment with the SENSEVAL-3 English all-
words and lexical sample test data, and also to ex-
plore evidence combination from multiple language
pairs.
4 Conclusions
Our precision and recall in the SENSEVAL-3 lexi-
cal sample tasks was in all cases very close to the
median reported by the task organizer, thus demon-
strating an ability to obtain credible performance
with a simple, robust approach. Additionally, we
explored the use of additional features, and we ex-
perimented with applying a new unsupervised prob-
abilistic model using human-translated rather than
pseudo-translated parallel text, with equivocal re-
sults for the various extensions beyond the basic
system. In the future we plan to focus our attention
more heavily on the learning paradigm and prob-
abilistic modeling, with the particular aim of more
effectively exploiting local and document-level con-
text for both sense disambiguation and lexical selec-
tion.
Acknowledgements
The work described in this paper was supported
in part by ONR MURI Contract FCPO.810548265,
National Science Foundation grant EIA0130422,
and Department of Defense contract RD-02-5700.
The authors gratefully acknowledge the assistance
of Mona Diab, Aaron Elkiss, Adam Lopez, and
Grazia Russo-Lassner in data preparation details,
and the kind help of Shankar Kumar and Yonggang
Deng in using their framework for statistical ma-
chine translation.

References
Yoshua Bengio and Christopher Kermorvant. 2003.
Extracting hidden sense probabilities from bi-
texts. Technical report, TR 1231, Departement
d’informatique et recherche operationnelle, Uni-
versite de Montreal.
Indrajit Bhattacharya, Lise Getoor, and Yoshua
Bengio. 2004. Unsupervised sense disambigua-
tion using bilingual probabilistic models. In
Meeting of the Association for Computational
Linguistics.
Clara Cabezas, Philip Resnik, and Jessica Stevens.
2001. Supervised sense tagging using support
vector machines. In Proceedings of the Sec-
ond International Workshop on Evaluating Word
Sense Disambiguation Systems (SENSEVAL-2),
Toulouse, France, July.
A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977.
Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statis-
tical Society, B 39:1–38.
Mona Diab and Philip Resnik. 2002. An unsuper-
vised method for word sense tagging using paral-
lel corpora. In 40th Anniversary Meeting of the
Association for Computational Linguistics (ACL-
02), Philadelphia, July.
Mona Diab. 2003. Word Sense Disambiguation
within a Multilingual Framework. Ph.D. thesis,
University of Maryland.
Philipp Koehn. 2003. Europarl: A multilingual cor-
pus for evaluation of machine translation. unpub-
lished manuscript.
S. Kumar and W. Byrne. 2003. A weighted finite
state transducer implementation of the alignment
template model for statistical machine transla-
tion. In Proceedings of HLT-NAACL, Edmonton,
Canada, May.
Gina Levow, Douglas Oard, and Philip Resnik.
2001. Rapidly retargetable interactive translin-
gual retrieval. In Human Language Technology
Conference (HLT-2001), San Diego, CA, March.
Dekang Lin. 1998. Dependency-based evaluation
of MINIPAR. In Workshop on the Evaluation of
Parsing Systems, Granada, Spain, May.
Franz Josef Och and Hermann Ney. 2003. A
systematic comparison of various statistical
alignment models. Computational Linguistics,
29(1):19–51.
