Mapping Lexical Entries in a Verbs Database
to WordNet Senses
Rebecca Greena0 and Lisa Pearla0 and Bonnie J. Dorra0a2a1 and Philip Resnika0a3a1
a1 Institute for Advanced Computer Studies
a0 Department of Computer Science
University of Maryland
College Park, MD 20742 USA
a4 rgreen,llsp,bonnie,resnik
a5 @umiacs.umd.edu
Abstract
This paper describes automatic tech-
niques for mapping 9611 entries in a
database of English verbs to Word-
Net senses. The verbs were initially
grouped into 491 classes based on
syntactic features. Mapping these
verbs into WordNet senses provides a
resource that supports disambiguation
in multilingual applications such as
machine translation and cross-language
information retrieval. Our techniques
make use of (1) a training set of 1791
disambiguated entries, representing
1442 verb entries from 167 classes;
(2) word sense probabilities, from
frequency counts in a tagged corpus;
(3) semantic similarity of WordNet
senses for verbs within the same class;
(4) probabilistic correlations between
WordNet data and attributes of the
verb classes. The best results achieved
72% precision and 58% recall, versus a
lower bound of 62% precision and 38%
recall for assigning the most frequently
occurring WordNet sense, and an upper
bound of 87% precision and 75% recall
for human judgment.
1 Introduction
Our goal is to map entries in a lexical database
of 4076 English verbs automatically to Word-
Net senses (Miller and Fellbaum, 1991), (Fell-
baum, 1998) to support such applications as ma-
chine translation and cross-language information
retrieval. For example, the verb drop is multi-
ply ambiguous, with many potential translations
in Spanish: bajar, caerse, dejar caer, derribar,
disminuir,echar, hundir, soltar, etc. The database
specifies a set of interpretations for drop, depend-
ing on its context in the source-language (SL). In-
clusion of WordNet senses in the database enables
the selection of an appropriate verb in the target
language (TL). Final selection is based on a fre-
quency count of WordNet senses across all classes
to which the verb belongs—e.g., disminuir is se-
lected when the WordNet sense corresponds to the
meaning of drop in Prices dropped.
Our task differs from standard word sense dis-
ambiguation (WSD) in several ways. First, the
words to be disambiguated are entries in a lexical
database, not tokens in a text corpus. Second, we
take an “all-words” rather than a “lexical-sample”
approach (Kilgarriff and Rosenzweig, 2000): All
words in the lexical database “text” are disam-
biguated, not just a small number for which de-
tailed knowledge is available. Third, we replace
the contextual data typically used for WSD with
information about verb senses encoded in terms
of thematic grids and lexical-semantic representa-
tions from (Olsen et al., 1997). Fourth, whereas a
single word sense for each token in a text corpus
is often assumed, the absence of sentential context
leads to a situation where several WordNet senses
may be equally appropriate for a database entry.
Indeed, as distinctions between WordNet senses
can be fine-grained (Palmer, 2000), it may be un-
clear, even in context, which sense is meant.
The verb database contains mostly syntactic in-
formation about its entries, much of which ap-
plies at the class level within the database. Word-
Net, on the other hand, is a significant source for
information about semantic relationships, much
of which applies at the “synset” level (“synsets”
are WordNet’s groupings of synonymous word
senses). Mapping entries in the database to their
corresponding WordNet senses greatly extends
the semantic potential of the database.
2 Lexical Resources
We use an existing classification of 4076 English
verbs, based initially on English Verbs Classes
and Alternations (Levin, 1993) and extended
through the splitting of some classes into sub-
classes and the addition of new classes. The re-
sulting 491 classes (e.g., “Roll Verbs, Group I”,
which includes drift, drop, glide, roll, swing) are
referred to here as Levin+ classes. As verbs may
be assigned to multiple Levin+ classes, the actual
number of entries in the database is larger, 9611.
Following the model of (Dorr and Olsen, 1997),
each Levin+ class is associated with a thematic
grid (henceforth abbreviated a6 -grid), which sum-
marizes a verb’s syntactic behavior by specify-
ing its predicate argument structure. For exam-
ple, the Levin+ class “Roll Verbs, Group I” is as-
sociated with the a6 -grid [th goal], in which a
theme and a goal are used (e.g., The ball dropped
to the ground).1 Each a6 -grid specification corre-
sponds to a Grid class. There are 48 Grid classes,
with a one-to-many relationship between Grid and
Levin+ classes.
WordNet, the lexical resource to which we are
mapping entries from the lexical database, groups
synonymous word senses into “synsets” and struc-
tures the synsets into part-of-speech hierarchies.
Our mapping operation uses several other data el-
ements pertaining to WordNet: semantic relation-
ships between synsets, frequency data, and syn-
tactic information.
Seven semantic relationship types exist be-
tween synsets, including, for example, antonymy,
hyperonymy, and entailment. Synsets are often
related to a half dozen or more other synsets; they
1There is also a Levin+ class “Roll Verbs, Group II”
which is associated with the a7 -grid [th particle(down)], in
which a theme and a particle ‘down’ are used (e.g., The ball
dropped down).
may be related to multiple synsets through a single
relationship or may be related to a single synset
through multiple relationship types.
Our frequency data for WordNet senses is de-
rived from SEMCOR—a semantic concordance in-
corporating tagging of the Brown corpus with
WordNet senses.2
Syntactic patterns (“frames”) are associated
with each synset, e.g., Somebody s something;
Something s; Somebody s somebody into
V-ing something. There are 35 such verb frames
in WordNet and a synset may have only one or as
many as a half dozen or so frames assigned to it.
Our mapping of verbs in Levin+ classes to
WordNet senses relies in part on the relation be-
tween thematic roles in Levin+ and verb frames in
WordNet. Both reflect how many and what kinds
of arguments a verb may take. However, con-
structing a direct mapping between a6 -grids and
WordNet frames is not possible, as the underly-
ing classifications differ in significant ways. The
correlations between the two sets of data are better
viewed probabilistically.
Table 1 illustrates the relation between Levin+
classes and WordNet for the verb drop. In our
multilingual applications (e.g., lexical selection in
machine translation), the Grid information pro-
vides a context-based means of associating a verb
with a Levin+ class according to its usage in the
SL sentence. The WordNet sense possibilities are
thus pared down during SL analysis, but not suffi-
ciently for the final selection of a TL verb. For ex-
ample, Levin+ class 9.4 has three possible Word-
Net senses for drop. However, the WordNet sense
8 is not associated with any of the other classes;
thus, it is considered to have a higher “information
content” than the others. The upshot is that the
lexical-selection routine prefers dejar caer over
other translations such as derribar and bajar.3
The other classes are similarly associated with ap-
2For further information see the WordNet manuals, sec-
tion 7, SEMCOR at http://www.cogsci.princeton.edu.
3This lexical-selection approach is an adaptation of the
notion of reduction in entropy, measured by information
gain (Mitchell, 1997). Using information content to quan-
tify the “value” of a node in the WordNet hierarchy has
also been used for measuring semantic similarity in a tax-
onomy (Resnik, 1999b). More recently, context-based mod-
els of disambiguation have been shown to represent signif-
icant improvements over the baseline (Bangalore and Ram-
bow, 2000), (Ratnaparkhi, 2000).
Levin+ Grid/Example WN Sense Spanish Verb(s)
9.4
Directional
Put
[ag th mod-loc src goal]
I dropped the stone
1. move, displace
2. descend, fall, go down
8. drop set down, put down
1. derribar, echar
2. bajar, caerse
8. dejar caer, echar, soltar
45.6
Calibratable
Change of
State
[th]
Prices dropped
1. move, displace
3. decline, go down, wane
1. derribar, echar
3. disminuir
47.7
Meander
[th src goal]
The river dropped from
the lake to the sea
2. descend, fall, go down
4. sink, drop, drop down
2. bajar, caerse
4. hundir, caer
51.3.1
Roll I
[th goal]
The ball dropped to the
ground
2. descend, fall, go down 2. bajar, caerse
51.3.1
Roll II
[th particle(down)]
The ball dropped down
2. descend, fall, go down 2. bajar, caerse
Table 1: Relation Between Levin+ and WN Senses for ‘drop’
propriate TL verbs during lexical selection: dis-
minuir (class 45.6), hundir (class 47.7), and bajar
(class 51.3.1).4
3 Training Data
We began with the lexical database of (Dorr and
Jones, 1996), which contains a significant number
of WordNet-tagged verb entries. Some of the as-
signments were in doubt, since class splitting had
occurred subsequent to those assignments, with
all old WordNet senses carried over to new sub-
classes. New classes had also been added since
the manual tagging. It was determined that the
tagging for only 1791 entries—including 1442
verbs in 167 classes—could be considered stable;
for these entries, 2756 assignments of WordNet
senses had been made. Data for these entries,
taken from both WordNet and the verb lexicon,
constitute the training data for this study.
The following probabilities were generated
from the training data:
a8a10a9a12a11a2a13a15a14a12a16a17a11a19a18a21a20a23a22a24a20a25a13a27a26a15a13a29a28a31a30a33a32 a34 a35a36a38a37a40a39a33a41a43a42a23a44a46a45a47a42a47a48a50a49a51a35
a35a36a52a37a53a39a54a49a51a35
,
where a55 a32 is a relation (of relationship type a56 ,
e.g., synonymy) between two synsets, a57a19a58 and a57a54a59 ,
where a57 a58 is mapped to by a verb in Grid class Ga58
and a57a54a59 is mapped to by a verb in Grid class Ga59 .
4The full set of Spanish translations is selected from
WordNet associations developed in the EuroWordNet effort
(Dorr et al., 1997).
This is the probability that if one synset is related
to another through a particular relationship type,
then a verb mapped to the first synset will belong
to the same Grid class as a verb mapped to the
second synset. Computed values generally range
between .3 and .35.
a8a25a60a62a61a64a63a25a13a66a65a12a67a68a16a25a11a19a18a47a20a69a22a24a20a17a13a15a26a27a13a29a28a70a30a33a32a71a34a72a35a36a38a37a40a39a73a41a75a74a77a76 a44 a45a78a74a77a76 a48 a49a51a35
a35a36a38a37a53a39a51a49a51a35
,
where a55 a32 is as above, except that sa58 is mapped to
by a verb in Levin+ class L+a58 and sa59 is mapped
to by a verb in Levin+ class L+a59 . This is the
probability that if one synset is related to another
through a particular relationship type, then a
verb mapped to the first synset will belong to
the same Levin+ class as a verb mapped to the
second synset. Computed values generally range
between .25 and .3.
a8a80a79a81a18a24a28a81a82a46a11a19a22a47a83a84a61a85a16a25a11a19a18a21a20a23a22a24a20a17a13a15a26a27a13a86a28a31a30a33a87a27a88a89a90a34 a35a36a50a91a53a92a94a93a95a17a41a97a96a99a98a86a100a40a93a95a31a49a51a35
a35a36a50a91a53a92a94a93a95a31a49a51a35
,
where a6 a87a27a88a101 is the occurrence of the entire a6 -grid a102
for verb entry a103 and cfa89a38a88a101 is the occurrence of the
entire frame sequence a104 for a WordNet sense to
which verb entry a103 is mapped. This is the prob-
ability that a verb in a Levin+ class is mapped to
a WordNet verb sense with some specific combi-
nation of frames. Values average only .11, but in
some cases the probability is 1.0.
a8a106a105a19a65a107a14a23a63a12a82a46a11a19a22a47a83a84a61a85a16a25a11a19a18a47a20a69a22a24a20a17a13a15a26a27a13a29a28a70a30 a87a27a88a89 a34 a35a36a50a91a53a92a94a93a95a17a41a97a96a99a98a86a100a40a93a95a31a49a51a35
a35a36a50a91a53a92a94a93a95a31a49a51a35
,
where a6 a87a27a88a101 is the occurrence of the single a6 -grid
componenta102 for verb entry a103 and cfa89a38a88a101 is the occur-
rence of the single framea104 for a WordNet sense to
which verb entry a103 is mapped. This is the proba-
bility that a verb in a Levin+ class with a partic-
ular a6 -grid component (possibly among others) is
mapped to a WordNet verb sense assigned a spe-
cific frame (possibly among others). Values aver-
age .20, but in some cases the probability is 1.0.
a8a109a108a110a11a3a13a29a18a21a11a112a111a114a113a115a16a17a11a54a18a47a20a69a22a24a20a17a13a27a26a15a13a29a28a70a30a23a116a117a34 a35a36a38a118a99a119a120a49a121a35
a35a36a38a118 a95 a49a51a35
, where
a122a46a123 is an occurrence of tag
a57 (for a particular synset)
in SEMCOR and a122 a101 is an occurrence of any of a set
of tags for verb a103 in SEMCOR, with a57 being one
of the senses possible for verb a103 . This probability
is the prior probability of specific WordNet verb
senses. Values average .11, but in some cases the
probability is 1.0.
In addition to the foregoing data elements,
based on the training set, we also made use of
a semantic similarity measure, which reflects the
confidence with which a verb, given the total set
of verbs assigned to its Levin+ class, is mapped
to a specific WordNet sense. This represents an
implementation of a class disambiguation algo-
rithm (Resnik, 1999a), modified to run against the
WordNet verb hierarchy.5
We also made a powerful “same-synset as-
sumption”: If (1) two verbs are assigned to the
same Levin+ class, (2) one of the verbs a103 a58 has
been mapped to a specific WordNet sense a57a19a58 , and
(3) the other verb a103a2a59 has a WordNet sense a57a54a59 syn-
onymous with a57 a58 , then a103 a59 should be mapped to a57 a59 .
Since WordNet groups synonymous word senses
into “synsets,” a57a19a58 and a57a54a59 would correspond to
the same synset. Since Levin+ verbs are mapped
to WordNet senses via their corresponding synset
identifiers, when the set of conditions enumer-
ated above are met, the two verb entries would be
mapped to the same WordNet synset.
As an example, the two verbs tag and mark
have been assigned to the same Levin+ class. In
WordNet, each occurs in five synsets, only one
in which they both occur. If tag has a WordNet
synset assigned to it for the Levin+ class it shares
with mark, and it is the synset that covers senses
5The assumption underlying this measure is that the ap-
propriate word senses for a group of semantically related
words should themselves be semantically related. Given
WordNet’s hierarchical structure, the semantic similarity be-
tween two WordNet senses corresponds to the degree of in-
formativeness of the most specific concept that subsumes
them both.
of both tag and mark, we can safely assume that
that synset is also appropriate for mark, since in
that context, the two verb senses are synonymous.
4 Evaluation
Subsequent to the culling of the training set, sev-
eral processes were undertaken that resulted in
full mapping of entries in the lexical database to
WordNet senses. Much, but not all, of this map-
ping was accomplished manually.
Each entry whose WordNet senses were as-
signed manually was considered by at least two
coders, one coder who was involved in the entire
manual assignment process and the other drawn
from a handful of coders working independently
on different subsets of the verb lexicon. In the
manual tagging, if a WordNet sense was consid-
ered appropriate for a lexical entry by any one of
the coders, it was assigned. Overall, 13452 Word-
Net sense assignments were made. Of these, 51%
were agreed upon by multiple coders. The kappa
coefficient (a124 ) of intercoder agreement was .47
for a first round of manual tagging and (only) .24
for a second round of more problematic cases.6
While the full tagging of the lexical database
may make the automatic tagging task appear su-
perfluous, the low rate of agreement between
coders and the automatic nature of some of the
tagging suggest there is still room for adjust-
ment of WordNet sense assignments in the verb
database. On the one hand, even the higher of
the kappa coefficients mentioned above is signifi-
cantly lower than the standard suggested for good
reliability (a124a126a125a128a127a130a129 ) or even the level where ten-
tative conclusions may be drawn (a127a130a131a133a132a135a134a72a124 a134
a127a130a129 ) (Carletta, 1996), (Krippendorff, 1980). On
the other hand, if the automatic assignments agree
with human coding at levels comparable to the de-
gree of agreement among humans, it may be used
to identify current assignments that need review
6The kappa statistic measures the degree to which pair-
wise agreement of coders on a classification task surpasses
what would be expected by chance; the standard definition
of this coefficient is: a136a138a137a140a139a15a141a81a139a86a142a33a143a145a144a146a141a81a139a15a147a33a143a99a143a99a148a51a139a99a149a33a144a146a141a81a139a15a147a33a143a99a143 ,
where a141a81a139a86a142a150a143 is the actual percentage of agreement and a141a81a139a15a147a33a143
is the expected percentage of agreement, averaged over all
pairs of assignments. Several adjustments in the computation
of the kappa coefficient were made necessary by the possible
assignment of multiple senses for each verb in a Levin+ class,
since without prior knowledge of how many senses are to be
assigned, there is no basis on which to compute a141a81a139a15a147a33a143 .
and to suggest new assignments for consideration.
In addition, consistency checking is done more
easily by machine than by hand. For example, the
same-synset assumption is more easily enforced
automatically than manually. When this assump-
tion is implemented for the 2756 senses in the
training set, another 967 sense assignments are
generated, only 131 of which were actually as-
signed manually. Similarly, when this premise is
enforced on the entirety of the lexical database
of 13452 assignments, another 5059 sense assign-
ments are generated. If the same-synset assump-
tion is valid and if the senses assigned in the
database are accurate, then the human tagging has
a recall of no more than 73%.
Because a word sense was assigned even if only
one coder judged it to apply, human coding has
been treated as having a precision of 100%. How-
ever, some of the solo judgments are likely to have
been in error. To determine what proportion of
such judgments were in reality precision failures,
a random sample of 50 WordNet senses selected
by only one of the two original coders was in-
vestigated further by a team of three judges. In
this round, judges rated WordNet senses assigned
to verb entries as falling into one of three cate-
gories: definitely correct, definitely incorrect, and
arguable whether correct. As it turned out, if any
one of the judges rated a sense definitely correct,
another judge independently judged it definitely
correct; this accounts for 31 instances. In 13 in-
stances the assignments were judged definitely in-
correct by at least two of the judges. No con-
sensus was reached on the remaining 6 instances.
Extrapolating from this sample to the full set of
solo judgments in the database leads to an estimate
that approximately 1725 (26% of 6636 solo judg-
ments) of those senses are incorrect. This suggests
that the precision of the human coding is approx-
imately 87%.
The upper bound for this task, as set by human
performance, is thus 73% recall and 87% preci-
sion. The lower bound, based on assigning the
WordNet sense with the greatest prior probability,
is 38% recall and 62% precision.
5 Mapping Strategies
Recent work (Van Halteren et al., 1998) has
demonstrated improvement in part-of-speech tag-
ging when the outputs of multiple taggers are
combined. When the errors of multiple classi-
fiers are not significantly correlated, the result of
combining votes from a set of individual classi-
fiers often outperforms the best result from any
single classifier. Using a voting strategy seems es-
pecially appropriate here: The measures outlined
in Section 3 average only 41% recall on the train-
ing set, but the senses picked out by their highest
values vary significantly.
The investigations undertaken used both sim-
ple and aggregate voters, combined using var-
ious voting strategies. The simple voters were
the 7 measures previously introduced.7 In addi-
tion, three aggregate voters were generated: (1)
the product of the simple measures (smoothed so
that zero values wouldn’t offset all other mea-
sures); (2) the weighted sum of the simple mea-
sures, with weights representing the percentage of
the training set assignments correctly identified by
the highest score of the simple probabilities; and
(3) the maximum score of the simple measures.
Using these data, two different types of vot-
ing schemes were investigated. The schemes dif-
fer most significantly on the circumstances un-
der which a voter casts its vote for a WordNet
sense, the size of the vote cast by each voter, and
the circumstances under which a WordNet sense
was selected. We will refer to these two schemes
as Majority Voting Scheme and Threshold Voting
Scheme.
5.1 Majority Voting Scheme
Although we do not know in advance how many
WordNet senses should be assigned to an entry in
the lexical database, we assume that, in general,
there is at least one. In line with this intuition, one
strategy we investigated was to have both simple
and aggregate measures cast a vote for whichever
sense(s) of a verb in a Levin+ class received the
highest (non-zero) value for that measure. Ten
variations are given here:
a8 PriorProb: Prior Probability of WordNet
senses
a8 SemSim: Semantic Similarity
7Only 6 measures (including the semantic similarity mea-
sure) were set out in the earlier section; the measures total 7
because Indv frame probability is used in two different ways.
a8 SimpleProd: Product of all simple measures
a8 SimpleWtdSum: Weighted sum of all sim-
ple measures
a8 MajSimpleSgl: Majority vote of all (7) sim-
ple voters
a8 MajSimplePair: Majority vote of all (21)
pairs of simple voters8
a8 MajAggr: Majority vote of SimpleProd and
SimpleWtdSum
a8 Maj3Best: Majority vote of SemSim, Sim-
pleProd, and SimpleWtdSum
a8 MajSgl+Aggr: Majority vote of MajSim-
pleSgl and MajAggr
a8 MajPair+Aggr: Majority vote of MajSim-
plePair and MajAggr
Table 2 gives recall and precision measures for
all variations of this voting scheme, both with
and without enforcement of the same-synset as-
sumption. If we use the harmonic mean of recall
and precision as a criterion for comparing results,
the best voting scheme is MajAggr, with 58% re-
call and 72% precision without enforcement of the
same-synset assumption. Note that if the same-
synset assumption is correct, the drop in precision
that accompanies its enforcement mostly reflects
inconsistencies in human judgments in the train-
ing set; the true precision value for MajAggr after
enforcing the same-synset assumption is probably
close to 67%.
Of the simple voters, only PriorProb and Sem-
Sim are individually strong enough to warrant dis-
cussion. Although PriorProb was used to estab-
lish our lower bound, SemSim proves to be the
stronger voter, bested only by MajAggr (the ma-
jority vote of SimpleProd and SimpleWtdSum) in
voting that enforces the same-synset assumption.
Both PriorProb and SemSim provide better results
than the majority vote of all 7 simple voters (Ma-
jSimpleSgl) and the majority vote of all 21 pairs
of simple voters (MajSimplePair). Moreover, the
inclusion of MajSimpleSgl and MajSimplePair in
a majority vote with MajAggr (in MajSgl+Aggr
8A pair cast a vote for a sense if, among all the senses of a
verb, a specific sense had the highest value for both measures.
Variation W/O SS W/ SS
R P R P
PriorProb 38% 62% 45% 46%
SemSim 56% 71% 60% 55%
SimpleProd 51% 74% 57% 55%
SimpleWtdSum 53% 77% 58% 56%
MajSimpleSgl 23% 71% 30% 48%
MajSimplePair 38% 60% 45% 43%
MajAggr 58% 72% 63% 53%
Maj3Best 52% 78% 57% 57%
MajSgl+Aggr 44% 74% 50% 54%
MajPair+Aggr 49% 77% 55% 57%
Table 2: Recall (R) and Precision (P) for Majority
Voting Scheme, Before (W/O) and After (W/) En-
forcement of the Same-Synset (SS) Assumption
Variation R P
AutoMap+ 61% 54%
AutoMap- 61% 54%
Triples 63% 52%
Combo 53% 44%
Combo&Auto 59% 45%
Table 3: Recall (R) and Precision (P) for Thresh-
old Voting Scheme
and MapPair+Aggr, respectively) turn in poorer
results than MajAggr alone.
The poor performance of MajSimpleSgl and
MajSimplePair do not point, however, to a gen-
eral failure of the principle that multiple voters
are better than individual voters. SimpleProd, the
product of all simple measures, and SimpleWtd-
Sum, the weighted sum of all simple measures,
provide reasonably strong results, and a majority
vote of the both of them (MajAggr) gives the best
results of all. When they are joined by SemSim in
Maj3Best, they continue to provide good results.
The bottom line is that SemSim makes the most
significant contribution of any single simple voter,
while the product and weighted sums of all simple
voters, in concert with each other, provide the best
results of all with this voting scheme.
5.2 Threshold Voting Scheme
The second voting strategy first identified, for
each simple and aggregate measure, the threshold
value at which the product of recall and precision
scores in the training set has the highest value if
that threshold is used to select WordNet senses.
During the voting, if a WordNet sense has a higher
score for a measure than its threshold, the measure
votes for the sense; otherwise, it votes against it.
The weight of the measure’s vote is the precision-
recall product at the threshold. This voting strat-
egy has the advantage of taking into account each
individual attribute’s strength of prediction.
Five variations on this basic voting scheme
were investigated. In each, senses were selected
if their vote total exceeded a variation-specific
threshold. Table 3 summarizes recall and pre-
cision for these variations at their optimal vote
thresholds.
In the AutoMap+ variation, Grid and Levin+
probabilities abstain from voting when their val-
ues are zero (a common occurrence, because
of data sparsity in the training set); the same-
synset assumption is automatically implemented.
AutoMap- differs in that it disregards the Grid
and Levin+ probabilities completely. The Triples
variation places the simple and composite mea-
sures into three groups, the three with the high-
est weights, the three with the lowest weights,
and the middle or remaining three. Voting first
occurs within the group, and the group’s vote is
brought forward with a weight equaling the sum
of the group members’ weights. This variation
also adds to the vote total if the sense was as-
signed in the training data. The Combo variation
is like Triples, but rather than using the weights
and thresholds calculated for the single measures
from the training data, this variation calculates
weights and thresholds for combinations of two,
three, four, five, six, and, seven measures. Finally,
the Combo&Auto variation adds the same-synset
assumption to the previous variation.
Although not evident in Table 3 because of
rounding, AutoMap- has slightly higher values for
both recall and precision than does AutoMap+,
giving it the highest recall-precision product of the
threshold voting schemes. This suggests that the
Grid and Levin+ probabilities could profitably be
dropped from further use.
Of the more exotic voting variations, Triples
voting achieved results nearly as good as the Au-
toMap voting schemes, but the Combo schemes
fell short, indicating that weights and thresholds
are better based on single measures than combi-
nations of measures.
6 Conclusions and Future Work
The voting schemes still leave room for improve-
ment, as the best results (58% recall and 72% pre-
cision, or, optimistically, 63% recall and 67% pre-
cision) fall shy of the upper bound of 73% re-
call and 87% precision for human coding.9 At
the same time, these results are far better than the
lower bound of 38% recall and 62% precision for
the most frequent WordNet sense.
As has been true in many other evaluation stud-
ies, the best results come from combining classi-
fiers (MajAggr): not only does this variation use
a majority voting scheme, but more importantly,
the two voters take into account all of the sim-
ple voters, in different ways. The next-best re-
sults come from Maj3Best, in which the three best
single measures vote. We should note, however,
that the single best measure, the semantic similar-
ity measure from SemSim, lags only slightly be-
hind the two best voting schemes.
This research demonstrates that credible word
sense disambiguation results can be achieved
without recourse to contextual data. Lexical re-
sources enriched with, for example, syntactic in-
formation, in which some portion of the resource
is hand-mapped to another lexical resource may
be rich enough to support such a task. The de-
gree of success achieved here also owes much to
the confluence of WordNet’s hierarchical struc-
ture and SEMCOR tagging, as used in the compu-
tation of the semantic similarity measure, on the
one hand, and the classified structure of the verb
lexicon, which provided the underlying groupings
used in that measure, on the other hand. Even
where one measure yields good results, several
data sources needed to be combined to enable its
success.
Acknowledgments
The authors are supported, in part, by
PFF/PECASE Award IRI-9629108, DOD
9The criteria for the majority voting schemes preclude
their assigning more than 2 senses to any single database en-
try. Controlled relaxation of these criteria may achieve some-
what better results.
Contract MDA904-96-C-1250, DARPA/ITO
Contracts N66001-97-C-8540 and N66001-
00-28910, and a National Science Foundation
Graduate Research Fellowship.
References
Srinivas Bangalore and Owen Rambow. 2000.
Corpus-Based Lexical Choice in Natural Language
Generation. In Proceedings of the ACL, Hong
Kong.
Olivier Bodenreider and Carol A. Bean. 2001. Re-
lationships among Knowledge Structures: Vocabu-
lary Integration within a Subject Domain. In C.A.
Bean and R. Green, editors, Relationships in the
Organization of Knowledge, pages 81–98. Kluwer,
Dordrecht.
Jean Carletta. 1996. Assessing Agreement on Classi-
fication Tasks: The Kappa Statistic. Computational
Lingustics, 22(2):249–254, June.
Bonnie J. Dorr and Douglas Jones. 1996. Robust Lex-
ical Acquisition: Word Sense Disambiguation to In-
crease Recall and Precision. Technical report, Uni-
versity of Maryland, College Park, MD.
Bonnie J. Dorr and Mari Broman Olsen. 1997. De-
riving Verbal and Compositional Lexical Aspect
for NLP Applications. In Proceedings of the
35th Annual Meeting of the Association for Com-
putational Linguistics (ACL-97), pages 151–158,
Madrid, Spain, July 7-12.
Bonnie J. Dorr, M. Antonia Mart´ı, and Irene Castell´on.
1997. Spanish EuroWordNet and LCS-Based In-
terlingual MT. In Proceedings of the Workshop on
Interlinguas in MT, MT Summit, New Mexico State
University Technical Report MCCS-97-314, pages
19–32, San Diego, CA, October.
Christiane Fellbaum. 1998. WordNet: An Electronic
Lexical Database. MIT Press, Cambridge, MA.
Eduard Hovy. In press. Comparing Sets of Semantic
Relations in Ontologies. In R. Green, C.A. Bean,
and S. Myaeng, editors, The Semantics of Rela-
tionships: An Interdisciplinary Perspective. Book
manuscript submitted for review.
A. Kilgarriff and J. Rosenzweig. 2000. Framework
and Results for English SENSEVAL. Computers
and the Humanities, 34:15–48.
Klaus Krippendorff. 1980. Content Analysis: An In-
troduction to Its Methodology. Sage, Beverly Hills.
Beth Levin. 1993. English Verb Classes and Alter-
nations: A Preliminary Investigation. University of
Chicago Press, Chicago, IL.
George A. Miller and Christiane Fellbaum. 1991. Se-
mantic Networks of English. In Beth Levin and
Steven Pinker, editors, Lexical and Conceptual Se-
mantics, pages 197–229. Elsevier Science Publish-
ers, B.V., Amsterdam, The Netherlands.
Tom Mitchell. 1997. Machine Learning. McGraw
Hill.
Mari Broman Olsen, Bonnie J. Dorr, and David J.
Clark. 1997. Using WordNet to Posit Hierarchical
Structure in Levin’s Verb Classes. In Proceedings
of the Workshop on Interlinguas in MT, MT Sum-
mit, New Mexico State University Technical Report
MCCS-97-314, pages 99–110, San Diego, CA, Oc-
tober.
Martha Palmer. 2000. Consistent Criteria for
Sense Distinctions. Computers and the Humanities,
34:217–222.
Adwait Ratnaparkhi. 2000. Trainable methods for sur-
face natural language generation. In Proceedings of
the ANLP-NAACL, Seattle, WA.
Philip Resnik. 1999a. Disambiguating noun group-
ings with respect to wordnet senses. In S. Arm-
strong, K. Church, P. Isabelle, E. Tzoukermann
S. Manzi, and D. Yarowsky, editors, Natural Lan-
guage Processing Using Very Large Corpora, pages
77–98. Kluwer Academic, Dordrecht.
Philip Resnik. 1999b. Semantic similarity in a taxon-
omy: An information-based measure and its appli-
cation to problems of ambiguity in natural language.
In Journal of Artificial Intelligence Research, num-
ber 11, pages 95–130.
Hans Van Halteren, Jakub Zavrel, and Walter Daele-
mans. 1998. Improving data-driven wordclass tag-
ging by system combination. In Proceedings of the
36th Annual Meeting of the Association for Compu-
tational Linguistics and the 17th International Con-
ference on Computational Linguistics, pages 491–
497.
