Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL),
pages 25–32, Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
New Experiments in Distributional Representations of Synonymy
Dayne Freitag, Matthias Blume, John Byrnes, Edmond Chow,
Sadik Kapadia, Richard Rohwer, Zhiqiang Wang
HNC Software, LLC
3661 Valley Centre Drive
San Diego, CA 92130, USA
a0 DayneFreitag,MatthiasBlume,JohnByrnes,EdChow,
SadikKapadia,RichardRohwer,ZhiqiangWanga1 @fairisaac.com
Abstract
Recent work on the problem of detect-
ing synonymy through corpus analysis has
used the Test of English as a Foreign Lan-
guage (TOEFL) as a benchmark. How-
ever, this test involves as few as 80 ques-
tions, prompting questions regarding the
statistical significance of reported results.
We overcome this limitation by generating
a TOEFL-like test using WordNet, con-
taining thousands of questions and com-
posed only of words occurring with suf-
ficient corpus frequency to support sound
distributional comparisons. Experiments
with this test lead us to a similarity mea-
sure which significantly outperforms the
best proposed to date. Analysis suggests
that a strength of this measure is its rela-
tive robustness against polysemy.
1 Introduction
Many text applications are predicated on the idea
that shallow lexical semantics can be acquired
through corpus analysis. Harris articulated the ex-
pectation that words with similar meanings would be
used in similar contexts (Harris, 1968), and recent
empirical work involving large corpora has borne
this out. In particular, by associating each word with
a distribution over the words observed in its context,
we can distinguish synonyms from non-synonyms
with fair reliability. This capability may be ex-
ploited to generate corpus-based thesauri automat-
ically (Lin, 1998), or used in any other application
of text that might benefit from a measure of lexi-
cal semantic similarity. And synonymy is a logical
first step in a broader research program that seeks to
account for natural language semantics through dis-
tributional means.
Previous research into corpus-analytic approaches
to synonymy has used the Test of English as a For-
eign Language (TOEFL). The TOEFL consists of
300 multiple-choice question, each question involv-
ing five words: the problem or target word and four
response words, one of which is a synonym of the
target. The objective is to identify the synonym (call
this the answer word, and call the other response
words decoys). In the context of research into lexi-
cal semantics, we seek a distance function which as
reliably as possible orders the answer word in front
of the decoys.
Landauer and Dumais first proposed the TOEFL
as a test of lexical semantic similarity and reported
a score of 64.4% on an 80-question version of the
TOEFL, a score nearly identical to the average score
of human test takers (Landauer and Dumais, 1997).
Subsequently, Sahlgren reported a score of 72.0%
on the same test using “random indexing” and a dif-
ferent training corpus (Sahlgren, 2001). By analyz-
ing a much larger corpus, Ehlert was able to score
82% on a 300-question version of the TOEFL, using
a simple distribution over contextual words (Ehlert,
2003).
While success on the TOEFL does not imme-
diately guarantee success in real-word applications
requiring lexical similarity judgments, the scores
have an intuitive appeal. They are easily inter-
pretable, and the expected performance of a random
25
guesser (25%) and typical human performance are
both known. Nevertheless, the TOEFL is problem-
atic in at least two ways. On the one hand, because it
involves so few questions, conclusions based on the
TOEFL regarding closely competing approaches are
suspect. Even on the 300-question TOEFL, a score
of 82% is accurate only to within plus or minus 4%
at the 95% confidence level. The other shortcoming
is a potential mis-match between the test vocabulary
and the corpus vocabulary. Typically, a substantial
number of questions include words observed too in-
frequently in the training corpus for a semantic judg-
ment to be made with any confidence.
We seek to overcome these difficulties by gener-
ating TOEFL-like tests automatically from Word-
Net (Fellbaum, 1998). While WordNet has been
used before to evaluate corpus-analytic approaches
to lexical similarity (Lin, 1998), the metric proposed
in that study, while useful for comparative purposes,
lacks an intuitive interpretation. In contrast, we
emulate the TOEFL using WordNet and inherit the
TOEFL’s easy interpretability.
Given a corpus, we first derive a list of words oc-
curring with sufficient marginal frequency to sup-
port a distributional comparison. We then use Word-
Net to generate a large set of questions identical in
format to those in the TOEFL. For a vocabulary of
reasonable size, this yields questions numbering in
the thousands. While the resulting questions differ
in some interesting ways from those in the TOEFL
(see below), their sheer number supports more con-
fident conclusions. Beyond this, we can partition
them by part of speech or degree of polysemy, en-
abling some analyses not supported by the original
TOEFL.
2 The Test
To generate a TOEFL-like test from WordNet, we
perform the following procedure once each for
nouns, verbs, adjectives and adverbs. Given a list of
candidate words, we produce one test question for
every ordered pair of words appearing together in
any synset in the respective WordNet part-of-speech
database. Decoy words are chosen at random from
among other words in the database that do not have
a synonymy relation with either word in the pair.
For convenience, we will call the resulting test the
technology:
A. engineering B. difference
C. department D. west
stadium:
A. miss B. hockey
C. wife D. bowl
string:
A. giant B. ballet
C. chain D. hat
trial:
A. run B. one-third
C. drove D. form
Table 1: Four questions chosen at random from the
noun test. Answers are A, D, C, and A.
WordNet-based synonymy test (WBST).
We take a few additional steps in order to in-
crease the resemblance between the WBST and the
TOEFL. First, we remove from consideration any
stop words or inflected forms. Note that whether
a particular wordform is inflected is a function of
its presumed part of speech. The word “indicted”
is either an inflected verb (so would not be used as a
word in a question involving verbs) or an uninflected
adjective. Second, we rule out pairs of words that
are too similar under the string edit distance. Mor-
phological variants often share a synset in WordNet.
For example, “group” and “grouping” share a nom-
inal sense. Questions using such pairs appear trivial
to human test takers and allow stemming shortcuts.
In the experiments reported in this paper, we used
WordNet 1.7.1. Our experimental corpus is the
North American News corpus, which is also used
by Ehlert (2003). We include as a candidate test
word any word occurring at least 1000 times in the
corpus (about 15,000 words when restricted to those
appearing in WordNet). Table 1 shows four sample
questions generated from this list out of the noun
database. In total, this procedure yields 9887 noun,
7398 verb, 5824 adjective, and 461 adverb ques-
tions, a total of 23,570 questions.1
This procedure yields questions that differ in
some interesting ways from those in the TOEFL.
Most notable is a bias in favor of polysemous terms.
The number of times a word appears as either the tar-
get or the answer word is proportional to the number
of synonyms it has in the candidate list. In contrast,
1This test is available as http://www.cs.cmu.edu/
˜dayne/wbst-nanews.tar.gz.
26
decoy words are chosen at random, so are less poly-
semous on average.
3 The Space of Solutions
Given that we have a large number of test ques-
tions composed of words with high corpus frequen-
cies, we now seek to optimize performance on the
WBST. The solutions we consider all start with a
word-conditional context frequency vector, usually
normalized to form a probability distribution. We
answer a question by comparing the target term vec-
tor and each of the response term vectors, choosing
the “closest.”
This problem definition excludes a common class
of solutions to this problem, in which the closeness
of a pair of terms is a statistic of the co-occurrence
patterns of the specific terms in question. It has
been shown that measures based on the pointwise
mutual information (PMI) between question words
yield good results on the TOEFL (Turney, 2001;
Terra and Clarke, 2003). However, Ehlert (2003)
shows convincingly that, for a fixed amount of data,
the distributional model performs better than what
we might call the pointwise co-occurrence model.
Terra and Clark (2003) report a top score of 81.3%
on an 80-word version of the TOEFL, which com-
pares favorably with Ehlert’s best of 82% on a 300-
word version, but their corpus is approximately 200
times as large as Ehlert’s.
Note that these two approaches are complemen-
tary and can be combined in a supervised setting,
along with static resources, to yield truly strong per-
formance (97.5%) on the TOEFL (Turney et al.,
2003). While impressive, this work begs an im-
portant question: Where do we obtain the training
data when moving to a less commonly taught lan-
guage, to say nothing of the comprehensive thesauri
and Web resources? In this paper, we focus on
shallow methods that use only the text corpus. We
are interested less in optimizing performance on the
TOEFL than in investigating the validity and limits
of the distributional hypothesis, and in illuminating
the barriers to automated human-level lexical simi-
larity judgments.
3.1 Definitions of Context
As in previous work, we form our context distribu-
tions by recording word-conditional counts of fea-
ture occurrences within some fixed window of a ref-
erence token. In this study, features are just unnor-
malized tokens, possibly augmented with direction
and distance information. In other words, we do not
investigate the utility of stemming. Similarly, except
where noted, we do not remove stop words.
All context definitions involve a window size,
which specifies the number of tokens to consider on
either side of an occurrence of a reference term. It
is always symmetric. Thus, a window size of one
indicates that only the immediately adjacent tokens
on either side should be considered. By default,
we bracket a token sequence with pseudo-tokens
“<bos>” and “<eos>”.2
Contextual tokens in the window may be either
observed or disregarded, and the policy governing
which to admit is one of the dimensions we ex-
plore here. The decision whether or not to observe
a particular contextual token is made before count-
ing commences, and is not sensitive to the circum-
stances of a particular occurrence (e.g., its partici-
pation in some syntactic relation (Lin, 1997; Lee,
1999)). When a contextual token is observed, it
is always counted as a single occurrence. Thus,
in contrast with earlier approaches (Sahlgren, 2001;
Ehlert, 2003), we do not use a weighting scheme that
is a function of distance from the reference token.
Once we have chosen to observe a contextual to-
ken, additional parameters govern whether counting
should be sensitive to the side of the reference token
on which it occurs and how distant from the refer-
ence token it is. If the strict direction parameter is
true, a left occurrence is distinguished from a right
occurrence. If strict distance is true, occurrences at
distinct removes (in number of tokens) are recorded
as distinct event types.
3.2 Distance Measures
The product of a particular context policy is a co-
occurrence matrix a0 , where the contents of a cell
a0a2a1a4a3a5 is the number of times context a6 is observed to
occur with word a7 . A row of this matrix (a0 a1 ) is
2In this paper, a sequence is a North American News seg-
ment delimited by the <p> tag. Nominally paragraphs, most of
these segments are single sentences.
27
therefore a word-conditional context frequency vec-
tor. In comparing two of these vectors, we typically
normalize counts so that all cells in a row sum to
one, yielding a word-conditional distribution over
contexts a0a2a1 a6a4a3a7a6a5 (but see the Cosine measure be-
low).
We investigate some of the distance measures
commonly employed in comparing term vectors.
These include:
Manhattan a7a9a8a10a3a0a2a1 a6 a8 a3a7a12a11a13a5a15a14a16a0a2a1 a6 a8 a3a7a18a17a19a5a20a3
Euclidean a21 a7 a8a23a22a0a24a1 a6 a8 a3a7a12a11a25a5a26a14a27a0a24a1 a6 a8 a3a7a18a17a19a5a29a28 a17
Cosine a7a27a30a32a31a34a33a36a35a38a37a39 a30 a31a40a33a42a41a13a37a39 a30a43
a31 a33a36a35
a43a45a44a46a43
a31 a33a47a41
a43
Note that whereas we use probabilities in calculating
the Manhattan and Euclidean distances, in order to
avoid magnitude effects, the Cosine, which defines
a different kind of normalization, is applied to raw
number counts.
We also avail ourselves of measures suggested
by probability theory. For a48 a49 a1a51a50a23a52a54a53a19a5 and
word-conditional context distributions a55 and a56 , we
have the so-called a48 -divergences (Zhu and Rohwer,
1998): a57
a58
a1a59a55a60a52a61a56a4a5a63a62a59a64
a53a65a14 a7 a55
a58
a56
a11a38a66
a58
a48a67a1a45a53a18a14a16a48a42a5
(1)
Divergences
a57
a68 and
a57
a11 are defined as limits as a48a6a69
a50 and a48a6a69a70a53 :a57
a11 a1a59a55a60a52a61a56a4a5a71a64
a57
a68
a1a51a56a67a52a51a55a72a5a71a64a74a73 a55a76a75a78a77a47a79
a55
a56
In other words,
a57
a11a19a1a59a55a60a52a61a56a4a5 is the KL-divergence of a55
from a56 . Members of this divergence family are in
some sense preferred by theory to alternative mea-
sures. It can be shown that the a48 -divergences (or
divergences defined by combinations of them, such
as the Jensen-Shannon or “skew” divergences (Lee,
1999)) are the only ones that are robust to redundant
contexts (i.e., only divergences in this family are in-
variant) (Csisz´ar, 1975).
Several notions of lexical similarity have been
based on the KL-divergence. Note that if any
a56
a8
a64a80a50 , then
a57
a11a81a1a59a55a60a52a61a56a4a5 is infinite; in general, the KL-
divergence is very sensitive to small probabilities,
and careful attention must be paid to smoothing if
it is to be used with text co-occurrence data. The
Jensen-Shannon divergence—an average of the di-
vergences of a55 and a56 from their mean distribution—
does not share this sensitivity and has previously
been used in tests of lexical similarity (Lee, 1999).
Furthermore, unlike the KL-divergence, it is sym-
metric, presumably a desirable property in this set-
ting, since synonymy is a symmetric relation, and
our test design exploits this symmetry.
However,
a57
a11a83a82a45a17 a1a59a55a60a52a61a56a4a5 , the Hellinger distance
3, is
also symmetric and robust to small or zero esti-
mates. To our knowledge, the Hellinger distance
has not previously been assessed as a measure of
lexical similarity. We experimented with both the
Hellinger distance and Jensen-Shannon (JS) diver-
gence, and obtained close scores across a wide range
of parameter settings, with the Hellinger yielding a
slightly better top score. We report results only for
the Hellinger distance below. As will be seen, nei-
ther the Hellinger nor the JS divergence are optimal
for this task.
In pursuit of synonymy, Ehlert (2003) derives a
formula for the probability of the target word given
a response word:
a0a2a1 a7 a11 a3a7 a17 a5a71a64
a7 a30a25a84a26a85
a1 a35a87a86a5a29a30a89a88
a84a26a85
a1 a41a32a86a5a90a30a89a88
a84a26a85
a5a90a30a91a88
a84a26a85
a1 a41 a88
(2)
a64 a0a2a1 a7a12a11a25a5 a7a27a8
a84a26a85
a5a90a30 a86a1 a35 a88
a84a26a85
a5a90a30 a86a1 a41 a88
a84a26a85
a5a90a30a91a88
(3)
The second line, which fits more conveniently into
our framework, follows from the first (Ehlert’s ex-
pression) through an application of Bayes Theo-
rem. While this measure falls outside the class of
a48 -divergences, our experiments confirm its relative
strength on synonymy tests.
It is possible to unify the a48 -divergences with
Ehlert’s expression by defining a broader class of
measures:
a57
a58
a3a92 a3a93 a1a59a55a60a52a61a56a4a5a71a64a94a53a18a14 a73
a8
a6
a66a10a93
a8 a55
a58
a8a13a56
a92
a8 (4)
where a6 a8 is the marginal probability of a single con-
text, anda55
a8 and
a56
a8 are its respective word-conditional
probabilities. Since, in the context of a given ques-
tion, a0a2a1 a7a6a11a25a5 does not change, maximizing the ex-
pression in Equation 3 is the same as minimizing
a57
a11 a3a95a11 a3a95a11 .
a57
a58
a3
a85
a11a38a66
a58
a88 a3
a68 recovers the
a48 divergences up to
a constant multiple, and
a57
a11 a3a95a11 a3
a68 provides the comple-
ment of the familiar inner-product measure.
3Actually,
a96 a35a91a97a83a41a25a98a95a99a67a100a51a101a32a102 is four times the square of the
Hellinger distance.
28
4 Evaluation
We experimented with various distance measures
and context policies using the full North American
News corpus. We count approximately one billion
words in this corpus, which is roughly four times
the size of the largest corpus considered by Ehlert.
Except where noted, the numbers reported here
are the result of taking the full WBST, a total of
23,570 test questions. Given this number of ques-
tions, scores where most of the results fall are accu-
rate to within plus or minus 0.6% at the 95% confi-
dence level.
4.1 Performance Bounds
In order to provide a point of comparison, the pa-
per’s authors each answered the same random sam-
ple of 100 questions from each part of speech. Aver-
age performance over this sample was 88.4%. The
one non-native speaker scored 80.3%. As will be
seen, this is better than the best automated result.
The expected score, in the absence of any seman-
tic information, is 25%. However, as noted, target
and answer words are more polysemous than decoy
words on average, and this can be exploited to es-
tablish a higher baseline. Since the frequency of
a word is correlated with its polysemy, a strategy
which always selects the most frequent word among
the response words yields 39.2%, 34.5%, 29.1%,
and 38.0% on nouns, verbs, adjectives, and adverbs,
respectively, for an average score of 35.2%.
4.2 An Initial Comparison
Table 2 displays a basic comparison of the distance
measures and context definitions enumerated so far.
For each distance measure (Manhattan, Euclidean,
Cosine, Hellinger, and Ehlert), results are shown for
window sizes 1 to 4 (columns). Results are further
sub-divided according to whether strict direction and
distance are false (None), only strict direction is true
(Dir), or both strict direction and strict distance are
true (Dir+Dist). In bold is the best score, along with
any scores indistinguishable from it at the 95% con-
fidence level.
Notable in Table 2 are the somewhat depressed
scores, compared with those reported for the
TOEFL. Ehlert reports a best score on the TOEFL
of 82%, whereas the best we are able to achieve on
Window Size
1 2 3 4
None 54.2 58.8 60.4 60.6
Manh Dir 54.3 58.5 60.3 60.8
Dir+Dist – 57.3 58.8 58.9
None 42.9 45.3 46.6 47.6
Euc Dir 43.2 45.7 46.8 47.6
Dir+Dist – 44.9 45.3 45.6
None 44.9 46.7 47.6 48.3
Cos Dir 46.2 48.0 48.6 49.2
Dir+Dist – 48.0 48.4 48.5
None 57.9 62.3 62.2 61.0
Hell Dir 57.2 62.6 63.3 61.8
Dir+Dist – 61.2 61.7 61.1
None 64.0 66.2 66.2 65.7
Ehl Dir 63.9 66.9 67.6 67.1
Dir+Dist – 66.4 67.2 67.5
Table 2: Accuracy on the WBST: an initial compar-
ison of distance measures and context definitions.
the WBST is 67.6%. Although there are differences
in some of the experimental details (Ehlert employs
a triangular window weighting and experiments with
stemming), these probably do not account for the
discrepancy. Rather, this appears to be a harder test
than the TOEFL—despite the fact that all words in-
volved are seen with high frequency.
It is hard to escape the conclusion that, in pursuit
of high scores, choice of distance measure is more
critical than the specific definition of context. All
scores returned by the Ehlert metric are significantly
higher than any returned by other distance measures.
Among the Ehlert scores, there is surprising lack of
sensitivity to context policy, given a window of size
2 or larger.
Although the Hellinger distance yields scores
only in the middle of the pack, it might be that other
divergences from the a48 -divergence family, such as
the KL-divergence, would yield better scores. We
experimented with various settings of a48 in Equa-
tion 1. In all cases, we observed bell-shaped curves
with peaks approximately at a48a80a64 a50 a1 a2 and locally
worst performance with values at or near 0 or 1. This
held true when we used maximum likelihood esti-
mates, or under a simple smoothing regime in which
29
all cells of the co-occurrence matrix were initialized
with various fixed values. It is possible that numeri-
cal issues are nevertheless partly responsible for the
poor showing of the KL-divergence. However, given
the symmetry of the synonymy relation, it would be
surprising if some value of a48 far from 0.5 was ulti-
mately shown to be best.
4.3 The Importance of Weighting
The Ehlert measure and the cosine are closely
related—both involve an inner product between
vectors—yet they return very different scores in Ta-
ble 2. There are two differences between these meth-
ods, normalization and vector element weighting.
We presume that normalization does not account for
the large score difference, and attribute the discrep-
ancy, and the general strength of the Ehlert measure,
to importance weighting.
In information retrieval, it is common to take the
cosine between vectors where vector elements are
not raw frequency counts, but counts weighted using
some version of the “inverse document frequency”
(IDF). We ran the cosine experiment again, this
time weighting the count of context a0 by a1a3a2a5a4 a1
a57
a6a8a7
a8
a5 ,
where
a57
is the number of rows in the count matrix
a0 and a7 a8 is the number of rows containing a non-
zero count for context a0 . The results confirmed our
expectation. The performance of “CosineIDF” for a
window size of 3 with strict direction was 64.0%,
which is better than Hellinger but worse than the
Ehlert measure. This was the best result returned
for “CosineIDF.”
4.4 Optimizing Distance Measures
Both the Hellinger distance and the Ehlert measure
are members of the family of measures defined by
Equation 4. Although there are theoretical reasons
to prefer each to neighboring members of the same
family (see the discussion following Equation 1),
we undertook to validate this preference empirically.
We conducted parameter sweeps of a9 , a48 , and a10 , first
exploring members of the family a48 a64a11a10 , of which
both Hellinger and Ehlert are members. Specifically,
we explored the space between a48 a64a12a10 a64 a50 a1 a2 and
a48 a64a13a10a9a64 a53 , first in increments of 0.1, then in incre-
ments of 0.01 around the approximate maximum, in
all cases varying a9 widely.
This experiment clearly favored a region midway
Noun Verb Adj Adv All
Ehlert 71.6 57.2 73.4 72.5 67.6
Optimal 75.8 63.8 76.4 76.6 72.2
Table 3: Comparison between the Ehlert measure
and the “optimal” point in the space of measures de-
fined by Equation 4 (a48 a64a14a10 a64 a50 a1a16a15 a2 , a9 a64 a53 a1 a53 ), by
part of speech. Context policy is window size 3 with
strict direction.
between the Hellinger and Ehlert measures. We
identified a48 a64a17a10a80a64 a50 a1a16a15 a2 , with a9 a64 a53 a1 a53 as the ap-
proximate midpoint of this optimal region. We next
varied a48 and a10 independently around this point. This
resulted in no improvement to the score, confirming
our expectation that some point along a48 a64a18a10 would
be best. For the sake of brevity, we will refer to this
best point (
a57
a68a20a19a21a23a22
a3
a68a20a19a21a23a22
a3a95a11
a19
a11 ) as the “Optimal” measure.
As Table 3 indicates, this measure is significantly
better than the Ehlert measure, or any other measure
investigated here.
This clear separation between Ehlert and Opti-
mal does not hold for the original TOEFL. Using
the same context policy, we applied these measures
to 298 of the 300 questions used by Ehlert (all
questions except those involving multi-word terms,
which our framework does not currently support).
Optimal returns 84.2%, while Ehlert’s measure re-
turns 83.6%, which is slightly better than the 82%
reported by Ehlert. The two results are not distin-
guishable with any statistical significance.
Interesting in Table 3 is the range of scores seen
across parts of speech. The variation is even wider
under other measures, the usual ordering among
parts of speech being (from highest to lowest) ad-
verb, adjective, noun, verb. In Section 4.6, we at-
tempt to shed some light on both this ordering and
the close outcome we observe on the TOEFL.
4.5 Optimizing Context Policy
It is certain that not every contextual token seen
within the co-occurrence window is equally impor-
tant to the detection of synonymy, and probable that
some such tokens are useless or even detrimental.
On the one hand, the many low-frequency events in
the tails of the context distributions consume a lot
of space, perhaps without contributing much infor-
30
mation. On the other, very-high-frequency terms are
typically closed-class and stop words, possibly too
common to be useful in making semantic distinc-
tions. We investigated excluding words at both ends
of the frequency spectrum.
We experimented with two kinds of exclusion
policies: one excluding the a0 most frequent terms,
for a0 ranging between 10 and 200; and one ex-
cluding terms occurring fewer than a0 times, for a0
ranging from 3 up to 100. Both Ehlert and Opti-
mal were largely invariant across all settings; no sta-
tistically significant improvements or degradations
were observed. Optimal returned scores ranging
from 72.0%, when contexts with marginal frequency
fewer than 100 were ignored, up to 72.6%, when the
200 most frequent terms were excluded.
Note there is a large qualitative difference be-
tween the two exclusion procedures. Whereas
we exclude only at most 200 words in the high-
frequency experiment, the number of terms ex-
cluded in the low-frequency experiment ranges
from 939,496 (less than minimum frequency 3) to
1,534,427 (minimum frequency 100), out of a vo-
cabulary containing about 1.6 million terms. Thus, it
is possible to reduce the expense of corpus analysis
substantially without sacrificing semantic fidelity.
4.6 Polysemy
We hypothesized that the variation in scores across
part of speech has to do with the average number of
senses seen in a test set. Common verbs, for exam-
ple, tend to be much more polysemous (and syntac-
tically ambiguous) than common adverbs. WordNet
allows us to test this hypothesis.
We define the polysemy level of a question as the
sum of the number of senses in WordNet of its tar-
get and answer words. Polysemy levels in our ques-
tion set range from 2 up to 116. Calculating the
average polysemy level for questions in the various
parts of speech—5.1, 6.7, 7.5, and 10.4, for adverbs,
adjectives, nouns, and verbs, respectively—provides
support for our hypothesis, inasmuch as this order-
ing aligns with test scores. By contrast, the average
polysemy level in the TOEFL, which spans all four
parts of speech, is 4.6.
Plotting performance against polysemy level
helps explain why Ehlert and Optimal return roughly
equivalent performance on the original TOEFL. Fig-
 0.5
 0.55
 0.6
 0.65
 0.7
 0.75
 0.8
 0.85
 0.9
 0.95
 1
 0  5  10  15  20  25  30
Score
Polysemy
Optimal
Ehlert
Figure 1: Score as a function of polysemy level.
ure 1 plots the Ehlert and Optimal measures as a
function of the polysemy level of the questions. To
produce this plot, we grouped questions according
to polysemy level, creating many smaller tests, and
scored each measure on each test separately.
At low polysemy levels, the Ehlert and Optimal
measures perform equally well. The advantage of
Optimal over Ehlert appears to lie specifically in its
relative strength in handling polysemous terms.
5 Discussion
Specific conclusions regarding the “Optimal” mea-
sure are problematic. We do not know whether
or to what extent this particular parameter setting
is universally best, best only for English, best for
newswire English, or best only for the specific test
we have devised. We have restricted our attention
to a relatively small space of similarity measures,
excluding many previously proposed measures of
lexical affinity (but see Weeds, et al (2004), and
Lee (1999) for some empirical comparisons). Lee
observed that measures from the space of invari-
ant divergences (particularly the JS and skew diver-
gences) perform at least as well as any of a wide
variety of alternatives. As noted, we experimented
with the JS divergence and observed accuracies that
tracked those of the Hellinger closely. This provides
a point of comparison with the measures investi-
gated by Lee, and recommends both Ehlert’s mea-
sure and what we have called “Optimal” as credible,
perhaps superior alternatives. More generally, our
results argue for some form of feature importance
31
weighting.
Empirically, the strength of Optimal on the
WBST is a feature of its robustness in the presence
of polysemy. Both Ehlert and Optimal are expressed
as a sum of ratios, in which the numerator is a prod-
uct of some function of conditional context prob-
abilities, and the denominator is some function of
the marginal probability. The Optimal exponents on
both the numerator and denominator have the effect
of advantaging lower-probability events, relative to
Ehlert. In our test, WordNet senses are sampled uni-
formly at random. Perhaps its emphasis on lower
probability events allows Optimal to sacrifice some
fidelity on high-frequency senses in exchange for in-
creased sensitivity to low-frequency ones.
It is clear, however, that polysemy is a critical
hurdle confronting distributional approaches to lex-
ical semantics. Figure 1 shows that, in the absence
of polysemy, distributional comparisons detect syn-
onymy quite well. Much of the human advantage
over machines on this task may be attributed to an
awareness of polysemy. In order to achieve perfor-
mance comparable to that of humans, therefore, it
is probably not enough to optimize context policies
or to rely on larger collections of text. Instead, we
require strategies for detecting and resolving latent
word senses.
Pantel and Lin (2002) propose one such method,
evaluated by finding the degree of overlap between
sense clusters and synsets in WordNet. The above
considerations suggest that a possibly more perti-
nent test of such approaches is to evaluate their util-
ity in the detection of semantic similarity between
specific polysemous terms. We expect to undertake
such an evaluation in future work.
Acknowledgments. This material is based on
work funded in whole or in part by the U.S. Govern-
ment. Any opinions, findings, conclusions, or rec-
ommendations expressed in this material are those
of the authors, and do not necessarily reflect the
views of the U.S. Government.

References
I. Csisz´ar. 1975. I-divergence geometry of probability
distributions and minimization problems. Annals of
Probability, 3:146–158.
B. Ehlert. 2003. Making accurate lexical semantic sim-
ilarity judgments using word-context co-occurrence
statistics. Master’s thesis, University of California,
San Diego.
C. Fellbaum. 1998. WordNet: An Electronic Lexical
Database. The MIT Press.
Z. Harris. 1968. Mathematical Structures of Language.
Interscience Publishers, New York.
T.K. Landauer and S.T. Dumais. 1997. A solution to
Plato’s problem: The latent semantic analysis theory
of acquisition, induction and representation of knowl-
edge. Psychological Review, 104(2):211–240.
L. Lee. 1999. Measures of distributional similarity. In
Proceedings of the 37th ACL.
D. Lin. 1997. Using syntactic dependency as local con-
text to resolve word sense ambiguity. In Proceedings
of ACL-97, Madrid, Spain.
D. Lin. 1998. Automatic retrieval and clustering of sim-
ilar words. In Proceedings of COLING-ACL98, Mon-
treal, Canada.
P. Pantel and D. Lin. 2002. Discovering word senses
from text. In Proceedings of KDD-02, Edmonton,
Canada.
M. Sahlgren. 2001. Vector-based semantic analysis: rep-
resenting word meanings based on random labels. In
Semantic Knowledge Acquisition and Categorisation
Workshop, ESSLLI 2001, Helsinki, Finland.
E. Terra and C.L.A. Clarke. 2003. Frequency estimates
for statistical word similarity measures. In Proceed-
ings of HLT/NAACL 2003, Edmonton, Canada.
P.D. Turney, M.L. Littman, J. Bigham, and V. Schnay-
der. 2003. Combining independent modules to solve
multiple-choice synonym and analogy problems. In
Proceedings of the International Conference on Recent
Advances in Natural Language Processing.
P.D. Turney. 2001. Mining the web for synonyms: PMI-
IR versus LSA on TOEFL. In Proceedings of the 12th
European Conference on Machine Learning (ECML-
01).
J. Weeds, D. Weir, and D. McCarthy. 2004. Character-
ising measures of lexical distributional similarity. In
Proceedings of CoLing 2004, Geneva, Switzerland.
H. Zhu and R. Rohwer. 1998. Information geometry,
Bayesian inference, ideal estimates, and error decom-
position. Technical Report 98-06-045, Santa Fe Insti-
tute.
