Characterising Measures of Lexical Distributional Similarity
Julie Weeds, David Weir and Diana McCarthy
Department of Informatics
University of Sussex
Brighton, BN1 9QH, UK
fjuliewe, davidw,dianamg@sussex.ac.uk
Abstract
This work investigates the variation in a word’s dis-
tributionally nearest neighbours with respect to the
similarity measure used. We identify one type of
variation as being the relative frequency of the neigh-
bour words with respect to the frequency of the tar-
get word. We then demonstrate a three-way connec-
tion between relative frequency of similar words, a
concept of distributional gnerality and the seman-
tic relation of hyponymy. Finally, we consider the
impact that this has on one application of distribu-
tional similarity methods (judging the composition-
ality of collocations).
1 Introduction
Over recent years, many Natural Language Pro-
cessing (NLP) techniques have been developed
that might bene t from knowledge of distribu-
tionally similar words, i.e., words that occur in
similar contexts. For example, the sparse data
problem can make it di cult to construct lan-
guage models which predict combinations of lex-
ical events. Similarity-based smoothing (Brown
et al., 1992; Dagan et al., 1999) is an intuitively
appealing approach to this problem where prob-
abilities of unseen co-occurrences are estimated
from probabilities of seen co-occurrences of dis-
tributionally similar events.
Other potential applications apply the hy-
pothesised relationship (Harris, 1968) between
distributional similarity and semantic similar-
ity; i.e., similarity in the meaning of words can
be predicted from their distributional similarity.
One advantage of automatically generated the-
sauruses (Grefenstette, 1994; Lin, 1998; Curran
and Moens, 2002) over large-scale manually cre-
ated thesauruses such as WordNet (Fellbaum,
1998) is that they might be tailored to a partic-
ular genre or domain.
However, due to the lack of a tight de ni-
tion for the concept of distributional similarity
and the broad range of potential applications, a
large number of measures of distributional sim-
ilarity have been proposed or adopted (see Sec-
tion 2). Previous work on the evaluation of dis-
tributional similarity methods tends to either
compare sets of distributionally similar words
to a manually created semantic resource (Lin,
1998; Curran and Moens, 2002) or be oriented
towards a particular task such as language mod-
elling (Dagan et al., 1999; Lee, 1999). The  rst
approach is not ideal since it assumes that the
goal of distributional similarity methods is to
predict semantic similarity and that the seman-
tic resource used is a valid gold standard. Fur-
ther, the second approach is clearly advanta-
geous when one wishes to apply distributional
similarity methods in a particular application
area. However, it is not at all obvious that one
universally best measure exists for all applica-
tions (Weeds and Weir, 2003). Thus, applying a
distributional similarity technique to a new ap-
plication necessitates evaluating a large number
of distributional similarity measures in addition
to evaluating the new model or algorithm.
We propose a shift in focus from attempting
to discover the overall best distributional sim-
ilarity measure to analysing the statistical and
linguistic properties of sets of distributionally
similar words returned by di erent measures.
This will make it possible to predict in advance
of any experimental evaluation which distribu-
tional similarity measures might be most appro-
priate for a particular application.
Further, we explore a problem faced by
the automatic thesaurus generation community,
which is that distributional similarity methods
do not seem to o er any obvious way to dis-
tinguish between the semantic relations of syn-
onymy, antonymy and hyponymy. Previous
work on this problem (Caraballo, 1999; Lin et
al., 2003) involves identifying speci c phrasal
patterns within text e.g., \Xs and other Ys" is
used as evidence that X is a hyponym of Y. Our
work explores the connection between relative
frequency, distributional generality and seman-
tic generality with promising results.
The rest of this paper is organised as follows.
In Section 2, we present ten distributional simi-
larity measures that have been proposed for use
in NLP. In Section 3, we analyse the variation in
neighbour sets returned by these measures. In
Section 4, we take one fundamental statistical
property (word frequency) and analyse correla-
tion between this and the nearest neighbour sets
generated. In Section 5, we relate relative fre-
quency to a concept of distributional generality
and the semantic relation of hyponymy. In Sec-
tion 6, we consider the e ects that this has on a
potential application of distributional similarity
techniques, which is judging compositionality of
collocations.
2 Distributional similarity measures
In this section, we introduce some basic con-
cepts and then discuss the ten distributional
similarity measures used in this study.
The co-occurrence types of a target word are
the contexts, c, in which it occurs and these
have associated frequencies which may be used
to form probability estimates. In our work, the
co-occurrence types are always grammatical de-
pendency relations. For example, in Sections 3
to 5, similarity between nouns is derived from
their co-occurrences with verbs in the direct-
object position. In Section 6, similarity between
verbs is derived from their subjects and objects.
The k nearest neighbours of a target word w
are the k words for which similarity with w is
greatest. Our use of the term similarity measure
encompasses measures which should strictly be
referred to as distance, divergence or dissimilar-
ity measures. An increase in distance correlates
with a decrease in similarity. However, either
type of measure can be used to  nd the k near-
est neighbours of a target word.
Table 1 lists ten distributional similarity mea-
sures. The cosine measure (Salton and McGill,
1983) returns the cosine of the angle between
two vectors.
The Jensen-Shannon (JS) divergence measure
(Rao, 1983) and the  -skew divergence measure
(Lee, 1999) are based on the Kullback-Leibler
(KL) divergence measure. The KL divergence,
or relative entropy, D(pjjq), between two prob-
ability distribution functions p and q is de ned
(Cover and Thomas, 1991) as the \ine ciency
of assuming that the distribution is q when the
true distribution is p": D(pjjq) = Pcplog pq.
However, D(pjjq) = 1 if there are any con-
texts c for which p(c) > 0 and q(c) = 0. Thus,
this measure cannot be used directly on maxi-
mum likelihood estimate (MLE) probabilities.
One possible solution is to use the JS diver-
gence measure, which measures the cost of using
the average distribution in place of each individ-
ual distribution. Another is the  -skew diver-
gence measure, which uses the p distribution to
smooth the q distribution. The value of the pa-
rameter  controls the extent to which the KL
divergence is approximated. We use  = 0:99
since this provides a close approximation to the
KL divergence and has been shown to provide
good results in previous research (Lee, 2001).
The confusion probability (Sugawara et al.,
1985) is an estimate of the probability that one
word can be substituted for another. Words
w1 and w2 are completely confusable if we are
equally as likely to see w2 in a given context as
we are to see w1 in that context.
Jaccard’s coe cient (Salton and McGill,
1983) calculates the proportion of features be-
longing to either word that are shared by both
words. In the simplest case, the features of a
word are de ned as the contexts in which it has
been seen to occur. simja+mi is a variant (Lin,
1998) in which the features of a word are those
contexts for which the pointwise mutual infor-
mation (MI) between the word and the context
is positive, where MI can be calculated using
I(c;w) = log P(cjw)P(c) . The related Dice Coe -
cient (Frakes and Baeza-Yates, 1992) is omitted
here since it has been shown (van Rijsbergen,
1979) that Dice and Jaccard’s Coe cients are
monotonic in each other.
Lin’s Measure (Lin, 1998) is based on his
information-theoretic similarity theorem, which
states, \the similarity between A and B is mea-
sured by the ratio between the amount of in-
formation needed to state the commonality of
A and B and the information needed to fully
describe what A and B are."
The  nal three measures are settings in
the additive MI-based Co-occurrence Retrieval
Model (AMCRM) (Weeds and Weir, 2003;
Weeds, 2003). We can measure the precision
and the recall of a potential neighbour’s re-
trieval of the co-occurrences of the target word,
where the sets of required and retrieved co-
occurrences (F(w1) and F(w2) respectively) are
those co-occurrences for which MI is positive.
Neighbours with both high precision and high
recall retrieval can be obtained by computing
Measure Function
cosine simcm(w2;w1) =
P
cP(cjw1):P(cjw2)pP
cP(cjw1)
2P
cP(cjw2)
2
Jens.-Shan. distjs(w2;w1) = 12
 
D
 
pjjp+q2
 
+D
 
qjjp+q2
  
where p = P(cjw1) and q = P(cjw2)
 -skew dist (w2;w1) = D(pjj( :q + (1  ):p)) where p = P(cjw1) and q = P(cjw2)
conf. prob. simcp(w2jw1) = Pc P(w1jc):P(w2jc):P(c)P(w1)
Jaccard’s simja(w2;w1) = jF(w1)\F(w2)jjF(w1)[F(w2)jwhere F(w) =fc : P(cjv) > 0g
Jacc.+MI simja+mi(w2;W1) = jF(w1)\F(w2)jjF(w1)[F(w2)j where F(w) =fc : I(c;w) > 0g
Lin’s simlin(w2;w1) =
P
F(w1)\F(w2)(I(c;w1)+I(c;w2))P
F(w1) I(c;w1)+
P
F(w2) I(c;w2)
where F(w) =fc : I(c;w) > 0g
precision simP(w2;w1) =
P
F(w1)\F(w2) I(c;w2)P
F(w2) I(c;w2)
where F(w) =fc : I(c;w) > 0g
recall simR(w2;w1) =
P
F(w1)\F(w2) I(c;w1)P
F(w1) I(c;w1)
where F(w) =fc : I(c;w) > 0g
harm. mean simhm(w2;w1) = 2:simP(w2;w1):simR(w2;w1)simP(w2;w1)+simR(w2;w1) where F(w) =fc : I(c;w) > 0g
Table 1: Ten distributional similarity measures
their harmonic mean (or F-score).
3 Overlap of neighbour sets
We have described a number of ways of calcu-
lating distributional similarity. We now con-
sider whether there is substantial variation in
a word’s distributionally nearest neighbours ac-
cording to the chosen measure. We do this by
calculating the overlap between neighbour sets
for 2000 nouns generated using di erent mea-
sures from direct-object data extracted from the
British National Corpus (BNC).
3.1 Experimental set-up
The data from which sets of nearest neighbours
are derived is direct-object data for 2000 nouns
extracted from the BNC using a robust accurate
statistical parser (RASP) (Briscoe and Carroll,
2002). For reasons of computational e ciency,
we limit ourselves to 2000 nouns and direct-
object relation data. Given the goal of compar-
ing neighbour sets generated by di erent mea-
sures, we would not expect these restrictions to
a ect our  ndings. The complete set of 2000
nouns (WScomp) is the union of two sets WShigh
and WSlow for which nouns were selected on the
basis of frequency: WShigh contains the 1000
most frequently occurring nouns (frequency >
500), and WSlow contains the nouns ranked
3001-4000 (frequency  100). By excluding
mid-frequency nouns, we obtain a clear sepa-
ration between high and low frequency nouns.
The complete data-set consists of 1,596,798 co-
occurrence tokens distributed over 331,079 co-
occurrence types. From this data, we computed
the similarity between every pair of nouns ac-
cording to each distributional similarity mea-
sure. We then generated ranked sets of nearest
neighbours (of size k = 200 and where a word
is excluded from being a neighbour of itself) for
each word and each measure.
For a given word, we compute the overlap be-
tween neighbour sets using a comparison tech-
nique adapted from Lin (1998). Given a word
w, each word w0 in WScomp is assigned a rank
score of k rank if it is one of the k near-
est neighbours of w using measure m and zero
otherwise. If NS(w;m) is the vector of such
scores for word w and measure m, then the
overlap, C(NS(w;m1);NS(w;m2)), of two neigh-
bour sets is the cosine between the two vectors:
C(NS(w;m1);NS(w;m2)) =
P
w0rm1(w0;w) rm2(w0;w)P
k
i=1i2
The overlap score indicates the extent to which
sets share members and the extent to which
they are in the same order. To achieve an over-
lap score of 1, the sets must contain exactly
the same items in exactly the same order. An
overlap score of 0 is obtained if the sets do not
contain any common items. If two sets share
roughly half their items and these shared items
are dispersed throughout the sets in a roughly
similar order, we would expect the overlap be-
tween sets to be around 0.5.
cm js  cp ja ja+mi lin
cm 1.0(0.0) 0.69(0.12) 0.53(0.15) 0.33(0.09) 0.26(0.12) 0.28(0.15) 0.32(0.15)
js 0.69(0.12) 1.0(0.0) 0.81(0.10) 0.46(0.31) 0.48(0.18) 0.49(0.20) 0.55(0.16)
 0.53(0.15) 0.81(0.10) 1.0(0.0) 0.61(0.08) 0.4(0.27) 0.39(0.25) 0.48(0.19)
cp 0.33(0.09) 0.46(0.31) 0.61(0.08) 1.0(0.0) 0.24(0.24) 0.20(0.18) 0.29(0.15)
ja 0.26(0.12) 0.48(0.18) 0.4(0.27) 0.24(0.24) 1.0(0.0) 0.81(0.08) 0.69(0.09)
ja+mi 0.28(0.15) 0.49(0.20) 0.39(0.25) 0.20(0.18) 0.81(0.08) 1.0(0.0) 0.81(0.10)
lin 0.32(0.15) 0.55(0.16) 0.48(0.19) 0.29(0.15) 0.69(0.09) 0.81(0.10) 1.0(0.0)
Table 2: Cross-comparison of  rst seven similarity measures in terms of mean overlap of neighbour
sets and corresponding standard deviations.
P R hm
cm 0.18(0.10) 0.31(0.13) 0.30(0.14)
js 0.19(0.12) 0.55(0.18) 0.51(0.18)
 0.08(0.08) 0.74(0.14) 0.41(0.23)
cp 0.03(0.04) 0.57(0.10) 0.25(0.18)
ja 0.36(0.30) 0.38(0.30) 0.74(0.14)
ja+mi 0.42(0.30) 0.40(0.31) 0.86(0.07)
lin 0.46(0.25) 0.52(0.22) 0.95(0.039)
Table 3: Mean overlap scores for seven simi-
larity measures with precision, recall and the
harmonic mean in the AMCRM.
3.2 Results
Table 2 shows the mean overlap score between
every pair of the  rst seven measures in Table 1
calculated over WScomp. Table 3 shows the mean
overlap score between each of these measures
and precision, recall and the harmonic mean in
the AMCRM. In both tables, standard devia-
tions are given in brackets and boldface denotes
the highest levels of overlap for each measure.
For compactness, each measure is denoted by
its subscript from Table 1.
Although overlap between most pairs of
measures is greater than expected if sets of
200 neighbours were generated randomly from
WScomp (in this case, average overlap would be
0.08 and only the overlap between the pairs
( ,P) and (cp,P) is not signi cantly greater
than this at the 1% level), there are substan-
tial di erences between the neighbour sets gen-
erated by di erent measures. For example, for
many pairs, neighbour sets do not appear to
have even half their members in common.
4 Frequency analysis
We have seen that there is a large variation in
neighbours selected by di erent similarity mea-
sures. In this section, we analyse how neighbour
sets vary with respect to one fundamental statis-
tical property | word frequency. To do this, we
measure the bias in neighbour sets towards high
frequency nouns and consider how this varies
depending on whether the target noun is itself
a high frequency noun or low frequency noun.
4.1 Measuring bias
If a measure is biased towards selecting high fre-
quency words as neighbours, then we would ex-
pect that neighbour sets for this measure would
be made up mainly of words from WShigh. Fur-
ther, the more biased the measure is, the more
highly ranked these high frequency words will
tend to be. In other words, there will be high
overlap between neighbour sets generated con-
sidering all 2000 nouns as potential neighbours
and neighbour sets generated considering just
the nouns in WShigh as potential neighbours. In
the extreme case, where all of a noun’s k nearest
neighbours are high frequency nouns, the over-
lap with the high frequency noun neighbour set
will be 1 and the overlap with the low frequency
noun neighbour set will be 0. The inverse is, of
course, true if a measure is biased towards se-
lecting low frequency words as neighbours.
If NSwordset is the vector of neighbours (and
associated rank scores) for a given word, w, and
similarity measure, m, and generated consider-
ing just the words in wordset as potential neigh-
bours, then the overlap between two neighbour
sets can be computed using a cosine (as be-
fore). If Chigh = C(NScomp;NShigh) and Clow =
C(NScomp;NSlow), then we compute the bias to-
wards high frequency neighbours for word w us-
ing measure m as: biashighm(w) = ChighChigh+Clow
The value of this normalised score lies in the
range [0,1] where 1 indicates a neighbour set
completely made up of high frequency words, 0
indicates a neighbour set completely made up of
low frequency words and 0.5 indicates a neigh-
bour set with no biases towards high or low fre-
quency words. This score is more informative
than simply calculating the proportion of high
high freq. low freq.
target nouns target nouns
cm 0.90 0.87
js 0.94 0.70
 0.98 0.90
cp 1.00 0.99
ja 0.99 0.21
ja+mi 0.95 0.14
lin 0.85 0.38
P 0.12 0.04
R 0.99 0.98
hm 0.92 0.28
Table 4: Mean value of biashigh according to
measure and frequency of target noun.
and low frequency words in each neighbour set
because it weights the importance of neighbours
by their rank in the set. Thus, a large number
of high frequency words in the positions clos-
est to the target word is considered more biased
than a large number of high frequency words
distributed throughout the neighbour set.
4.2 Results
Table 4 shows the mean value of the biashigh
score for every measure calculated over the set
of high frequency nouns and over the set of low
frequency nouns. The standard deviations (not
shown) all lie in the range [0,0.2]. Any deviation
from 0.5 of greater than 0.0234 is signi cant at
the 1% level.
For all measures and both sets of target
nouns, there appear to be strong tendencies to
select neighbours of particular frequencies. Fur-
ther, there appears to be three classes of mea-
sures: those that select high frequency nouns
as neighbours regardless of the frequency of the
target noun (cm, js,  , cpandR); those that se-
lect low frequency nouns as neighbours regard-
less of the frequency of the target noun (P); and
those that select nouns of a similar frequency to
the target noun (ja, ja+mi, lin and hm).
This can also be considered in terms of distri-
butional generality. By de nition, recall prefers
words that have occurred in more of the con-
texts that the target noun has, regardless of
whether it occurs in other contexts as well i.e.,
it prefers distributionally more general words.
The probability of this being the case increases
as the frequency of the potential neighbour in-
creases and so, recall tends to select high fre-
quency words. In contrast, precision prefers
words that have occurred in very few contexts
that the target word has not i.e., it prefers dis-
tributionally more speci c words. The prob-
ability of this being the case increases as the
frequency of the potential neighbour decreases
and so, precision tends to select low frequency
words. The harmonic mean of precision and re-
call prefers words that have both high precision
and high recall. The probability of this being
the case is highest when the words are of sim-
ilar frequency and so, the harmonic mean will
tend to select words of a similar frequency.
5 Relative frequency and hyponymy
In this section, we consider the observed fre-
quency e ects from a semantic perspective.
The concept of distributional generality in-
troduced in the previous section has parallels
with the linguistic relation of hyponymy, where
a hypernym is a semantically more general term
and a hyponym is a semantically more speci c
term. For example, animal is an (indirect1) hy-
pernym of dog and conversely dog is an (indi-
rect) hyponym of animal. Although one can
obviously think of counter-examples, we would
generally expect that the more speci c term dog
can only be used in contexts where animal can
be used and that the more general term animal
might be used in all of the contexts where dog
is used and possibly others. Thus, we might ex-
pect that distributional generality is correlated
with semantic generality | a word has high
recall/low precision retrieval of its hyponyms’
co-occurrences and high precision/low recall re-
trieval of its hypernyms’ co-occurrences.
Thus, ifn1 andn2 are related andP(n2;n1) >
R(n2;n1), we might expect that n2 is a hy-
ponym of n1 and vice versa. However, having
discussed a connection between frequency and
distributional generality, we might also expect
to  nd that the frequency of the hypernymic
term is greater than that of the hyponymic
term. In order to test these hypotheses, we ex-
tracted all of the possible hyponym-hypernym
pairs (20, 415 pairs in total) from our list of 2000
nouns (using WordNet 1.6). We then calculated
the proportion for which the direction of the hy-
ponymy relation could be accurately predicted
by the relative values of precision and recall and
the proportion for which the direction of the hy-
ponymy relation could be accurately predicted
by relative frequency. We found that the direc-
tion of the hyponymy relation is correlated in
the predicted direction with the precision-recall
1There may be other concepts in the hypernym chain
between dog and animal e.g. carnivore and mammal.
values in 71% of cases and correlated in the pre-
dicted direction with relative frequency in 70%
of cases. This supports the idea of a three-way
linking between distributional generality, rela-
tive frequency and semantic generality. We now
consider the impact that this has on a potential
application of distributional similarity methods.
6 Compositionality of collocations
In its most general sense, a collocation is a ha-
bitual or lexicalised word combination. How-
ever, some collocations such as strong tea are
compositional, i.e., their meaning can be de-
termined from their constituents, whereas oth-
ers such as hot dog are not. Both types are
important in language generation since a sys-
tem must choose between alternatives but only
non-compositional ones are of interest in lan-
guage understanding since only these colloca-
tions need to be listed in the dictionary.
Baldwin et al. (2003) explore empirical
models of compositionality for noun-noun com-
pounds and verb-particle constructions. Based
on the observation (Haspelmath, 2002) that
compositional collocations tend to be hyponyms
of their head constituent, they propose a model
which considers the semantic similarity between
a collocation and its constituent words.
McCarthy et al. (2003) also investigate sev-
eral tests for compositionality including one
(simplexscore) based on the observation that
compositional collocations tend to be similar in
meaning to their constituent parts. They ex-
tract co-occurrence data for 111 phrasal verbs
(e.g. rip o ) and their simplex constituents
(e.g. rip) from the BNC using RASP and cal-
culate the value of simlin between each phrasal
verb and its simplex constituent. The test
simplexscore is used to rank the phrasal verbs
according to their similarity with their simplex
constituent. This ranking is correlated with hu-
man judgements of the compositionality of the
phrasal verbs using Spearman’s rank correlation
coe cient. The value obtained (0.0525) is dis-
appointing since it is not statistically signi cant
(the probability of this value under the null hy-
pothesis of \no correlation" is 0.3).2
However, Haspelmath (2002) notes that a
compositional collocation is not just similar to
one of its constituents | it can be considered to
be a hyponym of its head constituent. For ex-
ample, \strong tea" is a type of \tea" and \to
2Other tests for compositionality investigated by Mc-
Carthy et al. (2003) do much better.
Measure rs P(rs) under H0
simlin 0.0525 0.2946
precision -0.160 0.0475
recall 0.219 0.0110
harmonic mean 0.011 0.4562
Table 5: Correlation with compositionality for
di erent similarity measures
rip up" is a way of \ripping".
Thus, we hypothesised that a distributional
measure which tends to select more general
terms as neighbours of the phrasal verb (e.g. re-
call) would do better than measures that tend
to select more speci c terms (e.g. precision) or
measures that tend to select terms of a similar
speci city (e.g simlin or the harmonic mean of
precision and recall).
Table 5 shows the results of using di erent
similarity measures with the simplexscore test
and data of McCarthy et al. (2003). We now see
signi cant correlation between compositionality
judgements and distributional similarity of the
phrasal verb and its head constituent. The cor-
relation using the recall measure is signi cant
at the 5% level; thus we can conclude that if
the simplex verb has high recall retrieval of the
phrasal verb’s co-occurrences, then the phrasal
is likely to be compositional. The correlation
score using the precision measure is negative
since we would not expect the simplex verb to
be a hyponym of the phrasal verb and thus, if
the simplex verb does have high precision re-
trieval of the phrasal verb’s co-occurrences, it is
less likely to be compositional.
Finally, we obtained a very similar result
(0.217) by ranking phrasals according to their
inverse relative frequency with their simplex
constituent (i.e., freq(simplex)freq(phrasal) ). Thus, it would
seem that the three-way connection between
distributional generality, hyponymy and rela-
tive frequency exists for verbs as well as nouns.
7 Conclusions and further work
We have presented an analysis of a set of dis-
tributional similarity measures. We have seen
that there is a large amount of variation in the
neighbours selected by di erent measures and
therefore the choice of measure in a given appli-
cation is likely to be important.
We also identi ed one of the major axes of
variation in neighbour sets as being the fre-
quency of the neighbours selected relative to the
frequency of the target word. There are three
major classes of distributional similarity mea-
sures which can be characterised as 1) higher
frequency selecting or high recall measures; 2)
lower frequency selecting or high precision mea-
sures; and 3) similar frequency selecting or high
precision and recall measures.
A word tends to have high recall similarity
with its hyponyms and high precision similarity
with its hypernyms. Further, in the majority of
cases, it tends to be more frequent than its hy-
ponyms and less frequent than its hypernyms.
Thus, there would seem to a three way corre-
lation between word frequency, distributional
generality and semantic generality.
We have considered the impact of these ob-
servations on a technique which uses a distribu-
tional similarity measure to determine composi-
tionality of collocations. We saw that in this ap-
plication we achieve signi cantly better results
using a measure that tends to select higher fre-
quency words as neighbours rather than a mea-
sure that tends to select neighbours of a similar
frequency to the target word.
There are a variety of ways in which this work
might be extended. First, we could use the ob-
servations about distributional generality and
relative frequency to aid the process of organ-
ising distributionally similar words into hierar-
chies. Second, we could consider the impact of
frequency characteristics in other applications.
Third, for the general application of distribu-
tional similarity measures, it would be useful
to  nd other characteristics by which distribu-
tional similarity measures might be classi ed.
Acknowledgements
This work was funded by a UK EPSRC stu-
dentship to the  rst author, UK EPSRC project
GR/S26408/01 (NatHab) and UK EPSRC
project GR/N36494/01 (RASP). We would like
to thank Adam Kilgarri and Bill Keller for use-
ful discussions.

References

Timothy Baldwin, Colin Bannard, Takaaki Tanaka,
and Dominic Widdows. 2003. An empirical model
of multiword expression decomposability. In Pro-
ceedings of the ACL-2003 Workshop on Multiword
Expressions, pages 89{96, Sapporo, Japan.

Edward Briscoe and John Carroll. 2002. Robust ac-
curate statistical annotation of general text. In
Proceedings of LREC-2002, pages 1499{1504.

P.F. Brown, V.J. DellaPietra, P.V deSouza, J.C. Lai,
and R.L. Mercer. 1992. Class-based n-gram mod-
els of natural language. Computational Linguis-
tics, 18(4):467{479.

Sharon Caraballo. 1999. Automatic construction of
a hypernym-labelled noun hierarchy from text. In
Proceedings of ACL-99, pages 120{126.

T.M. Cover and J.A. Thomas. 1991. Elements of
Information Theory. Wiley, New York.

James R. Curran and Marc Moens. 2002. Im-
provements in automatic thesaurus extraction. In
ACL-SIGLEX Workshop on Unsupervised Lexical
Acquisition, Philadelphia.

Ido Dagan, Lillian Lee, and Fernando Pereira. 1999.
Similarity-based models of word cooccurrence
probabilities. Machine Learning Journal, 34.

Christiane Fellbaum, editor. 1998. WordNet: An
Electronic Lexical Database. MIT Press.

W.B. Frakes and R. Baeza-Yates, editors. 1992. In-
formation Retrieval, Data Structures and Algo-
rithms. Prentice Hall.

Gregory Grefenstette. 1994. Corpus-derived  rst-,
second- and third-order word a nities. In Pro-
ceedings of Euralex, pages 279{290, Amsterdam.

Zelig S. Harris. 1968. Mathematical Structures of
Language. Wiley, New York.

Martin Haspelmath. 2002. Understanding Morphol-
ogy. Arnold Publishers.

Lillian Lee. 1999. Measures of distributional simi-
larity. In Proceedings of ACL-1999, pages 23{32.

Lillian Lee. 2001. On the e ectiveness of the skew
divergence for statistical language analysis. Arti-
cial Intelligence and Statistics, pages 65{72.

Dekang Lin, Shaojun Zhao, Lijuan Qin, and Ming
Zhou. 2003. Identifying synonyms among dis-
tributionally similar words. In Proceedings of
IJCAI-03, pages 1492{1493.

Dekang Lin. 1998. Automatic retrieval and cluster-
ing of similar words. In Proceedings of COLING-
ACL ’98, pages 768{774, Montreal.

Diana McCarthy, Bill Keller, and John Carroll.
2003. Detecting a continuum of compositionality
in phrasal verbs. In Proceedings of the ACL-2003
Workshop on Multiword Expressions, pages 73{80, Sapporo, Japan.

C. Radhakrishna Rao. 1983. Diversity: Its measure-
ment, decomposition, apportionment and analy-
sis. Sankyha: The Indian Journal of Statistics,
44(A):1{22.

G. Salton and M.J. McGill. 1983. Introduction to
Modern Information Retrieval. McGraw-Hill.

K.M. Sugawara, K. Nishimura, K. Toshioka,
M. Okachi, and T. Kaneko. 1985. Isolated word
recognition using hidden markov models. In Pro-
ceedings of the ICASSP-1985, pages 1{4.

C.J. van Rijsbergen. 1979. Information Retrieval.
Butterworths, second edition.

Julie Weeds and David Weir. 2003. A general frame-
work for distributional similarity. In Proceedings
of EMNLP-2003, pages 81{88, Sapporo, Japan.

Julie Weeds. 2003. Measures and Applications of
Lexical Distributional Similarity. Ph.D. thesis,
Department of Informatics, University of Sussex.
