Frequency Estimates for Statistical Word Similarity Measures
Egidio Terra
School of Computer Science
University of Waterloo
elterra@math.uwaterloo.ca
C. L. A. Clarke
School of Computer Science
University of Waterloo
claclark@plg2.uwaterloo.ca
Abstract
Statistical measures of word similarity have ap-
plication in many areas of natural language pro-
cessing, such as language modeling and in-
formation retrieval. We report a comparative
study of two methods for estimating word co-
occurrence frequencies required by word sim-
ilarity measures. Our frequency estimates are
generated from a terabyte-sized corpus of Web
data, and we study the impact of corpus size
on the effectiveness of the measures. We base
the evaluation on one TOEFL question set and
two practice questions sets, each consisting of
a number of multiple choice questions seek-
ing the best synonym for a given target word.
For two question sets, a context for the target
word is provided, and we examine a number of
word similarity measures that exploit this con-
text. Our best combination of similarity mea-
sure and frequency estimation method answers
6-8% more questions than the best results pre-
viously reported for the same question sets.
1 Introduction
Many different statistical tests have been proposed to
measure the strength of word similarity or word associ-
ation in natural language texts (Dunning, 1993; Church
and Hanks, 1990; Dagan et al., 1999). These tests attempt
to measure dependence between words by using statistics
taken from a large corpus. In this context, a key assump-
tion is that similarity between words is a consequence of
word co-occurrence, or that the closeness of the words
in text is indicative of some kind of relationship between
them, such as synonymy or antonymy.
Although word sequences in natural language are un-
likely to be independent, these statistical tests provide
quantitative information that can be used to compare
pairs of co-occurring words. Also, despite the fact that
word co-occurrence is a simple idea, there are a vari-
ety of ways to estimate word co-occurrence frequencies
from text. Two words can appear close to each other
in the same document, passage, paragraph, sentence or
fixed-size window. The boundaries for determining co-
occurrence will affect the estimates and as a consequence
the word similarity measures.
Statistical word similarity measures play an impor-
tant role in information retrieval and in many other natu-
ral language applications, such as the automatic creation
of thesauri (Grefenstette, 1993; Li and Abe, 1998; Lin,
1998) and word sense disambiguation (Yarowsky, 1992;
Li and Abe, 1998). Pantel and Lin (2002) use word sim-
ilarity to create groups of related words, in order to dis-
cover word senses directly from text. Recently, Tan et
al. (2002) provide an analysis on different measures of
independence in the context of association rules.
Word similarity is also used in language modeling ap-
plications. Rosenfeld (1996) uses word similarity as a
constraint in a maximum entropy model which reduces
the perplexity on a test set by 23%. Brown et al. (1992)
use a word similarity measure for language modeling
in an interpolated model, grouping similar words into
classes. Dagan et al. (1999) use word similarity to assign
probabilities to unseen bigrams by using similar bigrams,
which reduces perplexity up to 20% in held out data.
In information retrieval, word similarity can be used to
identify terms for pseudo-relevance feedback (Harman,
1992; Buckley et al., 1995; Xu and Croft, 2000; Vechto-
mova and Robertson, 2000). Xu and Croft (2000) expand
queries under a pseudo-relevance feedback model by us-
ing similar words from documents retrieved and improve
effectiveness by more than 20% on an 11-point average
precision.
Landauer and Dumais (1997) applied word similarity
measures to answer TOEFL (Test Of English as a For-
eign Language) synonym questions using Latent Seman-
tic Analysis. Turney (2001) performed an evaluation of a
specific word similarity measure using the same TOEFL
questions and compared the results with those obtained
                                                               Edmonton, May-June 2003
                                                             Main Papers , pp. 165-172
                                                         Proceedings of HLT-NAACL 2003
a0 = “The results of the test were quite [unambiguous].”
a1a3a2 = ‘unambiguous’
a4 =
a5 ‘clear’,‘doubtful’,‘surprising’, ‘illegal’a6
Figure 1: Finding the best synonym option in presence of context
a1a3a2 = ‘boast’
a4 =
a5 ‘brag’,‘yell’,‘complain’,‘explain’a6
Figure 2: Finding the best synonym
by Landauer and Dumais.
In our investigation of frequency estimates for word
similarity measures, we compare the results of sev-
eral different measures and frequency estimates to solve
human-oriented language tests. Our investigation is
based in part on the questions used by Landauer and Du-
mais, and by Turney. An example of such tests is the
determination of the best synonym in a set of alternatives
a4a8a7
a5
a4a10a9a12a11a13a4a3a14a15a11a13a4a3a16a17a11a18a4a20a19
a6 for a specific target word
a1a3a2 in a
context a0 a7 a5a22a21a3a23a9 a11 a21a3a23a14 a11a25a24a26a24a27a11 a21a28a23a29 a6a31a30 a1a3a2 , as shown in figure 1.
Ideally, the context can provide support to choose best al-
ternative for each question. We also investigate questions
where no context is available, as shown in figure 2. These
questions provides an easy way to assess the performance
of measures and the co-occurrence frequency estimation
methods used to compute them.
Although word similarity has been used in many dif-
ferent applications, to the best of our knowledge, ours is
the first comparative investigation of the impact of co-
occurrence frequency estimation on the performance of
word similarity measures. In this paper, we provide a
comprehensive study of some of the most widely used
similarity measures with frequency estimates taken from
a terabyte-sized corpus of Web data, both in the presence
of context and not. In addition, we investigate frequency
estimates for co-occurrence that are based both on docu-
ments and on a variety of different window sizes, and ex-
amine the impact of the corpus size on the frequency es-
timates. In questions where context is available, we also
investigate the effect of adding more words from context.
The remainder of this paper is organized as follows:
In section 2 we briefly introduce some of the most
commonly used methods for measuring word similarity.
In section 3 we present methods to assess word co-
occurrence frequencies. Section 4 presents our experi-
mental evaluation, which is followed by a discussion of
the results in section 5.
2 Measuring Word Similarity
The notion for co-occurrence of two words can depicted
by a contingency table, as shown in table 1. Each dimen-
sion represents a random discrete variable a2a33a32 with range
a34a35a7
a5a36a21
a32 a11a38a37
a21
a32
a6 (presence or absence of word a39 in a given
text window or document). Each cell in the table repre-
sent the joint frequency a40a15a41a43a42a45a44a41a47a46 a7a8a48a50a49a52a51a54a53a28a55a57a56a59a58 a39 a11a61a60a63a62 , where
a48a50a49a52a51a54a53 is the maximum number of co-occurrences. Un-
der an independence assumption, the values of the cells
in the contingency table are calculated using the prob-
abilities in table 2. The methods described below per-
form different measures of how distant observed values
are from expected values under an independence assump-
tion. Tan et al. (2002) indicate that the difference between
the methods arise from non-uniform marginals and how
the methods react to this non-uniformity.
a21
a9 a37
a21
a9
a21
a14
a40 a41a65a64a18a44a41a43a66 a40a15a67 a41a65a64a18a44a41a43a66 a40 a41a43a66
a37
a21
a14
a40a12a41 a64 a44a67 a41 a66 a40 a67 a41 a64 a44a67 a41 a66 a40 a67 a41 a66
a40 a41a65a64 a40a15a67 a41a65a64
a48
Table 1: Contingency table
a56a59a58
a21
a9a15a11
a21
a14a22a62a68a7a69a56a59a58
a21
a9a22a62a70a55a57a56a59a58
a21
a14a36a62
a56a59a58a71a37
a21
a9 a11
a21
a14 a62a72a7a73a56a59a58a61a37
a21
a9 a62a74a55a57a56a59a58
a21
a14 a62
a56a59a58
a21
a9 a11a18a37
a21
a14 a62a72a7a73a56a59a58
a21
a9 a62a70a55a57a56a59a58a71a37
a21
a14 a62
a56a59a58a71a37
a21
a9 a11a18a37
a21
a14 a62a72a7a73a56a59a58a61a37
a21
a9 a62a70a55a57a56a59a58a61a37
a21
a14 a62
Table 2: Probabilities under independence
Occasionally, a context a0 is available and can pro-
vide support for the co-occurrence and alternative meth-
ods can be used to exploit this context. The procedures to
estimate a56a59a58 a21 a9a12a11 a21 a14a75a62 , as well a56a59a58 a21 a32 a62 , will be described in
section 3.
2.1 Similarity between two words
We first present methods to measure the similarity be-
tween two words a21 a9 and a21 a14 when no context is available.
2.1.1 Pointwise Mutual Information
This measure for word similarity was first used in this
context by Church and Hanks (1990). The measure is
given by equation 1 and is called Pointwise Mutual Infor-
mation. It is a straightforward transformation of the inde-
pendence assumption (on a specific point), a56a59a58 a21 a9 a11 a21 a14 a62a76a7
a56a59a58
a21
a9 a62a47a55a72a56a59a58
a21
a14 a62 , into a ratio. Positive values indicate that
words occur together more than would be expected under
an independence assumption. Negative values indicate
that one word tends to appear only when the other does
not. Values close to zero indicate independence.
a77a20a78a80a79a82a81a84a83a86a85a88a87a33a89a90a85a54a91a92a83a94a93a95a87a33a89a76a93a54a96a70a87a80a97a26a98a100a99a75a93
a77a101a81a102a89 a85 a91a103a89 a93 a96
a77a101a81a102a89a90a85a92a96a45a77a101a81a102a89a76a93a38a96 (1)
2.1.2 a104
a14
-test
This test is directly derived from observed and ex-
pected values in the contingency tables.
a105
a93
a87a107a106
a108a36a109a12a110
a64
a106
a111a22a109a75a110
a66
a81a84a112
a108a36a113a111a57a114a116a115a68a108a36a113a111
a96
a93
a115 a108a36a113a111
(2)
The a104
a14
statistic determines a specific way to calculate
the difference between values expected under indepen-
dence and observed ones, as depicted in equation 2. The
values a40 a53 a44a117 correspond to the observed frequency esti-
mates.
2.1.3 Likelihood ratio
The likelihood ratio test provides an alternative to
check two simple hypotheses based on parameters of a
distribution. Dunning (1993) used a likelihood ratio to
test word similarity under the assumption that the words
in text have a binomial distribution.
Two hypotheses used are: H1:a56a59a58 a21 a14a119a118a21 a9a22a62 a7
a56a59a58
a21
a14 a118a37
a21
a9 a62 (i.e. they occur independently); and
H2: a56a59a58 a21 a14 a118a21 a9 a62a121a120a7a122a56a59a58 a21 a14 a118a37 a21 a9 a62 (i.e. not independent).
These two conditionals are used as sample in the like-
lihood function a123 a58a84a56a59a58 a21 a14 a118a21 a9 a62a54a11a18a56a59a58 a21 a14 a118a37 a21 a9 a62a100a124a92a125a119a62 , where
a125 in this particular case represents the parameter of
the binomial distribution a126 a58a128a127a95a11a18a129a65a124a92a125a119a62 . Under hypothe-
sis H1, a56a59a58 a21 a14a130a118a21 a9a25a62a73a7a131a56a59a58 a21 a14a130a118a37 a21 a9a25a62a73a7a133a132 , and for H2,
a56a59a58
a21
a14a63a118
a21
a9a22a62a72a7a80a132a65a9a75a11a18a56a59a58
a21
a14a119a118a37
a21
a9a25a62a76a7a80a132a47a14 .
a134 a87a136a135
a81a128a77a101a81a102a89a76a93a36a137a89a90a85a13a96a13a138a102a139a82a96a43a140
a135
a81a128a77a101a81a102a89a76a93a75a137a141a65a89a90a85a13a96a13a138a142a139a31a96
a135
a81a128a77a101a81a102a89 a93 a137a89 a85 a96a13a138a102a139 a85 a96a43a140
a135
a81a128a77a101a81a102a89 a93 a137a141a65a89 a85 a96a13a138a142a139 a93 a96 (3)
Equation 3 represents the likelihood ratio. Asymptoti-
cally, a143a28a144a17a145a84a146a36a147a31a148 is a104
a14
distributed.
2.1.4 Average Mutual Information
This measure corresponds to the expected value of two
random variables using the same equation as PMI. Av-
erage mutual information was used as a word similarity
measure by Rosenfeld (1996) and is given by equation 4.
a78a149a79a82a81a84a83 a85 a138a13a83 a93 a96a70a87 a106
a108a36a109a75a110
a64
a106
a111a25a109a12a110
a66
a77a101a81a102a150a43a91a61a151a130a96a45a97a27a98a25a99
a77a101a81a102a150a43a91a45a151a119a96
a77a101a81a102a150a82a96a45a77a101a81a102a151a130a96 (4)
2.2 Context supported similarity
Similarity between two words can also be in-
ferred from a context (if given). Given a context
a0 a7
a5a22a21a3a23
a9
a11
a21a28a23
a14
a11a22a24a27a24a26a24a26a11
a21a28a23a29 a6 , a21
a9 and
a21
a14 are related if their
co-occurrence with words in context are similar.
2.2.1 Cosine of Pointwise Mutual Information
The PMI between each context word a21a10a23 and a21 a32 form a
vector. The elements in the vector represents the similar-
ity weights of a21a28a23 and a21 a32 . The cosine value between the
two vectors corresponding to a21
a9 and
a21
a14 represents the
similarity between the two words in the specified context,
as depicted in equation 5.
a152 a77a101a81a102a89a90a85a18a138a153a89a76a93a54a96a74a87 a154
a155a82a156a27a157a27a158a36a159
a77a20a78a80a79a82a81a102a89a76a160a84a91a153a89 a85 a96a45a77a20a78a149a79a82a81a102a89a76a160a84a91a45a89 a93 a96
a161
a154
a155a82a156
a77a20a78a80a79a82a81a102a89
a160
a91a103a89 a85 a96
a93 a161
a154
a155a82a156
a77a20a78a80a79a82a81a102a89
a160
a91a153a89 a93 a96
a93
(5)
Values closer to one indicate more similarity whereas
values close to zero represent less similarity. Lesk (1969)
was one of the first to apply the cosine measure to word
similarity, but did not use pointwise mutual information
to compute the weights. Pantel (2002) used the cosine
of pointwise mutual information to uncover word sense
from text.
2.2.2 a123 a9 norm
In this method the conditional probability of each word
a21a28a23
a32 in
a0 given
a21
a9 (and
a21
a14 ) is computed. The accumu-
lated distance between the conditionals for all words in
context represents the similarity between the two words,
as shown in equation 6. This method was proposed as an
alternative word similarity measure in language modeling
to overcome zero-frequency problems of bigrams (Dagan
et al., 1999).
a135
a81a102a89 a85 a138a103a89 a93 a96a74a87 a106
a155 a156 a157a162a158a75a159
a137a77a101a81a102a89 a160 a137a89a28a163a54a96
a114
a77a101a81a102a89 a160 a137a89a57a164a36a96a38a137 (6)
In this measure, a smaller value indicates a greater sim-
ilarity.
2.2.3 Contextual Average Mutual Information
The conditional probabilities between each word in
the context and the two words a21 a9 and a21 a14 are used
to calculate the mutual information of the conditionals
(equation 7). This method was also used in Dagan et.
al. (1999).a165
a78a80a79 a152 a81a102a89 a85 a138a153a89 a93 a96a88a87a149a106
a155 a156
a77a101a81a102a89 a160 a137a89a20a163a100a96a45a97a27a98a25a99
a77a101a81a102a89 a160 a137a89a20a163a100a96
a77a101a81a102a89
a160
a137a89a57a164a36a96 (7)
2.2.4 Contextual Jensen-Shannon Divergence
This is an alternative to the Mutual Information for-
mula (equation 8). It helps to avoid zero frequency prob-
lem by averaging the two distributions and also provides
a symmetric measure (AMIC is not symmetric). This
method was also used in Dagan et. al. (1999).
a166
a135
a81a167a139a65a168a92a169a75a96a70a87a149a106a170a139a3a97a26a98a25a99
a139
a169
a165a90a171a28a172
a77a173a87
a77a101a81a102a89 a160 a137a89a90a85a13a96a175a174a176a77a101a81a102a89 a160 a137a89a76a93a54a96
a164
a79a15a177
a165a68a178
a81a102a89 a85 a138a103a89 a93 a96a74a87a80a166
a135
a81a128a77a101a81a102a89 a160 a137a89a20a163a100a96a38a168
a165a90a171a28a172
a77a20a96
a174a90a166
a135
a81a128a77a101a81a102a89 a160 a137a89a57a164a36a96a38a168
a165a90a171a28a172
a77a20a96
(8)
2.2.5 Pointwise Mutual Information of Multiple
words
Turney (2001) proposes a different formula for Point-
wise Mutual Information when context is available, as de-
picted in equation 9. The context is represented by a0 a23 ,
which is any subset of the context a0 . In fact, Turney ar-
gued that bigger a0 a23 sets are worse because they narrow
the estimate and as consequence can be affected by noise.
As a consequence, Turney used only one word a179 a32 from
the context, discarding the remaining words. The chosen
word was the one that has biggest pointwise information
with a21 a9 . Moreover, a21 a9 (a1a3a2 ) is fixed when the method
is used to find the best a4 a32 for a1a3a2 , so a56a59a58 a21 a9 a11 a0 a23 a62 is also
fixed and can be ignored, which transforms the equation
into the conditional a56a59a58 a21 a9 a118a21 a14 a11 a0 a62 .
It is interesting to note that the equation a56a59a58 a21 a9a17a118a21 a14a17a11 a0 a62
is not the traditional n-gram model since no ordering is
imposed on the words and also due to the fact that the
words in this formula can be separated from one another
by other words.
a77a20a78a80a79 a152 a81a102a89a90a85a38a91a153a89a76a93a100a138 a152 a160 a96a74a87
a77a101a81a102a89 a85 a91a153a89 a93 a91 a152 a160a142a96
a77a101a81a102a89a76a93a25a91 a152
a160
a96a45a77a101a81a102a89a90a85a38a91 a152
a160
a96 (9)
2.2.6 Other measures of word similarities
Many other measures for word similarities exists. Tan
et al. (2002) present a comparative study with 21 different
measures. Lillian (2001) proposes a new word similarity
measure in the context of language modeling, performing
an comparative evaluation with other 7 similarity mea-
sures.
3 Co-occurrence Estimates
We now discuss some alternatives to estimate word co-
occurrence frequencies from an available corpus. All
probabilities mentioned in previous section can be es-
timated from these frequencies. We describe two dif-
ferent approaches: a window-oriented approach and a
document-oriented approach.
3.1 Window-oriented approach
Let a40 a41 a42 be the frequency of a21 a32 and the co-occurrence fre-
quency of a21 a9 and a21 a14 be denoted by a40 a41a180a64a38a44a41a47a66 . Let a48 be
the size of the corpus in words. In the window-oriented
approach, individual word frequencies are the corpus fre-
quencies. The maximum likelihood estimate (MLE) for
a21
a32 in the corpus is a56a59a58
a21
a32 a62a72a7
a40a12a41a43a42a92a181
a48 .
The joint frequency a40 a41a65a64a18a44a41a43a66 is estimated by the number
of windows where the two words co-occur. The window
size may vary, Church and Hanks (1990) used windows
of size 2 and 5. Brown et al. (1992) used windows con-
taining 1001 words. Dunning (1993) also used windows
of size 2, which corresponds to word bigrams. Let the
number of windows of size a182 in the corpus be a48 a41a180a183 . Recall
that a48a184a49a90a51a100a53 is the maximum number of co-occurrences,
i.e. a48a50a49a52a51a54a53a185a7a107a48 a41a180a183 in the windows-oriented approach.
The MLE of the co-occurrence probability is given by
a56a59a58
a21
a9a15a11
a21
a14a22a62a68a7
a40 a41a65a64a18a44a41a43a66 a181
a48
a41a180a183 .
In most common case, windows are overlapping, and
in this case a48 a41a180a183 a7a69a48 a143a186a182a12a187a86a188 . The total frequency of win-
dows for co-occurrence should be adjusted to reflect the
multiple counts of the same co-occurrence. One method
to account for overlap is to divide the total count of win-
dows by a21a20a39 a127a74a189 a146a75a21 a190a22a39a45a191a130a192a20a143a193a188 . This method also reinforces
closer co-occurrences by assigning them a larger weight.
Smoothing techniques can be applied to address the
zero-frequency problem, or alternatively, the window size
can be increased, which also increases the chance of co-
occurrence. To avoid inconsistency, windows do not to
cross document boundaries.
3.2 Document-oriented approach
In information retrieval, one commonly uses document
statistics rather than individual word statistics. In an
document-oriented approach, the frequency of a word a21 a32
is denoted by a189 a40 a41 a42 and corresponds to the number of doc-
uments in which the word appears, regardless of how fre-
quently it occurs in each document. The number of docu-
ments is denoted by a194 . The MLE for an individual word
in document oriented approach is a56a59a58 a21 a32 a62a72a7a8a189 a40 a41 a42a92a181a12a194 .
The co-occurrence frequency of two words a21 a9 and
a21
a14 , denoted by a189
a40a15a41 a64 a44a41 a66 , is the number of documents
where the words co-occur. If we require only that the
words co-occur in the same document, no distinction is
made between distantly occurring words and adjacent
words. This distortion can be reduced by imposing a
maximal distance for co-occurrence, (i.e. a fixed-sized
window), but the frequency will still be the number of
documents where the two words co-occur within this dis-
tance. The MLE for the co-occurrence in this approach
is a56a59a58 a21 a9a12a11 a21 a14a22a62a33a7a195a189 a40 a41a65a64a54a44a41a47a66 a181a15a194 , since a48 a49a52a51a54a53 a7 a194 in the
document-oriented approach.
3.3 Syntax based approach
An alternative to the Window and Document-oriented ap-
proach is to use syntactical information (Grefenstette,
1993). For this purpose, a Parser or Part-Of-Speech tag-
ger must be applied to the text and only the interesting
pairs of words in correct syntactical categories used. In
this case, the fixed window can be superseded by the re-
sult of the syntax analysis or tagging process and the fre-
quency of the pairs can be used directly. Alternatively,
the number of documents that contain the pair can also
be used. However, the nature of the language tests in this
work make it impractical to be applied. First, the alter-
natives are not in a context, and as such can have more
than one part-of-speech tag. Occasionally, it is possible
to infer that the syntactic category of the alternatives from
context of the target word a1a3a2 , if there is such a context
. When the alternatives, or the target word a1a3a2 , are mul-
tiwords then the problem is harder, as depicted in the first
example of figure 7. Also, both parsers and POS tagger
make mistakes, thus introducing error. Finally, the size
of the corpus used and its nature intensify the parser/POS
taggers problems.
Figure 3: Results for TOEFL test set Figure 4: Impact of corpus size on TOEFL test set
Figure 5: Results for TS1 and no context Figure 6: Results for TS1 and context
4 Experiments
We evaluate the methods and frequency estimates using 3
test sets. The first test set is a set of TOEFL questions first
used by Landauer and Dumais (1997) and also by Tur-
ney (2001). This test set contains 80 synonym questions
and for each question one a1a3a2 and four alternative op-
tions (a118a4a184a118a31a7a121a196 ) are given. The other two test sets, which
we will refer to as TS1 and TS2, are practice questions
for the TOEFL. These two test sets also contain four al-
ternatives options, a118a4a186a118a43a7a197a196 , and a1a10a2 is given in context
a0 (within a sentence). TS1 has 50 questions and was also
used by Turney (2001). TS2 has 60 questions extracted
from a TOEFL practice guide (King and Stanley, 1989).
For all test sets the answer to each question is known and
unique. For comparison purposes, we also use TS1 and
TS2 with no context.
For the three test sets, TOEFL, TS1 and TS2 without
context, we applied the word and document-oriented fre-
quency estimates presented. We investigated a variety of
window sizes, varying the window size from 2 to 256 by
powers of 2.
The labels used in figures 3, 5, 6, 8, 9, 10, 12 are com-
posed from a keyword indicating the frequency estimate
used (W-window oriented; and DR-document retrieval
oriented) and a keyword indicating the word similarity
measure. For no-context measures the keywords are:
PMI-Pointwise Mutual Information; CHI-Chi-Squared;
MI-Average mutual information; and LL-Log-likelihood.
For the measures with context: CP-Cosine pointwise mu-
tual information; L1-L1 norm; AMIC-Average Mutual
Information in the presence of context; IRAD-Jensen-
Shannon Divergence; and PMIC-a127 - Pointwise Mutual
Information with a127 words of context.
For TS1 and TS2 with context, we also investigate Tur-
ney’s hypothesis that the outcome of adding more words
from a0 is negative, using DR-PMIC. The result of this
experiment is shown in figures 10 and 12 for TS1 and
TS2 respectively.
It is important to note that in some of the questions,
a1a3a2 or one or more of the a4 a32 ’s are multi-word strings.
For these questions, we assume that the strings may be
treated as collocations and use them “as is”, adjusting the
size of the windows by the collocation size when appli-
cable.
The corpus used for the experiments is a terabyte of
Web data crawled from the general web in 2001. In order
to balance the contents of the corpus, a breadth-first order
search was used from a initial seed set of URLs represent-
ing the home page of 2392 universities and other educa-
tional organizations (Clarke et al., 2002). No duplicate
pages are included in the collection and the crawler also
did not allow a large number of pages from the same site
to be downloaded simultaneously. Overall, the collection
contains 53 billion words and 77 million documents.
A key characteristic of this corpus is that it consists of
HTML files. These files have a focus on the presentation,
and not necessarily on the style of writing. Parsing or
tagging these files can be a hard process and prone to in-
troduction of error in rates bigger than traditional corpora
used in NLP or Information Retrieval.
We also investigate the impact of the collection size on
a0 = “The country is plagued by [turmoil].”
a4 =
a5 ‘constant change’,‘utter confusion’,‘bad weather’,‘fuel shortages’a6
a0 = “[For] all their protestations, they heeded the judge’s ruling.”
a4 =
a5 ‘In spite of’,‘Because of’,‘On behalf of’,‘without’a6
Figure 7: Examples of harder questions in TS2
Figure 8: Results for TS2 and no context Figure 9: Results for TS2 and context
the results, as depicted in figures 4, 11 and 13 for TOEFL,
TS1 and TS2 test sets, respectively.
5 Results and Discussion
The results for the TOEFL questions are presented in
figure 3. The best performance found is 81.25% of the
questions correctly answered. That result used DR-PMI
with a window size of 16-32 words. This is an im-
provement over the results presented by Landauer and
Dumais (1997) using Latent Semantic Analysis, where
64.5% of the questions were answered correctly, and Tur-
ney (2001), using pointwise mutual information and doc-
ument retrieval, where the best result was 73.75%.
Although we use a similar method (DR-PMI), the dif-
ference between the results presented here and Turney’s
results may be due to differences in the corpora and dif-
ferences in the queries. Turney uses Altavista and we
used our own crawl of web data. We can not compare
the collections since we do not know how Altavista col-
lection is created. As for the queries, we have more con-
trol over the queries since we can precisely specify the
window size and we also do not know how queries are
evaluated in Altavista.
PMI performs best overall, regardless of estimates used
(DR or W). W-CHI performs up to 80% when using win-
dow estimates, outperforming DR-CHI. MI and LL yield
exactly the same results (and the same ranking of the al-
ternatives), which suggests that the binomial distribution
is a good approximation for word occurrence in text.
The results for MI and PMI indicate that, for the two
discrete random variables a2 a9 and a2 a14 (and range a34a198a7
a5a22a21
a32 a11a18a37
a21
a32
a6 ), no further gain is achieved by calculating the
expectation in the divergence. Recall that the divergence
formula has an embedded expectation to be calculated be-
tween the joint probability of these two random variables
and their independence. The peak of information is ex-
actly where both words co-occur, i.e. when a2 a9a199a7 a21 a9
and a2 a14a200a7 a21 a14 , and not any of the other three possible
combinations.
Similar trends are seen when using TS1 and no con-
text, as depicted in figure 5. PMI is best overall, and DR-
PMI and W-PMI outperform each other with different
windows sizes. W-CHI has good performance in small
windows sizes. MI and LL yield identical (poor) results,
being worst than chance for some window sizes. Tur-
ney (2001) also uses this test set without context, achiev-
ing 66% peak performance compared with our best per-
formance of 72% (DR-PMI).
In the test set TS2 with no context, the trend seen be-
tween TOEFL and TS1 is repeated, as shown in figure 8.
PMI is best overall but W-CHI performs better than PMI
in three cases. DR-CHI performs poorly for small win-
dows sizes. MI and LL also perform poorly in compari-
son with PMI. The peak performance is 75%, using DR-
PMI with a window size of 64.
The result are not what we expected when context is
used in TS1 and TS2. In TS1, figure 6, only one of
the measures, DR-PMIC-1, outperforms the results from
non-context measures, having a peak of 80% correct an-
swers. The condition for the best result (one word from
context and a window size of 8) is similar to the one
used for the best score reported by Turney. L1, AMIC
and IRAD perform poorly, worst than chance for some
window sizes. One difference in the results is that for
DR-PMIC-1 only the best word from context was used,
while the other methods used all words but stopwords.
We examine the context and discovered that using more
words degrades the performance of DR-PMIC in all dif-
ferent windows sizes but, even using all words except
stopwords, the result from DR-PMIC is better than any
other contextual measure - 76% correct answers in TS1
(with DR-PMIC and a window size of 8).
For TS2, no measure using context was able to perform
Figure 10: Influence from the context on TS1 Figure 11: Impact of corpus size on TS1
Figure 12: Influence from the context on TS2 Figure 13: Impact of corpus size on TS2
better than the non-contextual measures. DR-PMIC-1
performs better overall but has worse performance than
DR-CP with a window size of 8. In this test set, the per-
formance of DR-CP is better than W-CP. L1 performs
better than AMIC but both have poor results, IRAD is
never better than chance. The context in TS2 has more
words than TS1 but the questions seem to be harder, as
shown in figure 7. In some of the TS2 questions, the tar-
get word or one of the alternatives uses functional words.
We also investigate the influence of more words from
context in TS2, as depicted in figure 12, where the trends
seen with TS1 are repeated.
The results in TS1 and TS2 suggest that the available
context is not very useful or that it is not being used prop-
erly.
Finally, we selected the method that yields the best
performance for each test set to analyze the impact
of the corpus size on performance, as shown in fig-
ures 4, 11 and 13. For TS1 we use W-PMI with a win-
dow size of 2 (W-PMI2) when no context is used and
DR-PMIC-1 with a window size of 8 (DR-PMIC8-1)
when context is used. For those measures, very little im-
provement is noticed after 500 GBytes for DR-PMIC8-1,
roughly half of the collection size. No apparent improve-
ment is achieved after 300-400 GBytes for W-PMI2. For
TS2 we use DR-PMI with a window size of 64 (DR-
PMI64) when no context is used, and DR-PMIC-1 with
a windows size of 64 (DR-PMIC64-1) when context is
used. It is clear that for TS2 no substantial improve-
ment in DR-PMI64 and DR-PMIC64-1 is achieved by
increasing the corpus size to values bigger than 300-400
GBytes. The most interesting impact of corpus size was
on TOEFL test set using DR-PMI with a window size of
16 (DR-PMI16). Using the full corpus is no better than
using 5% of the corpus, and the best result, 82.5% correct
answers, is achieved when using 85-95% of corpus size.
6 Conclusion
Using a large corpus and human-oriented tests we de-
scribe a comprehensive study of word similarity mea-
sures and co-occurrence estimates, including variants on
corpus size. Without any parameter training, we were
able to correctly answer at least 75% questions in all test
sets. From all combinations of estimates and measures,
document retrieval with a maximum window of 16 words
and pointwise mutual information performs best on aver-
age in the three test sets used. However, both document or
windows-oriented approach for frequency estimates pro-
duce similar results in average. The impact of the corpus
size is not very conclusive, it suggests that the increase
in the corpus size normally reaches an asymptote, but the
points where this occurs is distinct among different mea-
sures and frequency estimates.
Our results outperform the previously reported results
on test sets when no context is used, being able to cor-
rectly answer 81.25% of TOEFL synonym questions,
compared with a previous best result of 73.5%. A hu-
man average score on the same type of questions is
64.5% (Landauer and Dumais, 1997). We also perform
better than previous work on another test set used as prac-
tice questions for TOEFL, obtaining 80% correct answers
compared to a best result of 74% from previous work.
Acknowledgments
This work was made possible also in part by PUC/RS and
Ministry of Education of Brazil through CAPES agency.
References
P. F. Brown, P. V. deSouza, R. L. Mercer, T. J. Watson,
V. J. Della Pietra, and J. C. Lai. 1992. Class-based n-
gram models of natural language. Computational Lin-
guistics, 18:467–479.
C. Buckley, G. Salton, J. Allan, and A. Singhal. 1995.
Automatic query expansion using smart: Trec 3. In
The third Text REtrieval Conference, Gaithersburg,
MD.
K.W. Church and P. Hanks. 1990. Word association
norms, mutual information, and lexicography. Com-
putational Linguistics, 16(1):22–29.
C.L.A. Clarke, G.V. Cormack, M. Laszlo, T.R. Lynam,
and E.L. Terra. 2002. The impact of corpus size on
question answering performance. In Proceedings of
2002 SIGIR conference, Tampere, Finland.
I. Dagan, L. Lee, and F. C. N. Pereira. 1999. Similarity-
based models of word cooccurrence probabilities. Ma-
chine Learning, 34(1-3):43–69.
T. Dunning. 1993. Accurate methods for the statistics of
surprise and coincidence. Computational Linguistics,
19:61–74.
G. Grefenstette. 1993. Automatic theasurus generation
from raw text using knowledge-poor techniques. In
Making sense of Words. 9th Annual Conference of the
UW Centre for the New OED and text Research.
D. Harman. 1992. Relevance feedback revisited. In
Proceedings of 1992 SIGIR conference, Copenhagen,
Denmark.
C. King and N. Stanley. 1989. Building Skills for the
TOEFL. Thomas Nelson and Sons Ltd, second edition.
T. K. Landauer and S. T. Dumais. 1997. A solution
to plato’s problem: The latent semantic analysis the-
ory of the acquisition, induction, and representation of
knowledge. Psychological Review, 104(2):211–240.
Lillian Lee. 2001. On the effectiveness of the skew di-
vergence for statistical language analysis. In Artificial
Intelligence and Statistics 2001, pages 65–72.
M. E. Lesk. 1969. Word-word associations in doc-
ument retrieval systems. American Documentation,
20(1):27–38, January.
Hang Li and Naoki Abe. 1998. Word clustering and dis-
ambiguation based on co-occurence data. In COLING-
ACL, pages 749–755.
Dekang Lin. 1998. Automatic retrieval and clustering of
similar words. In COLING-ACL, pages 768–774.
P. Pantel and D. Lin. 2002. Discovering word senses
from text. In Proceedings of ACM SIGKDD Confer-
ence on Knowledge Discovery and Data Mining, pages
613–619.
R. Rosenfeld. 1996. A maximum entropy approach
to adaptive statistical language modeling. computer
speech and language. Computer Speech and Lan-
guage, 10:187–228.
P.-N. Tan, V. Kumar, and J. Srivastava. 2002. Selecting
the right interestingness measure for association pat-
terns. In Proceedings of ACM SIGKDD Conference on
Knowledge Discovery and Data Mining, pages 32–41.
P. D. Turney. 2001. Mining the Web for synonyms:
PMI–IR versus LSA on TOEFL. In Proceedings of
the Twelfth European Conference on Machine Learn-
ing (ECML-2001), pages 491–502.
O. Vechtomova and S. Robertson. 2000. Integration
of collocation statistics into the probabilistic retrieval
model. In 22nd Annual Colloquium on Information
Retrieval Research, Cambridge, England.
J. Xu and B. Croft. 2000. Improving the effectiveness of
information retrieval. ACM Transactions on Informa-
tion Systems, 18(1):79–112.
David Yarowsky. 1992. Word-sense disambiguation us-
ing statistical models of Roget’s categories trained on
large corpora. In Proceedings of COLING-92, pages
454–460, Nantes, France, July.
