Fast Computation of Lexical Affinity Models
Egidio Terra Charles L.A. Clarke
School of Computer Science
University of Waterloo
Canada
a0 elterra,claclark
a1 @plg2.uwaterloo.ca
Abstract
We present a framework for the fast compu-
tation of lexical affinity models. The frame-
work is composed of a novel algorithm to effi-
ciently compute the co-occurrence distribution
between pairs of terms, an independence model,
and a parametric affinity model. In compari-
son with previous models, which either use ar-
bitrary windows to compute similarity between
words or use lexical affinity to create sequential
models, in this paper we focus on models in-
tended to capture the co-occurrence patterns of
any pair of words or phrases at any distance in
the corpus. The framework is flexible, allowing
fast adaptation to applications and it is scalable.
We apply it in combination with a terabyte cor-
pus to answer natural language tests, achieving
encouraging results.
1 Introduction
Modeling term co-occurrence is important for many
natural language applications, such as topic seg-
mentation (Ferret, 2002), query expansion (Vech-
tomova et al., 2003), machine translation (Tanaka,
2002), language modeling (Dagan et al., 1999;
Yuret, 1998), and term weighting (Hisamitsu and
Niwa, 2002). For these applications, we are in-
terested in terms that co-occur in close proxim-
ity more often than expected by chance, for exam-
ple, a2 “NEW”,“YORK”a3 , a2 “ACCURATE”,“EXACT”a3
and a2 “GASOLINE”,“CRUDE”a3 . These pairs of terms
represent distinct lexical-semantic phenomena, and
as consequence the terms have an affinity for each
other. Examples of such affinities include syn-
onyms (Terra and Clarke, 2003), verb similari-
ties (Resnik and Diab, 2000) and word associa-
tions (Rapp, 2002).
Ideally, a language model would capture the pat-
terns of co-occurrences representing the affinity be-
tween terms. Unfortunately, statistical models used
to capture language characteristics often do not take
contextual information into account. Many models
incorporating contextual information use only a se-
lect group of content words and the end product is a
model for sequences of adjacent words (Rosenfeld,
1996; Beeferman et al., 1997; Niesler and Wood-
land, 1997).
Practical problems exist when modeling text sta-
tistically, since we require a reasonably sized cor-
pus in order to overcome sparseness problems, but
at the same time we face the difficulty of scal-
ing our algorithms to larger corpora (Rosenfeld,
2000). Attempts to scale language models to large
corpora, in particular to the Web, have often used
general-purpose search engines to generate term
statistics (Berger and Miller, 1998; Zhu and Rosen-
feld, 2001). However, many researchers are rec-
ognizing the limitations of relying on the statistics
provided by commercial search engines (Zhu and
Rosenfeld, 2001; Keller and Lapata, 2003). ACL
2004 features a workshop devoted to the problem
of scaling human language technologies to terabyte-
scale corpora.
Another approach to capturing lexical affinity is
through the use of similarity measures (Lee, 2001;
Terra and Clarke, 2003). Turney (2001) used statis-
tics supplied by the Altavista search engine to com-
pute word similarity measures, solving a set of syn-
onym questions taken from a series of practice ex-
ams for TOEFL (Test of English as a Foreign Lan-
guage). While demonstrating the value of Web data
for this application, that work was limited by the
types of queries that the search engine supported.
Terra and Clarke (2003) extended Turney’s work,
computing different similarity measures over a lo-
cal collection of Web data using a custom search
system. By gaining better control over search se-
mantics, they were able to vary the techniques
used to estimate term co-occurrence frequencies
and achieved improved performance on the same
question set in a smaller corpus. The choice of the
term co-occurrence frequency estimates had a big-
ger impact on the results than the actual choice of
similarity measure. For example, in the case of the
pointwise mutual information measure (PMI), val-
ues for a4a6a5a8a7a10a9a11a13a12 are best estimated by counting the
number of times the terms a7 and a11 appear together
within 10-30 words. This experience suggests that
the empirical distribution of distances between ad-
jacent terms may represent a valuable tool for as-
sessing term affinity. In this paper, we present an
novel algorithm for computing these distributions
over large corpora and compare them with the ex-
pected distribution under an independence assump-
tion.
In section 2, we present an independence model
and a parametric affinity model, used to capture
term co-occurrence with support for distance infor-
mation. In section 3 we describe our algorithm for
computing lexical affinity over large corpora. Using
this algorithm, affinity may be computed between
terms consisting of individual words or phrases. Ex-
periments and examples in the paper were generated
by applying this algorithm to a terabyte of Web data.
We discuss practical applications of our framework
in section 4, which also provides validation of the
approach.
2 Models for Word Co-occurrence
There are two types of models for the co-occurrence
of word pairs: functional models and distance mod-
els. Distance models use only positional informa-
tion to measure co-occurrence frequency (Beefer-
man et al., 1997; Yuret, 1998; Rosenfeld, 1996).
A special case of the distance model is the n-gram
model, where the only distance allowed between
pairs of words in the model is one. Any pair of word
represents a parameter in distance models. There-
fore, these models have to deal with combinato-
rial explosion problems, especially when longer se-
quences are considered. Functional models use the
underlying syntactic function of words to measure
co-occurrence frequency (Weeds and Weir, 2003;
Niesler and Woodland, 1997; Grefenstette, 1993).
The need for parsing affects the scalability of these
models.
Note that both distance and functional models
rely only on pairs of terms comprised of a single
word. Consider the pair of terms “NEW YORK” and
“TERRORISM”, or any pair where one of the two
items is itself a collocation. To best of our knowl-
edge, no model tries to estimate composite terms
of form a14a15a5a17a16a19a18a20a7a10a9a11a13a12 or a14a21a5a17a16a22a18a20a7a10a9a11a23a18a25a24a26a12 where a16 ,a7 ,a11 ,a24 are
words in the vocabulary, without regard to the dis-
tribution function of a14 .
In this work, we use models based on distance in-
formation. The first is an independence model that
is used as baseline to determine the strength of the
affinity between a pair of terms. The second is in-
tended to fit the empirical term distribution, reflect-
ing the actual affinity between the terms.
Notation. Let a27 be a random variable with range
comprising of all the words in the vocabulary. Also,
let us assume that a27 has multinomial probability
distribution function a14a29a28 . For any pair of terms a7
and a24 , let a30a32a31a34a33a35 be a random variable with the dis-
tance distribution for the co-occurrence of terms a7
and a24 . Let the probability distribution function of
the random variable a30a15a31a34a33a35 be a14a6a36a37a5a8a7a38a18a25a24a26a12 and the corre-
sponding cumulative be a39a40a36a37a5a8a7a38a18a25a24a26a12 .
2.1 Independence Model
Let a7 and a24 be two terms, with occurrence proba-
bilities a14a41a28a26a5a8a7a42a12 and a14a41a28a43a5a17a24a44a12 . The chances, under inde-
pendence, of the pair a7 and a24 co-occurring within a
specific distance a45 ,a14a46a36a47a5a8a7a48a18a25a24a49a9a45a23a12 is given by a geomet-
ric distribution with parameter a4 , a30a51a50a52a27a47a53a13a54a55a5a8a45a43a56a17a4a49a12 .
This is straightforward since if a7 and a24 are indepen-
dent then a14a6a28a43a5a8a7a57a9a24a44a12a59a58a60a14a61a28a26a5a8a7a42a12 and similarly a14a6a28a62a5a17a24a49a9a7a42a12a59a58
a14a41a28a43a5a17a24a26a12 . If we fix a position for a a7 , then if in-
dependent, the next a24 will occur with probability
a14a41a28a43a5a17a24a26a12a59a63a64a5a66a65a68a67a69a14a61a28a43a5a17a24a44a12a70a12a34a71a25a72a74a73 at distance a45 of a7 . The ex-
pected distance is the mean of the geometric distri-
bution with parametera4 .
The estimation of a4 is obtained using the Maxi-
mum Likelihood Estimator for the geometric distri-
bution. Leta75
a71
be the number of co-occurrences with
distance a45 , and a76 be the sample size:
a4a77a58
a65
a78
a58
a65
a73a79a81a80
a82
a71a66a83a41a73
a75
a71
(1)
We make the assumption that multiple occur-
rences of a7 do not increase the chances of seeing
a24 and vice-versa. This assumption implies a dif-
ferent estimation procedure, since we explicitly dis-
card what Befeerman et al. and Niesler call self-
triggers (Beeferman et al., 1997; Niesler and Wood-
land, 1997). We consider only those pairs in which
the terms are adjacent, with no intervening occur-
rences of a7 or a24 , although other terms may appear
between them
Figure 1 shows that the geometric distribution fits
well the observed distance of independent words
DEMOCRACY and WATERMELON. When a de-
pendency exists, the geometric model does not fit
the data well, as can be seen in Figure 2. Since
the geometric and exponential distributions repre-
sent related idea in discrete/continuous spaces it is
expected that both have similar results, especially
whena4a77a84a85a65 .
2.2 Affinity Model
The model of affinity follows a exponential-like dis-
tribution, as in the independence model. Other re-
searchers also used exponential models for affin-
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 300000  600000  900000
Cummulative Probability
Distance
observed
independence
Figure 1: a39 a36 a5 watermelona18 democracya12
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 300000  600000  900000
Cummulative Probability
Distance
observed
independence
fitted
Figure 2: a39 a36 a5 watermelona18 fruitsa12
ity (Beeferman et al., 1997; Niesler and Woodland,
1997). We use the gamma distribution, the gener-
alized version of the exponential distribution to fit
the observed data. Pairs of terms have a skewed
distribution, especially when they have affinity for
one another, and the gamma distribution is a good
choice to model this phenomenon.
a27a68a16a62a86a87a86a88a16a49a5a8a30a89a58a90a45a43a56a20a91a92a18a70a93a29a12a59a58
a45a38a94a26a72a74a73a95a53a57a72a22a71a66a96a34a97
a93 a94a55a98 a5a8a91a29a12
(2)
where a98 a5a8a91a29a12 is the complete gamma function. The
exponential distribution is a special case with a91a99a58
a65 . Given a set of co-occurrence pairs, estimates for
a91 anda93 can be calculated using the Maximum Like-
lihood Estimators given by:
a91a100a93a101a58
a65
a76
a80
a102
a71a66a83a41a73
a75
a71
(3)
and by:
a98a61a103a5a8a91a6a12
a98 a5a8a91a29a12
a67a105a104a107a106a10a108a109a91a110a58
a65
a76
a5
a80
a102
a71a66a83a41a73
a75
a71
a104a111a106a10a108a59a45a23a12a92a67a112a104a107a106a10a108a19a5
a65
a76
a80
a102
a71a66a83a41a73
a75
a71
a12
(4)
Figure 2 shows the fit of the gamma distribution
to the word pair FRUITS and WATERMELON (a91a113a58
a114a44a115a117a116a10a116a10a118a10a118a23a119a62a120 ).
3 Computing the Empirical Distribution
The independence and affinity models depend on a
good approximation to a78 . We try to reduce the bias
of the estimator by using a large corpus. Therefore,
we want to scan the whole corpus efficiently in order
to make this framework usable.
3.1 Corpus
The corpus used in our experiments comprises a ter-
abyte of Web data crawled from the general web
in 2001 (Clarke et al., 2002; Terra and Clarke,
2003). The crawl was conducted using a breadth-
first search from a initial seed set of URLs rep-
resenting the home page of 2392 universities and
other educational organizations. Pages with dupli-
cate content were eliminated. Overall, the collec-
tion contains 53 billion words and 77 million docu-
ments.
3.2 Computing Affinity
Given two terms, a7 and a24 , we wish to determine
the affinity between them by efficiently examining
all the locations in a large corpus where they co-
occur. We treat the corpus as a sequence of terms
a121 =
a122
a73
a18a70a122a34a123a23a18
a115a107a115a107a115
a18a70a122a70a124 where a125 is the size of the cor-
pus. This sequence is generated by concatenating
together all the documents in the collection. Docu-
ment boundaries are then ignored.
While we are primarily interested in within-
document term affinity, ignoring the boundaries
simplifies both the algorithm and the model. Docu-
ment information need not be maintained and ma-
nipulated by the algorithm, and document length
normalization need not be considered. The order
of the documents within the sequence is not of ma-
jor importance. If the order is random, then our
independence assumption holds when a document
boundary is crossed and only the within-document
affinity can be measured. If the order is determined
by other factors, for example if Web pages from
a single site are grouped together in the sequence,
then affinity can be measured across these groups of
pages.
We are specifically interested in identifying all
the locations where a7 and a24 co-occur. Consider a
particular occurrence of a7 at position a126 in the se-
quence (a122a70a127a128a58a129a7 ). Assume that the next occurrence
of a7 in the sequence is a122a70a130 and that the next occur-
rence of a24 is a122a66a131 (ignoring for now the exceptional
case where a122a70a127 is close to the end of the sequence
and is not followed by another a7 and a24 ). If a132a134a133a136a135 ,
then no a7 or a24 occurs between a122a25a127 and a122a131 , and the
interval can be counted for this pair. Otherwise, if
a132a138a137a139a135 let a122a34a140 be the last occurrence of a7 before a122a70a131 .
No a7 or a24 occurs between a122a140 and a122a131 , and once again
the interval containing the terms can be considered.
Our algorithm efficiently computes all locations
in a large term sequence where a7 and a24 co-occur
with no intervening occurrences of either a7 or a24 .
Two versions of the algorithm are given, an asym-
metric version that treats terms in a specific order,
and a symmetric version that allows either term to
appear before the other.
The algorithm depends on two access functions
a141 and
a142 that return positions in the term sequence
a122
a73
a18
a115a107a115a107a115
a18a70a122a124 . Both take a term a122 and a position in the
term sequence a126 as arguments and return results as
follows:
a141
a5a143a122a95a18a20a126a64a12a144a58
a145
a146a147 a135 a148a111a149a29a150a151a122a131 a58a99a122a100a152
a115a154a153a155a115
a126a77a156a69a135
a157a10a158a64a159a138a160
a150a47a122a131a25a161 a58a162a122a100a152
a115a154a153a155a115
a126a77a156a69a135a62a163a49a137a164a135
a125a166a165a90a65a167a106
a153a25a168a64a169a155a170a25a171
a148a111a152
a169
and
a142a70a5a143a122a172a18a20a126a64a12a92a58
a145
a146a147a101a173 a148a107a149a46a150a47a122a34a140a174a58a162a122a100a152
a115a154a153a155a115
a126a77a175 a173
a157a10a158a64a159a176a160
a150a47a122a140 a161 a58a99a122a100a152
a115a154a153a155a115
a126a77a175 a173 a163 a133 a173
a114
a106
a153a25a168a55a169a155a170a25a171
a148a111a152
a169
Informally, the access function a141 a5a143a122a172a18a20a126a64a12 returns the
position of the first occurrence of the term a122 located
at or after position a126 in the term sequence. If there
is no occurrence of a122 at or after position a126 , then
a141
a5a143a122a95a18a20a126a64a12 returns a125a177a165a178a65 . Similarly, the access function
a142a70a5a143a122a95a18a20a126a64a12 returns the position of the last occurrence of
the terma122 located at or before position a126 in the term
sequence. If there is no occurrence of a122 at or before
position a126 , then a142a70a5a143a122a95a18a20a126a64a12 returns a114 .
These access functions may be efficiently imple-
mented using variants of the standard inverted list
data structure. A very simple approach, suitable for
a small corpus, stores all index information in mem-
ory. For a terma122 , a binary search over a sorted list of
the positions where a122 occurs computes the result of
a call to a141 a5a143a122a172a18a20a126a64a12 or a142a70a5a143a122a172a18a20a126a64a12 in a179a15a5a17a104a107a106a10a108a109a75a10a180a181a12a151a156a113a179a15a5a17a104a107a106a10a108a59a125a101a12
time. Our own implementation uses a two-level in-
dex, split between memory and disk, and imple-
ments different strategies depending on the relative
frequency of a term in the corpus, minimizing disk
traffic and skipping portions of the index where no
co-occurrence will be found. A cache and other data
structures maintain information from call to call.
The asymmetric version of the algorithm is given
below. Each iteration of the while loop makes three
calls to access functions to generate a co-occurrence
pair a5a173 a18a70a135a44a12 , representing the interval in the corpus
from a122a34a140 to a122a182a131 where a7 and a24 are the start and end
of the interval. The first call (a132a184a183 a141 a5a8a7a38a18a20a126a64a12 ) finds
the first occurrence of a7 after a126 , and the second
(a135a162a183 a141 a5a17a24a64a18a70a132a177a165a185a65a48a12 ) finds the first occurrence of a24
after that, skipping any occurrences of a24 between a126
and a132 . The third call (a173 a183a186a142a70a5a8a7a38a18a70a135a15a67a99a65a48a12 ) essentially
indexes “backwards” in the corpus to locate last oc-
currence of a7 before a135 , skipping occurrences of a7
between a132 and a173 . Since each iteration generates a
co-occurrence pair, the time complexity of the al-
gorithm depends on a187 , the number of such pairs,
rather than than number of times a7 and a24 appear in-
dividually in the corpus. Including the time required
by calls to access functions, the algorithm generates
all co-occurrence pairs in a179a15a5a8a187a52a104a111a106a10a108a144a125a110a12 time.
a126a15a183a85a65 ;
while a126a77a156a164a125 do
a132a60a183
a141
a5a8a7a38a18a20a126a64a12 ;
a135a32a183
a141
a5a17a24a64a18a70a132a164a165a99a65a48a12 ;
a173 a183a188a142a70a5a8a7a38a18a70a135a174a67a189a65a48a12 ;
if a135a21a156a189a125 then
Generate a5a173 a18a70a135a44a12 ;
end if;
a126a21a183 a173 a165a99a65 ;
end while;
The symmetric version of the algorithm is given
next. It generates all locations in the term sequence
where a7 and a24 co-occur with no intervening occur-
rences of either a7 or a24 , regardless of order. Its oper-
ation is similar to that of the asymmetric version.
a126a15a183a85a65 ;
while a126a77a156a164a125 do
a135a32a183a184a86a88a16a62a190a6a5
a141
a5a8a7a38a18a20a126a64a12a95a18
a141
a5a17a24a64a18a20a126a64a12a70a12 ;
a173 a183a184a86a87a191a181a76a144a5a17a142a70a5a8a7a38a18a70a135a44a12a95a18a25a142a70a5a17a24a22a18a70a135a44a12a70a12 ;
if a135a21a156a189a125 then
Generate a5a173 a18a70a135a44a12 ;
end if;
a126a21a183 a173 a165a99a65 ;
end while;
To demonstrate the performance of the algorithm,
we apply it to the 99 word pairs described in Sec-
tion 4.2 on the corpus described in Section 3.1,
distributed over a 17-node cluster-of-workstations.
The terms in the corpus were indexed without stem-
ming. Table 1 presents the time required to scan all
co-occurrences of given pairs of terms. We report
the time for all hosts to return their results.
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15
Log-likelihood
Distance
watermelon,seeds
watermelon,fruits
Figure 3: Log-likelihood – WATERMELON
-1
 0
 1
 2
 3
 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15
Log-likelihood
Distance
america .. united
united .. america
united .. states
states .. united
Figure 4: Log-likelihood – UNITED
Time
Fastest 1ms
Average 310.32 ms
Slowest 744.1ms
Table 1: Scanning performance on 99 word pairs of
the Minnesota Word Association Norms
4 Evaluation
We use the empirical and the parametric affinity dis-
tributions in two applications. In both, the indepen-
dence model is used as a baseline.
4.1 Log-Likelihood Ratio
The co-occurrence distributions assign probabilities
for each pair at every distance. We can compare
point estimations from distributions and how un-
likely they are by means of log-likelihood ratio test:
a142a17a54a38a192a44a193a77a58a99a142a17a54a38a192a41a194
a5a17a14 a36 a5a8a7a38a18a25a24a26a12a95a56a17a4a196a195a144a12
a194
a5a17a14a6a36a47a5a8a7a38a18a25a24a26a12a95a56a17a4a19a197a57a12
(5)
where a4 a195 and a4a19a197 are the parameters for a14a46a36a47a5a8a7a48a18a25a24a44a12
under the empirical distribution and independence
models, respectively. It is also possible to use the
cumulative a39a198a36 instead of a14a6a36 . Figure 3 show log-
likelihood ratios using the asymmetric empirical
distribution and Figure 4 depicts log-likelihood ra-
tio using the symmetric distribution.
A set of fill-in-the-blanks questions taken from
GRE general tests were answered using the log-
likelihood ratio. For each question a sentence with
one or two blanks along with a set of options a199 was
given, as shown in Figure 5.
The correct alternative maximizes the likelihood
of the complete sentence a200 :
a142a17a54a38a192a44a193a77a58a99a142a17a54a38a192
a201
a31a34a202a57a203
a201
a35a42a202a57a203a204a33a31a20a205
a83
a35 a194
a5a17a14a6a36a47a5a8a7a38a18a25a24a196a9a45a42a31a34a33a35a23a12a95a56a17a4 a195 a12
a201
a31a34a202a57a203
a201
a35a42a202a57a203a204a33a31a20a205
a83
a35 a194
a5a17a14a6a36a47a5a8a7a38a18a25a24a196a9a45a42a31a34a33a35a23a12a95a56a17a4a196a197a48a12
(6)
where a45a42a31a34a33a35 is distance of a7 and a24 in the sentence.
Since only the blanks change from one alternative to
another, the remaining pairs are treated as constants
and can be ignored for the purpose of ranking:
a142a143a54a38a192a44a193a22a31a144a58a99a142a17a54a38a192
a201
a35a42a202a10a203a10a33a31a20a205
a83
a35 a194
a5a17a14 a36 a5a8a7a38a18a25a24a49a9a45a42a31a34a33a35a23a12a95a56a17a4a196a195a92a12
a201
a35a42a202a10a203a10a33a31a20a205
a83
a35 a194
a5a17a14a6a36a37a5a8a7a38a18a25a24a196a9a45a42a31a34a33a35a57a12a95a56a17a4a19a197a38a12
(7)
for every a7a37a206a87a199 .
It is not necessary to compute the likelihood for
all pairs in the whole sentence, instead a cut-off for
the maximum distance can be specified. If the cut-
off is two, then the resulting behavior will be sim-
ilar to a word bigram language model (with differ-
ent estimates). An increase in the cut-off has two
immediate implications. First, it will incorporate
the surroundings of the word as context. Second,
it causes an undirect effect of smoothing, since we
use cumulative probabilities to compute the likeli-
hood. As with any distance model, this approach
has the drawback of allowing constructions that are
not syntactically valid.
The tests used are from GRE practice tests ex-
tracted from the websites: gre.org (9 ques-
tions), PrincetonReview.com(11 questions),
Syvum.com (15 questions) and Microedu.com
(28 questions). Table 2 shows the results for a cut-
off of seven words. Every questions has five op-
tions, and thus selecting the answer at random gives
an expected score of 20%. Our framework answers
55% of the questions.
The science of seismology has grown just
enough so that the first overly bold theories have
been .
a) magnetic. . . accepted
b) predictive . . . protected
c) fledgling. . . refuted
d) exploratory . . . recalled
e) tentative. . . analyzed
Figure 5: Example of fill-in-the-blanks question
Source Correct Answers
ETS.org 67%
Princeton Review 54%
Syvum.com 67%
Microedu.com 46%
Overall 55%
Table 2: Fill-in-the-blanks results
4.2 Skew
Our second evaluation uses the parametric affinity
model. We use the skew of the fitted model to evalu-
ate the degree of affinity of two terms. We validated
our hypothesis that a greater positive skew corre-
sponds to more affinity. A list of pairs from word as-
sociation norms and a list of randomly picked pairs
are used. Word association is a common test in psy-
chology (Nelson et al., 2000), and it consists of a
person providing an answer to a stimulus word by
giving an associated one in response. The set of
words used in the test are called “norms”. Many
word association norms are available in psychology
literature, we chose the Minnesota word association
norms for our experiments (Jenkings, 1970). It is
composed of 100 stimulus words and the most fre-
quent answer given by 1000 individuals who took
the test. We also use 100 word pairs generated by
randomly choosing words from a small dictionary.
The skew in the gamma distribution is a207a208a58a210a209a204a211a62a212 a91
and table 3 shows the normalized skew for the asso-
ciation and the random pair sets. Note that the set
of 100 random pairs include some non-independent
ones.
The value of the skew was then tested on a set of
TOEFL synonym questions. Each question in this
synonym test set is composed of one target word
and a set of four alternatives. This TOEFL syn-
onym test set has been used by several other re-
searchers. It was first used in the context of La-
tent Semantic Analisys(LSA) (Landauer and Du-
mais, 1997), where 64.4% of the questions were an-
swered correctly. Turney (Turney, 2001) and Terra
et al. (Terra and Clarke, 2003) used different sim-
Pair Sets a207
Minnesota association norm 3.1425
Random set 2.1630
Table 3: Skewness,a207a213a58a90a209 a115a214a114 indicates independence
ilarity measures and statistical estimates to answer
the questions, achieving 73.75% and 81.25% cor-
rect answers respectively. Jarmasz (Jarmasz and
Szpakowicz, 2003) used a thesaurus to compute
the distance between the alternatives and the target
word, answering 78.75% correctly. Turney (Turney
et al., 2003) trained a system to answer the ques-
tions with an approach based on combined compo-
nents, including a module for LSA, PMI, thesaurus
and some heuristics based on the patterns of syn-
onyms. This combined approach answered 97.50%
of the questions correctly after being trained over
351 examples. With the exception of (Turney et al.,
2003), all previous approaches were not exclusively
designed for the task of answering TOEFL synonym
questions.
In order to estimate a91 and a93 we compute the em-
pirical distribution. This distribution provides us
with the right hand side of the equation 4 and we can
solve fora91 numerically. The calculation ofa93 is then
straightforward. Using only skew, we were able to
answer 78.75% of the TOEFL questions correctly.
Since skew represents the degree of asymmetry of
the affinity model, this result suggests that skew and
synonymy are strongly related.
We also used log-likelihood to solve the TOEFL
synonym questions. For each target-alternative pair,
we calculated the log-likelihood for every distance
in the range four to 750. The initial cut-off dis-
carded the affinity caused by phrases containing
both target and alternative words. The upper cut-off
of 750 represents the average document size in the
collection. The cumulative log-likelihood was then
used as the score for each alternative, and we con-
sidered the best alternative the one with higher accu-
mulated log-likelihood. With this approach, we are
able to answer 86.25% of questions correctly, which
is a substantial improvement over similar methods,
which do not require training data.
5 Conclusion
We presented a framework for the fast and effec-
tive computation of lexical affinity models. Instead
of using arbitrary windows to compute word simi-
larity measures, we model lexical affinity using the
complete observed distance distribution along with
independence and parametric models for this distri-
bution. Our results shows that, with minimal ef-
fort to adapt the models, we achieve good results
by applying this framework to simple natural lan-
guage tasks, such as TOEFL synonym questions and
GRE fill-in-the-blanks tests. This framework allows
the use of terabyte-scale corpora by providing a fast
algorithm to extract pairs of co-occurrence for the
models, thus enabling the use of more precise esti-
mators.
Acknowledgments
This work was made possible also in part by
PUC/RS and Ministry of Education of Brazil
through CAPES agency.

References

D. Beeferman, A. Berger, and J. Lafferty. 1997. A
model of lexical attraction and repulsion. In Pro-
ceedings of the 35th Annual Meeting of the ACL
and 8th Conference of the EACL, pages 373–380.

A. Berger and R. Miller. 1998. Just-in-time
language modelling. In Proceedings of IEEE
ICASSP, volume 2, pages 705–708, Seatle,
Washington.

C.L.A. Clarke, G.V. Cormack, M. Laszlo, T.R. Ly-
nam, and E.L. Terra. 2002. The impact of cor-
pus size on question answering performance. In
Proceedings of 2002 SIGIR conference, Tampere,
Finland.

I. Dagan, L. Lee, and F. C. N. Pereira. 1999.
Similarity-based models of word cooccurrence
probabilities. Machine Learning, 34(1-3):43–69.

O. Ferret. 2002. Using collocations for topic seg-
mentation and link detection. In Proceedings of
the 19th COLING.

G. Grefenstette. 1993. Automatic theasurus gener-
ation from raw text using knowledge-poor tech-
niques. In Making sense of Words. 9th Annual
Conference of the UW Centre for the New OED
and text Research.

T. Hisamitsu and Y. Niwa. 2002. A measure of
term representativeness based on the number of
co-occurring salient words. In Proceedings of the
19th COLING.

M. Jarmasz and S. Szpakowicz. 2003. Roget’s the-
saurus and semantic similarity. In Proceedings of
RANLP-03, Borovets, Bulgaria.

J.J. Jenkings. 1970. The 1952 minnesota word as-
sociation norms. In G. Keppel L. Postman, edi-
tor, Norms of word association, pages 1–38. Aca-
demic Press, New York.

F. Keller and M. Lapata. 2003. Using the web to
obtain frequencies for unseen bigrams. Compu-
tational Linguistics, 29(3):459–484.

T. K. Landauer and S. T. Dumais. 1997. A solu-
tion to plato’s problem: The latent semantic anal-
ysis theory of the acquisition, induction, and rep-
resentation of knowledge. Psychological Review,
104(2):211–240.

L. Lee. 2001. On the effectiveness of the skew di-
vergence for statistical language analysis. In Ar-
tificial Intelligence and Statistics 2001, pages 65–72.

D. Nelson, C. McEvoy, and S. Dennis. 2000. What
is and what does free association measure? Mem-
ory & Cognition, 28(6):887–899.

T. Niesler and P. Woodland. 1997. Modelling
word-pair relations in a category-based language
model. In Proc. ICASSP ’97, pages 795–798,
Munich, Germany.

R. Rapp. 2002. The computation of word associa-
tions: Comparing syntagmatic and paradigmatic
approaches. In Proceedings of the 19th COLING.

P. Resnik and M. Diab. 2000. Measuring verb
similarity. In 22nd Annual Meeting of the Cog-
nitive Science Society (COGSCI2000), Philadel-
phia, August.

R. Rosenfeld. 1996. A maximum entropy approach
to adaptive statistical language modeling. com-
puter speech and language. Computer Speech
and Language, 10:187–228.

R. Rosenfeld. 2000. Two decades of statistical lan-
guage modeling: Where do we go from here. In
Proceedings of the IEEE, volume 88.

T. Tanaka. 2002. Measuring the similarity between
compound nouns in different languages using
non-parallel corpora. In Proceedings of the 19th
COLING.

E. Terra and C. L. A. Clarke. 2003. Frequency es-
timates for statistical word similarity measures.
In Proceedings of HLT–NAACL 2003, pages 244–
251, Edmonton, Alberta.

P.D. Turney, Littman M.L., J. Bigham, and
V. Shnayder. 2003. Combining independent
modules to solve multiple-choice synonym and
analogy problems. In Proceedings of RANLP-03,
Borovets, Bulgaria.

P. D. Turney. 2001. Mining the Web for synonyms:
PMI–IR versus LSA on TOEFL. In Proceedings
of ECML-2001, pages 491–502.

O. Vechtomova, S. Robertson, and S. Jones. 2003.
Query expansion with long-span collocates. In-
formation Retrieval, 6(2):251–273.

J. Weeds and D. Weir. 2003. A general framework
for distributional similarity. In Proceedings of the
2003 Conference on Empirical Methods in Natu-
ral Language Processing.

D. Yuret. 1998. Discovery of linguistic relations us-
ing lexical attraction. Ph.D. thesis, Department
of Computer Science and Electrical Engineering,
MIT, May.

X. Zhu and R. Rosenfeld. 2001. Improving trigram
language modeling with the world wide web. In
Proceedings of IEEE ICASSP, volume 1, pages
533–536.
