Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 708–715, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Using Sketches to Estimate Associations
Ping Li
Department of Statistics
Stanford University
Stanford, California 94305
pingli@stat.stanford.edu
Kenneth W. Church
Microsoft Research
One Microsoft Way
Redmond, Washington 98052
church@microsoft.com
Abstract
We should not have to look at the en-
tire corpus (e.g., the Web) to know if two
words are associated or not.1 A powerful
sampling technique called Sketches was
originally introduced to remove duplicate
Web pages. We generalize sketches to
estimate contingency tables and associa-
tions, using a maximum likelihood esti-
mator to find the most likely contingency
table given the sample, the margins (doc-
ument frequencies) and the size of the
collection. Not unsurprisingly, computa-
tional work and statistical accuracy (vari-
ance or errors) depend on sampling rate,
as will be shown both theoretically and
empirically. Sampling methods become
more and more important with larger and
larger collections. At Web scale, sampling
rates as low as 10−4 may suffice.
1 Introduction
Word associations (co-occurrences) have a wide
range of applications including: Speech Recogni-
tion, Optical Character Recognition and Information
Retrieval (IR) (Church and Hanks, 1991; Dunning,
1993; Manning and Schutze, 1999). It is easy to
compute association scores for a small corpus, but
more challenging to compute lots of scores for lots
of data (e.g. the Web), with billions of web pages
(D) and millions of word types (V ). For a small
corpus, one could compute pair-wise associations by
multiplying the (0/1) term-by-document matrix with
its transpose (Deerwester et al., 1999). But this is
probably infeasible at Web scale.
1This work was conducted at Microsoft while the first author
was an intern. The authors thank Chris Meek, David Hecker-
man, Robert Moore, Jonathan Goldstein, Trevor Hastie, David
Siegmund, Art Own, Robert Tibshirani and Andrew Ng.
Approximations are often good enough. We
should not have to look at every document to de-
termine that two words are strongly associated. A
number of sampling-based randomized algorithms
have been implemented at Web scale (Broder, 1997;
Charikar, 2002; Ravichandran et al., 2005).2
A conventional random sample is constructed by
selecting Ds documents from a corpus of D doc-
uments. The (corpus) sampling rate is DsD . Of
course, word distributions have long tails. There
are a few high frequency words and many low fre-
quency words. It would be convenient if the sam-
pling rate could vary from word to word, unlike con-
ventional sampling where the sampling rate is fixed
across the vocabulary. In particular, in our experi-
ments, we will impose a floor to make sure that the
sample contains at least 20 documents for each term.
(When working at Web scale, one might raise the
floor somewhat to perhaps 104.)
Sampling is obviously helpful at the top of the
frequency range, but not necessarily at the bottom
(especially if frequencies fall below the floor). The
question is: how about “ordinary” words? To answer
this question, we randomly picked 15 pages from
a Learners’ dictionary (Hornby, 1989), and selected
the first entry on each page. According to Google,
there are 10 million pages/word (median value, ag-
gregated over the 15 words), no where near the floor.
Sampling can make it possible to work in mem-
ory, avoiding disk. At Web scale (D ≈ 10 billion
pages), inverted indexes are large (1500 GBs/billion
pages)3, probably too large for memory. But a sam-
ple is more manageable; the inverted index for a
10−4 sample of the entire web could fit in memory
on a single PC (1.5 GB).
2http://labs.google.com/sets produces fascinating sets, al-
though we don’t know how it works. Given the seeds, “Amer-
ica” and “China,” http://labs.google.com/sets returns: “Amer-
ica, China, Japan, India, Italy, Spain, Brazil, Persia, Europe,
Australia, France, Asia, Canada.”
3This estimate is extrapolated from Brin and Page (1998),
who report an inverted index of 37.2 GBs for 24 million pages.
708
Table 1: The number of intermediate results after the
first join can be reduced from 504,000 to 120,000,
by starting with “Schwarzenegger & Austria” rather
than the baseline (“Schwarzenegger & Terminator”).
The standard practice of starting with the two least
frequent terms is a good rule of thumb, but one can
do better, given (estimates of) joint frequencies.
Query Hits (Google)
Austria 88,200,000
Governor 37,300,000
Schwarzenegger 4,030,000
Terminator 3,480,000
Governor & Schwarzenegger 1,220,000
Governor & Austria 708,000
Schwarzenegger & Terminator 504,000
Terminator & Austria 171,000
Governor & Terminator 132,000
Schwarzenegger & Austria 120,000
1.1 An Application: The Governator
Google returns the top k hits, plus an estimate of
how many hits there are. Table 1 shows the number
of hits for four words and their pair-wise combina-
tions. Accurate estimates of associations would have
applications in Database query planning (Garcia-
Molina et al., 2002). Query optimizers construct a
plan to minimize a cost function (e.g., intermediate
writes). The optimizer could do better if it could
estimate a table like Table 1. But efficiency is im-
portant. We certainly don’t want to spend more time
optimizing the plan than executing it.
Suppose the optimizer wanted to construct a plan
for the query: “Governor Schwarzenegger Termi-
nator Austria.” The standard solution starts with
the two least frequent terms: “Schwarzenegger” and
“Terminator.” That plan generates 504,000 interme-
diate writes after the first join. An improvement
starts with “Schwarzenegger” with “Austria,” reduc-
ing the 504,000 down to 120,000.
In addition to counting hits, Table 1 could also
help find the top k pages. When joining the first pair
of terms, we’d like to know how far down the rank-
ing we should go. Accurate estimates of associations
would help the optimizer make such decisions.
It is desirable that estimates be consistent, as well
as accurate. Google, for example, reports 6 million
hits for “America, China, Britain,” and 23 million for
“America, China, Britain, Japan.” Joint frequencies
decrease monotonically: s ⊂ S =⇒ hits(s) ≥ hits(S).
f = a + c
f = a + b
D = a+b+c+d
a
c
y ~y
~x
x
d y
xb
(a)
x
~x
y ~y
a b
c d y
x
s s s s
s s
ss
n  = a + b
sD = a +b + c +d
n  = a + c
s s
ss
(b)
Figure 1: (a): A contingency table for word x and
word y. Cell a is the number of documents that con-
tain both x and y, b is the number that contain x but
not y, c is the number that contain y but not x, and
d is the number that contain neither x nor y. The
margins, fx = a + b and fy = a + c are known as
document frequencies in IR. D is the total number
of documents in the collection. (b): A sample con-
tingency table, with “s” indicating the sample space.
1.2 Sampling and Estimation
Two-way associations are often represented as two-
way contingency tables (Figure 1(a)). Our task is to
construct a sample contingency table (Figure 1(b)),
and estimate 1(a) from 1(b). We will use a max-
imum likelihood estimator (MLE) to find the most
likely contingency table, given the sample and vari-
ous other constraints. We will propose a sampling
procedure that bridges two popular choices: (A)
sampling over documents and (B) sampling over
postings. The estimation task is straightforward and
well-understood for (A). As we consider more flexi-
ble sampling procedures such as (B), the estimation
task becomes more challenging.
Flexible sampling procedures are desirable. Many
studies focus on rare words (Dunning, 1993; Moore,
2004); butterflies are more interesting than moths.
The sampling rate can be adjusted on a word-by-
word basis with (B), but not with (A). The sampling
rate determines the trade-off between computational
work and statistical accuracy.
We assume a standard inverted index. For each
word x, there are a set of postings, X. X contains a
set of document IDs, one for each document contain-
ing x. The size of postings, fx = |X|, corresponds
to the margins of the contingency tables in Figure
1(a), also known as document frequencies in IR.
The postings lists are approximated by sketches,
skX, first introduced by Broder (1997) for remov-
ing duplicate web pages. Assuming that document
IDs are random (e.g., achieved by a random permu-
tation), we can compute skX, a random sample of
709
X, by simply selecting the first few elements of X.
In Section 3, we will propose using sketches
to construct sample contingency tables. With this
novel construction, the contingency table (and sum-
mary statistics based on the table) can be estimated
using conventional statistical methods such as MLE.
2 Broder’s Sketch Algorithm
One could randomly sample two postings and inter-
sect the samples to estimate associations. The sketch
technique introduced by Broder (1997) is a signifi-
cant improvement, as demonstrated in Figure 2.
Assume that each document in the corpus of size
D is assigned a unique random ID between 1 and D.
The postings for word x is a sorted list of fx doc IDs.
The sketch, skX, is the first (smallest) sx doc IDs in
X. Broder used MINs(Z) to denote the s smallest
elements in the set, Z. Thus, skX = MINsx(X).
Similarly, Y denotes the postings for word y, and
skY denotes its sketch, MINsy(Y ). Broder assumed
sx = sy = s.
Broder defined resemblance (R) and sample re-
semblance (Rs) to be:
R = aa + b + c, Rs = |MINs(skX ∪skY )∩skX ∩skY||MIN
s(skX ∪skY )|
.
Broder (1997) proved that Rs is an unbiased esti-
mator of R. One could use Rs to estimate a but he
didn’t do that, and it is not recommended.4
Sketches were designed to improve the coverage
of a, as illustrated by Monte Carlo simulation in Fig-
ure 2. The figure plots, E parenleftbigasa parenrightbig, percentage of inter-
sections, as a function of (postings) sampling rate,
s
f , where fx = fy = f, sx = sy = s. The solid lines
(sketches), E parenleftbigasa parenrightbig ≈ sf , are above the dashed curve
(random sampling), Eparenleftbigasa parenrightbig = s2f2 . The difference is
particularly important at low sampling rates.
3 Generalizing Sketches: R→ Tables
Sketches were first proposed for estimating resem-
blance (R). This section generalizes the method to
construct sample contingency tables, from which we
can estimate associations: R, LLR, cosine, etc.
4There are at least three problems with estimating a from
Rs. First, the estimate is biased. Secondly, this estimate uses
just s of the 2 × s samples; larger samples → smaller errors.
Thirdly, we would rather not impose the restriction: sx = sy.
0  0.2 0.4 0.6 0.8 10
0.5
1
Sampling rates
Percentage of inersections
Random sampling
Sketch
Figure 2: Sketches (solid curves) dominate random
sampling (dashed curve). a=0.22, 0.38, 0.65, 0.80,
0.85f, f=0.2D, D=105. There is only one dashed
curve across all values of a. There are different but
indistinguishable solid curves depending on a.
Recall that the doc IDs span the integers from 1
to D with no gaps. When we compare two sketches,
skX and skY , we have effectively looked at Ds =
min{skX(sx),skY(sy)} documents, where skX(j) is
the jth smallest element in skX. The following
construction generates the sample contingency ta-
ble, as,bs,cs,ds (as in Figure 1(b)). The example
shown in Figure 3 may help explain the procedure.
Ds = min{skX(sx),skY(sy)}, as = |skX ∩skY|,
nx = sx −|{j : skX(j) > Ds}|,
ny = sy −|{j : skY(j) > Ds}|,
bs = nx −as, cs = ny −as, ds = Ds −as −bs −cs.
Given the sample contingency table, we are now
ready to estimate the contingency table. It is suffi-
cient to estimate a, since the rest of the table can be
determined from fx, fy and D. For practical appli-
cations, we recommend the convenient closed-form
approximation (8) in Section 5.1.
4 Margin-Free (MF) Baseline
Before considering the proposed MLE method, we
introduce a baseline estimator that will not work as
well because it does not take advantage of the mar-
gins. The baseline is the multivariate hypergeomet-
ric model, usually simplified as a multinomial by as-
suming “sample-with-replacement.”
The sample expectations are (Siegrist, 1997),
E(as) = DsD a, E(bs) = DsD b,
E(cs) = DsD c, E(ds) = DsD d. (1)
710
Y:  2   4   5   8   15    19   21     24   27   28   31 
f
X:  3   4   7   9   10   15   18      19   24   25   28
= 11 = 5 = 18f a Dx y = 11 s
= 5= 7= 7sy= 7sx
b c= 5= 2as s s= 3
n nx y
ds = 8
(a)
    
 
9     10     11    12    13    14    15   16  
1      2      3      4      5      6      7      8   
17   18     19    20    . . . . . .             D = 36 
(b)
Figure 3: (a): The two sketches, skX and skY
(larger shaded box), are used to construct a sam-
ple contingency table: as,bs,cs,ds. skX consists
of the first sx = 7 doc IDs in X, the postings for
word x. Similarly, skY consists of the first sy = 7
doc IDs in Y , the postings for word y. There are 11
doc IDs in both X and Y , and a = 5 doc IDs in
the intersection: {4, 15, 19, 24, 28}. (a) shows that
Ds = min(18,21) = 18. Doc IDs 19 and 21 are
excluded because we cannot determine if they are in
the intersection or not, without looking outside the
box. As it turns out, 19 is in the intersection and
21 is not. (b) enumerates the Ds = 18 documents,
showing which documents contain x (small circles)
and which contain y (small squares). Both proce-
dures, (a) and (b), produce the same sample contin-
gency table: as = 2, bs = 5, cs = 3 and ds = 8.
The margin-free estimator and its variance are
ˆaMF = DD
s
as, Var(ˆaMF) = DD
s
1
1
a +
1
D−a
D −Ds
D −1 . (2)
For the multinomial simplification, we have
ˆaMF,r = DD
s
as, Var(ˆaMF,r) = DD
s
1
1
a +
1
D−a
. (3)
where “r” indicates “sample-with-replacement.”
The term D−DsD−1 ≈ D−DsD is often called the
“finite-sample correction factor” (Siegrist, 1997).
5 The Proposed MLE Method
The task is to estimate the contingency table from
the samples, the margins and D. We would like to
use a maximum likelihood estimator for the most
probable a, which maximizes the (full) likelihood
(probability mass function, PMF) P(as,bs,cs,ds;a).
Unfortunately, we do not know the exact expres-
sion for P(as,bs,cs,ds;a), but we do know the con-
ditional probability P(as,bs,cs,ds|Ds;a). Since the
doc IDs are uniformly random, sampling the first
Ds contiguous documents is statistically equivalent
to randomly sampling Ds documents from the cor-
pus. Based on this key observation and Figure 3,
conditional on Ds, P(as,bs,cs,ds|Ds;a) is the PMF
of a two-way sample contingency table.
We factor the full likelihood into:
P(as,bs,cs,ds;a) = P(as,bs,cs,ds|Ds;a)×P(Ds;a).
P(Ds;a) is difficult. However, since we do not ex-
pect a strong dependency of Ds on a, we maxi-
mize the partial likelihood instead, and assume that
is good enough. An example of partial likelihood is
the Cox proportional hazards model in survival anal-
ysis (Venables and Ripley, 2002, Section 13.3) .
Our partial likelihood is
P (as,bs,cs,ds|Ds;a) =
`a
as
´`fx−a
bs
´`fy−a
cs
´`D−fx−fy+a
ds
´
`D
Ds
´
∝
as−1Y
i=0
(a−i)×
bs−1Y
i=0
(fx −a−i)×
cs−1Y
i=0
(fy −a−i)
×
ds−1Y
i=0
(D−fx −fy + a−i), (4)
where parenleftbignmparenrightbig = n!m!(n−m)! . “∝” is “proportional to.”
We now derive an MLE for (4), a result that was
not previously known, to the best of our knowledge.
Let ˆaMLE maximizes logP (as,bs,cs,ds|Ds;a):
as−1X
i=0
log(a−i) +
bs−1X
i=0
log(fx −a−i)
+
cs−1X
i=0
log(fy −a−i) +
ds−1X
i=0
log(D −fx −fy + a−i),
whose first derivative, ∂ logP(as,bs,cs,ds|Ds;a)∂a , is
as−1X
i=0
1
a−i −
bs−1X
i=0
1
fx −a−i −
cs−1X
i=0
1
fy −a−i
+
ds−1X
i=0
1
D−fx −fy + a−i. (5)
Since the second derivative, ∂2 logP(as,bs,cs,ds|Ds;a)∂a2 ,
is negative, the log likelihood function is concave,
hence has a unique maximum. One could numeri-
cally solve (5) for ∂ logP(as,bs,cs,ds|Ds;a)∂a = 0. How-
ever, we derive the exact solution using the follow-
ing updating formula from (4):
711
P (as,bs,cs,ds|Ds;a) = P (as,bs,cs,ds|Ds;a−1)×
fx −a + 1−bs
fx −a + 1
fy −a + 1−cs
fy −a + 1
D −fx −fy + a
D−fx −fy + a−ds
a
a−as = P (as,bs,cs,ds|Ds;a−1)×g(a). (6)
Since our MLE is unique, it suffices to find a from
g(a) = 1, which is a cubic function in a.
5.1 A Convenient Practical Approximation
Rather than solving the cubic equation for the ex-
act MLE, the following approximation may be more
convenient. Assume we sample nx = as + bs from
X and obtain as co-occurrences without knowledge
of the samples from Y . Further assuming “sample-
with-replacement,” as is then binomially distributed,
as ∼ Binom(nx, afx). Similarly, assume as ∼
Binom(ny, afy ). Under these assumptions, the PMF
of as is a product of two binomial PMFs:
 
fx
nx
!„
a
fx
«as„f
x −a
fx
«bs f
y
ny
!„
a
fy
«as„f
y −a
fy
«cs
∝ a2as (fx −a)bs (fy −a)cs . (7)
Setting the first derivative of the logarithm of (7) to
be zero, we obtain 2asa − bsfx−a − csfy−a = 0, which is
quadratic in a and has a solution:
ˆaMLE,a = fx (2as + cs) + fy (2as + bs)2(2a
s + bs + cs)
−
q
(fx (2as + cs)−fy (2as + bs))2 + 4fxfybscs
2(2as + bs + cs) . (8)
Section 6 shows that ˆaMLE,a is very close to ˆaMLE.
5.2 Theoretical Evaluation: Bias and Variance
How good are the estimates? A popular metric
is mean square error (MSE): MSE(ˆa) = E (ˆa−a)2 =
Var (ˆa) +Bias2 (ˆa). If ˆa is unbiased, MSE(ˆa) =Var (ˆa) =
SE2 (ˆa), where SE is the standard error. Here all ex-
pectations are conditional on Ds.
Large sample theory (Lehmann and Casella,
1998, Chapter 6) says that, under “sample-with-
replacement,” ˆaMLE is asymptotically unbiased and
converges to Normal with mean a and variance 1I(a),
where I(a), the Fisher Information, is
I(a) = −E
„ ∂2
∂a2 logP (as,bs,cs,ds|Ds;a,r)
«
. (9)
Under “sample-with-replacement,” we have
P (as,bs,cs,ds|Ds;a,r) ∝
“a
D
”as
×
„f
x −a
D
«bs
×
„f
y −a
D
«cs
×
„D −f
x −fy + a
D
«ds
, (10)
Therefore, the Fisher Information, I(a), is
E(as)
a2 +
E(bs)
(fx −a)2 +
E(cs)
(fy −a)2 +
E(ds)
(D −fx −fy + a)2.
(11)
We plug (1) from the margin-free model into (11)
as an approximation, to obtain
Var (ˆaMLE) ≈
D
Ds −1
1
a +
1
fx−a +
1
fy−a +
1
D−fx−fy+a
, (12)
which is 1I(a) multiplied by D−DsD , the “finite-
sample correction factor,” to consider “sample-
without-replacement.”
We can see that Var (ˆaMLE) is less than
Var(ˆaMF) in (2). In addition, ˆaMLE is asymptoti-
cally unbiased while ˆaMF is no longer unbiased un-
der margin constraints. Therefore, we expect ˆaMLE
has smaller MSE than ˆaMF. In other words, the pro-
posed MLE method is more accurate than the MF
baseline, in terms of variance, bias and mean square
error. If we know the margins, we ought to use them.
5.3 Unconditional Bias and Variance
ˆaMLE is also unconditionally unbiased:
E (ˆaMLE −a) = E (E (ˆaMLE −a|Ds)) ≈ E(0) = 0. (13)
The unconditional variance is useful because often
we would like to estimate the errors before knowing
Ds (e.g., for choosing sample sizes).
To compute the unconditional variance of ˆaMLE,
we should replace DDs with E
parenleftBig
D
Ds
parenrightBig
in (12). We
resort to an approximation for E
“
D
Ds
”
. Note that
skX(sx) is the order statistics of a discrete random
variable (Siegrist, 1997) with expectation
E`skX(sx)´= sx(D + 1)f
x + 1
≈ sxf
x
D. (14)
By Jensen’s inequality, we know that
E
„D
s
D
«
≤ min
 
E`skX(sx)´
D ,
E`skY(sy)´
D
!
= min
„s
x
fx,
sy
fy
«
(15)
E
„ D
Ds
«
≥ 1E`Ds
D
´ ≥ max
„f
x
sx,
fy
sy
«
. (16)
712
Table 2: Gold standard joint frequencies, a. Docu-
ment frequencies are shown in parentheses. These
words are frequent, suitable for evaluating our algo-
rithms at very low sampling rates.
THIS HAVE HELP PROGRAM
THIS (27633) — 13517 7221 3682
HAVE (17396) 13517 — 5781 3029
HELP (10791) 7221 5781 — 1949
PROGRAM (5327) 3682 3029 1949 —
Replacing the inequalities with equalities underes-
timates the variance, but only slightly.
5.4 Smoothing
Although not a major emphasis here, our evalua-
tions will show that ˆaMLE+S, a smoothed version
of the proposed MLE method, is effective, espe-
cially at low sampling rates. ˆaMLE+S uses “add-
one” smoothing. Given that such a simple method
is as effective as it is, it would be worth considering
more sophisticated methods such as Good-Turing.
5.5 How Many Samples Are Sufficient?
The answer depends on the trade-off between com-
putation and estimation errors. One simple rule is
to sample “2%.” (12) implies that the standard er-
ror is proportional to pD/Ds −1. Figure 4(a) plotsp
D/Ds −1 as a function of sampling rate, Ds/D, in-
dicating a “elbow” about 2%. However, 2% is too
large for high frequency words.
A more reasonable metric is the “coefficient of
variation,” cv = SE(ˆa)a . At Web scale (10 billion
pages), we expect that a very small sampling rate
such as 10−4 or 10−5 will suffice to achieve a rea-
sonable cv (e.g., 0.5). See Figure 4(b).
6 Evaluation
Two sets of experiments were run on a collection of
D = 216 web pages, provided by MSN. The first ex-
periment considered 4 English words shown in Ta-
ble 2, and the second experiment considers 968 En-
glish words with mean df = 2135 and median df =
1135. They form 468,028 word pairs, with mean co-
occurrences = 188 and median = 74.
6.1 Small Dataset Monte Carlo Experiment
Figure 5 evaluates the various estimate methods by
MSE over a wide range of sampling rates. Doc IDs
0   0.02 0.05 0.1 0.150
10
20
30
Samplig rates
Relative SE
(a)
105 106 107 108 109 1010
10−5
10−3
10−1
100
 D
Sampling rates
 fx = 0.0001×D
 fx =0.01×D
0.001
 fy = 0.1× fx
 a = 0.05 × fy
(b)
Figure 4: How large should the sampling rate be?
(a): We can sample up to the “elbow point” (2%),
but after that there are diminishing returns. (b): An
analysis based on cv = SEa = 0.5 suggests that we can
get away with much lower sampling rates. The three
curves plot the critical value for the sampling rate,
Ds
D , as a function of corpus size, D. At Web scale,
D ≈ 1010, sampling rates above 10−3 to 10−5 sat-
isfy cv < 0.5, at least for these settings of fx, fy
and a. The settings were chosen to simulate “ordi-
nary” words. The three curves correspond to three
choices of fx: D/100, D/1000, and D/10,000.
fy = fx/10, a = fy/20. SE is based on (12).
were randomly permuted 105 times. For each per-
mutation we constructed sketches from the inverted
index at a series of sampling rates. The figure shows
that the proposed method, ˆaMLE, is considerably
better (by 20% − 40%) than the margin-free base-
line, ˆaMF. Smoothing is effective at low sampling
rates. The recommended approximation, ˆaMLE,a, is
remarkably close to the exact solution.
Figure 6 shows agreement between the theoreti-
cal and empirical unconditional variances. Smooth-
ing reduces variances, at low sampling rates. We
used the empirical E
“
D
DS
”
to compute the theoreti-
cal variances. The approximation, max
parenleftBig
fx
sx,
fy
sy
parenrightBig
, is
> 0.95E
“
D
DS
”
at sampling rates > 0.01.
Figure 7 verifies that the proposed MLE is unbi-
ased, unlike the margin-free baselines.
6.2 Large Dataset Experiment
The large experiment considers 968 English words
(468,028 pairs) over a range of sampling rates. A
floor of 20 was imposed on sample sizes.
As reported in Figure 8, the large experiment con-
firms once again that proposed method, ˆaMLE, is
considerably better than the margin-free baseline (by
713
0.001 0.01 0.1 10
0.2
0.4
Normalized MSE
0.5 MF
MLE,a
MLE
MLE+S
IND
THIS − HAVE
0.001 0.01 0.1 10
0.2
0.4
THIS − HELP
0.001 0.01 0.1 10
0.2
0.4
Normalized MSE
0.5 THIS − PROGRAM
0.001 0.01 0.1 10
0.2
0.4 HAVE − HELP
0.001 0.01 0.1 10
0.2
0.4
0.5
Normalized MSE
0.5
Sampling rates
HAVE − PROGRAM
IND
MF
MLE+S
MLE,a
MLE
0.001 0.01 0.1 10
0.2
0.4
0.6
Sampling rates
HELP − PROGRAM
Figure 5: The proposed method, ˆaMLE outperforms
the margin-free baseline, ˆaMF, in terms of MSE0.5a .
The recommended approximation, ˆaMLE,a, is close
to ˆaMLE. Smoothing, ˆaMLE+S, is effective at low
sampling rates. All methods are better than assum-
ing independence (IND).
15% − 30%). The recommended approximation,
ˆaMLE,a, is close to ˆaMLE. Smoothing, ˆaMLE+S
helps at low sampling rates.
6.3 Rank Retrieval: Top k Associated Pairs
We computed a gold standard similarity cosine rank-
ing of the 468,028 pairs using a 100% sample: cos =
a√
fxfy . We then compared the gold standard to rank-
ings based on smaller samples. Figure 9(a) com-
pares the two lists in terms of agreement in the top k.
For 3 ≤ k ≤ 200, with a sampling rate of 0.005, the
agreement is consistently 70% or higher. Increasing
sampling rate, increases agreement.
The same comparisons are evaluated in terms of
precision and recall in Figure 9(b), by fixing the top
1% of the gold standard list but varying the top per-
centages of the sample list. Again, increasing sam-
pling rate, increases agreement.
0.001 0.01 0.1 10
0.1
0.2
Sampling rates
Normalized standard error
MLE
MLE+S
Theore.
HAVE − PROGRAM
0.001 0.01 0.1 10
0.1
0.2
0.3
0.4
Sampling rates
MLE
MLE+S
Theore. HELP − PROGRAM
Figure 6: The theoretical and empirical variances
show remarkable agreement, in terms of SE(ˆa)a .
Smoothing reduces variances at low sampling rates.
0.001 0.01 0.1 10  
0.02
0.05
Sampling rates
Normalized absolute bias
HAVE − PROGRAM
MF
MLE
MLE+S
0.001 0.01 0.1 10  
0.2
0.04
0.06
Sampling rates
HELP − PROGRAM
MLE+S
MF
MLE
Figure 7: Biases in terms of |E(ˆa)−a|a . ˆaMLE is prac-
tically unbiased, unlike ˆaMF . Smoothing increases
bias slightly.
7 Conclusion
We proposed a novel sketch-based procedure for
constructing sample contingency tables. The
method bridges two popular choices: (A) sam-
pling over documents and (B) sampling over post-
ings. Well-understood maximum likelihood estima-
tion (MLE) techniques can be applied to sketches
(or to traditional samples) to estimate word associa-
tions. We derived an exact cubic solution, ˆaMLE, as
well as a quadratic approximation, ˆaMLE,a. The ap-
proximation is recommended because it is close to
the exact solution, and easy to compute.
The proposed MLE methods were compared em-
pirically and theoretically to a margin-free (MF)
baseline, finding large improvements. When we
know the margins, we ought to use them.
Sample-based methods (MLE & MF) are often
better than sample-free methods. Associations are
often estimated without samples. It is popular to
assume independence: (Garcia-Molina et al., 2002,
Chapter 16.4), i.e., ˆa ≈ fxfyD . Independence led to
large errors in our experiments.
Not unsurprisingly, there is a trade-off between
computational work (space and time) and statistical
714
0.001 0.01 0.1 10
0.2
0.4
0.6
Sampling rates
Relative avg. abs. error
IND
MLE+S
MLE
MF
MLE,a
Figure 8: We report the (normalized) mean absolute
errors (divided by the mean co-occurrences, 188).
All curves are averaged over three permutations.
The proposed MLE and the recommended approxi-
mation are very close and both are significantly bet-
ter than the margin-free (MF) baseline. Smoothing,
ˆaMLE+S, helps at low sampling rates. All estima-
tors do better than assuming independence.
accuracy (variance or errors); reducing the sampling
rate saves work, but costs accuracy. We derived
formulas for variance, showing precisely how accu-
racy depends on sampling rate. Sampling methods
become more and more important with larger and
larger collections. At Web scale, sampling rates as
low as 10−4 may suffice for “ordinary” words.
We have recently generalized the sampling algo-
rithm and estimation method to multi-way associa-
tions; see (Li and Church, 2005).
References
S. Brin and L. Page. 1998. The anatomy of a large-
scale hypertextual web search engine. In Proceedings
of the Seventh International World Wide Web Confer-
ence, pages 107–117, Brisbane, Australia.
A. Broder. 1997. On the resemblance and containment
of documents. In Proceedings of the Compression and
Complexity of Sequences, pages 21–29, Positano, Italy.
M. S. Charikar. 2002. Similarity estimation techniques
from rounding algorithms. In Proceedings of the thiry-
fourth annual ACM symposium on Theory of comput-
ing, pages 380–388, Montreal, Quebec, Canada.
K. Church and P. Hanks. 1991. Word association norms,
mutual information and lexicography. Computational
Linguistics, 16(1):22–29.
S. Deerwester, S. T. Dumais, G. W. Furnas, and T. K.
Landauer. 1999. Indexing by latent semantic analy-
3 10 100 2000  
20
40
60
80
100
Top
Percentage of agreement ( % )
0.5
0.005
(a)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Precision
Top 1 %
0.005 0.01
0.03
0.02
0.5
(b)Figure 9: (a): Percentage of agreements in the gold
standard and reconstructed (from samples) top 3 to
200 list. (b):Precision-recall curves in retrieving the
top 1% gold standard pairs, at different sampling
rates. For example, 60% recall and 70% precision
is achieved at sampling rate = 0.02.
sis. Journal of the American Society for Information
Science, 41(6):391–407.
T. Dunning. 1993. Accurate methods for the statistics of
surprise and coincidence. Computational Linguistics,
19(1):61–74.
H. Garcia-Molina, J. D. Ullman, and J. D. Widom. 2002.
Database Systems: the Complete Book. Prentice Hall,
New York, NY.
A. S. Hornby, editor. 1989. Oxford Advanced Learner’s
Dictionary. Oxford University Press, Oxford, UK.
E. L. Lehmann and G. Casella. 1998. Theory of Point
Estimation. Springer, New York, NY, second edition.
P. Li and K. W. Church. 2005. Using sketches to esti-
mate two-way and multi-way associations. Technical
report, Microsoft Research, Redmond, WA.
C. D. Manning and H. Schutze. 1999. Foundations of
Statistical Natural Language Processing. The MIT
Press, Cambridge, MA.
R. C. Moore. 2004. On log-likelihood-ratios and the
significance of rare events. In Proceedings of EMNLP
2004, pages 333–340, Barcelona, Spain.
D. Ravichandran, P. Pantel, and E. Hovy. 2005. Ran-
domized algorithms and NLP: Using locality sensitive
hash function for high speed noun clustering. In Pro-
ceedings of ACL, pages 622–629, Ann Arbor.
K. Siegrist. 1997. Finite Sampling Models,
http://www.ds.unifi.it/VL/VL EN/urn/index.html. Vir-
tual Laboratories in Probability and Statistics.
W. N. Venables and B. D. Ripley. 2002. Modern Ap-
plied Statistics with S. Springer-Verlag, New York,
NY, fourth edition.
715
