Empirical Estimates of Adaptation: 
The chance of Two Noriegas is closer to p/2 than p 2 
Kenneth W. Church 
AT&T Labs-Research, 180 Park Ave., Florham Park, NJ., USA 
kwc@research.att.com 
Abstract 
Repetition is very common. Adaptive  models, which allow probabilities to change or adapt 
after seeing just a few words of a text, were introduced in speech recognition to account for text cohesion. 
Suppose a document mentions Noriega once. What is the chance that he will be mentioned again? if the 
first instance has probability p, then under standard (bag-of words) independence assumptions, two in- 
stances ought to have probability p2, but we find the probability is actually closer to p/2. The first men- 
tion of a word obviously depends on frequency, but surprisingly, the second does not. Adaptation de- 
pends more on lexical content than fl'equency; there is more adaptation for content words (proper nouns, 
technical terminology and good keywords for information retrieval), and less adaptation for function 
words, cliches and ordinary first names. 
1. Introduction 
Adaptive  models were introduced in the 
Speech Recognition literature to model repetition. 
Jelinek (1997, p. 254) describes cache-based 
models which combine two estimates of word 
(ngram) probabilities, Prl., a local estimate based 
on a relatively small cache of recently seen 
words, and PrG, a global estimate based on a 
large training corpus. 
1. Additive: 
Pr A (w)= XPrL (w) +(1 - X) PrG (w) 
2. Case-based: 
JX I Pr/.(w) if w~ cache 
" Pr G (w) otherwise PF c (W) = 1~ 2 
Intuitively, if a word has been mentioned re- 
cently, then (a) the probability of that word (and 
related words) should go way up, and (b) many 
other words should go down a little. We will re- 
fer to (a) as positive adaptation and (b) as 
negative adaptation. Our empirical experiments 
confirm the intuition that positive adaptation, 
Pr(+adapt), is typically much larger than neg- 
ative adaptation, Pr( - adapt). That is, 
Pr( +adapt) >> Pr(prior) > Pr(-adapt). Two 
methods, Pr( + adapt t ) and Pr( + adapt2), will 
be introduced for estimating positive adaptation. 
1. Pr( +adapt 1)=PrOve test\[w~ history) 
2. Pr(+adapt2)=Pr(k'>_2lk>_l )=d.f2/dfl 
The two methods produce similar results, usually 
well within a factor of two of one another. The 
first lnethod splits each document into two equal 
pieces, a hislory portion and a test portion. The 
adapted probabilities are modeled as the chance 
that a word will appeal" in the test portion, given 
that it appeared in the history. The second 
method, suggested by Church and Gale (1995), 
models adaptation as the chance of a second lnen- 
tion (probability that a word will appear two or 
inore times, given that it appeared one or more 
times). Pr(+adapt2) is approximated by 
dJ2/dfl, where c./\['k is the number of documents 
that contain the word/ngram k or more times. 
(dfa. is a generalization of document .frequeno,, 
d.f~ a standard term in information Retrieval.) 
Both inethods are non-parametric (unlike cache 
lnodels). Parametric assumptions, when appro- 
priate, can be very powerful (better estimates 
from less training data), but errors resulting from 
inappropriate assumptions can outweigh tile 
benefits. In this elnpirical investigation of the 
magnitude and shape o1' adaptation we decided to 
use conservative non-parametric methods to 
hedge against the risk of inappropriate parametric 
assumptions. 
The two plots (below) illustrate some of the 
reasons for being concerned about standard 
parametric assumptions. The first plot shows the 
number of times that tile word "said" appears ill 
each of the 500 documents ill the Brown Corpus 
(Francis & Kucera, 1982). Note that there are 
quite a few documents with more than 15 in- 
stances of "said," especially in Press and Fic- 
tion. There are also quite a few documents with 
hardly any instances of "said," especially in the 
Learned genre. We have found a similar pattern 
in other collections; "said" is more common in 
newswire (Associated Press and Wall Street 
Journal) than technical writing (Department of 
Energy abstracts). 
180 
The second plot (below) conlpares these f:hown 
Corpus o\[-)sorvations to a Poisson. Tile circles 
indicate the nulnber of docun-ierits that have .r in- 
stances of "said." As mentioned above, Press 
and Fiction docunlents can lilentioil "said" 15 
times or lllore, while doculllOlltS in the Learned 
genre might not mention the word at all. The line 
shows what woukt be expected under a Poisson. 
Clearly the line does not fit the circles very well. 
The probability of "said" depends on many 
factors (e.g, genre, topic, style, author) that make 
the distributions broader than chance (Poisson). 
We lind especially broad distributions for words 
that adapt a lot. 
"said" in Brown Corpus 
Re Press 
,bbi ,,s ion 
Bc 
_ore 
I i 
;i q 
i \[,i!0! 
! 
ili111'I,;,!~,1 ¸ ,~ ,:,iil 
Learned i o (4 O\ '| 
Ile-Lettt.'s FictiohHU n r 
, i 
'1' 
i I,~ i 
0 100 200 300 400 500 
document number 
Poisson Doesn't Fit 
g 
ix) 
O 
O 
;O o 
' O0oO O 
Ooo o 
O ° 0 
0 5 10 
oo 0 0 o 
o o 
o 0 0 o o oo 
15 20 25 30 
freq 
We will show that adaptation is huge. 
Pr(+ adapt) is ot'ten several orders of magnitude 
larger than Pr(prior). In addition, we lind that 
Pr(+adapt) has a very different shape fiom 
Pr(prior). By construction, Pr(prior) wu'ies 
over many orders o1' magnitude depending on the 
frequency of the word. Interestingly, though, we 
find that Pr(+adapt) has ahnost no dependence 
on word frequency, although there is a strong 
lexical dependence. Some words adapt more than 
others. The result is quite robust. Words that ad- 
apt more in one cortms also tend to adapt more in 
another corpus of similar material. Both the 
magnitude and especially the shape (lack of de- 
pendence on fiequency as well as dependence on 
content) are hard to capture ill an additive-based 
cache model. 
Later in the paper, we will study neighbmw, 
words that do not appear in the history but do 
appear in documents near the history using an in- 
formation retrieval notion of near. We find that 
neighbors adapt more than non-neighbors, but not 
as much as the history. The shape is in between 
as well. Neighbors have a modest dependency on 
fiequency, more than the history, but not as much 
as the prior. 
Neighbors are an extension of Florian & Yarow- 
sky (1999), who used topic clustering to build a 
 model for contexts such as: "It is at 
least on the Serb side a real setback to lhe x." 
Their work was motivated by speech recognition 
apl)lications where it would be desirable for the 
hmguage model to l'avor x = "peace" over x = 
"piece." Obviously, acoustic evidence is not 
very hell~l'tfl in this case. Trigrams are also not 
very helpful because the strongest clues (e.g., 
"Serb," "side" and "setback") are beyond the 
window of three words. Florian & Yarowsky 
cluster documents into about 102 topics, and 
compute a separate trigram  model for 
each topic. Neighbors are similar in spirit, but 
support more topics. 
2. Estimates of Adaptation: Method 1 
Method 1 splits each document into two equal 
pieces. The first hall' of each document is re- 
ferred to as the histoo, portion of the document 
and the second half of each document is referred 
to as the test portion of the documenl. The task is 
to predict the test portion of the document given 
the histm3,. We star! by computing a contingency 
table for each word, as ilhlstrated below: 
l)ocuments containing "hostages" in 1990 AP 
test test 
history a =638 b =505 
histoo, c =557 d =76787 
This table indicates that there are (a) 638 doc- 
uments with hostages in both the first half 
(history) and the second half (test), (b) 505 doc- 
uments with "hostages" in just the first half, (c) 
557 documents with "hostages" in just the 
second halt', and (d) 76,787 documents with 
"hostages" in neither half. Positive and negative 
adaptation are detined in terms a, b, c and d. 
181 
Pr( + adapt I ) = Pr(w E test I w e histoo,) a a+b 
Pr(-adapt l )=Pr(we test\]-~we histoo,)= c c+d 
Adapted probabilities will be compared to: 
Pr (prior) = Pr( w e test) = ( a + c )/D 
whereD =a+b +c+d. 
Positive adaptation tends to be much large, than 
the prior, which is just a little larger than negative 
adaptation, as illustrated in the table below for the 
word "hostages" in four years of the Associated 
Press (AP) newswire. We find remarkably con- 
sistent results when we compare one yea," of the 
AP news to another (though topics do come and 
go over time). Generally, the differences of 
interest are huge (orders of magnitude) compared 
to the differences among various control condi- 
tions (at most factors of two or three). Note that 
values are more similar within colmnns than 
across columns. 
Pr(+adapt) >> Pr(prior) > Pr(-adapt) 
prior +adapt -adapt source w 
0.014 0.56 0.0069 AP87 
0.015 0.56 0.0072 AP90 
0.013 0.59 0.0057 AP91 
0.0044 0.39 0.0030 AP93 
hostages 
3. Adaptation is Lexical 
We find that some words adapt more than others, 
and that words that adapt more in one year of the 
AP also tend to adapt more in another year of the 
AP. In general, words that adapt a lot tend to 
have more content (e.g., good keywords for infor- 
mation retrieval (IR)) and words that adapt less 
have less content (e.g., function words). 
It is often assumed that word fi'equency is a good 
(inverse) con'elate of content. In the psycholin- 
guistic literature, the term "high frequency" is 
often used syrlouymously with "function 
words," and "low frequency" with "content 
words." In IR, inverse document fiequency 
(IDF) is commonly used for weighting keywords. 
The table below is interesting because it ques- 
tions this very basic assumption. We compare 
two words, "Kennedy" and "except," that are 
about equally frequent (similar priors). 
Intuitively, "Kennedy" is a content word and 
"except" is not. This intuition is supported by 
the adaptation statistics: the adaptation ratio, 
Pr(+adapt)/Pr(prior), is nmch larger for 
"Kennedy" than for "except." A similar pattern 
holds for negative adaptation, but in the reverse 
direction. That is, Pr(-adapt)/Pr(prior) is 
lnuch slnaller for "Kennedy" than for "except." 
Kenneclv adapts more than except 
prior +adapt -adapt source w 
0.012 0.27 0.0091 AP90 
0.015 0.40 0.0084 AP91 
0.014 0.32 0.0094 AP93 
Kennedy 
0.016 0.049 0.016 AP90 
0.014 0.047 0.014 AP91 
0.012 0.048 0.012 AP93 
except 
In general, we expect more adaptation for better 
keywords (e.g., "Kennedy") and less adaptatiou 
for less good keywords (e.g., fnnction words such 
as "except"). This observation runs counter to 
the standard practice of weighting keywords 
solely on the basis of frequency, without con- 
sidering adaptation. In a related paper, Umemura 
and Church (submitted), we describe a term 
weighting method that makes use of adaptation 
(sometimes referred to as burstiness). 
Distinctive surnames adapt more 
than ordinary first names 
prior +adapt -adapt source w 
0.0079 0.71 0.0026 AP90 Noriega 
0.0038 0.80 0.0009 AP91 
0.0006 0.90 0.0002 AP90 Aristide 
0.0035 0.77 0.0009 AP91 
0.0011 0.47 0.0006 AP90 Escobar 
0.0014 0.74 0.0006 AP91 
0.068 0.18 0.059 AP90 John 
0.066 0.16 0.057 AP91 
0.025 0.11 0.022 AP90 George 
0.025 0.13 0.022 AP91 
0.029 0.15 0.025 AP90 Paul 
0.028 0.13 0.025 AP91 
The table above compares surnames with first 
names. These surnames are excellent keywords 
unlike the first names, which are nearly as useless 
for IR as function words. The adaptation ratio, 
Pr(+adapt)/Pr(prior), is much larger for the 
surnames than for the first names. 
What is the probability of seeing two Noriegas in 
a document? The chance of the first one is 
p=0.006. According to the table above, the 
chance of two is about 0.75p, closer to p/2 than 
1 )2. Finding a rare word like Noriega in a doc- 
ument is like lighming. We might not expect 
182 
lightning to strike twice, but it hapt)ens all the 
time, especially for good keywords. 
4. Smoothing (for low frequency words) 
Thus fitr, we have seen that adaptation can be 
large, but to delnonstrate tile shape property (lack 
of dependence on frequency), tile counts in the 
contingency table need to be smoothed. The 
problem is that the estimates of a, b, c, d, and es- 
pecially estimates of the ratios of these quantities, 
become unstable when the counts are small. The 
standard methods of smoothing in tile speech re- 
cognition literature are Good-Turing (GT) and 
tteld-Out (He), described in sections 15.3 & 15.4 
of Jelinek (1997). In both cases, we let r be an 
observed count of an object (e.g., the fi'equency 
of a word and/or ngram), and r* be our best 
estimate of r in another COl'pUS of the same size 
(all other things being equal). 
4.1 Standard Held-Out (He) 
He splits the training corpus into two halves. 
The first half is used to count r for all objects of 
intercst (e.g., the frequency of all words in vocal> 
ulary). These counts are then used to group 
objects into bins. The r m bin contains all (and 
only) tile words with count r. For each bin, we 
colnpute N r, tile number of words in the r m bin. 
The second half of the training corpus is then 
used to compute Cr, tile a,,,,re,,'~m~ ~,.~ frequency of 
all the words in the r ~h bin. The final result is 
simply: r*=Cr./N,, ll' the two halves o1' tile 
trail)ing corpora or the lest corpora have dilTercnt 
sizes, then r* should be scaled appropriately. 
We chose He in this work because it makes few 
assumptions. There is no parametric model. All 
that is assumed is that tile two halves of tile 
training corpus are similar, and that both are 
similar to the testing corpus. Even this assulnp- 
tion is a matter of some concern, since major 
stories come and go over time. 
4.2 Application of He to Contingency Tables 
As above, the training corpus is split into two 
halves. We used two different years of AP news. 
The first hall' is used to count document 
frequency rl/: (Document frequency will be used 
instead of standard (term) frequency.) Words are 
binned by df and by their cell in the coutingency 
table. The first half of tile corpus is used to 
compute the number of words in each bin: Nd, ., 
N4fj,, N41.(: and Ndl.,,t; the second half of the 
corpus is used to compute the aggregate doc- 
ument flequency for the words in each bin: C,!f, a, 
C41.,l), Cdl:,,c and C4f,d. The final result is stro- 
p!y: 
c~}.=C,/.~/N~l.r and d~i=C4f,,//N4f,~/. We • ./ .I,: ,/,' ,I 
compute tile probabilities as before, but replace a, 
b, c, d with a *, b *, c *, d*, respectively. 
h o 
5 
n 
2 
? 
1 
History (h) >> Neighborhood (n) >> Prior (p) 
n " n 
P 
P 
5 IO Jro 1oo 5oo lOOO 
Oocunmnt Frequent}, (d\[) 
With these smoothed estimates, we arc able to 
show that Pr(+adcq~t), labeled h in tile plot 
above, is larger and less dependent on fi'equency 
than l)r(prior), labeled p. The plot shows a third 
group, labeled n for neighbors, which will be 
described later. Note that Ihe ns fall between tile 
ps and tile hs. 
Thus far, we have seen that adaptation can be 
huge: Pr(+a&q)l)>> Pr(prior), often by two or 
three orders of magnitude. Perhaps even more 
surprisingly, although Ihe first mention depends 
strongly on frequency (d./), the second does not. 
Some words adapt more (e.g., Noriega, Aristide, 
Escobar) and some words adapt less (e.g., John, 
George, Paul). Tile results are robust. Words 
that adapt more in one year of AP news tend to 
adapt more in another year, and vice versa. 
5. Method 2: l'r( + adapt2 ) 
So far, we have limited our attention to the rel- 
atively simple case where the history and the test 
arc tile same size. In practice, this won't be the 
case. We were concerned that tile observations 
above might be artil'acts somehow caused by this 
limitation. 
We exl~erimented with two approaches for under- 
standing the effect of this limitation and found 
that the size of the history doesn't change 
Pr(+adal)t ) very much. The first approach split 
the history and the test at wlrious points ranging 
from 5% to 95%. Generally, Pr(+adaptl ) in- 
creases as the size of the test portion grows rel- 
ative to the size of the history, but the effect is 
183 
relatively small (more like a factor of two than an 
order of magnitude). 
We were even more convinced by the second 
approach, which uses Pr(+adapt2 ), a completely 
different argument for estimating adaptation and 
doesn't depend on the relative size of the history 
and the test. The two methods produce 
remarkably silnilar results, usually well within a 
factor of two of one another (even when adapted 
probabilities are orders of magnitude larger than 
the prior). 
Pr(+adapt2) makes use of d./)(w), a generaliz- 
ation of document frequency, d,/)(w) is the 
number of documents with .j or more instances of 
w; (dfl is the standard notion of dJ). 
Pr( + adapt 2 ) = Pr(k>_2 \[k>_ 1 ) = df2/(!/" 1 
Method 2 has some advantages and some disad- 
vantages in comparison with method 1. On the 
positive side, method 2 can be generalized to 
compute the chance of a third instance: 
Pr(k>_31k>_2 ). But unfortunately, we do not 
know how to use method 2 to estimate negative 
adaptation; we leave that as an open question. 
2 
Adaptation is huge (and hardly dependent on frequency) 
ptIk >= 3 \] k >= 2) = d\[3* / dr2" 
...... 
2 2 2222 
2 
Pr(k >= 2 I k >= 1) = dr2' / dr1" 
, 
111111 
1 1 
1 
I 
10 
Pr(k >= 1)=dfl*/D 
I I : 
1 100 1000 10030 100000 
Document Froquoncy (all} 
The plot (above) is similar to the plot in section 
4.2 which showed that adapted probabilities 
(labeled h) are larger and less dependent on 
frequency than the prior (labeled p). So too, the 
plot (above) shows that the second and third men- 
tions of a word (labeled 2 and 3, respectively) are 
larger and less dependent on frequency than the 
first mention (labeled 1). The plot in section 4.2 
used method 1 whereas the plot (above) uses 
method 2. Both plots use the He smoothing, so 
there is only one point per bin (df value), rather 
than one per word. 
6. Neighborhoods (Near) 
Florian and Yarowsky's example, "It is at least 
on the Serb side a real setback to the x," provides 
a nice motivation for neighborhoods. Suppose 
the context (history) mentions a number of words 
related to a peace process, but doesn't mention 
the word "peace." Intuitively, there should still 
be some adaptation. That is, the probability of 
"peace" should go up quite a bit (positive adap- 
tation), and the probability of many other words 
such as "piece" should go down a little (negative 
adaptation). 
We start by partitioniug the vocabulary into three 
exhaustive and mutually exclusive sets: hist, near 
and other (abbreviations for history, neighbor- 
hood and otherwise, respectively). The first set, 
hist, contains the words that appear in the first 
half of the document, as before. Other is a 
catchall for the words that are in neither of the 
first two sets. 
The interesting set is near. It is generated by 
query expansion. The history is treated as a 
query in an information retrieval document- 
ranking engine. (We implemented our own 
ranking engine using simple IDF weighting.) 
The neighborhood is the set of words that appear 
in the k= 10 or k = 100 top documents returned by 
the retrieval engine. To ensure that the three sets 
partition the vocabulary, we exclude the history 
fiom the neighborhood: 
near = words in query expansion of hist - hist 
The adaptation probabilities are estimated using a 
contingency table like before, but we now have a 
three-way partition (hist, near and other) of the 
vocabulary instead of the two-way partition, as il- 
lustrated below. 
Documents containing 
I test 
histo O, a =2125 
c = 1963 
"peace" in 1991 AP 
b =2t60 
d =74573 
hist 
ileal" 
other 
test 
a =2125 b =2160 
e =1479 f=22516 
g =484 h =52057 
In estilnating adaptation probabilities, we con- 
tinue to use a, b, c and d as before, but four new 
variables are introduced: e, f, g and h, where 
c=e+g andd=f+h. 
184 
Pr(wc test) =(a +c)/D 
p;( w c tes; l wc hist ) =al( a + b ) 
l'r(we test lwc near) =e/(e +f) 
l'r( w E test \[ w c other) =g/(g + h) 
prior 
hist 
neat 
other 
The table below shows that "Kennedy" adapts 
more than "except" and that "peace" adapts 
more than "piece." That is, "Kennedy" has a 
larger spread than "except" between tile history 
and tile otherwise case. 
.l)Jior hist near other src w 
0.026 0.40 0.022 0.0050 AP91 Kennedy 
0.020 0.32 0.025 0.0038 AP93 
0.026 0.05 0.018 0.0122 AP91 except 
0.019 0.05 0.014 0.0081 AP93 
0.077 0.50 0.062 0.0092 AP91 peace 
0.074 0.49 0.066 0.0069 AP93 
0.015 0.10 0.014 0.0066 AP91 piccc 
0.013 0.08 0.015 0.0046 AP93 
When (.\]/' is small (d/'< 100), I\]O smoothing is 
used to group words into bins by {!/\] Adaptation 
prol)abilities are computed for each bill, rather 
than for each word. Since these probabilities are 
implicitly conditional on ,qJ; they have ah'eady 
been weighted by (!fin some sense, and therefore, 
it is unnecessary to introduce an additional 
explicit weighting scheme based on (!/'or a simple 
transl'orm thereof such as II)1:. 
The experiments below split tile neighborhood 
into four chisses, ranging fronl belier nei.g, hbors 
to wmwe neighbotw, del)ending oil expansion 
frequency, e/\] el'(1) is a number between 1 and k, 
indicating how many of the k top scoring doc- 
uments contain I. (Better neighbors appear in 
more of the top scoring documents, and worse 
neighbors appear in fewer.) All the neighborhood 
classes fall between hist and other, with better 
neighbors adapting tllore than ~,OlSe neighbors. 
7. Experimental Results 
Recall that the task is to predict the test portion 
(the second half) of a document given the histoo, 
(the first half). The following table shows a 
selection of words (sorted by the third cohunn) 
from the test portion of one of the test doculnents. 
The table is separated into thirds by horizontal 
lines. The words in the top third receive nmch 
higher scores by the proposed method (S) than by 
a baseline (B). These words are such good 
keywords that one can faMy contidently guess 
what the story is about. Most of these words re- 
ceive a high score because they were mentioned 
in the history portion of the document, but 
"laid-off" receives a high score by the neighl)or- 
hood mechanism. Although "hiid-off" is not 
mentioned explicitly in the history, it is obviously 
closely related to a number of words that were, 
especially "layoffs," but also "notices" and 
"cuts." It is reassuring to see tile neighborhood 
mechanism doing what it was designed to do. 
The middle third shows words whose scores are 
about the same as the baseline. These words tend 
to be function words and other low content words 
that give tts little sense of what the document is 
about. The bottoln third contains words whose 
scores are much lower than the baseline. These 
words tend to be high in content, but misleading. 
The word ' al us," for example, might suggest 
that story is about a military conflict. 
S l?, Iog2(S/B) Set Ternl 
0.19 0.00 I 1.06 hist Binder 
0.22 0.00 7.45 hist layoff 
0.06 0.00 5.71 hist notices 
0.36 0.01 5.66 hist 13oeing 
0.02 0.00 5.11 near3 laid-off 
0.25 0.02 3.79 hist cuts 
0.01 0.01 0.18 near3 projects 
0.89 0.81 0.15 hist said 
0.06 0.05 0. l 1 near4 announced 
0.06 0.06 0.09 near4 As 
0.00 0.00 0.09 ncat+l employed 
0.00 0.00 -0.61 other 714 
0.00 0.01 -0.77 other managed 
0.01 0.02 -1.05 near2 additional 
0.00 0.01 - 1.56 other wave 
0.00 0.03 -3.41 other arms 
The proposed score, S, shown in colunln 1, is: 
Pr(wlhist) if wc hist 
Pr(wlneari) if wc nearj 
Pr(w\[near 2) if wc near a 
Prs(w)= Pr(winear 3) if w6near 3 
l'r(wJnear4) if we near4 
Pr(wiother) otherwise 
where near I through near 4 are four neighbor- 
l\]oods (k=100). Words in near4 are the best 
neighbors (e\[>10) and words in near I are the 
worst neighbors (e/'= 1). Tile baseline, B, shown 
in column 2, is: Prl~(w)=df/D. Colun\]i\] 3 con\]- 
pares the first two cohnnns. 
We applied this procedure to a year of the AP 
news and found a sizable gain in information on 
185 
average: 0.75 bits per word type per doculnent. 
In addition, there were many more big winners 
(20% of the documents gained 1 bit/type) than 
big losers (0% lost 1 bit/type). The largest 
winners include lists of major cities and their 
temperatures, lists of major currencies and their 
prices, and lists of commodities and their prices. 
Neighborhoods are quite successful in guessing 
the second half of such lists. 
On the other hand, there were a few big losers, 
e.g., articles that summarize the major stories of 
the clay, week and year. The second half of a 
summary article is almost never about the same 
subject its the first half. There were also a few 
end-of-document delimiters that were garbled in 
translnission causing two different documents to 
be treated as if they were one. These garbled 
documents tended to cause trouble for the 
proposed method; in such cases, the history 
comes fi'om one document and the test comes 
from another. 
In general, the proposed adaptation method per- 
formed well when the history is helpful for pre- 
dicting the test portion of the document, and it 
performed poorly when the history is misleading. 
This suggests that we ought to measure topic 
shifts using methods suggested by Hearst (1994) 
and Florian & Yarowsky (1999). We should not 
use the history when we believe that there has 
been a major topic shift. 
8. Conclusions 
Adaptive  models were introduced to 
account for repetition. It is well known that the 
second instance of a word (or ngram) is nmch 
more likely than the first. But what we find 
surprising is just how large the effect is. The 
chance of two Noriegas is closer to p/2 than p 2. 
in addition to the magnitude of adaptation, we 
were also surprised by the shape: while the first 
instance of a word depends very strongly on 
frequency, the second does not. Adaptation de- 
pends more on content than flequency; adaptation 
is stronger for content words such as proper 
nouns, technical terminology and good keywords 
for information retrieval, and weaker for functioll 
words, cliches and first nalnes. 
The shape and magnitude of adaptation has im- 
plications for psycholinguistics, information 
retrieval and  modeling. Psycholinguis- 
tics has tended to equate word frequency with 
content, but our results suggest that two words 
with similar frequency (e.g., "Kennedy" and 
"except") can be distinguished on the basis of 
their adaptation. Information retrieval has tended 
to use frequency in a similar way, weighting 
terms by IDF (inverse document frequency), with 
little attention paid to adaptation. We propose a 
term weighting method that makes use of adapta- 
tion (burstiness) and expansion frequency in a 
related paper (Umelnura and Church, submitted). 
Two estimation methods were introduced to 
demonstrate tile magnitude and shape of adapta- 
tion. Both methods produce similar results. 
• Pr(+ adapt I ) = Pr(test\] hist) 
• Pr(+adapt2)=Pr(k>2\]k>_l ) 
Neighborhoods were then introduced for words 
such as "laid-off" that were not in the history but 
were close ("laid-off" is related to "layoff," 
which was in the history). Neighborhoods were 
defined in terms of query expansion. The history 
is treated as a query in an information retriewd 
document-ranking system. Words in the k top- 
ranking documents (but not in the history) are 
called neighbors. Neighbors adapt more dmn 
other terms, but not as much as words that actual- 
ly appeared in the history. Better neighbors 
(larger et) adapt more than worse neighbors 
(slnaller el). 

References 

Church, K. and Gale, W. (1995) "Poisson 
Mixtures," Journal of Natural Language Engi- 
neering, 1:2, pp. 163-190. 

Florian, R. and Yarowsky, D. (1999) "Dynamic 
Nonlocal Language Modeling via Hierarchical 
Topic-Based Adaptation," ACL, pp. 167-174. 

Francis, W., and Kucera, H. (1982) Frequency 
Analysis of English Usage, Houghton Mifflin 
Colnpany, Boston, MA. 

Hearst, M. (1994) Context and Structure in 
Automated Full-Text Information Access, PhD 
Thesis, Berkeley, available via www.sims.ber- 
keley.edu/~hearst. 

Jelinek, F. (1997) Statistical Methods.for Speech 
Recognition, MIT Press, Cambridge, MA, USA. 

Umemura, K. and Church, K. (submitted) 
"Empirical Term Weighting: A Framework for 
Studying Limits, Stop Lists, Burstiness and 
Query Expansion. 
