A Model of Lexical Attraction and Repulsion* 
Doug Beeferman Adam Berger John Lafferty 
School of Computer Science 
Carnegie Mellon University 
Pittsburgh, PA 15213 USA 
<dougb, aberger, lafferty>@cs, cmu. edu 
Abstract 
This paper introduces new methods based 
on exponential families for modeling the 
correlations between words in text and 
speech. While previous work assumed the 
effects of word co-occurrence statistics to 
be constant over a window of several hun- 
dred words, we show that their influence 
is nonstationary on a much smaller time 
scale. Empirical data drawn from En- 
glish and Japanese text, as well as conver- 
sational speech, reveals that the "attrac- 
tion" between words decays exponentially, 
while stylistic and syntactic contraints cre- 
ate a "repulsion" between words that dis- 
courages close co-occurrence. We show 
that these characteristics are well described 
by simple mixture models based on two- 
stage exponential distributions which can 
be trained using the EM algorithm. The 
resulting distance distributions can then be 
incorporated as penalizing features in an 
exponential language model. 
1 Introduction 
One of the fundamental characteristics of language, 
viewed as a stochastic process, is that it is highly 
nonstationary. Throughout a written document 
and during the course of spoken conversation, the 
topic evolves, effecting local statistics on word oc- 
currences. The standard trigram model disregards 
this nonstationarity, as does any stochastic grammar 
whichassigns probabilities to sentences in a context- 
independent fashion. 
*Research supported in part by NSF grant IRI- 
9314969, DARPA AASERT award DAAH04-95-1-0475, 
and the ATR Interpreting Telecommunications Research 
Laboratories. 
Stationary models are used to describe such a dy- 
namic source for at least two reasons. The first is 
convenience: stationary models require a relatively 
small amount of computation to train and to apply. 
The second is ignorance: we know so little about 
how to model effectively the nonstationary charac- 
teristics of language that we have for the most part 
completely neglected the problem. From a theoreti- 
cal standpoint, we appeal to the Shannon-McMillan- 
Breiman theorem (Cover and Thomas, 1991) when- 
ever computing perplexities on test data; yet this 
result only rigorously applies to stationary and er- 
godic sources. 
To allow a language model to adapt to its recent 
context, some researchers have used techniques to 
update trigram statistics in a dynamic fashion by 
creating a cache of the most recently seen n-grams 
which is smoothed together (typically by linear in- 
terpolation) with the static model; see for example 
(Jelinek et al., 1991; Kuhn and de Mori, 1990). An- 
other approach, using maximum entropy methods 
similar to those that we present here, introduces a 
parameter for trigger pairs of mutually informative 
words, so that the occurrence of certain words in re- 
cent context boosts the probability of the words that 
they trigger (Rosenfeld, 1996). Triggers have also 
been incorporated through different methods (Kuhn 
and de Mori, 1990; Ney, Essen, and Kneser, 1994). 
All of these techniques treat the recent context as a 
"bag of words," so that a word that appears, say, five 
positions back makes the same contribution to pre- 
diction as words at distances of 50 or 500 positions 
back in the history. 
In this paper we introduce new modeling tech- 
niques based on exponential families for captur- 
ing the long-range correlations between occurrences 
of words in text and speech. We show how for 
both written text and conversational speech, the 
empirical distribution of the distance between trig- 
373 
s t ger words exhibits a striking behavior in which the "attraction" between words decays exponentially, 
while stylistic and syntactic constraints create a "re- 
pulsion" between words that discourages close co- 
occurrence. 
We have discovered that this observed behavior 
is well described by simple mixture models based on 
two-stage exponential distributions. Though in com- 
mon use in queueing theory, such distributions have 
not, to our knowledge, been previously exploited 
in speech and language processing. It is remark- 
able that the behavior of a highly complex stochas- 
tic process such as the separation between word co- 
occurrences is well modeled by such a simple para- 
metric family, just as it is surprising that Zipf's law 
can so simply capture the distribution of word fre- 
quencies in most languages. 
In the following section we present examples of the 
empirical evidence for the effects of distance. In Sec- 
tion 3 we outline the class of statistical models that 
we propose to model this data. After completing 
this work we learned of a related paper (Niesler and 
Woodland, 1997) which constructs similar models. 
In Section 4 we present a parameter estimation algo- 
rithm, based on the EM algorithm, for determining 
the maximum likelihood estimates within the class. 
In Section 5 we explain how distance models can be 
incorporated into an exponential language model, 
and present sample perplexity results we have ob- 
tained using this class of models. 
2 The Empirical Evidence 
The work described in this paper began with the 
goal of building a statistical language model using 
a static trigram model as a "prior," or default dis- 
tribution, and adding certain features to a family of 
conditional exponential models to capture some of 
the nonstationary features of text. The features we 
used were simple "trigger pairs" of words that were 
chosen on the basis of mutual information. Figure 1 
provides a small sample of the 41,263 (s,t) trigger 
pairs used in most of the experiments we will de- 
scribe. 
In earlier work, for example (Rosenfeld, 1996), the 
distance between the words of a trigger pair (s,t) 
plays no role in the model, meaning that the "boost" 
in probability which t receives following its trigger s 
is independent of how long ago s occurred, so long 
as s appeared somewhere in the history H, a fixed- 
length window of words preceding t. It is reasonable 
to expect, however, that the relevance of a word s to 
the identity of the next word should decay as s falls 
Ms. 
changes 
energy 
committee 
board 
lieutenant 
AIDS 
Soviet 
underwater 
patients 
television 
Voyager 
medical 
I 
Gulf 
her 
revisions 
gas 
representative 
board 
colonel 
AIDS 
missiles 
diving 
drugs 
airwaves 
Neptune 
surgical 
me 
Gulf 
Figure 1: A sample of the 41,263 trigger pairs ex- 
tracted from the 38 million word Wall Street Journal 
corpus. 
s t 
UN 
electricity 
election 
silk 
court 
,~WH~ -- 
Hungary 
Japan Air 
sentence 
transplant 
forest 
computer 
Security Council 
kilovatt 
small electoral district 
COCO0~ 
imprisonment 
Bulgaria 
to fly cargo 
proposed punishment 
orga/% 
wastepaper 
host 
Figure 2: A sample of triggers extracted from the 
33 million word Nikkei corpus. 
further and further back into the context. Indeed, 
there are tables in (Rosenfeld, 1996) which suggest 
that this is so, and distance-dependent "memory 
weights" are proposed in (Ney, Essen, and Kneser, 
1994). We decided to investigate the effect of dis- 
tance in more detail, and were surprised by what 
we found. 
374 
++L • , 0.01:1 \]-- , 
0.01= 0.01 • ~ 
O.G04 ~ O.(X31 
*~o* O.g04 
0++ °.. 
° ~ ',;,0 ,=' ~ ° 
} 
Y 
q, 
tgO 150 ~ 2S0 ~ 360 
Figure 3: The observed distance distributions--collected from five million words of the Wall Street Journal 
corpus--for one of the non-self trigger groups (left) and one of the self trigger groups (right). For a given 
distance 0 < k < 400 oa the z-axis, the value on the y-axis is the empirical probability that two trigger words 
within the group are separated by exactly k + 2 words, conditional on the event that they co-occur within 
a 400 word window. (We exclude separation of one or two words because of our use of distance models to 
improve upon trigrams.) 
The set of 41,263 trigger pairs was partitioned 
into 20 groups of non-self triggers (s, t), s ¢ t, such 
as (Soviet, Kremlin's), and 20 groups of self trig- 
gers (s, s), such as (business, business). Figure 3 
displays the empirical probability that a word t ap- 
pears for the first time k words after the appearance 
of its mate s in a trigger pair (s,t), for two repre- 
sentative groups. 
The curves are striking in both their similarities 
and their differences. Both curves seem to have more 
or less flattened out by N = 400, which allows us to 
make the approximating assumption (of great prac- 
tical importance) that word-triggering effects may 
be neglected after several hundred words. The most 
prominent distinction between the two curves is the 
peak near k = 25 in the self trigger plots; the non- 
self trigger plots suggest a monotonic decay. The 
shape of the self trigger curve, in particular the rise 
between k = 1 and/¢ ~ 25, reflects the stylistic and 
syntactic injunctions against repeating a word too 
soon. This effect, which we term the lexical exclu- 
sion principle, does not appear for non-self triggers. 
In general, the lexical exclusion principle seems to 
be more in effect for uncommon words, and thus the 
peak for such words is shifted further to the right. 
While the details of the curves vary depending on 
the particular triggers, this behavior appears to be 
universal. For triggers that appear too few times in 
the data for this behavior to exhibit itself, the curves 
emerge when the counts are pooled with those from 
a collection of other rare words. An example of this 
law of large numbers is shown in Figure 4. 
These empirical phenomena are not restricted to 
the Wall Street Journal corpus. In fact, we have ob- 
served similar behavior in conversational speech and 
.Japanese text. The corresponding data for self trig- 
gers in the Switchboard data (Godfrey, Holliman, 
and McDaniel, 1992), for instance, exhibits the same 
bump in p(k) for small k, though the peak is closer 
to zero. The lexical exclusion principle, then, seems 
to be less applicable when two people are convers- 
ing, perhaps because the stylistic concerns of written 
communication are not as important in conversation. 
Several examples from the Switchboard and Nikkei 
corpora are shown in Figure 5. 
3 Exponential Models of Distance 
The empirical data presented in the previous section 
exhibits three salient characteristics. First is the de- 
cay of the probability of a word t as the distance 
k from the most recent occurrence of its mate s in- 
creases. The most important (continuous-time) dis- 
tribution with this property is the single-parameter 
exponential family 
p~(x) = ~e :~. 
(We'll begin by showing the continuous analogues 
of the discrete formulas we actually use, since they 
are simpler in appearance.) This family is uniquely 
characterized by the mernoryless properly that the 
probability of waiting an additional length of time 
At is independent of the time elapsed so far, and 
375 
~oo I\ I\ t 
Figure 4: The law of large numbers emerging for distance distributions. Each plot shows the empirical 
distance curve for a collection of self triggers, each of which appears fewer than 100 times in the entire 38 
million word Wall Street Journal corpus. The plots include statistics for 10, 50,500, and all 2779 of the self 
triggers which occurred no more than 100 times each. 
" o.m4 . 
~ ~ ~ ~ ~ ~ ~ 11 
a~ 
a~ 
\ 
a~ 
cu~ 
\ ~CIOI 
o.o~d 
o~ 
@ 
w IOO ~lo ~3o 21o ~oo 
o,~ 
01 
Figure 5: Empirical distance distributions of triggers in the :Iapanese Nikkei corpus, and the Switchboard 
corpus of conversational speech. Upper row: All non-self (left) and self triggers (middle) appearing fewer 
than 100 times in the Nikkei corpus, and the curve for the possessive particle ¢9 (right). Bottom row: 
self trigger Utl (left), YOU-KNOW (middle), and all self triggers appearing fewer than 100 times in the entire 
Switchboard corpus (right). 
the distribution p, has mean 1/y and variance 1/y 2. 
This distribution is a good candidate for modeling 
non-self triggers. 
Figure 6: A two-stage queue 
The second characteristic is the bump between 0 
and 25 words for self triggers. This behavior appears 
when two exponential distributions are arranged in 
serial, and such distributions are an important tool 
in the "method of stages" in queueing theory (Klein- 
rock, 1975). The time it takes to travel through two 
service facilities arranged in serial, where the first 
provides exponential service with rate /~1 and the 
second provides exponential service with rate Y2, is 
simply the convolution of the two exponentials: # 
P.~,~2(z) = Y1Y2 e-~':te -~'~(=-Od~ 
_ ~1~2 (e -°'=- e -~'~=) ~x ¢/J2. 
/~2 - #1 
The mean and variance of the two-stage exponen- 
tial p.,,,: are 1/#, + l/p2 and 1/y~ + 1//J~ respec- 
tively. As #1 (or, by symmetry, P2) gets large, the 
peak shifts towards zero and the distribution ap- 
proaches the single-parameter exponential Pu= (by 
376 
symmetry, Pro)- A sequence of two-stage models is 
shown in Figure 7. 
0.01 
O+OOg 
O.QI\]I 
0..007 
O.OOG 
0.~6 
0.004 
0,00¢I 
0.002 
O,G01 
0 
Figure 7: A sequence of two-stage exponential mod- 
els pt`~,t`~(x) with/Jl = 0.01, 0.02, 0.06, 0.2, oo and 
/~ = 0.01. 
The two-stage exponential is a good candidate for 
distance modeling because of its mathematical prop- 
erties, but it is also well-motivated for linguistic rea- 
sons. The first queue in the two-stage model rep- 
resents the stylistic and syntactic constraints that 
prevent a word from being repeated too soon. After 
this waiting period, the distribution falls off expo- 
nentially, with the memoryless property. For non- 
self triggers, the first queue has a waiting time of 
zero, corresponding to the absence of linguistic con- 
straints against using t soon after s when the words 
s and t are different. Thus, we are directly model- 
ing the "lexical exclusion" effect and long-distance 
decay that have been observed empirically. 
The third artifact of the empirical data is the ten- 
dency of the curves to approach a constant, positive 
value for large distances. While the exponential dis- 
tribution quickly approaches zero, the empirical data 
settles down to a nonzero steady-state value. 
Together these three features suggest modeling 
distance with a three-parameter family of distribu- 
tions: 
= + c) 
where c > 0 and 7 is a normalizing constant. 
Rather than a continuous-time exponential, we use 
the discrete-time analogue 
p.(k) = (1 - -t`k 
In this case, the two-stage model becomes the 
discrete-time convolution 
k 
pt=l,t`2(k) = ~ p/=l(t)pt`~(k -- t). 
t=O 
Remark. It should be pointed out that there is 
another parametric family that is an excellent can- 
didate for distance models, based on the first two 
features noted above: This is the Gamma dislribu. 
lion 
/~a xot-le -#~ = 
This distribution has mean a//~ and variance a//~ 2 
and thus can afford greater flexibility in fitting the 
empirical data. For Bayesian analysis, this distribu- 
tion is appropriate as the conjugate prior for the ex- 
ponential parameter p (Gelman et al., 1995). Using 
this family, however, sacrifices the linguistic inter- 
pretation of the two-stage model. 
4 Estimating the Parameters 
In this section we present a solution to the problem 
of estimating the parameters of the distance models 
introduced in the previous section. We use the max- 
imum likelihood criterion to fit the curves. Thus, if 
0 E 0 represents the parameters of our model, and 
/3(k) is the empirical probability that two triggers 
appear a distance of k words apart, then we seek to 
maximize the log-likelihood 
C(0) = ~ ~(k)logp0(k). 
k>0 
First suppose that {PO}oE® is the family of continu- 
ous one-stage exponential models p~(k) = pe -t`k. 
In this case the maximum likelihood problem is 
straightforward: the mean is the sufficient statistic 
for this exponential family, and its maximum likeli- 
hood estimate is determined by 
1 1 
- Ek>o k~(k) - E~ \[k\]" 
In the case where we instead use the discrete model 
pt`(k) = (1 - e -t') e -t`k, a little algebra shows that 
the maximum likelihood estimate is then 
Now suppose that our parametric family {PO}OE® 
is the collection of two-stage exponential models; the 
log-likelihood in this case becomes 
£(/~1,/~2) = ~--~iS(k)log pm(j)pt`,(k-j) . 
k_>0 
Here it is not obvious how to proceed to obtain the 
maximum likelihood estimates. The difficulty is that 
there is a sum inside the logarithm, and direct dif- 
ferentiation results in coupled equations for Pi and 
377 
#2. Our solution to this problem is to view the con- 
volving index j as a hidden variable and apply the 
EM algorithm (Dempster, Laird, and Rubin, 1977). 
Recall that the interpretation of j is the time used 
to pass through the first queue; that is, the number 
of words used to satisfy the linguistic constraints of 
lexical exclusion. This value is hidden given only the 
total time k required to pass through both queues. 
Applying the standard EM argument, the dif- 
ference in log-likelihood for two parameter pairs 
(#~,#~) and (/tt,#2) can be bounded from below as 
c(.')- =  ( )log 
(p.:,.;(.,j')) 
/:>_0 j=0 
A(i,',~,) 
> 
where 
and 
p.,,..(~, J) = p., (J) p.~ (~ - i) 
Pu,,~,=(jlk) = Pm'"2(k'J) p.,,.~(k) 
Thus, the auxiliary function A can be written as 
k 
- it' z E~(k)EJPm,~,2(J \[k) 
k_>0 j=0 
k 
k>0 j=0 
+ constant(#). 
Differentiating .A(#',#) with respect to #~, we get 
the EM updates ( 1 ) 
#i = log 1 + )-~k>0/3(k) k Ej =0 J P;,,t'2 (J \[ k) ( 1 ) 
k #~ -- log 1 + ~ka0/3(k) y'~j__0(k - j)pm,.~(jlk) 
l:l.emark. It appears that the above updates re- 
quire O(N 2) operations if a window of N words 
is maintained in the history. However, us- 
ing formulas for the geometric series, such as 
~ k ~k=0 kz = z/(1- x) 2, we can write the expec- 
k • tation ~":~j=o 3 Pm,~,,(Jlk) in closed form. Thus, the 
updates can be calculated in linear time. 
Finally, suppose that our parametric family 
{pc}see is the three-parameter collection of two- 
stage exponential models together with an additive 
constant: 
p.,,.~,o(k) = .-~(p.,,.=(k) + e). 
Here again, the maximum likelihood problem can 
be solved by introducing a hidden variable. In par- 
c ticular, by setting a "- ~ we can express this 
model as a mizture of a two-stage exponential and 
a uniform distribution: 
Thus, we can again apply the EM algorithm to de- 
termine the mixing parameter a. This is a standard 
application of the EM algorithm, and the details are 
omitted. 
In summary, we have shown how the EM algo- 
rithm can be applied to determine maximum like- 
lihood estimates of the three-parameter family of 
distance models {Pm,~=,a} of distance models. In 
Figure 8 we display typical examples of this training 
algorithm at work. 
5 A Nonstationary Language Model 
To incorporate triggers and distance models into 
a long-distance language model, we begin by 
constructing a standard, static backoff trigram 
model (Katz, 1987), which we will denote as 
q(wo\[w-l,w-2). For the purposes of building a 
model for the Wall Street Journal data, this trigram 
model is quickly trained on the entire 38-million 
word corpus. We then build a family of conditional 
exponential models of the general form 
p(w I H) = 1 (= ) 
Z~-ff~ exp Aifi(H,w) q(wlw_l,w_2 ) 
where H = w-t, w-2 .... , w_N is the word history, 
and Z(H) is the normalization constant 
Z( H)~= E exp ( E Aifi( H' , q(w l w_l, w-2) 
The functions fl, which depend both on the word 
history H and the word being predicted, are called 
features, and each feature fi is assigned a weight Ai. 
In the models that we built, feature fi is an indicator 
function, testing for the occurrence of a trigger pair 
(si,ti): 
1 ifsiEHandw=ti fi(H,w) 
= 0 otherwise. 
The use of the trigram model as a default dis- 
tribution (Csiszhr, 1996) in this manner is new in 
language modeling. (One might also use the term 
prior, although q(w\[H) is not a prior in the strict 
Bayesian sense.) Previous work using maximum en- 
tropy methods incorporated trigram constraints as 
378 
0.014 
0.012 
0.01 
O.00e 
0.004 
0.004 
0.002 
r" 
\ 
-.~.. 
0.012 
0.01 !~il " 
I 
I 
I ol 
i i I i I * "'1 ' 
Figure 8: The same empirical distance distributions of Figure 2 fit to the three-parameter mixture model 
Pm,#2,a using the EM algorithm. The dashed line is the fitted curve. For the non-self trigger plot/J1 = 7, 
/~ = 0.0148, and o~ = 0.253. For the self trigger plot/~1 = 0.29,/J2 = 0.0168, and a = 0.224. 
explicit features (Rosenfeld, 1996), using the uni- 
form distribution as the default model. There are 
several advantages to incorporating trigrams in this 
way. The trigram component can be efficiently con- 
structed over a large volume of data, using standard 
software or including the various sophisticated tech- 
niques for smoothing that have been developed. Fur- 
thermore, the normalization Z(H) can be computed 
more efficiently when trigrams appear in the default 
distribution. For example, in the case of trigger fea- 
tures, since 
Z(H) = 1 + ~ 6(si E H)(e x' - 1)q(ti lw-1, w-z) 
i 
the normalization involves only a sum over those 
words that are actively triggered. Finally, assuming 
robust estimates for the parameters hl, the resulting 
model is essentially guaranteed to be superior to the 
trigram model. The training algorithm we use for 
estimating the parameters is the Improved Iterative 
Scaling (IIS) algorithm introduced in (Della Pietra, 
Della Pietra, and Lafferty, 1997). 
To include distance models in the word predic- 
tions, we treat the distribution on the separation k 
between sl and ti in a trigger pair (si,ti) as a prior. 
Suppose first that our distance model is a simple 
one-parameter exponential, p(k I sl E H,w = ti) = 
#i e -m~. Using Bayes' theorem, we can then write 
p(w = ti \[sl E H, si = w-A) 
p(w = ti \[si E H) p(k \[si E H,w = ti) 
p(k I si E H) 
oc e x'-"'k q(tl I wi-l,wi-~). 
Thus, the distance dependence is incorporated as a 
penalizing feature, the effect of which is to discour- 
age a large separation between si and ti. A simi- 
lar interpretation holds when the two-stage mixture 
models P,1,,2,~ are used to model distance, but the 
formulas are more complicated. 
In this fashion, we first trained distance models 
using the algorithm outlined in Section 4. We then 
incorporated the distance models as penalizing fea- 
tures, whose parameters remained fixed, and pro- 
ceeded to train the trigger parameters hi using the 
IIS algorithm. Sample perplexity results are tabu- 
lated in Figure 9. 
One important aspect of these results is that be- 
cause a smoothed trigram model is used as a de- 
fault distribution, we are able to bucket the trigger 
features and estimate their parameters on a modest 
amount of data. The resulting calculation takes only 
several hours on a standard workstation, in com- 
parison to the machine-months of computation that 
previous language models of this type required. 
The use of distance penalties gives only a small 
improvement, in terms of perplexity, over the base- 
line trigger model. However, we have found that 
the benefits of distance modeling can be sensitive to 
configuration of the trigger model. For example, in 
the results reported in Table 9, a trigger is only al- 
lowed to be active once in any given context. By 
instead allowing multiple occurrences of a trigger s 
to contribute to the prediction of its mate t, both 
the perplexity reduction over the baseline trigram 
and the relative improvements due to distance mod- 
eling are increased. 
379 
Experiment Perplexity 
Baseline: trigrams trained on 5M words 170 
Trigram prior + 41,263 triggers 145 
Same as above + distance modeling 142 
Baseline: trigrams trained on 38M words 107 
Trigram prior + 41,263 triggers 92 
Same as above + distance modeling 90 
Figure 9: Models constructed using trigram priors. Training the larger 
DEC Alpha workstation. 
Reduction 
14.7% 
I6.5% 
14.0% 
15.9% 
model required about 10 hours on a 
6 Conclusions 
We have presented empirical evidence showing that 
the distribution of the distance between word pairs 
thai; have high mutual information exhibits a strik- 
ing behavior that is well modeled by a three- 
parameter family of exponential models. The prop- 
erties of these co-occurrence statistics appear to be 
exhibited universally in both text and conversational 
speech. We presented a training algorithm for this 
class of distance models based on a novel applica- 
tion of the EM algorithm. Using a standard backoff 
trigram model as a default distribution, we built a 
class of exponential language models which use non- 
stationary features based on trigger words to allow 
the model to adapt to the recent context, and then 
incorporated the distance models as penalizing fea- 
tures. The use of distance modeling results in an 
improvement over the baseline trigger model. 
Acknowledgement 
We are grateful to Fujitsu Laboratories, and in par- 
ticular to Akira Ushioda, for providing access to the 
Nikkei corpus within Fujitsu Laboratories, and as- 
sistance in extracting Japanese trigger pairs. 

References 
Berger, A., S. Della Pietra, and V. Della Pietra. 1996. A 
maximum entropy approach to natural language pro- 
cessing. Computational Linguistics, 22(1):39-71. 
Cover, T.M. and J.A. Thomas. 1991. Elements of In. 
.\[ormation Theory. John Wiley. 
Csisz£r, I. 1996. Maxent, mathematics, and information 
theory. In K. Hanson and It. Silver, editors, Max- 
imum Entropy and Bayesian Methods. Kluwer Aca- 
demic Publishers. 
DeLia Pietra, S., V. Della Pietra, and J. Lafferty. 1997. 
Inducing features of random fields. IEEE Trans. 
on Pattern Analysis and Machine Intelligence, 19(3), 
March. 
Dempster, A.P., N.M. Laird, and D.B. RubEn. 1977. 
Maximum likelihood from incomplete data via the EM 
algorithm. Journal o\] the Royal Statistical Society, 
39(B):1-38. 
Gelman, A., J. Car\]in, H. Stern, and D. RubEn. 1995. 
Bayesian Data Analysis. Chapman &: Hall, London. 
Godfrey, J., E. HoUiman, and J. McDaniel. 1992. 
SWITCHBOARD: Telephone speech corpus for re- 
search and development. In Proc. ICASSP-9~. 
Jelinek, F., B. MeriMdo, S. Roukos, and M. Strauss. 
1991. A dynamic language model for speech recog- 
nition. In Proceedings o/the DARPA Speech and Nat. 
ural Language Workshop, pages 293-295, February. 
Katz, S. 1987. Estimation of probabilities from sparse 
data for the langauge model component of a speech 
recognizer. IEEE Transactions on Acoustics, Speech 
and Signal Processing, ASSP-35(3):400-401, March. 
Kleinrock, L. 1975. Queueing Systems. Volume I: The- 
ory. Wiley, New York. 
Kuhn, R. and R. de Mori. 1990. A cache-based nat- 
ural language model for speech recognition. IEEE 
Trans. on Pattern Analysis and Machine Intelligence, 
12:570-583. 
Ney, H., U. Essen, and R. Kneser. 1994. On structur- 
ing probabilistic dependencies in stochastic language 
modeling. Computer Speech and Language, 8:1-38. 
Niesler, T. and P. Woodland. 1997. Modelling word- 
pair relations in a category-based language model. In 
Proceedings o\] ICASSP-97, Munich, Germany, April. 
Rosenfeld, R. 1996. A maximum entropy approach 
to adaptive statistical language modeling. Computer 
Speech and Language, 10:187-228. 
