ADAPTIVE LANGUAGE MODELING USING MINIMUM 
DISCRIMINANT ESTIMATION* 
S. Della Pietra, V. Della Pietra, R. L. Mercer, S. Roukos 
Continuous Speech Recognition Group, 
Thomas J. Watson Research Center 
P. O. Box 704, Yorktown Heights, NY 10598 
ABSTRACT 
We present an algorithm to adapt a n-gram 
language model to a document as it is dictated. 
The observed partial document is used to esti- 
mate a unigram distribution for the words that 
already occurred. Then, we find the closest n- 
gram distribution to the static n.gram distribu- 
tion (using the discrimination information dis- 
tance measure) and that satisfies the marginal 
constraints derived from the document. The 
resulting minimum discrimination information 
model results in a perplexity of 208 instead of 
290 for the static trigram model on a document 
of 321 words. 
1 INTRODUCTION 
Statistical n-gram language models are useful for speech 
recognition and language translation systems because 
they provide an a-priori probability of a word sequence; 
these language models improve the accuracy of recog- 
nition or translation by a significant amount. In the 
case of trigram (n = 3) and blgram (n = 2) language 
models, the probability of the next word conditioned 
on the previous words is estimated from a large corpus 
of text. The resulting static language models (SLM) 
have fixed probabilities that are independent of the 
document being predicted. 
To improve the language model (LM), one can adapt 
the probabilities of the language model to match the 
current document more closely. The partially dictated 
(in the case of speech recognition) document provides 
significant clues about what words are more llke\]y to 
be used next. One expects many words to be bursty. 
For example, if in the early part of a document the 
word lax has been used then the probability that it 
wilt be used again in the same document is significantly 
*PAPER SUBMITTED TO THE ICASSP92 
PROCEEDINGS. 
higher than if it had not occurred yet. In addition, if 
the words in the early part of a document suggest that 
the current document is from a particular subdomain, 
then we may expect other related words to occur at 
a higher rate than the static model may suggest. For 
example, the words "inspection, fire, and insurance" 
suggest an insurance report domain, and therefore in- 
creased probabilities to words such as "stairwell and 
electricar' . 
Assume that from a partial document, denoted by 
h, we have an estimate of the unigram distribution 
pa(w I h) that a word w may be used in the remain- 
ing part of the document. We will denote pa(w I h) 
by d(w), keeping in mind that this dynamic unigram 
distribution is continuously updated as the document 
is dictated and is estimated for a subset of R words of 
the vocabulary. (Typically R is on the order of a doc- 
ument size of about a few hundred words as compared 
to a vocabulary size of 20,000 words.) In general, the 
dynamic unigram distribution will be different from 
the static marginal unigram distribution, denoted by 
p~(w). In this paper, we propose a method for adapt- 
\]ng the language model so that its marginal unigram 
distribution matches the desired dynamic unigram dis- 
tribution d(w). 
The proposed approach consists of finding the model 
that requires the least pertubation of the static model 
and satisfies the set of constraints that have been de- 
rived from the partially observed document. By least 
pertubation we mean that the new model is closest to 
the static model, p~, using the non-symmetric Kullback- 
Liebler distortion measure (also known as dlscrimina- 
tion information, relative entropy, etc.). The minimum 
discrimination information (MDI) p* distribution min- 
imizes: 
i 
over all p that satisfy a set of R linear constrailtts. 
In this paper, we consider marginal constraints of the 103 
form 
p(i) = d, 
iECr 
where we are summing over all events i in the set C~ 
that correspond to the r-th constraint and d, is the de- 
sired value (for r = 1, 2, ..., R). In our case, the events i 
correspond to bigrams, (nq, w2), and the desired value 
for the r-th constraint, d,, is the marginal unigram 
probability, d(w,), for a word w,. 
The idea of using a window of the previous N words, 
called a cache, to estimate dynamic frequencies for a 
word was proposed in \[5\] for the case of a tri-part-of- 
speech model and in \[6\] for a bigram model. In \[4\] a 
trigram language was estimated from the cache and in- 
terpolated with the static trlgram model to yield about 
20% lower perplexity and from 5% to 25% lower recog- 
nition error rate on documents ranging in length from 
100 to 800 words. 
3 ALTERNATING MINIMIZATION 
Starting with an initial estimate of the factors, the 
following iterative algorithm is guaranteed to converge 
to the optimum distribution. At each iteration j, pick 
a constraint rj and adjust the corresponding factor so 
that the constraint is satisfied. In the case of marginal 
constraints, the update is: 
new = fold dr, 
frj ~rj pj-l(Crj ) 
where pJ-~(C,j) is the marginal of the previous estl- 
mate and d~j is the desired marginal. This iteratlve al- 
gorithm cycles through the constraints repeatadly un- 
til convergence hence the name alternating (thru the 
constraints) minimization. It was proposed by Darroch 
and Ratcliff in \[3\]. A proof of convergence for linear 
constraints is given in \[2\] . 
2 MINIMUM DISCRIMINATION 
INFORMATION 
The discrimination information can be written as: 
D(p, ps) = -- EPlogps + EPlogp (1) 
= Itp,(p)- H(p) k 0 (2) 
where Rps(p ) is the bit rate in transmitting source p 
with model Ps and H(p) is the entropy of source p. The 
MDI distribution p* satisfies the following Pythagorean 
inequality: 
D(p,p,) > D(p,p*) + D(p*,ps) 
for all distributions p in the set PR of distributions that 
satisfy the R constraints. So if we have an accurate 
estimate of the constraints then using the MDI distri- 
bution will result in a lower error by at least D(p*, p~). 
The'. MDI distribution is the Maximum Entropy 
(ME) distribution if the static model is the uniform 
distribution. 
Using Lagrange multipliers and differentiating with 
respect to pi the probability of the i-th event, we find 
that the optimum must have the form 
p~ = p,ifi~f~2...fln 
where the factors fir are 1 if event i is not in the con- 
straint set C, or some other value .f, if event i belongs 
to constraint set C,. So the MDI distribution is speci- 
fied by the 17 factors .f,, r = 1, 2, ..., R, that correspond 
to the R constraints, in addition to the original static 
model. 
104 
4 ME CACHE 
We have applied the above approach to adapting a bi- 
gram model; we call the resulting model the ME cache. 
Using a cache window of the previous N words, we es- 
timate the desired unigram probability of all/:/words 
that have occurred in the cache by: 
d(w) = .\de(w) 
where Ac is an adjustment factor taken to be the prob.- 
ability that the next word is already in the cache and 
.fc is the observed frequency of a word in the cache. 
Since any event (wl, wu) participates in 2 constraints 
one for the left marginal d(wl) and the other for the 
right marginal d(w~) there are 2//-,t-1 constraint, a left 
and right marginal for each word in the cache and the 
overall normalization, the ME bigram cache model is 
given by: 
We require the left and right marginals to be equal to 
get a st~ttionary model. (Since all events participate in 
the normalization that factor is absorbed in the other 
two.) 
The iterations fall into two groups: those in which 
a left marginal is adjusted and those in which a right 
marginal is adjusted. In each of these iterations, we 
adjust two factors simultaneously: one for the desired 
unigram probability d(w) and the other so that the 
resulting ME model is a normalized distribution. The 
update for left marginals is 
pJ(wl, W2) : pJ-l(wl, W2)aj.~j 
where aj and sj are adjustments given by: 
1 - d(wj) 
8j = 1 -- pj--l(wj, .) 
d(w¢) 
aj = sjpj_l(wj, .) 
where pJ-l(wj, .) denotes the left marginal of the (j - 
1)-th estimate of the ME distribution and wj is the 
word that corresponds to the selected constraint at the 
j-th iteration. Similar equations can be derived for the 
updates for the right marginals. The process is started 
with p0(w,, = 
Note that the marginM pJ(w,.) can be computed 
by using R additions and multiplications. The algo- 
rithm requires order//2 operation to cycle thru all con- 
straints once. R is typically few hundred compared to 
the vocabulary size V which is 20,000 in our case. We 
have found that about 3 to 5 iterations are sufficient 
to achieve convergence. 
5 EXPERIMENTAL RESULTS 
Using a cache window size of about 700 words, we 
estimated a desired unigram distribution and a cor- 
responding ME bigram distribution with an MDI of 
about 2.2 bits (or 1.1 bits/word). Since the unigram 
distribution may not be exact, we do not expect to re- 
duce our perplexity on the next sentence by a factor 
larger than 2.1 = 2 El. The actuM reduction was a 
factor of 1.5 = 2 o.62 on the next 93 words of the docu- 
ment. For a smMler cache size the discrepancy between 
the MDI and actual perplexity reduction is larger. 
To evaluate the ME cache model we compared it to 
the trigram cache model and the static trigram model. 
In all models we use linear interpolation between the 
dynamic and static components as: 
p(w3lwl, w2) = Acp~(w31w~, wu)-I-(1-A~)p,(w31,vl, w2) 
where A~ = 0.2. The static and cache trigram prob- 
abilities use the usual interpolation between unigram, 
bigram, and trigram frequencies \[t\]. The cache trlgram 
probability p~ is given by: 
pc( w3\]zol, w2 ) = ,\lfcl ( w3 )-t- )~2/c2( w31w2 )T A3f c3( w3lwl 
where fci are frequencies estimated from the cache win- 
dow. The interpolating weights are A1 = 0.4, ),2 = 0.5, 
~nd A3 = 0.1. For the ME cache we replace the dy- 
namic unigram frequency f~l(w3) by the ME condi- 
tional bigram probability pme(W3lW2) given by:" 
o~, ( w3 )p,( w31w2 ) 
PmdW3lw~) = E~ o~,(w)p,(wlw2 ) 
, w2) 
105 
Note that the sum in the denominator is order R since 
the factors are unity for the words that are not in the 
cache. 
In 'Fable 1, we compare the static, the ME cache, 
and the trigram cache models on three documents. 
Both cache models improve on the static. The ME 
and trigram cache are fairly close as would be expected 
since they both have the same dynamic unigram dis- 
tribution. The second experiment illustrates how they 
are different. 
Document Words Static ME \]¥igram 
Cache Cache 
T1 321 290 208 218 
T3 426 434 291 300 
E1 814 294 175 182 
Table 1. Perplexity on three documents. 
We compared the ME cache and the trigram cache 
on 2 non-senslcal sentences made up from words that 
have occurred in the first sentence of a document. The 
2 sentences are: 
• SI: the letter fire to to to 
• S2: building building building building 
Table 2 shows the perplexity of each sentence at 2 
points in the document history: one after the first sen- 
tence (of length 33 words) is in the cache and the sec- 
ond after 10 sentences (203 words) are in the cache. 
We can see that the trigram cache can make some rare 
bigrams (wl, w~) more likely if both wx and w2 have 
already occurred due to a term of the form d(wt)d(w2) 
whereas the ME cache still has the factor p,(wl, w2) 
which will tend to keep a rare bigram somewhat \]ess 
probM)\]e. This is particular\]y pronounced for $2, where 
we expect d(building) to be quite accurate after 10 sen- 
tences, the ME cache penalizes the unlikely bigram by 
a factor of about 13 over the trigram cache. 
Sentence 
Sl 
Sl 
S2 
S2 
Cache 
Size 
33 
203 
33 
203 
Trigram ME 
Cache Cache 
213 268 
417 672 
245 665 
212 2963 
Table 2. Trigram and ME cache perplexity. 
6 CONCLUSION 
The MDI approach to adapting a language model can 
result in significant perplexity reduction without aleak- 
age in the bigram probability model. We expect this 
fact to be important in adapting to a new domain 
where the unigram distribution d(w) can be estimated 
from possibly tens of documents. We are currently 
pursuing such experiments. 
REFERENCES 
[11 Bahl, L., Jelinek, F., and Mercer, R.,A Statisti- 
cal Approach to Continuous Speech Recognition, 
IEEE Trans. on PAMI, 1983. 
[2] Csiszar, I., and bongo, G., In]ormation Geometry 
and Alternating Minimization Procedures, Statis- 
tics and Decisions, Supplement Issue 1:205-237, 
1984. 
[3] Darroch, J.N., Ratcliff, D. Generalized lterative 
Scaling for Log-Linear Models, The Annals of 
Mathematical Statistics, Vol. 43, pp. 1470-1480, 
1972. 
[4] Jelinek, F., Merialdo, B., Roukos, S., and Strauss, 
M., A Dynamic Language Model for Speech Recog- 
nition, Proceedings of Speech and Natural Lan- 
guage DARPA Workshop, pp. 293-295, Feb. 1991. 
[5] Kuhn, R., Speech Recognition and the Frequency 
of Recently Used Words: a Modified Markov 
Model for Natural Language, Proceedings of COL- 
ING Budapest, Vol. 1, pp. 348-350, 1988. Vol. 1 
July 1988 
[6] Kupiec, J., Probabilistic Models of Short and 
Long Distance Word Dependencies in Running 
Text, Proceedings of Speech and Natural Lan- 
guage DARPA Workshop, pp. 290-295, Feb. 1989. 
106 
