A Dynamic Language Model for Speech Recognition 
F. Jelinek, B. Merialdo, S. Roukos, and M. Strauss I 
IBM Research Division, Thomas J. Watson Research Center, 
Yorktown Heights, NY 10598 
ABSTRACT 
In the case of a trlgr~m language model, the proba- 
bility of the next word conditioned on the previous two 
words is estimated from a large corpus of text. The re- 
sulting static trigram language model (STLM) has fixed 
probabilities that are independent of the document being 
dictated. To improve the language mode\] (LM), one can 
adapt the probabilities of the trigram language model to 
match the current document more closely. The partially 
dictated document provides significant clues about what 
words ~re more likely to be used next. Of many meth- 
ods that can be used to adapt the LM, we describe in this 
paper a simple model based on the trigram frequencies es- 
timated from the partially dictated document. We call this 
model ~ cache trigram language model (CTLM) since we 
are c~chlng the recent history of words. We have found 
that the CTLM red,aces the perplexity of a dictated doc- 
ument by 23%. The error rate of a 20,000-word isolated 
word recognizer decreases by about 5% at the beginning of 
a document and by about 24% after a few hundred words. 
INTRODUCTION 
A language model is used in speech recognition 
systems and automatic translation systems to improve 
the performance of such systems. A trigram language 
model \[1\], whose parameters are estimated from a 
large corpus of text (greater than a few million words), 
has been used successfully in both applications. The 
trigram language model has a probability distribu- 
tion for the next word conditioned on the previous 
two words. This static distribution is obtained a.s an 
average over many documents. Yet we know that sev- 
I Now at Rutgers University, N J, work performed while vis- 
iting IBM 
eral words are bursty by nature, i.e., one expects the 
word "language" to occur in this paper at a signif- 
icantly higher rate than the average frequency esti- 
mated from a large collection of text. To capture the 
"dynamic" nature of the trigram probabilities in a 
particular document, we present a "cache;' trigram 
language model (CTLM) that uses a window of the n 
most recent words to determine the probability dis- 
tribution of the next word. 
The idea of using a window of the recent history to 
adjust the LM probabilities was proposed in \[2, 3\]. In 
\[2\] tile dynamic component adjusted the conditional 
probability, p,(w,+l \] g,+l), of the next word, wn+l, 
given a predicted part-oPspeech (POS), g,~+l, in a tri- 
part-of-speech language model. Each POS had a sep- 
arate cache where the frequencies of all the words that 
occurred with a POS determine the dynamic compo- 
nent of the language model. As a word is observed it 
is tagged and the appropriate POS cache is updated. 
At least 5 occurrences per cache are required before 
activating it. P~eliminary results for a couple POS 
caches indicate that with appropriate smoothing the 
perplexity, for example, on NN-words is decreased by 
a factor of 2.5 with a cache-based conditional word 
probability given the POS category instead of a static 
probability model. 
In \[3\], the dynamic model uses two bigram lan- 
guage models, p,(w,.+l \] w,, D), where D=I for words 
w,+l that have occurred in the cache window and 
D=0 for words that have occurred in the cache win- 
dow. For cache sizes from 128 to 4096, the reported 
res,lts indicate an improvement in the average rank 
of tile correct word predicted by the model by 7% to 
293 
Test 
Set 
A 
B 
C 
static dynalnic 
Perplexity Perplexity 
9i 75 
53 49 
262 202 
~c 
0.07 
0.12 
0.07 
Table 1: Perplexity of static and dynamic language mod- 
els. 
Perplexity 262 217 
Table 2: Perplexity ~s ~ function of cache size on test set 
C. 
Static Unigram Trigram 
Cache Cache 
262 230 202 
Table 3: Perplexity of unigram and trigram caches. 
17% over the static model assuming one knows if the 
next word is in the cache or not. 
In this paper, we will present a new cache lan- 
guage model and compare its performance to a tri- 
gram language model. In Section 2, we present our 
proposed dynamic component and some results com- 
paring static and dynamic trigram language models 
using perplexity. In Section 3, we present our method 
for incorporating the dynamic language model in an 
isolated 20,000 word speech recognizer and its effect 
on recognition performance. 
CACHE LANGUAGE MODEL 
Using a window of the n most recent words, we can 
estimate a unigram frequency distribution f,(w,+l), 
a bigram frequency distribution, fn(w,÷l I w,), and a 
trigram frequency distribution, f,(w,+l } w,,w,-1). 
The resulting 3 dynamic estimators are linearly smoothed 
together to obtain a dynamic trigram model denoted 
by pc,(w,+l I w,., w,-a). The dynamic trigram model 
assigns a non-zero probability for the words that have 
occurred in the window of the previous n words. Since 
the next word may not be in the cache and since the 
cache contains very few trigrams, we interpolate lin- 
early the dynamic model with the the static trigram 
language model: 
I = (1) 
\[ w,,, w._l) + 
(1 - I w,,.,w,,.-1) 
where p,(...) is the usual static trigram language model. 
We use the forward-backward algorithm to estimate 
the interpolation parameter ),c \[1\]. This parameter 
varies between 0.07 and 0.28 depending on the par- 
ticular static trigram language model (we used tri- 
gram language models estimated from different size 
corpora) and the cache size (varying from 200 to 1000 
words.) 
We have evaluated this cache language model by 
computing the perplexity on three test sets: 
• Test sets A and B are each about 100k words 
of text that were excised from a corpus of docu- 
ments from an insurance company that was used 
for building the static trigram language model 
for a 20,000-word vocabulary. 
• Test set C which consists of 7 documents (about 
4000 words) that were dictated in a field trial 
in the same insurance company on TANGORA 
(the 20,000-word isolated word recognizer devel- 
oped at IBM.) 
Table \] shows the perplexity of the static and dy- 
namic language models for the three test sets. The 
cache size was 1000 words and was updated word 
synchronously. The static language model was esti- 
mated from about 1.2 million words of insurance doc- 
uments. The dynamic language model yields from 8% 
to 23% reduction in perplexity, with the larger reduc- 
tion occurring with the test sets with larger perplex- 
ity. The interpolation weight Ac was estimated using 
set B when testing on sets A and C and set. A when 
testing on set B. Table 2 shows the effect of cache size 
on perplexity where it appears that a larger cache is 
more useful. These results were on test set C. On 
test set C, the rate that the next word is in the cache 
ranges from 75% for a cache window of 200 words to 
83% for a window of 1000. Table 3 compares a cache 
with unigrams only with a full trigram cache (for the 
trigram cache, the weights for the unigram, bigram, 
and trigram frequencies were 0.25, 0.25, 0.5 respec- 
tively and were selected by hand.) A second set of 
weights (0.25,0.5,0.25) produced a perplexity of 190 
for the trigram cache. In all the above experiments, 
the cache was not flushed between documents. In the 
next section, we compare the different models in an 
isolated speech recognition experiment. 
We have tried using a fancier interpolation scheme 
where the reliance on the cache depends on the cur- 
294 
Text 0-100 
Length 
% Reduction 
in Error 6.1% 
Rate 
100-200 200-300 300-400 400-500 500-800 
5.3% 4.7% 10.5% 16.3% 23.8% 
Table 4: Percentage reduction in error rate with trigr~m cache. 
rent word wn with the expectation that some words 
will tend to be followed by bursty words whereas other 
words will tend to be followed by non-bursty words. 
We typically used about 50 buckets (or weighting pa- 
rameters). However, we have found that the perplex- 
ity on independent data to be no better than the single 
parameter interpolation. 
ISOLATED SPEECH RECOGNITION 
We incorporated the cache language model into 
the TANGORA isolated speech recognition system. 
We evaluated two cache update strategies. In the first 
one, the cache is updated at the end of every utter- 
ance, i.e., when the speaker turns off the microphone. 
An utterance may be a partial sentence or a complete 
sentence or several sentences depending on how the 
speaker dictated the document. In the second strat- 
egy, the cache is updated as soon as the recognizer 
makes a decision about what was spoken. This typ- 
ically corresponds to a delay of about 3 words. The 
cache is updated with the correct text which requires 
that the speaker correct any errors that may occur. 
This may be unduly difficult with the second update 
strategy. But in the context of our experiments, we 
have found that using the simpler (and more realistic) 
update strategy, i.e., after an utterance is completed, 
to be as effective as the more elaborate update strat- 
egy. 
The TANGORA system uses a 20,000-word office 
correspondence vocabulary with a trigram language 
model estimated from a few hundred million words 
from several sources. The cache language model was 
tested on a set of 14 documents dictated by 5 speak- 
ers with an internal telephone system (private branch 
exchange.) The speakers were form the speech group 
typically dictating electronic mail messages or inter- 
nal memoranda. The size of a document ranged from 
about 120 words to 800 words. The total test cor- 
pus was about. 5000 words. The maximum cache size 
(4000 words) was \]a.rger than any of the documents. 
In these tests, the cache is flushed at the beginning of 
each document. 
In these experiments, the weights for interpolating 
the dynamic unigram, bigram, and trigram hequen- 
cies were 0.4, 0.5, and 0.1, respectively. The weight of 
the cache probability, At, relative to the static trigram 
probability was 0.2. Small changes in this weight does 
not seem to affect recognition performance. The po- 
tential benefit of a cache depends on the amount of 
text that has been observed. Table 4 shows the per- 
centage reduction in error rate as a function of the 
length of the observed text. We divided the docu- 
ments into 100-word bins and computed the error rate 
in each bin. For the static language model, the error 
rate should be constant except for statistical fluctua- 
tions, whereas one expects that the error rate of the 
cache to decrease with longer documents. As can be 
seen from Table 4, the cache reduces the error rate by 
about 5% for shorter documents and up to 24% for 
longer documents. The trigram cache results in an 
average reduction in error rate of 10% for these doc- 
uments whose average size is about 360 words. The 
trigram cache is very slightly better than a unigram 
cache eventhough the earlier results using perplexity 
as a measure of performance indicated a bigger differ- 
ence between the two caches. 
REFERENCES 
\[l\] Bahl, L., Jelinek, F., and Mercer, R.,A Statisti- 
cal Approach to Continuous Speech Recognition, 
IEEE Trans. on PAMI, 1983. 
\[2\] Kuhn, R., Speech Recognition and the Frequency 
o\] Recently Used Words: a Modified Markov 
Model for Natural Language, Proceedings of 
COLING B,dapest, Vol. 1, pp. 348-350, 1988. 
Vol. I July 1988 
\[3\] Kupiec, J. Probabilistie Models of Short and Long 
Distance Word Dependencies in Running Text, 
Proceedings of Speech and Natural Language 
DARPA Workshop, pp. 290-295, Feb. 1989. 
295 
