SPEECH RECOGNITION AND THE FREQUENCY OF RECENTLY USED WORDS 
A MODIFIED MARKOV MODEL FOR NATURAL LANGUAGE 
Roland Kuhn 
School of Computer Science, MeGill University 
805 Sherbrooke St. West, Montreal 
Abstract 
Speech recognition systems incorporate a language model 
which, at each stage of the recognition task, assigns a probabil- 
ity of occurrence to each word in the vocabulary. A class of Mar- 
kov langnage models identified by Jclinek has achieved consider-. 
able success in this domain. A modification of the Markov 
approach, wblch assigns higher probabilities to recently used 
words, is proposed and tested against a pure Markov model. 
Parameter calculation and comparison of the two models both 
involve use of the LOB CorPus of tagged modern English. 
1 Introduction 
Speech recognition systems consist of two components. An 
acoustic component matches the most recent acoustic input to 
words in its vocabulary, producing a list of the most plausible 
word candidates together with a probability for each. The second 
component, which incorporates a language model, utilizes the 
string of previously identified words to estimate for each word in 
the vocabulary the probability that it will occur next. Each word 
candidate originally selected by the acoustic component is thus 
associated with two probabilities, the first based on its 
resemblance to the observed signal and the second based on the 
linguistic plausibility of that word occurring immediately after 
the previously recognized words. Multiplication of these two 
probabilities produces an overall probability for each word 
candidate. 
Our work focuses on the language model incorporated in 
the second component. The language model we use is based on a 
class of Markov models identified by Jelinek, the "n-gram" and 
"Mg-gram" models \[Jelinek 1985, 1983\]. These models, whose 
parmneters are calculated from a large training text, produce a 
reasonable non-zero probability for every word in the vocabulary 
during every stage of the speech recognition task. Our model 
incorporates both a Markov 3g-gram component and an added 
"cache" component which tracks short-term fluctuations in word 
frequency. 
We adopted the hypothesis that a word used in the recent 
past is much more likely to be used soon than either its overall 
frequency in the language or a Markov model would suggest. 
The cache component of our model estimates the probability of a 
word from its recent frequency of use. The overall model uses a 
weighted average of the Markov and cache components in 
calculating word probabilities, where the relative weights 
assigned to each component depend on the part of speech (POS). 
For each POS, the overall model may therefore place more 
reliance on the cache component than on the Markov 
component, or vice veins; the relative weights arc obtained 
empirically for each POS from a training text. This dependance 
on POS arises from the hypothesis that a content word, such as 
a particular noun or verb, will occur in bursts. Function wm'ds, 
on the other hand, would be spread more evenly across a text or 
a conversation; their short-term frequencies of use would vary 
less dramatically from their long-term frequencies. One of the 
aims of our research was to assess this hypothesis empirically. If 
it is correct, the relative weight calculated from the training text 
for the cache component for most content POSs will be higher 
than the cache weighting for most flmction POSs. 
We intend to compare the pcrfor.mance of a standard 3g- 
gram Markov model with that of our model \[containing the 
same Markov model along with a cache component) in 
calculating the probability of 100 texts, each approximately 2000 
words long. The texts are taken from the Lancaster-Oats/Bergen 
(LOB) Corpus of modern English \[Johansson et al 1988, 1982\]; 
the rest of the corpus is utilized as a training text which 
determines the parameters of both models. Comparison of the 
two sets of probabilities will allow one to assess the extent of 
improvement over the pure Ma,kov model acifieved by adding a 
348 
cache component. Furthermore, the relative weigbts calculated 
from the training text for the two components of the combined 
model indicate tlmse POSs for which short-term frequencies of 
word use differ drastically from long-term frequencies, and those 
for which word frequencies Stay nearly constant over time. 
2 A Natural Language Model with Markov ~.nd 
Cache Components 
The "trigram " Markov language model for speech 
recognition developed by F. Jelinek and his colleagues uses the 
context provided by the two preceding words to estimate the 
probability that the word W i occurring at time i is a given 
vocabulary item W. Assume rccursivcly that at time i we have 
just recognized the word sequence W 0 ." " ,Wi_ 2 Wi__ 1. The 
trigram model approximatss P (Wi : W \] Wo, • " • , Wi_2, W~_I) 
by f (W~= W \[ W~_2, W~-I) "whets the frequencies f are 
calculated from a huge "training text" before the recogaition 
task takes place. 
One adaptation of the trigram model employs trigrams of 
POSs to predict the POS of W i , and frequency of words within 
each POS to predict W i itself. Thus, this "3g-gram" model gives 
p( w~=w \] wo • . . , W,_2, Wi_l) ~- 
P(w~=w Ig(w~)= g;)p(g(w~) =gj Ig(w~_:),g(w._0) 
~EG 
where we let P(WI=W I#(w~) =g~.) = f(w,=w 
Ig(w~) =g~), 
P(g(Wi) =gi Ig(Wl-2), g(Wi-O) ~- f (g(w~) =g~ Ig(W~_2), g(w~_0). 
Here G denotes the set of all parts of speech, gj denotes a 
particular part of speech, and g (Wi) denotes the part of speech 
catego~7 t6 which word W i belongs (abbreviated to gi from now 
on); f denotes a frequency calculated from the training text. 
This "Sg-gram" model was used by Derouault and Merialdo for 
French language modeling \[Derouault and Merialdo 1986, 19841, 
and forms the Markov component of our own model. In practice 
many POS triplets will never appear in the training text but will 
appear during the recognition task, so Derouanlt and Merialdo 
use a weighted average of triplet and doublet POS frequencies 
plus a low arbitrary constant to prevent zero estimates for the 
probability of occurrence of a given POS : 
P(g~ =gj \]gl-2 gi-x) :~ 
q *f (g~ =gj I g~-~,g~-~)+12*f (g~ =gi I g~-0+ 10-4. 
The parameters ILl 2 are not constant but can be made to 
depend on the count of occurrences of the sequence gi~.2,yl_llor 
on the POS of the preceding word, gi-1. In either Case thgse 
parameters most sum to 0.9999 and can be optimized iteratively; 
Deronault and Meriatdo .~'ound that the two weighting methods 
performed equMly well. 
The 3g-gram component of our model is almost identical 
to that of Derouault.and Merialdo, although the 153 POSs we 
use are those of the LOB Corpus. We let l 1 and 12 depend on the 
preceding POS gi-1. The cache component keeps track of ~he 
recent frequencies of words within each POS; it assigns high 
probabilities to recently used words. Now, let Cj(W,i) denote 
the caehc-based probability of word W at time i for POS gj ff 
g (W) ~ gY then Gj(W,i) -=0 at all times i, i.e. if W does not 
belong to POS gi, its cache-based probability for that POS is 
always 0. Similarly, let My(W) denote the Markov 
probability due to the rest of the pure 3g-gram Mackov ~a-mdeL 
This is approximated by ii(W) ~ f (Wi~W \]g(Wi) =gj), 
i.e, the frequency of word W among all words with POS ~ gj 
in the trainin~ text. 
The final, combined model is then P( W i --=W) = 
P(.q~' Ig(Wi-~), g(W~_l)) X \[kU, 1 × Mi(W ) + ks, 1 × 
(w,~) \] 
\]\]ere k M \] "4- k~ j ~1; k M 1 denotes the weighting given to the 
"frequen'~y within POS ~' component and kc, i the weighting of 
the "eaebe~based probability" component of ~OS gj. One would 
~peet relatively ,insensitive" POSs, whose constituent words do 
not vary much in frequency over time, to have high values of 
kM, j and low values of k v j; the reverse should be true for 
"sensitive" POSs. As is 'described in the next section, 
approximate values k6. J aud kMj were determined empirically 
for two POSs gj to see if these expectations were correct. 
Th~e cache-bnsed probabilities C\](W,i) were calculated 
as followt~. For each POS, a "cache" (just a buffer) with. room 
for 200 words is maintained. Each new word is assigned to a 
single POS'gj and pushed into the corresponding buffer. As 
soon as there are 5 words in a cache, it begins to output 
probabilities which correspond to the relative proportions of 
words it contains. The lower limit of 5 on the size of the cache 
before it t~tarts producing probabilities, and the upper size limit 
of 200, are arbitrary; there are many possible heuristics for 
producing cache-based probabilities. 
3 hnplementatto~a and Testing of the Combined 
Model 
3.1 The LOB Corpus 
The Laneaster-Oslo/Bergen Corpus of British English 
consists of 500 samples of about 2000 words each; each word in 
tile corpus is tagged with exactly one of 153 POSs. The samples 
were extracted from texts published in Britain in 1981, and have 
been grouped by the LOB researchers into 15 categories spanning 
a wilde range of English prose \[Joban~son et al 1086, 1982\]. We 
split the i;agged LOB Corpus into two unequal parts, one of 
which aslTed as a training text for our models and the other of 
which was used to test and compare them. The 
comprehensiveness of the LOB Corpus made it an ideal training 
text and a tough test of the robustness of the language model. 
Fnrthermore, the fact that it has been tagged by an expert team 
of gramm:~rians and lexicographers freed us from having to 
devise onr own tagging procedure. 
3.2 t)arameter Calculation 
400 sample texts form the training text used for parameter 
calculation; the remaining 100 samples form a testing text used 
for testing and comparison of the pure 3g-gram model with the 
combined lnodel. Samples were allocated to the training text and 
the testing text in a rammer that ensured that each had similar 
proportions of samples belonging to the 15 categories identified 
by the LOB researchers. All parameters for both tile pure 3g~ 
gram model and the combined model were calculated from the 
400-sample training text. 
The two models share a POS prediction component wlfich 
is estimated by the Derouanlb-Merialdo method. Triplet and 
doublet POS frequencies were obtained from 75% (300 of the 400 
samples) of the training text; the remaining 25% (100 samples) 
gave the weights, ll(gi_l) and 12(gi_l) , needed for smoothing 
between th~se two frequencies. These were computed iteratively 
using the Forward-Backward algorithm ( Derouault and 
Merialdo \[1~i88\], Rabiner and Juang \[1986\]). 
Now ~,he pure 3g-gram model is complete - it remains to 
find kg,.i and k,jd for the combined model. This can be 
calculated by means of the Forward-Backward method from the 
400 samples. 
8.3 Testing the Combined Model 
As dc.~cribed in 4.2, 80% of the LOB Corpus is used to find 
tile best-fit parameters for a. the pure 3g-gram model b. the 
combined model, made up of the 3g-gram model plus a cache 
component. These two models will then be tested on the 
remaining ~l% of the LOB Corpus as follows. Each is given this 
portion of the LOB Corpus word by word, calculating the 
probability .f each word as it goes along. The probability of this 
sequence of ~Lbout 200,{D0 words as estimated by either model is 
simply the product of the~,iudividnal w0rd i probabilities as 
increase achieved by the latter over the former is the measure of 
the improvemen t due to !addition of ~he Cache'component. 
Note that in order to calculate word probabilitir~, both 
models must have guessed the POSs of the two preceding words. 
Thus every word encountered must be assigned a POS. There are 
three cases : 
a). the word did not occur in the tagge d training text and 
therefore is not in the vocabulary; 
b). the word was in the training text, and had tim sanie 
tag wherever it occurred; 
c). the word was in the training text, and had more than 
one tag (e.g. the word "light" migbt have been tagged as a norm, 
verb, and adjective). 
The heuristics employed to assign tags were ns follows : 
a). in this ease, the two previous POSs are substituted "in 
tile Derouault-Merialdo wcighted-average formula and the 
program tries all 153 possible tags to find the one that 
maximizes the probability given by the formula. 
b). in this ease, there is no choice; the tag chosen is the 
unique tag associated with the word in the training text. 
c). when the word has two or more po~ible tags, the tag 
choasn is the one which makes the largest contribution to ~he 
word's probability (i.e. which gives rise to the largczt component 
in the summation on pg. 1). 
Thus, although the portion of the LOB Corpus used for 
testing is tagged, these tags were not employed in the 
implementation of either model; in both eases the heuristics 
given above guessed POSs. A separate part of the program 
cmnpared actual tags with guessed ones in order to collect 
statistics on the performance of these heuristics. 
4 Preliminary Results 
1. The first results of our calculations are tile values 
\[l(gi-1) and 12(gi_l) obtained iterativcly to optimize the 
weighting between the 19OS triplet frequency f (gl I gi-2,gi-1) 
and the POS doublet frequency f (gl \[ g/-1) in the estimation of P(m=gj 
\[m-2,m-~). A~ one might expect, ll(m-l) tends to be 
high relative to 12(gi-1) when gi-1 occurs often, because the 
~ triplet frequency is quite reliable in this ease. For instance, the 
most frequent tag in the LOB Corpus is "NN", singular common 
noun; we have II(NN ) ~ 0.61 . The tag "HVG", attached only 
to the word "having", is fairly rare; we have II(HVG ) =-: 0.13 . 
However, there are other factors to consider. Derouanlt 
and Merialdo state that for gi-I equal to an article, l I was 
relatively low because we need not know the POS gi-2 to predict 
that gl is a noun or adjective. Thus doublet frequencies alone 
were quite reliable in this case. On the other hand, when gi-I is 
a negation, knowing gl-2 was very important in making a 
prediction of gl, because of French phrases like "il ne veut" and 
Uje ne veux". 
Our results from English texts show somewhat different 
patterns. The tag "AT" for singulm" articles bml an l 1 that was 
neither high nor low, 0.47 . The tag "CC" for coordinating 
conjunctions, including "imt", had a high l I value, 0.80 . 
Adjectives ("JJ") and adverbs ("RB") had 11 values even higher 
ttmn one wouhl expect on the basis of their high frequencies of 
occurrence : 0.O0 and 0.86 respectively. 
2. We collected statistics on the success rate of the pure 
Marker component in guessing the POS of the latest word 
(using the tag actually assigned the word in the LOB Corpus as 
the criterion). This rate has a powerful impact on the 
performance of both models, especially the one with a cache 
component; each incorrectly guessed POS leads to looking in the 
wrong cache and thus to a cache-bused probability of 0. We are 
particularly interested in forming an idea of how fast this success 
rate will increase as we increase the size of the training text. 
Of the words that had occurred at least once in the 
training text, 83.9 o~ had tags that were gue~ed correctly (ltL1 
o~ incorrectly). Words that never occurred in the training text 
were assigned the correct tag only 22 o~ of the time (78 % 
incorrect). Apparently the informatiofi contained in the counts of 
POS triplets, doublets, and singlets is a good POS predictor 
when combined with some knowledge of the possible tags a word 
may have, but not nearly as good on its own. 
Among the words that appeared at least once in the 
training text, a surprisingly high proportion - 42.8 ~ - had more 
than one possible POS. Of these, 66.7 % had POSs that were 
guessed correctly, Thus it might appear that performance is 
degraded when the program ..must make a choice between 
pbssiblc tags. This analYSiS is faulty i a given word might have 
349 
many POSs, and perhaps the correct one was not found in the 
training text at all. The most important statistic , therefore, is 
the proportion of words in the testing text who~e tag was 
guessed correctly among the words that had also appeared with 
the correct tag in the training text. This proportion is 94.0 %. 
It seems reasonable to regard this as being an indication of the 
upper limit for the success rate of POS prediction with training 
texts of manageable size; it provides an estimate of the success 
rate when the two main sources of error ( words found in the 
testing text but not the training text, words found in both texts 
which are tagged in the testing text with a POS not attached to 
them anywhere in the training text ) are eliminated. 
3. We have not yet tested the full combined model ( with a 
cache component and a Markov component ) against the 3g- 
gram Marker model. However, we have examined the effect on 
the predictive power of the Marker model of including cache 
components for two POSs : singular common noun ( label "NN" 
in the LOB Corpus ) and preposition ( label "IN" in the LOB 
Corpus ). These two were chosen because they occur with high 
frequency in the Corpus, in which tllere are 148,759 occurrences 
of "NN`' and 123,440 occurrences of "IN", and because "NN`' is a 
content word category and "IN" a fnnction word category. Thus 
they provide a means of testing the hypothesis outlined in the 
Introduction, that a cache component will increase predictive 
power for content POSs but not make much difference for 
function POSs. 
For both POSs, the expectation that the 200-word cache 
will often contain the current word was abundantly fulfilled. On 
average, if the current word was an NN-word, it was stored in 
the NN cache 25.8 % of the time; if it was an IN-word, it was 
stored in the IN cache 64.7 % of the time. The latter is no 
surprise - there are relatively few different prepositions - but tim 
former figure is remarkably high, given the large nmnher of 
different nouns. Note that the figure would be higher if we 
counted plurals as variants of the singular word ( as we may do 
in future implementations ). 
We have not yet obtained the best-fit weighting for the 
combined model. However, we tried 3 different combinations for 
the NN-words and the IN-words. If "a" is the weight for the 
cache component and "b" the weight for the Marker component, 
tile a combinations (a, b) are (0.2, 0.8), (0.5, 0.5), and (0.9, 0.1); 
the pure Marker model corresponds to the weighting (0.0, 1.0). 
To assess the performance of each combination for NN-words 
and IN-words, we calculated i), the log product of the estimated 
probabilities for NN-words only under each of the 4 formulas ii). 
the log product of the estimated probabilities for IN- words only 
under each of the 4 formulas. It is then straightforward to 
calculate the improvement per word obtained by using a cache 
instead of the pure Marker model. 
For N'N-words, the (0.2, 0.8) weighting yielded an average 
multiple of 2.3 in the estimated probability of a word in the 
testing text over the probability as calculated by the pure 
Marker model ; the (0.5, 0.5) weighting yielded a multiple of 2.0 
per word, and the (0.0, 0.1} actually decreased the probability by 
a factor of 1.5 per word. 
For IN-words, the (0.2, 0.8) weighting gave an average 
multiple of 5.1, the (0.5, 0.5) weighting a multiple of 7.5 and the 
(0.9, 0.1) weighting a multiple of 6.2 . 
Conclusions 
The preliminmT results listed above seem to confirm our 
hypothesis that reeently-uasd words have a higher probability of 
occurrence titan the 3g-gram model would predict. Surprisingly , 
if the above comparison of the POS categories "NN" and "IN" is 
a reliable guide, this increased probability is more dramatic in 
the case of content-word categories. Perhaps the smaller number 
of different prepositions makes the cache-based probabilities 
more reliable in this ease. 
Since the cost of maintaining a 200-word cache, in terms of 
memory and time, is modest, and the increase in predictive 
power can be great, the approach outlined above should he 
considered as a simple way of intproving on the performance of a 
3g-gram language model for speech recognition. If memory is 
limited, one would he wise to create caches only for POSs that 
occur with high frequency and ignore other POSs. 
Our immediate goal is to build caches for a larger number 
of POSs, and to obtain the best-fit weighting for each of them, 
in order to test the full power of the combined model. 
Eventually, we may explore the possibility of ignoring variations 
in the exact form of a word, merging the singular form of a noun 
with its plural, and different tenses and persons of a verb. 
350 
This line of research has more general implications. The 
results above seem to suggest that at a given time, a human 
being works with only a small fraction of his vocabulary. 
Perhaps if we followed an individual's written or spoken use of 
language through the eoume of a day, it would consist largely of 
time spent in language "islands" or sublanguages, with brief 
periods of time during which he is in transition between islands. 
One might attempt to chart these "islands" by identifying groups 
of words which often occur together in the language. If this 
work is ever carried out on a large scale, it could lead to 
pseudo-semantic language models for speech recognition, since 
tbe occurrence of several words characteris$ic of an. "island" 
makes the appearance of all words in that island more probable. 

Bibliography 

1. R. Camps, L. Fissore, A. Martelli, G. Micea, and G. 
Volpi, "Probabilistic Models of the Italian Language for 
Speech Recognition". Recent Advances and Applications of 
Speech Recognition (international workshop), pp. 49-56, 
Rome, May 1986. 

2. A.M. Derouault and B. Mdrialdo., "Natural Language 
Modeling for Phoneme-to-Text Transcription", IEEE 
Trans. Pattern Anal. Machine Intell., Vol. PAMI-8, pp. 
742-749, No. 1986. 

3. A.M. Derouault and B. Mdrialdo~ "Language Modeling 
at the Syntactic Level", 7th Int. Conf. Pattern 
Recognition, Vol. II, pp. 1373-1375, Montreal, Aug. 1984. 

4. W.N. Francis, "A Tagged Corpus - Problems and 
Prospects", in Studies in English Linguistics for Randolph 
Quirk, S. Greenbaum, G. Leech, and J. Svartvik, Eds. 
London: Longman, 1980, pp. 193-209, 

5. F. Jelinek, "The Development of an Experimental 
Discrete Dictation Recognizer", Prec. IEEE, Vol. 73, 
No.ll, pp 1616-1624, Nov. 1985. 

6. F. Jelinek, R.L. Mercer, and L.R. Bahl, "A Maximum 
Likehood Approach to Continuous Speech Recognition", 
IEEE Trans. Pattern Anal. Machine lntell., Vol. PAMI-5, 
pp. 179-90, Mar. 1983. 

7. F. Jelinek, "Marker Source Modeling of Text 
Generation", personM communication. 

8. F. Jelinek, "Self-Organized Language Modeling for 
Speech Recognition", personal communication. 

9. S. Johansson, E. Atwell, R. Garside, and G. Leech, The 
Tagged LOB Corpus Users Manual. Norwegian Computing 
Centre for the Humanities, Bergen, 1986. 

10. S. Johansson, ed, Computer Corpora in English 
Language Research. Norwegian Computing Centre for the 
Humanities, Bergen 1982. 

11. S.E. Levinson, L.R. Rabiner, and M.M. Sondhi, "An 
Introduction to the Application of Probabilistic Functions 
of a Marker Process to Automatic Speech Recognition", 
The Bell System Technical Journal, Vol. 62, No. 4, pp. 
1035-1074, Apr. 1983. 

12. I. Marshall, "Choice of Grammatical Word-Class 
Without Global Syntactic Analysis: Tagging Words in the 
LOB Corpus", Computers and the Humanities, Vol. 17, 
No. 3, pp. 139-150, Sept. 1983. 

13. A. Martelli, "Probability Estimation of Unseen Events 
for Language Modeling", personal communication. 

14. E.M. Mucks~eial "A Natural Language Parser with 
Statistical Applications", IBM Research Report RC751fi 
(~38450), Mar. 1981. 

15. A. Nadas, "Estimation of Probabilities in the 
Language Model of the IBM Speech Recognition System", 
IEEE Trans. Acoust., Speech, Signal Processing, Vol. 32, 
pp. 859-861, Aug. 1984. 
