A Dynamic Language Model 
Based on Individual Word Domains 
E. I. Sicilia-Garcia, Ji Ming, F. J. Smith 
School Computer Science 
Queen's University of Belfast 
Belfast BT7 INN, Northern Ireland 
e.sicilia@qub.ac.uk 
Abstract 
We present a new statistical  
model based on a Colnbination of 
individual word  models. Each 
word model is built from an individual 
corpus which is formed by extracting 
those subsets of the entire training corpus 
which contain that significant word. We 
also present a novel way of combining 
 models called the "union 
model", based on a logical union of 
intersections, and use this to combine the 
 models obtained for the 
significant words from a cache. The 
initial results with the new model provide 
a 20% reduction in  model 
perplexity over the standard 3-gram 
approach. 
Introduction 
Statistical  models are based on 
information obtained fiom the analysis of large 
samples of a target . Such models 
estimate the conditional probability of a word 
given a sequence of preceding words. The 
conditional probability can be further used to 
determine the likelihood of a sentence through 
lhe product of the individual word probabilities. 
A popular type of statistical  model is 
lhe dynamic  model, which dynamically 
modifies conditional probabilities depending on 
the recent word history. For example the cached- 
based natural  models (Kuhn R. & De 
Mori R., 1990) incorporates a cache component 
into the model, which estimates the probability 
of a word depending upon its recent usage. 
Trigger based models go a step further by 
triggering associated words to each content word 
in a cache giving each associated word a higher 
probability (Lau et al., 1993). 
Our statistical  model, based 
upon individual word domains, extends these 
ideas by creating a new  model for each 
significant word in the cache. A significant word 
is hard to define; it is any word that significantly 
contributes to the content of the text. We define 
it as any word which is not a stop word, i. e. 
articles, prepositions and some of the most 
fiequcntly used words in the  such as 
"will", "now", "very", etc. Our model combines 
individual word  models with a standard 
global n-gram  model. 
A training corpus for each significant 
word is formed from the amalgamation of the 
text fiagments taken fiom the global training 
corpus in which that word appears. As such these 
corpora are smaller and closely constrained; 
hence the individual  models are more 
precise than the global  model and 
thereby should offer performance gains. One 
aspect of the performance of this joint model is 
how the global  model is to be 
combined with the individual word  
models. This is explored later. 
This paper is organised as follows. 
Section 1 explains the basis for this model. The 
mathematical background and how the models 
are combined are explained in section 2. In the 
third section, a novel method of combining the 
word models, the probabilistic-union model is 
explained. Finally, results and conclusion are 
drawn. 
789 
1 Dynamic Language Model based on 
Word Language Models 
Our dynamic  model builds a 
 model for each individual word. In 
order to do this we need to select which words 
are to be classified as significant and furthermore 
create a  model for them. We excluded 
all the stop words ('is', 'to', 'all', 'some') due to 
their high frequency within tile text and their 
limited contribution to the thematic of the text. A 
list of stop words was obtained by merging 
together the lists used by various www search 
engines, for example Altavista. 
Secondly we need to create a dictionary 
that contains the frequency of each word ill the 
corpus. This is needed because we want to 
exclude those non-stop words which appear too 
often in the training corpus, for example words 
like 'dollars', 'point', etc. A hash file is 
constructed to store large amounts of information 
so that it can be retrieved quickly. 
The next step is to create tile global 
 model by obtaining the text phrases and 
their probabilities. Frequencies of words and 
phrases are derived fiom a large text corpus and 
the conditional probability of a word given a 
sequence of preceding words is estimated. These 
conditional probabilities are combined to 
produce an overall  model probability 
for any given word sequence. The probability of 
a sequence of words is: 
P(w 1 "" w,, ) = P(w\[' ) = 
--P(wl)X P(w2 I wl)x"" P(w" \[ w~'-l) = (1) 
i=I 
where w\[' ={wl,w2,w3,...,w,, } is a sentence or 
sequence of words. The individual conditional 
probabilities are approximated by the maximum 
likelihoods: 
PML(W~ Iwl-b -- freq(w\[) _ f,'eq(w, ... w~_,wp 
freq(wl q) freq(wl "" ~h-l) (2) 
where freq(X)is the frequency of the 
phrase X in the text. 
In equation (2), there are often unknown 
sequences of words i.e. phrases which are not in 
the dictionary. The maximum likelihood 
probability is then zero. In order to improve this 
prediction of all unseen event, and hence the 
 model, a number of techniques have 
been explored, for example, the Good-Turing 
estimate (Good I. J., 1953), the backing-off 
method (Katz S. M., 1987), deleted interpolation 
(Jelinek F. and Mercer R. L., 1984) or the 
weighted average n-gram model (O'Boyle P., 
Owens M. and Smith F. J., 1994). We use the 
weighted average n-gram technique (WA), which 
combines n-grain ~ phrase distributions of several 
orders using a series of weighting fnnctions. The 
WA n-gram model has been shown to exhibit 
similar predictive powers to other n-gram 
techniques whilst enjoying several benefits. 
Firstly an algorithm for a WA model is relatively 
straightforward to implement ill computer 
software, secondly it is a variable n-gram model 
with the length depending on the context and 
finally it facilitates easy model extension 2. The 
weighted average probability of a woM given the 
preceding words is 
,,tj,,,, (w) + ~ & l,,,,.(,, I ,',,,, ,...,,',.,) &,,, (,,, I ,,,, ..-w,,, ) - '-' (3) 
i 0 
where the weighted functions are: 
,,t o = Ln( N), (4) 
-- 2 
N is tile number of tokens ill tile corpus and 
freq(wm+l_i...w,,~) is the frequency of tile 
senteuce Win+l_ i "'" W m in the text. 
The maximum likelihood probability of a word 
is: 
J A n-gram model contains the conditional probability 
of a word dependant on the previous i1 words. (Jclinek 
F., Mercer R.L. and Bahl L. R., 1983) 
2 Tile "ease of extension" applies to the fact that 
additional training data can be incorporated into an 
existing WA model without the need to re-estimate 
smoothing parameters. 
790 
P,,,,.(,,,)--¢°q (') (5) 
N 
J?eq(w) is the frequency of the word w in the 
text. This  model (defined by equation 
(3) and (5)) is what we term a standard n-gram 
 model or global  model. 
Finally the last step is the creation of a 
hmguage model for each significant word, which 
is formed in the same manner as the global 
 model. The word -training 
corpus to be used is tlle amalgamation of the text 
fiagments taken from the global training corpus 
in which the significant word appears. A number 
of choices can be made as to how the word- 
training corpus for each significant word can be 
selected. We initially construct what we termed 
the "paragraph context model", entailing that the 
global training corpus is scanned for a particular 
word and each time the word is found the 
paragraph containing that word is extracted. The 
paragraphs of text extracted for a particular word 
are joined together to form an individual word- 
training corpus, from which an individual word 
 model is built. Alternative methods 
include storing only the sentences where the 
word appears or extracting a piece of the text M- 
words before and M + words after the search 
word. 
Additionally some restrictions on the 
number of words were imposed. This was done 
due to the high frequency of certain words. Such 
words were omitted since the additional 
information that they provide is minimal 
(conversely  models for "rare" words 
are desirable as they provide significant 
additional iuformation to that contained within 
the global  model). Once individual 
 models have been formed for each 
significaut word (trained using the standard n- 
grain approach as used for the global lnodel), 
the.m remains the problem of how the individual 
word  models will be combined together 
with the global  model. 
2 Combining the Models 
We need to combine the probabilities 
obtained from each word  model and 
fiom the global  model, in order to 
obtain a conditional probability for a word given 
a sequence of words. The first model to be tested 
is an arithmetic combination of the global 
hmguage model and the word  models. 
All the word hmguage models and the global 
 model are weighted equally. We 
believe that words, which appear far away in the 
previous word history, do not have as nmch 
importance as the ones closest to the word. 
Therefore we need to lnake a restriction in the 
number of  models. First, the 
conditional probabilities obtained from the word 
hmguage models and the global  
model can be combined in a linear 
interpolated model as follows: 
m P(w Iw") = ;co e¢,,o,.., (w Iw;') + Z & v, (w I w',' ) 
i-~ (6) 
Ill 
where 2 c + .y_, 2 i = 1 (7) 
i I 
and l'(wl,,i')is the conditional probability in 
the word  model for the significant word 
w i, 2 iare the correspondent weights and m is 
the maxinmm number of word models that we 
are including. 
If the same weight is given to all the 
word  models but not to the global 
 model and if a restriction on tim 
lmmber of word  models to be included 
is enforced, the weighted model is defined as: 
and ~ is a parameter which is chosen to optimise 
the model. 
Furthermore, a method was used based 
on all exponential decay of the word model 
probabilities with distance. This stands to reason, 
as a word appearing several words previously 
will generally be less relevant than more recent 
791 
words. Given a sequence of words, for example, 
"We had happy times in America..." 
We ltad Happy Times In America 
5 4 3 2 1 
where 5, 4, 3, 2, 1 represent the distance of the 
word from the word America, Happy and Times 
are significant words for which we have an 
individual word  models. The 
exponential decay model for the word w, where 
in this case w represents the significant word 
America, is as follows: 
/ ...... / I {;iot,,,t( w\[ w I ) + P.,.,py (w I wl )' exp(-3/d) 
P(wlw, )=\[ + 1,,,,,,,,(~, \]w, ).exp(21d) ) (9) 
l + exp(-3/d) + exp(-2/d) 
where Patot,,t(wl w(') is the conditional 
probability of the word w following a phrase 
wl "" w,, in the global  model. 
Pmppy(Wl w~') is the conditional probability of the 
word w following a phrase w 1...%word 
 model for the significant word Happy. 
The same defnition applies for the word model 
Times. d is the exponential decay distance with 
d=5, 10, 15,etc. The decaying factor exp(-I/d) 
introduces a cut off: 
if l>d ~ exp(-l/d)=O 
where l is the word modelto word distance 
d is the decay distance 
Presently the combination methods 
outlined above have been experimentally 
explored. However, they offer a reasonably 
simplistic means of combining the individual and 
global  models. More sophisticated 
models are likely to offer improved performance 
gains. 
3 The Probabilistic-Union Model 
The next method is the Probabilistic- 
Union model. This model is based on the logical 
concept of a disjunction of conjunction which is 
implemented as a sum of products. The union 
model has been previously applied in problems 
of noisy speech recognition, (Ming J. et al., 
1999). Noisy conditions during speech 
recognition can have a serious effect on the 
likelihood of some features which are normally 
combined using the geometric mean. This noise 
has a zeroing effect upon the overall likelihood 
produced for that particular speech frame. The 
use of the probabilistic-union reduces the overall 
effect that each feature has in the colnbination, 
therefore loosening any zeroing effect. 
For the word  model, some of the 
conditional probabilities are zero or very small 
due to the small size of some of the word model 
corpora. For these word models, many of the 
words in the global training corpus are not in the 
word-model training-corpus dictionary. And so, 
the conditional probability will be in many cases 
zero or near zero reducing the overall 
probability. As in noisy speech @cognition we 
wish to reduce the effect of this zeroing in the 
combined model. The probabilistic-union model 
is one of the possible solutions for the zeroing 
problem when combining  models. 
The union model is best illustrated with 
an example when the number of word models to 
be included is m=4 and if they are assumed to be 
independent probabilities. 
(') , w,(e, (lo) &,,i,,,, (" ) = /2" 8 
-%9;P= . @v, p2 @ Pu,,io,, (w) - P3 P4 "") (1 1) 
/}(3) z . u,,io,,tw)-- W3(I:~P2 @ P,P~ ® PiP4 @'") (12) 
v(4) , . 't'4(P, ¢ • ) u,,io,, tW) = \]}2 P'~ @ P4 (13) 
where P'~,io,,(w) =P~U,,io,, (wlw;') is the nnion 
model of order k. P/ = P,.(w\[ w(') is the 
conditional probability for the significant word 
w i and ~ is a normalizing constant. The 
symbol '®' is a probabilistic sum, i.e. its 
equivalent for 1 and 2 is: 
8,,,,d2 =8 ¢",2 --e, +/'2 -8P2 (14) 
Tile combination of the global  
model with the probabilistic-union model is 
792 
defined as follows: 
p(w\[ w~') = ~J~,o~,,,~(wl ,,\[')+(l-a)/}j,,,,,,(w \[wi') (15) 
Results 
To evaluate the behaviour of one  model 
with respect to others we use perplexity. It 
measures the average branching factor (per 
word) of the sequence at every new word, with 
respect to some source model. The lower the 
branching l'actor, the lower the model errors rate. 
Therefore, the lower the branching (l~erplexity) 
the better the model. Let w i be a word in the 
 model and w\[" = {wl, w 2, w3,..-, w,,, } 
a sentence or sequence of words. The perplexity 
of this sequence of words is: 
Peq)lexity( w 1 w2 ... w,, ) = PP(w~' ) = 
t 1 " ._ \] 
= Z ('", I ,,'i' )) J 
It i=1 
(J6) 
The Wall Street Journal (version 
WSJ03) contains about 38 million words, and a 
dictionary of approxilnately 65,000 words. We 
select one quarter of the articles in the global 
training corpus as our training corpus (since the 
global training corpus is large and the 
normalisation process takes time). To test the 
new  model we use a subset of the test 
file given by WSJ0, selected at random. The 
training corpus that we are using contains 
172,796 paragraphs, 376,589 sentences, 
9526,187 tokens. The test file contains 150 
paragraphs, 486 sentences, 8824 tokens and 1908 
words types. Although the size of this test file is 
small, limher experilnents with bigger training 
corpora and test files are planned. 
Although in our first experiments we use 
5--grams in the calculation of the word models, 
the size of the n-gram has been reduced to 3- 
grains because the process of norlnalisation is 
slow in these experiments. 
The model based on a simple weighted 
colnbination offers ilnproved results, up to 10% 
when o~=0.6 in Eq. (8) and a combination of a 
maxilnuln of 10 word lnodels. Better results were 
found when the word models were weighted 
depending on their distance from the current 
word, that is, for the exponential decay model in 
Eq. (9) where d=7 and the number of word 
models is selected by the exponential cut off 
(Table 1 ). For this model ilnprovelnents of over 
17% have been found. 
F~ Decay d 
 .,11 5 I 6 I 7 I 
4d 15.53% 16.31% 16.46% 16.44% 
5d 15.90% 16.42% 16.52% 16.43% 
6d 15.92% 16.45% 16.53% 16.41% 
7d 16.02% 16,4~% 16.51% 16.40% 
8d 16.1)2% 16.46% 16.51% 16.39% 
9(1 15.97% 16.45% 16.51% 16.39% 
Table 1. Improvement in perplexity for the 
exponetial decay models with respect to the Global 
Language Model over the basic 3-gram model. 
For tile probabilistic-union model, we 
have as many nlodels as numbers of word 
 nlodels. For example, if we wish to 
include m=4 word  Jnodels, tile four 
union models are those with orders I to 4 
(equation (13) to (15)). The results for the 
probabilistic union model when the number of 
words models is m=5 and m=6 are shown in the 
tables below. 
Union Model Order 
5 I 4 I 3 I 2 I 1 
0.3 13% 15% -2% -15% -25% 
0.4 13% 18% 6% -3% -10% 
0.5 12% 19% I 1% 4% -I% 
0.6 12% 19% 13% 9% 5% 
0.7 11% 18% 14% 11% 8% 
0.8 9% 15% 13% 11% 9% 
0.9 6% 10% 9% 8% 8% 
3 CSR-I(WSJ0) Sennheiser, published by LDC , 
ISBN: 1-58563-007-I 
793 
t t U, fiox%odel Order t t 
0.3 13% 15% -2% -13% -22% -30% 
0.4 13% 18% 6% -2% -8% -13% 
0.5 13% 20% 11% 5% 1% -3% 
0.6 12% 20% 14% 9% 6% 3% 
0.7 11% 18% 14% 11% 9% 7% 
0.8 9% 16% 13% I I% 10% 9% 
0.9 6% 11% 10% 9% 8% 7% 
Table 2. Improvement in perplexity of the 
Probabilistie-Union Model with respect to the 
Global Language Model over the basic 3-gram 
model. 
The best result obtained so far, is an 
improvement of 20% when a maximum ot' 6 word 
models and the order is 5, i.e. sums of the 
products of pairs (Table 2).The value of alpha is 
0.6. 
Conclusion 
In this paper we have introduced the 
concept of individual word  models to 
improve  model performance. 
Individual word  models permit an 
accurate capture of the domains in which 
significant words occur and hence improve the 
 model performance. We also describe a 
new method of combining models called the 
probabilistic union model, which has yet to be 
fully explored but the first results show good 
performance. Even though the results are 
preliminary, they indicate that individual word 
models combined with the union model offer a 
promising means of reducing the perplexity. 
Weighted Eq. (8) 10% 
Exponential Decay Eq. (9) 17% 
Union Model 5 words 19% 
Union Model 6 words 20% 
Union Model 7 words 19% 
Table 3. hnprovement in perplexity for different 
combinations of word models. 
Acknowledgements 
Our thanks go to Dr. Phil Halma for his 
collaboration in this research. 

References 

Good I. J. (1953) "The Population Frequencies ot' 
Species and the Estimation of Population 
Parameters". Biometrika, Vol. 40, pp.237-254. 

Jelinek F., Mercer R. L. and Bahl L. R. (1983) "A 
Maximum Likelihood Approach to Continuous 
Speech Recognition". IEEE Transactions on 
Pattern Analysis and Machine Intelligence. Vol. 5, 
pp. 179-190. 

Jelinek F. and Mercer R. L. (1984) "Interpolated 
estimation of Markov Source Parameters from 
Sparse Data". Pattern Recognition in Practice. 
Gelsema E., Kanal L. eds. Amsterdam: Norlh- 
Holland Publishing Co. 

Katz S. M. (1987) "Estimation of Probabilities from 
Sparse Data for the Language Model Component of 
a Speech Recogniser". IEEE Transactions On 
Acoustic Speech and Signal Processing. Vol. 35(3), 
pp. 400-401. 

Kuhn R. and De Mori R. (1990) "A Cache-Based 
Natural Language Model for Speech Recognition". 
IEEE Transactions on Pattern Analysis and 
Machine Intelligence. Vol. 12 (6), pp. 570-583. 

Lau R., Rosenfeld R., Roukos S. (1993). "Trigger- 
based Language models: A Maximum entropy 
approach". IEEE ICASSP 93 Vo12, pp 45-48, 
Minneapolis, MN, U.S.A., April. 

Ming J., Stewart D., Hanna P. and Smith F. J. (1999) 
"A probabilistic Union Model Jbr Partial and 
temporal Corruption ql~ Speech ''. Automatic Speech 
Recognition and Understanding Workshop. 
Keystone, Colorado, U. S. A., December. 

O'Boyle P., Owens M. and Smith F. J. (1994) 
"Average n-gram Model of Natural Language". 
Computer Speech and Language. Vol. 8 pp 337- 
349. 
