Word Triggers and the EM Algorithm 
Christoph Tillmann and Hermann Ney 
Lehrstuhl ffir Informatik VI 
Aachen - University of Technology 
D-52074 Aachen, Germany 
{t illmann, ney}©inf ormat ik. rwth-aachen, de 
Abstract 
In this paper, we study the use of so-called 
word trigger pairs to improve an existing 
language model, which is typically a tri- 
gram model in combination with a cache 
component. A word trigger pair is de- 
fined as a long-distance word pair. We 
present two methods to select the most 
significant single word trigger pairs. The 
selected trigger pairs are used in a com- 
bined model where the interpolation pa- 
rameters and trigger interaction parame- 
ters are trained by the EM algorithm. 
1 Introduction 
In this paper, we study the use of so-called word 
trigger pairs (for short: word triggers) (Bahl et al., 
1984, Lau and Rosenfeld, 1993, Tillmann and Ney, 
1996) to improve an existing language model, which 
is typically a trigram model in combination with a 
cache component (Ney and Essen, 1994). 
We use a reference model p(wlh), i.e. the con- 
ditional probability of observing the word w for a 
given history h. For a trigram model, this history 
h includes the two predecessor words of the word 
under consideration, but in general it can be the 
whole sequence of the last M predecessor words. 
The criterion for measuring the quality of a lan- 
guage model p(Wlh ) is the so-called log-likelihood 
criterion (Ney and Essen, 1994), which for a corpus 
Wl, ..., wn, ...wN is defined by: 
F 
N 
~= y~logp(wn\[hn), 
rt-~--1 
According to this definition, the log-likelihood cri- 
terion measures for each position n how well the 
language model can predict the next word given 
the knowledge about the preceeding words and com- 
putes an average over all word positions n. In the 
context of language modeling, the log-likelihood cri- 
terion F is converted to perplexity PP, defined by 
PP := -FIN. 
For applications where the topic-dependence of 
the language model is important, e.g. text dicta- 
tion, the history h may reach back several sentences 
so that the history length M covers several hundred 
words, say, M = 400 as it is for the cache model. 
To illustrate what is meant by word triggers, we 
give a few examples: 
airline 
concerto 
asks 
neither 
we 
...... flights 
...... orchestra 
...... replies 
...... nor 
...... ourselves 
Thus word trigger pairs can be viewed as long- 
distance word bigrams. In this view, we are faced 
the problem of finding suitable word trigger pairs. 
This will be achieved by analysing a large text corpus 
(i.e. several millions of running words) and learning 
those trigger pairs that are able to improve the base- 
line language model. A related approach to captur- 
ing long-distance dependencies is based on stochastic 
variants of link grammars (Pietra and Pietra, 1994). 
In several papers (Bahl et al., 1984, Lau and 
Rosenfeld, 1993, Tillmann and Ney, 1996), selection 
criteria for single word trigger pairs were studied. In 
this paper, this work is extended as follows: 
• Single-Trigger Model: We consider the def- 
inition of a single word trigger pair. There are 
two models we consider, namely a backing-off 
model and a linear interpolation model. For the 
case of the backing-off model, there is a closed- 
form solution for estimating the trigger param- 
eter by maximum likelihood. For the linear in- 
terpolation model, there is no explicit solution 
Tillmann ~ Ney 117 Word Triggers and EM 
Christoph Tillmann and Hermann Ney (1997) Word Triggers and the EM Algorithm. In T.M. Ellison (ed.) 
CoNLL97: Computational Natural Language Learning, ACL pp 117-124. Q 1997 Association for Computational Linguistics 
anymore, but this model is better suited for the 
extension towards a large number of simultane- 
ous trigger pairs. 
Multi-Trigger Model: In practice, we have to 
take into account the interaction of many trig- 
ger pairs. Here, we introduce a model for this 
purpose. To really use the word triggers for a 
language model, they must be combined with 
an existing language model. This is achieved 
by using linear interpolation between the exist- 
ing language model and a model for the multi- 
trigger effects. The parameters of the resulting 
model, namely the trigger parameters and one 
interpolation parameter, are trained by the EM 
algorithm. 
• We present experimental results on the Wall 
Street Journal corpus. Both the single-trigger 
approach and the multi-trigger approach are 
used to improve the perplexity of a baseline lan- 
guage model. We give examples of selected trig- 
ger pairs with and without using the EM algo- 
rithm. 
2 Single-Trigger Model 
In this section, we review the basic model definition 
for single word trigger pairs as introduced in (Till- 
mann and Ney, 1996). 
We fix one trigger word pair (a --+ b) and define an 
extended model pab(wlh ) with an trigger interaction 
parameter q(bla ). To pave the way for the following 
extensions, we consider the asymmetric model rather 
than the symmetric model as originally described in 
(Tillmann and Ney, 1996). 
Backing-Off 
As indicated by the results of several groups (Lau 
and Rosenfeld, 1993, Rosenfeld, 1994, Tillmann and 
Ney, 1996), the word trigger pairs do not help much 
to predict the next word if there is already a good 
model based on specific contexts like trigram, bi- 
gram or cache. 
Therefore, we allow the trigger interaction a ~ b 
only if the probability p(blh ) of the reference model 
is not sufficiently high, i.e. if p(blh ) < Po for a cer- 
tain threshold p0 (note that, by setting P0 := 1.0, 
the trigger effect is used in all cases). Thus, we use 
the trigger effect only for the following subset of his- 
tories: 
H~b := {h : a E hAp(b\[h) <P0} 
In the experiments, we used P0 := 1.5/W, where 
W = 20000 is the vocabulary size. We define 
the model pab(wlh ) as an extension of the reference 
model p(wlh ) by a backing-off technique (Katz 87): 
p~b(wlh ) = 
q(bla) if h e H~b, w = b 
\[1 - q(bla)\]. P(wlh) E p(w'lh) 
w,~b 
if h E Hab, w # b 
p(wlh ) if h ~ Hab 
For a training corpus wl...WN, we consider the 
log-likelihood functions of both the extended model 
and the reference model p(wnlh,~), where we define 
the history hn: 
n--1 hn :: ll)n_ M = Wn-M...Wn-2Wn-1 
For the difference Fab -- FO in the log-likelihoods 
of the extended language model pab(wlh ) and the 
reference model p(w\[h), we obtain: 
Fab - Fo = 
N = ~log p~b(w"lh") 
n=l P(w"lh'O 
-- ~ log p,~b(w.lh~) 
p(w,~lh,~) n: h,~ EH,~b 
= ~ ~N(h,w)'~ogpab(wlh)p~ 
h: hEHab w 
E \[N(h,b) log q(bla) 
h: heHab p(b\[h) 
+ N(h, b) log 1 - q(bla) \] 1 p(blh)J 
~r(a; b) log q(bla ) + N(a;b)log\[1 - q(b\]a)\] 
- ~ \[N(h,b) logp(b\[h) 
h: hEHab 
+ N(h, b) log\[1 - p(b\[h)\]\] 
where we have used the usual counts N(h, w): 
N(h,w) := E 1 
and two additional counts N(a;b) and N(a;b) de- 
fined particularly for word trigger modeling: 
1 /~/(a;b) :: E N(h,b)= E 
h:hE Hab n:h~ EHab,wr~=b 
N(a;b) :: E N(h,b)= E 1 
h:hEHa~ n:hnEHob,w~b 
Tillmann ~t Ney 118 Word Triggers and EM 
Note that, for the counts N(a; b) and/V(a; b), it does 
not matter how often the triggering word a actually 
occurred in the history h E Hab. 
The unknown trigger parameter q(b\[a) is esti- 
mated using maximum likelihood estimation. By 
taking the derivative and setting it to zero, we ob- 
tain the estimate: 
 r(a; b) q(bla ) = 
At(a; b) +/V(a; b) 
which can be interpreted as the relative frequency of 
the occurrence of the word trigger (a -+ b). 
Linear Interpolation 
Although the backing-off method presented results 
in a closed-form solution for the trigger parameter 
q(b\]a), the disadvantage is that we have to use an 
explicit probability threshold P0 to decide whether 
or not the trigger effect applies. Furthermore, the 
ultimate goal is to combine several word trigger pairs 
into a single model, and it is not clear how this could 
be done with the backing-off model. 
Therefore, we replace the backing-off model by the 
corresponding model for linear interpolation: 
pab(wlh ) = 
= ~ \[1- q(bla)\]p(wlh ) + g(w,b)q(bla ) if a e h 
/ p(wlh ) if a ~ h 
\[1- q(bla)\]p(blh )-4- q(bla) if a e h, w = b 
= \[1 q(bla)\]p(wlh ) ifa e h, w # b 
P(wlh ) if a ~ h 
where tf(w, v) = 1 if and only if v = w. Note that 
this interpolation model allows a smooth transition 
from no trigger effect (q(bla) --+ O) to a strong trigger 
effect (q(bla) -+ 1). 
For a corpus Wl...Wn...'tON, we have the log- 
likelihood difference: 
AFab = ~ log\[l- q(bla)\] + 
rt:aeh. ,b#w,~ 
q(bla) '~ l°g (1-q(bla) + p(blh,~) / 
rt :aE hn ~b=w n 
= \[M(a) N(a;b)\].log\[1-q(b\[a)\] + 
q(bla) "~ l°g (1- q(bla) + p(blh,~) / 
n :aE h,,, ,b=w = 
with the count definition M(a): 
MCa) := 1 
n:aEhn 
Thus M(a) counts how many of all positions n 
(n = 1,..., N) with history hn contain word a and 
is therefore different from the unigram count N(a). 
To apply maximum likelihood estimation, we take 
the derivative with respect to q(b\[a) and obtain the 
following implicit equation for q(b\[a) after some ele- 
mentary manipulations: 
M(a) = 1 \[1 - q(bla)\] . p(b\[h,~) ÷ q(bla) 
n:aEhn~b=wn 
No explicit solution is possible. However, we can 
give bounds for the exact solution (proof omitted): 
N(a;b) 
M(a) <p(b) > N(a;b) < q(bla ) < 
1-<p(b)> - - M(a) ' 
with the definition of the average value < p(b) >: 
1 < p(b) > - N(a;b) ~ p(blh") 
n:aEhn~b=wn 
and an additional count N(a; b): 
N(a;b) := ~ 1 
n:aEh~,b=wr. 
An improved estimate can be obtained by the EM 
algorithm (Dempster and Laird, 1977): 
q(bla) = 
1 
M(a) 
q(b\[a) 
-- ~ \[1 - q(bla)\] . p(blh,~ ) + q(bla ) n:aEh,~,b=w~ 
An example of the full derivation of the iteration 
formula for the EM algorithm will be given in the 
next section for the more general case of a multi- 
trigger language model. 
3 Multi-Trigger Model 
The trigger pairs are used in combination with a 
conventional baseline model p(wn \[h,~) (e.g. m-gram) 
to define a trigger model pT(Wn \[hn): 
pT(wnlh ) = 
= (1-A).p(w,~lh,)+~-~(w,~lw,-m) 
m 
with the trigger parameters ot(w\]v) that must be 
normalized for each v: 
a(wlv) = 1 
Tillmann ~4 Ney 119 Word Triggers and EM 
To simplify the notation, we have used the conven- 
tion: 
m mE.M. 
with 
• .A4,: the set of triggering words for position n 
• M, = I.A4,1: the number of triggering words for 
position n 
Unfortunately, no method is known that produces 
closed-form solutions for the maximum-likelihood es- 
timates. Therefore, we resort to the EM algorithm in 
order to obtain the maximum-likelihood estimates. 
The framework of the EM algorithm is based 
on the so-called Q(#;~) function, where ~ is the 
new estimate obtained from the previous estimate 
/.t (Baum, 1972), (Dempster and Laird, 1977). The 
symbol # stands for the whole set of parameters to 
be estimated. The Q(#; ~) function is an extension 
of the usual log-likelihood function and is for our 
model: 
Q(-) = Q({A}, {o4wl~)}; {~(wlv)}) 
Y (1 -- ~)p(w,~lh, ) • log\[(1 - X)p(w.lh,O\] =E 
.=1 pT(w.lh.) 
pT(w, lh,) 
Taking the partial derivatives and solving for ~, we 
obtain: 
~' E,~(wnlw._~ ) 1 N M, \]= 
S-" 
When taking the partial derivatives with respect 
to ~(wlv), we use the method of Lagrangian multi- 
pliers for the normalization constraints and obtain: 
A(w, v) with ~(~1~1- EA(w',v) 
,I/31 
A(w, v) = ~(wl.) " 
5(v, Wn--rn ) N M. ~' 
E ,~(~,w.) ~' E~(~.I~..-.,-) .=1 (1 - .X)p(w.lh.) + .~- 
Note how the interaction of word triggers is taken 
into account by a local weighting effect: For a fixed 
position n with wn = w, the contribution of a par- 
ticular observed distant word pair (v...w) to ~(wlv) 
depends on the interaction parameters of all other 
word pairs (v'...w) with v' e {w~_-~} and the base- 
line probability p(wlh). 
Note that the local convergence property still 
holds when the length M, of the history is depen- 
dent on the word position n, e.g. if the history 
reaches back only to the beginning of the current 
paragraph. 
A remark about the functional form of the multi- 
trigger model is in order. The form chosen in this 
paper is a sort of linear combination of the trigger 
pairs. A different approach is to combine the var- 
ious trigger pairs in multiplicative way, which re- 
sults from a Maximum-Entropy approach (Lau and 
Rosenfeld, 1993). 
4 Experimental results 
Language Model Training and Corpus 
We first describe the details of the language model 
used and of its training. The trigger pairs were se- 
lected as described in subsection 2 and were used to 
extend a baseline language model. As in many other 
systems, the baseline language model used here con- 
sists of two parts, an m-gram model (here: tri- 
gram/bigram/unigram) and a cache part(Ney and 
Essen, 1994). Since the cache effect is equivalent to 
self-trigger pairs (a ---+ a), we can expect that there 
is some trade-off between the word triggers and the 
cache, which was confirmed in some initial informal 
experiments. 
For this reason, it is suitable to consider the simul- 
taneous interpolation of these three language model 
parts to define the refined language model. Thus we 
have the following equation for the refined language 
model p(w  \[h,): 
p(w.lh,O = 
AM "pM(w.lh.) + AC " pc(w, lh.) + AT. pT(wn 
where pM(w,~ Ih,) is the m-gram model, pc(w, Ih,) 
is the cache model and pw(wnlhn) is the trigger 
model. The three interpolation parameters must be 
normalized: 
AM "1" ~C -I- AT "-~ 1 
The details of the m-gram model are similar to those 
given in (Ney and Generet, 1995). The cache model 
PC (Wn n- 1 IWn_M) is defined as: 
M 1 
pc(w. lw~_-lM) = .~ ~ g(w.,w._~) 
m-~ l 
Tillmann 8J Ney 120 Word Triggers and EM 
Table 1: Effect of word trigger on test set perplexity (a) and interpolation parameter AM, AC, AT (b). 
l a) language model 
training corpus 
1 Mio 5 Mio 39 Mio \[ 
104.9 \] 
92.1 
88.5 I 
87.4 
trigram with no cache 255.1 168.4 
trigram with cache 200.0 138.9 I 
-t- triggers: no EM 183.2 129.8 
+ triggers: with EM 179.0 127.2 
b) +triggers: noEM .83/.11/.06 .86/.09/.05 .89/.08/.04 
+triggers: withEM .82/.10/.09 .85/.09/.07 .86/.07/.07 
where (~(w, v) = 1 if and only if v = w. The trigger 
model PT (Wn Ihn) is defined as: 
M 1 
pT(Wn\[W~I) a(Wn\[Wn--m) - M 2., 
rn----1 
There were two me~hods used to compute the trigger 
parameters: 
• method 'no:EM': The trigger parameters 
cr(w\[v) are obtained by renormalization from 
the single trigger parameters q(wlv): 
q(wlv) 
~(wlv) - ~q(w'lv) 
The backing-off method described in Section 2.1 
was used to select the top-K most significant 
single trigger pairs. In the experiments, we used 
K = 1.5 million trigger pairs. 
• method 'with EM': The trigger parameters 
o~(wlv ) are initialized by the 'no EM' values 
and re-estimated using the EM algorithm as de- 
scribed in Section 3. The typical number of it- 
erations is 10. 
The experimental tests were performed on the 
Wall Street Journal (WSJ) task (Paul and Baker, 
1992) for a vocabulary size of 20000 words. To 
train the m-gram 1,anguage model and the interpo- 
lation parameters, we used three training corpora 
with sizes of 1, 5 and 39 million running words. How- 
ever, the word trigger pairs were always selected and 
trained from the 39=million word training corpus. In 
the experiments, the history h was defined to start 
with the most recent article delimiter. 
The interpolation parameters are trained by using 
the EM algorithm. In the case of the 'EM triggers', 
this is done jointly with the reestimation of the trig- 
ger parameters ~(wlv ). To avoid the overfitting of 
the interpolation parameters on the training corpus, 
which was used to train both the m-gram language 
model and the interpolation parameters, we applied 
the leaving-one-out technique. 
Examples of Trigger Pairs 
In Table 2 and Table 3 we present examples of se- 
lected trigger pairs for the two methods no EM and 
EM. For a fixed triggering word v, we show the most 
significant triggered words w along with the trig- 
ger interaction parameter c~(wlv ) for both methods. 
There are 8 triggering words v for each of which we 
show the 15 triggered words w with the highest trig- 
ger parameter ot(wlv ). The triggered words w are 
sorted by the ot(wlv ) parameter. /,From the table 
it can be seen that for the no EM trigger pairs the 
trigger parameter oL(wlv ) varies only slightly over 
the triggered words w. This is different for the EM 
triggers, where the trigger parameters o~(wlv ) have 
a much larger variation. In addition the probability 
mass of the EM-trained trigger pairs is much more 
concentrated on the first 15 triggered words. 
Perplexity Results 
The perplexity was computed on a test corpus of 
325 000 words from the WSJ task. The results are 
shown in Table 1 for each of the three training cor- 
pora (1,5 and 39 million words). For comparison 
purposes, the perplexities of the trigram model with 
and without cache are included. As can be seen from 
this table, the trigger model is able to improve the 
perplexities in all conditions, and the EM triggers 
are consistently (although sometimes only slightly) 
better than the no EM triggers. There is an effect of 
the training corpus size: if the trigram model is al- 
ready well trained, the trigger model does not help as 
much as for a less well trained trigram model. This 
observation is confirmed by the part b of Table 1, 
which shows the EM trained interpolation parame- 
ters. As the size of the training corpus decreases the 
relative weight of the cache and trigger component 
increases. Furthermore in the last row of Table 1 
it can be seen that the relative weight of the trigger 
component increases after the EM training which in- 
dicates that the parameters of our trigger modell are 
successfully trained by this EM approach. 
Tillmann ~ Ney 121 Word Triggers and EM 
Table 2: Triggered words w along with c~(w\[v) for triggering word v. 
.... declining 
adding 
Bayerische 
positive 
_speculation 
concerns 
finished 
remaining 
reporting 
confusion 
excess 
falling 
disappointing 
eased 
, ,equities 
no EM 
W 
competitors 
changing 
creative 
simply 
deals 
competing 
hiring 
Armonk 
personnel 
businesses 
"-faster 
offices 
inventory 
successful 
color 
v = "added" \[ v = "airlines" 
I with EM I no EM \[ 
0.011 
0.010 
0.010 
0.009 
0.009 
0.009 
0.008 
0.008 
0.008 
0.008 
0.007 
0.007 
0.007 
0.007 
0.007 
w ot(wlv ) 
declined 0.106 
asked 0.080 
estimated 0.070 
asserted 0.055 
dropped 0.049 
concerns 0.036 
conceded 0.033 
adding 0.029 
recommended 0.028 
contended 0.023 
confusion 0.023 
reporting 0.020 
adequate 0.017 
referring 0.016 
contributed 0.016 
v = "business" 
\[ with EM 
0.004 
0.004 
0.004 
0.004 
0.OO4 
0.004 
0.004 
0.004 
0.004 
0.003 
0.003 
0.003 
0.003 
0.003 
0.003 
w ot(wlv ) 
corporate 0.146 
businesses 0.102 
marketing 0.056 
customers 0.047 
computer 0.026 
executives 0.024 
working 0.023 
competitive 0.022 
manufacturing 0.019 
product 0.018 
profits 0.017 
corporations 0.016 
started 0.015 
businessmen 0.014 
offices 0.011 
w .(wlv) 
pass.engers 0.015 
careers 0.013 
passenger 0.013 
United's 0.013 
Trans 0.012 
Continental's 0.011 
Eastern's 0.010 
flights 0.010 
fare 0.009 
airline 0.009 
American's 0.009 
pilots' 0.008 
airlines' 0.008 
travel 0.008 
planes 0.008 
with EM 
w ~(wlv ) 
airline 0.296 
air 0.064 
Continental 0.056 
carrier 0.049 
carriers 0.046 
passengers 0.037 
flight 0.035 
United 0.032 
flights 0.029 
Delta 0.026 
fares 0.024 
Eastern 0.023 
carrier's 0.020 
frequent 0.018 
passenger 0.018 
\] v : "buy" 
no EM I 
W 
purchases 
acquiring 
privately 
deals 
speculative 
partly 
financing 
huge 
immediately 
aggressive 
declining 
borrowing 
cheap 
cyclical 
investing 
0.005 
0.005 
0.005 
0.004 
0.004 
0.004 
0.004 
0.004 
0.004 
0.004 
0.004 
0.004 
0.004 
0.004 
0.004 
with EM 
purchase 
buying 0.092 
purchases 0.051 
well 0.050 
bought 0.042 
cash 0.030 
deal 0.028 
potential 0.026 
future 0.025 
couldn't 0.024 
giving 0.022 
buys 0.019 
together 0.018 
bid 0.018 
buyers 0.017 
Tillmann ~ Ney 122 Word Triggers and EM 
Table 3: Triggered words w along with c~(w\[v) for triggering word v. 
no EM 
W 
addi .rig. 
acquiring 
publicly 
depressed 
financially 
ro.ughly 
prior 
reduced 
overseas 
remaining 
competitors 
substantially 
rival 
partly 
privately 
W 
characters 
physical 
turns 
beautiful 
comic 
playing 
fun 
herself 
rock 
stuff 
dance 
evil 
God 
pain. 
passion 
v = "company" 
I with EM 
0.002 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
no EM 
management 0.092 
including 0.037 
top 0.028 
employees 0.027 
will 0.024 
plans 0.018 
unit 0.017 
couldn't 0.017 
hasn't 0.016 
subsidiary 0.014 
previously 0.014 
now 0.013 
since 0.011 
won't 0.011 
executives 0.011 
v = "Ford" 
w 
Ford's 0.039 
Dearborn 0.020 
Chrysler's 0.014 
Chevrolet 0.013 
Lincoln 0.013 
truck 0.012 
Mazda 0.011 
vehicle 0.010 
Dodge 0.009 
incentive 0.009 
Buick 0.009 
dealer 0.008 
vans 0.008 
car's 0.008 
Honda 0.008 
with EM 
w a(wlv } 
Ford's 0.651 
auto 0.063 
Dearborn 0.056 
Chrysler 0.028 
Mercury 0.022 
Taurus 0.021 
Mustang 0.013 
Escort 0.011 
Lincoln 0.010 
Tempo 0.009 
parts 0.007 
car 0.006 
pattern 0.006 
Henry 0.006 
Jaguar 0.006 
0.006 
0.005 
0.005 
0.005 
0.005 
0.005 
0.005 
0.005 
0.005 
0.005 
0.004 
0.004 
0.004 
0.004 
0.004 
"love" 
w   olv) 
human 0.051 
lovers 0.044 
passion 0.039 
turns 0.031 
beautiful 0.030 
spirit 0.029 
marriage 0.020 
phil 0.019 
lounge 0.017 
dresses 0.017 
stereotype 0.016 
wonder 0.015 
songs 0.015 
beautifully 0.014 
muscular 0.014 
V ---- 
deep 0.002 
changing 0.002 
starting 0.002 
simply 0.002 
tough 0.002 
dozens 0.002 
driving 0.002 
twice 0.002 
experts 0.002 
cheap 0.002 
winning 0.002 
minor 0.002 
critics 0.002 
nearby 0.002 
living 0.002 
"says" 
w a(wlv ) 
adds 0.090 
low 0.053 
suggests 0.031 
concedes 0.024 
explains 0.019 
contends 0.017 
notes 0.016 
agrees 0.016 
thinks 0.015 
insists 0.015 
get 0.014 
hot 0.013 
early 0.013 
sees 0.012 
consultant 0.012 
Tillmann ~ Ney 123 Word Triggers and EM 
5 Conclusions 
We have presented a model and an algorithm for 
training a multi-word trigger model along with some 
experimental evaluations. The results can be sum- 
merized as follows: 
® The trigger parameters for all word triggers are 
jointly trained using the EM algorithm. This 
leads to a systematic (although small) improve- 
ment over the condition that each trigger pa- 
rameter is trained separately. 
• The word-trigger model is used in combination 
with a full language model (m-gram/cache) . 
Thus the perplexity is reduced from 138.9 to 
127.2 for the 5-million training corpus and from 
92.2 to 87.4 for the 39-million corpus. 

References 
L. E. Baum. 1972. "An Inequality and Associated 
Maximization Technique in Statistical Estima- 
tion of a Markov Process", Inequalities, Vol. 3, 
No. 1, pp. 1-8. 
L. R. Bahl, F. Jelinek, R. L. Mercer, A. Nadas. 1984. 
"Next Word Statistical Predictor", IBM Tech. 
Disclosure Bulletin, Vol. 27, No. 7A, pp. 3941- 
42, December. 
P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, 
R. L. Mercer. 1993. "Mathematics of Sta- 
tistical Machine Translation: Parameter Es- 
timation", Computational Linguistics, Vol. 19, 
No. 2, pp. 263-311, June. 
A. P. Dempster, N. M. Laird, D. B. Rubin. 1977. 
"Maximum Likelihood from Incomplete Data 
via the EM Algorithm", J. Royal Statist. Soc. 
Ser. B (methodological), Vol. 39, pp. 1-38. 
S.M. Katz. 1993. "Estimation of Probabilities from 
Sparse Data for the Language Model Compo- 
nent of a Speech Recognizer", in IEEE Trans. 
on Acoustics, Speech and Signal Processing, 
Vol. 35, pp. 400-401, March. 
R. Lau, R. Rosenfeld, S. Roukos. 1993. "Trigger- 
Based Language Models: A Maximum En- 
tropy Approach", in Proc. IEEE Inter. Conf. 
on Acoustics, Speech and Signal Processing, 
Minneapolis, MN, Vol. II, pp. 45-48, April. 
H. Ney, U. Essen, R. Kneser. 1994. "On Structuring 
Probabilistic Dependencies in Language Mod- 
eling", Computer Speech and Language, Vol. 8, 
pp. 1-38. 
H. Ney, M. Generet, F. Wessel. 1995. "Extensions 
of Absolute Discounting for Language Mod- 
eling", in Proc. Fourth European Conference 
on Speech Communication and Technology, 
Madrid, pp. 1245-1248, September. 
D.B. Paul and J.B. Baker. 1992. "The Design for 
the Wall Street Journal-based CSR Corpus", 
in Proc. of the DARPA SLS Workshop, pp. 
357-361, February. 
S. Della Pietra, V. Della Pietra, J. Gillett, J. Laf- 
ferty, H. Printz and L. Ures. 1994. "Infer- 
ence and Estimation of a Long-Range Tri- 
gram Model", in Lecture Notes in Artificial 
Intelligence, Grammatical Inference and Ap- 
plications, ICGI-94, Alicante, Spain, Springer- 
Verlag, pp. 78-92, September. 
R. Rosenfeld. 1994. "Adaptive Statistical Lan- 
guage Modeling: A Maximum Entropy Ap- 
proach", Ph.D. thesis, School of Computer Sci- 
ence, Carnegie Mellon University, Pittsburgh, 
PA, CMU-CS-94-138. 
C. Tillmann and H. Ney. 1996. "Selection Criteria 
for Word Triggers in Language Modeling". , 
in Lecture Notes in Artificial Intelligence, Int. 
Colloquium on Grammatical Inference, Mont- 
pellier, France, Springer-Verlag, pp. 95-106, 
September. 
