Exploiting Syntactic Structure for Language Modeling 
Ciprian Chelba and Frederick Jelinek 
Center for Language and Speech Processing 
The Johns Hopkins University, Barton Hall 320 
3400 N. Charles St., Baltimore, MD-21218, USA 
{chelba,jelinek} @jhu.edu 
Abstract 
The paper presents a language model that devel- 
ops syntactic structure and uses it to extract mean- 
ingful information from the word history, thus en- 
abling the use of long distance dependencies. The 
model assigns probability to every joint sequence 
of words-binary-parse-structure with headword an- 
notation and operates in a left-to-right manner -- 
therefore usable for automatic speech recognition. 
The model, its probabilistic parameterization, and a 
set of experiments meant to evaluate its predictive 
power are presented; an improvement over standard 
trigram modeling is achieved. 
1 Introduction 
The main goal of the present work is to develop a lan- 
guage model that uses syntactic structure to model 
long-distance dependencies. During the summer96 
DoD Workshop a similar attempt was made by the 
dependency modeling group. The model we present 
is closely related to the one investigated in (Chelba 
et al., 1997), however different in a few important 
aspects: 
• our model operates in a left-to-right manner, al- 
lowing the decoding of word lattices, as opposed to 
the one referred to previously, where only whole sen- 
tences could be processed, thus reducing its applica- 
bility to n-best list re-scoring; the syntactic structure 
is developed as a model component; 
• our model is a factored version of the one 
in (Chelba et al., 1997), thus enabling the calculation 
of the joint probability of words and parse structure; 
this was not possible in the previous case due to the 
huge computational complexity of the model. 
Our model develops syntactic structure incremen- 
tally while traversing the sentence from left to right. 
This is the main difference between our approach 
and other approaches to statistical natural language 
parsing. Our parsing strategy is similar to the in- 
cremental syntax ones proposed relatively recently 
in the linguistic community (Philips, 1996). The 
probabilistic model, its parameterization and a few 
experiments that are meant to evaluate its potential 
for speech recognition are presented. 
/// 
/~tract NP 
the_DT contract NN ~,dc,l VBI)with INa DT Ioss_NN of_IN 7_CD ~: ~::, X~,'~ after 
Figure 1: Partial parse 
2 The Basic Idea and Terminology 
Consider predicting the word after in the sentence: 
the contract ended with a loss of 7 cents 
after trading as low as 89 cents. 
A 3-gram approach would predict after from 
(7, cents) whereas it is intuitively clear that the 
strongest predictor would be ended which is outside 
the reach of even 7-grams. Our assumption is that 
what enables humans to make a good prediction of 
after is the syntactic structure in the past. The 
linguistically correct partial parse of the word his- 
tory when predicting after is shown in Figure 1. 
The word ended is called the headword of the con- 
stituent (ended (with (...) )) and ended is an ex- 
posed headword when predicting after -- topmost 
headword in the largest constituent that contains it. 
The syntactic structure in the past filters out irrel- 
evant words and points to the important ones, thus 
enabling the use of long distance information when 
predicting the next word. 
Our model will attempt to build the syntactic 
structure incrementally while traversing the sen- 
tence left-to-right. The model will assign a probabil- 
ity P(W, T) to every sentence W with every possible 
POStag assignment, binary branching parse, non- 
terminal label and headword annotation for every 
constituent of T. 
Let W be a sentence of length n words to which 
we have prepended <s> and appended </s> so 
that Wo =<s> and w,+l =</s>. Let Wk be the 
word k-prefix Wo...wk of the sentence and WkTk 
225 
"i .... " °'' 
(¢:s>. SB) ....... (wp. t p) (w {p÷l }. L_( I~-I }) ........ (wk. t_k) w_( k*l }.... </s.~ 
T \[-ra} 
<s> 
h_(-2 } h (-! } h_O 
......... 
Figure 2: A word-parse k-prefix Figure 4: Before an adjoin operation 
_ (<,'s>, TOP) 
(<s>, SB) (w_l, ~_1) ..................... ('*'_n, t_n) (</~, SE) 
Figure 3: Complete parse 
the word-parse k-prefix. To stress this point, a 
word-parse k-prefix contains -- for a given parse 
-- only those binary subtrees whose span is com- 
pletely included in the word k-prefix, excluding 
w0 =<s>. Single words along with their POStag 
can be regarded as root-only trees. Figure 2 shows 
a word-parse k-prefix; h_0 .. h_{-m} are the ex- 
posed heads, each head being a pair(headword, non- 
terminal label), or (word, POStag) in the case of a 
root-only tree. A complete parse -- Figure 3 -- is 
any binary parse of the 
(wl,tl)...(wn,t,) (</s>, SE) sequence with the 
restriction that (</s>, TOP') is the only allowed 
head. Note that ((wl,tl)...(w,,t,)) needn't be a 
constituent, but for the parses where it is, there is 
no restriction on which of its words is the headword 
or what is the non-terminal label that accompanies 
the headword. 
The model will operate by means of three mod- 
ules: 
• WORD-PREDICTOR predicts the next word 
wk+l given the word-parse k-prefix and then passes 
control to the TAGGER; 
• TAGGER predicts the POStag of the next word 
tk+l given the word-parse k-prefix and the newly 
predicted word and then passes control to the 
PARSER; 
• PARSER grows the already existing binary 
branching structure by repeatedly generating the 
transitions: 
(unary, NTlabel), (adjoin-left, NTlabel) or 
(adjoin-right, NTlabel) until it passes control 
to the PREDICTOR by taking a null transition. 
NTlabel is the non-terminal label assigned to the 
newly built constituent and {left ,right} specifies 
where the new headword is inherited from. 
The operations performed by the PARSER are 
illustrated in Figures 4-6 and they ensure that all 
possible binary branching parses with all possible 
T'_{.m÷l l <-<s.~. 
<s> 
h'{-I } = h_(-2 } h'_0= (h_{-I }.word, NTlabel) 
Figure 5: Result of adjoin-left under NTlabel 
headword and non-terminal label assignments for 
the wl ... wk word sequence can be generated. The 
following algorithm formalizes the above description 
of the sequential generation of a sentence with a 
complete parse. 
Transition t; // a PARSER transition 
predict (<s>, SB); 
do{ 
//WORD-PREDICTORand TAGGER 
predict (next_word, POStag); 
//PARSER 
do{ 
if(h_{-l}.word != <s>){ 
if(h_O.word == </s>) 
t = (adjoin-right, TOP'); 
else{ 
if(h_O.tag== NTlabel) 
t = \[(adjoin-{left,right}, NTlabel), 
null\]; 
else 
t = \[(unary, NTlabel), 
(adjoin-{left,right}, NTlabel), 
null\]; 
} 
} 
else{ 
if (h_O.tag == NTlabel) 
t = null; 
else 
t = \[(unary, NTlabel), null\] ; 
} 
}while(t != null) //done PARSER 
}while ( ! (h_0. word==</s> && h_{- 1 }. word==<s>) ) 
t = (adjoin-right, TOP); //adjoin <s>_SB; DONE; 
The unary transition is allowed only when the 
most recent exposed head is a leaf of the tree -- 
a regular word along with its POStag -- hence it 
can be taken at most once at a given position in the 
226 
T'_l.m+l } <-<s> 
h' {- I }=h {-2} h'_0 = (h_0.word, NTlab¢l) 
Figure 6: Result of adjoin-right under NTlabel 
input word string. The second subtree in Figure 2 
provides an example of a unary transition followed 
by a null transition. 
It is easy to see that any given word sequence 
with a possible parse and headword annotation is 
generated by a unique sequence of model actions. 
This will prove very useful in initializing our model 
parameters from a treebank -- see section 3.5. 
3 Probabilistic Model 
The probability P(W, T) of a word sequence W and 
a complete parse T can be broken into: 
P(W, T) = 
1-I "+xr P(wk/Wk-aTk-x) " P(tk/Wk-lTk-x,wk)" k=X I. 
N~ 
~I P(Pki /Wk-xTk-a' Wk, tk,pkx ... pLX)\](1) 
i=X 
where: 
• Wk-lTk-x is the word-parse (k - 1)-prefix 
• wk is the word predicted by WORD-PREDICTOR 
* tk is the tag assigned to wk by the TAGGER 
• Nk -- 1 is the number of operations the PARSER 
executes before passing control to the WORD- 
PREDICTOR (the Nk-th operation at position k is 
the null transition); Nk is a function of T 
• pi k denotes the i-th PARSER operation carried out 
at position k in the word string; 
p~ 6 {(unary, NTlabel), 
(adjoin-left, NTlabel), 
(adjoin-right, NTlabel), null}, 
pk 6 { (adjoin-left, NTlabel), 
(adjoin-right, NTlabel)}, 1 < i < Nk , 
p~ =null, i = Nk 
Our model is based on three probabilities: 
P(wk/Wk-lTk-1) (2) 
P(tk/wk, Wk-lTk-x) (3) 
P(p~/wk,tk,Wk--xTk--l,p~.. k "Pi--X) C a) 
As can be seen, (wk, tk, Wk-xTk-x,p~...pki_x) is one 
of the Nk word-parse k-prefixes WkTk at position k 
in the sentence, i = 1, Nk. 
To ensure a proper probabilistic model (1) we 
have to make sure that (2), (3) and (4) are well de- 
fined conditional probabilities and that the model 
halts with probability one. Consequently, certain 
PARSER and WORD-PREDICTOR probabilities 
must be given specific values: 
• P(null/WkTk) = 1, if h_{-1}.word = <s> and 
h_{0} ~ (</s>, TOP') -- that is, before predicting 
</s> -- ensures that (<s>, SB) is adjoined in the 
last step of the parsing process; 
• P((adjoin-right, TOP)/WkTk) = 1, if 
h_O = (</s>, TOP') and h_{-l}.word = <s> 
and 
P((adjoin-right, TOP')/WkTk) = 1, if 
h_0 = (</s>, TOP') and h_{-1}.word ~ <s> 
ensure that the parse generated by our model is con- 
sistent with the definition of a complete parse; 
• P((unary, NWlabel)/WkTk) = 0, if h_0.tag 
POStag ensures correct treatment of unary produc- 
tions; 
• 3e > O, VWk-lTk-l,P(wk=</s>/Wk-xTk-1) >_ e 
ensures that the model halts with probability one. 
The word-predictor model (2) predicts the next 
word based on the preceding 2 exposed heads, thus 
making the following equivalence classification: 
P(wk/Wk-lTk-1) = P(wk/ho, h-l) 
After experimenting with several equivalence clas- 
sifications of the word-parse prefix for the tagger 
model, the conditioning part of model (3) was re- 
duced to using the word to be tagged and the tags 
of the two most recent exposed heads: 
P(tk/Wk, Wk-lTk-1) = P(tk/wk, ho.tag, h-l.tag) 
Model (4) assigns probability to different parses of 
the word k-prefix by chaining the elementary oper- 
ations described above. The workings of the parser 
module are similar to those of Spatter (Jelinek et al., 
1994). The equivalence classification of the WkTk 
word-parse we used for the parser model (4) was the 
same as the one used in (Collins, 1996): 
p (pk / Wk Tk ) = p (pk / ho , h-x) 
It is worth noting that if the binary branching 
structure developed by the parser were always right- 
branching and we mapped the POStag and non- 
terminal label vocabularies to a single type then our 
model would be equivalent to a trigram language 
model. 
3.1 Modeling Tools 
All model components -- WORD-PREDICTOR, 
TAGGER, PARSER -- are conditional probabilis- 
tic models of the type P(y/xl,x2,...,xn) where 
y, Xx,X2,...,Xn belong to a mixed bag of words, 
POStags, non-terminal labels and parser operations 
(y only). For simplicity, the modeling method we 
chose was deleted interpolation among relative fre- 
quency estimates of different orders fn(') using a 
227 
recursive mixing scheme: 
P(y/xl, . . . ,xn) = 
A(xl,...,x,)-P(y/xl,...,x,_x) + 
(1 -- ~(Xl,...,Xn))" fn(y/Xl,...,Xn), (5) 
f -l (Y) = uniform(vocabulary(y)) (6) 
As can be seen, the context mixing scheme dis- 
cards items in the context in right-to-left order. The 
A coefficients are tied based on the range of the 
count C(xx,...,Xn). The approach is a standard 
one which doesn't require an extensive description 
given the literature available on it (Jelinek and Mer- 
cer, 1980). 
3.2 Search Strategy 
Since the number of parses for a given word prefix 
Wt grows exponentially with k, I{Tk}l ,,. O(2k), the 
state space of our model is huge even for relatively 
short sentences so we had to use a search strategy 
that prunes it. Our choice was a synchronous multi- 
stack search algorithm which is very similar to a 
beam search. 
Each stack contains hypotheses -- partial parses 
-- that have been constructed by the same number of 
predictor and the same number of parser operations. 
The hypotheses in each stack are ranked according 
to the ln(P(W, T)) score, highest on top. The width 
of the search is controlled by two parameters: 
• the maximum stack depth -- the maximum num- 
ber of hypotheses the stack can contain at any given 
state; 
• log-probability threshold -- the difference between 
the log-probability score of the top-most hypothesis 
and the bottom-most hypothesis at any given state 
of the stack cannot be larger than a given threshold. 
Figure 7 shows schematically the operations asso- 
ciated with the scanning of a new word Wk+l. The 
above pruning strategy proved to be insufficient so 
we chose to also discard all hypotheses whose score 
is more than the log-probability threshold below the 
score of the topmost hypothesis. This additional 
pruning step is performed after all hypotheses in 
stage k' have been extended with the null parser 
transition and thus prepared for scanning a new 
word. 
3.3 Word Level Perplexity 
The conditional perplexity calculated by assigning 
to a whole sentence the probability: 
P(W/T*) = fi P(wk+l/WkT~), (7) 
k=O 
where T* = argrnaxTP(W, T), is not valid because 
it is not causal: when predicting wk+l we use T* 
which was determined by looking at the entire sen- 
tence. To be able to compare the perplexity of our 
(k) \ 
0 parser ot~ 
k predict. \[ 
p parser op 
k predict. 
p+l parser 
k predict. 
P_k parser 
k predict. 
(k') \ 
~ 0 parser opt_ 
"~+1 predict. \[" - 
1 k+l predict. 
p+ 1 parser ~ 
+1 predict\]z/ 
~ P_k parser ~_ 
predict V - 
_k+ 1 parse~e~ 
1 predict.~" 
word predictor 
and tagger 
- (k+l) 
oq 
I I 
I 
)~_-~--)\]pparser op 
I 
- - =-~+1 predict. I 
!--~+1 parser\] 
-- 7 - - >~+..} predic\[. I 
i i 
i 
---!--~lP kparser! 
---: 
, - :-~+ 1 predict. 
I 
nullparser transitions 
parser adjoin/unary transitions 
Figure 7: One search extension cycle 
model with that resulting from the standard tri- 
gram approach, we need to factor in the entropy of 
guessing the correct parse T~ before predicting wk+l, 
based solely on the word prefix Wk. 
The probability assignment for the word at posi- 
tion k + 1 in the input sentence is made using: 
P(Wk+l/Wk) = 
~TheS~ P(Wk+x/WkTk) " p(Wk,Tk), (8) 
p(Wk,Tk) = P(W Tk)/ P(WkTk) (9) 
TkESk 
which ensures a proper probability over strings W*, 
where Sk is the set of all parses present in our stacks 
at the current stage k. 
Another possibility for evaluating the word level 
perplexity of our model is to approximate the prob- 
ability of a whole sentence: 
N 
P(W) = Z P(W, T (k)) (10) 
k=l 
where T (k) is one of the "N-best" -- in the sense 
defined by our search -- parses for W. This is a 
deficient probability assignment, however useful for 
justifying the model parameter re-estimation. 
The two estimates (8) and (10) are both consistent 
in the sense that if the sums are carried over all 
228 
possible parses we get the correct value for the word 
level perplexity of our model. 
3.4 Parameter Re-estimation 
The major problem we face when trying to reesti- 
mate the model parameters is the huge state space of 
the model and the fact that dynamic programming 
techniques similar to those used in HMM parame- 
ter re-estimation cannot be used with our model. 
Our solution is inspired by an HMM re-estimation 
technique that works on pruned -- N-best -- trel- 
lises(Byrne et al., 1998). 
Let (W, T(k)), k = 1... N be the set of hypothe- 
ses that survived our pruning strategy until the end 
of the parsing process for sentence W. Each of 
them was produced by a sequence of model actions, 
chained together as described in section 2; let us call 
the sequence of model actions that produced a given 
(W, T) the derivation(W, T). 
Let an elementary event in the derivation(W, T) 
be :, (m,) .~(m,)~ where: 
* l is the index of the current model action; 
* ml is the model component -- WORD- 
PREDICTOR, TAGGER, PARSER -- that takes 
action number l in the derivation(W, T); 
, y~mt) is the action taken at position I in the deriva- 
tion: 
if mt = WORD-PREDICTOR, then y~m,) is a word; 
if mt -- TAGGER, then y~m~) is a POStag; 
if mt = PARSER, then y~m~) is a parser-action; 
• ~m~) is the context in which the above action was 
taken: 
if rat = WORD-PREDICTOR or PARSER, then 
_~,na) = (ho.tag, ho.word, h-1 .tag, h-l.word); 
if rat = TAGGER, then 
~mt) = (word-to-tag, ho.tag, h-l.tag). 
The probability associated with each model ac- 
tion is determined as described in section 3.1, based 
on counts C (m) (y(m), x_("0), one set for each model 
component. 
Assuming that the deleted interpolation coeffi- 
cients and the count ranges used for tying them stay 
fixed, these counts are the only parameters to be 
re-estimated in an eventual re-estimation procedure; 
indeed, once a set of counts C (m) (y(m), x_(m)) is spec- 
ified for a given model ra, we can easily calculate: 
• the relative frequency estimates 
fn(m)/,,(m) Ix(m) ~ for all context orders kY I_n / 
n = 0...maximum-order(model(m)); 
• the count c(m)(x_ (m)) used for determining the 
A(x_ (m)) value to be used with the order-n context 
x(m)" 
This is all we need for calculating the probability of 
an elementary event and then the probability of an 
entire derivation. 
One training iteration of the re-estimation proce- 
dure we propose is described by the following algo- 
rithm: 
N-best parse development data; // counts.El 
// prepare counts.E(i+l) 
for each model component c{ 
gather_counts development model_c; } 
In the parsing stage we retain for each "N-best" hy- 
pothesis (W, T(k)), k = 1... N, only the quantity 
¢(W, T(k)) p(W,T(k))/ N = ~-~k=l P(W, T(k)) 
and its derivation(W,T(k)). We then scan all 
the derivations in the "development set" and, for 
each occurrence of the elementary event (y(m), x_(m)) 
in derivation(W,T(k)) we accumulate the value 
¢(W,T (k)) in the C(m)(y(m),x__ (m)) counter to be 
used in the next iteration. 
The intuition behind this procedure is that 
¢(W,T (k)) is an approximation to the P(T(k)/w) 
probability which places all its mass on the parses 
that survived the parsing process; the above proce- 
dure simply accumulates the expected values of the 
counts c(m)(y(m),x (m)) under the ¢(W,T (k)) con- 
ditional distribution. As explained previously, the 
C(m) (y(m), X_(m)) counts are the parameters defining 
our model, making our procedure similar to a rigor- 
ous EM approach (Dempster et al., 1977). 
A particular -- and very interesting -- case is that 
of events which had count zero but get a non-zero 
count in the next iteration, caused by the "N-best" 
nature of the re-estimation process. Consider a given 
sentence in our "development" set. The "N-best" 
derivations for this sentence are trajectories through 
the state space of our model. They will change 
from one iteration to the other due to the smooth- 
ing involved in the probability estimation and the 
change of the parameters -- event counts -- defin- 
ing our model, thus allowing new events to appear 
and discarding others through purging low probabil- 
ity events from the stacks. The higher the number 
of trajectories per sentence, the more dynamic this 
change is expected to be. 
The results we obtained are presented in the ex- 
periments section. All the perplexity evaluations 
were done using the left-to-right formula (8) (L2R- 
PPL) for which the perplexity on the "development 
set" is not guaranteed to decrease from one itera- 
tion to another. However, we believe that our re- 
estimation method should not increase the approxi- 
mation to perplexity based on (10) (SUM-PPL) -- 
again, on the "development set"; we rely on the con- 
sistency property outlined at the end of section 3.3 
to correlate the desired decrease in L2R-PPL with 
that in SUM-PPL. No claim can be made about 
the change in either L2R-PPL or SUM-PPL on test 
data. 
229 
Y_! Y_k Y_n Y l Y_k Y_n 
Figure 8: Binarization schemes 
3.5 Initial Parameters 
Each model component -- WORD-PREDICTOR, 
TAGGER, PARSER -- is trained initially from a 
set of parsed sentences, after each parse tree (W, T) 
undergoes: 
• headword percolation and binarization -- see sec- 
tion 4; 
• decomposition into its derivation(W, T). 
Then, separately for each m model component, we: 
• gather joint counts cCm)(y(m),x (m)) from the 
derivations that make up the "development data" 
using ¢(W,T) = 1; 
• estimate the deleted interpolation coefficients on 
joint counts gathered from "check data" using the 
EM algorithm. 
These are the initial parameters used with the re- 
estimation procedure described in the previous sec- 
tion. 
4 Headword Percolation and 
Binarization 
In order to get initial statistics for our model com- 
ponents we needed to binarize the UPenn Tree- 
bank (Marcus et al., 1995) parse trees and perco- 
late headwords. The procedure we used was to first 
percolate headwords using a context-free (CF) rule- 
based approach and then binarize the parses by us- 
ing a rule-based approach again. 
The headword of a phrase is the word that best 
represents the phrase, all the other words in the 
phrase being modifiers of the headword. Statisti- 
cally speaking, we were satisfied with the output 
of an enhanced version of the procedure described 
in (Collins, 1996) -- also known under the name 
"Magerman & Black Headword Percolation Rules". 
Once the position of the headword within a con- 
stituent -- equivalent with a CF production of the 
type Z --~ Y1.--Yn , where Z, Y1,...Yn are non- 
terminal labels or POStags (only for Y/) -- is iden- 
tified to be k, we binarize the constituent as follows: 
depending on the Z identity, a fixed rule is used 
to decide which of the two binarization schemes in 
Figure 8 to apply. The intermediate nodes created 
by the above binarization schemes receive the non- 
terminal label Z ~. 
5 Experiments 
Due to the low speed of the parser -- 200 wds/min 
for stack depth 10 and log-probability threshold 
6.91 nats (1/1000) -- we could carry out the re- 
estimation technique described in section 3.4 on only 
1 Mwds of training data. For convenience we chose 
to work on the UPenn Treebank corpus. The vocab- 
ulary sizes were: 
* word vocabulary: 10k, open -- all words outside 
the vocabulary are mapped to the <unk> token; 
• POS tag vocabulary: 40, closed; 
• non-terminal tag vocabulary: 52, closed; 
• parser operation vocabulary: 107, closed; 
The training data was split into "development" set 
-- 929,564wds (sections 00-20) -- and "check set" 
-- 73,760wds (sections 21-22); the test set size was 
82,430wds (sections 23-24). The "check" set has 
been used for estimating the interpolation weights 
and tuning the search parameters; the "develop- 
ment" set has been used for gathering/estimating 
counts; the test set has been used strictly for evalu- 
ating model performance. 
Table 1 shows the results of the re-estimation tech- 
nique presented in section 3.4. We achieved a reduc- 
tion in test-data perplexity bringing an improvement 
over a deleted interpolation trigram model whose 
perplexity was 167.14 on the same training-test data; 
the reduction is statistically significant according to 
a sign test. 
iteration DEV set TEST set 
number L2R-PPL L2R-PPL 
E0 24.70 167.47 
E1 22.34 160.76 
E2 21.69 158.97 
E3 21.26 158.28 
3-gram 21.20 167.14 
Table 1: Parameter re-estimation results 
Simple linear interpolation between our model and 
the trigram model: 
Q(wk+l/Wk) = 
)~" P(Wk+I/Wk-I,Wk) + (1 -- A)" P(wk+l/Wk) 
yielded a further improvement in PPL, as shown in 
Table 2. The interpolation weight was estimated on 
check data to be )~ = 0.36. 
An overall relative reduction of 11% over the trigram 
model has been achieved. 
6 Conclusions and Future Directions 
The large difference between the perplexity of our 
model calculated on the "development" set -- used 
230 
II iteration 
number II Eo 
E3 
\[l 3-gram 
TEST set 
L2R-PPL 
167.47 
158.28 
167.14 
TEST set 
3-gram interpolated PPL 
152.25 II 148.90 
167.14 II 
Table 2: Interpolation with trigram results 
for model parameter estimation -- and "test" set -- 
unseen data -- shows that the initial point we choose 
for the parameter values has already captured a lot 
of information from the training data. The same 
problem is encountered in standard n-gram language 
modeling; however, our approach has more flexibility 
in dealing with it due to the possibility of reestimat- 
ing the model parameters. 
We believe that the above experiments show the 
potential of our approach for improved language 
models. Our future plans include: 
• experiment with other parameterizations than the 
two most recent exposed heads in the word predictor 
model and parser; 
• estimate a separate word predictor for left-to- 
right language modeling. Note that the correspond- 
ing model predictor was obtained via re-estimation 
aimed at increasing the probability of the "N-best" 
parses of the entire sentence; 
• reduce vocabulary of parser operations; extreme 
case: no non-terminal labels/POS tags, word only 
model; this will increase the speed of the parser 
thus rendering it usable on larger amounts of train- 
ing data and allowing the use of deeper stacks -- 
resulting in more "N-best" derivations per sentence 
during re-estimation; 
• relax -- flatten -- the initial statistics in the re- 
estimation of model parameters; this would allow the 
model parameters to converge to a different point 
that might yield a lower word-level perplexity; 
• evaluate model performance on n-best sentences 
output by an automatic speech recognizer. 
7 Acknowledgments 
This research has been funded by the NSF 
IRI-19618874 grant (STIMULATE). 
The authors would like to thank Sanjeev Khu- 
danpur for his insightful suggestions. Also to Harry 
Printz, Eric Ristad, Andreas Stolcke, Dekai Wu and 
all the other members of the dependency model- 
ing group at the summer96 DoD Workshop for use- 
ful comments on the model, programming support 
and an extremely creative environment. Also thanks 
to Eric Brill, Sanjeev Khudanpur, David Yarowsky, 
Radu Florian, Lidia Mangu and Jun Wu for useful 
input during the meetings of the people working on 
our STIMULATE grant. 

References 
w. Byrne, A. Gunawardana, and S. Khudanpur. 
1998. Information geometry and EM variants. 
Technical Report CLSP Research Note 17, De- 
partment of Electical and Computer Engineering, 
The Johns Hopkins University, Baltimore, MD. 
C. Chelba, D. Engle, F. Jelinek, V. Jimenez, S. Khu- 
danpur, L. Mangu, H. Printz, E. S. Ristad, 
R. Rosenfeld, A. Stolcke, and D. Wu. 1997. Struc- 
ture and performance of a dependency language 
model. In Proceedings of Eurospeech, volume 5, 
pages 2775-2778. Rhodes, Greece. 
Michael John Collins. 1996. A new statistical parser 
based on bigram lexical dependencies. In Proceed- 
ings of the 34th Annual Meeting of the Associ- 
ation for Computational Linguistics, pages 184- 
191. Santa Cruz, CA. 
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. 
Maximum likelihood from incomplete data via the 
EM algorithm. In Journal of the Royal Statistical 
Society, volume 39 of B, pages 1-38. 
Frederick Jelinek and Robert Mercer. 1980. Inter- 
polated estimation of markov source parameters 
from sparse data. In E. Gelsema and L. Kanal, ed- 
itors, Pattern Recognition in Practice, pages 381- 
397. 
F. Jelinek, J. Lafferty, D. M. Magerman, R. Mercer, 
A. Ratnaparkhi, and S. Roukos. 1994. Decision 
tree parsing using a hidden derivational model. 
In ARPA, editor, Proceedings of the Human Lan- 
guage Technology Workshop, pages 272-277. 
M. Marcus, B. Santorini, and M. Marcinkiewicz. 
1995. Building a large annotated corpus of En- 
glish: the Penn Treebank. Computational Lin- 
guistics, 19(2):313-330. 
Colin Philips. 1996. Order and Structure. Ph.D. 
thesis, MIT. Distributed by MITWPL. 
