A Structured Language Model 
Ciprian Chelba 
The Johns Hopkins University 
CLSP, Barton Hall 320 
3400 N. Charles Street, Baltimore, MD-21218 
chelba@j hu. edu 
Abstract 
The paper presents a language model that 
develops syntactic structure and uses it to 
extract meaningful information from the 
word history, thus enabling the use of 
long distance dependencies. The model as- 
signs probability to every joint sequence 
of words-binary-parse-structure with head- 
word annotation. The model, its proba- 
bilistic parametrization, and a set of ex- 
periments meant to evaluate its predictive 
power are presented. 
the dog I heard yesterday barked 
Figure 1: Partial parse 
'¢"~.( ~ I h_{-=*l ) ~_{-I \[ h_O 
~ w_l ... w..p ........ w q...w~r w_lr+ll ...w_k w_lk+l} ..... w_n </s> 
Figure 2: A word-parse k-prefix 
1 Introduction 
The main goal of the proposed project is to develop 
a language model(LM) that uses syntactic structure. 
The principles that guided this propo§al were: 
• the model will develop syntactic knowledge as a 
built-in feature; it will assign a probability to every 
joint sequence of words-binary-parse-structure; 
• the model should operate in a left-to-right man- 
ner so that it would be possible to decode word lat- 
tices provided by an automatic speech recognizer. 
The model consists of two modules: a next word 
predictor which makes use of syntactic structure as 
developed by a parser. The operations of these two 
modules are intertwined. 
2 The Basic Idea and Terminology 
Consider predicting the word barked in the sen- 
tence: 
the dog I heard yesterday barked again. 
A 3-gram approach would predict barked from 
(heard, yesterday) whereas it is clear that the 
predictor should use the word dog which is out- 
side the reach of even 4-grams. Our assumption 
is that what enables us to make a good predic- 
tion of barked is the syntactic structure in the 
past. The correct partial parse of the word his- 
tory when predicting barked is shown in Figure 1. 
The word dog is called the headword of the con- 
stituent ( the (dog (...) )) and dog is an exposed 
headword when predicting barked -- topmost head- 
word in the largest constituent that contains it. The 
syntactic structure in the past filters out irrelevant 
words and points to the important ones, thus en- 
abling the use of long distance information when 
predicting the next word. Our model will assign a 
probability P(W, T) to every sentence W with ev- 
ery possible binary branching parse T and every 
possible headword annotation for every constituent 
of T. Let W be a sentence of length I words to 
which we have prepended <s> and appended </s> 
so that wo =<s> and wl+l =</s>. Let Wk be the 
word k-prefix w0... wk of the sentence and WkT~ 
the word-parse k-prefix. To stress this point, a 
word-parse k-prefix contains only those binary trees 
whose span is completely included in the word k- 
prefix, excluding wo =<s>. Single words can be re- 
garded as root-only trees. Figure 2 shows a word- 
parse k-prefix; h_0 .. h_{-m} are the exposed head- 
words. A complete parse -- Figure 3 -- is any bi- 
nary parse of the wl ... wi </s> sequence with the 
restriction that </s> is the only allowed headword. 
498 
~D 
<s> w_l ...... w_l </s> 
Figure 3: Complete parse 
Note that (wl...wi) needn't be a constituent, but 
for the parses where it is, there is no restriction on 
which of its words is the headword. 
The model will operate by means of two modules: 
• PREDICTOR predicts the next word wk+l given 
the word-parse k-prefix and then passes control to 
the PARSER; 
• PARSER grows the already existing binary 
branching structure by repeatedly generating the 
transitions adjoin-left or adjoin-right until it 
passes control to the PREDICTOR by taking a null 
transition. 
The operations performed by the PARSER en- 
sure that all possible binary branching parses with 
all possible headword assignments for the w~... wk 
word sequence can be generated. They are illus- 
trated by Figures 4-6. The following algorithm de- 
scribes how the model generates a word sequence 
with a complete parse (see Figures 3-6 for notation): 
Transition t; // a PARSER transition 
generate <s> ; 
do{ 
predict next_word; //PREDICTOR 
do{ //PARSER 
if(T_{-l} != <s> ) 
if(h_0 == </s>) t = adjoin-right; 
else t = {adjoin-{left,right}, null}; 
else I; = null; 
}while(t != null) 
}while(!(h_0 == </s> &E T_{-1} == <s>)) 
t = adjoin-right; // adjoin <s>; DONE 
It is easy to see that any given word sequence with a 
possible parse and headword annotation is generated 
by a unique sequence of model actions. 
3 Probabilistic Model 
The probability P(W, T) can be broken into: 1+1 p 
P(W,T) = l-L=1\[ (wk/Wk-lTk-1)" 
~\]~21 P ( tk l wk, Wk- , Tk-1, t~ . . . t~_l) \] where: 
• Wk-lTk-1 is the word-parse (k - 1)-prefix 
• wk is the word predicted by PP~EDICTOR 
• Nk - 1 is the number of adjoin operations the 
PARSER executes before passing control to the 
PREDICTOR (the N~-th operation at position k is 
the null transition); N~ is a function of T 
h_{-2 } h_{-I } h_O 
Figure 4: Before an adjoin operation 
h.~(-z ) -- h_(-2) h._o. h._(- x ) 
Figure 5: Result of adjoin-left 
h'_{*t ).h_(o2) h*_O -- n_O 
h_ . 
.......... 
Figure 6: Result of adjoin-right 
• t~ denotes the i-th PARSER operation carried 
out at position k in the word string; 
t k E {adjoin-left,adjoin-right},i < Nk , 
=null, i = Nk 
Our model is based on two probabilities: 
P(wk/Wk-lTk-1) (1) 
P(t~/Wk, Wk-lTk-1, t~... t~_l) (2) 
As can be seen (wk, Wk-lTk-1, t k k ...ti_l) is one 
of the Nk word-parse k-prefixes of WkTk, i = 1, Nk 
at position k in the sentence. 
To ensure a proper probabilistic model we have 
to make sure that (1) and (2) are well defined con- 
ditional probabilities and that the model halts with 
probability one. A few provisions need to be taken: 
• P(null/WkTk) = 1, if T_{-1} == <s> ensures 
that <s> is adjoined in the last step of the parsing 
process; 
• P(adjoin-right/WkTk) = 1, if h_0 == </s> 
ensures that the headword of a complete parse is 
<Is>; 
• 3~ > Os.t. P(wk=</s>/Wk-lT~-l) >_ e, VWk-lTk-1 
ensures that the model halts with probability one. 
3.1 The first model 
The first term (1) can be reduced to an n-gram LM, 
P(w~/W~-lTk-1) = P(wk/W~-l... Wk-n+l). 
A simple alternative to this degenerate approach 
would be to build a model which predicts the next 
word based on the preceding p-1 exposed headwords 
and n-1 words in the history, thus making the fol- 
lowing equivalence classification: 
\[WkTk\] = {h_O .. h_{-p+2},iUk-l..Wk-n+ 1 }. 
499 
The approach is similar to the trigger LM(Lau93), 
the difference being that in the present work triggers 
are identified using the syntactic structure. 
3.2 The second model 
Model (2) assigns probability to different binary 
parses of the word k-prefix by chaining the ele- 
mentary operations described above. The workings 
of the PARSER are very similar to those of Spat- 
ter (Jelinek94). It can be brought to the full power 
of Spatter by changing the action of the adjoin 
operation so that it takes into account the termi- 
nal/nonterminal labels of the constituent proposed 
by adjoin and it also predicts the nonterminal la- 
bel of the newly created constituent; PREDICTOR 
will now predict the next word along with its POS 
tag. The best equivalence classification of the WkTk 
word-parse k-prefix is yet to be determined. The 
Collins parser (Collins96) shows that dependency- 
grammar-like bigram constraints may be the most 
adequate, so the equivalence classification \[WkTk\] 
should contain at least (h_0, h_{-1}}. 
4 Preliminary Experiments 
Assuming that the correct partial parse is a func- 
tion of the word prefix, it makes sense to compare 
the word level perplexity(PP) of a standard n-gram 
LM with that of the P(wk/Wk-ITk-1) model. We 
developed and evaluated four LMs: 
• 2 bigram LMs P(wk/Wk-lTk-1) = P(Wk/Wk-1) 
referred to as W and w, respectively; wk-1 is the pre- 
vious (word, POStag) pair; 
• 2 P(wk/Wk-ITk--1) = P(wjho) models, re- 
ferred to as H and h, respectively; h0 is the previous 
exposed (headword, POS/non-term tag) pair; the 
parses used in this model were those assigned man- 
ually in the Penn Treebank (Marcus95) after under- 
going headword percolation and binarization. 
All four LMs predict a word wk and they were 
implemented using the Maximum Entropy Model- 
ing Toolkit 1 (Ristad97). The constraint templates 
in the {W,H} models were: 
4 <= <*>_<*> <7>; P- <= <7>_<*> <7>; 
2 <= <?>_<7> <?>; 8 <= <*>_<?> <7>; 
and in the {w,h} models they were: 
4 <= <*>_<*> <7>; 2 <= <7>_<*> <7>; 
<.> denotes a don't care position, <7>_<7> a (word, 
tag) pair; for example, 4 <= <7>_<*> <7> will trig- 
ger on all ((word, any tag), predicted-word) pairs 
that occur more than 3 times in the training data. 
The sentence boundary is not included in the PP cal- 
culation. Table 1 shows the PP results along with 
I ftp://ftp.cs.princeton.edu/pub/packages/memt 
the number of parameters for each of the 4 models 
described. 
H LM PP \[ parara H LM PP param II 
H 312 206540 h 410 102437 
Table 1: Perplexity results 
5 Acknowledgements 
The author thanks to Frederick Jelinek, Sanjeev 
Khudanpur, Eric Ristad and all the other members 
of the Dependency Modeling Group (Stolcke97), 
WS96 DoD Workshop at the Johns Hopkins Uni- 
versity. 

References 
Michael John Collins. 1996. A new statistical parser 
based on bigram lexical dependencies. In Pro- 
ceedings of the 3~th Annual Meeting of the As- 
sociation for Computational Linguistics, 184-191, 
Santa Cruz, CA. 
Frederick Jelinek. 1997. Information extraction from 
speech and text -- course notes. The Johns Hop- 
kins University, Baltimore, MD. 
Frederick Jelinek, John Lafferty, David M. Mager- 
man, Robert Mercer, Adwait Ratnaparkhi, Salim 
Roukos. 1994. Decision Tree Parsing using a Hid- 
den Derivational Model. In Proceedings of the 
Human Language Technology Workshop, 272-277. 
ARPA. 
Raymond Lau, Ronald Rosenfeld, and Salim 
Roukos. 1993. Trigger-based language models: a 
maximum entropy approach. In Proceedings of the 
IEEE Conference on Acoustics, Speech, and Sig- 
nal Processing, volume 2, 45-48, Minneapolis. 
Mitchell P. Marcus, Beatrice Santorini, Mary Ann 
Marcinkiewicz. 1995. Building a large annotated 
corpus of English: the Penn Treebank. Computa- 
tional Linguistics, 19(2):313-330. 
Eric Sven Ristad. 1997. Maximum entropy model- 
ing toolkit. Technical report, Department of Com- 
puter Science, Princeton University, Princeton, 
N J, January 1997, v. 1.4 Beta. 
Andreas Stolcke, Ciprian Chelba, David Engle, 
Frederick Jelinek, Victor Jimenez, Sanjeev Khu- 
danpur, Lidia Mangu, Harry Printz, Eric Sven 
Ristad, Roni Rosenfeld, Dekai Wu. 1997. Struc- 
ture and Performance of a Dependency Language 
Model. In Proceedings of Eurospeech'97, PJaodes, 
Greece. To appear. 
