Bayesian Grammar Induction for Language Modeling 
Stanley F. Chen 
Aiken Computation Laboratory 
Division of Applied Sciences 
Harvard University 
Cambridge, MA 02138 
sfc@das, harvard, edu 
Abstract 
We describe a corpus-based induction algo- 
rithm for probabilistic context-free gram- 
mars. The algorithm employs a greedy 
heuristic search within a Bayesian frame- 
work, and a post-pass using the Inside- 
Outside algorithm. We compare the per- 
formance of our algorithm to n-gram mo- 
dels and the Inside-Outside algorithm in 
three language modeling tasks. In two of 
the tasks, the training data is generated by 
a probabilistic context-free grammar and in 
both tasks our algorithm outperforms the 
other techniques. The third task involves 
naturally-occurring data, and in this task 
our algorithm does not perform as well as 
n-gram models but vastly outperforms the 
Inside-Outside algorithm. 
1 Introduction 
In applications such as speech recognition, hand- 
writing recognition, and spelling correction, perfor- 
mance is limited by the quality of the language mo- 
del utilized (7; 7; 7; 7). However, static language 
modeling performance has remained basically un- 
changed since the advent of n-gram language mo- 
dels forty years ago (7). Yet, n-gram language mo- 
dels can only capture dependencies within an n- 
word window, where currently the largest practical 
n for natural language is three, and many dependen- 
cies in natural language occur beyond a three-word 
window. In addition, n-gram models are extremely 
large, thus making them difficult to implement effi- 
ciently in memory-constrained applications. 
An appealing alternative is grammar-based lan- 
guage models. Language models expressed as a pro- 
babilistic grammar tend to be more compact than 
n-gram language models, and have the ability to mo- 
del long-distance dependencies (7; 7; 7). However, 
to date there has been little success in constructing 
grammar-based language models competitive with 
n-gram models in problems of any magnitude. 
In this paper, we describe a corpus-based indue- 
tion algorithm for probabilistic context-free gram- 
mars that outperforms n-gram models and the 
Inside-Outside algorithm (7) in medium-sized do- 
mains. This result marks the first time a grammar- 
based language model has surpassed n-gram mode- 
ling in a task of at least moderate size. The al- 
gorithm employs a greedy heuristic search within 
a Bayesian framework, and a post-pass using the 
Inside-Outside algorithm. 
2 Grammar Induction as Search 
Grammar induction can be framed as a search pro- 
blem, and has been framed as such almost without 
exception in past research (7). The search space is 
taken to be some class of grammars; for example, in 
our work we search within the space of probabilistic 
context-free grammars. The objective function is ta- 
ken to be some measure dependent on the training 
data; one generally wants to find a grammar that in 
some sense accurately models the training data. 
Most work in language modeling, including n- 
gram models and the Inside-Outside algorithm, falls 
under the maximum-likelihood paradigm, where one 
takes the objective function to be the likelihood of 
the training data given the grammar. However, the 
optimal grammar under this objective function is 
one which generates only strings in the training data 
and no other strings. Such grammars are poor lan- 
guage models, as they overfit the training data and 
do not model the language at large. In n-gram mo- 
dels and the Inside-Outside algorithm, this issue is 
evaded by bounding the size and form of the gram- 
mars considered, so that the "optimal" grammar 
cannot be expressed. However, in our work we do 
not wish to limit the size of the grammars conside- 
red. 
The basic shortcoming of the maximum-likelihood 
objective function is that it does not encompass the 
compelling intuition behind Occam's Razor, that 
simpler (or smaller) grammars are preferable over 
complex (or larger) grammars. A factor in the ob- 
jective function that favors smaller grammars over 
228 
s --, sx (l-e) 
s x (,) 
X ~ A (p(A)) 
Aa ---* a (1) 
VAeN-{S,X} 
VaET 
N = the set of all nonterminal symbols 
T = the set of all terminal symbols 
Probabilities for each rule are in parentheses. 
Table 1: Initial hypothesis grammar 
large can prevent the objective function from pre- 
ferring grammars that overfit the training data. ?) 
presents a Bayesian grammar induction framework 
that includes such a factor in a motivated manner. 
The goM of grammar induction is taken to be fin- 
ding the grammar with the largest a posteriori pro- 
bability given the training data, that is, finding the 
grammar G ~ where 
c' = arg m xp(GIo) 
and where we denote the training data as O, for ob- 
servations. As it is unclear how to estimate p(GIO) 
directly, we apply Bayes' Rule and get 
a I = arg p(Ola)p(a) p(o) = arg% xp(O\[a)p(a) 
Hence, we can frame the search for G ~ as a search 
with the objective function p(OIG)p(G), the likeli- 
hood of the training data multiplied by the prior 
probability of the grammar. 
We satisfy the goal of favoring smaller grammars 
by choosing a prior that assigns higher probabilities 
to such grammars. In particular, Solomonoff propo- 
ses the use of the universal a priori probability (?), 
which is closely related to the minimum description 
length principle later proposed by (?). In the case 
of grammatical language modeling, this corresponds 
to taking 
p(G) = 2 -t(a) 
where l(G) is the length of the description of the 
grammar in bits. The universal a priori probabi- 
lity has many elegant properties, the most salient 
of which is that it dominates all other enumerable 
probability distributions multiplicativelyJ 
3 Search Algorithm 
As described above, we take grammar induction to 
be the search for the grammar G ~ that optimizes the 
objective function p(OlG)p(G ). While this frame- 
work does not restrict us to a particular grammar 
formalism, in our work we consider only probabili- 
stic context-free grammars. 
1A very thorough discussion of the universal a priori 
probability is given by 7). 
We assume a simple greedy search strategy. We 
maintain a single hypothesis grammar which is in- 
itialized to a small, trivial grammar. We then try to 
find a modification to the hypothesis grammar, such 
as the addition of a grammar rule, that results in a 
grammar with a higher score on the objective func- 
tion. When we find a superior grammar, we make 
this the new hypothesis grammar. We repeat this 
process until we can no longer find a modification 
that improves the current hypothesis grammar. 
For our initial grammar, we choose a grammar 
that can generate any string, to assure that the 
grammar can cover the training data. The initial 
grammar is listed in Table ??. The sentential symbol 
S expands to a sequence of X's, where X expands 
to every other nonterminal symbol in the grammar. 
Initially, the set of nonterminal symbols consists of 
a different nonterminal symbol expanding to each 
terminal symbol. 
Notice that this grammar models a sentence as 
a sequence of independently generated nonterminal 
symbols. We maintain this property throughout the 
search process, that is, for every symbol A ~ that we 
add to the grammar, we also add a rule X ---+ A I. 
This assures that the sentential symbol can expand 
to every symbol; otherwise, adding a symbol will not 
affect the probabilities that the grammar assigns to 
strings. 
We use the term move set to describe the set of 
modifications we consider to the current hypothesis 
grammar to hopefully produce a superior grammar. 
Our move set includes the following moves: 
Move 1: Create a rule of the form A ---* BC 
Move 2: Create a rule of the form A --+ BIC 
For any context-free grammar, it is possible to ex- 
press a weakly equivalent grammar using only rules 
of these forms. As mentioned before, with each new 
symbol A we also create a rule X ---* A. 
3.1 Evaluating the Objective Function 
Consider the task of calculating the objective func- 
tion p(OIG)p(G ) for some grammar G. Calculating 
229 
S 
S X 
Aslowly 
i i i X A=az~s slowly 
I \] A Ma,-y talks 
i Mary 
S 
S X A,towtv 
\[ I $ X Ataak, slowly 
I i ABo, talks 
i Bob 
Figure 1: Initial Viterbi Parse 
S 
s//"'x s 
i i I X B X 
AMary Atatks Astowty ABob 
I i t i Mary talks slowly Bob 
S 
X 
i B 
Atatk, Ajtowty 
I I talks slowly 
Figure 2: Predicted Viterbi Parse 
p(G) = 2 -/(G) is inexpensive2; however, calculating 
p(OIG) requires a parsing of the entire training data. 
We cannot afford to parse the training data for each 
grammar considered; indeed, to ever be practical for 
data sets of millions of words, it seems likely that we 
can only afford to parse the data once. 
To achieve this goal, we employ several approxi- 
mations. First, notice that we do not ever need to 
calculate the actual value of the objective function; 
we need only to be able to distinguish when a move 
applied to the current hypothesis grammar produces 
a grammar that has a higher score on the objective 
function, that is, we need only to be able to calcu- 
late the difference in the objective function resulting 
from a move. This can be done efficiently if we can 
quickly approximate how the probability of the trai- 
ning data changes when a move is applied. 
To make this possible, we approximate the proba- 
bility of the training data p(OIG ) by the probability 
of the single most probable parse, or Viterbi parse, 
of the training data. Furthermore, instead of recal- 
culating the Viterbi parse of the training data from 
scratch when a move is applied, we use heuristics to 
predict how a move will change the Viterbi parse. 
For example, consider the case where the training 
data consists of the two sentences 
O = {Bob talks slowly, Mary talks slowly} 
~Due to space limitations, we do not specify our me- 
thod for encoding grammars, i.e., how we calculate l(G) 
for a given G. However, this will be described in the 
author's forthcoming Ph.D. dissertation. 
In Figure ??, we display the Viterbi parse of this 
data under the initial hypothesis grammar used in 
our algorithm. 
Now, let us consider the move of adding the rule 
B ---* Atalks Aslo~ty 
to the initial grammar (as well as the concomitant 
rule X ---* B). A reasonable heuristic for predic- 
ting how the Viterbi parse will change is to replace 
adjacent X's that expand to Atazk, and A~zo~,ty re- 
spectively with a single X that expands to B, as 
displayed in Figure ??. This is the actual heuristic 
we use for moves of the form A ---* BC, and we have 
analogous heuristics for each move in our move set. 
By predicting the differences in the Viterbi parse re- 
sulting from a move, we can quickly estimate the 
change in the probability of the training data. 
Notice that our predicted Viterbi parse can stray 
a great deal from the actual Viterbi parse, as errors 
can accumulate as move after move is applied. To 
minimize these effects, we process the training data 
incrementally. Using our initial hypothesis gram- 
mar, we parse the first sentence of the training data 
and search for the optimal grammar over just that 
one sentence using the described search framework. 
We use the resulting grammar to parse the second 
sentence, and then search for the optimal grammar 
over the first two sentences using the last grammar 
as the starting point. We repeat this process, par- 
sing the next sentence using the best grammar found 
on the previous sentences and then searching for the 
230 
best grammar taking into account this new sentence, 
until the entire training corpus is covered. 
Delaying the parsing of a sentence until all of the 
previous sentences are processed should yield more 
accurate Viterbi parses during the search process 
than if we simply parse the whole corpus with the 
initial hypothesis grammar. In addition, we still 
achieve the goal of parsing each sentence but once. 
3.2 Parameter Training 
In this section, we describe how the parameters of 
our grammar, the probabilities associated with each 
grammar rule, are set. Ideally, in evaluating the ob- 
jective function for a particular grammar we should 
use its optimal parameter settings given the training 
data, as this is the full score that the given gram- 
mar can achieve. However, searching for optimal 
parameter values is extremely expensive computa- 
tionally. Instead, we grossly approximate the opti- 
mal values by deterministically setting parameters 
based on the Viterbi parse of the training data par- 
sed so far. We rely on the post-pass, described later, 
to refine parameter values. 
Referring to the rules in Table ??, the parameter 
e is set to an arbitrary small constant. The values 
of the parameters p(A) are set to the (smoothed) 
frequency of the X ~ A reduction in the Viterbi 
parse of the data seen so far. The remaining symbols 
are set to expand uniformly among their possible 
expansions. 
3.3 Constraining Moves 
Consider the move of creating a rule of the form 
A --* BC. This corresponds to k 3 different specific 
rules that might be created, where k is the current 
number of symbols in the grammar. As it is too 
computationally expensive to consider each of these 
rules at every point in the search, we use heuristics 
to constrain which moves are appraised. 
For the left-hand side of a rule, we always create 
a new symbol. This heuristic selects the optimal 
choice the vast majority of the time; however, under 
this constraint the moves described earlier in this 
section cannot yield arbitrary context-free langua- 
ges. To partially address this, we add the move 
Move 3: Create a rule of the form A ---* AB\[B 
With this iteration move, we can construct gram- 
mars that generate arbitrary regular languages. As 
yet, we have not implemented moves that enable 
the construction of arbitrary context-free grammars; 
this belongs to future work. 
To constrain the symbols we consider on the 
right-hand side of a new rule, we use what we call 
~riggcrs. 3 A ~rigger is a phenomenon in the Viterbi 
parse of a sentence that is indicative that a particular 
move might lead to a better grammar. For example, 
3This is not to be confused with the use of the term 
triggers in dynamic language modeling. 
in Figure .9.9 the fact that the symbols Atalks and 
Aszo,ozv occur adjacently is indicative that it could 
be profitable to create a rule B ---* At~t~sAsto,olv. We 
have developed a set of triggers for each move in our 
move set, and only consider a specific move if it is 
triggered in the sentence currently being parsed in 
the incremental processing. 
3.4 Post-Pass 
A conspicuous shortcoming in our search framework 
is that the grammars in our search space are fairly 
unexpressive. Firstly, recall that our grammars mo- 
del a sentence as a sequence of independently gene- 
rated symbols; however, in language there is a large 
dependence between adjacent constituents. Further- 
more, the only free parameters in our search are the 
parameters p(A); all other symbols (except S) are 
fixed to expand uniformly. These choices were ne- 
cessary to make the search tractable. 
To address this issue, we use an Inside-Outside al- 
gorithm post-pass. Our methodology is derived from 
that described by .9). We create n new nonterminal 
symbols {X1,...,X,}, and create all rules of the 
form: 
X~ ~ Xj Xk i,j, k e {1,...,n} 
Xi--* A iE {1,...,n}, 
A E No~d- {S, X} 
Nold denotes the set of nonterminal symbols acqui- 
red in the initial grammar induction phase, and X1 
is taken to be the new sentential symbol. These 
new rules replace the first three rules listed in Ta- 
ble .9.9. The parameters of these rules are initiMized 
randomly. Using this grammar as the starting point, 
we run the Inside-Outside algorithm on the training 
data until convergence. 
In other words, instead of using the naive S 
SXIX rule to attach symbols together in parsing 
data, we now use the Xi rules and depend on the 
Inside-Outside algorithm to train these randomly 
initialized rules intelligently. This post-pass allows 
us to express dependencies between adjacent sym- 
bols. In addition, it allows us to train parameters 
that were fixed during the initial grammar induc- 
tion phase. 
4 Previous Work 
As mentioned, this work employs the Bayesian gram- 
mar induction framework described by Solomonoff 
(.9; ?). However, Solomonoff does not specify a con- 
crete search algorithm and only makes suggestions 
as to its nature. 
Similar research includes work by Cook et al. 
(1976) and Stolcke and Omohundro (1994). This 
work also employs a heuristic search within a Baye- 
sian framework. However, a different prior proba- 
bility on grammars is used, and the algorithms are 
only efficient enough to be applied to small data sets. 
231 
The grammar induction algorithms most suc- 
cessful in language modeling include the Inside- 
Outside algorithm (.7; ?; ?), a special case of the 
Expectation-Maximization algorithm (?), and work 
by ?). In the latter work, McCandless uses a heu- 
ristic search procedure similar to ours, but a very 
different search criteria. To our knowledge, neither 
algorithm has surpassed the performance of n-gram 
models in a language modeling task of substantial 
scale. 
5 Results 
To evaluate our algorithm, we compare the perfor- 
mance of our algorithm to that of n-gram models 
and the Inside-Outside algorithm. 
For n-gram models, we tried n - 1,...,10 for 
each domain. For smoothing a particular n-gram 
model, we took a linear combination of all lower or- 
der n-gram models. In particular, we follow stan- 
dard practice (?; ?; ?) and take the smoothed i- 
gram probability to be a linear combination of the 
/-gram frequency in the training data and the smoo- 
thed (i - 1)-gram probability, that is, 
p(w01W = wi_l...w-i) = 
c(W~o0) + Ai,o(w) c(W) 
(1 - Ai,c(w))p(wolwi_2 . . . w-z) 
where c(W) denotes the count of the word sequence 
W in the training data. The smoothing parameters 
,~i,c are trained through the Forward-Backward al- 
gorithm (?) on held-out data. Parameters Ai.e are 
tied together for similar c to prevent data sparsity. 
For the Inside-Outside algorithm, we follow the 
methodology described by Lari and Young. For a 
given n, we create a probabilistic context-free gram- 
mar consisting of all Chomsky normal form rules 
over the n nonterminal symbols {X1, •. • Xn } and the 
given terminal symbols, that is, all rules 
Xi ---* Xj Xk i,j, k E {1,...,n} 
Xi ---* a i E {1,. . .,n},a E T 
where T denotes the set of terminal symbols in the 
domain. All parameters are initialized randomly. 
From this starting point, the Inside-Outside algo- 
rithm is run until convergence. 
For smoothing, we combine the expansion distri- 
bution of each symbol with a uniform distribution, 
that is, we take the smoothed parameter ps(A ---* a) 
to be 
1 
p,(A ~ a) = (1 - A)p,,(A ---* a) + An3 -F n\[T\[ 
where p~ (A --~ a) denotes the unsmoothed parame- 
ter. The value n 3 + n\[TI is the number of different 
ways a symbol expands under the Lari and Young 
methodology. The parameter A is trained through 
the Inside-Outside algorithm on held-out data. This 
smoothing is also performed on the Inside-Outside 
post-pass of our algorithm. For each domain, we 
tried n -- 3,..., 10. 
Because of the computational demands of our al- 
gorithm, it is currently impractical to apply it to 
large vocabulary or large training set problems. Ho- 
wever, we present the results of our algorithm in 
three medium-sized domains. In each case, we use 
4500 sentences for training, with 500 of these sent- 
ences held out for smoothing. We test on 500 sent- 
ences, and measure performance by the entropy of 
the test data. 
In the first two domains, we created the training 
and test data artificially so as to have an ideal gram- 
mar in hand to benchmark results. In particular, we 
used a probabilistic grammar to generate the data. 
In the first domain, we created this grammar by 
hand; the grammar was a small English-like probabi- 
listic context-free grammar consisting of roughly 10 
nonterminal symbols, 20 terminal symbols, and 30 
rules. In the second domain, we derived the gram- 
mar from manually parsed text. From a million 
words of parsed Wall Street Journal data from the 
Penn treebank, we extracted the 20 most frequently 
occurring symbols, and the 10 most frequently oc- 
curring rules expanding each of these symbols. For 
each symbol that occurs on the right-hand side of a 
rule but which was not one of the most frequent 20 
symbols, we create a rule that expands that symbol 
to a unique terminal symbol. After removing unre- 
achable rules, this yields a grammar of roughly 30 
nonterminals, 120 terminals, and 160 rules. Para- 
meters are set to reflect the frequency of the corre- 
sponding rule in the parsed corpus. 
For the third domain, we took English text and 
reduced the size of the vocabulary by mapping each 
word to its part-of-speech tag. We used tagged Wall 
Street Journal text from the Penn treebank, which 
has a tag set size of about fifty. 
In Tables ??_?.7, we summarize our results. The 
ideal grammar denotes the grammar used to gene- 
rate the training and test data. For each algorithm, 
we list the best performance achieved over all n tried, 
and the best n column states which value realized 
this performance. 
We achieve a moderate but significant improve- 
ment in performance over n-gram models and the 
Inside-Outside algorithm in the first two domains, 
while in the part-of-speech domain we are outper- 
formed by n-gram models but we vastly outperform 
the Inside-Outside algorithm. 
In Table ??, we display a sample of the number 
of parameters and execution time (on a Decstation 
5000/33) associated with each algorithm. We choose 
n to yield approximately equivalent performance for 
each algorithm. The first pass row refers to the main 
grammar induction phase of our algorithm, and the 
post-pass row refers to the Inside-Outside post-pass. 
232 
best entropy 
n (bits/word) 
ideal grammar 2.30 
our algorithm 7 2.37 
n-gram model 4 2.46 
Inside-Outside 9 2.60 
entr. relative 
to n-gram 
-6.5% 
-3.7% 
+5.7% 
Table 2: English-like artificial grammar 
best entropy 
n (bits/word) 
ideal grammar 
our algorithm 9 
n-gram model 4 
Inside-Outside 9 
4.13 
4.44 
entr. relative 
to n-gram 
--10.4% 
-3.7% 
4.61 
4.64 +0.7% 
Table 3: Wall Street Journal-like artificial grammar 
Notice that our algorithm produces a significantly 
more compact model than the n-gram model, while 
running significantly faster than the Inside-Outside 
algorithm even though we use an Inside-Outside 
post-pass. Part of this discrepancy is due to the fact 
that we require a smaller number of new nonterminal 
symbols to achieve equivalent performance, but we 
have also found that our post-pass converges more 
quickly even given the same number of nonterminal 
symbols. 
6 Discussion 
Our algorithm consistently outperformed the Inside- 
Outside algorithm in these experiments. While we 
partially attribute this difference to using a Bayesian 
instead of maximum-likelihood objective function, 
we believe that part of this difference results from a 
more effective search strategy. In particular, though 
both algorithms employ a greedy hill-climbing strat- 
egy, our algorithm gains an advantage by being able 
to add new rules to the grammar. 
In the Inside-Outside algorithm, the gradient des- 
cent search discovers the "nearest" local minimum in 
the search landscape to the initial grammar. If there 
are k rules in the grammar and thus k parameters, 
then the search takes place in a fixed k-dimensional 
space IR ~. In our algorithm, it is possible to ex- 
pand the hypothesis grammar, thus increasing the 
dimensionality of the parameter space that is being 
searched. An apparent local minimum in the space 
\]Rk may no longer be a local minimum in the space 
\]~k+l; the extra dimension may provide a pathway 
for further improvement of the hypothesis grammar. 
Hence, our algorithm should be less prone to sub- 
optimal local minima than the Inside-Outside algo- 
rithm. 
Outperforming n-gram models in the first two do- 
mains demonstrates that our algorithm is able to 
take advantage of the grammatical structure present 
in data. However, the superiority of n-gram models 
in the part-of-speech domain indicates that to be 
competitive in modeling naturally-occurring data, it 
is necessary to model collocational information ac- 
curately. We need to modify our algorithm to more 
aggressively model n-gram information. 
7 Conclusion 
This research represents a step forward in the quest 
for developing grammar-based language models for 
natural language. We induce models that, while 
being substantially more compact, outperform n- 
gram language models in medium-sized domains. 
The algorithm runs essentially in time and space li- 
near in the size of the training data, so larger do- 
mains are within our reach. 
However, we feel the largest contribution of this 
work does not lie in the actual algorithm specified, 
but rather in its indication of the potential of the in- 
duction framework described by Solomonoffin 1964. 
We have implemented only a subset of the moves 
that we have developed, and inspection of our re- 
sults gives reason to believe that these additional 
moves may significantly improve the performance of 
our algorithm. 
Solomonoff's induction framework is not restric- 
ted to probabilistic context-free grammars. After 
completing the implementation of our move set, we 
plan to explore the modeling of context-sensitive 
phenomena. This work demonstrates that Solomo- 
noff's elegant framework deserves much further con- 
sideration. 
Acknowledgements 
We are indebted to Stuart Shieber for his suggestions 
and guidance, as well as his invaluable comments on 
earlier drafts of this paper. This material is based 
233 
best entropy 
n (bits/word) 
n-gram model 6 
our algorithm 7 
Inside-Outside 7 
entr. relative 
to n-gram 
3.01 
3.15 +4.7% 
3.93 +30.6% 
Table 4: English sentence part-of-speech sequences 
WSJ n 
artif. 
n-gram 3 
IO 9 
first pass 
post-pass 5 
entropy no. 
(bits/word) params 
4.61 15000 
4.64 2000 
800 
4.60 4000 
time 
(sec) 
50 
30000 
1000 
5000 
Table 5: Parameters and Training Time 
on work supported by the National Science Founda- 
tion under Grant Number IRI-9350192 to Stuart M. 
Shieber. 

References 
D. Angluin and C.H. Smith. 1983. Inductive in- 
ference: theory and methods. ACM Computing 
Surveys, 15:237-269. 
L.R. Bahl, J.K. Baker, P.S. Cohen, F. Jelinek, B.L. 
Lewis, and R.L. Mercer. 1978. Recognition of a 
continuously read natural corpus. In Proceedings 
of the IEEE International Conference on Acou- 
stics, Speech and Signal Processing, pages 422- 
424, Tulsa, Oklahoma, April. 
Lalit R. Bahl, Frederick Jelinek, and Robert L. Mer- 
cer. 1983. A maximum likelihood approach to 
continuous speech recognition. IEEE Transac- 
tions on Pattern Analysis and Machine Intelli- 
gence, PAMI-5(2):179-190, March. 
J.K. Baker. 1975. The DRAGON system - an over- 
view. IEEE Transactions on Acoustics, Speech 
and Signal Processing, 23:24-29, February. 
J.K. Baker. 1979. Trainable grammars for speech 
recognition. In Proceedings of the Spring Confe- 
rence of the Acoustical Society of America, pages 
547-550, Boston, MA, June. 
L.E. Baum and J.A. Eagon. 1967. An inequality 
with application to statistical estimation for pro- 
babilistic functions of Markov processes and to a 
model for ecology. Bulletin of the American Ma- 
thematicians Society, 73:360-363. 
Peter F. Brown, Vincent J. DellaPietra, Peter V. 
deSouza, Jennifer C. Lai, and Robert L. Mercer. 
1992. Class-based n-gram models of natural lan- 
guage. Computational Linguistics, 18(4):467-479, 
December. 
A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. 
Maximum likelihood from incomplete data via the 
EM algorithm. Journal of the Royal Statistical 
Society, 39(B):1-38. 
Frederick Jelinek and Robert L. Mercer. 1980. Inter- 
polated estimation of Markov source parameters 
from sparse data. In Proceedings of the Workshop 
on Pattern Recognition in Practice, Amsterdam, 
The Netherlands: North-Holland, May. 
M.D. Kernighan, K.W. Church, and W.A. Gale. 
1990. A spelling correction program based on a 
noisy channel model. In Proceedings of the Thir- 
teenth International Conference on Computatio- 
nal Linguistics, pages 205-210. 
K. Lari and S.J. Young. 1990. The estimation of 
stochastic context-free grammars using the inside- 
outside algorithm. Computer Speech and Lan- 
guage, 4:35-56. 
K. Lari and S.J. Young. 1991. Applications of sto- 
chastic context-free grammars using the inside- 
outside algorithm. Computer Speech and Lan- 
guage, 5:237-257. 
Ming Li and Paul VitAnyi. 1993. An Introduction 
to Kolmogorov Complexity and its Applications. 
Springer-Verlag. 
Michael K. McCandless and James R. Glass. 1993. 
Empirical acquisition of word and phrase classes 
in the ATIS domain. In Third European Confe- 
rence on Speech Communication and Technology, 
Berlin, Germany, September. 
Fernando Pereira and Yves Schabes. 1992. Inside- 
outside reestimation from partially bracket cor- 
pora. In Proceedings of the 30th Annual Meeting 
of the ACL, pages 128-135, Newark, Delaware. 
P. Resnik. 1992. Probabilistic tree-adjoining gram- 
mar as a framework for statistical natural lan- 
guage processing. In Proceedings of the 14th In- 
ternational Conference on Computational £ingui- 
stics. 
J. Rissanen. 1978. Modeling by the shortest data 
description. Automatica, 14:465-471. 
Y. Schabes. 1992. Stochastic lexicalized tree- 
adjoining grammars. In Proceedings of the l~th 
International Conference on Computational Lin- 
guistics. 
C.E. Shannon. 1951. Prediction and entropy of 
printed English. Bell Systems Technical Journal, 
30:50-64, January. 
R,.J. Solomonoff. 1960. A preliminary report on 
a general theory of inductive inference. Techni- 
cal Report ZTB-138, Zator Company, Cambridge, 
MA, November. 
R.J. Solomonoff. 1964. A formal theory of inductive 
inference. Information and Control, 7:1-22,224- 
254, March, June. 
Rohini Srihari and Charlotte BMtus. 1992. Combi- 
ning statisticM and syntactic methods in recogni- 
zing handwritten sentences. In AAAI Symposium: 
Probabilistie Approaches to Natural Language, pa- 
ges 121-127. 
