An Efficient Probabilistic Context-Free 
Parsing Algorithm that Computes Prefix 
Probabilities 
Andreas Stolcke* 
University of California at Berkeley 
and 
International Computer Science Institute 
We describe an extension of Earley's parser for stochastic context-free grammars that computes the 
following quantities given a stochastic context-free grammar and an input string: a) probabilities 
of successive prefixes being generated by the grammar; b) probabilities of substrings being gen- 
erated by the nonterminals, including the entire string being generated by the grammar; c) most 
likely (Viterbi) parse of the string; d) posterior expected number of applications of each grammar 
production, as required for reestimating rule probabilities. Probabilities (a) and (b) are computed 
incrementally in a single left-to-right pass over the input. Our algorithm compares favorably to 
standard bottom-up parsing methods for SCFGs in that it works efficiently on sparse grammars 
by making use of Earley's top-down control structure. It can process any context-free rule format 
without conversion to some normal form, and combines computations for (a) through (d) in a 
single algorithm. Finally, the algorithm has simple extensions for processing partially bracketed 
inputs, and for finding partial parses and their likelihoods on ungrammatical inputs. 
1. Introduction 
Context-free grammars are widely used as models of natural language syntax. In 
their probabilistic version, which defines a language as a probability distribution over 
strings, they have been used in a variety of applications: for the selection of parses 
for ambiguous inputs (Fujisaki et al. 1991); to guide the rule choice efficiently during 
parsing (Jones and Eisner 1992); to compute island probabilities for non-linear parsing 
(Corazza et al. 1991). In speech recognition, probabilistic context-free grammars play 
a central role in integrating low-level word models with higher-level language mod- 
els (Ney 1992), as well as in non-finite-state acoustic and phonotactic modeling (Lari 
and Young 1991). In some work, context-free grammars are combined with scoring 
functions that are not strictly probabilistic (Nakagawa 1987), or they are used with 
context-sensitive and/or semantic probabilities (Magerman and Marcus 1991; Mager- 
man and Weir 1992; Jones and Eisner 1992; Briscoe and Carroll 1993). 
Although clearly not a perfect model of natural language, stochastic context-free 
grammars (SCFGs) are superior to nonprobabilistic CFGs, with probability theory pro- 
viding a sound theoretical basis for ranking and pruning of parses, as well as for 
integration with models for nonsyntactic aspects of language. All of the applications 
listed above involve (or could potentially make use of) one or more of the following 
* Speech Technology and Research Laboratory, SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025. E-mail: stolcke@speech.sri.com. 
© 1995 Association for Computational Linguistics 
Computational Linguistics Volume 21, Number 2 
standard tasks, compiled by Jelinek and Lafferty (1991). 1 
. 
. 
3. 
. 
What is the probability that a given string x is generated by a grammar 
G? 
What is the single most likely parse (or derivation) for x? 
What is the probability that x occurs as a prefix of some string generated 
by G (the prefix probability of x)? 
How should the parameters (e.g., rule probabilities) in G be chosen to 
maximize the probability over a training set of strings? 
The algorithm described in this article can compute solutions to all four of these 
problems in a single framework, with a number of additional advantages over previ- 
ously presented isolated solutions. 
Most probabilistic parsers are based on a generalization of bottom-up chart pars- 
ing, such as the CYK algorithm. Partial parses are assembled just as in nonprobabilistic 
parsing (modulo possible pruning based on probabilities), while substring probabili- 
ties (also known as "inside" probabilities) can be computed in a straightforward way. 
Thus, the CYK chart parser underlies the standard solutions to problems (1) and (4) 
(Baker 1979), as well as (2) (Jelinek 1985). While the Jelinek and Lafferty (1991) solu- 
tion to problem (3) is not a direct extension of CYK parsing, the authors nevertheless 
present their algorithm in terms of its similarities to the computation of inside proba- 
bilities. 
In our algorithm, computations for tasks (1) and (3) proceed incrementally, as the 
parser scans its input from left to right; in particular, prefix probabilities are available 
as soon as the prefix has been seen, and are updated incrementally as it is extended. 
Tasks (2) and (4) require one more (reverse) pass over the chart constructed from the 
input. 
Incremental, left-to-right computation of prefix probabilities is particularly impor- 
tant since that is a necessary condition for using SCFGs as a replacement for finite-state 
language models in many applications, such a speech decoding. As pointed out by Je- 
linek and Lafferty (1991), knowing probabilities P (Xo... xi) for arbitrary prefixes Xo... xi 
enables probabilistic prediction of possible follow-words Xi+l, as P(xi+l I xo...xi) = P(Xo...xixi+I)/P(xo...xi). 
These conditional probabilities can then be used as word 
transition probabilities in a Viterbi-style decoder or to incrementally compute the cost 
function for a stack decoder (Bahl, Jelinek, and Mercer 1983). 
Another application in which prefix probabilities play a central role is the extrac- 
tion of n-gram probabilities from SCFGs (Stolcke and Segal 1994). Here, too, efficient 
incremental computation saves time, since the work for common prefix strings can be 
shared. 
The key to most of the features of our algorithm is that it is based on the top- 
down parsing method for nonprobabilistic CFGs developed by Earley (1970). Earley's 
algorithm is appealing because it runs with best-known complexity on a number of 
special classes of grammars. In particular, Earley parsing is more efficient than the 
bottom-up methods in cases where top-down prediction can rule out potential parses 
of substrings. The worst-case computational expense of the algorithm (either for the 
complete input, or incrementally for each new word) is as good as that of the other 
1 Their paper phrases these problem in terms of context-free probabilistic grammars, but they generalize 
in obvious ways to other classes of models. 
166 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
known specialized algorithms, but can be substantially better on well-known grammar 
classes. 
Earley's parser (and hence ours) also deals with any context-free rule format in 
a seamless way, without requiring conversions to Chomsky Normal Form (CNF), as 
is often assumed. Another advantage is that our probabilistic Earley parser has been 
extended to take advantage of partially bracketed input, and to return partial parses 
on ungrammatical input. The latter extension removes one of the common objections 
against top-down, predictive (as opposed to bottom-up) parsing approaches (Mager- 
man and Weir 1992). 
2. Overview 
The remainder of the article proceeds as follows. Section 3 briefly reviews the workings 
of an Earley parser without regard to probabilities. Section 4 describes how the parser 
needs to be extended to compute sentence and prefix probabilities. Section 5 deals with 
further modifications for solving the Viterbi and training tasks, for processing partially 
bracketed inputs, and for finding partial parses. Section 6 discusses miscellaneous 
issues and relates our work to the literature on the subject. In Section 7 we summarize 
and draw some conclusions. 
To get an overall idea of probabilistic Earley parsing it should be sufficient to read 
Sections 3, 4.2, and 4.4. Section 4.5 deals with a crucial technicality, and later sections 
mostly fill in details and add optional features. 
We assume the reader is familiar with the basics of context-free grammar the- 
ory, such as given in Aho and Ullman (1972, Chapter 2). Some prior familiarity with 
probabilistic context-free grammars will also be helpful. Jelinek, Lafferty, and Mercer 
(1992) provide a tutorial introduction covering the standard algorithms for the four 
tasks mentioned in the introduction. 
Notation. The input string is denoted by x. Ix\[ is the length of x. Individual input 
symbols are identified by indices starting at 0: x0,xl ..... Xlxl_ 1. The input alphabet 
is denoted by E. Substrings are identified by beginning and end positions Xi...j. The 
variables i,j,k are reserved for integers referring to positions in input strings. Latin 
capital letters X, Y, Z denote nonterminal symbols. Latin lowercase letters a, b,... are 
used for terminal symbols. Strings of mixed nonterminal and terminal symbols are 
written using lowercase Greek letters ,~, #, v. The empty string is denoted by e. 
3. Earley Parsing 
An Earley parser is essentially a generator that builds left-most derivations of strings, 
using a given set of context-free productions. The parsing functionality arises because 
the generator keeps track of all possible derivations that are consistent with the input 
string up to a certain point. As more and more of the input is revealed, the set of 
possible derivations (each of which corresponds to a parse) can either expand as new 
choices are introduced, or shrink as a result of resolved ambiguities. In describing the 
parser it is thus appropriate and convenient to use generation terminology. 
The parser keeps a set of states for each position in the input, describing all 
pending derivations, a These state sets together form the Earley chart. A state is of the 
2 Earley states are also known as items in LR parsing; see Aho and Ullman (1972, Section 5.2) and 
Section 6.2. 
167 
Computational Linguistics Volume 21, Number 2 
form 
i: k X --* ,~.#, 
where X is a nonterminal of the grammar, ,~ and # are strings of nonterminals and/or 
terminals, and i and k are indices into the input string. States are derived from pro- 
ductions in the grammar. The above state is derived from a corresponding production 
X--* ~# 
with the following semantics: 
• The current position in the input is i, i.e., Xo...xi-1 have been processed 
SO far. 3 The states describing the parser state at position i are collectively 
called state set i. Note that there is one more state set than input 
symbols: set 0 describes the parser state before any input is processed, 
while set ixl contains the states after all input symbols have been 
processed. 
• Nonterminal X was expanded starting at position k in the input, i.e., X 
generates some substring starting at position k. 
• The expansion of X proceeded using the production X ~ ~#, and has 
expanded the right-hand side (RHS) ,~# up to the position indicated by 
the dot. The dot thus refers to the current position i. 
A state with the dot to the right of the entire RHS is called a complete state, since 
it indicates that the left-hand side (LHS) nonterminal has been fully expanded. 
Our description of Earley parsing omits an optional feature of Earley states, the 
lookahead string. Earley's algorithm allows for an adjustable amount of lookahead 
during parsing, in order to process LR(k) grammars deterministically (and obtain the 
same computational complexity as specialized LR(k) parsers where possible). The ad- 
dition of lookahead is orthogonal to our extension to probabilistic grammars, so we 
will not include it here. 
The operation of the parser is defined in terms of three operations that consult the 
current set of states and the current input symbol, and add new states to the chart. This 
is strongly suggestive of state transitions in finite-state models of language, parsing, 
etc. This analogy will be explored further in the probabilistic formulation later on. 
The three types of transitions operate as follows. 
Prediction. For each state 
i: kX ~ &.Y#, 
where Y is a nonterminal anywhere in the RHS, and for all rules Y --* L, expanding Y, 
add states 
i : iY ~ .7/. 
A state produced by prediction is called a predicted state. Each prediction corresponds 
to a potential expansion of a nonterminal in a left-most derivation. 
3 This index is implicit in Earley (1970). We include it here for clarity. 
168 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
Scanning. For each state 
i: kX ~ )~.a#, 
where a is a terminal symbol that matches the current input xi, add the state 
i+1: kX ~ )~a.# 
(move the dot over the current symbol). A state produced by scanning is called a 
scanned state. Scanning ensures that the terminals produced in a derivation match 
the input string. 
Completion. For each complete state 
i: jy"'+ ~. 
and each state in set j, j < i, that has Y to the right of the dot, 
j : kX --~ .~.Y#, 
add the state 
i: kX--* ,~Y.# 
(move the dot over the current nonterminal). A state produced by completion is called 
a completed state. 4 Each completion corresponds to the end of a nonterminal expan- 
sion started by a matching prediction step. 
For each input symbol and corresponding state set, an Earley parser performs 
all three operations exhaustively, i.e., until no new states are generated. One crucial 
insight into the working of the algorithm is that, although both prediction and com- 
pletion feed themselves, there are only a finite number of states that can possibly 
be produced. Therefore recursive prediction and completion at each position have to 
terminate eventually, and the parser can proceed to the next input via scanning. 
To complete the description we need only specify the initial and final states. The 
parser starts out with 
0 : 0 ~ .S, 
where S is the sentence nonterminal (note the empty left-hand side). After processing 
the last symbol, the parser verifies that 
1: 0 ---~ S. 
has been produced (among possibly others), where I is the length of the input x. If at 
any intermediate stage a state set remains empty (because no states from the previous 
stage permit scanning), the parse can be aborted because an impossible prefix has been 
detected. 
States with empty LHS such as those above are useful in other contexts, as will 
be shown in Section 5.4. We will refer to them collectively as dummy states. Dummy 
states enter the chart only as a result of initialization, as opposed to being derived 
from grammar productions. 
4 Note the difference between "complete" and "completed" states: complete states (those with the dot to 
the right of the entire RHS) are the result of a completion or scanning step, but completion also 
produces states that are not yet complete. 
169 
Computational Linguistics Volume 21, Number 2 
Table 1 
(a) Example grammar for a tiny fragment of English. (b) Earley parser processing the sentence 
a circle touches a triangle. 
(a) 
(b) 
S --* NPVP Det --~ a 
NP -* Det N N -* circle \[ square \[ triangle 
VP ~ VT NP VT ~ touches 
VP --* VIPP VI --* is 
PP --* P NP P --* above \] below 
a circle touches a square 
0 ----~ .S predicted 
0S ~ .NP VP 
0NP --~ .Det N 
0Det --~ .a 
scanned scanned scanned scanned scanned 
oDet ~ a. 1N ---* circle. 2VT ---* touches. 3Det ~ a. 4N --4 triangle. 
completed completed completed completed completed 
oNP --~ Det.N oNP --~ Det N. 2VP ~ VT.NP 3NP ~ Det.N 4NP ~ Det N. 
predicted oS --~ NP.VP predicted predicted 3VP --4 VT NP. 
1N -* .circle predicted 3NP --* .Det N 5N ~ .circle oS --* NP VP. 
1N --~ .square 2VP --* .VT NP 3Det --* .a 4N --~ .square 0 --* S. 
1N --* .triangle 2VP --~ .VI PP 4 N --* .triangle 
2VT ~ .touches 
2VI ~ .is 
State set 0 1 2 3 4 5 
It is easy to see that Earley parser operations are correct, in the sense that each 
chain of transitions (predictions, scanning steps, completions) corresponds to a pos- 
sible (partial) derivation. Intuitively, it is also true that a parser that performs these 
transitions exhaustively is complete, i.e., it finds all possible derivations. Formal proofs 
of these properties are given in the literature; e.g., Aho and Ullman (1972). The rela- 
tionship between Earley transitions and derivations will be stated more formally in 
the next section. 
The parse trees for sentences can be reconstructed from the chart contents. We will 
illustrate this in Section 5 when discussing Viterbi parses. 
Table 1 shows a simple grammar and a trace of Earley parser operation on a 
sample sentence. 
Earley's parser can deal with any type of context-free rule format, even with 
null or c-productions, i.e., those that replace a nonterminal with the empty string. 
Such productions do, however, require special attention, and make the algorithm and 
its description more complicated than otherwise necessary. In the following sections 
we assume that no null productions have to be dealt with, and then summarize the 
necessary changes in Section 4.7. One might choose to simply preprocess the grammar 
to eliminate null productions, a process which is also described. 
4. Probabilistic Earley Parsing 
4.1 Stochastic Context-Free Grammars 
A stochastic context-free grammar (SCFG) extends the standard context-free formalism 
by adding probabilities to each production: 
\[p\], 
170 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
where the rule probability p is usually written as P(X --+ ~). This notation to some 
extent hides the fact that p is a conditional probability, of production X --+ ,~ being 
chosen, given that X is up for expansion. The probabilities of all rules with the same 
nonterminal X on the LHS must therefore sum to unity. Context-freeness in a prob- 
abilistic setting translates into conditional independence of rule choices. As a result, 
complete derivations have joint probabilities that are simply the products of the rule 
probabilities involved. 
The probabilities of interest mentioned in Section 1 can now be defined formally. 
Definition 1 
The following quantities are defined relative to a SCFG G, a nonterminal X, and a 
string x over the alphabet y~ of G. 
a) The probability of a (partial) derivation/71 ~/72 ~ ""/Tk is inductively 
defined by 
1) 
2) 
P(/71) = 1 
P(/71 ~ "" ~ /Tk) = P(X --* ,X)P(u2 ~... ~/Tk), 
b) 
c) 
d) 
where /71,/72 .... ,/Tk are strings of terminals and nonterminals, X ~ A is a 
production of G, and u2 is derived from/71 by replacing one occurrence 
of X with &. 
The string probability P(X =g x) (of x given X) is the sum of the 
probabilities of all left-most derivations X => ... => x producing x 
from X. s 
The sentence probability P(S ~ x) (of x given G) is the string 
probability given the start symbol S of G. By definition, this is also the 
probability P(x I G) assigned to x by the grammar G. 
The prefix probability P(S g>L X) (of X given G) is the sum of the 
probabilities of all sentence strings having x as a prefix, 
P(S x) = p(s xy) 
yEY~* 
(In particular, P(S ~L e) = 1). 
In the following, we assume that the probabilities in a SCFG are proper and con- 
sistent as defined in Booth and Thompson (1973), and that the grammar contains no 
useless nonterminals (ones that can never appear in a derivation). These restrictions 
ensure that all nonterminals define probability measures over strings; i.e., P(X ~ x) is 
a proper distribution over x for all X. Formal definitions of these conditions are given 
in Appendix A. 
5 In a left-most derivation each step replaces the nonterminal furthest to the left in the partially 
expanded string. The order of expansion is actually irrelevant for this definition, because of the 
multiplicative combination of production probabilities. We restrict summation to left-most derivations 
to avoid counting duplicates, and because left-most derivations will play an important role later. 
171 
Computational Linguistics Volume 21, Number 2 
4.2 Earley Paths and Their Probabilities 
In order to define the probabilities associated with parser operation on a SCFG, we 
need the concept of a path, or partial derivation, executed by the Earley parser. 
Definition 2 
a) 
b) 
c) 
d) 
e) 
An (unconstrained) Earley path, or simply path, is a sequence of Earley 
states linked by prediction, scanning, or completion. For the purpose of 
this definition, we allow scanning to operate in "generation mode," i.e., 
all states with terminals to the right of the dot can be scanned, not just 
those matching the input. (For completed states, the predecessor state is 
defined to be the complete state from the same state set contributing to 
the completion.) 
A path is said to be constrained by, or to generate a string x if the 
terminals immediately to the left of the dot in all scanned states, in 
sequence, form the string x. 
A path is complete if the last state on it matches the first, except that the 
dot has moved to the end of the RHS. 
We say that a path starts with nonterminal X if the first state on it is a 
predicted state with X on the LHS. 
The length of a path is defined as the number of scanned states on it. 
Note that the definition of path length is somewhat counterintuitive, but is moti- 
vated by the fact that only scanned states correspond directly to input symbols. Thus 
the length of a path is always the same as the length of the input string it generates. 
A constrained path starting with the initial state contains a sequence of states 
from state set 0 derived by repeated prediction, followed by a single state from set 1 
produced by scanning the first symbol, followed by a sequence of states produced by 
completion, followed by a sequence of predicted states, followed by a state scanning 
the second symbol, and so on. The significance of Earley paths is that they are in a 
one-to-one correspondence with left-most derivations. This will allow us to talk about 
probabilities of derivations, strings, and prefixes in terms of the actions performed by 
Earley's parser. From now on, we will use "derivation" to imply a left-most derivation. 
Lemma 1 
a) 
b) 
An Earley parser generates state 
i: kX--+ A.#, 
if and only if there is a partial derivation 
S ~ Xo...k_lXly =~ Xo...k_l/~#I/ G Xo...k_lXk...i_l#l/ 
deriving a prefix Xo...i-1 of the input. 
There is a one-to-one mapping between partial derivations and Earley 
paths, such that each production X --~ ~, applied in a derivation 
corresponds to a predicted Earley state X --~ .L,. 
172 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
(a) is the invariant underlying the correctness and completeness of Earley's algo- 
rithm; it can be proved by induction on the length of a derivation (Aho and Ullman 
1972, Theorem 4.9). The slightly stronger form (b) follows from (a) and the way pos- 
sible prediction steps are defined. 
Since we have established that paths correspond to derivations, it is convenient 
to associate derivation probabilities directly with paths. The uniqueness condition (b) 
above, which is irrelevant to the correctness of a standard Earley parser, justifies (prob- 
abilistic) counting of paths in lieu of derivations. 
Definition 3 
The probability P(7 ~) of a path ~ is the product of the probabilities of all rules used 
in the predicted states occurring in ~v. 
Lemma 2 
a) For all paths 7 ~ starting with a nonterminal X, P(7 ~) gives the probability 
of the (partial) derivation represented by ~v. In particular, the string 
probability P(X ~ x) is the sum of the probabilities of all paths starting 
with X that are complete and constrained by x. 
b) The sentence probability P(S ~ x) is the sum of the probabilities of all 
complete paths starting with the initial state, constrained by x. 
c) The prefix probability P(S ~L X) is the sum of the probabilities of all 
paths 7 ~ starting with the initial state, constrained by x, that end in a 
scanned state. 
Note that when summing over all paths "starting with the initial state," summa- 
tion is actually over all paths starting with S, by definition of the initial state 0 --* .S. 
(a) follows directly from our definitions of derivation probability, string probability, 
path probability, and the one-to-one correspondence between paths and derivations 
established by Lemma 1. (b) follows from (a) by using S as the start nonterminal. To 
obtain the prefix probability in (c), we need to sum the probabilities of all complete 
derivations that generate x as a prefix. The constrained paths ending in scanned states 
represent exactly the beginnings of all such derivations. Since the grammar is assumed 
to be consistent and without useless nonterminals, all partial derivations can be com- 
pleted with probability one. Hence the sum over the constrained incomplete paths is 
the sought-after sum over all complete derivations generating the prefix. 
4.3 Forward and Inner Probabilities 
Since string and prefix probabilities are the result of summing derivation probabilities, 
the goal is to compute these sums efficiently by taking advantage of the Earley control 
structure. This can be accomplished by attaching two probabilistic quantities to each 
Earley state, as follows. The terminology is derived from analogous or similar quan- 
tities commonly used in the literature on Hidden Markov Models (HMMs) (Rabiner 
and Juang 1986) and in Baker (1979). 
Definition 4 
The following definitions are relative to an implied input string x. 
a) The forward probability Oq(kX--4 A.\[d,) is the sum of the probabilities of 
all constrained paths of length i that end in state kX ---* .~.#. 
173 
Computational Linguistics Volume 21, Number 2 
b) The inner probability ~i(k x ---+ /~.\]1,) is the sum of the probabilities of all 
paths of length i - k that start in state k : kX -* .)~# and end in 
i : kX --* A.#, and generate the input symbols Xk... xi-> 
It helps to interpret these quantities in terms of an unconstrained Earley parser that 
operates as a generator emitting--rather than recognizing--strings. Instead of tracking 
all possible derivations, the generator traces along a single Earley path randomly 
determined by always choosing among prediction steps according to the associated 
rule probabilities. Notice that the scanning and completion steps are deterministic once 
the rules have been chosen. 
Intuitively, the forward probability Oq(kX "-+ ,,~.~) is the probability of an Earley 
generator producing the prefix of the input up to position i - 1 while passing through 
state kX --* ~.# at position i. However, due to left-recursion in productions the same 
state may appear several times on a path, and each occurrence is counted toward 
the total ~i. Thus, ~i is really the expected number of occurrences of the given state 
in state set i. Having said that, we will refer to o~ simply as a probability, both for 
the sake of brevity, and to keep the analogy to the HMM terminology of which this 
is a generalization. 6 Note that for scanned states, ~ is always a probability, since by 
definition a scanned state can occur only once along a path. 
The inner probabilities, on the other hand, represent the probability of generating 
a substring of the input from a given nonterminal, using a particular production. 
Inner probabilities are thus conditional on the presence of a given nonterminal X with 
expansion starting at position k, unlike the forward probabilities, which include the 
generation history starting with the initial state. The inner probabilities as defined here 
correspond closely to the quantities of the same name in Baker (1979). The sum of "y 
of all states with a given LHS X is exactly Baker's inner probability for X. 
The following is essentially a restatement of Lemma 2 in terms of forward and 
inner probabilities. It shows how to obtain the sentence and string probabilities we 
are interested in, provided that forward and inner probabilities can be computed ef- 
fectively. 
Lemma 3 
The following assumes an Earley chart constructed by the parser on an input string x 
with Ixl = l. 
a) Provided that S :GL Xo...k-lXV is a possible left-most derivation of the 
grammar (for some v), the probability that a nonterminal X generates the 
substring Xk... xi-1 can be computed as the sum 
P(x xki-1) = Z  i(kx 
i:k X---~ )~. 
(sum of inner probabilities over all complete states with LHS X and start 
index k). 
6 The same technical complication was noticed by Wright (1990) in the computation of probabilistic LR 
parser tables. The relation to LR parsing will be discussed in Section 6.2. Incidentally, a similar 
interpretation of forward "probabilities" is required for HMMs with non-emitting states. 
174 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
b) 
c) 
In particular, the string probability P(S G x) can be computed as 7 
P(s ~ x) = "rl(0 ~ s.) 
= ~t(0 ~ s.) 
The prefix probability P(S ~a x), with Ixl = 1, can be computed as 
P(S ~L X) = ~ Oq(kX ---+ .~Xl_l.#) 
k X--~ ,,~X l _ l . \]~ 
(sum of forward probabilities over all scanned states). 
The restriction in (a) that X be preceded by a possible prefix is necessary, since 
the Earley parser at position i will only pursue derivations that are consistent with 
the input up to position i. This constitutes the main distinguishing feature of Earley 
parsing compared to the strict bottom-up computation used in the standard inside 
probability computation (Baker 1979). There, inside probabilities for all positions and 
nonterminals are computed, regardless of possible prefixes. 
4.4 Computing Forward and Inner Probabilities 
Forward and inner probabilities not only subsume the prefix and string probabilities, 
they are also straightforward to compute during a run of Earley's algorithm. In fact, if 
it weren't for left-recursive and unit productions their computation would be trivial. 
For the purpose of exposition we will therefore ignore the technical complications 
introduced by these productions for a moment, and then return to them once the 
overall picture has become clear. 
During a run of the parser both forward and inner probabilities will be attached 
to each state, and updated incrementally as new states are created through one of the 
three types of transitions. Both probabilities are set to unity for the initial state 0 --* .S. 
This is consistent with the interpretation that the initial state is derived from a dummy 
production ~ S for which no alternatives exist. 
Parsing then proceeds as usual, with the probabilistic computations detailed below. 
The probabilities associated with new states will be computed as sums of various 
combinations of old probabilities. As new states are generated by prediction, scanning, 
and completion, certain probabilities have to be accumulated, corresponding to the 
multiple paths leading to a state. That is, if the same state is generated multiple 
times, the previous probability associated with it has to be incremented by the new 
contribution just computed. States and probability contributions can be generated in 
any order, as long as the summation for one state is finished before its probability 
enters into the computation of some successor state. Appendix B.2 suggests a way to 
implement this incremental summation. 
Notation. A few intuitive abbreviations are used from here on to describe Earley 
transitions succinctly. (1) To avoid unwieldy y\]~ notation we adopt the following con- 
vention. The expression x += y means that x is computed incrementally as a sum of 
various y terms, which are computed in some order and accumulated to finally yield 
the value of x. 8 (2) Transitions are denoted by ~, with predecessor states on the left 
7 The definitions of forward and inner probabilities coincide for the final state. 
8 This notation suggests a simple implementation, being obviously borrowed from the programming 
language C. 
175 
Computational Linguistics Volume 21, Number 2 
and successor states on the right. (3) The forward and inner probabilities of states are 
notated in brackets after each state, e.g., 
i: kX --+ ,~.Y# \[a, 7\] 
is shorthand for a = ai(kX --+ A.Y#), 7 = "fi(k X -+ ,~.Y#). 
Prediction (probabilistic). 
i: kX--+ A.Y# \[a, 7\] ~ i: iY--+., \[a',7'\] 
for all productions Y ~ u. The new probabilities can be computed as 
a' += a.P(Y--+u) 
7' = P(Y--+ u) 
Note that only the forward probability is accumulated; 7 is not used in this step. 
Rationale. a' is the sum of all path probabilities leading up to kX --* )~.Y#, times 
the probability of choosing production Y --+ u. The value "7' is just a special case of the 
definition. 
Scanning (probabilistic). 
i : k x --+ )~.a# \[a, 7\] ~ i + 1 : k x --* )~a.lt \[a', 7'\] 
for all states with terminal a matching input at position i. Then 
"7' = 7 
Rationale. Scanning does not involve any new choices, since the terminal was al- 
ready selected as part of the production during prediction. 9 
Completion (probabilistic). 
i: jY --+ u. \[a',7"\] 
j: kX--+,k.Yit \[a, 71 f ~ i: kX--+,kY.it \[a',7'\] 
Then 
a / += a.'7" (11) 
7' += '7"7" (12) 
Note that ~" is not used. 
Rationale. To update the old forward/inner probabilities a and "7 to cd and "7', 
respectively, the probabilities of all paths expanding Y --+ t, have to be factored in. 
These are exactly the paths summarized by the inner probability "7". 
9 In different parsing scenarios the scanning step may well modify probabilities. For example, if the 
input symbols themselves have attached likelihoods, these can be integrated by multiplying them onto 
a and "/when a symbol is scanned. That way it is possible to perform efficient Earley parsing with 
integrated joint probability computation directly on weighted lattices describing ambiguous inputs. 
176 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
4.5 Coping with Recursion 
The standard Earley algorithm, together with the probability computations described 
in the previous section, would be sufficient if it weren't for the problem of recursion 
in the prediction and completion steps. 
The nonprobabilistic Earley algorithm can stop recursing as soon as all predic- 
tions/completions yield states already contained in the current state set. For the com- 
putation of probabilities, however, this would mean truncating the probabilities re- 
sulting from the repeated summing of contributions. 
4.5.1 Prediction loops. As an example, consider the following simple left-recursive 
SCFG. 
s -.a 
S ~ Sb \[q\], 
where q = 1 - p. Nonprobabilistically, the prediction loop at position 0 would stop 
after producing the states 
0 --* .S 
0S --* .a 
oS --+ .Sb. 
This would leave the forward probabilities at 
a0(0S~.a) = p 
ao(oS-* .Sb) = q, 
corresponding to just two out of an infinity of possible paths. The correct forward 
probabilities are obtained as a sum of infinitely many terms, accounting for all possible 
paths of length 1. 
ao(oS --~ .a) 
ao(oS -~ .Sb ) 
= p+qp+q2p+ .... p(1 _ q)-I = 1 
= q+q2+q3+ .... q(1-q)-I =p-lq 
In these sums each p corresponds to a choice of the first production, each q to a choice 
of the second production. If we didn't care about finite computation the resulting 
geometric series could be computed by letting the prediction loop (and hence the 
summation) continue indefinitely. 
Fortunately, all repeated prediction steps, including those due to left-recursion 
in the productions, can be collapsed into a single, modified prediction step, and the 
corresponding sums computed in closed form. For this purpose we need a probabilistic 
version of the well-known parsing concept of a left comer, which is also at the heart 
of the prefix probability algorithm of Jelinek and Lafferty (1991). 
Definition 5 
The following definitions are relative to a given SCFG G. 
a) Two nonterminals X and Y are said to be in a left-comer relation 
X --*L Y iff there exists a production for X that has a RHS starting with Y, 
X--* YA. 
177 
Computational Linguistics Volume 21, Number 2 
b) The probabilistic left-corner relation 1° PL = PL(G) is the matrix of 
probabilities P(X --+L Y), defined as the total probability of choosing a 
production for X that has Y as a left corner: 
P(X--*L Y)= ~ P(X ~ YA). 
X--*YAcG 
c) 
d) 
The relation X GL Y is defined as the reflexive, transitive closure of 
X ~L Y, i.e., X ~ Y iff X = Y or there is a nonterminal Z such that 
X --*L Z and Z GL Y. 
The probabilistic reflexive, transitive left-corner relation RL = RL(G) is 
a matrix of probability sums R(X :GL Y). Each R(X ~L Y) is defined as a 
series 
e(x Y) e(x = Y) 
+ P(x Y) 
4- y'~ P(X -+L Z1)P(Z1 --+L Y) 
z1 
+ ~_, P(X ---~L Z1)P(Z1 ---+L Za)P(Z2 --+L Y) 
Z1 ,Z2 
4-... 
Alternatively, RL is defined by the recurrence relation 
R(X ~L Y) = 6(X, Y) + Z e(x -~L Z)R(Z ~,~ Y), 
z 
where we use the delta function, defined as 6(X, Y) = 1 if X = Y, and 
6(X, Y) = 0 if X # Y. 
The recurrence for RL can be conveniently written in matrix notation 
RL = 1 4- PLRL, 
from which the closed-form solution is derived: 
RL = (I -- PL) -1. 
An existence proof for RL is given in Appendix A. Appendix B.3.1 shows how to speed 
up the computation of RL by inverting only a reduced version of the matrix I - PL. 
The significance of the matrix RL for the Earley algorithm is that its elements are 
the sums of the probabilities of the potentially infinitely many prediction paths leading 
from a state kX --+ A.Z# to a predicted state iY --~ .~', via any number of intermediate 
states. 
RL can be computed once for each grammar, and used for table-lookup in the 
following, modified prediction step. 
10 If a probabilistic relation R is replaced by its set-theoretic version R I, i.e., (x,y) E R' iff R(x,y) ~ 0, 
then the closure operations used here reduce to their traditional discrete counterparts; hence the choice 
of terminology. 
178 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
Prediction (probabilistic, transitive). 
i: kX --+ A.Z# \[c~,'7\] ~ i: iW ~ .11 \[oJ, "7'\] 
for all productions Y --+ u such that R(Z GL Y) is nonzero. Then 
c~' += c~.R(Z dr Y)P(Y---* u) (13) 
"7' = P(Y ~ .) (14) 
The new R(Z GL Y) factor in the updated forward probability accounts for the 
sum of all path probabilities linking Z to Y. For Z = Y this covers the case of a single 
step of prediction; R(Y ~L Y) _> 1 always, since RL is defined as a reflexive closure. 
4.5.2 Completion loops. As in prediction, the completion step in the Earley algorithm 
may imply an infinite summation, and could lead to an infinite loop if computed 
naively. However, only unit productions 11 can give rise to cyclic completions. 
The problem is best explained by studying an example. Consider the grammar 
S --ca \[p\] 
S T \[q\] 
T --. S \[1\], 
where q = 1 - p. Presented with the input a (the only string the grammar generates), 
after one cycle of prediction, the Earley chart contains the following states. 
0:0 ~ .S ~=1, '7=1 
0:0S --* .T o~ = p-lq, '7=q 
0:0T --+ .S c~ -~ p-lq, '7--1 
O : oS --* .a o~ ~ p-tp= l, 7=P. 
The p-1 factors are a result of the left-corner sum 1 + q + q2 q_ .... (1 - q)-l. 
After scanning oS --* .a, completion without truncation would enter an infinite 
loop. First 0T ~ .S is completed, yielding a complete state 0T --* S., which allows 0S ---* 
.T to be completed, leading to another complete state for S, etc. The nonprobabilistic 
Earley parser can just stop here, but as in prediction, this would lead to truncated 
probabilities. The sum of probabilities that needs to be computed to arrive at the correct 
result contains infinitely many terms, one for each possible loop through the T --+ S 
production. Each such loop adds a factor of q to the forward and inner probabilities. 
The summations for all completed states turn out as 
1:0S ---+ c/. 
I:0T --+ S. 
1:0 --+ S. 
I:0S ~ T. 
c~=1, '7=p 
c~ = p-lq(p + pq + pq2 +...) = p-lq, "7 ---- P + Pq + pq2 + .... 1 
oL = p + pq + pq2 + .... 1, ,7 = p + pq + pq2 + .... 1 
c~=p-lq(p+pq+pq2 +...)=p-lq, "7=q(p+pq+pq2 +...)=q 
11 Unit productions are also called "chain productions" or "single productions" in the literature. 
179 
Computational Linguistics Volume 21, Number 2 
The approach taken here to compute exact probabilities in cyclic completions is 
mostly analogous to that for left-recursive predictions. The main difference is that unit 
productions, rather than left-corners, form the underlying transitive relation. Before 
proceeding we can convince ourselves that this is indeed the only case we have to 
worry about. 
Lemma 4 
Let 
klX1 ~ -~1X2. ===~ k2X2 --+ /~2X3 • ==~ "'" ~ kcXc --+/~cXc+l. 
be a completion cycle, i.e., kl = k¢, X1 = Xc, ~1 --- )~c, X2 = Xc+I. Then it must be the 
case that ~1 = /~2 ..... "~c = C, i.e., all productions involved are unit productions 
X1 ~ X2,..., Xc ~ Xc+l. 
Proof 
For all completion chains it is true that the start indices of the states are monotonically 
increasing, kl ~ k2 ~ ... (a state can only complete an expansion that started at 
the same or a previous position). From kl ~- kc, it follows that kl = k2 ..... kc. 
Because the current position (dot) also refers to the same input index in all states, 
all nonterminals X~,X2,...,Xc have been expanded into the same substring of the 
input between kl and the current position. By assumption the grammar contains no 
nonterminals that generate ¢,12 therefore we must have )~1 : ~2 ..... )~c = e, q.e.d. 
\[\] 
We now formally define the relation between nonterminals mediated by unit pro- 
ductions, analogous to the left-corner relation. 
Definition 6 
The following definitions are relative to a given SCFG G. 
a) 
b) 
c) 
d) 
Two nonterminals X and Y are said to be in a unit-production relation 
X ~ Y iff there exists a production for X that has Y as its RHS. 
The probabilistic unit-production relation Pu = Pu(G) is the matrix of 
probabilities P(X --+ Y). 
The relation X ~ Y is defined as the reflexive, transitive closure of 
X ~ Y, i.e., X G Y iff X = Y or there is a nonterminal Z such that X --+ Z 
and Z ~ Y. 
The probabilistic reflexive, transitive unit-production relation 
Ru = Ru(G) is the matrix of probability sums R(X ~ Y). Each R(X ~ Y) 
is defined as a series 
e(x Y) P(x = Y) 
+ P(X --* Y) 
q- ~~ P(X --.4 Zl)P(Zl --4 Y) 
z1 
12 Even with null productions, these would not be used for Earley transitions; see Section 4.7. 
180 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
q- ~ P(X --+ Z1)P(Z 1 ~ Z2)P(Z 2 -..+ Y) 
ZI,Z2 
~-. , . 
~(x, Y) + ~P(x -, z)R(z ~ Y). 
Z 
As before, a matrix inversion can compute the relation Ru in closed form: 
Ru -- (I - Pu) -1. 
The existence of Ru is shown in Appendix A. 
The modified completion loop in the probabilistic Earley parser can now use the 
Ru matrix to collapse all unit completions into a single step. Note that we still have 
to do iterative completion on non-unit productions. 
Completion (probabilistic, transitive). 
i: jY---* u. \[ct',7"\] } j: kX--+ A.Z# \[oe, 7\] ~ i: kX--+ AZ.# \[o/,7'\] 
for all Y, Z such that R(Z ~ Y) is nonzero, and Y + t, is not a unit production (lu\] > 1 
or u c g). Then 
~' += ~.7"e(zg r) 
-~' += -r. 7"e(z ~, Y) 
4.6 An Example 
Consider the grammar 
s --, a \[p\] 
s -, ss \[q\] 
where q = 1 - p. This highly ambiguous grammar generates strings of any number of 
a's, using all possible binary parse trees over the given number of terminals. The states 
involved in parsing the string aaa are listed in Table 2, along with their forward and 
inner probabilities. The example illustrates how the parser deals with left-recursion 
and the merging of alternative sub-parses during completion. 
Since the grammar has only a single nonterminal, the left-corner matrix PL has 
rank 1: 
PL = M- 
Its transitive closure is 
RL = (I- PL) -1 ~- ~\]-1 ~--- ~/9-1\]. 
Consequently, the example trace shows the factor p-1 being introduced into the for- 
ward probability terms in the prediction steps. 
The sample string can be parsed as either (a(aa)) or ((aa)a), each parse having a 
probability of p3q2. The total string probability is thus 2p3q 2, the computed c~ and 7 
values for the final state. The oe values for the scanned states in sets 1, 2, and 3 are 
the prefix probabilities for a, aa, and aaa, respectively: P(S GL a) = 1, P(S :GL aa) = q, 
P(S :£L aaa) = (1 + p)q2. 
181 
Computational Linguistics Volume 21, Number 2 
Table 2 
Earley chart as constructed during the parse of aaa with the grammar in (a). The two columns 
to the right in (b) list the forward and inner probabilities, respectively, for each state. In both c~ 
and 3' columns, the • separates old factors from new ones (as per equations 11, 12 and 13). 
Addition indicates multiple derivations of the same state. 
(a) 
(b) 
S ----~ a \[p\] 
S ~ SS \[q\] 
State set 0 
0 ~.S 1 1 
predicted 
oS ~ .a 1 . p-lp = 1 p 
oS ~ .SS 1 • p-lq = p-lq q 
State set 1 
scanned 
oS --* a. p-lp = 1 p 
completed 
oS --* S.S p-lq . p = q q . p = pq 
predicted 
1S ---* .a q. p-lp = q p 
1S ~ .SS q . p-lq = p-lq2 q 
State set 2 
scanned 
1S ---~ a. q p 
completed 
1S ~ S.S p-lq2, p = q2 q. q = pq 
oS ~ SS. q . p = pq pq . p = p2q 
oS --* S.S p-lq . p2q = pq2 q . p2q = p2q2 
0 ~ S. 1 • p2q = p2q 1 • p2q = p2q 
predicted 
2S ~ .a (q2 q_ pq2) . p-lp = (1 + p)q2 p 
2S ~ .SS (q2 q_ pq2) . p-lq = (1 + p-1)q3 q 
State set 3 
scanned 
2S ---* a. (1 -t- p)q2 p 
completed 
2S ----+ S.S (1 + p-1)q3, p = (1 + p)q3 q .p = pq 
1S ~ SS. q2 . p = pq2 pq . p = p2q 
IS ---+ S.S p-lq2 . paq = pq3 q . p2q = p2q2 
oS --+ SS. pq2 . p + q. p2q -= 2p2q2 p2q2 . p + pq . p2q .~_ 2p3q2 
oS --* S.S p-lq . 2p3q2 = 2p2q3 q. 2p3q2 = 2p3q3 
0 ---* S. 1 • 2p3q 2 = 2p3q 2 1 • 2p3q 2 = 2p3q 2 
182 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
4.7 Null Productions 
Null productions X --* c introduce some complications into the relatively straight- 
forward parser operation described so far, some of which are due specifically to the 
probabilistic aspects of parsing. This section summarizes the necessary modifications 
to process null productions correctly, using the previous description as a baseline. Our 
treatment of null productions follows the (nonprobabilistic) formulation of Graham, 
Harrison, and Ruzzo (1980), rather than the original one in Earley (1970). 
4.7.1 Computing c-expansion probabilities. The main problem with null productions 
is that they allow multiple prediction-completion cycles in between scanning steps 
(since null productions do not have to be matched against one or more input symbols). 
Our strategy will be to collapse all predictions and completions due to chains of null 
productions into the regular prediction and completion steps, not unlike the way 
recursive predictions/completions were handled in Section 4.5. 
A prerequisite for this approach is to precompute, for all nonterminals X, the prob- 
ability that X expands to the empty string. Note that this is another recursive problem, 
since X itself may not have a null production, but expand to some nonterminal Y that 
does. 
Computation of P(X :~ c) for all X can be cast as a system of non-linear equations, 
as follows. For each X, let ex be an abbreviation for P(X G c). For example, let X have 
productions 
X ---* c \[Pl\] 
--* YI Y2 ~92\] 
---9 W3Y4Y5 ~/93\] 
The semantics of context-free rules imply that X can only expand to c if all the RHS 
nonterminals in one of X's productions expand to e. Translating to probabilities, we 
obtain the equation 
ex -- Pl + p2eyley2 + p3ey3eY4eY5 + "" • 
In other words, each production contributes a term in which the rule probability is 
multiplied by the product of the e variables corresponding to the RHS nonterminals, 
unless the RHS contains a terminal (in which case the production contributes nothing 
to ex because it cannot possibly lead to e). 
The resulting nonlinear system can be solved by iterative approximation. Each 
variable ex is initialized to P(X ~ e), and then repeatedly updated by substituting in 
the equation right-hand sides, until the desired level of accuracy is attained. Conver- 
gence is guaranteed, since the ex values are monotonically increasing and bounded 
above by the true values P(X ~ e) ( 1. For grammars without cyclic dependen- 
cies among e-producing nonterminals, this procedure degenerates to simple backward 
substitution. Obviously the system has to be solved only once for each grammar. 
The probability ex can be seen as the precomputed inner probability of an expan- 
sion of X to the empty string; i.e., it sums the probabilities of all Earley paths that 
derive c from X. This is the justification for the way these probabilities can be used in 
modified prediction and completion steps, described next. 
4.7.2 Prediction with null productions. Prediction is mediated by the left-corner re- 
lation. For each X occurring to the right of a dot, we generate states for all Y that 
183 
Computational Linguistics Volume 21, Number 2 
are reachable from X by way of the X --*L Y relation. This reachability criterion has 
to be extended in the presence of null productions. Specifically, if X has a production 
X --+ Y1 ...Wi-lYi)~ then Yi is a left corner of X iff Y1,...,Yi-1 all have a nonzero 
probability of expanding to e. The contribution of such a production to the left-corner 
probability P(X --+L Yi) is 
i--1 
P(X --* Y1... Yi-lYi&) II eYk 
k=l 
The old prediction procedure can now be modified in two steps. First, replace 
the old PL relation by the one that takes into account null productions, as sketched 
above. From the resulting PL compute the reflexive transitive closure RL, and use it to 
generate predictions as before. 
Second, when predicting a left corner Y with a production Y --* Y1 ... Yi-IYi)~, add 
states for all dot positions up to the first RHS nonterminal that cannot expand to e, 
say from X --* .Y1 ... Yi-I Yi )~ through X --* Y1 ... Yi-l.Yi .X. We will call this procedure 
"spontaneous dot shifting." It accounts precisely for those derivations that expand the 
RHS prefix Y1 ... Wi-1 without consuming any of the input symbols. 
The forward and inner probabilities of the states thus created are those of the 
first state X --* .Y1... Yi-lYi/~, multiplied by factors that account for the implied e- 
expansions. This factor is just the product 1-I~=1 eYk, where j is the dot position. 
4.7.3 Completion with null productions. Modification of the completion step follows 
a similar pattern. First, the unit production relation has to be extended to allow for 
unit production chains due to null productions. A rule X ~ Y1 ... Yi-lYiYi+l ... Yj can 
effectively act as a unit production that links X and Yi if all other nonterminals on the 
RHS can expand to e. Its contribution to the unit production relation P(X ~ Yi) will 
then be 
P(X ~ Y1... Yi-lYiYi+l... Yj) IIeYk 
From the resulting revised Pu matrix we compute the closure Ru as usual. 
The second modification is another instance of spontaneous dot shifting. When 
completing a state X --+ )~.Y# and moving the dot to get X ~ )~Y.#, additional states 
have to be added, obtained by moving the dot further over any nonterminals in # that 
have nonzero e-expansion probability. As in prediction, forward and inner probabilities 
are multiplied by the corresponding e-expansion probabilities. 
4.7.4 Eliminating null productions. Given these added complications one might con- 
sider simply eliminating all c-productions in a preprocessing step. This is mostly 
straightforward and analogous to the corresponding procedure for nonprobabilistic 
CFGs (Aho and Ullman 1972, Algorithm 2.10). The main difference is the updating of 
rule probabilities, for which the e-expansion probabilities are again needed. 
. 
. 
Delete all null productions, except on the start symbol (in case the 
grammar as a whole produces c with nonzero probability). Scale the 
remaining production probabilities to sum to unity. 
For each original rule X ~ ,~Y# that contains a nonterminal Y such that 
Y~E: 
184 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
(a) 
(b) 
(c) 
Create a variant rule X --* &# 
Set the rule probability of the new rule to eyP(X --, &Y#). If the 
rule X ~ ~# already exists, sum the probabilities. 
Decrement the old rule probability by the same amount. 
Iterate these steps for all RHS occurrences of a null-able nonterminal. 
The crucial step in this procedure is the addition of variants of the original produc- 
tions that simulate the null productions by deleting the corresponding nonterminals 
from the RHS. The spontaneous dot shifting described in the previous sections effec- 
tively performs the same operation on the fly as the rules are used in prediction and 
completion. 
4.8 Complexity Issues 
The probabilistic extension of Earley's parser preserves the original control structure 
in most aspects, the major exception being the collapsing of cyclic predictions and unit 
completions, which can only make these steps more efficient. Therefore the complexity 
analysis from Earley (1970) applies, and we only summarize the most important results 
here. 
The worst-case complexity for Earley's parser is dominated by the completion step, 
which takes O(/2) for each input position, I being the length of the current prefix. The 
total time is therefore O(/3) for an input of length l, which is also the complexity of the 
standard Inside/Outside (Baker 1979) and LRI (Jelinek and Lafferty 1991) algorithms. 
For grammars of bounded ambiguity, the incremental per-word cost reduces to O(l), 
0(/2) total. For deterministic CFGs the incremental cost is constant, 0(l) total. Because 
of the possible start indices each state set can contain 0(l) Earley states, giving O(/2) 
worst-case space complexity overall. 
Apart from input length, complexity is also determined by grammar size. We 
will not try to give a precise characterization in the case of sparse grammars (Ap- 
pendix B.3 gives some hints on how to implement the algorithm efficiently for such 
grammars). However, for fully parameterized grammars in CNF we can verify the 
scaling of the algorithm in terms of the number of nonterminals n, and verify that it 
has the same O(n 3) time and space requirements as the Inside/Outside (I/O) and LRI 
algorithms. 
The completion step again dominates the computation, which has to compute 
probabilities for at most O(n 3) states. By organizing summations (11) and (12) so that 
3'" are first summed by LHS nonterminals, the entire completion operation can be 
accomplished in 0(//3). The one-time cost for the matrix inversions to compute the 
left-corner and unit production relation matrices is also O(r/3). 
5. Extensions 
This section discusses extensions to the Earley algorithm that go beyond simple parsing 
and the computation of prefix and string probabilities. These extensions are all quite 
straightforward and well supported by the original Earley chart structure, which leads 
us to view them as part of a single, unified algorithm for solving the tasks mentioned 
in the introduction. 
185 
Computational Linguistics Volume 21, Number 2 
5.1 Viterbi Parses 
Definition 7 
A Viterbi parse for a string x, in a grammar G, is a left-most derivation that assigns 
maximal probability to x, among all possible derivations for x. 
Both the definition of Viterbi parse and its computation are straightforward gener- 
alizations of the corresponding notion for Hidden Markov Models (Rabiner and Juang 
1986), where one computes the Viterbi path (state sequence) through an HMM. Pre- 
cisely the same approach can be used in the Earley parser, using the fact that each 
derivation corresponds to a path. 
The standard computational technique for Viterbi parses is applicable here. Wher- 
ever the original parsing procedure sums probabilities that correspond to alternative 
derivations of a grammatical entity, the summation is replaced by a maximization. 
Thus, during the forward pass each state must keep track of the maximal path prob- 
ability leading to it, as well as the predecessor states associated with that maximum 
probability path. Once the final state is reached, the maximum probability parse can 
be recovered by tracing back the path of "best" predecessor states. 
The following modifications to the probabilistic Earley parser implement the for- 
ward phase of the Viterbi computation. 
• Each state computes an additional probability, its Viterbi probability v. 
• Viterbi probabilities are propagated in the same way as inner 
probabilities, except that during completion the summation is replaced 
by maximization: Vi(kX --* ,~Y.#) is the maximum of all products 
vi(jW --+ 17.)Vj(kX --+ ,~.Y#) that contribute to the completed state 
kX --* )~Y.#. The same-position predecessor jY -~ ~,. associated with the 
maximum is recorded as the Viterbi path predecessor of kX --* ,~Y.# (the 
other predecessor state kX --* ,~.Y# can be inferred). 
• The completion step uses the original recursion without collapsing unit 
production loops. Loops are simply avoided, since they can only lower a 
path's probability. Collapsing unit production completions has to be 
avoided to maintain a continuous chain of predecessors for later 
backtracing and parse construction. 
• The prediction step does not need to be modified for the Viterbi 
computation. 
Once the final state is reached, a recursive procedure can recover the parse tree 
associated with the Viterbi parse. This procedure takes an Earley state i : kX --* &.# 
as input and produces the Viterbi parse for the substring between k and i as output. 
(If the input state is not complete (# ~ ¢), the result will be a partial parse tree with 
children missing from the root node.) 
Viterbi parse (i : kX --* -~.#): 
1. If )~ = ¢, return a parse tree with root labeled X and no children. 
186 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
. Otherwise, if ~ ends in a terminal a, let A~a = ~, and call this procedure 
recursively to obtain the parse tree 
T = Viterbi-parse(i - 1 : k X i+ ,,V.a~) 
. 
Adjoin a leaf node labeled a as the right-most child to the root of T and 
return T. 
Otherwise, if A ends in a nonterminal Y, let A'Y = A. Find the Viterbi 
predecessor state jW ---+ t~. for the current state. Call this procedure 
recursively to compute 
T = Viterbi-parse(j : kX ~ A'.Y#) 
as well as 
T' = Viterbi-parse(i : jY --* u.) 
Adjoin T ~ to T as the right-most child at the root, and return T. 
5.2 Rule Probability Estimation 
The rule probabilities in a SCFG can be iteratively estimated using the EM (Expectation- 
Maximization) algorithm (Dempster et al. 1977). Given a sample corpus D, the esti- 
mation procedure finds a set of parameters that represent a local maximum of the 
grammar likelihood function P(D I G), which is given by the product of the string 
probabilities 
P(D I C) = 1-\[ P(S x), 
xED 
i.e., the samples are assumed to be distributed identically and independently. 
The two steps of this algorithm can be briefly characterized as follows. 
E-step: Compute expectations for how often each grammar rule is used, given the 
corpus D and the current grammar parameters (rule probabilities). 
M-step: Reset the parameters so as to maximize the likelihood relative to the 
expected rule counts found in the E-step. 
This procedure is iterated until the parameter values (as well as the likelihood) con- 
verge. It can be shown that each round in the algorithm produces a likelihood that is 
at least as high as the previous one; the EM algorithm is therefore guaranteed to find 
at least a local maximum of the likelihood function. 
EM is a generalization of the well-known Baum-Welch algorithm for HMM esti- 
mation (Baum et al. 1970); the original formulation for the case of SCFGs is attributable 
to Baker (1979). For SCFGs, the E-step involves computing the expected number of 
times each production is applied in generating the training corpus. After that, the M- 
step consists of a simple normalization of these counts to yield the new production 
probabilities. 
In this section we examine the computation of production count expectations re- 
quired for the E-step. The crucial notion introduced by Baker (1979) for this purpose 
is the "outer probability" of a nonterminal, or the joint probability that the nonter- 
minal is generated with a given prefix and suffix of terminals. Essentially the same 
method can be used in the Earley framework, after extending the definition of outer 
probabilities to apply to arbitrary Earley states. 
187 
Computational Linguistics Volume 21, Number 2 
Definition 8 
Given a string x, Ix\] = l, the outer probability fli(kX ---* A.#) of an Earley state is the 
sum of the probabilities of all paths that 
. 
2. 
3. 
4. 
5. 
start with the initial state, 
generate the prefix Xo... Xk-1, 
pass through k x ---4 .17#, for some u, 
generate the suffix xi. • • x1-1 starting with state kX --* u.# , 
end in the final state. 
Outer probabilities complement inner probabilities in that they refer precisely to 
those parts of complete paths generating x not covered by the corresponding inner 
probability 7i(kX --* A.#). Therefore, the choice of the production X --* A# is not part 
of the outer probability associated with a state kX ~ A.#. In fact, the definition makes 
no reference to the first part A of the RHS: all states sharing the same k, X, and # will 
have identical outer probabilities. 
Intuitively, fli(kX --* A.#) is the probability that an Earley parser operating as a 
string generator yields the prefix Xo...k-1 and the suffix xi...l_l, while passing through 
state kX --* A.# at position i (which is independent of A). As was the case for forward 
probabilities, fl is actually an expectation of the number of such states in the path, as 
unit production cycles can result in multiple occurrences for a single state. Again, we 
gloss over this technicality in our terminology. The name is motivated by the fact that 
fl reduces to the "outer probability" of X, as defined in Baker (1979), if the dot is in 
final position. 
5.2.1 Computing expected production counts. Before going into the details of com- 
puting outer probabilities, we describe their use in obtaining the expected rule counts 
needed for the E-step in grammar estimation. 
Let c(X --* A \] x) denote the expected number of uses of production X --* A in the 
derivation of string x. Alternatively, c(X --* A \] x) is the expected number of times that 
X --+ ), is used for prediction in a complete Earley path generating x. Let c(X -~ A \] t 9) 
be the number of occurrences of predicted states based on production X ~ A along a 
path 79. 
c(x  lx) = Z P(791s x)c(X ;, 179) 
79 derives x 
1 - F_, p(79, s x)c(x 
P(S ~ x) 79 derives x 
_ 1 P(S ~ Xo...i_lXl\] ~ x). 
P(S X) i:iX---+.A 
79) 
The last summation is over all predicted states based on production X --* A. The 
quantity P(S   Xo...i_lXt, :~ x) is the sum of the probabilities of all paths passing 
through i : iX --* .A. Inner and outer probabilities have been defined such that this 
quantity is obtained precisely as the product of the corresponding of "Yi and fli. Thus, 
188 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
the expected usage count for a rule can be computed as 
1 c(X -. ~ Ix) - 
P(S X) i:iX--->.,,k 9~(~x ~ .A),~(~x -~ .,~). 
The sum can be computed after completing both forward and backward passes (or 
during the backward pass itself) by scanning the chart for predicted states. 
5.2.2 Computing outer probabilities. The outer probabilities are computed by tracing 
the complete paths from the final state to the start state, in a single backward pass over 
the Earley chart. Only completion and scanning steps need to be traced back. Reverse 
scanning leaves outer probabilities unchanged, so the only operation of concern is 
reverse completion. 
We describe reverse transitions using the same notation as for their forward coun- 
terparts, annotating each state with its outer and inner probabilities. 
Reverse completion. 
i: jY --~ 1,1. \[fl",'y"\] 
i: kX-* AY.# \[fl,"/\] ~ j: kX--* A.Y# \[fl',7'\] 
for all pairs of states jY --+ t,. and kX --* A.Y# in the chart. Then 
fl' += q/'.fl 
fl" += -y'.fl 
The inner probability 7 is not used. 
Rationale. Relative to fl, fl' is missing the probability of expanding Y, which is 
filled in from ~,". The probability of the surrounding of Y(fl") is the probability of the 
surrounding of X(fl), plus the choice of the rule of production for X and the expansion 
of the partial LHS A, which are together given by ~,'. 
Note that the computation makes use of the inner probabilities computed in the 
forward pass. The particular way in which 3' and fl were defined turns out to be 
convenient here, as no reference to the production probabilities themselves needs to 
be made in the computation. 
As in the forward pass, simple reverse completion would not terminate in the pres- 
ence of cyclic unit productions. A version that collapses all such chains of productions 
is given below. 
Reverse completion (transitive). 
i: jY--* ~,. \[fl",7"\] 
i: k x --* AZ.# \[fl, 3'\] ~ j kX ~ A.Z# \[fl', 7'\] 
for all pairs of states jY ---+ v. and kX --* A.Z# in the chart, such that the unit production 
relation R(Z   Y) is nonzero. Then 
fl' += q/'.fl 
fl" += -~'. flR(Z G Y) 
The first summation is carried out once for each state j : kX --* MZ#, whereas the 
second summation is applied for each choice of Z, but only if X --* AZ# is not itself a 
unit production, i.e., A# ~ E. 
189 
Computational Linguistics Volume 21, Number 2 
Rationale. This increments fl" the equivalent of R(Z ~ Y) times, accounting for 
the infinity of surroundings in which Y can occur if it can be derived through cyclic 
productions. Note that the computation of tip is unchanged, since "y" already includes 
an infinity of cyclically generated subtrees for Y, where appropriate. 
5.3 Parsing Bracketed Inputs 
The estimation procedure described above (and EM-based estimators in general) are 
only guaranteed to find locally optimal parameter estimates. Unfortunately, it seems 
that in the case of unconstrained SCFG estimation local maxima present a very real 
problem, and make success dependent on chance and initial conditions (Lari and 
Young 1990). Pereira and Schabes (1992) showed that partially bracketed input samples 
can alleviate the problem in certain cases. The bracketing information constrains the 
parse of the inputs, and therefore the parameter estimates, steering it clear from some 
of the suboptimal solutions that could otherwise be found. 
An Earley parser can be minimally modified to take advantage of bracketed strings 
by invoking itself recursively when a left parenthesis is encountered. The recursive in- 
stance of the parser is passed any predicted states at that position, processes the input 
up to the matching right parenthesis, and hands complete states back to the invoking 
instance. This technique is efficient, as it never explicitly rejects parses not consistent 
with the bracketing. It is also convenient, as it leaves the basic parser operations, 
including the left-to-right processing and the probabilistic computations, unchanged. 
For example, prefix probabilities conditioned on partial bracketings could be computed 
easily this way. 
Parsing bracketed inputs is described in more detail in Stolcke (1993), where it is 
also shown that bracketing gives the expected improved efficiency. For example, the 
modified Earley parser processes fully bracketed inputs in linear time. 
5.4 Robust Parsing 
In many applications ungrammatical input has to be dealt with in some way. Tra- 
ditionally it has been seen as a drawback of top-down parsing algorithms such as 
Earley's that they sacrifice "robustness," i.e., the ability to find partial parses in an 
ungrammatical input, for the efficiency gained from top-down prediction (Magerman 
and Weir 1992). 
One approach to the problem is to build robustness into the grammar itself. In the 
simplest case one could add top-level productions 
S --* XS 
where X can expand to any nonterminal, including an "unknown word" category. This 
grammar will cause the Earley parser to find all partial parses of substrings, effectively 
behaving like a bottom-up parser constructing the chart in left-to-right fashion. More 
refined variations are possible: the top-level productions could be used to model which 
phrasal categories (sentence fragments) can likely follow each other. This probabilistic 
information can then be used in a pruning version of the Earley parser (Section 6.1) 
to arrive at a compromise between robust and expectation-driven parsing. 
An alternative method for making Earley parsing more robust is to modify the 
parser itself so as to accept arbitrary input and find all or a chosen subset of pos- 
sible substring parses. In the case of Earley's parser there is a simple extension to 
190 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
accomplish just that, based on the notion of a wildcard state 
i : k --~ ~.?, 
where the wildcard ? stands for an arbitrary continuation of the RHS. 
During prediction, a wildcard to the left of the dot causes the chart to be seeded 
with dummy states --* .X for each phrasal category X of interest. Conversely, a minimal 
modification to the standard completion step allows the wildcard states to collect all 
abutting substring parses: 
i: jY--+ #. } 
j: k ~+ ,'~. ? ~ i: k --* )~ Y. ? 
for all Y. This way each partial parse will be represented by exactly one wildcard state 
in the final chart position. 
A detailed account of this technique is given in Stolcke (1993). One advantage 
over the grammar-modifying approach is that it can be tailored to use various criteria 
at runtime to decide which partial parses to follow. 
6. Discussion 
6.1 Online Pruning 
In finite-state parsing (especially speech decoding) one often makes use of the forward 
probabilities for pruning partial parses before having seen the entire input. Pruning 
is formally straightforward in Earley parsers: in each state set, rank states according 
to their ~ values, then remove those states with small probabilities compared to the 
current best candidate, or simply those whose rank exceeds a given limit. Notice this 
will not only omit certain parses, but will also underestimate the forward and inner 
probabilities of the derivations that remain. Pruning procedures have to be evaluated 
empirically since they invariably sacrifice completeness and, in the case of the Viterbi 
algorithm, optimality of the result. 
While Earley-based on-line pruning awaits further study, there is reason to be- 
lieve the Earley framework has inherent advantages over strategies based only on 
bottom-up information (including so-called "over-the-top" parsers). Context-free for- 
ward probabilities include all available probabilistic information (subject to assump- 
tions implicit in the SCFG formalism) available from an input prefix, whereas the 
usual inside probabilities do not take into account the nonterminal prior probabilities 
that result from the top-down relation to the start state. Using top-down constraints 
does not necessarily mean sacrificing robustness, as discussed in Section 5.4. On the 
contrary, by using Earley-style parsing with a set of carefully designed and estimated 
"fault-tolerant" top-level productions, it should be possible to use probabilities to bet- 
ter advantage in robust parsing. This approach is a subject of ongoing work, in the 
context of tight-coupling SCFGs with speech decoders (Jurafsky, Wooters, Segal, Stol- 
cke, Fosler, Tajchman, and Morgan 1995). 
6.2 Relation to Probabilistic LR Parsing 
One of the major alternative context-free parsing paradigms besides Earley's algo- 
rithm is LR parsing (Aho and Ullman 1972). A comparison of the two approaches, 
both in their probabilistic and nonprobabilistic aspects, is interesting and provides 
useful insights. The following remarks assume familiarity with both approaches. We 
191 
Computational Linguistics Volume 21, Number 2 
sketch the fundamental relations, as well as the important tradeoffs between the two 
frameworks. 13 
Like an Earley parser, LR parsing uses dotted productions, called items, to keep 
track of the progress of derivations as the input is processed. The start indices are not 
part of LR items: we may therefore use the term "item" to refer to both LR items and 
Earley states without start indices. An Earley parser constructs sets of possible items 
on the fly, by following all possible partial derivations. An LR parser, on the other 
hand, has access to a complete list of sets of possible items computed beforehand, and at 
runtime simply follows transitions between these sets. The item sets are known as the 
"states" of the LR parser. ~4 A grammar is suitable for LR parsing if these transitions can 
be performed deterministically by considering only the next input and the contents 
of a shift-reduce stack. Generalized LR parsing is an extension that allows parallel 
tracking of multiple state transitions and stack actions by using a graph-structured 
stack (Tomita 1986). 
Probabilistic LR parsing (Wright 1990) is based on LR items augmented with 
certain conditional probabilities. Specifically, the probability p associated with an LR 
item X --+ )~./z is, in our terminology, a normalized forward probability: 
 i(x --+ p= 
P(S ~L X0...i-1)' 
where the denominator is the probability of the current prefix, is LR item probabilities, 
are thus conditioned forward probabilities, and can be used to compute conditional 
probabilities of next words: P(xi I X0-..i-1) is the sum of the p's of all items having xi 
to the right of the dot (extra work is required if the item corresponds to a "reduce" 
state, i.e., if the dot is in final position). 
Notice that the definition of p is independent of i as well as the start index of 
the corresponding Earley state. Therefore, to ensure that item probabilities are correct 
independent of input position, item sets would have to be constructed so that their 
probabilities are unique within each set. However, this may be impossible given that 
the probabilities can take on infinitely many values and in general depend on the his- 
tory of the parse. The solution used by Wright (1990) is to collapse items whose prob- 
abilities are within a small tolerance ~ and are otherwise identical. The same threshold 
is used to simplify a number of other technical problems, e.g., left-corner probabilities 
are computed by iterated prediction, until the resulting changes in probabilities are 
smaller than e. Subject to these approximations, then, a probabilistic LR parser can 
compute prefix probabilities by multiplying successive conditional probabilities for 
the words it sees. 16 
As an alternative to the computation of LR transition probabilities from a given 
SCFG, one might instead estimate such probabilities directly from traces of parses 
13 Like Earley parsers, LR parsers can be built using various amounts of lookahead to make the operation 
of the parser (more) deterministic, and hence more efficient. Only the case of zero-lookahead, LR(0), is 
considered here; the correspondence between LR(k) parsers and k-lookahead Earley parsers is 
discussed in the literature (Earley 1970; Aho and Ullman 1972). 
14 Once again, it is helpful to compare this to a closely related finite-state concept: the states of the LR 
parser correspond to sets of Earley states, similar to the way the states of a deterministic FSA 
correspond to sets of states of an equivalent nondeterministic FSA under the standard subset 
construction. 
15 The identity of this expression with the item probabilities of Wright (1990) can be proved by induction 
on the steps performed to compute the p's, as shown in Stolcke (1993). 
16 It is not clear what the numerical properties of this approximation are, e.g., how the errors will 
accumulate over longer parses. 
192 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
on a training corpus. Because of the imprecise relationship between LR probabilities 
and SCFG probabilities, it is not clear if the model thus estimated corresponds to any 
particular SCFG in the usual sense. 
Briscoe and Carroll (1993) turn this incongruity into an advantage by using the 
LR parser as a probabilistic model in its own right, and show how LR probabilities 
can be extended to capture non--context-free contingencies. The problem of capturing 
more complex distributional constraints in natural language is clearly important, but 
well beyond the scope of this article. We simply remark that it should be possible to 
define "interesting" nonstandard probabilities in terms of Earley parser actions so as 
to better model non-context-free phenomena. 
Apart from such considerations, the choice between LR methods and Earley pars- 
ing is a typical space-time tradeoff. Even though an Earley parser runs with the same 
linear time and space complexity as an LR parser on grammars of the appropriate LR 
class, the constant factors involved will be much in favor of the LR parser, as almost all 
the work has already been compiled into its transition and action table. However, the 
size of LR parser tables can be exponential in the size of the grammar (because of the 
number of potential item subsets). Furthermore, if the generalized LR method is used 
for dealing with nondeterministic grammars (Tomita 1986) the runtime on arbitrary 
inputs may also grow exponentially. The bottom line is that each application's needs 
have to be evaluated against the pros and cons of both approaches to find the best 
solution. From a theoretical point of view, the Earley approach has the inherent appeal 
of being the more general (and exact) solution to the computation of the various SCFG 
probabilities. 
6.3 Other Related Work 
The literature on Earley-based probabilistic parsers is sparse, presumably because of 
the precedent set by the Inside/Outside algorithm, which is more naturally formulated 
as a bottom-up algorithm. 
Both Nakagawa (1987) and P~seler (1988) use a nonprobabilistic Earley parser aug- 
mented with "word match" scoring. Though not truly probabilistic, these algorithms 
are similar to the Viterbi version described here, in that they find a parse that optimizes 
the accumulated matching scores (without regard to rule probabilities). Prediction and 
completion loops do not come into play since no precise inner or forward probabilities 
are computed. 
Magerman and Marcus (1991) are interested primarily in scoring functions to guide 
a parser efficiently to the most promising parses. Earley-style top-down prediction is 
used only to suggest worthwhile parses, not to compute precise probabilities, which 
they argue would be an inappropriate metric for natural language parsing. 
Casacuberta and Vidal (1988) exhibit an Earley parser that processes weighted (not 
necessarily probabilistic) CFGs and performs a computation that is isomorphic to that 
of inside probabilities shown here. Schabes (1991) adds both inner and outer proba- 
bilities to Earley's algorithm, with the purpose of obtaining a generalized estimation 
algorithm for SCFGs. Both of these approaches are restricted to grammars without 
unbounded ambiguities, which can arise from unit or null productions. 
Dan Jurafsky (personal communication) wrote an Earley parser for the Berke- 
ley Restaurant Project (BeRP) speech understanding system that originally computed 
forward probabilities for restricted grammars (without left-corner or unit production 
recursion). The parser now uses the method described here to provide exact SCFG pre- 
fix and next-word probabilities to a tightly coupled speech decoder (Jurafsky, Wooters, 
Segal, Stolcke, Fosler, Tajchman, and Morgan 1995). 
193 
Computational Linguistics Volume 21, Number 2 
An essential idea in the probabilistic formulation of Earley's algorithm is the 
collapsing of recursive predictions and unit completion chains, replacing both with 
lookups in precomputed matrices. This idea arises in our formulation out of the need 
to compute probability sums given as infinite series. Graham, Harrison, and Ruzzo 
(1980) use a nonprobabilistic version of the same technique to create a highly opti- 
mized Earley-like parser for general CFGs that implements prediction and completion 
by operations on Boolean matrices. ~7 
The matrix inversion method for dealing with left-recursive prediction is borrowed 
from the LRI algorithm of Jelinek and Lafferty (1991) for computing prefix probabilities 
for SCFGs in CNF) s We then use that idea a second time to deal with the similar 
recursion arising from unit productions in the completion step. We suspect, but have 
not proved, that the Earley computation of forward probabilities when applied to a 
CNF grammar performs a computation that is isomorphic to that of the LRI algorithm. 
In any case, we believe that the parser-oriented view afforded by the Earley framework 
makes for a very intuitive solution to the prefix probability problem, with the added 
advantage that it is not restricted to CNF grammars. 
Algorithms for probabilistic CFGs can be broadly characterized along several di- 
mensions. One such dimension is whether the quantities entered into the parser chart 
are defined in a bottom-up (CYK) fashion, or whether left-to-right constraints are an 
inherent part of their definition) 9 The probabilistic Earley parser shares the inherent 
left-to-right character of the LRI algorithm, and contrasts with the bottom-up I/O 
algorithm. 
Probabilistic parsing algorithms may also be classified as to whether they are for- 
mulated for fully parameterized CNF grammars or arbitrary context-free rules (typ- 
ically taking advantage of grammar sparseness). In this respect the Earley approach 
contrasts with both the CNF-oriented I/O and LRI algorithms. Another approach to 
avoiding the CNF constraint is a formulation based on probabilistic Recursive Tran- 
sition Networks (RTNs) (Kupiec 1992). The similarity goes further, as both Kupiec's 
and our approach is based on state transitions, and dotted productions (Earley states) 
turn out to be equivalent to RTN states if the RTN is constructed from a CFG. 
7. Conclusions 
We have presented an Earley-based parser for stochastic context-free grammars that 
is appealing for its combination of advantages over existing methods. Earley's control 
structure lets the algorithm run with best-known complexity on a number of gram- 
mar subclasses, and no worse than standard bottom-up probabilistic chart parsers on 
general SCFGs and fully parameterized CNF grammars. 
Unlike bottom-up parsers it also computes accurate prefix probabilities incremen- 
tally while scanning its input, along with the usual substring (inside) probabilities. The 
chart constructed during parsing supports both Viterbi parse extraction and Baum- 
Welch type rule probability estimation by way of a backward pass over the parser 
chart. If the input comes with (partial) bracketing to indicate phrase structure, this 
17 This connection to the GHR algorithm was pointed out by Fernando Pereira. Exploration of this link 
then led to the extension of our algorithm to handle e-productions, as described in Section 4.7. 
18 Their method uses the transitive (but not reflexive) closure over the left-corner relation PL, for which 
they chose the symbol QL. We chose the symbol RL in this article to point to this difference. 
19 Of course a CYK-style parser can operate left-to-right, right-to-left, or otherwise by reordering the 
computation of chart entries. 
194 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
information can be easily incorporated to restrict the allowable parses. A simple ex- 
tension of the Earley chart allows finding partial parses of ungrammatical input. 
The computation of probabilities is conceptually simple, and follows directly Ear- 
ley's parsing framework, while drawing heavily on the analogy to finite-state language 
models. It does not require rewriting the grammar into normal form. Thus, the present 
algorithm fills a gap in the existing array of algorithms for SCFGs, efficiently combin- 
ing the functionalities and advantages of several previous approaches. 
Appendix A: Existence of RL and Ru 
In Section 4.5 we defined the probabilistic left-corner and unit-production matrices RL 
and Ru, respectively, to collapse recursions in the prediction and completion steps. It 
was shown how these matrices could be obtained as the result of matrix inversions. 
In this appendix we give a proof that the existence of these inverses is assured if the 
grammar is well-defined in the following three senses. The terminology used here is 
taken from Booth and Thompson (1973). 
Definition 9 
For an SCFG G over an alphabet G, with start symbol S, we say that 2° 
a) 
b) 
c) 
G is proper iff for all nonterminals X the rule" probabilities sum to unity, 
i.e., 
~:(X--*~)EG 
G is consistent iff it defines a probability distribution over finite strings, 
i.e., 
P(S xl = 1, 
xE~* 
where P(S ~ x) is induced by the rule probabilities according to 
Definition l(a). 
G has no useless nonterminals iff all nonterminals X appear in at least 
one derivation of some string x c G* with nonzero probability, i.e., 
P(S :~ AX# =~ x) > O. 
It is useful to translate consistency into "process" terms. We can view an SCFG as 
a stochastic string-rewriting process, in which each step consists of simultaneously re- 
placing all nonterminals in a sentential form with the right-hand sides of productions, 
randomly drawn according to the rule probabilities. Booth and Thompson (1973) show 
that the grammar is consistent if and only if the probability that stochastic rewriting 
of the start symbol S leaves nonterminals remaining after n steps, goes to 0 as n ~ ~. 
More loosely speaking, rewriting S has to terminate after a finite number of steps with 
probability 1, or else the grammar is inconsistent. 
20 Unfortunately, the terminology used in the literature is not uniform. For example, Jelinek and Lafferty 
(1991) use the term "proper" to mean (c), and "well-defined" for (b). They also state mistakenly that (a) 
and (c) together are a sufficient condition for (b). Booth and Thompson (1973) show that one can write 
a SCFG that satisfies (a) and (c) but generates derivations that do not terminate with probability 1, and 
give necessary and sufficient conditions for (b). 
195 
Computational Linguistics Volume 21, Number 2 
We observe that the same property holds not only for S, but for all nontermi- 
nals, if the grammar has no useless terminals. If any nonterminal X admitted infinite 
derivations with nonzero probability, then S itself would have such derivations, since 
by assumption X is reachable from S with nonzero probability. 
To prove the existence of RL and Ru, it is sufficient to show that the corresponding 
geometric series converge: 
RL = I + PL + p2 + .... (I- PL) -1 
Ru = I + Pu + p2 + .... (I - Pu) -1. 
Lemma 5 
If G is a proper, consistent SCFG without useless nonterminals, then the powers P~ 
of the left-corner relation, and P~/of the unit production relation, converge to zero as 
H ---+ OO. 
Proof 
Entry (X, Y) in the left-corner matrix PL is the probability Of generating Y as the 
immediately succeeding left-corner below X. Similarly, entry (X, Y) in the nth power 
P~ is the probability of generating Y as the left-corner of X with n - 1 intermediate 
nonterminals. Certainly P~(X, Y) is bounded above by the probability that the entire 
derivation starting at X terminates after n steps, since a derivation couldn't terminate 
without expanding the left-most symbol to a terminal (as opposed to a nonterminal). 
But that probability tends to 0 as n -+ oo, and hence so must each entry in P~. 
For the unit production matrix Pu a similar argument applies, since the length of 
a derivation is at least as long as it takes to terminate any initial unit production chain. 
Lemma 6 
If G is a proper, consistent SCFG without useless nonterminals, then the series for RL 
and Rt/as defined above converge to finite, non-negative values. 
Proof 
P~ converging to 0 implies that the magnitude of PL'S largest eigenvalue (its spectral 
radius) is < 1, which in turn implies that the series Y~-~0 P\[ converges (similarly for 
Pt/). The elements of RL and Ru are non-negative since they are the result of adding 
and multiplying among the non-negative elements of PL and Pu, respectively. 
Interestingly, a SCFG may be inconsistent and still have converging left-corner 
and/or unit production matrices, i.e., consistency is a stronger constraint. For example 
s-.a 
S SS \[q\] 
is inconsistent for any choice of q > 1, but the left-corner relation (a single number 
in this case) is well defined for all q < 1, namely (1 - q)-I = p-1. In this case the left 
fringe of the derivation is guaranteed to result in a terminal after finitely many steps, 
but the derivation as a whole may never terminate. 
196 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
Appendix B: Implementation Notes 
This appendix discusses some of the experiences gained from implementing the prob- 
abilistic Earley parser. 
B.1 Prediction 
Because of the collapse of transitive predictions, this step can be implemented in a very 
efficient and straightforward manner. As explained in Section 4.5, one has to perform 
a single pass over the current state set, identifying all nonterminals Z occurring to the 
right of dots, and add states corresponding to all productions Y --* u that are reachable 
through the left-corner relation Z ~L Y- As indicated in equation (13), contributions 
to the forward probabilities of new states have to be summed when several paths lead 
to the same state. However, the summation in equation (13) can be optimized if the 
c~ values for all old states with the same nonterminal Z are summed first, and then 
multiplied by R(Z GL Y). These quantities are then summed over all nonterminals Z, 
and the result is once multiplied by the rule probability P(Y ~ u) to give the forward 
probability for the predicted state. 
B.2 Completion 
Unlike prediction, the completion step still involves iteration. Each complete state de- 
rived by completion can potentially feed other completions. An important detail here 
is to ensure that all contributions to a state's c~ and ~/are summed before proceeding 
with using that state as input to further completion steps. 
One approach to this problem is to insert complete states into a prioritized queue. 
The queue orders states by their start indices, highest first. This is because states 
corresponding to later expansions always have to be completed first before they can 
lead to the completion of expansions earlier on in the derivation. For each start index, 
the entries are managed as a first-in, first-out queue, ensuring that the dependency 
graph formed by the states is traversed in breadth-first order. 
The completion pass can now be implemented as follows. Initially, all complete 
states from the previous scanning step are inserted in the queue. States are then re- 
moved from the front of the queue and used to complete other states. Among the new 
states thus produced, complete ones are again added to the queue. The process iterates 
until no more states remain in the queue. Because the computation of probabilities al- 
ready includes chains of unit productions, states derived from such productions need 
not be queued, which also ensures that the iteration terminates. 
A similar queuing scheme, with the start index order reversed, can be used for the 
reverse completion step needed in the computation of outer probabilities (Section 5.2). 
B.3 Efficient Parsing with Large Sparse Grammars 
During work with a moderate-sized, application-specific natural language grammar 
taken from the BeRP speech system (Jurafsky, Wooters, Tajchman, Segal, Stolcke, Foster, 
and Morgan 1994) we had an opportunity to optimize our implementation of the 
algorithm. Below we relate some of the lessons learned in the process. 
B.3.1 Speeding up matrix inversions. Both prediction and completion steps make use 
of a matrix R defined as a geometric series derived from a matrix P, 
R=I+p+p2+ .... (i_p) -1. 
197 
Computational Linguistics Volume 21, Number 2 
Both P and R are indexed by the nonterminals in the grammar. The matrix P is de- 
rived from the SCFG rules and probabilities (either the left-corner relation or the unit 
production relation). 
For an application using a fixed grammar the time taken by the precomputation 
of left-corner and unit production matrices may not be crucial, since it occurs off- 
line. There are cases, however, when that cost should be minimized, e.g., when rule 
probabilities are iteratively reestimated. 
Even if the matrix P is sparse, the matrix inversion can be prohibitive for large 
numbers of nonterminals n. Empirically, matrices of rank n with a bounded number 
p of nonzero entries in each row (i.e., p is independent of n) can be inverted in time 
O(n2), whereas a full matrix of size n x n would require time O(n3). 
In many cases the grammar has a relatively small number of nonterminals that 
have productions involving other nonterminals in a left-corner (or the RHS of a unit 
production). Only those nonterminals can have nonzero contributions to the higher 
powers of the matrix P. This fact can be used to substantially reduce the cost of the 
matrix inversion needed to compute R. 
Let P' be a subset of the entries of P, namely, only those elements indexed by non- 
terminals that have a nonempty row in P. For example, for the left-corner computation, 
P' is obtained from P by deleting all rows and columns indexed by nonterminals that 
do not have productions starting with nonterminals. Let I' be the identity matrix over 
the same set of nonterminals as P'. Then R can be computed as 
R = I+(I+p+p2+...)P 
= I + (I' + P' + p,2 +...), p 
= I+(I'-P')-I,P 
= I+R',P. 
Here R' is the inverse of I' - P', and * denotes a matrix multiplication in which the left 
operand is first augmented with zero elements to match the dimensions of the right 
operand, P. 
The speedups obtained with this technique can be substantial. For a grammar 
with 789 nonterminals, of which only 132 have nonterminal productions, the left- 
corner matrix was computed in 12 seconds (including the final multiply with P and 
addition of/). Inversion of the full matrix I - P took 4 minutes, 28 seconds. 21 
B.3.2 Linking and bottom-up filtering. As discussed in Section 4.8, the worst-case 
run-time on fully parameterized CNF grammars is dominated by the completion step. 
However, this is not necessarily true of sparse grammars. Our experiments showed that 
the computation is dominated by the generation of Earley states during the prediction 
steps. 
It is therefore worthwhile to minimize the total number of predicted states gen- 
erated by the parser. Since predicted states only affect the derivation if they lead to 
subsequent scanning, we can use the next input symbol to constrain the relevant pre- 
dictions. To this end, we compute the extended left-corner relation RLT, indicating 
which terminals can appear as left corners of which nonterminals. RLT is a Boolean 
21 These figures are not very meaningful for their absolute values. All measurements were obtained on a 
Sun SPARCstation 2 with a CommonLisp/CLOS implementation of generic sparse matrices that was not particularly optimized for this task. 
198 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
matrix with rows indexed by nonterminals and columns indexed by terminals. It can 
be computed as the product 
RLT = RLPLT 
where PLT has a nonzero entry at (X, a) iff there is a production for nonterminal X that 
starts with terminal a. RL is the old left-corner relation. 
During the prediction step we can ignore incoming states whose RHS nonterminal 
following the dot cannot have the current input as a left-corner, and then eliminate 
from the remaining predictions all those whose LHS cannot produce the current input 
as a left-corner. These filtering steps are very fast, as they involve only table lookup. 
This technique for speeding up Earley prediction is the exact converse of the 
"linking" method described by Pereira and Shieber (1987, chapter 6) for improving 
the efficiency of bottom-up parsers. There, the extended left-corner relation is used 
for top-down filtering the bottom-up application of grammar rules. In our case, we use 
linking to provide bottom-up filtering for top-down application of productions. 
On a test corpus this technique cut the number of generated predictions to al- 
most one-fourth and speeded up parsing by a factor of 3.3. The corpus consisted of 
1,143 sentence with an average length of 4.65 words. The top-down prediction alone 
generated 991,781 states and parsed at a rate of 590 milliseconds per sentence. With 
bottom-up filtered prediction only 262,287 states were generated, resulting in 180 mil- 
liseconds per sentence. 
Acknowledgments 
Thanks are due Dan Jurafsky and Steve 
Omohundro for extensive discussions on 
the topics in this paper, and Fernando 
Pereira for helpful advice and pointers. 
Jerry Feldman, Terry Regier, Jonathan Segal, 
Kevin Thompson, and the anonymous 
reviewers provided valuable comments for 
improving content and presentation. 
References 
Aho, Alfred V., and Ullman, Jeffrey D. 
(1972). The Theory of Parsing, Translation, 
and Compiling, Volume 1: Parsing. 
Prentice-Hall. 
Bahl, Lalit R.; Jelinek, Frederick; and Mercer, 
Robert L. (1983). "A maximum likelihood 
approach to continuous speech 
recognition." IEEE Transactions on Pattern 
Analysis and Machine Intelligence, 5(2), 
179-190. 
Baker, James K. (1979). "Trainable grammars 
for speech recognition." In Speech 
Communication Papers for the 97th Meeting of 
the Acoustical Society of America, edited by 
Jared J. Wolf and Dennis H. Klatt, 
547-550. 
Baum, Leonard E.; Petrie, Ted; Soules, 
George; and Weiss, Norman (1970). "A 
maximization technique occuring in the 
statistical analysis of probabilistic 
functions in Markov chains." The Annals of 
Mathematical Statistics, 41(1), 164-171. 
Booth, Taylor L., and Thompson, Richard A. 
(1973). "Applying probability measures to 
abstract languages." IEEE Transactions on 
Computers, C-22(5), 442-450. 
Briscoe, Ted, and Carroll, John (1993). 
"Generalized probabilistic LR parsing of 
natural language (corpora) with 
unification-based grammars." 
Computational Linguistics, 19(1), 25-59. 
Casacuberta, E, and Vidal, E. (1988). "A 
parsing algorithm for weighted grammars 
and substring recognition." In Syntactic 
and Structural Pattern Recognition, Volume 
F45, NATO ASI Series, edited by Gabriel 
Ferrat6, Theo Pavlidis, Alberto Sanfeliu, 
and Horst Bunke, 51-67. Springer Verlag. 
Corazza, Anna; De Mori, Renato; Gretter, 
Roberto; and Satta, Giorgio (1991). 
"Computation of probabilities for an 
island-driven parser." IEEE Transactions on 
Pattern Analysis and Machine Intelligence, 
13(9), 936-950. 
Dempster, A. P.; Laird, N. M.; and Rubin, 
D. B. (1977). "Maximum likelihood from 
incomplete data via the EM algorithm." 
Journal of the Royal Statistical Society, Series 
B, 34, 1-38. 
Earley, Jay (1970). "An efficient context-free 
parsing algorithm." Communications of the 
ACM, 6(8), 451-455. 
Fujisaki, T.; Jelinek, F.; Cocke, J.; Black, E.; 
and Nishino, T. (1991). "A probabilistic 
parsing method for sentence 
disambiguation." In Current Issues in 
Parsing Technology, edited by Masaru 
Tomita, 139-152. Kluwer Academic 
199 
Computational Linguistics Volume 21, Number 2 
Publishers. 
Graham, Susan L.; Harrison, Michael A.; 
and Ruzzo, Walter L. (1980). "An 
improved context-free recognizer." ACM 
Transactions on Programming Languages and 
Systems, 2(3), 415-462. 
Jelinek, Frederick (1985). "Markov source 
modeling of text generation." In The 
Impact of Processing Techniques on 
Communications, Volume E91, NATO ASI 
Series, edited by J. K. Skwirzynski, 
569-598. Nijhoff. 
Jelinek, Frederick, and Lafferty, John D. 
(1991). "Computation of the probability of 
initial substring generation by stochastic 
context-free grammars." Computational 
Linguistics, 17(3), 315-323. 
Jelinek, Frederick; Lafferty, John D.; and 
Mercer, Robert L. (1992). "Basic methods 
of probabilistic context free grammars." 
In Speech Recognition and Understanding. 
Recent Advances, Trends, and Applications, 
Volume F75, NATO ASI Series, edited by 
Pietro Laface and Renato De Mori, 
345-360. Springer Verlag. 
Jones, Mark A., and Eisner, Jason M. (1992). 
"A probabilistic parser and its 
applications." In AAAI Workshop on 
Statistically~Based NLP Techniques, 20-27. 
Jurafsky, Daniel; Wooters, Chuck; Segal, 
Jonathan; Stolcke, Andreas; Fosler, Eric; 
Tajchman, Gary; and Morgan, Nelson 
(1995). "Using a stochastic context-free 
grammar as a language model for speech 
recognition." In Proceedings, IEEE 
Conference on Acoustics, Speech and Signal 
Processing, 189-192. Detroit, Michigan. 
Jurafsky, Daniel; Wooters, Chuck; Tajchman, 
Gary; Segal, Jonathan; Stolcke, Andreas; 
Fosler, Eric; and Morgan, Nelson (1994). 
"The Berkeley restaurant project." In 
Proceedings, International Conference on 
Spoken Language Processing, 4, 2139-2142. 
Yokohama, Japan. 
Kupiec, Julian (1992). "Hidden Markov 
estimation for unrestricted stochastic 
context-free grammars." In Proceedings, 
IEEE Conference on Acoustics, Speech and 
Signal Processing, 1, 177-180, San 
Francisco, California. 
Lari, K., and Young, S. J. (1990). "The 
estimation of stochastic context-free 
grammars using the Inside-Outside 
algorithm." Computer Speech and Language, 
4, 35-56. 
Lari, K., and Young, S. J. (1991). 
"Applications of stochastic context-free 
grammars using the Inside-Outside 
algorithm." Computer Speech and Language, 
5, 237-257. 
Magerman, David M., and Marcus, 
Mitchell P. (1991). "Pearl: A probabilistic 
chart parser." In Proceedings, Second 
International Workshop on Parsing 
Technologies. Cancun, Mexico, 193-199. 
Magerman, David M., and Weir, Carl (1992). 
"Efficiency, robustness and accuracy in 
Picky chart parsing." In Proceedings, 30th 
Annual Meeting of the Association for 
Computational Linguistics. Newark, 
Delaware, 40-47. 
Nakagawa, Sei-ichi (1987). "Spoken 
sentence recognition by time-synchronous 
parsing algorithm of context-free 
grammar." In Proceedings, IEEE Conference 
on Acoustics, Speech and Signal Processing, 2, 
829-832. Dallas, Texas. 
Ney, Hermann (1992). "Stochastic grammars 
and pattern recognition." In Speech 
Recognition and Understanding. Recent 
Advances, Trends, and Applications, volume 
F75 of NATO ASI Series, edited by Pietro 
Laface and Renato De Mori, 319-344. 
Springer Verlag. 
P~iseler, Annedore (1988). "Modification of 
Earley's algorithm for speech 
recognition/' In Recent Advances in Speech 
Understanding and Dialog Systems, volume 
F46 of NATO ASI Series, edited by 
H. Niemann, M. Lang, and G. Sagerer, 
466-472. Springer Verlag. 
Pereira, Fernando C. N., and Schabes, Yves 
(1992). "Inside-outside reestimation from 
partially bracketed corpora." In 
Proceedings, 30th Annual Meeting of the 
Association for Computational Linguistics. 
Newark, Delaware, 128-135. 
Pereira, Fernando C. N., and Shieber, 
Stuart M. (1987). Prolog and 
Natural-Language Analysis. CLSI Lecture 
Notes Series, Number 10. Center for the 
Study of Language and Information, 
Stanford, California. 
Rabiner, L. R., and Juang, B. H. (1986). "An 
introduction to hidden Markov models." 
IEEE ASSP Magazine, 3(1), 4-16. 
Schabes, Yves (1991). "An inside-outside 
algorithm for estimating the parameters 
of a hidden stochastic context-free 
grammar based on Earley's algorithm." 
Paper presented at the Second Workshop 
on Mathematics of Language, Tarrytown, 
New York. 
Stolcke, Andreas, and Segal, Jonathan 
(1994). "Precise n-gram probabilities from 
stochastic context-free grammars." In 
Proceedings, 3 lth Annual Meeting of the 
Association for Computational Linguistics. 
Las Cruces, New Mexico, 74-79. 
Stolcke, Andreas (1993). "An efficient 
probabilistic context-free parsing 
algorithm that computes prefix 
200 
Andreas Stolcke Efficient Probabilistic Context-Free Parsing 
probabilities." Technical Report 
TR-93-065, International Computer 
Science Institute, Berkeley, CA. Revised 
1994. 
Tomita, Masaru (1986). Efficient Parsing for Natural Language. 
Kluwer Academic 
Publishers. 
Wright, J. H. (1990). "LR parsing of 
probabilistic grammars with input 
uncertainty for speech recognition." 
Computer Speech and Language, 4, 297-323. 
201 

