XML Viewer - j95-2002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/j95-2002_metho.xml
Size: 75,645 bytes
Last Modified: 2025-10-06 14:13:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="J95-2002">
  <Title>An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities</Title>
  <Section position="5" start_page="169" end_page="173" type="metho">
    <SectionTitle>
4. Probabilistic Earley Parsing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="169" end_page="170" type="sub_section">
      <SectionTitle>
4.1 Stochastic Context-Free Grammars
</SectionTitle>
      <Paragraph position="0"> A stochastic context-free grammar (SCFG) extends the standard context-free formalism by adding probabilities to each production: \[p\],</Paragraph>
    </Section>
    <Section position="2" start_page="170" end_page="171" type="sub_section">
      <SectionTitle>
Andreas Stolcke Efficient Probabilistic Context-Free Parsing
</SectionTitle>
      <Paragraph position="0"> where the rule probability p is usually written as P(X --+ ~). This notation to some extent hides the fact that p is a conditional probability, of production X --+ ,~ being chosen, given that X is up for expansion. The probabilities of all rules with the same nonterminal X on the LHS must therefore sum to unity. Context-freeness in a probabilistic setting translates into conditional independence of rule choices. As a result, complete derivations have joint probabilities that are simply the products of the rule probabilities involved.</Paragraph>
      <Paragraph position="1"> The probabilities of interest mentioned in Section 1 can now be defined formally.</Paragraph>
      <Paragraph position="2"> Definition 1 The following quantities are defined relative to a SCFG G, a nonterminal X, and a string x over the alphabet y~ of G.</Paragraph>
      <Paragraph position="3"> a) The probability of a (partial) derivation/71 ~/72 ~ &amp;quot;&amp;quot;/Tk is inductively defined by</Paragraph>
      <Paragraph position="5"> where /71,/72 .... ,/Tk are strings of terminals and nonterminals, X ~ A is a production of G, and u2 is derived from/71 by replacing one occurrence of X with &amp;.</Paragraph>
      <Paragraph position="6"> The string probability P(X =g x) (of x given X) is the sum of the probabilities of all left-most derivations X =&gt; ... =&gt; x producing x from X. s The sentence probability P(S ~ x) (of x given G) is the string probability given the start symbol S of G. By definition, this is also the probability P(x I G) assigned to x by the grammar G.</Paragraph>
      <Paragraph position="7"> The prefix probability P(S g&gt;L X) (of X given G) is the sum of the probabilities of all sentence strings having x as a prefix,</Paragraph>
      <Paragraph position="9"> In the following, we assume that the probabilities in a SCFG are proper and consistent as defined in Booth and Thompson (1973), and that the grammar contains no useless nonterminals (ones that can never appear in a derivation). These restrictions ensure that all nonterminals define probability measures over strings; i.e., P(X ~ x) is a proper distribution over x for all X. Formal definitions of these conditions are given in Appendix A.</Paragraph>
      <Paragraph position="10"> 5 In a left-most derivation each step replaces the nonterminal furthest to the left in the partially expanded string. The order of expansion is actually irrelevant for this definition, because of the multiplicative combination of production probabilities. We restrict summation to left-most derivations to avoid counting duplicates, and because left-most derivations will play an important role later.</Paragraph>
    </Section>
    <Section position="3" start_page="171" end_page="172" type="sub_section">
      <SectionTitle>
Computational Linguistics Volume 21, Number 2
4.2 Earley Paths and Their Probabilities
</SectionTitle>
      <Paragraph position="0"> In order to define the probabilities associated with parser operation on a SCFG, we need the concept of a path, or partial derivation, executed by the Earley parser.</Paragraph>
      <Paragraph position="2"> An (unconstrained) Earley path, or simply path, is a sequence of Earley states linked by prediction, scanning, or completion. For the purpose of this definition, we allow scanning to operate in &amp;quot;generation mode,&amp;quot; i.e., all states with terminals to the right of the dot can be scanned, not just those matching the input. (For completed states, the predecessor state is defined to be the complete state from the same state set contributing to the completion.) A path is said to be constrained by, or to generate a string x if the terminals immediately to the left of the dot in all scanned states, in sequence, form the string x.</Paragraph>
      <Paragraph position="3"> A path is complete if the last state on it matches the first, except that the dot has moved to the end of the RHS.</Paragraph>
      <Paragraph position="4"> We say that a path starts with nonterminal X if the first state on it is a predicted state with X on the LHS.</Paragraph>
      <Paragraph position="5"> The length of a path is defined as the number of scanned states on it. Note that the definition of path length is somewhat counterintuitive, but is motivated by the fact that only scanned states correspond directly to input symbols. Thus the length of a path is always the same as the length of the input string it generates. A constrained path starting with the initial state contains a sequence of states from state set 0 derived by repeated prediction, followed by a single state from set 1 produced by scanning the first symbol, followed by a sequence of states produced by completion, followed by a sequence of predicted states, followed by a state scanning the second symbol, and so on. The significance of Earley paths is that they are in a one-to-one correspondence with left-most derivations. This will allow us to talk about probabilities of derivations, strings, and prefixes in terms of the actions performed by Earley's parser. From now on, we will use &amp;quot;derivation&amp;quot; to imply a left-most derivation.</Paragraph>
      <Paragraph position="7"> An Earley parser generates state i: kX--+ A.#, if and only if there is a partial derivation S ~ Xo...k_lXly =~ Xo...k_l/~#I/ G Xo...k_lXk...i_l#l/ deriving a prefix Xo...i-1 of the input. There is a one-to-one mapping between partial derivations and Earley paths, such that each production X --~ ~, applied in a derivation corresponds to a predicted Earley state X --~ .L,.</Paragraph>
    </Section>
    <Section position="4" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
Andreas Stolcke Efficient Probabilistic Context-Free Parsing
</SectionTitle>
      <Paragraph position="0"> (a) is the invariant underlying the correctness and completeness of Earley's algorithm; it can be proved by induction on the length of a derivation (Aho and Ullman 1972, Theorem 4.9). The slightly stronger form (b) follows from (a) and the way possible prediction steps are defined.</Paragraph>
      <Paragraph position="1"> Since we have established that paths correspond to derivations, it is convenient to associate derivation probabilities directly with paths. The uniqueness condition (b) above, which is irrelevant to the correctness of a standard Earley parser, justifies (probabilistic) counting of paths in lieu of derivations.</Paragraph>
      <Paragraph position="2"> Definition 3 The probability P(7 ~) of a path ~ is the product of the probabilities of all rules used in the predicted states occurring in ~v.</Paragraph>
      <Paragraph position="3"> Lemma 2 a) For all paths 7 ~ starting with a nonterminal X, P(7 ~) gives the probability of the (partial) derivation represented by ~v. In particular, the string probability P(X ~ x) is the sum of the probabilities of all paths starting with X that are complete and constrained by x.</Paragraph>
      <Paragraph position="4"> b) The sentence probability P(S ~ x) is the sum of the probabilities of all complete paths starting with the initial state, constrained by x.</Paragraph>
      <Paragraph position="5"> c) The prefix probability P(S ~L X) is the sum of the probabilities of all paths 7 ~ starting with the initial state, constrained by x, that end in a scanned state.</Paragraph>
      <Paragraph position="6"> Note that when summing over all paths &amp;quot;starting with the initial state,&amp;quot; summation is actually over all paths starting with S, by definition of the initial state 0 --* .S. (a) follows directly from our definitions of derivation probability, string probability, path probability, and the one-to-one correspondence between paths and derivations established by Lemma 1. (b) follows from (a) by using S as the start nonterminal. To obtain the prefix probability in (c), we need to sum the probabilities of all complete derivations that generate x as a prefix. The constrained paths ending in scanned states represent exactly the beginnings of all such derivations. Since the grammar is assumed to be consistent and without useless nonterminals, all partial derivations can be completed with probability one. Hence the sum over the constrained incomplete paths is the sought-after sum over all complete derivations generating the prefix.</Paragraph>
    </Section>
    <Section position="5" start_page="172" end_page="173" type="sub_section">
      <SectionTitle>
4.3 Forward and Inner Probabilities
</SectionTitle>
      <Paragraph position="0"> Since string and prefix probabilities are the result of summing derivation probabilities, the goal is to compute these sums efficiently by taking advantage of the Earley control structure. This can be accomplished by attaching two probabilistic quantities to each Earley state, as follows. The terminology is derived from analogous or similar quantities commonly used in the literature on Hidden Markov Models (HMMs) (Rabiner and Juang 1986) and in Baker (1979).</Paragraph>
      <Paragraph position="1"> Definition 4 The following definitions are relative to an implied input string x.</Paragraph>
      <Paragraph position="2"> a) The forward probability Oq(kX--4 A.\[d,) is the sum of the probabilities of all constrained paths of length i that end in state kX ---* .~.#.</Paragraph>
      <Paragraph position="3">  Computational Linguistics Volume 21, Number 2 b) The inner probability ~i(k x ---+ /~.\]1,) is the sum of the probabilities of all paths of length i - k that start in state k : kX -* .)~# and end in</Paragraph>
      <Paragraph position="5"> It helps to interpret these quantities in terms of an unconstrained Earley parser that operates as a generator emitting--rather than recognizing--strings. Instead of tracking all possible derivations, the generator traces along a single Earley path randomly determined by always choosing among prediction steps according to the associated rule probabilities. Notice that the scanning and completion steps are deterministic once the rules have been chosen.</Paragraph>
      <Paragraph position="6"> Intuitively, the forward probability Oq(kX &amp;quot;-+ ,,~.~) is the probability of an Earley generator producing the prefix of the input up to position i - 1 while passing through state kX --* ~.# at position i. However, due to left-recursion in productions the same state may appear several times on a path, and each occurrence is counted toward the total ~i. Thus, ~i is really the expected number of occurrences of the given state in state set i. Having said that, we will refer to o~ simply as a probability, both for the sake of brevity, and to keep the analogy to the HMM terminology of which this is a generalization. 6 Note that for scanned states, ~ is always a probability, since by definition a scanned state can occur only once along a path.</Paragraph>
      <Paragraph position="7"> The inner probabilities, on the other hand, represent the probability of generating a substring of the input from a given nonterminal, using a particular production.</Paragraph>
      <Paragraph position="8"> Inner probabilities are thus conditional on the presence of a given nonterminal X with expansion starting at position k, unlike the forward probabilities, which include the generation history starting with the initial state. The inner probabilities as defined here correspond closely to the quantities of the same name in Baker (1979). The sum of &amp;quot;y of all states with a given LHS X is exactly Baker's inner probability for X.</Paragraph>
      <Paragraph position="9"> The following is essentially a restatement of Lemma 2 in terms of forward and inner probabilities. It shows how to obtain the sentence and string probabilities we are interested in, provided that forward and inner probabilities can be computed effectively. null Lemma 3 The following assumes an Earley chart constructed by the parser on an input string x  with Ixl = l.</Paragraph>
      <Paragraph position="10"> a) Provided that S :GL Xo...k-lXV is a possible left-most derivation of the grammar (for some v), the probability that a nonterminal X generates the substring Xk... xi-1 can be computed as the sum</Paragraph>
      <Paragraph position="12"> (sum of inner probabilities over all complete states with LHS X and start index k).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="173" end_page="184" type="metho">
    <SectionTitle>
6 The same technical complication was noticed by Wright (1990) in the computation of probabilistic LR
</SectionTitle>
    <Paragraph position="0"> parser tables. The relation to LR parsing will be discussed in Section 6.2. Incidentally, a similar interpretation of forward &amp;quot;probabilities&amp;quot; is required for HMMs with non-emitting states.</Paragraph>
    <Paragraph position="1">  In particular, the string probability P(S G x) can be computed as 7</Paragraph>
    <Paragraph position="3"> (sum of forward probabilities over all scanned states).</Paragraph>
    <Paragraph position="4"> The restriction in (a) that X be preceded by a possible prefix is necessary, since the Earley parser at position i will only pursue derivations that are consistent with the input up to position i. This constitutes the main distinguishing feature of Earley parsing compared to the strict bottom-up computation used in the standard inside probability computation (Baker 1979). There, inside probabilities for all positions and nonterminals are computed, regardless of possible prefixes.</Paragraph>
    <Section position="1" start_page="174" end_page="176" type="sub_section">
      <SectionTitle>
4.4 Computing Forward and Inner Probabilities
</SectionTitle>
      <Paragraph position="0"> Forward and inner probabilities not only subsume the prefix and string probabilities, they are also straightforward to compute during a run of Earley's algorithm. In fact, if it weren't for left-recursive and unit productions their computation would be trivial.</Paragraph>
      <Paragraph position="1"> For the purpose of exposition we will therefore ignore the technical complications introduced by these productions for a moment, and then return to them once the overall picture has become clear.</Paragraph>
      <Paragraph position="2"> During a run of the parser both forward and inner probabilities will be attached to each state, and updated incrementally as new states are created through one of the three types of transitions. Both probabilities are set to unity for the initial state 0 --* .S. This is consistent with the interpretation that the initial state is derived from a dummy production ~ S for which no alternatives exist.</Paragraph>
      <Paragraph position="3"> Parsing then proceeds as usual, with the probabilistic computations detailed below.</Paragraph>
      <Paragraph position="4"> The probabilities associated with new states will be computed as sums of various combinations of old probabilities. As new states are generated by prediction, scanning, and completion, certain probabilities have to be accumulated, corresponding to the multiple paths leading to a state. That is, if the same state is generated multiple times, the previous probability associated with it has to be incremented by the new contribution just computed. States and probability contributions can be generated in any order, as long as the summation for one state is finished before its probability enters into the computation of some successor state. Appendix B.2 suggests a way to implement this incremental summation.</Paragraph>
      <Paragraph position="5"> Notation. A few intuitive abbreviations are used from here on to describe Earley transitions succinctly. (1) To avoid unwieldy y\]~ notation we adopt the following convention. The expression x += y means that x is computed incrementally as a sum of various y terms, which are computed in some order and accumulated to finally yield the value of x. 8 (2) Transitions are denoted by ~, with predecessor states on the left  Computational Linguistics Volume 21, Number 2 and successor states on the right. (3) The forward and inner probabilities of states are notated in brackets after each state, e.g., i: kX --+ ,~.Y# \[a, 7\] is shorthand for a = ai(kX --+ A.Y#), 7 = &amp;quot;fi(k X -+ ,~.Y#).</Paragraph>
      <Paragraph position="6"> Prediction (probabilistic).</Paragraph>
      <Paragraph position="7"> i: kX--+ A.Y# \[a, 7\] ~ i: iY--+., \[a',7'\] for all productions Y ~ u. The new probabilities can be computed as</Paragraph>
      <Paragraph position="9"> Note that only the forward probability is accumulated; 7 is not used in this step.</Paragraph>
      <Paragraph position="10"> Rationale. a' is the sum of all path probabilities leading up to kX --* )~.Y#, times the probability of choosing production Y --+ u. The value &amp;quot;7' is just a special case of the definition.</Paragraph>
      <Paragraph position="11"> Scanning (probabilistic).</Paragraph>
      <Paragraph position="13"> for all states with terminal a matching input at position i. Then &amp;quot;7' = 7 Rationale. Scanning does not involve any new choices, since the terminal was already selected as part of the production during prediction. 9 Completion (probabilistic).</Paragraph>
      <Paragraph position="14"> i: jY --+ u. \[a',7&amp;quot;\] j: kX--+,k.Yit \[a, 71 f ~ i: kX--+,kY.it \[a',7'\]</Paragraph>
      <Paragraph position="16"> Note that ~&amp;quot; is not used.</Paragraph>
      <Paragraph position="17"> Rationale. To update the old forward/inner probabilities a and &amp;quot;7 to cd and &amp;quot;7', respectively, the probabilities of all paths expanding Y --+ t, have to be factored in. These are exactly the paths summarized by the inner probability &amp;quot;7&amp;quot;.</Paragraph>
      <Paragraph position="18"> 9 In different parsing scenarios the scanning step may well modify probabilities. For example, if the input symbols themselves have attached likelihoods, these can be integrated by multiplying them onto a and &amp;quot;/when a symbol is scanned. That way it is possible to perform efficient Earley parsing with integrated joint probability computation directly on weighted lattices describing ambiguous inputs.</Paragraph>
    </Section>
    <Section position="2" start_page="176" end_page="184" type="sub_section">
      <SectionTitle>
Andreas Stolcke Efficient Probabilistic Context-Free Parsing
4.5 Coping with Recursion
</SectionTitle>
      <Paragraph position="0"> The standard Earley algorithm, together with the probability computations described in the previous section, would be sufficient if it weren't for the problem of recursion in the prediction and completion steps.</Paragraph>
      <Paragraph position="1"> The nonprobabilistic Earley algorithm can stop recursing as soon as all predictions/completions yield states already contained in the current state set. For the computation of probabilities, however, this would mean truncating the probabilities resulting from the repeated summing of contributions.</Paragraph>
      <Paragraph position="2"> 4.5.1 Prediction loops. As an example, consider the following simple left-recursive SCFG.</Paragraph>
      <Paragraph position="4"> where q = 1 - p. Nonprobabilistically, the prediction loop at position 0 would stop after producing the states</Paragraph>
      <Paragraph position="6"> This would leave the forward probabilities at</Paragraph>
      <Paragraph position="8"> corresponding to just two out of an infinity of possible paths. The correct forward probabilities are obtained as a sum of infinitely many terms, accounting for all possible paths of length 1.</Paragraph>
      <Paragraph position="10"> In these sums each p corresponds to a choice of the first production, each q to a choice of the second production. If we didn't care about finite computation the resulting geometric series could be computed by letting the prediction loop (and hence the summation) continue indefinitely.</Paragraph>
      <Paragraph position="11"> Fortunately, all repeated prediction steps, including those due to left-recursion in the productions, can be collapsed into a single, modified prediction step, and the corresponding sums computed in closed form. For this purpose we need a probabilistic version of the well-known parsing concept of a left comer, which is also at the heart of the prefix probability algorithm of Jelinek and Lafferty (1991).</Paragraph>
      <Paragraph position="12"> Definition 5 The following definitions are relative to a given SCFG G.</Paragraph>
      <Paragraph position="13"> a) Two nonterminals X and Y are said to be in a left-comer relation  X --*L Y iff there exists a production for X that has a RHS starting with Y, X--* YA.</Paragraph>
      <Paragraph position="14">  Computational Linguistics Volume 21, Number 2 b) The probabilistic left-corner relation 1deg PL = PL(G) is the matrix of probabilities P(X --+L Y), defined as the total probability of choosing a production for X that has Y as a left corner:</Paragraph>
      <Paragraph position="16"> The probabilistic reflexive, transitive left-corner relation RL = RL(G) is a matrix of probability sums R(X :GL Y). Each R(X ~L Y) is defined as a</Paragraph>
      <Paragraph position="18"> where we use the delta function, defined as 6(X, Y) = 1 if X = Y, and</Paragraph>
      <Paragraph position="20"> from which the closed-form solution is derived:</Paragraph>
      <Paragraph position="22"> An existence proof for RL is given in Appendix A. Appendix B.3.1 shows how to speed up the computation of RL by inverting only a reduced version of the matrix I - PL.</Paragraph>
      <Paragraph position="23"> The significance of the matrix RL for the Earley algorithm is that its elements are the sums of the probabilities of the potentially infinitely many prediction paths leading from a state kX --+ A.Z# to a predicted state iY --~ .~', via any number of intermediate states.</Paragraph>
      <Paragraph position="24"> RL can be computed once for each grammar, and used for table-lookup in the following, modified prediction step.</Paragraph>
      <Paragraph position="25"> 10 If a probabilistic relation R is replaced by its set-theoretic version R I, i.e., (x,y) E R' iff R(x,y) ~ 0, then the closure operations used here reduce to their traditional discrete counterparts; hence the choice of terminology.</Paragraph>
      <Paragraph position="26">  i: kX --+ A.Z# \[c~,'7\] ~ i: iW ~ .11 \[oJ, &amp;quot;7'\] for all productions Y --+ u such that R(Z GL Y) is nonzero. Then</Paragraph>
      <Paragraph position="28"> The new R(Z GL Y) factor in the updated forward probability accounts for the sum of all path probabilities linking Z to Y. For Z = Y this covers the case of a single step of prediction; R(Y ~L Y) _&gt; 1 always, since RL is defined as a reflexive closure.</Paragraph>
      <Paragraph position="29">  may imply an infinite summation, and could lead to an infinite loop if computed naively. However, only unit productions 11 can give rise to cyclic completions. The problem is best explained by studying an example. Consider the grammar</Paragraph>
      <Paragraph position="31"> where q = 1 - p. Presented with the input a (the only string the grammar generates), after one cycle of prediction, the Earley chart contains the following states.</Paragraph>
      <Paragraph position="33"> The p-1 factors are a result of the left-corner sum 1 + q + q2 q_ .... (1 - q)-l.</Paragraph>
      <Paragraph position="34"> After scanning oS --* .a, completion without truncation would enter an infinite loop. First 0T ~ .S is completed, yielding a complete state 0T --* S., which allows 0S ---* .T to be completed, leading to another complete state for S, etc. The nonprobabilistic Earley parser can just stop here, but as in prediction, this would lead to truncated probabilities. The sum of probabilities that needs to be computed to arrive at the correct result contains infinitely many terms, one for each possible loop through the T --+ S production. Each such loop adds a factor of q to the forward and inner probabilities.</Paragraph>
      <Paragraph position="35"> The summations for all completed states turn out as</Paragraph>
      <Paragraph position="37"> Computational Linguistics Volume 21, Number 2 The approach taken here to compute exact probabilities in cyclic completions is mostly analogous to that for left-recursive predictions. The main difference is that unit productions, rather than left-corners, form the underlying transitive relation. Before proceeding we can convince ourselves that this is indeed the only case we have to worry about.</Paragraph>
      <Paragraph position="39"> be a completion cycle, i.e., kl = kC/, X1 = Xc, ~1 --- )~c, X2 = Xc+I. Then it must be the case that ~1 = /~2 ..... &amp;quot;~c = C, i.e., all productions involved are unit productions</Paragraph>
      <Paragraph position="41"> For all completion chains it is true that the start indices of the states are monotonically increasing, kl ~ k2 ~ ... (a state can only complete an expansion that started at the same or a previous position). From kl ~- kc, it follows that kl = k2 ..... kc.</Paragraph>
      <Paragraph position="42"> Because the current position (dot) also refers to the same input index in all states, all nonterminals X~,X2,...,Xc have been expanded into the same substring of the input between kl and the current position. By assumption the grammar contains no nonterminals that generate C/,12 therefore we must have )~1 : ~2 ..... )~c = e, q.e.d.</Paragraph>
      <Paragraph position="43"> \[\] We now formally define the relation between nonterminals mediated by unit productions, analogous to the left-corner relation.</Paragraph>
      <Paragraph position="44"> Definition 6 The following definitions are relative to a given SCFG G.</Paragraph>
      <Paragraph position="46"> where q = 1 - p. This highly ambiguous grammar generates strings of any number of a's, using all possible binary parse trees over the given number of terminals. The states involved in parsing the string aaa are listed in Table 2, along with their forward and inner probabilities. The example illustrates how the parser deals with left-recursion and the merging of alternative sub-parses during completion.</Paragraph>
      <Paragraph position="47"> Since the grammar has only a single nonterminal, the left-corner matrix PL has</Paragraph>
      <Paragraph position="49"> Consequently, the example trace shows the factor p-1 being introduced into the forward probability terms in the prediction steps.</Paragraph>
      <Paragraph position="50"> The sample string can be parsed as either (a(aa)) or ((aa)a), each parse having a probability of p3q2. The total string probability is thus 2p3q 2, the computed c~ and 7 values for the final state. The oe values for the scanned states in sets 1, 2, and 3 are the prefix probabilities for a, aa, and aaa, respectively: P(S GL a) = 1, P(S :GL aa) = q,</Paragraph>
      <Paragraph position="52"> Earley chart as constructed during the parse of aaa with the grammar in (a). The two columns to the right in (b) list the forward and inner probabilities, respectively, for each state. In both c~ and 3' columns, the * separates old factors from new ones (as per equations 11, 12 and 13).</Paragraph>
      <Paragraph position="53"> Addition indicates multiple derivations of the same state.</Paragraph>
      <Paragraph position="54">  forward parser operation described so far, some of which are due specifically to the probabilistic aspects of parsing. This section summarizes the necessary modifications to process null productions correctly, using the previous description as a baseline. Our treatment of null productions follows the (nonprobabilistic) formulation of Graham, Harrison, and Ruzzo (1980), rather than the original one in Earley (1970).</Paragraph>
      <Paragraph position="55"> 4.7.1 Computing c-expansion probabilities. The main problem with null productions is that they allow multiple prediction-completion cycles in between scanning steps (since null productions do not have to be matched against one or more input symbols).</Paragraph>
      <Paragraph position="56"> Our strategy will be to collapse all predictions and completions due to chains of null productions into the regular prediction and completion steps, not unlike the way recursive predictions/completions were handled in Section 4.5.</Paragraph>
      <Paragraph position="57"> A prerequisite for this approach is to precompute, for all nonterminals X, the probability that X expands to the empty string. Note that this is another recursive problem, since X itself may not have a null production, but expand to some nonterminal Y that does.</Paragraph>
      <Paragraph position="58"> Computation of P(X :~ c) for all X can be cast as a system of non-linear equations, as follows. For each X, let ex be an abbreviation for P(X G c). For example, let X have  The semantics of context-free rules imply that X can only expand to c if all the RHS nonterminals in one of X's productions expand to e. Translating to probabilities, we obtain the equation ex -- Pl + p2eyley2 + p3ey3eY4eY5 + &amp;quot;&amp;quot; * In other words, each production contributes a term in which the rule probability is multiplied by the product of the e variables corresponding to the RHS nonterminals, unless the RHS contains a terminal (in which case the production contributes nothing to ex because it cannot possibly lead to e).</Paragraph>
      <Paragraph position="59"> The resulting nonlinear system can be solved by iterative approximation. Each variable ex is initialized to P(X ~ e), and then repeatedly updated by substituting in the equation right-hand sides, until the desired level of accuracy is attained. Convergence is guaranteed, since the ex values are monotonically increasing and bounded above by the true values P(X ~ e) ( 1. For grammars without cyclic dependencies among e-producing nonterminals, this procedure degenerates to simple backward substitution. Obviously the system has to be solved only once for each grammar.</Paragraph>
      <Paragraph position="60"> The probability ex can be seen as the precomputed inner probability of an expansion of X to the empty string; i.e., it sums the probabilities of all Earley paths that derive c from X. This is the justification for the way these probabilities can be used in modified prediction and completion steps, described next.</Paragraph>
      <Paragraph position="61">  lation. For each X occurring to the right of a dot, we generate states for all Y that  Computational Linguistics Volume 21, Number 2 are reachable from X by way of the X --*L Y relation. This reachability criterion has to be extended in the presence of null productions. Specifically, if X has a production</Paragraph>
      <Paragraph position="63"> probability of expanding to e. The contribution of such a production to the left-corner</Paragraph>
      <Paragraph position="65"> The old prediction procedure can now be modified in two steps. First, replace the old PL relation by the one that takes into account null productions, as sketched above. From the resulting PL compute the reflexive transitive closure RL, and use it to generate predictions as before.</Paragraph>
      <Paragraph position="66"> Second, when predicting a left corner Y with a production Y --* Y1 ... Yi-IYi)~, add states for all dot positions up to the first RHS nonterminal that cannot expand to e, say from X --* .Y1 ... Yi-I Yi )~ through X --* Y1 ... Yi-l.Yi .X. We will call this procedure &amp;quot;spontaneous dot shifting.&amp;quot; It accounts precisely for those derivations that expand the RHS prefix Y1 ... Wi-1 without consuming any of the input symbols.</Paragraph>
      <Paragraph position="67"> The forward and inner probabilities of the states thus created are those of the first state X --* .Y1... Yi-lYi/~, multiplied by factors that account for the implied eexpansions. This factor is just the product 1-I~=1 eYk, where j is the dot position.</Paragraph>
      <Paragraph position="68">  a similar pattern. First, the unit production relation has to be extended to allow for unit production chains due to null productions. A rule X ~ Y1 ... Yi-lYiYi+l ... Yj can effectively act as a unit production that links X and Yi if all other nonterminals on the RHS can expand to e. Its contribution to the unit production relation P(X ~ Yi) will then be P(X ~ Y1... Yi-lYiYi+l... Yj) IIeYk From the resulting revised Pu matrix we compute the closure Ru as usual.</Paragraph>
      <Paragraph position="69"> The second modification is another instance of spontaneous dot shifting. When completing a state X --+ )~.Y# and moving the dot to get X ~ )~Y.#, additional states have to be added, obtained by moving the dot further over any nonterminals in # that have nonzero e-expansion probability. As in prediction, forward and inner probabilities are multiplied by the corresponding e-expansion probabilities.</Paragraph>
      <Paragraph position="70"> 4.7.4 Eliminating null productions. Given these added complications one might consider simply eliminating all c-productions in a preprocessing step. This is mostly straightforward and analogous to the corresponding procedure for nonprobabilistic CFGs (Aho and Ullman 1972, Algorithm 2.10). The main difference is the updating of rule probabilities, for which the e-expansion probabilities are again needed.</Paragraph>
      <Paragraph position="71"> .</Paragraph>
      <Paragraph position="72"> .</Paragraph>
      <Paragraph position="73"> Delete all null productions, except on the start symbol (in case the grammar as a whole produces c with nonzero probability). Scale the remaining production probabilities to sum to unity.</Paragraph>
      <Paragraph position="74"> For each original rule X ~ ,~Y# that contains a nonterminal Y such that Y~E:  Create a variant rule X --* &amp;# Set the rule probability of the new rule to eyP(X --, &amp;Y#). If the rule X ~ ~# already exists, sum the probabilities.</Paragraph>
      <Paragraph position="75"> Decrement the old rule probability by the same amount.</Paragraph>
      <Paragraph position="76"> Iterate these steps for all RHS occurrences of a null-able nonterminal.</Paragraph>
      <Paragraph position="77"> The crucial step in this procedure is the addition of variants of the original productions that simulate the null productions by deleting the corresponding nonterminals from the RHS. The spontaneous dot shifting described in the previous sections effectively performs the same operation on the fly as the rules are used in prediction and completion.</Paragraph>
    </Section>
    <Section position="3" start_page="184" end_page="184" type="sub_section">
      <SectionTitle>
4.8 Complexity Issues
</SectionTitle>
      <Paragraph position="0"> The probabilistic extension of Earley's parser preserves the original control structure in most aspects, the major exception being the collapsing of cyclic predictions and unit completions, which can only make these steps more efficient. Therefore the complexity analysis from Earley (1970) applies, and we only summarize the most important results here.</Paragraph>
      <Paragraph position="1"> The worst-case complexity for Earley's parser is dominated by the completion step, which takes O(/2) for each input position, I being the length of the current prefix. The total time is therefore O(/3) for an input of length l, which is also the complexity of the standard Inside/Outside (Baker 1979) and LRI (Jelinek and Lafferty 1991) algorithms.</Paragraph>
      <Paragraph position="2"> For grammars of bounded ambiguity, the incremental per-word cost reduces to O(l), 0(/2) total. For deterministic CFGs the incremental cost is constant, 0(l) total. Because of the possible start indices each state set can contain 0(l) Earley states, giving O(/2) worst-case space complexity overall.</Paragraph>
      <Paragraph position="3"> Apart from input length, complexity is also determined by grammar size. We will not try to give a precise characterization in the case of sparse grammars (Appendix B.3 gives some hints on how to implement the algorithm efficiently for such grammars). However, for fully parameterized grammars in CNF we can verify the scaling of the algorithm in terms of the number of nonterminals n, and verify that it has the same O(n 3) time and space requirements as the Inside/Outside (I/O) and LRI algorithms.</Paragraph>
      <Paragraph position="4"> The completion step again dominates the computation, which has to compute probabilities for at most O(n 3) states. By organizing summations (11) and (12) so that 3'&amp;quot; are first summed by LHS nonterminals, the entire completion operation can be accomplished in 0(//3). The one-time cost for the matrix inversions to compute the left-corner and unit production relation matrices is also O(r/3).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="184" end_page="190" type="metho">
    <SectionTitle>
5. Extensions
</SectionTitle>
    <Paragraph position="0"> This section discusses extensions to the Earley algorithm that go beyond simple parsing and the computation of prefix and string probabilities. These extensions are all quite straightforward and well supported by the original Earley chart structure, which leads us to view them as part of a single, unified algorithm for solving the tasks mentioned in the introduction.</Paragraph>
    <Section position="1" start_page="185" end_page="186" type="sub_section">
      <SectionTitle>
Computational Linguistics Volume 21, Number 2
5.1 Viterbi Parses
Definition 7
</SectionTitle>
      <Paragraph position="0"> A Viterbi parse for a string x, in a grammar G, is a left-most derivation that assigns maximal probability to x, among all possible derivations for x.</Paragraph>
      <Paragraph position="1"> Both the definition of Viterbi parse and its computation are straightforward generalizations of the corresponding notion for Hidden Markov Models (Rabiner and Juang 1986), where one computes the Viterbi path (state sequence) through an HMM. Precisely the same approach can be used in the Earley parser, using the fact that each derivation corresponds to a path.</Paragraph>
      <Paragraph position="2"> The standard computational technique for Viterbi parses is applicable here. Wherever the original parsing procedure sums probabilities that correspond to alternative derivations of a grammatical entity, the summation is replaced by a maximization.</Paragraph>
      <Paragraph position="3"> Thus, during the forward pass each state must keep track of the maximal path probability leading to it, as well as the predecessor states associated with that maximum probability path. Once the final state is reached, the maximum probability parse can be recovered by tracing back the path of &amp;quot;best&amp;quot; predecessor states.</Paragraph>
      <Paragraph position="4"> The following modifications to the probabilistic Earley parser implement the for- null ward phase of the Viterbi computation.</Paragraph>
      <Paragraph position="5"> * Each state computes an additional probability, its Viterbi probability v.</Paragraph>
      <Paragraph position="6"> * Viterbi probabilities are propagated in the same way as inner probabilities, except that during completion the summation is replaced by maximization: Vi(kX --* ,~Y.#) is the maximum of all products vi(jW --+ 17.)Vj(kX --+ ,~.Y#) that contribute to the completed state kX --* )~Y.#. The same-position predecessor jY -~ ~,. associated with the maximum is recorded as the Viterbi path predecessor of kX --* ,~Y.# (the other predecessor state kX --* ,~.Y# can be inferred).</Paragraph>
      <Paragraph position="7"> * The completion step uses the original recursion without collapsing unit  production loops. Loops are simply avoided, since they can only lower a path's probability. Collapsing unit production completions has to be avoided to maintain a continuous chain of predecessors for later backtracing and parse construction.</Paragraph>
      <Paragraph position="8"> * The prediction step does not need to be modified for the Viterbi computation.</Paragraph>
      <Paragraph position="9"> Once the final state is reached, a recursive procedure can recover the parse tree associated with the Viterbi parse. This procedure takes an Earley state i : kX --* &amp;.# as input and produces the Viterbi parse for the substring between k and i as output. (If the input state is not complete (# ~ C/), the result will be a partial parse tree with children missing from the root node.) Viterbi parse (i : kX --* -~.#):  1. If )~ = C/, return a parse tree with root labeled X and no children.</Paragraph>
      <Paragraph position="10">  Andreas Stolcke Efficient Probabilistic Context-Free Parsing . Otherwise, if ~ ends in a terminal a, let A~a = ~, and call this procedure recursively to obtain the parse tree</Paragraph>
      <Paragraph position="12"> Adjoin a leaf node labeled a as the right-most child to the root of T and return T.</Paragraph>
      <Paragraph position="13"> Otherwise, if A ends in a nonterminal Y, let A'Y = A. Find the Viterbi predecessor state jW ---+ t~. for the current state. Call this procedure recursively to compute</Paragraph>
      <Paragraph position="15"> as well as</Paragraph>
      <Paragraph position="17"> Adjoin T ~ to T as the right-most child at the root, and return T.</Paragraph>
    </Section>
    <Section position="2" start_page="186" end_page="188" type="sub_section">
      <SectionTitle>
5.2 Rule Probability Estimation
</SectionTitle>
      <Paragraph position="0"> The rule probabilities in a SCFG can be iteratively estimated using the EM (Expectation-Maximization) algorithm (Dempster et al. 1977). Given a sample corpus D, the estimation procedure finds a set of parameters that represent a local maximum of the grammar likelihood function P(D I G), which is given by the product of the string</Paragraph>
      <Paragraph position="2"> i.e., the samples are assumed to be distributed identically and independently.</Paragraph>
      <Paragraph position="3"> The two steps of this algorithm can be briefly characterized as follows.</Paragraph>
      <Paragraph position="4"> E-step: Compute expectations for how often each grammar rule is used, given the corpus D and the current grammar parameters (rule probabilities).</Paragraph>
      <Paragraph position="5"> M-step: Reset the parameters so as to maximize the likelihood relative to the expected rule counts found in the E-step.</Paragraph>
      <Paragraph position="6"> This procedure is iterated until the parameter values (as well as the likelihood) converge. It can be shown that each round in the algorithm produces a likelihood that is at least as high as the previous one; the EM algorithm is therefore guaranteed to find at least a local maximum of the likelihood function.</Paragraph>
      <Paragraph position="7"> EM is a generalization of the well-known Baum-Welch algorithm for HMM estimation (Baum et al. 1970); the original formulation for the case of SCFGs is attributable to Baker (1979). For SCFGs, the E-step involves computing the expected number of times each production is applied in generating the training corpus. After that, the M-step consists of a simple normalization of these counts to yield the new production probabilities.</Paragraph>
      <Paragraph position="8"> In this section we examine the computation of production count expectations required for the E-step. The crucial notion introduced by Baker (1979) for this purpose is the &amp;quot;outer probability&amp;quot; of a nonterminal, or the joint probability that the nonterminal is generated with a given prefix and suffix of terminals. Essentially the same method can be used in the Earley framework, after extending the definition of outer probabilities to apply to arbitrary Earley states.</Paragraph>
      <Paragraph position="9">  start with the initial state, generate the prefix Xo... Xk-1, pass through k x ---4 .17#, for some u, generate the suffix xi. * * x1-1 starting with state kX --* u.# , end in the final state.</Paragraph>
      <Paragraph position="10"> Outer probabilities complement inner probabilities in that they refer precisely to those parts of complete paths generating x not covered by the corresponding inner probability 7i(kX --* A.#). Therefore, the choice of the production X --* A# is not part of the outer probability associated with a state kX ~ A.#. In fact, the definition makes no reference to the first part A of the RHS: all states sharing the same k, X, and # will have identical outer probabilities.</Paragraph>
      <Paragraph position="11"> Intuitively, fli(kX --* A.#) is the probability that an Earley parser operating as a string generator yields the prefix Xo...k-1 and the suffix xi...l_l, while passing through state kX --* A.# at position i (which is independent of A). As was the case for forward probabilities, fl is actually an expectation of the number of such states in the path, as unit production cycles can result in multiple occurrences for a single state. Again, we gloss over this technicality in our terminology. The name is motivated by the fact that fl reduces to the &amp;quot;outer probability&amp;quot; of X, as defined in Baker (1979), if the dot is in final position.</Paragraph>
      <Paragraph position="12"> 5.2.1 Computing expected production counts. Before going into the details of computing outer probabilities, we describe their use in obtaining the expected rule counts needed for the E-step in grammar estimation.</Paragraph>
      <Paragraph position="13"> Let c(X --* A \] x) denote the expected number of uses of production X --* A in the derivation of string x. Alternatively, c(X --* A \] x) is the expected number of times that</Paragraph>
      <Paragraph position="15"> The last summation is over all predicted states based on production X --* A. The quantity P(S Xo...i_lXt, :~ x) is the sum of the probabilities of all paths passing through i : iX --* .A. Inner and outer probabilities have been defined such that this quantity is obtained precisely as the product of the corresponding of &amp;quot;Yi and fli. Thus,</Paragraph>
    </Section>
    <Section position="3" start_page="188" end_page="189" type="sub_section">
      <SectionTitle>
Andreas Stolcke Efficient Probabilistic Context-Free Parsing
</SectionTitle>
      <Paragraph position="0"> the expected usage count for a rule can be computed as</Paragraph>
      <Paragraph position="2"> The sum can be computed after completing both forward and backward passes (or during the backward pass itself) by scanning the chart for predicted states.</Paragraph>
      <Paragraph position="3"> 5.2.2 Computing outer probabilities. The outer probabilities are computed by tracing the complete paths from the final state to the start state, in a single backward pass over the Earley chart. Only completion and scanning steps need to be traced back. Reverse scanning leaves outer probabilities unchanged, so the only operation of concern is reverse completion.</Paragraph>
      <Paragraph position="4"> We describe reverse transitions using the same notation as for their forward counterparts, annotating each state with its outer and inner probabilities.</Paragraph>
      <Paragraph position="5"> Reverse completion.</Paragraph>
      <Paragraph position="6"> i: jY --~ 1,1. \[fl&amp;quot;,'y&amp;quot;\] i: kX-* AY.# \[fl,&amp;quot;/\] ~ j: kX--* A.Y# \[fl',7'\] for all pairs of states jY --+ t,. and kX --* A.Y# in the chart. Then</Paragraph>
      <Paragraph position="8"> The inner probability 7 is not used.</Paragraph>
      <Paragraph position="9"> Rationale. Relative to fl, fl' is missing the probability of expanding Y, which is filled in from ~,&amp;quot;. The probability of the surrounding of Y(fl&amp;quot;) is the probability of the surrounding of X(fl), plus the choice of the rule of production for X and the expansion of the partial LHS A, which are together given by ~,'.</Paragraph>
      <Paragraph position="10"> Note that the computation makes use of the inner probabilities computed in the forward pass. The particular way in which 3' and fl were defined turns out to be convenient here, as no reference to the production probabilities themselves needs to be made in the computation.</Paragraph>
      <Paragraph position="11"> As in the forward pass, simple reverse completion would not terminate in the presence of cyclic unit productions. A version that collapses all such chains of productions is given below.</Paragraph>
      <Paragraph position="12"> Reverse completion (transitive).</Paragraph>
      <Paragraph position="13"> i: jY--* ~,. \[fl&amp;quot;,7&amp;quot;\] i: k x --* AZ.# \[fl, 3'\] ~ j kX ~ A.Z# \[fl', 7'\] for all pairs of states jY ---+ v. and kX --* A.Z# in the chart, such that the unit production relation R(Z Y) is nonzero. Then fl' += q/'.fl fl&amp;quot; += -~'. flR(Z G Y) The first summation is carried out once for each state j : kX --* MZ#, whereas the second summation is applied for each choice of Z, but only if X --* AZ# is not itself a unit production, i.e., A# ~ E.</Paragraph>
      <Paragraph position="14">  Computational Linguistics Volume 21, Number 2 Rationale. This increments fl&amp;quot; the equivalent of R(Z ~ Y) times, accounting for the infinity of surroundings in which Y can occur if it can be derived through cyclic productions. Note that the computation of tip is unchanged, since &amp;quot;y&amp;quot; already includes an infinity of cyclically generated subtrees for Y, where appropriate.</Paragraph>
    </Section>
    <Section position="4" start_page="189" end_page="189" type="sub_section">
      <SectionTitle>
5.3 Parsing Bracketed Inputs
</SectionTitle>
      <Paragraph position="0"> The estimation procedure described above (and EM-based estimators in general) are only guaranteed to find locally optimal parameter estimates. Unfortunately, it seems that in the case of unconstrained SCFG estimation local maxima present a very real problem, and make success dependent on chance and initial conditions (Lari and Young 1990). Pereira and Schabes (1992) showed that partially bracketed input samples can alleviate the problem in certain cases. The bracketing information constrains the parse of the inputs, and therefore the parameter estimates, steering it clear from some of the suboptimal solutions that could otherwise be found.</Paragraph>
      <Paragraph position="1"> An Earley parser can be minimally modified to take advantage of bracketed strings by invoking itself recursively when a left parenthesis is encountered. The recursive instance of the parser is passed any predicted states at that position, processes the input up to the matching right parenthesis, and hands complete states back to the invoking instance. This technique is efficient, as it never explicitly rejects parses not consistent with the bracketing. It is also convenient, as it leaves the basic parser operations, including the left-to-right processing and the probabilistic computations, unchanged.</Paragraph>
      <Paragraph position="2"> For example, prefix probabilities conditioned on partial bracketings could be computed easily this way.</Paragraph>
      <Paragraph position="3"> Parsing bracketed inputs is described in more detail in Stolcke (1993), where it is also shown that bracketing gives the expected improved efficiency. For example, the modified Earley parser processes fully bracketed inputs in linear time.</Paragraph>
    </Section>
    <Section position="5" start_page="189" end_page="190" type="sub_section">
      <SectionTitle>
5.4 Robust Parsing
</SectionTitle>
      <Paragraph position="0"> In many applications ungrammatical input has to be dealt with in some way. Traditionally it has been seen as a drawback of top-down parsing algorithms such as Earley's that they sacrifice &amp;quot;robustness,&amp;quot; i.e., the ability to find partial parses in an ungrammatical input, for the efficiency gained from top-down prediction (Magerman and Weir 1992).</Paragraph>
      <Paragraph position="1"> One approach to the problem is to build robustness into the grammar itself. In the simplest case one could add top-level productions S --* XS where X can expand to any nonterminal, including an &amp;quot;unknown word&amp;quot; category. This grammar will cause the Earley parser to find all partial parses of substrings, effectively behaving like a bottom-up parser constructing the chart in left-to-right fashion. More refined variations are possible: the top-level productions could be used to model which phrasal categories (sentence fragments) can likely follow each other. This probabilistic information can then be used in a pruning version of the Earley parser (Section 6.1) to arrive at a compromise between robust and expectation-driven parsing.</Paragraph>
      <Paragraph position="2"> An alternative method for making Earley parsing more robust is to modify the parser itself so as to accept arbitrary input and find all or a chosen subset of possible substring parses. In the case of Earley's parser there is a simple extension to</Paragraph>
    </Section>
    <Section position="6" start_page="190" end_page="190" type="sub_section">
      <SectionTitle>
Andreas Stolcke Efficient Probabilistic Context-Free Parsing
</SectionTitle>
      <Paragraph position="0"> accomplish just that, based on the notion of a wildcard state</Paragraph>
      <Paragraph position="2"> where the wildcard ? stands for an arbitrary continuation of the RHS.</Paragraph>
      <Paragraph position="3"> During prediction, a wildcard to the left of the dot causes the chart to be seeded with dummy states --* .X for each phrasal category X of interest. Conversely, a minimal modification to the standard completion step allows the wildcard states to collect all abutting substring parses: i: jY--+ #. } j: k ~+ ,'~. ? ~ i: k --* )~ Y. ? for all Y. This way each partial parse will be represented by exactly one wildcard state in the final chart position.</Paragraph>
      <Paragraph position="4"> A detailed account of this technique is given in Stolcke (1993). One advantage over the grammar-modifying approach is that it can be tailored to use various criteria at runtime to decide which partial parses to follow.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="190" end_page="193" type="metho">
    <SectionTitle>
6. Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="190" end_page="190" type="sub_section">
      <SectionTitle>
6.1 Online Pruning
</SectionTitle>
      <Paragraph position="0"> In finite-state parsing (especially speech decoding) one often makes use of the forward probabilities for pruning partial parses before having seen the entire input. Pruning is formally straightforward in Earley parsers: in each state set, rank states according to their ~ values, then remove those states with small probabilities compared to the current best candidate, or simply those whose rank exceeds a given limit. Notice this will not only omit certain parses, but will also underestimate the forward and inner probabilities of the derivations that remain. Pruning procedures have to be evaluated empirically since they invariably sacrifice completeness and, in the case of the Viterbi algorithm, optimality of the result.</Paragraph>
      <Paragraph position="1"> While Earley-based on-line pruning awaits further study, there is reason to believe the Earley framework has inherent advantages over strategies based only on bottom-up information (including so-called &amp;quot;over-the-top&amp;quot; parsers). Context-free forward probabilities include all available probabilistic information (subject to assumptions implicit in the SCFG formalism) available from an input prefix, whereas the usual inside probabilities do not take into account the nonterminal prior probabilities that result from the top-down relation to the start state. Using top-down constraints does not necessarily mean sacrificing robustness, as discussed in Section 5.4. On the contrary, by using Earley-style parsing with a set of carefully designed and estimated &amp;quot;fault-tolerant&amp;quot; top-level productions, it should be possible to use probabilities to better advantage in robust parsing. This approach is a subject of ongoing work, in the context of tight-coupling SCFGs with speech decoders (Jurafsky, Wooters, Segal, Stolcke, Fosler, Tajchman, and Morgan 1995).</Paragraph>
    </Section>
    <Section position="2" start_page="190" end_page="192" type="sub_section">
      <SectionTitle>
6.2 Relation to Probabilistic LR Parsing
</SectionTitle>
      <Paragraph position="0"> One of the major alternative context-free parsing paradigms besides Earley's algorithm is LR parsing (Aho and Ullman 1972). A comparison of the two approaches, both in their probabilistic and nonprobabilistic aspects, is interesting and provides useful insights. The following remarks assume familiarity with both approaches. We  Computational Linguistics Volume 21, Number 2 sketch the fundamental relations, as well as the important tradeoffs between the two frameworks. 13 Like an Earley parser, LR parsing uses dotted productions, called items, to keep track of the progress of derivations as the input is processed. The start indices are not part of LR items: we may therefore use the term &amp;quot;item&amp;quot; to refer to both LR items and Earley states without start indices. An Earley parser constructs sets of possible items on the fly, by following all possible partial derivations. An LR parser, on the other hand, has access to a complete list of sets of possible items computed beforehand, and at runtime simply follows transitions between these sets. The item sets are known as the &amp;quot;states&amp;quot; of the LR parser. ~4 A grammar is suitable for LR parsing if these transitions can be performed deterministically by considering only the next input and the contents of a shift-reduce stack. Generalized LR parsing is an extension that allows parallel tracking of multiple state transitions and stack actions by using a graph-structured stack (Tomita 1986).</Paragraph>
      <Paragraph position="1"> Probabilistic LR parsing (Wright 1990) is based on LR items augmented with certain conditional probabilities. Specifically, the probability p associated with an LR</Paragraph>
      <Paragraph position="3"> where the denominator is the probability of the current prefix, is LR item probabilities, are thus conditioned forward probabilities, and can be used to compute conditional probabilities of next words: P(xi I X0-..i-1) is the sum of the p's of all items having xi to the right of the dot (extra work is required if the item corresponds to a &amp;quot;reduce&amp;quot; state, i.e., if the dot is in final position).</Paragraph>
      <Paragraph position="4"> Notice that the definition of p is independent of i as well as the start index of the corresponding Earley state. Therefore, to ensure that item probabilities are correct independent of input position, item sets would have to be constructed so that their probabilities are unique within each set. However, this may be impossible given that the probabilities can take on infinitely many values and in general depend on the history of the parse. The solution used by Wright (1990) is to collapse items whose probabilities are within a small tolerance ~ and are otherwise identical. The same threshold is used to simplify a number of other technical problems, e.g., left-corner probabilities are computed by iterated prediction, until the resulting changes in probabilities are smaller than e. Subject to these approximations, then, a probabilistic LR parser can compute prefix probabilities by multiplying successive conditional probabilities for the words it sees. 16 As an alternative to the computation of LR transition probabilities from a given SCFG, one might instead estimate such probabilities directly from traces of parses 13 Like Earley parsers, LR parsers can be built using various amounts of lookahead to make the operation of the parser (more) deterministic, and hence more efficient. Only the case of zero-lookahead, LR(0), is considered here; the correspondence between LR(k) parsers and k-lookahead Earley parsers is discussed in the literature (Earley 1970; Aho and Ullman 1972).</Paragraph>
      <Paragraph position="5"> 14 Once again, it is helpful to compare this to a closely related finite-state concept: the states of the LR parser correspond to sets of Earley states, similar to the way the states of a deterministic FSA correspond to sets of states of an equivalent nondeterministic FSA under the standard subset</Paragraph>
    </Section>
    <Section position="3" start_page="192" end_page="192" type="sub_section">
      <SectionTitle>
Andreas Stolcke Efficient Probabilistic Context-Free Parsing
</SectionTitle>
      <Paragraph position="0"> on a training corpus. Because of the imprecise relationship between LR probabilities and SCFG probabilities, it is not clear if the model thus estimated corresponds to any particular SCFG in the usual sense.</Paragraph>
      <Paragraph position="1"> Briscoe and Carroll (1993) turn this incongruity into an advantage by using the LR parser as a probabilistic model in its own right, and show how LR probabilities can be extended to capture non--context-free contingencies. The problem of capturing more complex distributional constraints in natural language is clearly important, but well beyond the scope of this article. We simply remark that it should be possible to define &amp;quot;interesting&amp;quot; nonstandard probabilities in terms of Earley parser actions so as to better model non-context-free phenomena.</Paragraph>
      <Paragraph position="2"> Apart from such considerations, the choice between LR methods and Earley parsing is a typical space-time tradeoff. Even though an Earley parser runs with the same linear time and space complexity as an LR parser on grammars of the appropriate LR class, the constant factors involved will be much in favor of the LR parser, as almost all the work has already been compiled into its transition and action table. However, the size of LR parser tables can be exponential in the size of the grammar (because of the number of potential item subsets). Furthermore, if the generalized LR method is used for dealing with nondeterministic grammars (Tomita 1986) the runtime on arbitrary inputs may also grow exponentially. The bottom line is that each application's needs have to be evaluated against the pros and cons of both approaches to find the best solution. From a theoretical point of view, the Earley approach has the inherent appeal of being the more general (and exact) solution to the computation of the various SCFG probabilities.</Paragraph>
    </Section>
    <Section position="4" start_page="192" end_page="193" type="sub_section">
      <SectionTitle>
6.3 Other Related Work
</SectionTitle>
      <Paragraph position="0"> The literature on Earley-based probabilistic parsers is sparse, presumably because of the precedent set by the Inside/Outside algorithm, which is more naturally formulated as a bottom-up algorithm.</Paragraph>
      <Paragraph position="1"> Both Nakagawa (1987) and P~seler (1988) use a nonprobabilistic Earley parser augmented with &amp;quot;word match&amp;quot; scoring. Though not truly probabilistic, these algorithms are similar to the Viterbi version described here, in that they find a parse that optimizes the accumulated matching scores (without regard to rule probabilities). Prediction and completion loops do not come into play since no precise inner or forward probabilities are computed.</Paragraph>
      <Paragraph position="2"> Magerman and Marcus (1991) are interested primarily in scoring functions to guide a parser efficiently to the most promising parses. Earley-style top-down prediction is used only to suggest worthwhile parses, not to compute precise probabilities, which they argue would be an inappropriate metric for natural language parsing.</Paragraph>
      <Paragraph position="3"> Casacuberta and Vidal (1988) exhibit an Earley parser that processes weighted (not necessarily probabilistic) CFGs and performs a computation that is isomorphic to that of inside probabilities shown here. Schabes (1991) adds both inner and outer probabilities to Earley's algorithm, with the purpose of obtaining a generalized estimation algorithm for SCFGs. Both of these approaches are restricted to grammars without unbounded ambiguities, which can arise from unit or null productions.</Paragraph>
      <Paragraph position="4"> Dan Jurafsky (personal communication) wrote an Earley parser for the Berkeley Restaurant Project (BeRP) speech understanding system that originally computed forward probabilities for restricted grammars (without left-corner or unit production recursion). The parser now uses the method described here to provide exact SCFG prefix and next-word probabilities to a tightly coupled speech decoder (Jurafsky, Wooters, Segal, Stolcke, Fosler, Tajchman, and Morgan 1995).</Paragraph>
      <Paragraph position="5">  Computational Linguistics Volume 21, Number 2 An essential idea in the probabilistic formulation of Earley's algorithm is the collapsing of recursive predictions and unit completion chains, replacing both with lookups in precomputed matrices. This idea arises in our formulation out of the need to compute probability sums given as infinite series. Graham, Harrison, and Ruzzo (1980) use a nonprobabilistic version of the same technique to create a highly optimized Earley-like parser for general CFGs that implements prediction and completion by operations on Boolean matrices. ~7 The matrix inversion method for dealing with left-recursive prediction is borrowed from the LRI algorithm of Jelinek and Lafferty (1991) for computing prefix probabilities for SCFGs in CNF) s We then use that idea a second time to deal with the similar recursion arising from unit productions in the completion step. We suspect, but have not proved, that the Earley computation of forward probabilities when applied to a CNF grammar performs a computation that is isomorphic to that of the LRI algorithm.</Paragraph>
      <Paragraph position="6"> In any case, we believe that the parser-oriented view afforded by the Earley framework makes for a very intuitive solution to the prefix probability problem, with the added advantage that it is not restricted to CNF grammars.</Paragraph>
      <Paragraph position="7"> Algorithms for probabilistic CFGs can be broadly characterized along several dimensions. One such dimension is whether the quantities entered into the parser chart are defined in a bottom-up (CYK) fashion, or whether left-to-right constraints are an inherent part of their definition) 9 The probabilistic Earley parser shares the inherent left-to-right character of the LRI algorithm, and contrasts with the bottom-up I/O algorithm.</Paragraph>
      <Paragraph position="8"> Probabilistic parsing algorithms may also be classified as to whether they are formulated for fully parameterized CNF grammars or arbitrary context-free rules (typically taking advantage of grammar sparseness). In this respect the Earley approach contrasts with both the CNF-oriented I/O and LRI algorithms. Another approach to avoiding the CNF constraint is a formulation based on probabilistic Recursive Transition Networks (RTNs) (Kupiec 1992). The similarity goes further, as both Kupiec's and our approach is based on state transitions, and dotted productions (Earley states) turn out to be equivalent to RTN states if the RTN is constructed from a CFG.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="193" end_page="196" type="metho">
    <SectionTitle>
7. Conclusions
</SectionTitle>
    <Paragraph position="0"> We have presented an Earley-based parser for stochastic context-free grammars that is appealing for its combination of advantages over existing methods. Earley's control structure lets the algorithm run with best-known complexity on a number of grammar subclasses, and no worse than standard bottom-up probabilistic chart parsers on general SCFGs and fully parameterized CNF grammars.</Paragraph>
    <Paragraph position="1"> Unlike bottom-up parsers it also computes accurate prefix probabilities incrementally while scanning its input, along with the usual substring (inside) probabilities. The chart constructed during parsing supports both Viterbi parse extraction and Baum-Welch type rule probability estimation by way of a backward pass over the parser chart. If the input comes with (partial) bracketing to indicate phrase structure, this 17 This connection to the GHR algorithm was pointed out by Fernando Pereira. Exploration of this link then led to the extension of our algorithm to handle e-productions, as described in Section 4.7. 18 Their method uses the transitive (but not reflexive) closure over the left-corner relation PL, for which they chose the symbol QL. We chose the symbol RL in this article to point to this difference.</Paragraph>
    <Paragraph position="2"> 19 Of course a CYK-style parser can operate left-to-right, right-to-left, or otherwise by reordering the computation of chart entries.</Paragraph>
    <Paragraph position="3">  Andreas Stolcke Efficient Probabilistic Context-Free Parsing information can be easily incorporated to restrict the allowable parses. A simple extension of the Earley chart allows finding partial parses of ungrammatical input.</Paragraph>
    <Paragraph position="4"> The computation of probabilities is conceptually simple, and follows directly Earley's parsing framework, while drawing heavily on the analogy to finite-state language models. It does not require rewriting the grammar into normal form. Thus, the present algorithm fills a gap in the existing array of algorithms for SCFGs, efficiently combining the functionalities and advantages of several previous approaches.</Paragraph>
    <Paragraph position="5"> Appendix A: Existence of RL and Ru In Section 4.5 we defined the probabilistic left-corner and unit-production matrices RL and Ru, respectively, to collapse recursions in the prediction and completion steps. It was shown how these matrices could be obtained as the result of matrix inversions.</Paragraph>
    <Paragraph position="6"> In this appendix we give a proof that the existence of these inverses is assured if the grammar is well-defined in the following three senses. The terminology used here is taken from Booth and Thompson (1973).</Paragraph>
    <Paragraph position="7">  where P(S ~ x) is induced by the rule probabilities according to Definition l(a).</Paragraph>
    <Paragraph position="8"> G has no useless nonterminals iff all nonterminals X appear in at least one derivation of some string x c G* with nonzero probability, i.e.,</Paragraph>
    <Paragraph position="10"> It is useful to translate consistency into &amp;quot;process&amp;quot; terms. We can view an SCFG as a stochastic string-rewriting process, in which each step consists of simultaneously replacing all nonterminals in a sentential form with the right-hand sides of productions, randomly drawn according to the rule probabilities. Booth and Thompson (1973) show that the grammar is consistent if and only if the probability that stochastic rewriting of the start symbol S leaves nonterminals remaining after n steps, goes to 0 as n ~ ~.</Paragraph>
    <Paragraph position="11"> More loosely speaking, rewriting S has to terminate after a finite number of steps with probability 1, or else the grammar is inconsistent.</Paragraph>
    <Paragraph position="12"> 20 Unfortunately, the terminology used in the literature is not uniform. For example, Jelinek and Lafferty (1991) use the term &amp;quot;proper&amp;quot; to mean (c), and &amp;quot;well-defined&amp;quot; for (b). They also state mistakenly that (a) and (c) together are a sufficient condition for (b). Booth and Thompson (1973) show that one can write a SCFG that satisfies (a) and (c) but generates derivations that do not terminate with probability 1, and give necessary and sufficient conditions for (b).</Paragraph>
    <Paragraph position="13">  Computational Linguistics Volume 21, Number 2 We observe that the same property holds not only for S, but for all nonterminals, if the grammar has no useless terminals. If any nonterminal X admitted infinite derivations with nonzero probability, then S itself would have such derivations, since by assumption X is reachable from S with nonzero probability.</Paragraph>
    <Paragraph position="14"> To prove the existence of RL and Ru, it is sufficient to show that the corresponding geometric series converge:</Paragraph>
    <Paragraph position="16"> Lemma 5 If G is a proper, consistent SCFG without useless nonterminals, then the powers P~ of the left-corner relation, and P~/of the unit production relation, converge to zero as H ---+ OO.</Paragraph>
    <Paragraph position="17"> Proof Entry (X, Y) in the left-corner matrix PL is the probability Of generating Y as the immediately succeeding left-corner below X. Similarly, entry (X, Y) in the nth power P~ is the probability of generating Y as the left-corner of X with n - 1 intermediate nonterminals. Certainly P~(X, Y) is bounded above by the probability that the entire derivation starting at X terminates after n steps, since a derivation couldn't terminate without expanding the left-most symbol to a terminal (as opposed to a nonterminal). But that probability tends to 0 as n -+ oo, and hence so must each entry in P~. For the unit production matrix Pu a similar argument applies, since the length of a derivation is at least as long as it takes to terminate any initial unit production chain. Lemma 6 If G is a proper, consistent SCFG without useless nonterminals, then the series for RL and Rt/as defined above converge to finite, non-negative values.</Paragraph>
    <Paragraph position="18"> Proof P~ converging to 0 implies that the magnitude of PL'S largest eigenvalue (its spectral radius) is &lt; 1, which in turn implies that the series Y~-~0 P\[ converges (similarly for Pt/). The elements of RL and Ru are non-negative since they are the result of adding and multiplying among the non-negative elements of PL and Pu, respectively. Interestingly, a SCFG may be inconsistent and still have converging left-corner and/or unit production matrices, i.e., consistency is a stronger constraint. For example s-.a S SS \[q\] is inconsistent for any choice of q &gt; 1, but the left-corner relation (a single number in this case) is well defined for all q &lt; 1, namely (1 - q)-I = p-1. In this case the left fringe of the derivation is guaranteed to result in a terminal after finitely many steps, but the derivation as a whole may never terminate.</Paragraph>
    <Paragraph position="19">  Because of the collapse of transitive predictions, this step can be implemented in a very efficient and straightforward manner. As explained in Section 4.5, one has to perform a single pass over the current state set, identifying all nonterminals Z occurring to the right of dots, and add states corresponding to all productions Y --* u that are reachable through the left-corner relation Z ~L Y- As indicated in equation (13), contributions to the forward probabilities of new states have to be summed when several paths lead to the same state. However, the summation in equation (13) can be optimized if the c~ values for all old states with the same nonterminal Z are summed first, and then multiplied by R(Z GL Y). These quantities are then summed over all nonterminals Z, and the result is once multiplied by the rule probability P(Y ~ u) to give the forward probability for the predicted state.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML