File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/j95-2002_intro.xml
Size: 8,532 bytes
Last Modified: 2025-10-06 14:05:51
<?xml version="1.0" standalone="yes"?> <Paper uid="J95-2002"> <Title>An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities</Title> <Section position="4" start_page="166" end_page="169" type="intro"> <SectionTitle> 3. Earley Parsing </SectionTitle> <Paragraph position="0"> An Earley parser is essentially a generator that builds left-most derivations of strings, using a given set of context-free productions. The parsing functionality arises because the generator keeps track of all possible derivations that are consistent with the input string up to a certain point. As more and more of the input is revealed, the set of possible derivations (each of which corresponds to a parse) can either expand as new choices are introduced, or shrink as a result of resolved ambiguities. In describing the parser it is thus appropriate and convenient to use generation terminology.</Paragraph> <Paragraph position="1"> The parser keeps a set of states for each position in the input, describing all pending derivations, a These state sets together form the Earley chart. A state is of the 2 Earley states are also known as items in LR parsing; see Aho and Ullman (1972, Section 5.2) and Section 6.2.</Paragraph> <Paragraph position="2"> Computational Linguistics Volume 21, Number 2 form i: k X --* ,~.#, where X is a nonterminal of the grammar, ,~ and # are strings of nonterminals and/or terminals, and i and k are indices into the input string. States are derived from productions in the grammar. The above state is derived from a corresponding production X--* ~# with the following semantics: * The current position in the input is i, i.e., Xo...xi-1 have been processed SO far. 3 The states describing the parser state at position i are collectively called state set i. Note that there is one more state set than input symbols: set 0 describes the parser state before any input is processed, while set ixl contains the states after all input symbols have been processed.</Paragraph> <Paragraph position="3"> * Nonterminal X was expanded starting at position k in the input, i.e., X generates some substring starting at position k.</Paragraph> <Paragraph position="4"> * The expansion of X proceeded using the production X ~ ~#, and has expanded the right-hand side (RHS) ,~# up to the position indicated by the dot. The dot thus refers to the current position i.</Paragraph> <Paragraph position="5"> A state with the dot to the right of the entire RHS is called a complete state, since it indicates that the left-hand side (LHS) nonterminal has been fully expanded. Our description of Earley parsing omits an optional feature of Earley states, the lookahead string. Earley's algorithm allows for an adjustable amount of lookahead during parsing, in order to process LR(k) grammars deterministically (and obtain the same computational complexity as specialized LR(k) parsers where possible). The addition of lookahead is orthogonal to our extension to probabilistic grammars, so we will not include it here.</Paragraph> <Paragraph position="6"> The operation of the parser is defined in terms of three operations that consult the current set of states and the current input symbol, and add new states to the chart. This is strongly suggestive of state transitions in finite-state models of language, parsing, etc. This analogy will be explored further in the probabilistic formulation later on. The three types of transitions operate as follows.</Paragraph> <Paragraph position="7"> Prediction. For each state i: kX ~ &.Y#, where Y is a nonterminal anywhere in the RHS, and for all rules Y --* L, expanding Y, add states</Paragraph> <Paragraph position="9"> A state produced by prediction is called a predicted state. Each prediction corresponds to a potential expansion of a nonterminal in a left-most derivation.</Paragraph> <Section position="1" start_page="168" end_page="169" type="sub_section"> <SectionTitle> Andreas Stolcke Efficient Probabilistic Context-Free Parsing </SectionTitle> <Paragraph position="0"> Scanning. For each state i: kX ~ )~.a#, where a is a terminal symbol that matches the current input xi, add the state i+1: kX ~ )~a.# (move the dot over the current symbol). A state produced by scanning is called a scanned state. Scanning ensures that the terminals produced in a derivation match the input string.</Paragraph> <Paragraph position="1"> Completion. For each complete state i: jy&quot;'+ ~.</Paragraph> <Paragraph position="2"> and each state in set j, j < i, that has Y to the right of the dot,</Paragraph> <Paragraph position="4"> add the state i: kX--* ,~Y.# (move the dot over the current nonterminal). A state produced by completion is called a completed state. 4 Each completion corresponds to the end of a nonterminal expansion started by a matching prediction step.</Paragraph> <Paragraph position="5"> For each input symbol and corresponding state set, an Earley parser performs all three operations exhaustively, i.e., until no new states are generated. One crucial insight into the working of the algorithm is that, although both prediction and completion feed themselves, there are only a finite number of states that can possibly be produced. Therefore recursive prediction and completion at each position have to terminate eventually, and the parser can proceed to the next input via scanning. To complete the description we need only specify the initial and final states. The parser starts out with</Paragraph> <Paragraph position="7"> where S is the sentence nonterminal (note the empty left-hand side). After processing the last symbol, the parser verifies that 1: 0 ---~ S.</Paragraph> <Paragraph position="8"> has been produced (among possibly others), where I is the length of the input x. If at any intermediate stage a state set remains empty (because no states from the previous stage permit scanning), the parse can be aborted because an impossible prefix has been detected.</Paragraph> <Paragraph position="9"> States with empty LHS such as those above are useful in other contexts, as will be shown in Section 5.4. We will refer to them collectively as dummy states. Dummy states enter the chart only as a result of initialization, as opposed to being derived from grammar productions.</Paragraph> <Paragraph position="10"> 4 Note the difference between &quot;complete&quot; and &quot;completed&quot; states: complete states (those with the dot to the right of the entire RHS) are the result of a completion or scanning step, but completion also produces states that are not yet complete.</Paragraph> <Paragraph position="11"> scanned scanned scanned scanned scanned oDet ~ a. 1N ---* circle. 2VT ---* touches. 3Det ~ a. 4N --4 triangle. completed completed completed completed completed oNP --~ Det.N oNP --~ Det N. 2VP ~ VT.NP 3NP ~ Det.N 4NP ~ Det N.</Paragraph> <Paragraph position="12"> predicted oS --~ NP.VP predicted predicted 3VP --4 VT NP.</Paragraph> <Paragraph position="13"> State set 0 1 2 3 4 5 It is easy to see that Earley parser operations are correct, in the sense that each chain of transitions (predictions, scanning steps, completions) corresponds to a possible (partial) derivation. Intuitively, it is also true that a parser that performs these transitions exhaustively is complete, i.e., it finds all possible derivations. Formal proofs of these properties are given in the literature; e.g., Aho and Ullman (1972). The relationship between Earley transitions and derivations will be stated more formally in the next section.</Paragraph> <Paragraph position="14"> The parse trees for sentences can be reconstructed from the chart contents. We will illustrate this in Section 5 when discussing Viterbi parses.</Paragraph> <Paragraph position="15"> Table 1 shows a simple grammar and a trace of Earley parser operation on a sample sentence.</Paragraph> <Paragraph position="16"> Earley's parser can deal with any type of context-free rule format, even with null or c-productions, i.e., those that replace a nonterminal with the empty string. Such productions do, however, require special attention, and make the algorithm and its description more complicated than otherwise necessary. In the following sections we assume that no null productions have to be dealt with, and then summarize the necessary changes in Section 4.7. One might choose to simply preprocess the grammar to eliminate null productions, a process which is also described.</Paragraph> </Section> </Section> class="xml-element"></Paper>