File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1059_metho.xml

Size: 20,251 bytes

Last Modified: 2025-10-06 14:15:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1059">
  <Title>Efficient Parsing for Bilexical Context-Free Grammars and Head Automaton Grammars*</Title>
  <Section position="3" start_page="0" end_page="457" type="metho">
    <SectionTitle>
2 Notation for context-free
</SectionTitle>
    <Paragraph position="0"> grammars The reader is assumed to be familiar with context-free grammars. Our notation fol1Other relevant parsers simultaneously consider two or more words that are not necessarily in a dependency relationship (Lafferty et al., 1992; Magerman, 1995; Collins and Brooks, 1995; Chelba and Jelinek, 1998).  lows (Harrison, 1978; Hopcroft and Ullman, 1979). A context-free grammar (CFG) is a tuple G = (VN, VT, P, S), where VN and VT are finite, disjoint sets of nonterminal and terminal symbols, respectively, and S E VN is the start symbol. Set P is a finite set of productions having the form A --+ a, where A E VN, a E (VN U VT)*. If every production in P has the form A -+ BC or A --+ a, for A,B,C E VN,a E VT, then the grammar is said to be in Chomsky Normal Form (CNF). 2 Every language that can be generated by a CFG can also be generated by a CFG in CNF.</Paragraph>
    <Paragraph position="1"> In this paper we adopt the following conventions: a, b, c, d denote symbols in VT, w, x, y denote strings in V~, and a, ~,... denote strings in (VN t_J VT)*. The input to the parser will be a CFG G together with a string of terminal symbols to be parsed, w = did2.., dn. Also h,i,j,k denote positive integers, which are assumed to be ~ n when we are treating them as indices into w. We write wi,j for the input substring di'.&amp;quot; dj (and put wi,j = e for i &gt; j).</Paragraph>
    <Paragraph position="2"> A &amp;quot;derives&amp;quot; relation, written =~, is associated with a CFG as usual. We also use the reflexive and transitive closure of o, written ~*, and define L(G) accordingly. We write a fl 5 =~* a75 for a derivation in which only fl is rewritten.</Paragraph>
  </Section>
  <Section position="4" start_page="457" end_page="458" type="metho">
    <SectionTitle>
3 Bilexical context-free grammars
</SectionTitle>
    <Paragraph position="0"> We introduce next a grammar formalism that captures lexical dependencies among pairs of words in VT. This formalism closely resembles stochastic grammatical formalisms that are used in several existing natural language processing systems (see SS1). We will specify a non-stochastic version, noting that probabilities or other weights may be attached to the rewrite rules exactly as in stochastic CFG (Gonzales and Thomason, 1978; Wetherell, 1980). (See SS4 for brief discussion.) Suppose G = (VN, VT, P,T\[$\]) is a CFG in CNF. 3 We say that G is bilexical iff there exists a set of &amp;quot;delexicalized nonterminals&amp;quot; VD such that VN = {A\[a\] : A E VD,a E VT} and every production in P has one of the following forms: 2Production S --~ e is also allowed in a CNF grammar if S never appears on the right side of any production. However, S --+ e is not allowed in our bilexical CFGs. ,awe have a more general definition that drops the restriction to CNF, but do not give it here.</Paragraph>
    <Paragraph position="2"> Thus every nonterminal is lexicalized at some terminal a. A constituent of nonterminal type A\[a\] is said to have terminal symbol a as its lexical head, &amp;quot;inherited&amp;quot; from the constituent's head child in the parse tree (e.g., C\[a\]).</Paragraph>
    <Paragraph position="3"> Notice that the start symbol is necessarily a lexicalized nonterminal, T\[$\]. Hence $ appears in every string of L(G); it is usually convenient to define G so that the language of interest is actually L'(G) = {x: x$ E L(G)}.</Paragraph>
    <Paragraph position="4"> Such a grammar can encode lexically specific preferences. For example, P might contain the  in order to allow the derivation VP\[solve\] ~* solve two puzzles, but meanwhile omit the sim-</Paragraph>
    <Paragraph position="6"> since puzzles are not edible, a goat is not solvable, &amp;quot;sleep&amp;quot; is intransitive, and &amp;quot;goat&amp;quot; cannot take plural determiners. (A stochastic version of the grammar could implement &amp;quot;soft preferences&amp;quot; by allowing the rules in the second group but assigning them various low probabilities.) The cost of this expressiveness is a very large grammar. Standard context-free parsing algorithms are inefficient in such a case. The CKY algorithm (Younger, 1967; Aho and Ullman, 1972) is time O(n 3. IPI), where in the worst case IPI = \[VNI 3 (one ignores unary productions).</Paragraph>
    <Paragraph position="7"> For a bilexical grammar, the worst case is IPI = I VD 13. I VT 12, which is large for a large vocabulary VT. We may improve the analysis somewhat by observing that when parsing dl ... dn, the CKY algorithm only considers nonterminals of the form A\[di\]; by restricting to the relevant productions we obtain O(n 3. IVDI 3. min(n, IVTI)2).</Paragraph>
    <Paragraph position="8">  We observe that in practical applications we always have n &lt;&lt; IVTI. Let us then restrict our analysis to the (infinite) set of input instances of the parsing problem that satisfy relation n &lt; IVTI. With this assumption, the asymptotic time complexity of the CKY algorithm becomes O(n 5. IVDt3). In other words, it is a factor of n 2 slower than a comparable non-lexicalized CFG.</Paragraph>
  </Section>
  <Section position="5" start_page="458" end_page="458" type="metho">
    <SectionTitle>
4 Bilexical CFG in time O(n 4)
</SectionTitle>
    <Paragraph position="0"> In this section we give a recognition algorithm for bilexical CNF context-free grammars, which runs in time O(n 4. max(p, IVDI2)) = O(n 4.</Paragraph>
    <Paragraph position="1"> IVDI3). Here p is the maximum number of productions sharing the same pair of terminal symbols (e.g., the pair (b, a) in production (1)). The new algorithm is asymptotically more efficient than the CKY algorithm, when restricted to input instances satisfying the relation n &lt; IVTI.</Paragraph>
    <Paragraph position="2"> Where CKY recognizes only constituent sub-strings of the input, the new algorithm can recognize three types of subderivations, shown and described in Figure l(a). A declarative specification of the algorithm is given in Figure l(b). The derivability conditions of (a) are guaranteed by (b), by induction, and the correctness of the acceptance condition (see caption) follows.</Paragraph>
    <Paragraph position="3"> This declarative specification, like CKY, may be implemented by bottom-up dynamic programming. We sketch one such method. For each possible item, as shown in (a), we maintain a bit (indexed by the parameters of the item) that records whether the item has been derived yet. All these bits are initially zero. The algorithm makes a single pass through the possible items, setting the bit for each if it can be derived using any rule in (b) from items whose bits are already set. At the end of this pass it is straight-forward to test whether to accept w (see caption). The pass considers the items in increasing order of width, where the width of an item in (a) is defined as max{h,i,j} -min{h,i,j}.</Paragraph>
    <Paragraph position="4"> Among items of the same width, those of type A should be considered last.</Paragraph>
    <Paragraph position="5"> The algorithm requires space proportional to the number of possible items, which is at most na\]VDI 2. Each of the five rule templates can instantiate its free variables in at most n4p or (for COMPLETE rules) n41VDI 2 different ways, each of which is tested once and in constant time; so the runtime is O(n 4 max(p, IVDI2)).</Paragraph>
    <Paragraph position="6"> By comparison, the CKY algorithm uses only the first type of item, and relies on rules whose B C inputs are pairs .~.~ . z~::~ . Such rules can be instantiated in O(n 5) different ways for a fixed grammar, yielding O(n 5) time complexity.</Paragraph>
    <Paragraph position="7"> The new algorithm saves a factor of n by combining those two constituents in two steps, one of which is insensitive to k and abstracts over its possible values, the other of which is insensitive to h ~ and abstracts over its possible values.</Paragraph>
    <Paragraph position="8"> It is straightforward to turn the new O(n 4) recognition algorithm into a parser for stochastic bilexical CFGs (or other weighted bilexical CFGs). In a stochastic CFG, each nonterminal A\[a\] is accompanied by a probability distribution over productions of the form A\[a\] --+ ~. A T is just a derivation (proof tree) of lZ~n ,.o parse and its probability--like that of any derivation we find--is defined as the product of the probabilities of all productions used to condition inference rules in the proof tree. The highest-probability derivation for any item can be reconstructed recursively at the end of the parse, provided that each item maintains not only a bit indicating whether it can be derived, but also the probability and instantiated root rule of its highest-probability derivation tree.</Paragraph>
  </Section>
  <Section position="6" start_page="458" end_page="460" type="metho">
    <SectionTitle>
5 A more efficient variant
</SectionTitle>
    <Paragraph position="0"> We now give a variant of the algorithm of SS4; the variant has the same asymptotic complexity but will often be faster in practice.</Paragraph>
    <Paragraph position="1"> Notice that the ATTACH-LEFT rule of Figure l(b) tries to combine the nonterminal label B\[dh,\] of a previously derived constituent with every possible nonterminal label of the form C\[dh\]. The improved version, shown in Figure 2, restricts C\[dh\] to be the label of a previously derived adjacent constituent. This improves speed if there are not many such constituents and we can enumerate them in O(1) time apiece (using a sparse parse table to store the derived items).</Paragraph>
    <Paragraph position="2"> It is necessary to use an agenda data structure (Kay, 1986) when implementing the declarative algorithm of Figure 2. Deriving narrower items before wider ones as before will not work here because the rule HALVE derives narrow items from wide ones.</Paragraph>
  </Section>
  <Section position="7" start_page="460" end_page="461" type="metho">
    <SectionTitle>
6 Multiple word senses
</SectionTitle>
    <Paragraph position="0"> Rather than parsing an input string directly, it is often desirable to parse another string related by a (possibly stochastic) transduction. Let T be a finite-state transducer that maps a morpheme sequence w E V~ to its orthographic realization, a grapheme sequence v~. T may realize arbitrary morphological processes, including affixation, local clitic movement, deletion of phonological nulls, forbidden or dispreferred k-grams, typographical errors, and mapping of multiple senses onto the same grapheme. Given grammar G and an input @, we ask whether E T(L(G)). We have extended all the algorithms in this paper to this case: the items simply keep track of the transducer state as well.</Paragraph>
    <Paragraph position="1"> Due to space constraints, we sketch only the special case of multiple senses. Suppose that the input is ~ =dl ... dn, and each di has up to * g possible senses. Each item now needs to track its head's sense along with its head's position in @. Wherever an item formerly recorded a head position h (similarly h~), it must now record a pair (h, dh) , where dh E VT is a specific sense of d-h. No rule in Figures 1-2 (or Figure 3 below) will mention more than two such pairs. So the time complexity increases by a factor of O(g2).</Paragraph>
    <Paragraph position="2"> 7 Head automaton grammars in time O(n 4) In this section we show that a length-n string generated by a head automaton grammar (A1shawi, 1996) can be parsed in time O(n4). We do this by providing a translation from head automaton grammars to bilexical CFGs. 4 This result improves on the head-automaton parsing algorithm given by Alshawi, which is analogous to the CKY algorithm on bilexical CFGs and is likewise O(n 5) in practice (see SS3).</Paragraph>
    <Paragraph position="3"> A head automaton grammar (HAG) is a function H : a ~ Ha that defines a head automaton (HA) for each element of its (finite) domain. Let VT =- domain(H) and D = {~, +--}. A special symbol $ E VT plays the role of start symbol. For each a E VT, Ha is a tuple (Qa, VT, (~a, In, Fa), where  VT x D to 2 Qa, the power set of Qa.</Paragraph>
    <Paragraph position="4"> A single head automaton is an acceptor for a language of string pairs (z~, Zr) E V~ x V~. Informally, if b is the leftmost symbol of Zr and q~ E 5a(q, b, -~), then Ha can move from state q to state q~, matching symbol b and removing it from the left end of Zr. Symmetrically, if b is the rightmost symbol of zl and ql E 5a(q, b, ~---) then from q Ha can move to q~, matching symbol b and removing it from the right end of zl.5 More formally, we associate with the head automaton Ha a &amp;quot;derives&amp;quot; relation F-a, defined as a binary relation on Qa x V~ x V~. For every q E Q, x,y E V~, b E VT, d E D, and q' E ~a(q, b, d), we specify that (q, xb, y) ~-a (q',x,Y) if d =+-; (q, x, by) ~-a (q', x, y) if d =--+.</Paragraph>
    <Paragraph position="5"> The reflexive and transitive closure of F-a is written ~-~. The language generated by Ha is the set</Paragraph>
    <Paragraph position="7"> qEIa, rEFa}.</Paragraph>
    <Paragraph position="8"> We may now define the language generated by the entire grammar H. To generate, we expand the start word $ E VT into xSy for some (x, y) E L(H$), and then recursively expand the words in strings x and y. More formally, given H, we simultaneously define La for all a E VT to be minimal such that if (x,y) E L(Ha), x r E Lx, yl ELy, then x~ay ~ E La, where Lal...ak stands for the concatenation language Lal &amp;quot;'&amp;quot; La k. Then H generates language L$. We next present a simple construction that transforms a HAG H into a bilexical CFG G generating the same language. The construction also preserves derivation ambiguity. This means that for each string w, there is a linear-time 1-to-1 mapping between (appropriately de~Alshawi (1996) describes HAs as accepting (or equivalently, generating) zl and z~ from the outside in. To make Figure 3 easier to follow, we have defined HAs as accepting symbols in the opposite order, from the inside out. This amounts to the same thing if transitions are reversed, Is is exchanged with Fa, and any transition probabilities are replaced by those of the reversed Markov chain.</Paragraph>
    <Paragraph position="9">  fined) canonical derivations of w by H and canonical derivations of w by G.</Paragraph>
    <Paragraph position="10"> We adopt the notation above for H and the components of its head automata. Let VD be an arbitrary set of size t = max{\[Qa\[ : a * VT}, and for each a, define an arbitrary injection fa : Qa --+ YD. We define G -- (VN, VT, P,T\[$\]), where  I$ is a singleton set {q}.</Paragraph>
    <Paragraph position="11"> We omit the formal proof that G and H admit isomorphic derivations and hence generate the same languages, observing only that if (x,y) = (bib2... bj, bj+l.., bk) E L(Ha)-a condition used in defining La above--then g\[a\] 3&amp;quot; BI\[bl\]&amp;quot;&amp;quot; Bj\[bj\]aBj+l\[bj+l\]... Bk\[bk\], for any A, B1,... Bk that map to initial states in Ha, Hbl,... Hb~ respectively.</Paragraph>
    <Paragraph position="12"> In general, G has p = O(IVDI 3) = O(t3). The construction therefore implies that we can parse a length-n sentence under H in time O(n4t3). If the HAs in H happen to be deterministic, then in each binary production given by (ii) above, symbol A is fully determined by a, b, and C. In this case p = O(t2), so the parser will operate in time O(n4t2).</Paragraph>
    <Paragraph position="13"> We note that this construction can be straightforwardly extended to convert stochastic HAGs as in (Alshawi, 1996) into stochastic CFGs. Probabilities that Ha assigns to state q's various transition and halt actions are copied onto the corresponding productions A\[a\] --~ c~ of G, where A = fa(q).</Paragraph>
  </Section>
  <Section position="8" start_page="461" end_page="462" type="metho">
    <SectionTitle>
8 Split head automaton grammars
</SectionTitle>
    <Paragraph position="0"> in time O(n 3) For many bilexical CFGs or HAGs of practical significance, just as for the bilexical version of link grammars (Lafferty et al., 1992), it is possible to parse length-n inputs even faster, in time O(n 3) (Eisner, 1997). In this section we describe and discuss this special case, and give a new O(n 3) algorithm that has a smaller grammar constant than previously reported.</Paragraph>
    <Paragraph position="1"> A head automaton Ha is called split if it has no states that can be entered on a +-- transition and exited on a ~ transition. Such an automaton can accept (x, y) only by reading all of y--immediately after which it is said to be in a flip state--and then reading all of x. Formally, a flip state is one that allows entry on a --+ transition and that either allows exit on a e-transition or is a final state.</Paragraph>
    <Paragraph position="2"> We are concerned here with head automaton grammars H such that every Ha is split.</Paragraph>
    <Paragraph position="3"> These correspond to bilexical CFGs in which any derivation A\[a\] 3&amp;quot; xay has the form A\[a\] 3&amp;quot; xB\[a\] =~* xay. That is, a word's left dependents are more oblique than its right dependents and c-command them.</Paragraph>
    <Paragraph position="4"> Such grammars are broadly applicable. Even if Ha is not split, there usually exists a split head automaton H~ recognizing the same language.</Paragraph>
    <Paragraph position="5"> H a' exists iff {x#y : {x,y) e L(Ha)} is regular (where # C/ VT). In particular, H~a must exist unless Ha has a cycle that includes both +-- and --+ transitions. Such cycles would be necessary for Ha itself to accept a formal language such as {(b n, c n) : n &gt; 0}, where word a takes 2n dependents, but we know of no natural-language motivation for ever using them in a HAG.</Paragraph>
    <Paragraph position="6"> One more definition will help us bound the complexity. A split head automaton Ha is said to be g-split if its set of flip states, denoted Qa C_ Qa, has size &lt; g. The languages that can be recognized by g-split HAs are those that can g be written as \[Ji=l Li x Ri, where the Li and Ri are regular languages over VT. Eisner (1997) actually defined (g-split) bilexical grammars in terms of the latter property. 6 6That paper associated a product language Li x Ri, or equivalently a 1-split HA, with each of g senses of a word (see SS6). One could do the same without penalty in our present approach: confining to l-split automata would remove the g2 complexity factor, and then allowing g  We now present our result: Figure 3 specifies an O(n3g2t 2) recognition algorithm for a head automaton grammar H in which every Ha is g-split. For deterministic automata, the run-time is O(n3g2t)--a considerable improvement on the O(n3g3t 2) result of (Eisner, 1997), which also assumes deterministic automata. As in SS4, a simple bottom-up implementation will suffice.</Paragraph>
    <Paragraph position="7"> s For a practical speedup, add . \[&amp;quot;'. as an an- h j tecedent to the MID rule (and fill in the parse table from right to left).</Paragraph>
    <Paragraph position="8"> Like our previous algorithms, this one takes two steps (ATTACH, COMPLETE) to attach a child constituent to a parent constituent. But instead of full constituents--strings xd~y E Ld~--it uses only half-constituents like xdi and diy. Where CKY combines z~ i h jj+ln we save two degrees of freedom i, k (so improving O(n 5) to O(n3)) and combine, ,~:~...~J; n 2J~1 n The other halves of these constituents can be attached later, because to find an accepting path for (zl, Zr) in a split head automaton, one can separately find the half-path before the flip state (which accepts zr) and the half-path after the flip state (which accepts zt). These two halfpaths can subsequently be joined into an accepting path if they have the same flip state s, i.e., one path starts where the other ends. Annotating our left half-constituents with s makes this check possible.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML