File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2130_metho.xml

Size: 14,421 bytes

Last Modified: 2025-10-06 14:15:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2130">
  <Title>Formal aspects and parsing issues of dependency theory</Title>
  <Section position="3" start_page="788" end_page="792" type="metho">
    <SectionTitle>
3. Parsing issues
</SectionTitle>
    <Paragraph position="0"> In this section we describe an Earley-style parser for the formalism in section 2. The parser is an off-line algorithm: the first step scans the input sentence to select the appropriate dependency rules  from the grammar. The selection is carried out by matching the head of the rules with the words of the sentence. The second step follows Earley's phases on the dependency rules, together with the treatment of u-indices and u-triples. This off-line technique is not uncommon in lexicalized grammars, since each Earley's prediction would waste much computation time (a grammar factor) in the body of the algorithm, because dependency rules do not abstract over categories (cf. (Schabes 1990)).</Paragraph>
    <Paragraph position="1"> In order to recognize a sentence of n words, n+l sets Si of items are built. An item represents a subtree of the total dependency tree associated with the sentence. An item is a 5-tuple &lt;Dotted-rule, Position, g-index, v-index, T-stack&gt;. Dotted-rule is a dependency rule with a dot between two d-quads of the d-quad sequence. Position is the input position where the parsing of the subtree represented by this item began (the leftmost position on the spanned string). It-index and v-index are two integers that correspond to the indices of a word object in a derivation. T-stack is a stack of sets of u-triples to be satisfied yet: the sets of u-triples (including empty sets, when applicable) provided by the various items are stacked in order to verify the consumption of one u-triple in the appropriate subtree (cf. the notion of derivation above). Each time the parser predicts a dependent, the set of u-triples associated with it is pushed onto the stack. In order for an item to enter the completer phase, the top of T-stack must be the empty set, that means that all the u-triples associated with that item have been satisfied. The sets of u-triples in T-stack are always inserted at the top, after having checked that each u-triple is not already present in T-stack (the check neglects the u-index). In case a u-triple is present, the deeper u-triple is deleted, and the T-stack will only contain the u-triple in the top set (see the derivation relation above). When satisfying a u-triple, the T-stack is treated as a set data structure, since the formalism does not pose any constraint on the order of consumption of the u-triples.</Paragraph>
    <Paragraph position="2"> Following Earley's style, the general idea is to move the dot rightward, while predicting the dependents of the head of the rule. The dot can advance across a d-quadruple &lt;rjYiui'q&gt; or across the special symbol #. The d-q-ua-d-irhmediately following the dot can be indexed uj. This is acknowledged by predicting an item (representing the subtree rooted by that dependent), and inserting a new progressive integer in the fourth component of the item (v-index). xj is pushed onto T-stack: the substructure rooted by a node of category Yi must contain the trace nodes of the type licensed by the u-triples. The prediction of a trace occurs as in the case 2) of the derivation process. When an item P contains a dotted rule with the dot at its end and a T-stack with the empty set @ as the top symbol, the parser looks for the items that can advance the dot, .given the completion of the dotted dependency rule m P. Here is the algorithm.</Paragraph>
    <Paragraph position="3"> Sentence: w 0 w 1 ... wn-1</Paragraph>
    <Paragraph position="5"> for each set S i (O~_i.gn) do for each P--&lt;y: Y('q * 5), j, Ix, v, T-stack&gt; in S i ---&gt; completer (including pseudocompleter).</Paragraph>
    <Paragraph position="6"> if 5 is the empty sequence and TOP(T-stack)=O for each &lt;x: X(Tt * &lt;R, Y, u x, Xx&gt; 4), j~, Ix', v', T-stack'&gt; in Sj</Paragraph>
    <Paragraph position="8"> INSERT&lt;e: ZS(. #), i, u. uS. T-stack'&gt; into Si;</Paragraph>
    <Paragraph position="10"> if&lt;x: Q(ct ,). 0. is. v. \[\]&gt; e s n, where Qa S(G) then accept else reject endif.</Paragraph>
    <Paragraph position="11"> At the beginning (initialization), the parser initializes the set So, by inserting all the dotted rules (x:Q(8)e H(G)) that have a head of a root category (Qe S(G)). The dot precedes the whole d-quad sequence ($). Each u-index of the rule is replaced by a progressive integer, in both the u and  the % components of the d-quads. Both IX and vindices are null (0), and T-stack is empty (\[\]). The body consists of an external loop on the sets Si (0 &lt; i &lt; n) and an inner loop on the single items of the set Si. Let</Paragraph>
    <Paragraph position="13"> be a generic item. Following Eafley's schema, the parser invokes three phases on the item P: completer, predictor and scanner. Because of the derivation of traces (8) from the u-triples in Tstack, we need to add some code (the so-called pseudo-phases) that deals with completion, prediction and scanning of these entities.</Paragraph>
    <Paragraph position="14"> Completer: When 8 is an empty sequence (all the d-quad sequence has been traversed) and the top of T-stack is the empty set O (all the triples concerning this item have been satisfied), the dotted rule has been completely analyzed. The completer looks for the items in S i which were waiting for completion (return items; j is the retum Position of the item P). The return items must contain a dotted rule where the dot immediately precedes a d-quad &lt;R,Y,ux,Xx&gt;, where Y is the head category of the dotted rule in the item P. Their generic form is &lt;x: X(X * &lt;R,Y,ux,Xx&gt; 4), J', IX', v', T-stack'&gt;. These items from Sj are inserted into Si after having advanced the dot to the right of &lt;R,Y,ux,%x&gt;.</Paragraph>
    <Paragraph position="15"> Before inserting the items, we need updating the T-stack component, because some u-triples could have been satisfied (and, then, deleted from the T-stack). The new T-stack&amp;quot; is the T-stack of the completed item after popping the top element E~.</Paragraph>
    <Paragraph position="16"> Predictor: If the dotted rule of P has a d-quad &lt;Rs,Z&amp;us,xS&gt; immediately after the dot, then the parser is expecting a subtree headed by a word of category ZS. This expectation is encoded by inserting a new item (predicted item) in the set, for each rule associated with 28 (of the form z:Zs(0)). Again, each u-index of the new item (dquad sequence 0) is replaced by a progressive integer. The v-index component of the predicted item is set to us. Finally, the parser prepares the new T-stack', by pushing the new u-triples introduced by %8, that are to be satisfied by the items predicted after the current item. This operation is accomplished by the primitive PUSH-UNION, which also accounts for the non repetition of u-triples in T-stack. As stated in the derivation relation through the UNION operation, there cannot be two u-triples with the same relation and syntactic category in T-stack.</Paragraph>
    <Paragraph position="17"> In case of a repetition of a u-triple, PUSH deletes the old u-triple and inserts the new one (with the same u-index) in the topmost set. Finally, INSERT joins the new item to the set Si.</Paragraph>
    <Paragraph position="18"> The pseudopredictor accounts for the satisfaction of the the u-triples when the appropriate conditions hold. The current d-quad in P, &lt;Rs,Zs,us, xS&gt;, can be the dependent which satisfies the u-triple &lt;u,Rs,Zs&gt; in T-stack (the UNION operation gathers all the u-triples scattered through the T-stack): in addition to updating T-stack (PUSH(O,T-stack)) and inserting the u-index u8 in the v component as usual, the parser also inserts the u-index u in the IX component to coindex the appropriate distant element. Then it inserts an item (trace item) with a fictitious dotted dependency rule for the trace.</Paragraph>
    <Paragraph position="19"> Scanner: When the dot precedes the symbol #, the parser can scan the current input word wi (if y, the head of the item P, is equal to it), or pseudoscan a trace item, respectively. The result is the insertion of a new item in the subsequent set (Si+l) or the same set (Si), respectively.</Paragraph>
    <Paragraph position="20"> At the end of the external loop (termination), the sentence is accepted if an item of a root category Q with a dotted rule completely recognized, spanning the whole sentence (Position=0), an empty T-stack must be in the set Sn.</Paragraph>
    <Section position="1" start_page="790" end_page="792" type="sub_section">
      <SectionTitle>
3.1. An example
</SectionTitle>
      <Paragraph position="0"> In this section, we trace the parsing of the sentence &amp;quot;Beans I know John likes&amp;quot;. In this example we neglect the problem of subject-verb agreement: it can be coded by inserting the AGR features in the category label (in a similar way to the +EX feature in the grammar G1); the comments on the right help to follow the events; the separator symbol I helps to keep trace of the sets in the stack; finally, we have left in plain text the d-quad sequence of the dotted rules; the other components of the items appear in boldface.</Paragraph>
      <Paragraph position="1">  &lt;beans: N (# *), 0, 0, 0, \[@\]&gt; OcanneO &lt;know: V+EX (&lt;VISITOR, N, 1, Q~&gt; (completer &amp;quot;beans&amp;quot;) * &lt;SUBJ, N, O, 9&gt; # &lt;SCOMP, V, 0, &lt;1, OBL N&gt;&gt;), 0, 0, 0, ~&gt; &lt;likes: V (&lt;SUB J, N, 0, 9&gt; (completer &amp;quot;beans&amp;quot;) *# &lt;OBJ, V, 0, Q)&gt;), 0, 0, 0, \[\]&gt; &lt;beans: N (* # ), 1, 0, 0,\[O\]&gt; (predictor &amp;quot;know&amp;quot; ) &lt;I: N (* # ), 1, 0, 0, \[C/~\]&gt; (predictor &amp;quot;know&amp;quot;) &lt;John: N (* # ), 1, 0, 0, \[~\]&gt; (predictor &amp;quot;know&amp;quot; ) $2 \[I\] &lt;I: N (# *), 1, 0, 0, \[~10\]&gt; (scanner) &lt;know: V+v.x (&lt;VISITOR, N, 1, O&gt; (completer&amp;quot;I&amp;quot;) &lt;SUB J, N, 0, Q~&gt; *# &lt;SCOMP0 V, 0, &lt;1, OBJ, N&gt;&gt;), 0, 0, 0, H&gt; $3 \[know\] &lt;know: V +EX (&lt;VISITOR, N, 1.9&gt; (scanner) &lt;SUB J, N, 0, ~&gt; # * &lt;SCOMP, V, 0, &lt;1, OBJ, N&gt;&gt;), 0, 0, 0, H&gt; &lt;likes: V ( * &lt;SUB J, N, 0. 9&gt; (predictor &amp;quot;know&amp;quot; ) # &lt;OBJ, V, 0, ~&gt;), 3, 0, 0, \[ {&lt;1, OBJ, N&gt;}\]&gt;  * &lt;SUBJ, N, 0, 9&gt; # &lt;SCOMP, V, 0, &lt;2, OBJ, N&gt;&gt;), 3, 0, 0, \[{&lt;1, OBJ, N&gt;}\]&gt; &lt;likes: V (&lt;SUB J, N, 0, ~&gt; (completer &amp;quot;John&amp;quot;) *# &lt;OBJ, V, 0, ~&gt;), 3, 0, 0, \[ {&lt;1, OBJ, N&gt;}\]&gt; $5 \[likes\] &lt;likes: V (&lt;SUB J, N, 0, 9&gt; (scanner) # * &lt;OBL N, 0, ~&gt;), 0, 0, 0, \[{&lt;1, OBJ, N&gt;}\]&gt; &lt;beans: N (* #), 5, 0, 0, \[{&lt;1, OBJ, N&gt;} I~\]&gt; &lt;I: N (* #), 5, 0, 0, \[{&lt;1, OBJ, N&gt;}I~\]&gt; &lt;John: N ( * # ), 5, 0, 0, \[{&lt;1, OBJ, N&gt;} 1~\]&gt; &lt;~: N (* #), 5, 0, 0, \[O\]&gt; &lt;E: N (# *), 5, 0, 0, \[~\]&gt; &lt;likes: V (&lt;SUB J, N, 0, Q~&gt; # &lt;OBJ, N, 0, O&gt;* ), 3, 0, 0, \[ C/~'\]&gt; &lt;know: V+EX (&lt;VISITOR. N, 1, O&gt; &lt;SUB J, N, 0, ~&gt; # &lt;SCOMP, V, 0, &lt;I, OBJ. N&gt;&gt; * ), 0, 0, 0, \[\]&gt;  The parser has a polynomial complexity. The space complexity of the parser, i.e. the number of items, is O(n 3+ aDllCl). Each item is a 5-tuple &lt;Dotted-rule, Position, Ix-index, v-index, T-stack&gt;: Dotted rules are in a number which is a constant of the grammar, but in off-line parsing this number is bounded by O(n). Position is bounded by O(n). Ix-index and v-index are two integers that keep trace of u-triple satisfaction, and do not add an own contribution to the complexity count. T-stack has a number of elements which depends on the maximum length of the chain of predictions. Since the number of rules is O(n), the size of the stack is O(n). The elements of T-stack contain all the u-triples introduced up to an item and which are to be satisfied (deleted) yet. A u-triple is of the form &lt;u,R,Y&gt;: u is an integer that is ininfluent, Ra D, Ya C. Because of the PUSH-UNION operation on T-stack, the number of possible u-triples scattered throughout the elements of T-stack is IDIICI. The number of different stacks is given by the dispositions of IDllCI u-triples on O(n) elements; so, O(nlOllCl). Then, the number of items in a set of items is bounded by O(n 2+ IDIICI) and there are n sets of items (O(n 3+ IDI ICI)).</Paragraph>
      <Paragraph position="2"> The time complexity of the parser is O(n 7+3 IDI ICI). Each of the three phases executes an INSERT of an item in a set. The cost of the INSERT operation depends on the implementation of the set data structure; we assume it to be linear (O(n 2+ IDIICI)) to make easy calculations. The phase completer executes at most O(n 2+ IDIICI)) actions per each pair of items (two for-loops). The pairs of items are O(n 6+2 IDI ICI). But to execute the action of the completer, one of the sets must have the index equal to one of the positions, so O(n 5 + 21DI ICI). Thus, the completer costs O(n 7+3 ID\[ ICI). The  phase predictor, executes O(n) actions for each item to introduce the predictions (&amp;quot;for each rule&amp;quot; loop); then, the loop of the pseudopredictor is O(IDIICI) (UNION+DELETE), a grammar factor. Finally it inserts the new item in the set (O(n 2+ IDIICI)). The total number of items is O(n 3+ tDJ )o) and, so, the cost of the predictor O(n (i + 21DI ICl). The phase scanner executes the INSERT operation per item, and the items are at most O(n 3+ IDI ICI). THUS, the scanner costs O(n s+2 IDI IC/1). The total complexity of the algorithm is O(n 7+3 IDttCt). We are conscious that the (grammar dependent) exponent can be very high, but the treatment of the set data structure for the u-triples requires expensive operations (cf. a stack). Actually this formalism is able to deal a high degree of free word order (for a comparable result, see (Becker, Rambow 1995)). Also, the complexity factor due to the cardinalities of the sets D and C is greatly reduced if we consider that linguistic constraints restrict the displacement of several categories and relations. A better estimation of complexity can only be done when we consider empirically the impact of the linguistic constraints in writing a wide coverage grammar.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML