File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1504_metho.xml

Size: 45,349 bytes

Last Modified: 2025-10-06 14:09:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1504">
  <Title>Parsing with Soft and Hard Constraints on Dependency Length[?]</Title>
  <Section position="4" start_page="0" end_page="30" type="metho">
    <SectionTitle>
2 Short Dependencies in Langugage
</SectionTitle>
    <Paragraph position="0"> We assume that correct parses exhibit a &amp;quot;short-dependency preference&amp;quot;: a word's dependents tend to be close to it in the string.3 If the jth word of a sentence depends on the ith word, then|i[?]j|tends to be 3 In this paper, we consider only a crude notion of &amp;quot;closeness&amp;quot;: the number of intervening words. Other distance measures could be substituted or added (following the literature on heavy-shift and sentence comprehension), including the phonological, morphological, syntactic, or referential (given/new) complexity of the intervening material (Gibson, 1998). In parsing, the most relevant previous work is due to Collins (1997), who considered three binary features of the intervening material: did it contain (a) any word tokens at all, (b) any verbs, (c) any commas or colons? Note that (b) is effective because it measures the length of a dependency in terms of the number of alternative attachment sites that the dependent skipped over, a notion that could be generalized. Similarly, McDonald et al.</Paragraph>
    <Paragraph position="1"> (2005) separately considered each of the intervening POS tags.</Paragraph>
    <Paragraph position="2">  small. This implies that neither i nor j is modified by complex phrases that fall between i and j. In terms of phrase structure, it implies that the phrases modifying word i from a given side tend to be (1) few in number, (2) ordered so that the longer phrases fall farther from i, and (3) internally structured so that the bulk of each phrase falls on the side of j away from i.</Paragraph>
    <Paragraph position="3"> These principles can be blamed for several linguistic phenomena. (1) helps explain the &amp;quot;late closure&amp;quot; or &amp;quot;attach low&amp;quot; heuristic (e.g., Frazier, 1979; Hobbs and Bear, 1990): a modifier such as a PP is more likely to attach to the closest appropriate head.</Paragraph>
    <Paragraph position="4"> (2) helps account for heavy-shift: when an NP is long and complex, take NP out, put NP on the table, and give NP to Mary are likely to be rephrased as take out NP, put on the table NP, and give Mary NP. (3) explains certain non-canonical word orders: in English, a noun's left modifier must become a right modifier if and only if it is right-heavy (a taller politician vs. a politician taller than all her rivals4), and a verb's left modifier may extrapose its right-heavy portion (An aardvark walked in who had circumnavigated the globe5).</Paragraph>
    <Paragraph position="5"> Why should sentences prefer short dependencies? Such sentences may be easier for humans to produce and comprehend. Each word can quickly &amp;quot;discharge its responsibilities,&amp;quot; emitting or finding all its dependents soon after it is uttered or heard; then it can be dropped from working memory (Church, 1980; Gibson, 1998). Such sentences also succumb nicely to disambiguation heuristics that assume short dependencies, such as low attachment. Thus, to improve comprehensibility, a speaker can make stylistic choices that shorten dependencies (e.g., heavyshift), and a language can categorically prohibit some structures that lead to long dependencies (*a  who ...] into two non-adjacent pieces, moving the heavy second piece. By slightly stretching the aardvark-who dependency in this way, it greatly shortens aardvark-walked. The same is possible for heavy, non-final right dependents: I met an aardvark yesterday who had circumnavigated the globe again stretches aardvark-who, which greatly shortens met-yesterday. These examples illustrate (3) and (2) respectively. However, the resulting non-contiguous constituents lead to non-projective parses that are beyond the scope of this paper.</Paragraph>
    <Paragraph position="6"> that another sentence that had center-embedding was inside was incomprehensible).</Paragraph>
    <Paragraph position="7"> Such functionalist pressures are not all-powerful.</Paragraph>
    <Paragraph position="8"> For example, many languages use SOV basic word order where SVO (or OVS) would give shorter dependencies. However, where the data exhibit some short-dependency preference, computer parsers as well as human parsers can obtain speed and accuracy benefits by exploiting that fact.</Paragraph>
  </Section>
  <Section position="5" start_page="30" end_page="33" type="metho">
    <SectionTitle>
3 Soft Constraints on Dependency Length
</SectionTitle>
    <Paragraph position="0"> We now enhance simple baseline probabilistic parsers for English, Chinese, and German so that they consider dependency lengths. We confine ourselves (throughout the paper) to parsing part-of-speech (POS) tag sequences. This allows us to ignore data sparseness, out-of-vocabulary, smoothing, and pruning issues, but it means that our accuracy measures are not state-of-the-art. Our techniques could be straightforwardly adapted to (bi)lexicalized parsers on actual word sequences, though not necessarily with the same success.</Paragraph>
    <Section position="1" start_page="30" end_page="31" type="sub_section">
      <SectionTitle>
3.1 Grammar Formalism
</SectionTitle>
      <Paragraph position="0"> Throughout this paper we will use split bilexical grammars, or SBGs (Eisner, 2000), a notationally simpler variant of split head-automaton grammars, or SHAGs (Eisner and Satta, 1999). The formalism is context-free. We define here a probabilistic version,6 which we use for the baseline models in our experiments. They are only baselines because the SBG generative process does not take note of dependency length.</Paragraph>
      <Paragraph position="1"> An SBG is an tuple G = (S,$,L,R). S is an alphabet of words. (In our experiments, we parse only POS tag sequences, so S is actually an alphabet of tags.) $ negationslash[?] S is a distinguished root symbol; let -S = S[?]{$}. L and R are functions from -S to probabilistic epsilon1-free finite-state automata over S. Thus, for each w[?] -S, the SBG specifies &amp;quot;left&amp;quot; and &amp;quot;right&amp;quot; probabilistic FSAs, Lw and Rw.</Paragraph>
      <Paragraph position="2"> We use Lw(G) : -S[?] -[0,1] to denote the probabilistic context-free language of phrases headed by w. Lw(G) is defined by the following simple top-down stochastic process for sampling from it:  1. Sample from the finite-state language L(Lw) a sequence l = w[?]1w[?]2 ...w[?]lscript [?] S[?] of left children, and from L(Rw) a sequence r = w1w2 ...wr [?] S[?] of right children. Each sequence is found by a random walk on its probabilistic FSA. We say the children depend on w.</Paragraph>
      <Paragraph position="3"> 2. For each i from [?]lscript to r with i negationslash= 0, recursively sample ai [?] S[?] from the context-free language Lwi(G). It is this step that indirectly determines dependency lengths.</Paragraph>
      <Paragraph position="4"> 3. Return a[?]lscript ...a[?]2a[?]1wa1a2 ...ar [?] -S[?], a  concatenation of strings.</Paragraph>
      <Paragraph position="5"> Notice that w's left children l were generated in reverse order, so w[?]1 and w1 are its closest children while w[?]lscript and wr are the farthest.</Paragraph>
      <Paragraph position="6"> Given an input sentence o = w1w2 ...wn [?]S[?], a parser attempts to recover the highest-probability derivation by which $o could have been generated from L$(G). Thus, $ plays the role of w0. A sample derivation is shown in Fig. 1a. Typically, L$ and R$ are defined so that $ must have no left children (lscript = 0) and at most one right child (r [?] 1), the latter serving as the conventional root of the parse.</Paragraph>
    </Section>
    <Section position="2" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
3.2 Baseline Models
</SectionTitle>
      <Paragraph position="0"> In the experiments reported here, we defined only very simple automata for Lw and Rw (w [?] S).</Paragraph>
      <Paragraph position="1"> However, we tried three automaton types, of varying quality, so as to evaluate the benefit of adding length-sensitivity at three different levels of baseline performance.</Paragraph>
      <Paragraph position="2"> In model A (the worst), each automaton has topology circlering a1a0a27, with a single state q1, so token w's left dependents are conditionally independent of one another given w. In model C (the best), each automaton circlering[?]-circlering a1a0a27 has an extra state q0 that allows the first (closest) dependent to be chosen differently from the rest. Model B is a compromise:7 it is like model A, but each type w [?] S may have an elevated or reduced probability of having no dependents at all. This is accomplished by using automata circlering[?]-circlering a1a0a27 as in model C, which allows the stopping probabilities p(STOP  |q0) and p(STOP |q1) to differ, but tying the conditional dis7It is equivalent to the &amp;quot;dependency model with valence&amp;quot; of Klein and Manning (2004).</Paragraph>
      <Paragraph position="3"> tributions p(q0 w[?]-q1  |q0,!STOP) and p(q1 w[?]-q1 | q1,!STOP).</Paragraph>
      <Paragraph position="4"> Finally, inSS3, L$ and R$ are restricted as above, so R$ gives a probability distribution over S only.</Paragraph>
    </Section>
    <Section position="3" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
3.3 Length-Sensitive Models
</SectionTitle>
      <Paragraph position="0"> None of the baseline models A-C explicitly model the distance between a head and child. We enhanced them by multiplying in some extra length-sensitive factors when computing a tree's probability. For each dependency, an extra factor p([?]|...) is multiplied in for the probability of the dependency's length [?] =|i[?]j|, where i and j are the positions of the head and child in the surface string.8 Again we tried three variants. In one version, this new probability p([?]|...) is conditioned only on the direction d = sign(i[?]j) of the dependency. In another version, it is conditioned only on the POS tag h of the head. In a third version, it is conditioned on d, h, and the POS tag c of the child.</Paragraph>
    </Section>
    <Section position="4" start_page="31" end_page="32" type="sub_section">
      <SectionTitle>
3.4 Parsing Algorithm
</SectionTitle>
      <Paragraph position="0"> Fig. 2a gives a variant of Eisner and Satta's (1999) SHAG parsing algorithm, adapted to SBGs, which are easier to understand.9 (We will modify this algorithm later in SS4.) The algorithm obtains O(n3) runtime, despite the need to track the position of head words, by exploiting the conditional independence between a head's left children and right children. It builds &amp;quot;half-constituents&amp;quot; denoted by a64a64 (a head word together with some modifying phrases on the right, i.e., wa1 ...ar) and a0a0 (a head word together with some modifying phrases on the left, i.e., a[?]lscript ...a[?]1w). A new dependency is introduced when a64a64 + a0a0 are combined to get a72a72 or a8a8 (a pair of linked head words with all the intervening phrases, i.e., wa1 ...araprime[?]lscriptprime ...aprime[?]1wprime, where w is respectively the parent or child of wprime).</Paragraph>
      <Paragraph position="1"> One can then combine a72a72 + a64a64 = a64a64 , or 8Since the [?] values are fully determined by the tree but every p([?]  |...) [?] 1, this crude procedure simply reduces the probability mass of every legal tree. The resulting model is deficient (does not sum to 1); the remaining probability mass goes to impossible trees whose putative dependency lengths [?] are inconsistent with the tree structure. We intend in future work to explore non-deficient models (log-linear or generative), but even the present crude approach helps.</Paragraph>
      <Paragraph position="3"> are possible in total when parsing a length-n sentence. null</Paragraph>
    </Section>
    <Section position="5" start_page="32" end_page="32" type="sub_section">
      <SectionTitle>
3.5 A Note on Word Senses
</SectionTitle>
      <Paragraph position="0"> [This section may be skipped by the casual reader.] A remark is necessary about :w and :wprime in Fig. 2a, which represent senses of the words at positions h and hprime. Like past algorithms for SBGs (Eisner, 2000), Fig. 2a is designed to be a bit more general and integrate sense disambiguation into parsing. It formally runs on an input Ohm = W1 ...Wn [?] S[?], where each Wi [?] S is a &amp;quot;confusion set&amp;quot; over possible values of the ith word wi. The algorithm recovers the highest-probability derivation that generates $o for some o [?] Ohm (i.e., o = w1 ...wn with ([?]i)wi[?]Wi).</Paragraph>
      <Paragraph position="1"> This extra level of generality is not needed for any of our experiments, but it is needed for SBG parsers to be as flexible as SHAG parsers. We include it in this paper to broaden the applicability of both Fig. 2a and our extension of it inSS4.</Paragraph>
      <Paragraph position="2"> The &amp;quot;senses&amp;quot; can be used in an SBG to pass a finite amount of information between the left and right children of a word, just as SHAGs allow.10 For example, to model the fronting of a direct object, an SBG might use a special sense of a verb, whose automata tend to generate both one more noun in l and one fewer noun in r.</Paragraph>
      <Paragraph position="3"> Senses can also be used to pass information between parents and children. Important uses are to encode lexical senses, or to enrich the dependency parse with constituent labels or depen10Fig. 2a enhances the Eisner-Satta version with explicit senses while matching its asymptotic performance. On this point, see (Eisner and Satta, 1999, SS8 and footnote 6). However, it does have a practical slowdown, in that START-LEFT nondeterministically guesses every possible sense of Wi, and these senses are pursued separately. To match the Eisner-Satta algorithm, we should not need to commit to a word's sense until we have seen all its left children. That is, left triangles and left trapezoids should not carry a sense :w at all, except for the completed left triangle (marked F) that is produced by FINISH-LEFT. FINISH-LEFT should choose a sense w of Wh according to the final state q, which reflects knowledge of Wh's left children. For this strategy to work, the transitions in Lw (used by ATTACH-LEFT) must not depend on the particular sense w but only on W. In other words, all Lw : w [?] Wh are really copies of a shared LWh, except that they may have different final states. This requirement involves no loss of generality, since the nondeterministic shared LWh is free to branch as soon as it likes onto paths that commit to the various senses w.</Paragraph>
      <Paragraph position="4"> dency labels (Eisner, 2000). For example, the input token Wi = {bank1/N/NP, bank2/N/NP, bank3/V/VP, bank3/V/S} [?] S allows four &amp;quot;senses&amp;quot; of bank, namely two nominal meanings, and two syntactically different versions of the verbal meaning, whose automata require them to expand into VP and S phrases respectively.</Paragraph>
      <Paragraph position="5"> The cubic runtime is proportional to the number of ways of instantiating the inference rules in Fig. 2a: O(n2(n + tprime)tg2), where n = |Ohm |is the input length, g = maxni=1|Wi |bounds the size of a confusion set, t bounds the number of states per automaton, and tprime [?] t bounds the number of automaton transitions from a state that emit the same word. For deterministic automata, tprime = 1.11</Paragraph>
    </Section>
    <Section position="6" start_page="32" end_page="33" type="sub_section">
      <SectionTitle>
3.6 Probabilistic Parsing
</SectionTitle>
      <Paragraph position="0"> It is easy to make the algorithm of Fig. 2a lengthsensitive. When a new dependency is added by an ATTACH rule that combines a64a64 + a0a0 , the annotations on a64a64 and a0a0 suffice to determine the dependency's length [?] = |h[?]hprime|, direction d = sign(h[?]hprime), head word w, and child word wprime.12 So the additional cost of such a dependency, e.g. p([?]  |d,w,wprime), can be included as the weight of an extra antecedent to the rule, and so included in the weight of the resulting a8a8 or a72a72 .</Paragraph>
      <Paragraph position="1"> To execute the inference rules in Fig. 2a, we use a prioritized agenda. Derived items such as a64a64 , a0a0 , a8a8 , and a72a72 are prioritized by their Viterbi-inside probabilities. This is known as uniform-cost search or shortest-hyperpath search (Nederhof, 2003). We halt as soon as a full parse (the accept item) pops from the agenda, since uniform-cost search (as a special case of the A[?] algorithm) guarantees this to be the maximum-probability parse. No other pruning is done.</Paragraph>
      <Paragraph position="2"> 11Confusion-set parsing may be regarded as parsing a particular lattice with n states and ng arcs. The algorithm can be generalized to lattice parsing, in which case it has runtime O(m2(n + tprime)t) for a lattice of n states and m arcs. Roughly, h : w is replaced by an arc, while i is replaced by a state and i[?]1 is replaced by the same state.</Paragraph>
      <Paragraph position="3"> 12For general lattice parsing, it is not possible to determine [?] while applying this rule. There h and hprime are arcs in the lattice, not integers, and different paths from h to hprime might cover different numbers of words. Thus, if one still wanted to measure dependency length in words (rather than in, say, milliseconds of speech), each item would have to record its width explicitly, leading in general to more items and increased runtime.</Paragraph>
      <Paragraph position="4">  With a prioritized agenda, a probability model that more sharply discriminates among parses will typically lead to a faster parser. (Low-probability constituents languish at the back of the agenda and are never pursued.) We will see that the length-sensitive models do run faster for this reason.</Paragraph>
    </Section>
    <Section position="7" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
3.7 Experiments with Soft Constraints
</SectionTitle>
      <Paragraph position="0"> We trained models A-C, using unsmoothed maximum likelihood estimation, on three treebanks: the Penn (English) Treebank (split in the standard way, SS2-21 train/SS23 test, or 950K/57K words), the Penn Chinese Treebank (80% train/10% test or 508K/55K words), and the German TIGER corpus (80%/10% or 539K/68K words).13 Estimation was a simple matter of counting automaton events and normalizing counts into probabilities. For each model, we also trained the three length-sensitive versions described inSS3.3.</Paragraph>
      <Paragraph position="1"> The German corpus contains non-projective trees.</Paragraph>
      <Paragraph position="2"> None of our parsers can recover non-projective dependencies (nor can our models produce them). This fact was ignored when counting events for maximum likelihood estimation: in particular, we always trained Lw and Rw on the sequence of w's immediate children, even in non-projective trees.</Paragraph>
      <Paragraph position="3"> Our results (Tab. 1) show that sharpening the probabilities with the most sophisticated distance factors p([?]  |d,h,c), consistently improved the speed of all parsers.14 The change to the code is trivial. The only overhead is the cost of looking up and multiplying in the extra distance factors.</Paragraph>
      <Paragraph position="4"> Accuracy also improved over the baseline models of English and Chinese, as well as the simpler baseline models of German. Again, the most sophisticated distance factors helped most, but even the simplest distance factor usually obtained most of the accuracy benefit.</Paragraph>
      <Paragraph position="5"> German model C fell slightly in accuracy. The speedup here suggests that the probabilities were sharpened, but often in favor of the wrong parses.</Paragraph>
      <Paragraph position="6"> We did not analyze the errors on German; it may 13Heads were extracted for English using Michael Collins' rules and Chinese using Fei Xia's rules (defaulting in both cases to right-most heads where the rules fail). German heads were extracted using the TIGER Java API; we discarded all resulting dependency structures that were cyclic or unconnected (6%). 14We measure speed abstractly by the number of items built and pushed on the agenda.</Paragraph>
      <Paragraph position="7"> be relevant that 25% of the German sentences contained a non-projective dependency between non-punctuation tokens.</Paragraph>
      <Paragraph position="8"> Studying the parser output for English, we found that the length-sensitive models preferred closer attachments, with 19.7% of tags having a nearer parent in the best parse under model C with p([?]|d,h,c) than in the original model C, 77.7% having a parent at the same distance, and only 2.5% having a farther parent. The surviving long dependencies (at any length &gt; 1) tended to be much more accurate, while the (now more numerous) length-1 dependencies were slightly less accurate than before.</Paragraph>
      <Paragraph position="9"> We caution that length sensitivity's most dramatic improvements to accuracy were on the worse base-line models, which had more room to improve. The better baseline models (B and C) were already able to indirectly capture some preference for short dependencies, by learning that some parts of speech were unlikely to have multiple left or multiple right dependents. Enhancing B and C therefore contributed less, and indeed may have had some harmful effect by over-penalizing some structures that were already appropriately penalized.15 It remains to be seen, therefore, whether distance features would help state-of-the art parsers that are already much better than model C. Such parsers may already incorporate features that indirectly impose a good model of distance, though perhaps not as cheaply.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="33" end_page="37" type="metho">
    <SectionTitle>
4 Hard Dependency-Length Constraints
</SectionTitle>
    <Paragraph position="0"> We have seen how an explicit model of distance can improve the speed and accuracy of a simple probabilistic dependency parser. Another way to capitalize on the fact that most dependencies are local is to impose a hard constraint that simply forbids long dependencies.</Paragraph>
    <Paragraph position="1"> The dependency trees that satisfy this constraint yield a regular string language.16 The constraint prevents arbitrarily deep center-embedding, as well as arbitrarily many direct dependents on a given head, 15Owing to our deficient model. A log-linear or discriminative model would be trained to correct for overlapping penalties and would avoid this risk. Non-deficient generative models are also possible to design, along lines similar to footnote 16. 16One proof is to construct a strongly equivalent CFG without center-embedding (Nederhof, 2000). Each nonterminal has the form &lt;w,q,i,j&gt; , where w [?] S, q is a state of Lw or Rw, and i,j [?] {0,1,...k[?]1,[?] k}. We leave the details as an exercise.  in how they weight the same candidate parse trees. Length-sensitive models are larger but can improve dependency accuracy and speed. (Recall is measured as the fraction of non-punctuation tags whose correct parent (if not the $ symbol) was correctly recovered by the parser; it equals precision, unless the parser left some sentences unparsed (or incompletely parsed, as in SS4), in which case precision is higher. Runtime is measured abstractly as the average number of items (i.e., a64a64 , a0a0 , a8a8 , a72a72 ) built per word. Model size is measured as the number of nonzero parameters.) either of which would allow the non-regular language {anbcn : 0 &lt; n &lt; [?]}. It does allow arbitrarily deep right- or left-branching structures.</Paragraph>
    <Section position="1" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
4.1 Vine Grammars
</SectionTitle>
      <Paragraph position="0"> The tighter the bound on dependency length, the fewer parse trees we allow and the faster we can find them using the algorithm of Fig. 2a. If the bound is too tight to allow the correct parse of some sentence, we would still like to allow an accurate partial parse: a sequence of accurate parse fragments (Hindle, 1990; Abney, 1991; Appelt et al., 1993; Chen, 1995; Grefenstette, 1996). Furthermore, we would like to use the fact that some fragment sequences are presumably more likely than others.</Paragraph>
      <Paragraph position="1"> Our partial parses will look like the one in Fig. 1b.</Paragraph>
      <Paragraph position="2"> where 4 subtrees rather than 1 are dependent on $.</Paragraph>
      <Paragraph position="3"> This is easy to arrange in the SBG formalism. We merely need to construct our SBG so that the automaton R$ is now permitted to generate multiple children--the roots of parse fragments.</Paragraph>
      <Paragraph position="4"> This R$ is a probabilistic finite-state automaton that describes legal or likely root sequences in S[?].</Paragraph>
      <Paragraph position="5"> In our experiments in this section, we will train it to be a first-order (bigram) Markov model. (Thus we construct R$ in the usual way to have |S|+ 1 states, and train it on data like the other left and right automata. During generation, its state remembers the previously generated root, if any. Recall that we are working with POS tag sequences, so the roots, like all other words, are tags in S.) The 4 subtrees in Fig. 1b appear as so many bunches of grapes hanging off a vine. We refer to the dotted dependencies upon $ as vine dependencies, and the remaining, bilexical dependencies as tree dependencies.</Paragraph>
      <Paragraph position="6"> One might informally use the term &amp;quot;vine grammar&amp;quot; (VG) for any generative formalism, intended for partial parsing, in which a parse is a constrained sequence of trees that cover the sentence. In general, a VG might use a two-part generative process: first generate a finite-state sequence of roots, then expand the roots according to some more powerful formalism. Conveniently, however, SBGs and other dependency grammars can integrate these two steps into a single formalism.</Paragraph>
    </Section>
    <Section position="2" start_page="34" end_page="35" type="sub_section">
      <SectionTitle>
4.2 Feasible Parsing
</SectionTitle>
      <Paragraph position="0"> Now, for both speed and accuracy, we will restrict the trees that may hang from the vine. We define a feasible parse under our SBG to be one in which all tree dependencies are short, i.e., their length never exceeds some hard bound k. The vine dependencies may have unbounded length, of course, as in Fig. 1b.</Paragraph>
      <Paragraph position="1"> Sentences with feasible parses form a regular language. This would also be true under other definitions of feasibility, e.g., we could have limited the depth or width of each tree on the vine. However, that would have ruled out deeply right-branching trees, which are very common in language, and  parse for the same sentence retaining only tree dependencies of length [?] k = 3. The roots of the 4 resulting parse fragments are now connected only by their dotted-line &amp;quot;vine dependencies&amp;quot; on $. Transforming (a) into (b) involves grafting subtrees rooted at &amp;quot;According&amp;quot;, &amp;quot;,&amp;quot;, and &amp;quot;.&amp;quot; onto the vine. are also the traditional way to describe finite-state sublanguages within a context-free grammar. By contrast, our limitation on dependency length ensures regularity while still allowing (for any bound k [?] 1) arbitrarily wide and deep trees, such as a-b-...- root-...-y-z.</Paragraph>
      <Paragraph position="2"> Our goal is to find the best feasible parse (if any). Rather than transform the grammar as in footnote 16, our strategy is to modify the parser so that it only considers feasible parses. The interesting problem is to achieve linear-time parsing with a grammar constant that is as small as for ordinary parsing.</Paragraph>
      <Paragraph position="3"> We also correspondingly modify the training data so that we only train on feasible parses. That is, we break any long dependencies and thereby fragment each training parse (a single tree) into a vine of one or more restricted trees. When we break a childto-parent dependency, we reattach the child to $.17 This process, grafting, is illustrated in Fig. 1. Although this new parse may score less than 100% recall of the original dependencies, it is the best feasible parse, so we would like to train the parser to find it.18 By training on the modified data, we learn more 17Any dependency covering the child must also be broken to preserve projectivity. This case arises later; see footnote 25. 18Although the parser will still not be able to find it if it is non-projective (possible in German). Arguably we should have defined &amp;quot;feasible&amp;quot; to also require projectivity, but we did not. appropriate statistics for both R$ and the other automata. If we trained on the original trees, we would inaptly learn that R$ always generates a single root rather than a certain kind of sequence of roots.</Paragraph>
      <Paragraph position="4"> For evaluation, we score tree dependencies in our feasible parses against the tree dependencies in the unmodified gold standard parses, which are not necessarily feasible. We also show oracle performance.</Paragraph>
    </Section>
    <Section position="3" start_page="35" end_page="36" type="sub_section">
      <SectionTitle>
4.3 Approach #1: FSA Parsing
</SectionTitle>
      <Paragraph position="0"> Since we are now dealing with a regular language, it is possible in principle to use a weighted finite-state automaton (FSA) to search for the best feasible parse. The idea is to find the highest-weighted path that accepts the input string o = w1w2 ...wn. Using the Viterbi algorithm, this takes time O(n).</Paragraph>
      <Paragraph position="1"> The trouble is that this linear runtime hides a constant factor, which depends on the size of the relevant part of the FSA and may be enormous for any correct FSA.19 Consider an example from Fig 1b. After nondeterministically reading w1 ...w11 = According. . . insider along the correct path, the FSA state must record (at least) that insider has no parent yet and that R$ and Rcut are in particular states that 19The full runtime is O(nE), where E is the number of FSA edges, or for a tighter estimate, the number of FSA edges that can be traversed by reading o.</Paragraph>
      <Paragraph position="2">  may still accept more children. Else the FSA cannot know whether to accept a continuation w12 ...wn.</Paragraph>
      <Paragraph position="3"> In general, after parsing a prefix w1 ...wj, the FSA state must somehow record information about all incompletely linked words in the past. It must record the sequence of past words wi (i [?] j) that still need a parent or child in the future; if wi still needs a child, it must also record the state of Rwi.</Paragraph>
      <Paragraph position="4"> Our restriction to dependency length[?]k is what allows us to build a finite-state machine (as opposed to some kind of pushdown automaton with an unbounded number of configurations). We need only build the finitely many states where the incompletely linked words are limited to at most w0 = $ and the k most recent words, wj[?]k+1 ...wj. Other states cannot extend into a feasible parse, and can be pruned. However, this still allows the FSA to be in O(2ktk+1) different states after reading w1 ...wj.</Paragraph>
      <Paragraph position="5"> Then the runtime of the Viterbi algorithm, though linear in n, is exponential in k.</Paragraph>
    </Section>
    <Section position="4" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
4.4 Approach #2: Ordinary Chart Parsing
</SectionTitle>
      <Paragraph position="0"> A much better idea for most purposes is to use a chart parser. This allows the usual dynamic programming techniques for reusing computation. (The FSA in the previous section failed to exploit many such opportunities: exponentially many states would have proceeded redundantly by building the same wj+1wj+2wj+3 constituent.) It is simple to restrict our algorithm of Fig. 2a to find only feasible parses. It is the ATTACH rules a64a64 + a0a0 that add dependencies: simply use a side condition to block them from applying unless |h[?]hprime|[?]k (short tree dependency) or h = 0 (vine dependency). This ensures that all a72a72 and a8a8 will have width[?]k or have their left edge at 0.</Paragraph>
      <Paragraph position="1"> One might now incorrectly expect runtime linear in n: the number of possible ATTACH combinations is reduced from O(n3) to O(nk2), because i and hprime are now restricted to a narrow range given h.</Paragraph>
      <Paragraph position="2"> Unfortunately, the half-constituents a64a64 and a0a0 may still be arbitrarily wide, thanks to arbitrary right- and left-branching: a feasible vine parse may be a sequence of wide trees a0a0a64a64 . Thus there are O(n2k) possible COMPLETE combinations, not to mention O(n2) ATTACH-RIGHT combinations for which h = 0. So the runtime remains quadratic.</Paragraph>
    </Section>
    <Section position="5" start_page="36" end_page="37" type="sub_section">
      <SectionTitle>
4.5 Approach #3: Specialized Chart Parsing
</SectionTitle>
      <Paragraph position="0"> How, then, do we get linear runtime and a reasonable grammar constant? We give two ways to achieve runtime of O(nk2).</Paragraph>
      <Paragraph position="1"> First, we observe without details that we can easily achieve this by starting instead with the algorithm of Eisner (2000),20 rather than Eisner and Satta (1999), and again refusing to add long tree dependencies. That algorithm effectively concatenates only trapezoids, not triangles. Each is spanned by a single dependency and so has width[?]k. The vine dependencies do lead to wide trapezoids, but these are constrained to start at 0, where $ is. So the algorithm tries at most O(nk2) combinations of the form h i+ i j (like the ATTACH combinations above) and O(nk) combinations of the form 0 i + i j, where i[?]h[?]k,j[?]i[?]k. The precise runtime is O(nk(k + tprime)tg3).</Paragraph>
      <Paragraph position="2"> We now propose a hybrid linear-time algorithm that further improves runtime to O(nk(k + tprime)tg2), saving a factor of g in the grammar constant.21 We observe that since within-tree dependencies must have length [?] k, they can all be captured within Eisner-Satta trapezoids of width [?] k. So our VG parse a0a0a64a64 [?] can be assembled by simply concatenating a sequence ( a0a0 a8a8 [?] a72a72 [?] a64a64 )[?] of these narrow trapezoids interspersed with width-0 triangles. As this is a regular sequence, we can assemble it in linear time from left to right (rather than in the order of Eisner and Satta (1999)), multiplying the items' probabilities together. Whenever we start adding the right half a72a72 [?] a64a64 of a tree along the vine, we have discovered that tree's root, so we multiply in the probability of a $-root dependency.</Paragraph>
      <Paragraph position="3"> Formally, our hybrid parsing algorithm restricts the original rules of Fig. 2a to build only trapezoids of width [?] k and triangles of width &lt; k.22 The additional inference rules in Fig. 2b then assemble the final VG parse as just described.</Paragraph>
      <Paragraph position="4"> 20With a small change that when two items are combined, the right item (rather than the left) must be simple.</Paragraph>
      <Paragraph position="5"> 21This savings comes from building the internal structure of a trapezoid from both ends inward rather than from left to right. The corresponding unrestricted algorithms (Eisner, 2000; Eisner and Satta, 1999, respectively) have exactly the same runtimes with k replaced by n.</Paragraph>
      <Paragraph position="6"> 22For the experiments of SS4.7, where k varied by type, we restricted these rules as tightly as possible given h and hprime.  improve precision at the expense of recall, for English and Chinese. German performance suffers more. Bounds shown are k = {1,2,...,10,15,20}. The dotted lines show constant F-measure of the unbounded model.</Paragraph>
    </Section>
    <Section position="6" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
4.6 Experiments with Hard Constraints
</SectionTitle>
      <Paragraph position="0"> Our experiments used the asymptotically fast hybrid parsing algorithm above. We used the same left and right automata as in model C, the best-performing model from SS3.2. However, we now define R$ to be a first-order (bigram) Markov model (SS4.1). We trained and tested on the same headed treebanks as before (SS3.7), except that we modified the training trees to make them feasible (SS4.2).</Paragraph>
      <Paragraph position="1"> Results are shown in Figures 3 (precision/recall tradeoff) and 4 (accuracy/speed tradeoff), for k [?] {1,2,...,10,15,20}. Dots correspond to different values of k. On English and Chinese, some values of k actually achieve better F-measure accuracy than the unbounded parser, by eliminating errors.23 We observed that changing R$ from a bigram to a unigram model significantly hurt performance, showing that it is in fact useful to empirically model likely sequences of parse fragments.</Paragraph>
    </Section>
    <Section position="7" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
4.7 Finer-Grained Hard Constraints
</SectionTitle>
      <Paragraph position="0"> The dependency length bound k need not be a single value. Substantially better accuracy can be retained if each dependency type--each (h,c,d) = (head tag, child tag, direction) tuple--has its own 23Because our prototype implementation of each kind of parser (baseline, soft constraints, single-bound, and type-specific bounds) is known to suffer from different inefficiencies, runtimes in milliseconds are not comparable across parsers. To give a general idea, 60-word English sentences parsed in around 300ms with no bounds, but at around 200ms with either a distance model p([?]|d,h,c) or a generous hard bound of k = 10.</Paragraph>
      <Paragraph position="1"> bound k(h,c,d). We call these type-specific bounds: they create a many-dimensional space of possible parsers. We measured speed and accuracy along a sensible path through this space, gradually tighten- null its value is decremented and trees that violate the new bound are accordingly broken, the fewest dependencies will be broken.25 3. Decrement the bound k(h,c,d) and modify the training data to respect the bound by breaking dependencies that violate the bound and &amp;quot;grafting&amp;quot; the loose portion onto the vine. Retrain the parser on the training data.</Paragraph>
      <Paragraph position="2"> 4. If all bounds are not equal to 1, go to step 2.</Paragraph>
      <Paragraph position="3"> The performance of every 200th model along the trajectory of this search is plotted in Fig. 4.26 The graph shows that type-specific bounds can speed up the parser to a given level with less loss in accuracy.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="37" end_page="39" type="metho">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> As discussed in footnote 3, Collins (1997) and Mc-Donald et al. (2005) considered the POS tags intervening between a head and child. These soft constraints were very helpful, perhaps in part because they helped capture the short dependency preference (SS2). Collins used them as conditioning variables and McDonald et al. as log-linear features, whereas ourSS3 predicted them directly in a deficient model.</Paragraph>
    <Paragraph position="1"> As for hard constraints (SS4), our limitation on dependency length can be regarded as approximating a context-free language by a subset that is a regular 24In the case of the German TIGER corpus, which contains non-projective dependencies, we first make the training trees into projective vines by raising all non-projective child nodes to become heads on the vine.</Paragraph>
    <Paragraph position="2"> 25Not counting dependencies that must be broken indirectly in order to maintain projectivity. (If word 4 depends on word 7 which depends on word 2, and the 4 - 7 dependency is broken, making 4 a root, then we must also break the 2 - 7 dependency.) 26Note that k(h,c,right) = 7 bounds the width of a64a64 + a0a0 = a8a8 . For a finer-grained approach, we could instead separately bound the widths of a64a64 and a0a0 , say by kr(h,c,right) = 4 and kl(h,c,right) = 2.</Paragraph>
    <Paragraph position="3">  language. Our &amp;quot;vines&amp;quot; then let us concatenate several strings in this subset, which typically yields a superset of the original context-free language. Sub-set and superset approximations of (weighted) CFLs by (weighted) regular languages, usually by preventing center-embedding, have been widely explored; Nederhof (2000) gives a thorough review.</Paragraph>
    <Paragraph position="4"> We limit all dependency lengths (not just centerembedding).27 Further, we derive weights from a modified treebank rather than by approximating the true weights. And though regular grammar approximations are useful for other purposes, we argue that for parsing it is more efficient to perform the approximation in the parser, not in the grammar.</Paragraph>
    <Paragraph position="5"> Brants (1999) described a parser that encoded the grammar as a set of cascaded Markov models. The decoder was applied iteratively, with each iteration transforming the best (or n-best) output from the previous one until only the root symbol remained.</Paragraph>
    <Paragraph position="6"> This is a greedy variant of CFG parsing where the grammar is in Backus-Naur form.</Paragraph>
    <Paragraph position="7"> Bertsch and Nederhof (1999) gave a linear-time recognition algorithm for the recognition of the regular closure of deterministic context-free languages. Our result is related; instead of a closure of deterministic CFLs, we deal in a closure of CFLs that are assumed (by the parser) to obey some constraint on trees (like a maximum dependency length).</Paragraph>
  </Section>
  <Section position="8" start_page="39" end_page="40" type="metho">
    <SectionTitle>
6 Future Work
</SectionTitle>
    <Paragraph position="0"> The simple POS-sequence models we used as an experimental baseline are certainly not among the best parsers available today. They were chosen to illustrate how modeling and exploiting distance in syntax can affect various performance measures. Our approach may be helpful for other kinds of parsers as well. First, we hope that our results will generalize to more expressive grammar formalisms such as lexicalized CFG, CCG, and TAG, and to more expressively weighted grammars, such as log-linear models that can include head-child distance among other rich features. The parsing algorithms we presented also admit inside-outside variants, allowing iterative estimation methods for log-linear models (see, e.g., Miyao and Tsujii, 2002).</Paragraph>
    <Paragraph position="1"> 27Of course, this still allows right-branching or left-branching to unbounded depth.</Paragraph>
    <Paragraph position="2">  of feasible parses: The baseline (no length bound) is shown as +. Tighter bounds always improve speed, except for the most lax bounds, for which vine construction overhead incurs a slowdown. Type-specific bounds tend to maintain good F-measure at higher speeds than the single-bound approach. The vertical error bars show the &amp;quot;oracle&amp;quot; accuracy for each experiment (i.e., the F-measure if we had recovered the best feasible parse, as constructed from the gold-standard parse by grafting: see SS4.2). Runtime is measured as the number of items per word (i.e., a64a64 , a0a0 , a8a8 , a72a72 , a8a8a64a64 , a88a88a121 a88a88a121 ) built by the agenda parser. The &amp;quot;soft constraint&amp;quot; point marked with x represents the p([?]  |d,h,c)-augmented model from SS3.  Second, fast approximate parsing may play a role in more accurate parsing. It might be used to rapidly compute approximate outside-probability estimates to prioritize best-first search (e.g., Caraballo and Charniak, 1998). It might also be used to speed up the early iterations of training a weighted parsing model, which for modern training methods tends to require repeated parsing (either for the best parse, as by Taskar et al., 2004, or all parses, as by Miyao and Tsujii, 2002).</Paragraph>
    <Paragraph position="3"> Third, it would be useful to investigate algorithmic techniques and empirical benefits for limiting dependency length in more powerful grammar formalisms. Our runtime reduction from O(n3) -O(nk2) for a length-k bound applies only to a &amp;quot;split&amp;quot; bilexical grammar.28 Various kinds of synchronous grammars, in particular, are becoming important in statistical machine translation. Their high runtime complexity might be reduced by limiting monolingual dependency length (for a related idea see Schafer and Yarowsky, 2003).</Paragraph>
    <Paragraph position="4"> Finally, consider the possibility of limiting dependency length during grammar induction. We reason that a learner might start with simple structures that focus on local relationships, and gradually relax this restriction to allow more complex models.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML