XML Viewer - w06-1616

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1616_metho.xml
Size: 28,523 bytes
Last Modified: 2025-10-06 14:10:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1616">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Incremental Integer Linear Programming for Non-projective Dependency Parsing</Title>
  <Section position="4" start_page="129" end_page="129" type="metho">
    <SectionTitle>
2 Dependency Parsing
</SectionTitle>
    <Paragraph position="0"> Dependency parsing is the task of attaching words to their arguments. Figure 1 shows a dependency graph for the Dutch sentence I'll come at twelve and then you'll get what you deserve (taken from the Alpino Corpus (van der Beek et al., 2002)). In this dependency graph the verb kom is attached to its subject ik . kom is referred to as the head of the dependency and ik as its child. In labelled dependency parsing edges between words are labelled with the relation captured. In the case of the dependency between kom and ik the label would be subject .</Paragraph>
    <Paragraph position="1"> In a dependency tree every token must be the child of exactly one other node, either another token or the dummy root node. By de nition, a dependency tree is free of cycles. For example, it must not contain dependency chains such as en - kom - ik - en . For a more formal definition see previous work (Nivre et al., 2004).</Paragraph>
    <Paragraph position="2"> An important distinction between dependency trees is whether they are projective or nonprojective. Figure 1 is an example of a projective dependency tree, in such trees dependencies do not cross. In Dutch and other exible word order languages such as German and Czech we also encounter non-projective trees, in these cases the trees contain crossing dependencies.</Paragraph>
    <Paragraph position="3"> Dependency parsing is useful for applications such as relation extraction (Culotta and Sorensen, 2004) and machine translation (Ding and Palmer, 2005). Although less informative than lexicalised phrase structures, dependency structures still capture most of the predicate-argument information needed for applications. It has the advantage of being more ef cient to learn and parse.</Paragraph>
    <Paragraph position="4"> McDonald et al. (2005a) introduce a dependency parsing framework which treats the task as searching for the projective tree that maximises the sum of local dependency scores. This frame- null The verb krijg is incorrectly coordinated with the preposition om .</Paragraph>
    <Paragraph position="5"> work is ef cient and has also been extended to non-projective trees (McDonald et al., 2005b). It provides a discriminative online learning algorithm which when combined with a rich feature set reaches state-of-the-art performance across multiple languages.</Paragraph>
    <Paragraph position="6"> However, within this framework one can only de ne features over single attachment decisions.</Paragraph>
    <Paragraph position="7"> This leads to cases where basic linguistic constraints are not satis ed (e.g. verbs with two subjects or incompatible coordination arguments). An example of this for Dutch is illustrated in Figure 2 which was produced by the parser of McDonald et al. (2005b). Here the parse contains a coordination of incompatible word classes (a preposition and a verb).</Paragraph>
    <Paragraph position="8"> Our approach is able to include additional constraints which forbid con gurations such as those in Figure 2. While McDonald and Pereira (2006) address the issue of local attachment decisions by de ning scores over attachment pairs, our solution is more general. Furthermore, it is complementary in the sense that we could formulate their model using ILP and then add constraints.</Paragraph>
    <Paragraph position="9"> The method we present is not the only one that can take global constraints into account. Deterministic dependency parsing (Nivre et al., 2004; Yamada and Matsumoto, 2003) can apply global constraints by conditioning attachment decisions on the intermediate parse built. However, for ef ciency a greedy search is used which may produce sub-optimal solutions. This is not the case when using ILP.</Paragraph>
  </Section>
  <Section position="5" start_page="129" end_page="132" type="metho">
    <SectionTitle>
3 Model
</SectionTitle>
    <Paragraph position="0"> Our underlying model is a modi ed labelled ver-</Paragraph>
    <Paragraph position="2"> where x is a sentence, y is a set of labelled dependencies, f(i,j,l) is a multidimensional feature vector representation of the edge from token i to token j with label l and w the corresponding weight vector. For example, a feature f101 in f could be:</Paragraph>
    <Paragraph position="4"> where t(i) is the word at token i and p(j) the part-of-speech tag at token j.</Paragraph>
    <Paragraph position="5"> Decoding in this model amounts to nding the y for a given x that maximises s(x,y): yprime = arg maxy s(x,y) while ful lling the following constraints: T1 For every non-root token in x there exists exactly one head; the root token has no head. T2 There are no cycles.</Paragraph>
    <Paragraph position="6"> Thus far, the formulation follows McDonald et al. (2005b) and corresponds to the Maximum Spanning Tree (MST) problem. In addition to T1 and T2, we include a set of linguistically motivated constraints: A1 Heads are not allowed to have more than one outgoing edge labelled l for all l in a set of labels U.</Paragraph>
    <Paragraph position="7"> C1 In a symmetric coordination there is exactly one argument to the right of the conjunction and at least one argument to the left.</Paragraph>
    <Paragraph position="8"> C2 In an asymmetric coordination there are no arguments to the left of the conjunction and at least two arguments to the right.</Paragraph>
    <Paragraph position="9"> C3 There must be at least one comma between subsequent arguments to the left of a symmetric coordination.</Paragraph>
    <Paragraph position="10"> C4 Arguments of a coordination must have compatible word classes.</Paragraph>
    <Paragraph position="11"> P1 Two dependencies must not cross if one of their labels is in a set of labels P.</Paragraph>
    <Paragraph position="12"> A1 covers constraints such as there can only be one subject if U contains subject (see Section 4.4 for more details of U). C1 applies to con gurations which contain conjunctions such as en , of or maar ( and , or and but ). C2 will rule-out settings where a conjunction such as zowel (translates as both ) having arguments to its left. C3 forces coordination arguments to the left of a conjunction to have commas in between. C4 avoids parses in which incompatible word classes are coordinated, such as nouns and verbs. Finally, P1 allows selective projective parsing: we can, for instance, forbid the crossing of Noun-Determiner dependencies if we add the corresponding label type to P(see Section 4.4 for more details of P) . If we extend P to contain all labels we forbid any type of crossing dependencies. This corresponds to projective parsing.</Paragraph>
    <Section position="1" start_page="130" end_page="132" type="sub_section">
      <SectionTitle>
3.1 Decoding
</SectionTitle>
      <Paragraph position="0"> McDonald et al. (2005b) use the Chu-Liu-Edmonds (CLE) algorithm to solve the maximum spanning tree problem. However, global constraints cannot be incorporated into the CLE algorithm (McDonald et al., 2005b). We alleviate this problem by presenting an equivalent Integer Linear Programming formulation which allows us to incorporate global constraints naturally.</Paragraph>
      <Paragraph position="1"> Before giving full details of our formulation we rst introduce some of the concepts of linear and integer linear programming (for a more thorough introduction see Winston and Venkataramanan (2003)).</Paragraph>
      <Paragraph position="2"> Linear Programming (LP) is a tool for solving optimisation problems in which the aim is to maximise (or minimise) a given linear function with respect to a set of linear constraints. The function to be maximised (or minimised) is referred to as the objective function. A number of decision variables are under our control which exert in uence on the objective function. Speci cally, they have to be optimised in order to maximise (or minimise) the objective function. Finally, a set of constraints restrict the values that the decision variables can take. Integer Linear Programming is an extension of linear programming where all decision variables must take integer values.</Paragraph>
      <Paragraph position="3"> There are several explicit formulations of the MST problem as an integer linear program in the literature (Williams, 2002). They are based on the concept of eliminating subtours (cycles), cuts (disconnections) or requiring intervertex ows (paths). However, in practice these formulations cause long solve times as the rst two meth- null ods yield an exponential number of constraints.</Paragraph>
      <Paragraph position="4"> Although the latter scales cubically, it produces non-fractional solutions in its relaxed version; this causes long runtimes for the branch and bound algorithm (Williams, 2002) commonly used in integer linear programming. We found out experimentally that dependency parsing models of this form do not converge on a solution after multiple hours of solving, even for small sentences.</Paragraph>
      <Paragraph position="5"> As a workaround for this problem we follow an incremental approach akin to the work of Warme (1998). Instead of adding constraints which forbid all possible cycles in advance (this would result in an exponential number of constraints) we rst solve the problem without any cycle constraints.</Paragraph>
      <Paragraph position="6"> The solution is then examined for cycles, and if cycles are found we add constraints to forbid these cycles; the solver is then run again. This process is repeated until no more violated constraints are found. The same procedure is used for other types of constraints which are too expensive to add in advance (e.g. the constraints of P1).</Paragraph>
      <Paragraph position="7"> Algorithm 1 outlines our approach. For a sentence x, Bx is the set of constraints that we add in advance and Ix are the constraints we add incrementally. Ox is the objective function and Vx is a set of variables including integer declarations. solve(C,O,V ) maximises the objective function O with respect to the set of constraints C and variables V . violated(y,I) inspects the proposed solution (y) and returns all constraints in I which are violated.</Paragraph>
      <Paragraph position="8"> The number of iterations required in this approach is at most polynomial with respect to the number of variables (Grcurrency1otschel et al., 1981). In practice, this technique converges quickly (less than 20 iterations in 99% of approximately 12,000 sentences), yielding average solve times of less than 0.5 seconds.</Paragraph>
      <Paragraph position="9"> Our approach converges quickly due to the quality of the scoring function. Its weights have been trained on cycle free data, thus it is more likely to guide the search to a cycle free solution. In the following section we present the objective function Ox, variables Vx and linear constraints Bx and Ix needed for parsing x using Algorithm 1.</Paragraph>
      <Paragraph position="10">  Vx contains a set of binary variables to represent labelled edges: ei,j,l [?]i[?]0..n,j[?]1..n, l[?]bestk(i,j) where n is the number of tokens and the index 0 represents the root token. bestk(i,j) is the set of k labels with highest s(i,j,l). ei,j,l equals 1 if there is a edge (i.e., a dependency) with the label l between token i (head) and j (child), 0 otherwise. k depends on the type of constraints we want to add. For the plain MST problem it is suf cient to set k = 1 and only take the best scoring label for each token pair. However, if we want a constraint which forbids duplicate subjects we need to provide additional labels to fall back on.</Paragraph>
      <Paragraph position="11"> Vx also contains a set of binary auxiliary variables: null di,j [?]i[?]0..n,j[?]1..n which represent the existence of a dependency between tokens i and j. We connect these to the ei,j,l variables by the constraint:</Paragraph>
      <Paragraph position="13"> Given the above variables our objective function Ox can be represented as (using a suitable k):</Paragraph>
      <Paragraph position="15"> We rst introduce a set of base constraints Bx which we add in advance.</Paragraph>
      <Paragraph position="16"> Only One Head (T1) Every token has exactly one head: summationdisplay</Paragraph>
      <Paragraph position="18"> for non-root tokens j &gt; 0 in x. An exception is made for the arti cial root node:summationdisplay</Paragraph>
      <Paragraph position="20"> Now we present the set of constraints Ix we add incrementally. The constraints are chosen based on the two criteria: (1) adding them to the base constraints (those added in advance) would result in an extremely large program, and (2) it must be efcient to detect whether the constraint is violated in y.</Paragraph>
      <Paragraph position="21"> No Cycles (T2) For every possible cycle c for the sentence x we have a constraint which forbids the case where all edges in c are active simultane- null ric conjunction token i which forms a symmetric coordination and each set of tokens A in x to the left of i with no comma between each pair of successive tokens we add:  of triplets (i,j,l1) and (m,n,l2) we add the constraint: null ei,j,l1 + em,n,l2 [?]1 if l1 or l2 is in P.</Paragraph>
    </Section>
    <Section position="2" start_page="132" end_page="132" type="sub_section">
      <SectionTitle>
3.2 Training
</SectionTitle>
      <Paragraph position="0"> For training we use single-best MIRA (McDonald et al., 2005a). This is an online algorithm that learns by parsing each sentence and comparing the result with a gold standard. Training is complete after multiple passes through the whole corpus. Thus we decode using the Chu-Liu-Edmonds (CLE) algorithm due to its speed advantage over ILP (see Section 5.2 for a detailed comparison of runtimes).</Paragraph>
      <Paragraph position="1"> The fact that we decode differently during training (using CLE) and testing (using ILP) may degrade performance. In the presence of additional constraints weights may be able to capture other aspects of the data.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="132" end_page="133" type="metho">
    <SectionTitle>
4 Experimental Set-up
</SectionTitle>
    <Paragraph position="0"> Our experiments were designed to answer the fol- null lowing questions: 1. How much do our additional constraints help improve accuracy? 2. How fast is our generic inference method in comparison with the Chu-Liu-Edmonds algorithm? null 3. Can approximations be used to increase the  speed of our method while remaining accurate? null Before we try to answer these questions we brie y describe our data, features used, settings for U and P in our parametric constraints, our working environment and our implementation.</Paragraph>
    <Section position="1" start_page="133" end_page="133" type="sub_section">
      <SectionTitle>
4.1 Data
</SectionTitle>
      <Paragraph position="0"> We use the Alpino treebank (van der Beek et al., 2002), taken from the CoNLL shared task of multilingual dependency parsing3. The CoNLL data differs slightly from the original Alpino treebank as the corpus has been part-of-speech tagged using a Memory-Based-Tagger (Daelemans et al., 1996).</Paragraph>
      <Paragraph position="1"> It consists of 13,300 sentences with an average length of 14.6 tokens. The data is non-projective, more speci cally 5.4% of all dependencies are crossed by at least one other dependency. It contains approximately 6000 sentences more than the Alpino corpus used by Malouf and van Noord (2004).</Paragraph>
      <Paragraph position="2"> The training set was divided into a 10% development set (dev) while the remaining 90% is used as a training and cross-validation set (cross). Feature sets, constraints and training parameters were selected through training on cross and optimising against dev.</Paragraph>
      <Paragraph position="3"> Our nal accuracy scores and runtime evaluations were acquired using a nine-fold cross-validation on cross</Paragraph>
    </Section>
    <Section position="2" start_page="133" end_page="133" type="sub_section">
      <SectionTitle>
4.2 Environment and Implementation
</SectionTitle>
      <Paragraph position="0"> All our experiments were conducted on a Intel Xeon with 3.8 Ghz and 4Gb of RAM. We used the open source Mixed Integer Programming library lp solve4 to solve the Integer Linear Programs. Our code ran in Java and called the JNIwrapper around the lp solve library.</Paragraph>
    </Section>
    <Section position="3" start_page="133" end_page="133" type="sub_section">
      <SectionTitle>
4.3 Feature Sets
</SectionTitle>
      <Paragraph position="0"> Our feature set was determined through experimentation with the development set. The features are based upon the data provided within the Alpino treebank. Along with POS tags the corpus contains several additional attributes such as gender, number and case.</Paragraph>
      <Paragraph position="1"> Our best results on the development set were achieved using the feature set of McDonald et al. (2005a) and a set of features based on the additional attributes. These features combine the attributes of the head with those of the child. For example, if token i has the attributes a1 and a2, and token j has the attribute a3 then we created the features (a1[?]a3) and (a2[?]a3).</Paragraph>
    </Section>
    <Section position="4" start_page="133" end_page="133" type="sub_section">
      <SectionTitle>
4.4 Constraints
</SectionTitle>
      <Paragraph position="0"> All the constraints presented in Section 3 were used in our model. The set U of unique labels constraints contained su, obj1, obj2, sup, ld, vc, predc, predm, pc, pobj1, obcomp and body. Here su stands for subject and obj1 for direct object (for full details see Moortgat et al. (2000)).</Paragraph>
      <Paragraph position="1"> The set of projective labels P contained cnj, for coordination dependencies; and det, for determiner dependencies. One exception was added for the coordination constraint: dependencies can cross when coordinated arguments are verbs.</Paragraph>
      <Paragraph position="2"> One drawback of hard deterministic constraints is the undesirable effect noisy data can cause. We see this most prominently with coordination argument compatibility. Words ending in en are typically wrongly tagged and cause our coordination argument constraint to discard correct coordinations. As a workaround we assigned words ending in en a wildcard POS tag which is compatible with all POS tags.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="133" end_page="135" type="metho">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> In this section we report our results. We not only present our accuracy but also provide an empirical evaluation of the runtime behaviour of this approach and show how parsing can be accelerated using a simple approximation.</Paragraph>
    <Section position="1" start_page="133" end_page="134" type="sub_section">
      <SectionTitle>
5.1 Accuracy
</SectionTitle>
      <Paragraph position="0"> An important question to answer when using global constraints is: How much of a performance boost is gained when using global constraints? We ran the system without any linguistic constraints as a baseline (bl) and compared it to a system with the additional constraints (cnstr). To evaluate our systems we use the accuracy over labelled attachment decisions:</Paragraph>
      <Paragraph position="2"> where Nl is the number of tokens with correct head and label and Nt is the total number of tokens. For completeness we also report the unlabelled accuracy:</Paragraph>
      <Paragraph position="4"> where Nu is the number of tokens with correct head.</Paragraph>
      <Paragraph position="5">  curacy using nine-fold cross-validation on cross for baseline (bl) and constraint-based (constr) system. LC and UC are the percentages of sentences with 100% labelled and unlabelled accuracy, respectively. null Table 1 shows our results using nine-fold cross-validation on the cross set. The baseline system (no additional constraints) gives an unlabelled accuracy of 84.6% and labelled accuracy of 88.9%. When we add our linguistic constraints the performance increases by 0.5% for both labelled and unlabelled accuracy. This increase is signi cant (p &lt; 0.001) according to Dan Bikel's parse comparison script and using the Sign test (p &lt; 0.001). Now we give a little insight into how our results compare with the rest of the community. The reported state-of-the-art parser of Malouf and van Noord (2004) achieves 84.4% labelled accuracy which is very close numerically to our baseline. However, they use a subset of the CoNLL Alpino treebank with a higher average number of tokens per sentences and also evaluate control relations, thus results are not directly comparable. We have also run our parser on the relatively small (approximately 400 sentences) CoNNL test data. The best performing system (McDonald et al. 2006; note: this system is different to our baseline) achieves 79.2% labelled accuracy while our baseline system achieves 78.6% and our constrained version 79.8%. However, a signi cant difference is only observed between our baseline and our constraint-based system.</Paragraph>
      <Paragraph position="6"> Examining the errors produced using the dev set highlight two key reasons why we do not see a greater improvement using our constraint-based system. Firstly, we cannot improve on coordinations that include words ending with en based on the workaround present in Section 4.4. This problem can only be solved by improving POS taggers for Dutch or by performing POS tagging within the dependency parsing framework.</Paragraph>
      <Paragraph position="7"> Secondly, our system suffers from poor next best solutions. That is, if the best solution violates some constraints, then we nd the next best solution is typically worse than the best solution with violated constraints. This appears to be a consequence of inaccurate local score distributions (as opposed to inaccurate best local scores). For example, suppose we attach two subjects, t1 and t2, to a verb, where t1 is the actual subject while t2 is meant to be labelled as object. If we forbid this con guration (two subjects) and if the score of labelling t1 object is higher than that for t2 being labelled subject, then the next best solution will label t1 incorrectly as object and t2 incorrectly as subject. This is often the case, and thus results in a drop of accuracy.</Paragraph>
    </Section>
    <Section position="2" start_page="134" end_page="134" type="sub_section">
      <SectionTitle>
5.2 Runtime Evaluation
</SectionTitle>
      <Paragraph position="0"> We now concentrate on the runtime of our method.</Paragraph>
      <Paragraph position="1"> While we expect a longer runtime than using the Chu-Liu-Edmonds as in previous work (McDonald et al., 2005b), we are interested in how large the increase is.</Paragraph>
      <Paragraph position="2"> Table 2 shows the average solve time (ST) for sentences with respect to the number of tokens in each sentence for our system with constraints (cnstr) and the Chu-Liu-Edmonds (CLE) algorithm.</Paragraph>
      <Paragraph position="3"> All solve times do not include feature extraction as this is identical for all systems. For cnstr we also show the number of sentences that could not be parsed after two minutes, the average number of iterations and the average duration of the rst iteration.</Paragraph>
      <Paragraph position="4"> The results show that parsing using our generic approach is still reasonably fast although signi cantly slower than using the Chu-Liu-Edmonds algorithm. Also, only a small number of sentences take longer than two minutes to parse. Thus, in practice we would not see a signi cant degradation in performance if we were to fall back on the CLE algorithm after two minutes of solving.</Paragraph>
      <Paragraph position="5"> When we examine the average duration of the rst iteration it appears that the majority of the solve time is spent within this iteration. This could be used to justify using the CLE algorithm to nd a initial solution as starting point for the ILP solver (see Section 6).</Paragraph>
    </Section>
    <Section position="3" start_page="134" end_page="135" type="sub_section">
      <SectionTitle>
5.3 Approximation
</SectionTitle>
      <Paragraph position="0"> Despite the fact that our parser can parse all sentences in a reasonable amount of time, it is still signi cantly slower than the CLE algorithm. While this is not crucial during decoding, it does make discriminative online training dif cult as training requires several iterations of parsing the whole corpus.</Paragraph>
      <Paragraph position="1">  with constraints (constr), the Chu-Liu-Edmonds algorithm (CLE), number of sentences with solve times greater than 120 seconds, average number of iterations and rst iteration solve time.  time (ST) for the cross dataset using varying q values and the Chu-Liu-Edmonds algorithm (CLE) Thus we investigate if it is possible to speed up our inference using a simple approximation. For each token we now only consider the q variables in Vx with the highest scoring edges. For example, if we set q = 2 the set of variables for a token j will contain two variables, either both for the same head i but with different labels (variables ei,j,l1 and ei,j,l2) or two distinct heads i1 and i2 (variables ei1,j,l1 and ei2,j,l2) where labels l1 and l2 may be identical.</Paragraph>
      <Paragraph position="2"> Table 3 shows the effect of different q values on solve time for the full corpus cross (roughly 12,000 sentences) and overall accuracy. We see that solve time can be reduced by 80% while only losing a marginal amount of accuracy when we set q to 10. However, we are unable to reach the 20 seconds solve time of the CLE algorithm. Despite this, when we add the time for preprocessing and feature extraction, the CLE system parses a corpus in around 15 minutes whereas our system with q = 10 takes approximately 25 minutes5. When we view the total runtime of each system we see our system is more competitive.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="135" end_page="136" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> While we have presented signi cant improvements using additional constraints, one may won5Even when caching feature extraction during training McDonald et al. (2005a) still takes approximately 10 minutes to train.</Paragraph>
    <Paragraph position="1"> der whether the improvements are large enough to justify further research in this direction; especially since McDonald and Pereira (2006) present an approximate algorithm which also makes more global decisions. However, we believe that our approach is complementary to their model. We can model higher order features by using an extended set of variables and a modi ed objective function.</Paragraph>
    <Paragraph position="2"> Although this is likely to increase runtime, it may still be fast enough for real world applications. In addition, it will allow exact inference, even in the case of non-projective parsing. Also, we argue that this approach has potential for interesting extensions and applications.</Paragraph>
    <Paragraph position="3"> For example, during our runtime evaluations we nd that a large fraction of solve time is spent in the rst iteration of our incremental algorithm. After the rst iteration the solver uses its last state to ef ciently search for solutions in the presence of new constraints. Some solvers allow the speci cation of an initial solution as a starting point, thus it is expected that signi cant improvements in terms of speed can be made by using the CLE algorithm to provide an initial solution.</Paragraph>
    <Paragraph position="4"> Our approach uses a generic algorithm to solve a complex task. Thus other applications may bene t from it. For instance, Germann et al. (2001) present an ILP formulation of the Machine Translation (MT) decoding task in order to conduct exact inference. However, their model suffers from the same type of exponential blow-up we observe when we add all our cycle constraints in advance.</Paragraph>
    <Paragraph position="5"> In fact, the constraints which cause the exponential explosion in their graphically formulation are of the same nature as our cycle constraints. We hope that the incremental approach will allow exact MT decoding for longer sentences.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML