File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1011_metho.xml
Size: 17,819 bytes
Last Modified: 2025-10-06 14:10:05
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1011"> <Title>Online Learning of Approximate Dependency Parsing Algorithms</Title> <Section position="3" start_page="0" end_page="83" type="metho"> <SectionTitle> 2 Maximum Spanning Tree Parsing </SectionTitle> <Paragraph position="0"> Dependency-tree parsing as the search for the maximum spanning tree (MST) in a graph was root John saw a dog yesterday which was a Yorkshire Terrier proposed byMcDonald etal.(2005c). Thisformulation leads to efficient parsing algorithms for both projective and non-projective dependency trees with the Eisner algorithm (Eisner, 1996) and the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965; Edmonds, 1967) respectively. The formulation works by defining the score of a dependency tree to be the sum of edge scores,</Paragraph> <Paragraph position="2"> where x = x1 ***xn is an input sentence and y a dependency tree for x. We can view y as a set of tree edges and write (i,j) [?] y to indicate an edge in y from word xi to word xj. Consider the example from Figure 1, where the subscripts index the nodes of the tree. The score of this tree would then be,</Paragraph> <Paragraph position="4"> We call this first-order dependency parsing since scores are restricted to a single edge in the dependency tree. The score of an edge is in turn computed as the inner product of a high-dimensional feature representation of the edge with a corresponding weight vector,</Paragraph> <Paragraph position="6"> This is a standard linear classifier in which the weight vector w are the parameters to be learned during training. We should note that f(i,j) can be based on arbitrary features of the edge and the input sequence x.</Paragraph> <Paragraph position="7"> Given a directed graph G = (V,E), the maximum spanning tree (MST) problem is to find the highest scoring subgraph of G that satisfies the tree constraint over the vertices V . By defining a graph in which the words in a sentence are the vertices and there is a directed edge between all words with a score as calculated above, McDonald et al. (2005c) showed that dependency parsing is equivalent to finding the MST in this graph. Furthermore, it was shown that this formulation can lead to state-of-the-art results when combined with discriminative learning algorithms.</Paragraph> <Paragraph position="8"> Although the MST formulation applies to any directed graph, our feature representations andone oftheparsing algorithms (Eisner's) rely onalinear ordering of the vertices, namely the order of the words in the sentence.</Paragraph> <Section position="1" start_page="81" end_page="82" type="sub_section"> <SectionTitle> 2.1 Second-Order MST Parsing </SectionTitle> <Paragraph position="0"> Restricting scores to a single edge in a dependency tree gives a very impoverished view of dependency parsing. Yamadaand Matsumoto (2003) showed that keeping a small amount of parsing history was crucial to improving parsing performance for their locally-trained shift-reduce SVM parser. It is reasonable to assume that other parsing models might benefit from features over previous decisions.</Paragraph> <Paragraph position="1"> Here we will focus on methods for parsing second-order spanning trees. These models factor the score of the tree into the sum of adjacent edge pair scores. To quantify this, consider again the example from Figure 1. In the second-order spanning tree model, the score would be,</Paragraph> <Paragraph position="3"> Here we use the second-order score function s(i,k,j), which is the score of creating a pair of adjacent edges, from word xi to words xk and xj.</Paragraph> <Paragraph position="4"> For instance, s(2,4,5) is the score of creating the edges from hit to with and from hit to ball. The score functions are relative to the left or right of the parent and we never score adjacent edges that are on different sides of the parent (for instance, there is no s(2,1,4) for the adjacent edges from hit to John and ball). This independence between left and right descendants allow us to use a O(n3) second-order projective parsing algorithm, as we will see later. We write s(xi,[?],xj) when xj is the first left or first right dependent of word xi. For example, s(2,[?],4) is the score of creating a dependency from hit to ball, since ball is the first child to the right of hit. More formally, if the word xi0 has the children shown in this picture,</Paragraph> <Paragraph position="6"> the score factors as follows:</Paragraph> <Paragraph position="8"> This second-order factorization subsumes the first-order factorization, since the score function could just ignore the middle argument to simulate first-order scoring. The score of a tree for second-order parsing is now</Paragraph> <Paragraph position="10"> where k and j are adjacent, same-side children of i in the tree y.</Paragraph> <Paragraph position="11"> The second-order model allows us to condition onthe mostrecent parsing decision, thatis, the last dependent picked up by a particular word, which is analogous to the the Markov conditioning of in the Charniak parser (Charniak, 2000).</Paragraph> </Section> <Section position="2" start_page="82" end_page="82" type="sub_section"> <SectionTitle> 2.2 Exact Projective Parsing </SectionTitle> <Paragraph position="0"> For projective MST parsing, the first-order algorithm can be extended to the second-order case, as was noted by Eisner (1996). The intuition behind the algorithm is shown graphically in Figure 3, which displays both the first-order and second-order algorithms. In the first-order algorithm, a word will gather its left and right dependents independently by gathering each half of the subtree rooted by its dependent in separate stages. By splitting up chart items into left and right components, the Eisner algorithm only requires 3 indices to be maintained at each step, as discussed in detail elsewhere (Eisner, 1996; McDonald et al., 2005b). For the second-order algorithm, the key insight is to delay the scoring of edges until pairs 1. Let y = 2-order-proj(x,s) 2. while true 3. m = [?][?],c = [?]1,p = [?]1 4. for j : 1***n 5. for i : 0***n 6. yprime = y[i - j] 7. if !tree(yprime) or [?]k : (i,k,j) [?] y continue 8. d = s(x,yprime) [?]s(x,y) 9. if d > m 10. m = d,c = j,p = i 11. end for 12. end for 13. if m > 0 14. y = y[p - c] 15. else return y 16. end while projective parsing algorithm.</Paragraph> <Paragraph position="1"> of dependents have been gathered. This allows for the collection of pairs of adjacent dependents in a single stage, which allows for the incorporation of second-order scores, while maintaining cubic-time parsing.</Paragraph> <Paragraph position="2"> The Eisner algorithm can be extended to an arbitrary mth-order model with a complexity of O(nm+1), for m > 1. An mth-order parsing algorithm willworksimilarly tothe second-order algorithm, except that wecollect mpairs of adjacent dependents in succession before attaching them to their parent.</Paragraph> </Section> <Section position="3" start_page="82" end_page="83" type="sub_section"> <SectionTitle> 2.3 Approximate Non-projective Parsing </SectionTitle> <Paragraph position="0"> Unfortunately, second-order non-projective MST parsing is NP-hard, as shown in appendix A. To circumvent this, we designed an approximate algorithm based on the exact O(n3) second-order projective Eisner algorithm. The approximation works by first finding the highest scoring projective parse. It then rearranges edges in the tree, one at a time, as long as such rearrangements increase the overall score and do not violate the tree constraint. We can easily motivate this approximation by observing that even in non-projective languages like Czech and Danish, most trees are primarily projective with just a few non-projective edges (Nivre and Nilsson, 2005). Thus, by starting with the highest scoring projective tree, we are typically only a small number of transformations away from the highest scoring non-projective tree.</Paragraph> <Paragraph position="1"> The algorithm is shown in Figure 4. The expression y[i - j] denotes the dependency graph identical to y except that xi's parent is xi instead</Paragraph> <Paragraph position="3"> shows how h1 creates a dependency to h3 with the second-order knowledge that the last dependent of h1 was h2. This is done through the creation of a sibling item in part (B). In the first-order model, the dependency to h3 is created after the algorithm has forgotten that h2 was the last dependent.</Paragraph> <Paragraph position="4"> of what it was in y. The test tree(y) is true iff the dependency graph y satisfies the tree constraint.</Paragraph> <Paragraph position="5"> In more detail, line 1 of the algorithm sets y to the highest scoring second-order projective tree.</Paragraph> <Paragraph position="6"> The loop of lines 2-16 exits only when no further score improvement is possible. Each iteration seeks the single highest-scoring parent change to y that does not break the tree constraint. To that effect, the nested loops starting in lines 4 and 5 enumerate all (i,j) pairs. Line 6 sets yprime to the dependency graph obtained from y by changing xj's parent to xi. Line 7 checks that the move from y to yprime is valid by testing that xj's parent was not already xi and that yprime is a tree. Line 8 computes the score change from y to yprime. If this change is larger than the previous best change, we record how this new tree was created (lines 9-10). After considering all possible valid edge changes to the tree, the algorithm checks to see that the best new tree does have a higher score. If that is the case, we change the tree permanently and re-enter the loop.</Paragraph> <Paragraph position="7"> Otherwise we exit since there are no single edge switches that can improve the score.</Paragraph> <Paragraph position="8"> This algorithm allows for the introduction of non-projective edges because we do not restrict any of the edge changes except to maintain the tree property. In fact, if any edge change is ever made, the resulting tree is guaranteed to be nonprojective, otherwise there would have been a higher scoring projective tree that would have already been found by the exact projective parsing algorithm. It is not difficult to find examples for which this approximation will terminate without returning the highest-scoring non-projective parse.</Paragraph> <Paragraph position="9"> It is clear that this approximation will always terminate -- there are only a finite number of dependency trees for any given sentence and each iteration of the loop requires an increase in score to continue. However, the loop could potentially take exponential time, so we will bound the number of edge transformations to a fixed value M.</Paragraph> <Paragraph position="10"> It is easy to argue that this will not hurt performance. Even in freer-word order languages such as Czech, almost all non-projective dependency trees are primarily projective, modulo a few non-projective edges. Thus, if our inference algorithm starts with the highest scoring projective parse, the best non-projective parse only differs by a small number of edge transformations. Furthermore, it is easy to show that each iteration of the loop takes O(n2) time, resulting in a O(n3 + Mn2) runtime algorithm. In practice, the approximation terminates after a small number of transformations and we do not need to bound the number of iterations in our experiments.</Paragraph> <Paragraph position="11"> Weshould note that this is one of many possible approximations wecould have made. Another reasonable approach would be to first find the highest scoring first-order non-projective parse, and then re-arrange edges based on second order scores in a similar manner to the algorithm we described.</Paragraph> <Paragraph position="12"> We implemented this method and found that the results were slightly worse.</Paragraph> </Section> </Section> <Section position="4" start_page="83" end_page="84" type="metho"> <SectionTitle> 3 Danish: Parsing Secondary Parents </SectionTitle> <Paragraph position="0"> Kromann (2001) argued for a dependency formalism called Discontinuous Grammar and annotated a large set of Danish sentences using this formalism to create the Danish Dependency Treebank (Kromann, 2003). The formalism allows for a root Han spejder efter og ser elefanterne He looks for and sees elephants (2003)).</Paragraph> <Paragraph position="1"> word to have multiple parents. Examples include verb coordination in which the subject or object is an argument of several verbs, and relative clauses in which words must satisfy dependencies both inside and outside the clause. An example is shown in Figure 5 for the sentence He looks for and sees elephants. Here, the pronoun He is the subject for both verbs in the sentence, and the noun elephants the corresponding object. In the Danish Dependency Treebank, roughly 5% of words have more than one parent, which breaks the single parent (or tree) constraint we have previously required on dependency structures. Kromann also allows for cyclic dependencies, though we deal only with acyclic dependency graphs here. Though less common than trees, dependency graphs involving multiple parents are well established in the literature (Hudson, 1984). Unfortunately, the problem of finding the dependency structure with highest score in this setting is intractable (Chickering et al., 1994).</Paragraph> <Paragraph position="2"> To create an approximate parsing algorithm for dependency structures with multiple parents, we start with our approximate second-order non-projective algorithm outlined in Figure 4. We use the non-projective algorithm since the Danish Dependency Treebank contains a small number of non-projective arcs. We then modify lines 7-10 of this algorithm so that it looks for the change in parent or the addition of a new parent that causes the highest change in overall score and does not create a cycle2. Like before, we make one change per iteration and that change will depend on the resulting score of the new tree. Using this simple new approximate parsing algorithm, we train a new parser that can produce multiple parents.</Paragraph> </Section> <Section position="5" start_page="84" end_page="85" type="metho"> <SectionTitle> 4 Online Learning and Approximate </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="84" end_page="85" type="sub_section"> <SectionTitle> Inference </SectionTitle> <Paragraph position="0"> In this section, we review the work of McDonald et al. (2005b) for online large-margin dependency 2We are not concerned with violating the tree constraint. parsing. As usual for supervised learning, we assume a training set T = {(xt,yt)}Tt=1, consisting of pairs of a sentence xt and its correct dependency representation yt.</Paragraph> <Paragraph position="1"> The algorithm is an extension of the Margin Infused Relaxed Algorithm (MIRA) (Crammer and Singer, 2003) to learning with structured outputs, in the present case dependency structures. Figure 6 gives pseudo-code for the algorithm. An on-line learning algorithm considers a single training instance for each update to the weight vector w.</Paragraph> <Paragraph position="2"> We use the common method of setting the final weight vector as the average of the weight vectors after each iteration (Collins, 2002), which has been shown to alleviate overfitting.</Paragraph> <Paragraph position="3"> On each iteration, the algorithm considers a single training instance. We parse this instance to obtain a predicted dependency graph, and find the smallest-norm update to the weight vector w that ensures that the training graph outscores the predicted graph by a margin proportional to the loss of the predicted graph relative to the training graph, which is the number of words with incorrect parents in the predicted tree (McDonald et al., 2005b). Note that we only impose margin constraints between the single highest-scoring graph and the correct graph relative to the current weight setting. Past work on tree-structured outputs has used constraints for the k-best scoring tree (Mc-Donald et al., 2005b) or even all possible trees by using factored representations (Taskar et al., 2004; McDonald et al., 2005c). However, we have found that a single margin constraint per example leads to much faster training with a negligible degradation in performance. Furthermore, this formulation relates learning directly to inference, which is important, since we want the model to set weights relative to the errors made by an approximate inference algorithm. This algorithm can thus be viewed as a large-margin version of the perceptron algorithm for structured outputs Collins (2002).</Paragraph> <Paragraph position="4"> Online learning algorithms have been shown to be robust even with approximate rather than exact inference in problems such as word alignment (Moore, 2005), sequence analysis (Daum'e and Marcu, 2005; McDonald et al., 2005a) and phrase-structure parsing (Collins and Roark, 2004). This robustness to approximations comes from the fact that the online framework sets weights with respect to inference. In other words, the learning method sees common errors due to Training data: T = {(xt,yt)}Tt=1 1. w(0) = 0; v = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. min</Paragraph> <Paragraph position="6"> 5. v = v + w(i+1) 6. i = i + 1 7. w = v/(N [?]T) rect for them. The work of Daum'e and Marcu (2005) formalizes this intuition by presenting an online learning framework in which parameter updates aremadedirectly withrespect toerrors inthe inference algorithm. We show in the next section that this robustness extends to approximate dependency parsing.</Paragraph> </Section> </Section> class="xml-element"></Paper>