File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1012_metho.xml

Size: 13,327 bytes

Last Modified: 2025-10-06 14:09:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1012">
  <Title>Online Large-Margin Training of Dependency Parsers</Title>
  <Section position="3" start_page="91" end_page="92" type="metho">
    <SectionTitle>
2 System Description
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="91" end_page="91" type="sub_section">
      <SectionTitle>
2.1 Definitions and Background
</SectionTitle>
      <Paragraph position="0"> In what follows, the generic sentence is denoted by x (possibly subscripted); the ith word of x is denoted by xi. The generic dependency tree is denoted by y. If y is a dependency tree for sentence x, we write (i,j) [?] y to indicate that there is a directed edge from word xi to word xj in the tree, that is, xi is the parent of xj. T = {(xt,yt)}Tt=1 denotes the training data.</Paragraph>
      <Paragraph position="1"> We follow the edge based factorization method of Eisner (1996) and define the score of a dependency tree as the sum of the score of all edges in the tree,</Paragraph>
      <Paragraph position="3"> where f(i,j) is a high-dimensional binary feature representation of the edge from xi to xj. For example, in the dependency tree of Figure 1, the following feature would have a value of 1:</Paragraph>
      <Paragraph position="5"> In general, any real-valued feature may be used, but we use binary features for simplicity. The feature weights in the weight vector w are the parameters that will be learned during training. Our training algorithms are iterative. We denote by w(i) the weight vector after the ith training iteration.</Paragraph>
      <Paragraph position="6"> Finally we define dt(x) as the set of possible dependency trees for the input sentence x and bestk(x;w) as the set of k dependency trees in dt(x) that are given the highest scores by weight vector w, with ties resolved by an arbitrary but fixed rule.</Paragraph>
      <Paragraph position="7"> Three basic questions must be answered for models of this form: how to find the dependency tree y with highest score for sentence x; how to learn an appropriate weight vector w from the training data; and finally, what feature representation f(i,j) should be used. The following sections address each of these questions.</Paragraph>
    </Section>
    <Section position="2" start_page="91" end_page="92" type="sub_section">
      <SectionTitle>
2.2 Parsing Algorithm
</SectionTitle>
      <Paragraph position="0"> Given a feature representation for edges and a weight vector w, we seek the dependency tree or</Paragraph>
      <Paragraph position="2"> trees that maximize the score function, s(x,y). The primary difficulty is that for a given sentence of length n there are exponentially many possible dependency trees. Using a slightly modified version of a lexicalized CKY chart parsing algorithm, it is possible to generate and represent these sentences in a forest that is O(n5) in size and takes O(n5) time to create.</Paragraph>
      <Paragraph position="3"> Eisner (1996) made the observation that if the head of each chart item is on the left or right periphery, then it is possible to parse in O(n3). The idea is to parse the left and right dependents of a word independently and combine them at a later stage. This removes the need for the additional head indices of the O(n5) algorithm and requires only two additional binary variables that specify the direction of the item (either gathering left dependents or gathering right dependents) and whether an item is complete (available to gather more dependents). Figure 2 shows the algorithm schematically. As with normal CKY parsing, larger elements are created bottom-up from pairs of smaller elements.</Paragraph>
      <Paragraph position="4"> Eisner showed that his algorithm is sufficient for both searching the space of dependency parses and, with slight modification, finding the highest scoring tree y for a given sentence x under the edge factorization assumption. Eisner and Satta (1999) give a cubic algorithm for lexicalized phrase structures.</Paragraph>
      <Paragraph position="5"> However, it only works for a limited class of languages in which tree spines are regular. Furthermore, there is a large grammar constant, which is typically in the thousands for treebank parsers.</Paragraph>
    </Section>
    <Section position="3" start_page="92" end_page="92" type="sub_section">
      <SectionTitle>
2.3 Online Learning
</SectionTitle>
      <Paragraph position="0"> Figure 3 gives pseudo-code for the generic online learning setting. A single training instance is considered on each iteration, and parameters updated by applying an algorithm-specific update rule to the instance under consideration. The algorithm in Figure 3 returns an averaged weight vector: an auxiliary weight vector v is maintained that accumulates Training data: T = {(xt,yt)}Tt=1  1. w0 = 0; v = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w(i+1) = update w(i) according to instance (xt,yt) 5. v = v + w(i+1) 6. i = i + 1 7. w = v/(N [?] T)  the values of w after each iteration, and the returned weight vector is the average of all the weight vectors throughout training. Averaging has been shown to help reduce overfitting (Collins, 2002).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="92" end_page="94" type="metho">
    <SectionTitle>
2.3.1 MIRA
</SectionTitle>
    <Paragraph position="0"> Crammer and Singer (2001) developed a natural method for large-margin multi-class classification, which was later extended by Taskar et al. (2003) to structured classification: minbardblwbardbl s.t. s(x,y) [?] s(x,yprime) [?] L(y,yprime) [?](x,y) [?] T , yprime [?] dt(x) where L(y,yprime) is a real-valued loss for the tree yprime relative to the correct tree y. We define the loss of a dependency tree as the number of words that have the incorrect parent. Thus, the largest loss a dependency tree can have is the length of the sentence. Informally, this update looks to create a margin between the correct dependency tree and each incorrect dependency tree at least as large as the loss of the incorrect tree. The more errors a tree has, the farther away its score will be from the score of the correct tree. In order to avoid a blow-up in the norm of the weight vector we minimize it subject to constraints that enforce the desired margin between the correct and incorrect trees1.</Paragraph>
    <Paragraph position="1">  The Margin Infused Relaxed Algorithm (MIRA) (Crammer and Singer, 2003; Crammer et al., 2003) employs this optimization directly within the online framework. On each update, MIRA attempts to keep the norm of the change to the parameter vector as small as possible, subject to correctly classifying the instance under consideration with a margin at least as large as the loss of the incorrect classifications. This can be formalized by substituting the following update into line 4 of the</Paragraph>
    <Paragraph position="3"> This is a standard quadratic programming problem that can be easily solved using Hildreth's algorithm (Censor and Zenios, 1997). Crammer and Singer (2003) and Crammer et al. (2003) provide an analysis of both the online generalization error and convergence properties of MIRA. In equation (1), s(x,y) is calculated with respect to the weight vector after optimization, w(i+1).</Paragraph>
    <Paragraph position="4"> To apply MIRA to dependency parsing, we can simply see parsing as a multi-class classification problem in which each dependency tree is one of many possible classes for a sentence. However, that interpretation fails computationally because a general sentence has exponentially many possible dependency trees and thus exponentially many margin constraints.</Paragraph>
    <Paragraph position="5"> To circumvent this problem we make the assumption that the constraints that matter for large margin optimization are those involving the incorrect trees yprime with the highest scores s(x,yprime). The resulting optimization made by MIRA (see Figure 3, line 4) would then be:</Paragraph>
    <Paragraph position="7"> reducing the number of constraints to the constant k.</Paragraph>
    <Paragraph position="8"> We tested various values of k on a development data set and found that small values of k are sufficient to achieve close to best performance, justifying our assumption. In fact, as k grew we began to observe a slight degradation of performance, indicating some overfitting to the training data. All the experiments presented here use k = 5. The Eisner (1996) algorithm can be modified to find the k-best trees while only adding an additional O(k logk) factor to the runtime (Huang and Chiang, 2005).</Paragraph>
    <Paragraph position="9"> A more common approach is to factor the structure of the output space to yield a polynomial set of local constraints (Taskar et al., 2003; Taskar et al., 2004). One such factorization for dependency trees</Paragraph>
    <Paragraph position="11"> It is trivial to show that if these O(n2) constraints are satisfied, then so are those in (1). We implemented this model, but found that the required training time was much larger than the k-best formulation and typically did not improve performance.</Paragraph>
    <Paragraph position="12"> Furthermore, the k-best formulation is more flexible with respect to the loss function since it does not assume the loss function can be factored into a sum of terms for each dependency.</Paragraph>
    <Section position="1" start_page="93" end_page="94" type="sub_section">
      <SectionTitle>
2.4 Feature Set
</SectionTitle>
      <Paragraph position="0"> Finally, we need a suitable feature representation f(i,j) for each dependency. The basic features in our model are outlined in Table 1a and b. All features are conjoined with the direction of attachment as well as the distance between the two words being attached. These features represent a system of back-off from very specific features over words and part-of-speech tags to less sparse features over just part-of-speech tags. These features are added for both the entire words as well as the 5-gram prefix if the word is longer than 5 characters.</Paragraph>
      <Paragraph position="1"> Using just features over the parent-child node pairs in the tree was not enough for high accuracy, because all attachment decisions were made outside of the context in which the words occurred. To solve this problem, we added two other types of features, which can be seen in Table 1c. Features of the first type look at words that occur between a child and its parent. These features take the form of a POS trigram: the POS of the parent, of the child, and of a word in between, for all words linearly between the parent and the child. This feature was particularly helpful for nouns identifying their parent, since  p-pos, p-pos+1, c-pos-1, c-pos p-pos-1, p-pos, c-pos-1, c-pos p-pos, p-pos+1, c-pos, c-pos+1 p-pos-1, p-pos, c-pos, c-pos+1  node. p-pos: POS of parent node. c-pos: POS of child node. p-pos+1: POS to the right of parent in sentence. p-pos-1: POS to the left of parent. c-pos+1: POS to the right of child. c-pos-1: POS to the left of child. b-pos: POS of a word in between parent and child nodes. it would typically rule out situations when a noun attached to another noun with a verb in between, which is a very uncommon phenomenon.</Paragraph>
      <Paragraph position="2"> The second type of feature provides the local context of the attachment, that is, the words before and after the parent-child pair. This feature took the form of a POS 4-gram: The POS of the parent, child, word before/after parent and word before/after child.</Paragraph>
      <Paragraph position="3"> The system also used back-off features to various tri-grams where one of the local context POS tags was removed. Adding these two features resulted in a large improvement in performance and brought the system to state-of-the-art accuracy.</Paragraph>
    </Section>
    <Section position="2" start_page="94" end_page="94" type="sub_section">
      <SectionTitle>
2.5 System Summary
</SectionTitle>
      <Paragraph position="0"> Besides performance (see Section 3), the approach to dependency parsing we described has several other advantages. The system is very general and contains no language specific enhancements. In fact, the results we report for English and Czech use identical features, though are obviously trained on different data. The online learning algorithms themselves are intuitive and easy to implement.</Paragraph>
      <Paragraph position="1"> The efficient O(n3) parsing algorithm of Eisner allows the system to search the entire space of dependency trees while parsing thousands of sentences in a few minutes, which is crucial for discriminative training. We compare the speed of our model to a standard lexicalized phrase structure parser in Section 3.1 and show a significant improvement in parsing times on the testing data.</Paragraph>
      <Paragraph position="2"> The major limiting factor of the system is its restriction to features over single dependency attachments. Often, when determining the next dependent for a word, it would be useful to know previous attachment decisions and incorporate these into the features. It is fairly straightforward to modify the parsing algorithm to store previous attachments.</Paragraph>
      <Paragraph position="3"> However, any modification would result in an asymptotic increase in parsing complexity.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML