File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3201_metho.xml
Size: 20,046 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3201"> <Title>Max-Margin Parsing</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Recent work has shown that discriminative techniques frequently achieve classification accuracy that is superior to generative techniques, over a wide range of tasks. The empirical utility of models such as logistic regression and support vector machines (SVMs) in flat classification tasks like text categorization, word-sense disambiguation, and relevance routing has been repeatedly demonstrated. For sequence tasks like part-of-speech tagging or named-entity extraction, recent top-performing systems have also generally been based on discriminative sequence models, like conditional Markov models (Toutanova et al., 2003) or conditional random fields (Lafferty et al., 2001).</Paragraph> <Paragraph position="1"> A number of recent papers have considered discriminative approaches for natural language parsing (Johnson et al., 1999; Collins, 2000; Johnson, 2001; Geman and Johnson, 2002; Miyao and Tsujii, 2002; Clark and Curran, 2004; Kaplan et al., 2004; Collins, 2004).</Paragraph> <Paragraph position="2"> Broadly speaking, these approaches fall into two categories, reranking and dynamic programming approaches. In reranking methods (Johnson et al., 1999; Collins, 2000; Shen et al., 2003), an initial parser is used to generate a number of candidate parses. A discriminative model is then used to choose between these candidates. In dynamic programming methods, a large number of candidate parse trees are represented compactly in a parse tree forest or chart. Given sufficiently &quot;local&quot; features, the decoding and parameter estimation problems can be solved using dynamic programming algorithms.</Paragraph> <Paragraph position="3"> For example, (Johnson, 2001; Geman and Johnson, 2002; Miyao and Tsujii, 2002; Clark and Curran, 2004; Kaplan et al., 2004) describe approaches based on conditional log-linear (maximum entropy) models, where variants of the inside-outside algorithm can be used to efficiently calculate gradients of the log-likelihood function, despite the exponential number of trees represented by the parse forest.</Paragraph> <Paragraph position="4"> In this paper, we describe a dynamic programming approach to discriminative parsing that is an alternative to maximum entropy estimation. Our method extends the max-margin approach of Taskar et al. (2003) to the case of context-free grammars. The present method has several compelling advantages. Unlike reranking methods, which consider only a pre-pruned selection of &quot;good&quot; parses, our method is an end-to-end discriminative model over the full space of parses. This distinction can be very significant, as the set of n-best parsesoften does not contain thetrue parse. For example, in the work of Collins (2000), 41% of the correct parseswere not inthe candidate pool of [?]30-best parses. Unlike previous dynamic programming approaches, which were based on maximum entropy estimation, our method incorporates an articulated loss function which penalizes larger tree discrepancies more severely than smaller ones.1 Moreover, like perceptron-based learning, it requires only the calculation of Viterbi trees, rather than expectations over all trees (for example using the inside-outside algorithm). In practice, it converges in many fewer iterations than CRF-like approaches. For example, while our approach generally converged in 20-30 iterations, Clark and Curran (2004) report experiments involving 479 iterations of training for one model, and 1550 iterations for another.</Paragraph> <Paragraph position="5"> The primary contribution of this paper is the extension of the max-margin approach of Taskar et al. (2003) to context free grammars. We show that this framework allows high-accuracy parsing in cubic time by exploiting novel kinds of lexical information.</Paragraph> </Section> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 2 Discriminative Parsing </SectionTitle> <Paragraph position="0"> In the discriminative parsing task, we want to learn a function f : X - Y, where X is a set of sentences, and Y is a set of valid parse trees according to a fixed grammar G. G maps an input x [?] X to a set of candidate parses G(x) [?] We assume a loss function L : X x Y x Y - R+. The function L(x,y, ^y) measures the penalty for proposing the parse ^y for x when y is the true parse. This penalty may be defined, for example, as the number of labeled spans on which the two trees do not agree. In general we assume that L(x,y, ^y) = 0 for y = ^y. Given labeled training examples (xi,yi) for i = 1...n, we seek a function f with small expected loss on unseen sentences.</Paragraph> <Paragraph position="1"> The functions we consider take the following linear discriminant form:</Paragraph> <Paragraph position="3"> space of parse trees over many grammars is naturally infinite, but can be made finite if we disallow unary chains and empty productions.</Paragraph> <Paragraph position="4"> where <*,*> denotes the vector inner product, w [?] Rd and Ph is a feature-vector representation of a parse tree Ph : X x Y - Rd (see examples below).3 Note that this class of functions includes Viterbi PCFG parsers, where the feature-vector consists of the counts of the productions used in the parse, and the parameters w are the log-probabilities of those productions.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Probabilistic Estimation </SectionTitle> <Paragraph position="0"> The traditional method of estimating the parameters of PCFGs assumes a generative grammar that defines P(x,y) and maximizes the joint log-likelihood summationtexti logP(xi,yi) (with some regularization). A alternative probabilistic approach is to estimate the parameters discriminatively by maximizing conditional loglikelihood. For example, the maximum entropy approach (Johnson, 2001) defines a conditional</Paragraph> <Paragraph position="2"> where Zw(x) =summationtexty[?]G(x) exp{<w,Ph(x,y)> }, and maximizes the conditional log-likelihood of the sample, summationtexti logP(yi |xi), (with some regularization). null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Max-Margin Estimation </SectionTitle> <Paragraph position="0"> In this paper, we advocate a different estimation criterion, inspired by the max-margin principle of SVMs. Max-margin estimation has been used for parse reranking (Collins, 2000). Recently, it has also been extended to graphical models (Taskar et al., 2003; Altun et al., 2003) and shown to outperform the standard maxlikelihood methods. The main idea is to forego the probabilistic interpretation, and directly ensure that</Paragraph> <Paragraph position="2"> for all i in the training data. We define the margin of the parameters w on the example i and parse y as the difference in value between the true parse yi and y: <w,Ph(xi,yi)> [?]<w,Ph(xi,y)> = <w,Phi,yi [?]Phi,y> , 3Note that in the case that two members y1 and y2 have the same tied value for <w,Ph(x,y)> , we assume that there is some fixed, deterministic way for breaking ties. For example, one approach would be to assume some default ordering on the members of Y.</Paragraph> <Paragraph position="3"> where Phi,y = Ph(xi,y), and Phi,yi = Ph(xi,yi). Intuitively, the size of the margin quantifies the confidence in rejecting the mistaken parse y using the function fw(x), modulo the scale of the parameters ||w||. We would like this rejection confidence to be larger when the mistake y is more severe, i.e. L(xi,yi,y) is large. We can express this desideratum as an optimization problem: null</Paragraph> <Paragraph position="5"> where Li,y = L(xi,yi,y). This quadratic program aims to separate each y [?] G(xi) from the target parse yi by a margin that is proportional to the loss L(xi,yi,y). After a standard transformation, in which maximizing the margin is reformulated as minimizing the scale of the weights (for a fixed margin of 1), we get the following program:</Paragraph> <Paragraph position="7"> s.t. <w,Phi,yi [?]Phi,y> [?] Li,y [?]xi [?]y [?] G(xi).</Paragraph> <Paragraph position="8"> The addition of non-negative slack variables xi allows one to increase the global margin by paying a local penalty on some outlying examples.</Paragraph> <Paragraph position="9"> The constant C dictates the desired trade-off between margin size and outliers. Note that this formulation has an exponential number of constraints, one for each possible parse y for each sentence i. We address this issue in section 4.</Paragraph> </Section> <Section position="3" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 2.3 The Max-Margin Dual </SectionTitle> <Paragraph position="0"> In SVMs, the optimization problem is solved by working with the dual of a quadratic program analogous to Eq. 2. For our problem, just as for SVMs, the dual has important computational advantages, including the &quot;kernel trick,&quot; which allows the efficient use of high-dimensional features spaces endowed with efficient dot products (Cristianini and Shawe-Taylor, 2000). Moreover, the dual view plays a crucial role in circumventing the exponential size of the primal problem.</Paragraph> <Paragraph position="1"> In Eq. 2, there is a constraint for each mistake y one might make oneach example i, whichrules out that mistake. For each mistake-exclusion constraint, the dual contains a variable ai,y. Intuitively, the magnitude of ai,y is proportional to the attention we must pay to that mistake in order not to make it.</Paragraph> <Paragraph position="2"> The dual of Eq. 2 (after adding additional variables ai,yi and renormalizing by C) is given</Paragraph> <Paragraph position="4"> where Ii,y = I(xi,yi,y) indicates whether y is the true parse yi. Given the dual solution a[?], the solution to the primal problem w[?] is simply a weighted linear combination of the feature vectors of the correct parse andmistaken parses:</Paragraph> <Paragraph position="6"> (Ii,y [?]a[?]i,y)Phi,y.</Paragraph> <Paragraph position="7"> This is the precise sense in which mistakes with large a contribute more strongly to the model.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Factored Models </SectionTitle> <Paragraph position="0"> There is a major problem with both the primal and the dual formulations above: since each potential mistake must be ruled out, the number of variables or constraints is proportional to |G(x)|, the numberof possibleparse trees. Even in grammars without unary chains or empty elements, the number of parses is generally exponential in the length of the sentence, so we cannot expect to solve the above problem without any assumptions about the feature-vector representation Ph and loss function L.</Paragraph> <Paragraph position="1"> For that matter, for arbitrary representations, to find the best parse given a weight vector, we would have no choice but to enumerate all trees and score them. However, our grammars and representations are generally structured to enable efficient inference. For example, we usually assign scores to local parts of the parse such as PCFG productions. Such factored models have shared substructure properties which permit dynamic programming decompositions. In this section, we describe how this kind of decomposition can be done over the dual a distributions. The idea of this decomposition has previously been used for sequences and other Markov random fields in Taskar et al. (2003), but the present extension to CFGs is novel.</Paragraph> <Paragraph position="2"> For clarity of presentation, we restrict the grammar tobein Chomskynormal form(CNF), where all rules in the grammar are of the form non-terminal symbols, and a is some terminal symbol. For example figure 1(a) shows a tree in this form.</Paragraph> <Paragraph position="3"> We will represent each parse as a set of two types of parts. Parts of the first type are single constituent tuples <A,s,e,i> , consisting of a non-terminal A, start-point s and end-point e, and sentence i, such as r in figure 1(b). In this representation, indices s and e refer to positions between words, rather than to words themselves. These parts correspond to the traditional notion of an edge in a tabular parser. Parts of the second type consist of CF-ruletuples <A - B C,s,m,e,i> . The tuple specifies a particular rule A - B C, and its position, including split point m, within the sentence i, such as q in figure 1(b), and corresponds to the traditional notion of a traversal in a tabular parser. Note that parts for a basic PCFG model are not just rewrites (which can occur multiple times), but rather anchored items.</Paragraph> <Paragraph position="4"> Formally, we assume some countable set of parts, R. We also assume a function R which maps each object (x,y) [?] X x Y to a finite subset of R. Thus R(x,y) is the set of parts belonging to a particular parse. Equivalently, the function R(x,y) maps a derivation y to the set of parts which it includes. Because all rules are in binary-branching form, |R(x,y) |is constant across different derivations y for the same input sentence x. We assume that the feature vector for a sentence and parse tree (x,y) decomposes into a sum of the feature vectors for its parts:</Paragraph> <Paragraph position="6"> ph(x,r).</Paragraph> <Paragraph position="7"> In CFGs, the function ph(x,r) can be any function mapping a rule production and its position in the sentence x, to some feature vector representation. For example, ph could include features which identify the rule used in the production, or features which track the rule identity together with features of the words at positions s,m,e, and neighboring positions in the sentence x.</Paragraph> <Paragraph position="8"> In addition, we assume that the loss function L(x,y, ^y) also decomposes into a sum of local loss functions l(x,y,r) over parts, as follows:</Paragraph> <Paragraph position="10"> l(x,y,r).</Paragraph> <Paragraph position="11"> One approach would be to define l(x,y,r) to be 0 only if the non-terminal A spans words s...e in the derivation y and 1 otherwise. This would lead to L(x,y, ^y) tracking the number of &quot;constituent errors&quot; in ^y, where a constituent is a tuple such as <A,s,e,i> . Another, more strict definition would be to define l(x,y,r) to be 0 if r of the type <A - B C,s,m,e,i> is in the derivation y and 1 otherwise. This definition would lead to L(x,y, ^y) beingthe numberof CFrule-tuples in ^y which are not seen in y.4 Finally, we define indicator variables I(x,y,r) which are 1 if r [?] R(x,y), 0 otherwise. We also define sets R(xi) = [?]y[?]G(xi)R(xi,y) for the training examples i = 1...n. Thus, R(xi) is the set of parts that is seen in at least one of the objects {(xi,y) : y [?] G(xi)}.</Paragraph> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Factored Dual </SectionTitle> <Paragraph position="0"> The dual in Eq. 3 involves variables ai,y for all i = 1...n, y [?] G(xi), and the objective is quadratic in these a variables. In addition, it turns out that the set of dual variables ai = {ai,y : y [?] G(xi)} for each example i is constrained to be non-negative and sum to 1.</Paragraph> <Paragraph position="1"> It is interesting that, while the parameters w lose their probabilistic interpretation, the dual variables ai for each sentence actually form a kind of probability distribution. Furthermore, the objective can be expressed in terms of expectations with respect to these distributions: We now consider how to efficiently solve the max-margin optimization problem for a factored model. As shown in Taskar et al.</Paragraph> <Paragraph position="2"> (2003), the dual in Eq. 3 can be reframed using &quot;marginal&quot; terms. We will also find it useful to consider thisalternative formulation of the dual. Given dual variables a, we define the marginals</Paragraph> <Paragraph position="4"> Since the dual variables ai form probability distributions over parse trees for each sentence i, the marginals ui,r(ai) represent the proportion of parses that would contain part r if they were drawn from a distribution ai. Note that the number of such marginal terms is the number of parts, which is polynomial in the length of the sentence.</Paragraph> <Paragraph position="5"> Now consider the dual objective Q(a) in Eq. 3. It can be shown that the original objective Q(a) can be expressed in terms of these respond to the standard scoring metrics, such as F1 or crossing brackets, but shares the sensitivity to the number of differences between trees. We have not thoroughly investigated the exact interplay between the various loss choices and the various parsing metrics. We used the constituent loss in our experiments.</Paragraph> <Paragraph position="6"> marginals as Qm(u(a)), whereu(a) is thevector with components ui,r(ai), and Qm(u) is defined where li,r = l(xi,yi,r), phi,r = ph(xi,r) and Ii,r = I(xi,yi,r).</Paragraph> <Paragraph position="7"> This follows from substituting the factored definitions of the feature representation Ph and loss function L together with definition of marginals.</Paragraph> <Paragraph position="8"> Having expressed the objective in terms of a polynomial number of variables, we now turn to the constraints on these variables. The feasible set for a is</Paragraph> <Paragraph position="10"> Now let [?]m be the space of marginal vectors which are feasible: [?]m = {u : [?]a [?] [?] s.t. u = u(a)}. Then our original optimization problem can be reframed as maxu[?][?]m Qm(u).</Paragraph> <Paragraph position="11"> Fortunately, in case of PCFGs, the domain [?]m can be described compactly with a polynomial number of linear constraints. Essentially, we need to enforce the condition that the expected proportions of parses having particular parts should be consistent with each other. Our marginals track constituent parts <A,s,e,i> and CF-rule-tuple parts <A - B C,s,m,e,i> The consistency constraints are precisely the inside-outside probability relations: where ni is the length of the sentence. In addition, we must ensure non-negativity and normalization to 1: trained and tested on Penn treebank sentences of length [?] 15.</Paragraph> <Paragraph position="12"> while the number of constraints is quadratic. This polynomial size formulation should be contrasted with the earlier formulation in Collins (2004), which has an exponential number of constraints.</Paragraph> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Factored SMO </SectionTitle> <Paragraph position="0"> We have reduced the problem to a polynomial size QP, which, in principle, can be solved using standard QP toolkits. However, although the number of variables and constraints in the factored dual is polynomial in the size of the data, the number of coefficients in the quadratic term in the objective is very large: quadratic in the number of sentences and dependent on the sixth power of sentence length. Hence, in our experiments we use an online coordinate descent method analogous to the sequential minimal optimization (SMO) used for SVMs (Platt, 1999) and adapted to structured max-margin estimation in Taskar et al. (2003).</Paragraph> <Paragraph position="1"> We omit the details of the structured SMO procedure, but the important fact about this kind of training is that, similar to the basic perceptron approach, it only requires picking up sentences one at a time, checking what the best parse is according to the current primal and dual weights, and adjusting the weights.</Paragraph> </Section> class="xml-element"></Paper>