File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2904_metho.xml
Size: 19,817 bytes
Last Modified: 2025-10-06 14:10:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2904"> <Title>Improved Large Margin Dependency Parsing via Local Constraints and Laplacian Regularization</Title> <Section position="5" start_page="21" end_page="23" type="metho"> <SectionTitle> 3 Dependency Parsing Model </SectionTitle> <Paragraph position="0"> Given a sentence a2 a3 a4a6a5a8a7a10a9a12a11a13a11a13a11a13a9a14a5a16a15a18a17 we are interested in computing a directed dependency tree,</Paragraph> <Paragraph position="2"> rected dependency tree a19 consists of ordered pairs a4a6a5a16a20a22a21a23a5a25a24a26a17 of words in a2 such that each word appears in at least one pair and each word has in-degree at most one. Dependency trees are usually assumed to be projective (no crossing arcs), which means that if there is an arc a4a6a5a27a20a25a21a28a5a25a24a29a17 , then a5a16a20 is an ancestor of all the words between a5a30a20 and a5a25a24 . Let a31a32a4a33a2a34a17 denote the set of all the directed, projective trees that span a2 .</Paragraph> <Paragraph position="3"> Given an input sentence a2 , we would like to be able to compute the best parse; that is, a projective tree, a19a36a35 a31a32a4a33a2a34a17 , that obtains the highest score . In particular, we follow (Eisner, 1996; Eisner and Satta, 1999; McDonald et al., 2005) and assume that the score of a complete spanning tree a19 for a given sentence, whether probabilistically motivated or not, can be decomposed as a sum of local scores for each link (a word pair). In which case, the parsing problem reduces to</Paragraph> <Paragraph position="5"> where the score sa4a6a5a27a20a70a21 a5a25a24a71a17 can depend on any measurable property of a5a30a20 and a5a25a24 within the tree a19 . This formulation is suf ciently general to capture most dependency parsing models, including probabilistic dependency models (Wang et al., 2005; Eisner, 1996) as well as non-probabilistic models (Mc-Donald et al., 2005). For standard scoring functions, parsing requires an a72a58a4a6a73a75a74a12a17 dynamic programming algorithm to compute a projective tree that obtains the maximum score (Eisner and Satta, 1999; Wang et al., 2005; McDonald et al., 2005).</Paragraph> <Paragraph position="6"> For the purpose of learning, we decompose each link score into a weighted linear combination of features null sa4a6a5 a20 a21a76a5 a24 a17a46a3 a77a54a78a75a79a80a4a6a5 a20 a21a36a5 a24 a17 (2) where a77 are the weight parameters to be estimated during training.</Paragraph> <Paragraph position="7"> Of course, the speci c features used in any real situation are critical for obtaining a reasonable dependency parser. The natural sets of features to consider in this setting are very large, consisting at the very least of features indexed by all possible lexical items (words). For example, natural features to use for dependency parsing are indicators of each possible word pair</Paragraph> <Paragraph position="9"> which allows one to represent the tendency of two words, a89 and a90 , to be directly linked in a parse. In this case, there is a corresponding parameter a91 a82a10a83 to be learned for each word pair, which represents the strength of the possible linkage.</Paragraph> <Paragraph position="10"> A large number of features leads to a serious risk of over- tting due to sparse data problems. The standard mechanisms for mitigating such effects are to combine features via abstraction (e.g. using partsof-speech) or smoothing (e.g. using word similarity based smoothing). For abstraction, a common strategy is to use parts-of-speech to compress the feature set, for example by only considering the tag of the parent</Paragraph> <Paragraph position="12"> However, rather than use abstraction, we will follow a purely lexical approach and only consider features that are directly computable from the words themselves (or statistical quantities that are directly measurable from these words).</Paragraph> <Paragraph position="13"> In general, the most important aspect of a link feature is simply that it measures something about a candidate word pair that is predictive of whether the words will actually be linked in a given sentence. Thus, many other natural features, beyond parts-of-speech and abstract grammatical categories, immediately suggest themselves as being predictive of link existence. For example, one very useful feature is simply the degree of association between the two words as measured by their pointwise mutual</Paragraph> <Paragraph position="15"> (We describe in Section 6 below how we compute this association measure on an auxiliary corpus of unannotated text.) Another useful link feature is simply the distance between the two words in the sentence; that is, how many words they have between them</Paragraph> <Paragraph position="17"> In fact, the likelihood of a direct link between two words diminishes quickly with distance, which motivates using more rapidly increasing functions of distance, such as the square dist2a4a6a5a16a20a103a21a36a5a25a24a29a17a104a3a105a4 positiona4a6a5a27a20a68a17a106a101 positiona4a6a5a25a24a26a17a14a17a97a107 In our experiments below, we used only these simple, lexically determined features, a108 a81a109a82a10a83a42a110 , a81 PMI, a81 dist and a81 dist2, without the parts-of-speech a108 a81a29a92a12a83a42a110 . Currently, we only use undirected forms of these features, where, for example, a81 a82a10a83 a3 a81 a83a44a82 for all pairs (or, put another way, we tie the parameters a91 a82a10a83 a3</Paragraph> <Paragraph position="19"> to use directed features, but we have already found that these simple undirected features permit state of the art accuracy in predicting (undirected) dependencies. Nevertheless, extending our approach to directed features and contextual features, as in (Wang et al., 2005), remains an important direction for future research.</Paragraph> </Section> <Section position="6" start_page="23" end_page="23" type="metho"> <SectionTitle> 4 Large Margin Training </SectionTitle> <Paragraph position="0"> Given a training set of sentences annotated with their correct dependency parses, a4a33a2 a7 a9a19 a7 a17a96a9a12a11a13a11a13a11a13a9a29a4a33a2a112a111a104a9a19 a111a38a17 , the goal of learning is to estimate the parameters of the parsing model, a77 . In particular, we seek values for the parameters that can accurately reconstruct the training parses, but more importantly, are also able to accurately predict the dependency parse structure on future test sentences.</Paragraph> <Paragraph position="1"> To train a77 we follow the large margin training approach of (Taskar et al., 2003; Tsochantaridis et al., 2004), which has been applied with great success to dependency parsing (Taskar et al., 2004; McDonald et al., 2005). Large margin training can be expressed as minimizing a regularized loss (Hastie et al., 2004)</Paragraph> <Paragraph position="3"> where a19 a20 is the target tree for sentence a2 a20 ; a122 a20 ranges over all possible alternative trees in a31a32a4a33a2a127a20a33a17 ;</Paragraph> <Paragraph position="5"> a20a100a17 is a measure of distance between the two trees a122a25a20 and a19 a20 .</Paragraph> <Paragraph position="6"> Using the techniques of (Hastie et al., 2004) one can show that minimizing (4) is equivalent to solving the quadratic program</Paragraph> <Paragraph position="8"> for all a141a44a9a44a122a25a20 a35 a31a32a4a33a2a142a20a68a17 which corresponds to the training problem posed in (McDonald et al., 2005).</Paragraph> <Paragraph position="9"> Unfortunately, the quadratic program (4) has three problems one must address. First, there are exponentially many constraints corresponding to each possible parse of each training sentence which forces one to use alternative training procedures, such as incremental constraint generation, to slowly converge to a solution (McDonald et al., 2005; Tsochantaridis et al., 2004). Second, and related, the original loss (4) is only evaluated at the global parse tree level, and is not targeted at penalizing any speci c component in an incorrect parse. Although (McDonald et al., 2005) explicitly describes this as an advantage over previous approaches (Ratnaparkhi, 1999; Yamada and Matsumoto, 2003), below we nd that changing the loss to enforce a more detailed set of constraints leads to a more effective approach. Third, given the large number of bi-lexical features a108 a81a42a82a10a83a143a110 in our model, solving (4) directly will over- t any reasonable training corpus. (Moreover, using a large a116 to shrink the a77 values does not mitigate the sparse data problem introduced by having so many features.) We now present our re nements that address each of these issues in turn.</Paragraph> </Section> <Section position="7" start_page="23" end_page="24" type="metho"> <SectionTitle> 5 Training with Local Constraints </SectionTitle> <Paragraph position="0"> We are initially focusing on training on just an undirected link model, where each parameter in the model is a weight a91 a60a144a60a146a145 between two words, a5 and a5a30a147, respectively. Since links are undirected, these</Paragraph> <Paragraph position="2"> also write the score in an undirected fashion as: sa4a6a5a150a9a14a5 a147a17a151a3a152a77 a78 a79a67a4a6a5a150a9a14a5 a147a17 . The main advantage of working with the undirected link model is that the constraints needed to ensure correct parses on the training data are much easier to specify in this case.</Paragraph> <Paragraph position="3"> Ignoring the projective (no crossing arcs) constraint for the moment, an undirected dependency parse can be equated with a maximum score spanning tree of a sentence. Given a target parse, the set of constraints needed to ensure the target parse is in fact the maximum score spanning tree under the weights a77 , by at least a minimum amount, is a simple set of linear constraints: for any edge a5a150a7a44a5 a107 that is not in the target parse, one simply adds two constraints</Paragraph> <Paragraph position="5"> are the adjacent edges that actually occur in the target parse that are also on the path between a5a150a7 and a5 a107 . (These would have to be the only such edges, or there would be a loop in the parse tree.) These constraints behave very naturally by forcing the weight of an omitted edge to be smaller than the adjacent included edges that would form a loop, which ensures that the omitted edge would not be added to the maximum score spanning tree before the included edges.</Paragraph> <Paragraph position="6"> In this way, one can simply accumulate the set of linear constraints (5) for every edge that fails to be included in the target parse for the sentences where it is a candidate. We denote this set of constraints by</Paragraph> <Paragraph position="8"> Importantly, the constraint seta154 is convex in the link weight parameters a77 , as it consists only of linear constraints.</Paragraph> <Paragraph position="9"> Ignoring the non-crossing condition, the constraint set a154 is exact. However, because of the non-crossing condition, the constraint set a154 is more restrictive than necessary. For example, consider the word sequence a11a13a11a13a11a156a5a30a20a123a5a16a20a115a157a75a7a44a5a16a20a13a157</Paragraph> <Paragraph position="11"> can be ruled out of the parse in one of two ways: it can be ruled out by making its score less than the adjacent scores as speci ed in (5), or it can be ruled out by making its score smaller than the score of a5a27a20a115a157a75a7a14a5a16a20a115a157 . Thus, the exact constraint contains a disjunction of two different constraints, which creates a non-convex constraint in a77 . (The union of two convex sets is not necessarily convex.) This is a weakening of the original constraint set a154 . Unfortunately, this means that, given a large training corpus, the constraint set a154 can easily become infeasible.</Paragraph> <Paragraph position="12"> Nevertheless, the constraints in a154 capture much of the relevant structure in the data, and are easy to enforce. Therefore, we wish to maintain them.</Paragraph> <Paragraph position="13"> However, rather than impose the constraints exactly, we enforce them approximately through the introduction of slack variables a137 . The relaxed constraints can then be expressed as</Paragraph> <Paragraph position="15"> and therefore a maximum soft margin solution can then be expressed as a quadratic program</Paragraph> <Paragraph position="17"> for all constraints in a154 where a135 denotes the vector of all 1's. Even though the slacks are required because we have slightly over-constrained the parameters, given that there are so many parameters and a sparse data problem as well, it seems desirable to impose a stronger set of constraints. A set of solution parameters achieved in this way will allow maximum weight spanning trees to correctly parse nearly all of the training sentences, even without the non-crossing condition (see the results in Section 8). This quadratic program has the advantage of producing link parameters that will correctly parse most of the training data. Unfortunately, the main drawback of this method thus far is that it does not offer any mechanism by which the link weights a91 a60a144a60 a145 can be generalized to new or rare words. Given the sparse data problem, some form of generalization is necessary to achieve good test results. We achieve this by exploiting distributional similarities between words to smooth the parameters.</Paragraph> </Section> <Section position="8" start_page="24" end_page="25" type="metho"> <SectionTitle> 6 Distributional Word Similarity </SectionTitle> <Paragraph position="0"> Treebanks are an extremely precious resource. The average cost of producing a treebank parse can run as high as 30 person-minutes per sentence (20 words on average). Similarity-based smoothing, on the other hand, allows one to tap into auxiliary sources of raw unannotated text, which is practically unlimited. With this extra data, one can estimate parameters for words that have never appeared in the training corpus.</Paragraph> <Paragraph position="1"> The basic intuition behind similarity smoothing is that words that tend to appear in the same contexts tend to have similar meanings. This is known as the Distributional Hypothesis in linguistics (Harris, 1968). For example, the words test and exam are similar because both of them can follow verbs such as administer, cancel, cheat on, conduct, etc.</Paragraph> <Paragraph position="2"> Many methods have been proposed to compute distributional similarity between words, e.g., (Hindle, 1990; Pereira et al., 1993; Grefenstette, 1994; Lin, 1998). Almost all of the methods represent a word by a feature vector where each feature corresponds to a type of context in which the word appeared. They differ in how the feature vectors are constructed and how the similarity between two feature vectors is computed.</Paragraph> <Paragraph position="3"> In our approach below, we de ne the features of a word a5 to be the set of words that occurred within a small window of a5 in a large corpus. The context window of a5 consists of the closest non-stop-word on each side of a5 and the stop-words in between. The value of a feature a5 a147 is de ned as the pointwise mutual information between the a5a8a147 and a17 , is then de ned as the cosine of the angle between their feature vectors. We use this similarity information both in training and in parsing. For training, we smooth the parameters according to their underlying word-pair similarities by introducing a Laplacian regularizer, which will be introduced in the next section. For parsing, the link scores in (1) are smoothed by word similarities (similar to the approach used by (Wang et al., 2005)) before the maximum score projective dependency tree is computed.</Paragraph> </Section> <Section position="9" start_page="25" end_page="26" type="metho"> <SectionTitle> 7 Laplacian Regularization </SectionTitle> <Paragraph position="0"> We wish to incorporate similarity based smoothing in large margin training, while using the more rened constraints outlined in Section 5.</Paragraph> <Paragraph position="1"> Recall that most of the features we use, and therefore most of the parameters we need to estimate are based on bi-lexical parameters a91 a60a144a60a146a145 that serve as undirected link weights between words a5 and a5a32a147 in our dependency parsing model (Section 3). Here we would like to ensure that two different link weights,</Paragraph> <Paragraph position="3"> a159 , that involve similar words also take on similar values. The previous optimization (7) needs to be modi ed to take this into account. Smoothing the link parameters requires us to rst extend the notion of word similarity to word-pair similarities, since each link involves two words.</Paragraph> <Paragraph position="4"> Given similarities between individual words, computed above, we then de ne the similarity between word pairs by the geometric mean of the similarities between corresponding words.</Paragraph> <Paragraph position="5"> a17 is de ned as in Section 6 above.</Paragraph> <Paragraph position="6"> Then, instead of just solving the constraint system (7) we can also ensure that similar links take on similar parameter values by introducing a penalty on their deviations that is weighted by their similarity value. Speci cally, we use</Paragraph> <Paragraph position="8"> Here a122a161a4a33a167a140a17 is the Laplacian matrix of a167 , which is de ned by a122a161a4a33a167a25a17a172a3 a173a174a4a33a167a140a17a47a101a175a167 where a173a174a4a33a167a140a17 is a diagonal matrix such that a173 a60a51a158a123a60 a145a158a132a60a103a158a68a60 a145a158 a3 have a high similarity value, their parameters will be encouraged to take on similar values. By contrast, if two edges have low similarity, then there will be little mutual attraction on their parameter values.</Paragraph> <Paragraph position="9"> Note, however, that we do not smooth the parameters, a91 PMI, a91 dist, a91 dist2, corresponding to the point-wise mutual information, distance, and squared distance features described in Section 5, respectively.</Paragraph> <Paragraph position="10"> We only apply similarity smoothing to the bi-lexical parameters.</Paragraph> <Paragraph position="11"> The Laplacian regularizer (9) provides a natural smoother for the bi-lexical parameter estimates that takes into account valuable word similarity information computed as above. The Laplacian regularizer also has a signi cant computational advantage: it is guaranteed to be a convex quadratic function of the parameters (Zhu et al., 2001). Therefore, by combining the constraint system (7) with the Laplacian smoother (9), we can obtain a convex optimization for all constraints in a154 where a176a122a161a4a33a167a140a17 does not apply smoothing to a91 PMI, a91 dist, a91 dist2.</Paragraph> <Paragraph position="12"> Clearly, (10) describes a large margin training program for dependency parsing, but one which uses word similarity smoothing for the bi-lexical parameters, and a more re ned set of constraints developed in Section 5. Although the constraints are more re ned, they are fewer in number than (4).</Paragraph> <Paragraph position="13"> That is, we now only have a polynomial number of constraints corresponding to each word pair in (5), rather than the exponential number over every possible parse tree in (4). Thus, we obtain a polynomial size quadratic program that can be solved for moderately large problems using standard software packages. We used CPLEX in our experiments below.</Paragraph> <Paragraph position="14"> As before, once optimized, the solution parameters a77 can be introduced into the dependency model (1) according to (2).</Paragraph> </Section> class="xml-element"></Paper>