File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1505_metho.xml
Size: 16,673 bytes
Last Modified: 2025-10-06 14:10:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1505"> <Title>Corrective Modeling for Non-Projective Dependency Parsing</Title> <Section position="4" start_page="43" end_page="44" type="metho"> <SectionTitle> 3 Constituency Parsing for Dependency Trees </SectionTitle> <Paragraph position="0"> A pragmatic justification for using constituency-based parsers in order to predict dependency structures is that currently the best Czech dependency-tree parser is a constituency-based parser (Collins et al., 1999; Zeman, 2004). In fact both Charniak's and Collins' generative probabilistic models contain lexical dependency features.</Paragraph> <Paragraph position="1"> From a generative modeling perspective, we use the constraints imposed by constituents (i.e., projectivity) to enable the encapsulation of syntactic substructures. This directly leads to efficient parsing algorithms such as the CKY algorithm and related agenda-based parsing algorithms (Manning and Sch&quot;utze, 1999). Additionally, this allows for the efficient computation of the scores for the dynamic-programming state variables (i.e., the inside and outside probabilities) that are used in efficient statistical parsers. The computational complexity advantages of dynamic programming techniques along with efficient search techniques (Caraballo and Charniak, 1998; Klein and Manning, 2003) allow for richer predictive models which include local contextual information.</Paragraph> <Paragraph position="2"> In an attempt to extend a constituency-based parsing model to train on dependency trees, Collins transforms the PDT dependency trees into constituency trees (Collins et al., 1999). In order to accomplish this task, he first normalizes the trees to remove non-projectivities. Then, he creates artificial constituents based on the parts-of-speech of the words associated with each dependency node.</Paragraph> <Paragraph position="3"> The mapping from dependency tree to constituency tree is not one-to-one. Collins describes a heuristic for choosing trees that work well with his parsing model.</Paragraph> <Section position="1" start_page="43" end_page="44" type="sub_section"> <SectionTitle> 3.1 Training a Constituency-based Parser </SectionTitle> <Paragraph position="0"> We consider two approaches to creating projective trees from dependency trees exhibiting nonprojectivities. The first is based on word-reordering and is the model that was used with the Collins parser. This algorithm identifies non-projective structures and deterministically reorders the words of the sentence to create projective trees. An alternative method, used by Charniak in the adaptation of his parser for Czech and used by Nivre and Nilsson (2005), alters the dependency links by raising the governor to a higher node in the tree whenever Bilexical dependencies are components of both the Collins and Charniak parsers and effectively model the types of syntactic subordination that we wish to extract in a dependency tree. (Bilexical models were also proposed by Eisner (Eisner, 1996)). In the absence of lexicalization, both parsers have dependency features that are encoded as head-constituent to sibling features. This information was provided by Eugene Charniak in a personal communication.</Paragraph> <Paragraph position="1"> PDT development data.</Paragraph> <Paragraph position="2"> a non-projectivity is observed. The trees are then transformed into Penn Treebank style constituencies using the technique described in (Collins et al., 1999).</Paragraph> <Paragraph position="3"> Both of these techniques have advantages and disadvantages which we briefly outline here: Reordering The dependency structure is preserved, but the training procedure will learn statistics for structures over word-strings that may not be part of the language. The parser, however, may be capable of constructing parses for any string of words if a smoothed grammar is being used.</Paragraph> <Paragraph position="4"> Governor-Raising The dependency structure is corrupted leading the parser to incorporate arbitrary dependency statistics into the model.</Paragraph> <Paragraph position="5"> However, the parser is trained on true sentences, the words of which are in the correct linear order. We expect the parser to predict similar incorrect dependencies when sentences similar to the training data are observed.</Paragraph> <Paragraph position="6"> Although the results presented in (Collins et al., 1999) used the reordering technique, we have experimented with his parser using the governor-raising technique and observe an increase in dependency accuracy. For the remainder of the paper, we assume the governor-raising technique.</Paragraph> <Paragraph position="7"> The process of generating dependency trees from parsed constituency trees is relatively straightforward. Both the Collins and Charniak parsers provide head-word annotation on each constituent. This is precisely the information that we encode in an unlabeled dependency tree, so the dependency structure can simply be extracted from the parsed constituency trees. Furthermore, the constituency labels can be used to identify the dependency labels; however, we do not attempt to identify correct dependency labels in this work.</Paragraph> </Section> <Section position="2" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 3.2 Constituency-based errors </SectionTitle> <Paragraph position="0"> We now discuss a quantitative measure for the types of dependency errors made by constituency-based parsing techniques. For node w</Paragraph> <Paragraph position="2"> the distance between the two nodes in the hypothesized dependency tree is:</Paragraph> <Paragraph position="4"> Ancestor, sibling, cousin, and descendant have the standard interpretation in the context of a tree. The dependency distance d(w</Paragraph> <Paragraph position="6"> ) is the minimum number of dependency links traversed on the undirected path from w</Paragraph> <Paragraph position="8"> in the hypothesized dependency tree. The definition of the dist function makes a distinction between paths through the par-</Paragraph> <Paragraph position="10"/> <Paragraph position="12"> (negative values). We found that a vast majority of the correct governors were actually hypothesized as siblings or grandparents (a dist values of 2) - an extreme local error.</Paragraph> <Paragraph position="13"> Figure 2 shows a histogram of the fraction of nodes whose correct governor was within a particular dist in the hypothesized tree. A dist of 1 indicates the correct governor was selected by the parser; in these graphs, the density at dist =1(on the x axis) shows the baseline dependency accuracy of each parser. Note that if we repaired only the nodes that are within a dist of 2 (grandparents and siblings), we can recover more than 50% of the incorrect dependency links (a raw accuracy improvement of up to 9%). We believe this distribution to be indirectly caused by the governor raising projectivization routine. In the cases where non-projective structures can be repaired by raising the node's governor to its parent, the correct governor becomes a sibling of the node.</Paragraph> </Section> </Section> <Section position="5" start_page="44" end_page="47" type="metho"> <SectionTitle> 4 Corrective Modeling </SectionTitle> <Paragraph position="0"> The error analysis of the previous section suggests that by looking only at a local neighborhood of the proposed governor in the hypothesized trees, we can correct many of the incorrect dependencies. This fact motivates the corrective modeling procedure employed here.</Paragraph> <Paragraph position="1"> Table 1 presents the pseudo-code for the corrective procedure. The set g h contains the indices of governors as predicted by the parser. The set of governors predicted by the corrective procedure is denoted as g prime . The procedure independently corrects each node of the parsed trees meaning that there is potential for inconsistent governor relationships to exist in the proposed set; specifically, the resulting dependency graph may have cycles. We employ a greedy search to remove cycles when they are present in the output graph.</Paragraph> <Paragraph position="2"> The final line of the algorithm picks the governor in which we are most confident. We use the correctgovernor classification likelihood,</Paragraph> <Paragraph position="4"> )), as a measure of the confidence that w</Paragraph> <Paragraph position="6"> where the parser had proposed w</Paragraph> <Paragraph position="8"> as the governor. In effect, we create a decision list using the most likely decision if we can (i.e., there are no cycles). If the dependency graph resulting from the most likely decisions does not result in a tree, we use the decision lists to greedily select the tree for which the product of the independent decisions is maximal.</Paragraph> <Paragraph position="9"> Training the corrective model requires pairs of dependency trees; each pair contains a manually-annotated tree (i.e., the gold standard tree) and a tree generated by the parser. This data is trivially transformed into per-node samples. For each node w</Paragraph> <Paragraph position="11"> ) |samples; one for each governor candidate in the local neighborhood. One advantage to the type of corrective algorithm presented here is that it is completely disconnected from the parser used to generate the tree hypotheses. This means that the original parser need not be statistical or even constituency based. What is critical for this technique to work is that the distribution of dependency errors be relatively local as is the case with the errors made by the Charniak and Collins parsers. This can be determined via data analysis using the dist metric. Determining the size of the local neighborhood is data dependent. If subordinate nodes are considered as candidate governors, then a more robust cycle removal technique is be required.</Paragraph> <Section position="1" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 4.1 MaxEnt Estimation </SectionTitle> <Paragraph position="0"> We have chosen a MaxEnt model to estimate the</Paragraph> <Paragraph position="2"> the next section we outline the feature set with which we have experimented, noting that the features are selected based on linguistic intuition (specifically for Czech). We choose not to factor the feature vector as it is not clear what constitutes a reasonable factorization of these features. For this reason we use the MaxEnt estimator which provides us with the flexibility to incorporate interdependent features independently while still optimizing for likelihood.</Paragraph> <Paragraph position="3"> The maximum entropy principle states that we wish to find an estimate of p(y|x) [?]Cthat maximizes the entropy over a sample set X for some set of observations Y ,wherex [?] X is an observation</Paragraph> <Paragraph position="5"> The set C is the candidate set of distributions from which we wish to select p(y|x). We define this set as the p(y|x) that meets a feature-based expectation constraint. Specifically, we want the expected count of a feature, f(x,y), to be equivalent under the distribution p(y|x) and under the observed distribution ~p(y|x).</Paragraph> <Paragraph position="7"> (x,y) is a feature of our model with which we capture correlations between observations and outcomes. In the following section, we describe a set of features with which we have experimented to determine when a word is likely to be the correct governor of another word.</Paragraph> <Paragraph position="8"> We incorporate the expected feature-count constraints into the maximum entropy objective using Lagrange multipliers (additionally, constraints are added to ensure the distributions p(y|x) are consistent probability distributions):</Paragraph> <Paragraph position="10"> 's constant, we compute the unconstrained maximum of the above Lagrangian form:</Paragraph> <Paragraph position="12"> giving us the log-linear form of the distributions p(y|x) in C (Z is a normalization constant). Finally, we compute the a</Paragraph> <Paragraph position="14"> 's that maximize the objective function:</Paragraph> <Paragraph position="16"> A number of algorithms have been proposed to efficiently compute the optimization described in this derivation. For a more detailed introduction to maximum entropy estimation see (Berger et al., 1996).</Paragraph> </Section> <Section position="2" start_page="46" end_page="47" type="sub_section"> <SectionTitle> 4.2 Proposed Model </SectionTitle> <Paragraph position="0"> Given the above formulation of the MaxEnt estimation procedure, we define features over pairs of observations and outcomes. In our case, the observations are simply w</Paragraph> <Paragraph position="2"> ) and the outcome is a binary variable indicating whether c = g</Paragraph> <Paragraph position="4"> is the correct governor). In order to limit the dimensionality of the feature space, we consider feature functions over the outcome, the current node</Paragraph> <Paragraph position="6"> , the candidate governor node w c and the node proposed as the governor by the parser w</Paragraph> <Paragraph position="8"> Table 2 describes the general classes of features used. We write F i to indicate the form of the current child node, F c for the form of the candidate, and F g as the form of the governor proposed by the parser. A combined feature is denoted as L</Paragraph> <Paragraph position="10"> and indicates we observed a particular lemma for the current node with a particular tag of the candidate. In all models, we include features containing the form, the lemma, the morphological tag, and the ParserGov feature. We have experimented with different sets of feature combinations. Each combination set is intended to capture some intuitive linguistic correlation. For example, the feature component</Paragraph> <Paragraph position="12"> will fire if a particular child's lemma L</Paragraph> <Paragraph position="14"> served with a particular candidate's morphological tag T c . This feature is intended to capture phenomena surrounding particles; for example, in Czech, the governor of the reflexive particle se will likely be a verb.</Paragraph> </Section> <Section position="3" start_page="47" end_page="47" type="sub_section"> <SectionTitle> 4.3 Related Work </SectionTitle> <Paragraph position="0"> Recent work by Nivre and Nilsson introduces a technique where the projectivization transformation is encoded in the non-terminals of constituents during parsing (Nivre and Nilsson, 2005). This allows for a deterministic procedure that undoes the projectivization in the generated parse trees, creating non-projective structures. This technique could be incorporated into a statistical parsing framework, however we believe the sparsity of such non-projective configurations may be problematic when using smoothed backed-off grammars. We suspect that the deterministic procedure employed by Nivre and Nilsson enables their parser to greedily consider non-projective constructions when possible. This may also explain the relatively low overall performance of their parser.</Paragraph> <Paragraph position="1"> A primary difference between the Nivre and Nilsson approach and what we propose in this paper is that of determining the projectivization procedure.</Paragraph> <Paragraph position="2"> While we exploit particular side-effects of the projectivization procedure, we do not assume any particular algorithm. Additionally, we consider transformations for all dependency errors where their technique explicitly addresses non-projectivity errors. null We mentioned above that our approach appears to be similar to that of reranking for statistical parsing (Collins, 2000; Charniak and Johnson, 2005). While it is true that we are improving upon the output of the automatic parser, we are not considering multiple alternate parses. Instead, we consider a complete set of alternate trees that are minimal perturbations of the best tree generated by the parser. In the context of dependency parsing, we do this in order to generate structures that constituency-based parsers are incapable of generating (i.e., non-projectivities).</Paragraph> <Paragraph position="3"> Recent work by Smith and Eisner (2005) on contrastive estimation suggests similar techniques to generate local neighborhoods of a parse; however, the purpose in their work is to define an approximation to the partition function for log-linear estimation (i.e., the normalization factor in a MaxEnt model).</Paragraph> </Section> </Section> class="xml-element"></Paper>