File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1013_metho.xml
Size: 13,408 bytes
Last Modified: 2025-10-06 14:09:42
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1013"> <Title>Pseudo-Projective Dependency Parsing</Title> <Section position="3" start_page="99" end_page="102" type="metho"> <SectionTitle> 2 Dependency Graph Transformations </SectionTitle> <Paragraph position="0"> We assume that the goal in dependency parsing is to construct a labeled dependency graph of the kind depicted in Figure 1. Formally, we define dependency graphs as follows: 1. Let R ={r1,...,rm}be the set of permissible dependency types (arc labels).</Paragraph> <Paragraph position="1"> 2. A dependency graph for a string of words W = w1***wn is a labeled directed graph D = (W,A), where (a) W is the set of nodes, i.e. word tokens in the input string, ordered by a linear precedence relation <, (b) A is a set of labeled arcs (wi,r,wj), where wi,wj [?]W, r[?]R, (c) for every wj [?]W, there is at most one arc 3. A graph D = (W,A) is well-formed iff it is acyclic and connected.</Paragraph> <Paragraph position="2"> If (wi,r,wj)[?]A, we say that wi is the head of wj and wj a dependent of wi. In the following, we use the notation wi r-wj to mean that (wi,r,wj)[?]A; we also use wi-wj to denote an arc with unspecified label and wi-[?] wj for the reflexive and transitive closure of the (unlabeled) arc relation. The dependency graph in Figure 1 satisfies all the defining conditions above, but it fails to satisfy the condition of projectivity (Kahane et al., 1998): 1. An arc wi-wk is projective iff, for every word wj occurring between wi and wk in the string (wi<wj<wk or wi>wj>wk), wi-[?] wj.</Paragraph> <Paragraph position="3"> 2. A dependency graph D = (W,A) is projective iff every arc in A is projective.</Paragraph> <Paragraph position="4"> The arc connecting the head jedna (one) to the dependent Z (out-of) spans the token je (is), which is not dominated by jedna.</Paragraph> <Paragraph position="5"> As observed by Kahane et al. (1998), any (nonprojective) dependency graph can be transformed into a projective one by a lifting operation, which replaces each non-projective arc wj -wk by a projective arc wi - wk such that wi -[?] wj holds in the original graph. Here we use a slightly different notion of lift, applying to individual arcs and moving their head upwards one step at a time:</Paragraph> <Paragraph position="7"> Intuitively, lifting an arc makes the word wk dependent on the head wi of its original head wj (which is unique in a well-formed dependency graph), unless wj is a root in which case the operation is undefined (but then wj - wk is necessarily projective if the dependency graph is well-formed).</Paragraph> <Paragraph position="8"> Projectivizing a dependency graph by lifting non-projective arcs is a nondeterministic operation in the general case. However, since we want to preserve as much of the original structure as possible, we are interested in finding a transformation that involves a minimal number of lifts. Even this may be nondeterministic, in case the graph contains several non-projective arcs whose lifts interact, but we use the following algorithm to construct a minimal projective transformation Dprime = (W,Aprime) of a (non null The function SMALLEST-NONP-ARC returns the non-projective arc with the shortest distance from head to dependent (breaking ties from left to right).</Paragraph> <Paragraph position="9"> Applying the function PROJECTIVIZE to the graph in Figure 1 yields the graph in Figure 2, where the problematic arc pointing to Z has been lifted from the original head jedna to the ancestor je. Using the terminology of Kahane et al. (1998), we say that jedna is the syntactic head of Z, while je is its linear head in the projectivized representation.</Paragraph> <Paragraph position="10"> Unlike Kahane et al. (1998), we do not regard a projectivized representation as the final target of the parsing process. Instead, we want to apply an in- null verse transformation to recover the underlying (nonprojective) dependency graph. In order to facilitate this task, we extend the set of arc labels to encode information about lifting operations. In principle, it would be possible to encode the exact position of the syntactic head in the label of the arc from the linear head, but this would give a potentially infinite set of arc labels and would make the training of the parser very hard. In practice, we can therefore expect a trade-off such that increasing the amount of information encoded in arc labels will cause an increase in the accuracy of the inverse transformation but a decrease in the accuracy with which the parser can construct the labeled representations. To explore this tradeoff, we have performed experiments with three different encoding schemes (plus a baseline), which are described schematically in Table 1.</Paragraph> <Paragraph position="11"> The baseline simply retains the original labels for all arcs, regardless of whether they have been lifted or not, and the number of distinct labels is therefore simply the number n of distinct dependency types.2 In the first encoding scheme, called Head, we use a new label d|h for each lifted arc, where d is the dependency relation between the syntactic head and the dependent in the non-projective representation, and h is the dependency relation that the syntactic head has to its own head in the underlying structure.</Paragraph> <Paragraph position="12"> Using this encoding scheme, the arc from je to Z in Figure 2 would be assigned the label AuxP|Sb (signifying an AuxP that has been lifted from a Sb).</Paragraph> <Paragraph position="13"> In the second scheme, Head+Path, we in addition modify the label of every arc along the lifting path from the syntactic to the linear head so that if the original label is p the new label is p|. Thus, the arc from je to jedna will be labeled Sb|(to indicate that there is a syntactic head below it). In the third and final scheme, denoted Path, we keep the extra infor2Note that this is a baseline for the parsing experiment only (Experiment 2). For Experiment 1 it is meaningless as a baseline, since it would result in 0% accuracy.</Paragraph> <Paragraph position="14"> mation on path labels but drop the information about the syntactic head of the lifted arc, using the label d| instead of d|h (AuxP|instead of AuxP|Sb).</Paragraph> <Paragraph position="15"> As can be seen from the last column in Table 1, both Head and Head+Path may theoretically lead to a quadratic increase in the number of distinct arc labels (Head+Path being worse than Head only by a constant factor), while the increase is only linear in the case of Path. On the other hand, we can expect Head+Path to be the most useful representation for reconstructing the underlying non-projective dependency graph. In approaching this problem, a variety of different methods are conceivable, including a more or less sophisticated use of machine learning. In the present study, we limit ourselves to an algorithmic approach, using a deterministic breadth-first search. The details of the transformation procedure are slightly different depending on the encoding schemes: * Head: For every arc of the form wi d|h[?]- wn, we search the graph top-down, left-to-right, breadth-first starting at the head node wi. If we find an arc wl h[?]-wm, called a target arc, we replace wi d|h[?]-wn by wm d[?]-wn; otherwise we replace wi d|h[?]-wn by wi d[?]-wn (i.e. we let the linear head be the syntactic head).</Paragraph> <Paragraph position="16"> * Head+Path: Same as Head, but the search only follows arcs of the form wj p|[?]-wk and a target arc must have the form wl h|[?]-wm; if no target arc is found, Head is used as backoff.</Paragraph> <Paragraph position="17"> * Path: Same as Head+Path, but a target arc must have the form wl p|[?]- wm and no out-going arcs of the form wm pprime|[?]-wo; no backoff. In section 4 we evaluate these transformations with respect to projectivized dependency treebanks, and in section 5 they are applied to parser output. Before we turn to the evaluation, however, we need to introduce the data-driven dependency parser used in the latter experiments.</Paragraph> </Section> <Section position="4" start_page="102" end_page="102" type="metho"> <SectionTitle> 3 Memory-Based Dependency Parsing </SectionTitle> <Paragraph position="0"> In the experiments below, we employ a data-driven deterministic dependency parser producing labeled projective dependency graphs,3 previously tested on Swedish (Nivre et al., 2004) and English (Nivre and Scholz, 2004). The parser builds dependency graphs by traversing the input from left to right, using a stack to store tokens that are not yet complete with respect to their dependents. At each point during the derivation, the parser has a choice between pushing the next input token onto the stack - with or without adding an arc from the token on top of the stack to the token pushed - and popping a token from the stack - with or without adding an arc from the next input token to the token popped. More details on the parsing algorithm can be found in Nivre (2003).</Paragraph> <Paragraph position="1"> The choice between different actions is in general nondeterministic, and the parser relies on a memory-based classifier, trained on treebank data, to predict the next action based on features of the current parser configuration. Table 2 shows the features used in the current version of the parser. At each point during the derivation, the prediction is based on six word tokens, the two topmost tokens on the stack, and the next four input tokens. For each token, three types of features may be taken into account: the word form; the part-of-speech assigned by an automatic tagger; and labels on previously assigned dependency arcs involving the token - the arc from its head and the arcs to its leftmost and right-most dependent, respectively. Except for the left3The graphs satisfy all the well-formedness conditions given in section 2 except (possibly) connectedness. For robustness reasons, the parser may output a set of dependency trees instead of a single tree.</Paragraph> <Paragraph position="2"> most dependent of the next input token, dependency type features are limited to tokens on the stack.</Paragraph> <Paragraph position="3"> The prediction based on these features is a k-nearest neighbor classification, using the IB1 algorithm and k = 5, the modified value difference metric (MVDM) and class voting with inverse distance weighting, as implemented in the TiMBL software package (Daelemans et al., 2003). More details on the memory-based prediction can be found in Nivre et al. (2004) and Nivre and Scholz (2004).</Paragraph> </Section> <Section position="5" start_page="102" end_page="103" type="metho"> <SectionTitle> 4 Experiment 1: Treebank Transformation </SectionTitle> <Paragraph position="0"> The first experiment uses data from two dependency treebanks. The Prague Dependency Treebank (PDT) consists of more than 1M words of newspaper text, annotated on three levels, the morphological, analytical and tectogrammatical levels (HajiVc, 1998).</Paragraph> <Paragraph position="1"> Our experiments all concern the analytical annotation, and the first experiment is based only on the training part. The Danish Dependency Treebank (DDT) comprises about 100K words of text selected from the Danish PAROLE corpus, with annotation of primary and secondary dependencies (Kromann, 2003). The entire treebank is used in the experiment, but only primary dependencies are considered.4 In all experiments, punctuation tokens are included in the data but omitted in evaluation scores.</Paragraph> <Paragraph position="2"> In the first part of the experiment, dependency graphs from the treebanks were projectivized using the algorithm described in section 2. As shown in Table 3, the proportion of sentences containing some non-projective dependency ranges from about 15% in DDT to almost 25% in PDT. However, the over-all percentage of non-projective arcs is less than 2% in PDT and less than 1% in DDT. The last four columns in Table 3 show the distribution of non-projective arcs with respect to the number of lifts required. It is worth noting that, although non-projective constructions are less frequent in DDT than in PDT, they seem to be more deeply nested, since only about 80% can be projectivized with a single lift, while almost 95% of the non-projective arcs in PDT only require a single lift.</Paragraph> <Paragraph position="3"> In the second part of the experiment, we applied the inverse transformation based on breadth-first search under the three different encoding schemes.</Paragraph> <Paragraph position="4"> The results are given in Table 4. As expected, the most informative encoding, Head+Path, gives the highest accuracy with over 99% of all non-projective arcs being recovered correctly in both data sets.</Paragraph> <Paragraph position="5"> However, it can be noted that the results for the least informative encoding, Path, are almost comparable, while the third encoding, Head, gives substantially worse results for both data sets. We also see that the increase in the size of the label sets for Head and Head+Path is far below the theoretical upper bounds given in Table 1. The increase is generally higher for PDT than for DDT, which indicates a greater diversity in non-projective constructions.</Paragraph> </Section> class="xml-element"></Paper>