File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2006_metho.xml
Size: 16,806 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2006"> <Title>Finding non-local dependencies: beyond pattern matching</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 From the Penn Treebank to a </SectionTitle> <Paragraph position="0"> dependency treebank This section describes the corpus of dependency structures that we used to evaluate our algorithm. The corpus was automatically derived from the Penn Treebank II corpus (Marcus et al., 1993), by means of the script chunklink.pl (Buchholz, 2002) that we modified to fit our purposes. The script uses a sort of head percolation table to identify heads of constituents, and then converts the result to a dependency format. We refer to (Buchholz, 2002) for a thorough description of the conversion algorithm, and will only emphasize the two most important modifications that we made.</Paragraph> <Paragraph position="1"> One modification of the conversion algorithm concerns participles and reduced relative clauses modifying NPs. Regular participles in the Penn Treebank II are simply annotated as VPs adjoined to the modified NPs (see Figure 1(a)). These participles (also called reduced relative clauses, as they lack auxiliary verbs and complementizers) are both syntactically and semantically similar to full relative clauses, but the Penn annotation does not introduce empty complementizers, thus preventing co-indexing of a trace with any antecedent. We perform a simple heuristic modification while converting the Treebank to the dependency format: when we encounter an NP modified by a VP headed by a past participle, an object dependency is introduced between the head of the VP and the head of the NP.</Paragraph> <Paragraph position="2"> Figure 1(b) shows an example, with solid arrows denoting local and dotted arrows denoting non-local dependencies. Arrows are marked with dependency labels and go from dependents to heads.</Paragraph> <Paragraph position="3"> This simple heuristics does not allow us to handle all reduced relative clauses, because some of them correspond to PPs or NPs rather than VPs, but the latter are quite rare in the Treebank.</Paragraph> <Paragraph position="4"> The second important change to Buchholz' script concerns the structure of VPs. For every verb cluster, we choose the main verb as the head of the cluster, and leave modal and auxiliary verbs as dependents of the main verb. A similar modification was used by Eisner (1996) for the study of dependency parsing models. As will be described below, this allows us to &quot;factor out&quot; tense and modality of finite clauses from our patterns, making the patterns more general.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Pattern extraction and matching </SectionTitle> <Paragraph position="0"> After converting the Penn Treebank to a dependency treebank, we first extracted non-local dependency patterns. As in (Johnson, 2002), our patterns are minimal connected fragments containing both nodes involved in a non-local dependency. However, in our Henderson will become chairman, succeeding Butler. . .</Paragraph> <Paragraph position="1"> case these fragments are not connected sets of local trees, but shortest paths in local dependency graphs, leading from heads to non-local dependents. Patterns do not include POS tags of the involved words, but only labels of the dependencies. Thus, a pattern is a directed graph with labeled edges, and two distinguished nodes: the head and the dependent of a corresponding non-local dependency. When several patterns intersect, as may be the case, for example, when a word participates in more than one non-local dependency, these patterns are handled independently. Figure 2 shows examples of dependency graphs (above) and extracted patterns (below, with filled bullets corresponding to the nodes of a non-local dependency). As before, dotted lines denote non-local dependencies.</Paragraph> <Paragraph position="2"> The definition of a structure matching a pattern, and the algorithms for pattern matching and pattern extraction from a corpus are straightforward and similar to those described in (Johnson, 2002).</Paragraph> <Paragraph position="3"> The total number of non-local dependencies found in the Penn WSJ is 57325. The number of different extracted patterns is 987. The 80 most frequent patterns (those that we used for the evaluation of our algorithm) cover 53700 out of all 57325 non-local dependencies (93,7%). These patterns were further cleaned up manually, e.g., most Penn functional tags (-TMP, -CLR etc., but not -OBJ, -SBJ, -PRD) were removed. Thus, we ended up with 16 structural patterns (covering the same 93,7% of the Penn Treebank).</Paragraph> <Paragraph position="4"> Table 1 shows some of the patterns found in the Penn Treebank. The column Count gives the number of times a pattern introduces non-local dependencies in the corpus. The Match column is the number of times a pattern actually occurs in the corpus (whether it introduces a non-local dependency or not). The patterns are shown as dependency graphs with labeled arrows from dependents to heads. The column Dependency shows labels and directions of introduced non-local dependencies.</Paragraph> <Paragraph position="5"> Clearly, an occurrence of a pattern alone is not enough for inserting a non-local dependency and determining its label, as for many patterns Match is significantly greater than Count. For this reason we introduce a set of other structural features, associated with patterns. For every occurrence of a pattern and for every word of this occurrence, we extract the following features: a17 pos, the POS tag of the word; a17 class, the simplified word class (similar to (Eisner, 1996)); a17 fin, whether the word is a verb and a head of a finite verb cluster (as opposed to infinitives, gerunds or participles); a17 subj, whether the word has a dependent (probably not included in the pattern) with a dependency label NP-SBJ; and a17 obj, the same for NP-OBJ label.</Paragraph> <Paragraph position="6"> Thus, an occurrence of a pattern is associated with a sequence of symbolic features: five features for each node in the pattern. E.g., a pattern consisting of 1 NP-SBJ . . . sympthoms thatdep showhead up decades later. . . 2 ADVP . . . buying futures whendep future prices fallhead. . . 3 NP-OBJ . . . practices thatdep the government has identifiedhead. . . 4 NP-SBJ . . . the airlinedep had been planning to initiatehead service. . . 5 NP-OBJ . . . that its absencedep is to blamehead for the sluggish development. . . 6 NP-OBJ . . . the situationdep will get settledhead in the short term. . . 7 NP-OBJ . . . the numberdep of planes the company has soldhead. . . 8 NP-SBJ . . . one of the first countriesdep to concludehead its talks. . . 9 ADVP . . . buying sufficient optionsdep to purchasehead shares. . . 10 NP-SBJ . . . both magazinesdep are expected to announcehead their ad rates. . . 11 NP-SBJ . . . whichdep is looking to expandhead its business. . . 12 NP-OBJ . . . the programsdep we wanted to dohead. . . 13 NP-SBJ . . . youdep can't make soap without turninghead up the flame. . . ther heads nor non-local dependents, are in italic; they correspond to empty bullets in patterns in Table 1. Boldfaced words correspond to filled bullets in Table 1.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Classification of pattern instances </SectionTitle> <Paragraph position="0"> Given a pattern instance and its feature vector, our task now is to determine whether the pattern introduces a non-local dependency and, if so, what the label of this dependency is. In many cases this is not a binary decision, since one pattern may introduce several possible labeled dependencies (e.g., the pattern a0 Sa1a2a1a2a1a4a3 a0 in Table 1). Our task is a classification task: an instance of a pattern must be assigned to two or more classes, corresponding to several possible dependency labels (or absence of a dependency).</Paragraph> <Paragraph position="1"> We train a classifier on instances extracted from a corpus, and then apply it to previously unseen instances. null The procedure for finding non-local dependencies now consists of the two steps: 1. given a local dependency structure, find matching patterns and their feature vectors; 2. for each pattern instance found, use the clas null sifier to identify a possible non-local dependency. null</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experiments and evaluation </SectionTitle> <Paragraph position="0"> In our experiments we used sections 02-22 of the Penn Treebank as the training corpus and section 23 as the test corpus. First, we extracted all non-local patterns from the Penn Treebank, which resulted in 987 different (pattern, non-local dependency) pairs.</Paragraph> <Paragraph position="1"> As described in Section 3, after cleaning up we took 16 of the most common patterns.</Paragraph> <Paragraph position="2"> For each of these 16 patterns, instances of the pattern, pattern features, and a non-local dependency label (or the special label &quot;no&quot; if no dependency was introduced by the instance) were extracted from the training and test corpora.</Paragraph> <Paragraph position="3"> We performed experiments with two statistical classifiers: the decision tree induction system C4.5 (Quinlan, 1993) and the Tilburg Memory-Based Learner (TiMBL) (Daelemans et al., 2002). In most cases TiBML performed slightly better. The results described in this section were obtained using TiMBL.</Paragraph> <Paragraph position="4"> For each of the 16 structural patterns, a separate classifier was trained on the set of (feature-vector, label) pairs extracted from the training corpus, and then evaluated on the pairs from the test corpus. Table 1 shows the results for some of the most frequent patterns, using conventional metrics: precision (the fraction of the correctly labeled dependencies among all the dependencies found), recall (the fraction of the correctly found dependencies among all the dependencies with a given label) and f-score (harmonic mean of precision and recall). The table also shows the number of times a pattern (together with a specific non-local dependency label) actually occurs in the whole Penn Treebank corpus (the column Dependency count).</Paragraph> <Paragraph position="5"> In order to compare our results to the results presented in (Johnson, 2002), we measured the over-all performance of the algorithm across patterns and non-local dependency labels. This corresponds to the row &quot;Overall&quot; of Table 4 in (Johnson, 2002), repeated here in Table 4. We also evaluated the procedure on NP traces across all patterns, i.e., on non-local dependencies with NP-SBJ, NP-OBJ or NP-PRD labels. This corresponds to rows 2, 3 and 4 of Table 4 in (Johnson, 2002). Our results are presented in Table 3. The first three columns show the results for those non-local dependencies that are actually covered by our 16 patterns (i.e., for 93.7% of all non-local dependencies). The last three columns present the evaluation with respect to all non-local dependencies, thus the precision is the same, but recall drops accordingly. These last columns give the results that can be compared to Johnson's results for It is difficult to make a strict comparison of our results and those in (Johnson, 2002). The two algorithms are designed for slightly different purposes: while Johnson's approach allows one to recover free empty nodes (without antecedents), we look for non-local dependencies, which corresponds to identification of co-indexed empty nodes (note, however, the modifications we describe in Section 2, when we actually transform free empty nodes into co-indexed empty nodes).</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> The results presented in the previous section show that it is possible to improve over the simple pattern matching algorithm of (Johnson, 2002), using dependency rather than phrase structure information, more skeletal patterns, as was suggested by Johnson, and a set of features associated with instances of patterns.</Paragraph> <Paragraph position="1"> One of the reasons for this improvement is that our approach allows us to discriminate between different syntactic phenomena involving non-local dependencies. In most cases our patterns correspond to linguistic phenomena. That helps to understand why a particular construction is easy or difficult for our approach, and in many cases to make the necessary modifications to the algorithm (e.g., adding other features to instances of patterns). For example, for patterns 11 and 12 (see Tables 1 and 2) our classifier distinguishes subject and object reasonably well, apparently, because the feature has a local object is explicitly present for all instances (for the examples 11 and 12 in Table 2, expand has a local object, but do doesn't).</Paragraph> <Paragraph position="2"> Another reason is that the patterns are general enough to factor out minor syntactic differences in linguistic phenomena (e.g., see example 4 in Table 2). Indeed, the most frequent 16 patterns cover 93.7% of all non-local dependencies in the corpus.</Paragraph> <Paragraph position="3"> This is mainly due to our choices in the dependency representation, such as making the main verb a head of a verb phrase. During the conversion to a dependency treebank and extraction of patterns some important information may have been lost (e.g., the finiteness of a verb cluster, or presence of subject and object); for that reason we had to associate patterns with additional features, encoding this information and providing it to the classifier. In other words, we first take an &quot;oversimplified&quot; representation of the data, and then try to find what other data features can be useful. This strategy appears to be successful, because it allows us to identify which information is important for the recovery of non-local dependencies.</Paragraph> <Paragraph position="4"> More generally, the reasonable overall performance of the algorithm is due to the fact that for the most common non-local dependencies (extraction in relative clauses and reduced relative clauses, passivization, control and raising) the structural information we extract is enough to robustly identify non-local dependencies in a local dependency graph: the most frequent patterns in Table 1 are also those with best scores. However, many less frequent phenomena appear to be much harder. For example, performance for relative clauses with extracted objects or adverbs is much worse than for subject relative clauses (e.g., patterns 2 and 3 vs. 1 in Table 1). Apparently, in most cases this is not due to the lack of training data, but because structural information alone is not enough and lexical preferences, subcategorization information, or even semantic properties should be considered. We think that the aproach allows us to identify those &quot;hard&quot; cases.</Paragraph> <Paragraph position="5"> The natural next step in evaluating our algorithm is to work with the output of a parser instead of the original local structures from the Penn Treebank. Obviously, because of parsing errors the performance drops significantly: e.g., in the experiments reported in (Johnson, 2002) the overall f-score decreases from 0.75 to 0.68 when evaluating on parser output (see Table 4). While experimenting with Collins' parser (Collins, 1999), we found that for our algorithm the accuracy drops even more dramatically, when we train the classifier on Penn Tree-bank data and test it on parser output. One of the reasons is that, since we run our algorithm not on the parser's output itself but on the output automatically converted to dependency structures, conversion errors also contribute to the performance drop. Moreover, the conversion script is highly tailored to the Penn Treebank annotation (with functional tags and empty nodes) and, when run on the parser's output, produces structures with somewhat different dependency labels. Since our algorithm is sensitive to the exact labels of the dependencies, it suffers from these systematic errors.</Paragraph> <Paragraph position="6"> One possible solution to that problem could be to extract patterns and train the classification algorithm not on the training part of the Penn Treebank, but on the parser output for it. This would allow us to train and test our algorithm on data of the same nature.</Paragraph> </Section> class="xml-element"></Paper>