File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1040_metho.xml

Size: 25,420 bytes

Last Modified: 2025-10-06 14:08:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1040">
  <Title>Enriching the Output of a Parser Using Memory-Based Learning</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 An Overview of the Method
</SectionTitle>
    <Paragraph position="0"> In this section we give a high-level overview of our method for transforming a parser's output and describe the different steps of the process. In the experiments we used the parsers described in (Charniak, 2000) and (Collins, 1999). For Collins' parser the text was first POS-tagged using Ratnaparkhi's maximum enthropy tagger.</Paragraph>
    <Paragraph position="1"> The training phase of the method consists in learning which transformations need to be applied to the output of a parser to make it as similar to the treebank data as possible.</Paragraph>
    <Paragraph position="2"> As a preliminary step (Step 0), we convert the WSJ2 to a dependency corpus without losing the annotated information (functional tags, empty nodes, non-local dependencies). The same conversion is applied to the output of the parsers we consider. The details of the conversion process are described in Section 4 below.</Paragraph>
    <Paragraph position="3"> The training then proceeds by comparing graphs derived from a parser's output with the graphs from the dependency corpus, detecting various mismatches, such as incorrect arc labels and missing nodes or arcs. Then the following steps are taken to fix the mismatches:  Obviously, other modifications are possible, such as deleting arcs or moving arcs from one node to another. We leave these for future work, though, and focus on the three transformations mentioned above.</Paragraph>
    <Paragraph position="4"> The dependency corpus was split into training (WSJ sections 02-21), development (sections 00- null 2Thoughout the paper WSJ refers to the Penn Treebank II Wall Street Journal corpus.</Paragraph>
    <Paragraph position="5"> 01) and test (section 23) corpora. For each of the steps 1, 2 and 3 we proceed as follows: 1. compare the training corpus to the output of the parser on the strings of the corpus, after applying the transformations of the previous steps 2. identify possible beneficial transformations (which arc labels need to be changed or where new nodes or arcs need to be added) 3. train a memory-based classifier to predict pos null sible transformations given their context (i.e., information about the local structure of the dependency graph around possible application sites).</Paragraph>
    <Paragraph position="6"> While the definitions of the context and application site and the graph modifications are different for the three steps, the general structure of the method remains the same at each stage. Sections 6, 7 and 8 describe the steps in detail.</Paragraph>
    <Paragraph position="7"> In the application phase of the method, we proceed similarly. First, the output of the parser is converted to dependency graphs, and then the learners trained during the steps 1, 2 and 3 are applied in sequence to perform the graph transformations.</Paragraph>
    <Paragraph position="8"> Apart from the conversion from phrase structures to dependency graphs and the extraction of some linguistic features for the learning, our method does not use any information about the details of the tree-bank annotation or the parser's output: it works with arbitrary labelled directed graphs.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Step 0: From Constituents to
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Dependencies
</SectionTitle>
      <Paragraph position="0"> To convert phrase trees to dependency structures, we followed the commonly used scheme (Collins, 1999). The conversion routine,3 described below, is applied both to the original WSJ structures and the output of the parsers, though the former provides more information (e.g., traces) which is used by the conversion routine if available.</Paragraph>
      <Paragraph position="1"> First, for the treebank data, all traces are resolved and corresponding empty nodes are replaced with links to target constituents, so that syntactic trees become directed acyclic graphs. Second, for each constituent we detect its head daughters (more than one in the case of conjunction) and identify lexical heads. Then, for each constituent we output new dependencies between its lexical head and the lexical heads of its non-head daughters. The label of every new dependency is the constituent's phrase  results of the conversion to dependency structures of (c) the Penn tree and of (d) the parser's output label, stripped of all functional tags and coindexing marks, conjoined with the label of the non-head daughter, with its functional tags but without coindexing marks. Figure 1 shows an example of the original Penn annotation (a), the output of Charniak's parser (b) and the results of our conversion of these trees to dependency structures (c and d). The interpretation of the dependency labels is straightforward: e.g., the label S a0 NP-TMP corresponds to a sentence (S) being modified by a temporal noun phrase (NP-TMP).</Paragraph>
      <Paragraph position="2"> The core of the conversion routine is the selection of head daughters of the constituents. Following (Collins, 1999), we used a head table, but extended it with a set of additional rules, based on constituent labels, POS tags or, sometimes actual words, to account for situations where the head table alone gave unsatisfactory results. The most notable extension is our handling of conjunctions, which are often left relatively flat in WSJ and, as a result, in a parser's output: we used simple pattern-based heuristics to detect conjuncts and mark all conjuncts as heads of a conjunction.</Paragraph>
      <Paragraph position="3"> After the conversion, every resulting dependency structure is modified deterministically: a1 auxiliary verbs (be, do, have) become dependents of corresponding main verbs (similar to modal verbs, which are handled by the head table); null a1 to fix a WSJ inconsistency, we move the -LGS tag (indicating logical subject of passive in a by-phrase) from the PP to its child NP.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Dependency-based Evaluation of
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Parsers
</SectionTitle>
      <Paragraph position="0"> After the original WSJ structures and the parsers' outputs have been converted to dependency structures, we evaluate the performance of the parsers against the dependency corpus. We use the standard precision/recall measures over sets of dependencies (excluding punctuation marks, as usual) and evaluate Collins' and Charniak's parsers on WSJ section 23 in three settings: a1 on unlabelled dependencies; a1 on labelled dependencies with only bare labels (all functional tags discarded); a1 on labelled dependencies with functional tags. Notice that since neither Collins' nor Charniak's parser outputs WSJ functional labels, all dependencies with functional labels in the gold parse will be judged incorrect in the third setting. The evaluation results are shown in Table 1, in the row &amp;quot;step 0&amp;quot;.4 As explained above, the low numbers for the dependency evaluation with functional tags are expected, because the two parsers were not intended to produce functional labels.</Paragraph>
      <Paragraph position="1"> Interestingly, the ranking of the two parsers is different for the dependency-based evaluation than for PARSEVAL: Charniak's parser obtains a higher PARSEVAL score than Collins' (89.0% vs. 88.2%),  but slightly lower f-score on dependencies without functional tags (82.9% vs. 83.4%).</Paragraph>
      <Paragraph position="2"> To summarize the evaluation scores at this stage, both parsers perform with f-score around 87% on unlabelled dependencies. When evaluating on bare dependency labels (i.e., disregarding functional tags) the performance drops to 83%. The new errors that appear when taking labels into account come from different sources: incorrect POS tags (NN vs. VBG), different degrees of flatness of analyses in gold and test parses (JJ vs. ADJP, or CD vs. QP) and inconsistencies in the Penn annotation (VP vs. RRC). Finally, the performance goes down to around 66% when taking into account functional tags, which are not produced by the parsers at all.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Step 1: Changing Dependency Labels
</SectionTitle>
    <Paragraph position="0"> Intuitively, it seems that the 66% performance on labels with functional tags is an underestimation, because much of the missing information is easily recoverable. E.g., one can think of simple heuristics to distinguish subject NPs, temporal PPs, etc., thus introducing functional labels and improving the scores. Developing such heuristics would be a very time consuming and ad hoc process: e.g., Collins' -A and -g tags may give useful clues for this labelling, but they are not available in the output of other parsers. As an alternative to hard-coded heuristics, Blaheta and Charniak (2000) proposed to recover the Penn functional tags automatically. On the Penn Treebank, they trained a statistical model that, given a constituent in a parsed sentence and its context (parent, grandparent, head words thereof etc.), predicted the functional label, possibly empty. The method gave impressive performance, with 98.64% accuracy on all constituents and 87.28% f-score for non-empty functional labels, when applied to constituents correctly identified by Charniak's parser. If we extrapolate these results to labelled PARSEVAL with functional labels, the method would give around 87.8% performance (98.64% of the &amp;quot;usual&amp;quot; 89%) for Charniak's parser. Adding functional labels can be viewed as a relabelling task: we need to change the labels produced by a parser. We considered this more general task, and used a different approach, taking dependency graphs as input. We first parsed the training part of our dependency tree-bank (sections 02-21) and identified possible relabellings by comparing dependencies output by a parser to dependencies from the treebank.</Paragraph>
    <Paragraph position="1"> E.g., for Collins' parser the most frequent relabellings were S a0 NPa0 S a0 NP-SBJ, PP a0 NP-Aa0 PP a0 NP, VP a0 NP-Aa0 VP a0 NP, S a0 NP-Aa0 S a0 NP-SBJ and VP a0 PPa0 VP a0 PP-CLR. In total, around 30% of all the parser's dependencies had different labels in the treebank. We then learned a mapping from the parser's labels to those in the dependency corpus, using TiMBL, a memory-based classifier (Daelemans et al., 2003). The features used for the relabelling were similar to those used by Blaheta and Charniak, but redefined for dependency structures. For each dependency we included:  from the parser's output.</Paragraph>
    <Paragraph position="2"> When included in feature vectors, all dependency labels were split at 'a9', e.g., the label S a0 NP-A resulted in two features: S and NP-A.</Paragraph>
    <Paragraph position="3"> Testing was done as follows. The test corpus (section 23) was also parsed, and for each dependency a feature vector was formed and given to TiMBL to correct the dependency label. After this transformation the outputs of the parsers were evaluated, as before, on dependencies in the three settings. The results of the evaluation are shown in Table 1 (the row marked &amp;quot;step 1&amp;quot;).</Paragraph>
    <Paragraph position="4"> Let us take a closer look at the evaluation results. Obviously, relabelling does not change the unlabelled scores. The 1% improvement for evaluation on bare labels suggests that our approach is capable not only of adding functional tags, but can also correct the parser's phrase labels and part-of-speech tags: for Collins' parser the most frequent correct changes not involving functional labels were NP a0 NNa5 NP a0 JJ and NP a0 JJa5 NP a0 VBN, fixing POS tagging errors. A very substantial increase of the labelled score (from 66% to 81%), which is only 6% lower than unlabelled score, clearly indicates that, although the parsers do not produce functional labels, this information is to a large extent implicitly present in trees and can be recovered.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Comparison to Earlier Work
</SectionTitle>
      <Paragraph position="0"> One effect of the relabelling procedure described above is the recovery of Penn functional tags. Thus, it is informative to compare our results with those reported in (Blaheta and Charniak, 2000) for this same task. Blaheta and Charniak measured tagging accuracy and precision/recall for functional tag identification only for constituents correctly identified by the parser (i.e., having the correct span and nonterminal label). Since our method uses the dependency formalism, to make a meaningful comparison we need to model the notion of a constituent being correctly found by a parser. For a word a0 we say that the constituent corresponding to its maximal projection is correctly identified if there exists a1 , the head of a0 , and for the dependency a0 a5a7a1 the right part of its label (e.g., NP-SBJ for S a0 NP-SBJ) is a nonterminal (i.e., not a POS tag) and matches the right part of the label in the gold dependency structure, after stripping functional tags. Thus, the constituent's label and headword should be correct, but not necessarily the span. Moreover, 2.5% of all constituents with functional labels (246 out of 9928 in section 23) are not maximal projections. Since our method ignores functional tags of such constituents (these tags disappear after the conversion of phrase structures to dependency graphs), we consider them as errors, i.e., reducing our recall value.</Paragraph>
      <Paragraph position="1"> Below, the tagging accuracy, precision and recall are evaluated on constituents correctly identified by Charniak's parser for section 23.</Paragraph>
      <Paragraph position="2">  The difference in the accuracy is due to two reasons. First, because of the different definition of a correctly identified constituent in the parser's output, we apply our method to a greater portion of all labels produced by the parser (95% vs. 89% reported in (Blaheta and Charniak, 2000)). This might make the task for out system more difficult. And second, whereas 22% of all constituents in section 23 have a functional tag, 36% of the maximal projections have one. Since we apply our method only to labels of maximal projections, this means that our accuracy baseline (i.e., never assign any tag) is lower.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Step 2: Adding Missing Nodes
</SectionTitle>
    <Paragraph position="0"> As the row labelled &amp;quot;step 1&amp;quot; in Table 1 indicates, for both parsers the recall is relatively low (6% lower than the precision): while the WSJ trees, and hence the derived dependency structures, contain non-local dependencies and empty nodes, the parsers simply do not provide this information. To make up for this, we considered two further tranformations of the output of the parsers: adding new nodes (corresponding to empty nodes in WSJ), and adding new labelled arcs. This section describes the former modification and Section 8 the latter.</Paragraph>
    <Paragraph position="1"> As described in Section 4, when converting WSJ trees to dependency structures, traces are resolved, their empty nodes removed and new dependencies introduced. Of the remaining empty nodes (i.e., non-traces), the most frequent in WSJ are: NP PRO, empty units, empty complementizers, empty relative pronouns. To add missing empty nodes to dependency graphs, we compared the output of the parsers on the strings of the training corpus after steps 0 and 1 (conversion to dependencies and relabelling) to the structures in the corpus itself. We trained a classifier which, for every word in the parser's output, had to decide whether an empty node should be added as a new dependent of the word, and what its symbol ('*', '*U*' or '0' in WSJ), POS tag (always -NONE- in WSJ) and the label of the new dependency (e.g., 'S a0 NP-SBJ' for  NP PRO and 'VP a0 SBAR' for empty complementizers) should be. This decision is conditioned on the word itself and its context. The features used were: a1 the word and its POS tag, whether the word has any subject and object dependents, and whether it is the head of a finite verb group; a1 the same information for the word's head (if any) and also the label of the corresponding dependency; null a1 the same information for the rightmost and  leftmost dependents of the word (if exist) along with their dependency labels.</Paragraph>
    <Paragraph position="2"> In total, we extracted 23 symbolic features for every word in the corpus. TiMBL was trained on sections 02-21 and applied to the output of the parsers (after steps 0 and 1) on the test corpus (section 23), producing a list of empty nodes to be inserted in the dependency graphs. After insertion of the empty nodes, the resulting structures were evaluated against section 23 of the gold dependency treebank.</Paragraph>
    <Paragraph position="3"> The results are shown in Table 1 (the row &amp;quot;step 2&amp;quot;). For both parsers the insertion of empty nodes improves the recall by 1.5%, resulting in a 1% increase of the f-score.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.1 Comparison to Earlier Work
</SectionTitle>
      <Paragraph position="0"> A procedure for empty node recovery was first described in (Johnson, 2002), along with an evaluation criterion: an empty node is correct if its category and position in the sentence are correct. Since our method works with dependency structures, not phrase trees, we adopt a different but comparable criterion: an empty node should be attached as a dependent to the correct word, and with the correct dependency label. Unlike the first metric, our correctness criterion also requires that possible attachment ambiguities are resolved correctly (e.g., as in the number of reports 0 they sent, where the empty relative pronoun may be attached either to number or to reports).</Paragraph>
      <Paragraph position="1"> For this task, the best published results (using Johnson's metric) were reported by Dienes and Dubey (2003), who used shallow tagging to insert empty elements. Below we give the comparison to our method. Notice that this evaluation does not include traces (i.e., empty elements with antecedents): recovery of traces is described in Section 8.</Paragraph>
      <Paragraph position="2">  For comparison we use the notation of Dienes and Dubey: PRO-NP for uncontrolled PROs (nodes '*' in the WSJ), COMP-SBAR for empty complementizers (nodes '0' with dependency label VP a0 SBAR), COMP-WHNP for empty relative pronouns (nodes '0' with dependency label X a0 SBAR, where X a0a1 VP) and UNIT for empty units (nodes '*U*').</Paragraph>
      <Paragraph position="3"> It is interesting to see that for empty nodes except for UNIT both methods have their advantages, showing better precision or better recall. Yet shallow tagging clearly performs better for UNIT.</Paragraph>
      <Paragraph position="4"> 8 Step 3: Adding Missing Dependencies We now get to the third and final step of our transformation method: adding missing arcs to dependency graphs. The parsers we considered do not explicitly provide information about non-local dependencies (control, WH-extraction) present in the treebank. Moreover, newly inserted empty nodes (step 2, Section 7) might also need more links to the rest of a sentence (e.g., the inserted empty complementizers). In this section we describe the insertion of missing dependencies.</Paragraph>
      <Paragraph position="5"> Johnson (2002) was the first to address recovery of non-local dependencies in a parser's output. He proposed a pattern-matching algorithm: first, from the training corpus the patterns that license non-local dependencies are extracted, and then these patterns are detected in unseen trees, dependencies being added when matches are found. Building on these ideas, Jijkoun (2003) used a machine learning classifier to detect matches. We extended Jijkoun's approach by providing the classifier with lexical information and using richer patterns with labels containing the Penn functional tags and empty nodes, detected at steps 1 and 2.</Paragraph>
      <Paragraph position="6"> First, we compared the output of the parsers on the strings of the training corpus after steps 0, 1 and 2 to the dependency structures in the training corpus. For every dependency that is missing in the parser's output, we find the shortest undirected path in the dependency graph connecting the head and the dependent. These paths, connected sequences of labelled dependencies, define the set of possible patterns. For our experiments we only considered patterns occuring more than 100 times in the training corpus. E.g., for Collins' parser, 67 different patterns were found.</Paragraph>
      <Paragraph position="7"> Next, from the parsers' output on the strings of the training corpus, we extracted all occurrences of the patterns, along with information about the nodes involved. For every node in an occurrence of a pattern we extracted the following features: a1 the word and its POS tag; a1 whether the word has subject and object dependents; null a1 whether the word is the head of a finite verb cluster.</Paragraph>
      <Paragraph position="8"> We then trained TiMBL to predict the label of the missing dependency (or 'none'), given an occurrence of a pattern and the features of all the nodes involved. We trained a separate classifier for each pattern.</Paragraph>
      <Paragraph position="9"> For evaluation purposes we extracted all occurrences of the patterns and the features of their nodes from the parsers' outputs for section 23 after steps 0, 1 and 2 and used TiMBL to predict and insert new dependencies. Then we compared the resulting dependency structures to the gold corpus. The results are shown in Table 1 (the row &amp;quot;step 3&amp;quot;). As expected, adding missing dependencies substantially improves the recall (by 4% for both parsers) and allows both parsers to achieve an 84% f-score on dependencies with functional tags (90% on unlabelled dependencies). The unlabelled f-score 89.9% for Collins' parser is close to the 90.9% reported in (Collins, 1999) for the evaluation on unlabelled local dependencies only (without empty nodes and traces). Since as many as 5% of all dependencies in WSJ involve traces or empty nodes, the results in Table 1 are encouraging.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
8.1 Comparison to Earlier Work
</SectionTitle>
      <Paragraph position="0"> Recently, several methods for the recovery of non-local dependencies have been described in the literature. Johnson (2002) and Jijkoun (2003) used pattern-matching on local phrase or dependency structures. Dienes and Dubey (2003) used shallow preprocessing to insert empty elements in raw sentences, making the parser itself capable of finding non-local dependencies. Their method achieves a considerable improvement over the results reported in (Johnson, 2002) and gives the best evaluation results published to date. To compare our results to Dienes and Dubey's, we carried out the transformation steps 0-3 described above, with a single modification: when adding missing dependencies (step 3), we only considered patterns that introduce non-local dependencies (i.e., traces: we kept the information whether a dependency is a trace when converting WSJ to a dependency corpus).</Paragraph>
      <Paragraph position="1"> As before, a dependency is correctly found if its head, dependent, and label are correct. For traces, this corresponds to the evaluation using the head-based antecedent representation described in (Johnson, 2002), and for empty nodes without antecedents (e.g., NP PRO) this is the measure used in Section 7.1. To make the results comparable to other methods, we strip functional tags from the dependency labels before label comparison. Below are the overall precision, recall, and f-score for our method and the scores reported in (Dienes and Dubey, 2003) for antecedent recovery using Collins' parser.</Paragraph>
      <Paragraph position="2">  Interestingly, the overall performance of our post-processing method is very similar to that of the pre- and in-processing methods of Dienes and Dubey (2003). Hence, for most cases, traces and empty nodes can be reliably identified using only local information provided by a parser, using the parser itself as a black box. This is important, since making parsers aware of non-local relations need not improve the overall performance: Dienes and Dubey (2003) report a decrease in PARSEVAL f-score from 88.2% to 86.4% after modifying Collins' parser to resolve traces internally, although this allowed them to achieve high accuracy for traces.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML