File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1018_metho.xml

Size: 15,828 bytes

Last Modified: 2025-10-06 14:07:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1018">
  <Title>A simple pattern-matching algorithm for recovering empty nodes and their antecedents</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 A pattern-matching algorithm
</SectionTitle>
    <Paragraph position="0"> This section describes the pattern-matching algorithm in detail. In broad outline the algorithm can  21 of the Penn Treebank (there are approximately 64,000 empty nodes in total). The label column gives the terminal label of the empty node, the POS column gives its preterminal label and the Antecedent column gives the label of its antecedent. The entry with an SBAR POS and empty label corresponds to an empty compound SBAR subtree, as explained in the text and Figure 3.</Paragraph>
    <Paragraph position="1">  be regarded as an instance of the Memory-Based Learning approach, where both the pattern extraction and pattern matching involve recursively visiting all of the subtrees of the tree concerned. It can also be regarded as a kind of tree transformation, so the overall system architecture (including the parser) is an instance of the transform-detransform approach advocated by Johnson (1998). The algorithm has two phases. The rst phase of the algorithm extracts the patterns from the trees in the training corpus. The second phase of the algorithm uses these extracted patterns to insert empty nodes and index their antecedents in trees that do not contain empty nodes. Before the trees are used in the training and insertion phases they are passed through a common preproccessing step, which relabels preterminal nodes dominating auxiliary verbs and transitive verbs.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Auxiliary and transitivity annotation
</SectionTitle>
      <Paragraph position="0"> The preprocessing step relabels auxiliary verbs and transitive verbs in all trees seen by the algorithm.</Paragraph>
      <Paragraph position="1"> This relabelling is deterministic and depends only on the terminal (i.e., the word) and its preterminal label.</Paragraph>
      <Paragraph position="2"> Auxiliary verbs such as is and being are relabelled as either a AUX or AUXG respectively. The relabelling of auxiliary verbs was performed primarily because Charniak's parser (which produced one of the test corpora) produces trees with such labels; experiments (on the development section) show that auxiliary relabelling has little effect on the algorithm's performance.</Paragraph>
      <Paragraph position="3"> The transitive verb relabelling suf xes the preterminal labels of transitive verbs with t . For example, in Figure 1 the verb likes is relabelled VBZ t in this step. A verb is deemed transitive if its stem is followed by an NP without any grammatical function annotation at least 50% of the time in the training corpus; all such verbs are relabelled whether or not any particular instance is followed by an NP.</Paragraph>
      <Paragraph position="4"> Intuitively, transitivity would seem to be a powerful cue that there is an empty node following a verb. Experiments on the development corpus showed that transitivity annotation provides a small but useful improvement to the algorithm's performance. The  in Figure 1.</Paragraph>
      <Paragraph position="5"> accuracy of transitivity labelling was not systematically evaluated here.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Patterns and matchings
</SectionTitle>
      <Paragraph position="0"> Informally, patterns are minimal connected tree fragments containing an empty node and all nodes co-indexed with it. The intuition is that the path from the empty node to its antecedents speci es important aspects of the context in which the empty node can appear.</Paragraph>
      <Paragraph position="1"> There are many different possible ways of realizing this intuition, but all of the ones tried gave approximately similar results so we present the simplest one here. The results given below were generated where the pattern for an empty node is the minimal tree fragment (i.e., connected set of local trees) required to connect the empty node with all of the nodes coindexed with it. Any indices occuring on nodes in the pattern are systematically renumbered beginning with 1. If an empty node does not bear an index, its pattern is just the local tree containing it. Figure 4 displays the single pattern that would be extracted corresponding to the two empty nodes in the tree depicted in Figure 1.</Paragraph>
      <Paragraph position="2"> For this kind of pattern we de ne pattern matching informally as follows. If p is a pattern and t is a tree, then p matches t iff t is an extension of p ignoring empty nodes in p. For example, the pattern displayed in Figure 4 matches the subtree rooted under SBAR depicted in Figure 2.</Paragraph>
      <Paragraph position="3"> If a pattern p matches a tree t, then it is possible to substitute p for the fragment of t that it matches.</Paragraph>
      <Paragraph position="4"> For example, the result of substituting the pattern shown in Figure 4 for the subtree rooted under SBAR depicted in Figure 2 is the tree shown in Figure 1.</Paragraph>
      <Paragraph position="5"> Note that the substitution process must standardize apart or renumber indices appropriately in order to avoid accidentally labelling empty nodes inserted by two independent patterns with the same index.</Paragraph>
      <Paragraph position="6"> Pattern matching and substitution can be de ned more rigorously using tree automata (G*ecseg and Steinby, 1984), but for reasons of space these definitions are not given here.</Paragraph>
      <Paragraph position="7"> In fact, the actual implementation of pattern matching and substitution used here is considerably more complex than just described. It goes to some lengths to handle complex cases such as adjunction and where two or more empty nodes' paths cross (in these cases the pattern extracted consists of the union of the local trees that constitute the patterns for each of the empty nodes). However, given the low frequency of these constructions, there is probably only one case where this extra complexity is justi ed: viz., the empty compound SBAR subtree shown in Figure 3.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Empty node insertion
</SectionTitle>
      <Paragraph position="0"> Suppose we have a rank-ordered list of patterns (the next subsection describes how to obtain such a list).</Paragraph>
      <Paragraph position="1"> The procedure that uses these to insert empty nodes into a tree t not containing empty nodes is as follows. We perform a pre-order traversal of the sub-trees of t (i.e., visit parents before their children), and at each subtree we nd the set of patterns that match the subtree. If this set is non-empty we substitute the highest ranked pattern in the set into the subtree, inserting an empty node and (if required) co-indexing it with its antecedents.</Paragraph>
      <Paragraph position="2"> Note that the use of a pre-order traversal effectively biases the procedure toward deeper , more embedded patterns. Since empty nodes are typically located in the most embedded local trees of patterns (i.e., movement is usually upward in a tree), if two different patterns (corresponding to different non-local dependencies) could potentially insert empty nodes into the same tree fragment in t, the deeper pattern will match at a higher node in t, and hence will be substituted. Since the substitution of one pattern typically destroys the context for a match of another pattern, the shallower patterns no longer match. On the other hand, since shallower patterns contain less structure they are likely to match a greater variety of trees than the deeper patterns, they still have ample opportunity to apply.</Paragraph>
      <Paragraph position="3"> Finally, the pattern matching process can be speeded considerably by indexing patterns appropriately, since the number of patterns involved is quite large (approximately 11,000). For patterns of the kind described here, patterns can be indexed on their topmost local tree (i.e., the pattern's root node label and the sequence of node labels of its children).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Pattern extraction
</SectionTitle>
      <Paragraph position="0"> After relabelling preterminals as described above, patterns are extracted during a traversal of each of the trees in the training corpus. Table 2 lists the most frequent patterns extracted from the Penn Tree-bank training corpus. The algorithm also records how often each pattern was seen; this is shown in the count column of Table 2.</Paragraph>
      <Paragraph position="1"> The next step of the algorithm determines approximately how many times each pattern can match some subtree of a version of the training corpus from which all empty nodes have been removed (regardless of whether or not the corresponding substitutions would insert empty nodes correctly). This information is shown under the match column in Table 2, and is used to lter patterns which would most often be incorrect to apply even though they match.</Paragraph>
      <Paragraph position="2"> If c is the count value for a pattern and m is its match value, then the algorithm discards that pattern when the lower bound of a 67% con dence interval for its success probability (given c successes out of m trials) is less than 1=2. This is a standard technique for discounting success probabilities from small sample size data (Witten and Frank, 2000). (As explained immediately below, the estimates of c and m given in Table 2 are inaccurate, so whenever the estimate of m is less than c we replace m by c in this calculation). This pruning removes approximately 2,000 patterns, leaving 9,000 patterns.</Paragraph>
      <Paragraph position="3"> The match value is obtained by making a second pre-order traversal through a version of the training data from which empty nodes are removed. It turns out that subtle differences in how the match value is obtained make a large difference to the algorithm's performance. Initially we de ned the match value of a pattern to be the number of subtrees that match that pattern in the training corpus. But as explained above, the earlier substitution of a deeper pattern may prevent smaller patterns from applying, so this simple de nition of match value undoubtedly over-estimates the number of times shallow patterns might apply. To avoid this over-estimation, after we have matched all patterns against a node of a training corpus tree we determine the correct pattern (if any) to apply in order to recover the empty nodes that were originally present, and reinsert the relevant empty nodes. This blocks the matching of shallower patterns, reducing their match values and hence raising their success probability. (Undoubtedly the count values are also over-estimated in the same way; however, experiments showed that estimating count values in a similar manner to the way in which match values are estimated reduces the algorithm's performance).</Paragraph>
      <Paragraph position="4"> Finally, we rank all of the remaining patterns. We experimented with several different ranking criteria, including pattern depth, success probability (i.e., c=m) and discounted success probability. Perhaps surprisingly, all produced similiar results on the development corpus. We used pattern depth as the ranking criterion to produce the results reported below because it ensures that deep patterns receive a chance to apply. For example, this ensures that the pattern inserting an empty NP * and WHNP can apply before the pattern inserting an empty complementizer 0.</Paragraph>
      <Paragraph position="5"> 3 Empty node recovery evaluation The previous section described an algorithm for restoring empty nodes and co-indexing their antecedents. This section describes two evaluation procedures for such algorithms. The rst, which measures the accuracy of empty node recovery but not co-indexation, is just the standard Parseval evaluation applied to empty nodes only, viz., precision and recall and scores derived from these. In this evaluation, each node is represented by a triple consisting of its category and its left and right string positions. (Note that because empty nodes dominate the empty string, their left and right string positions of empty nodes are always identical).</Paragraph>
      <Paragraph position="6"> Let G be the set of such empty node representations derived from the gold standard evaluation corpus and T the set of empty node representations  column is the number of times the pattern was found, and the Match column is an estimate of the number of times that this pattern matches some subtree in the training corpus during empty node recovery, as explained in the text.</Paragraph>
      <Paragraph position="7"> derived from the corpus to be evaluated. Then as is standard, the precision P , recall R and f-score f are calculated as follows:</Paragraph>
      <Paragraph position="9"> Table 3 provides these measures for two different test corpora: (i) a version of section 23 of the Penn Treebank from which empty nodes, indices and unary branching chains consisting of nodes of the same category were removed, and (ii) the trees produced by Charniak's parser on the strings of section 23 (Charniak, 2000).</Paragraph>
      <Paragraph position="10"> To evaluate co-indexation of empty nodes and their antecedents, we augment the representation of empty nodes as follows. The augmented representation for empty nodes consists of the triple of category plus string positions as above, together with the set of triples of all of the non-empty nodes the empty node is co-indexed with. (Usually this set of antecedents is either empty or contains a single node). Precision, recall and f-score are de ned for these augmented representations as before.</Paragraph>
      <Paragraph position="11"> Note that this is a particularly stringent evaluation measure for a system including a parser, since it is necessary for the parser to produce a non-empty node of the correct category in the correct location to serve as an antecedent for the empty node. Table 4 provides these measures for the same two corpora described earlier.</Paragraph>
      <Paragraph position="12"> In an attempt to devise an evaluation measure for empty node co-indexation that depends less on syntactic structure we experimented with a modi ed augmented empty node representation in which each antecedent is represented by its head's category and location. (The intuition behind this is that we do not want to penalize the empty node antecedentnding algorithm if the parser misattaches modiers to the antecedent). In fact this head-based antecedent representation yields scores very similiar to those obtained using the phrase-based representation. It seems that in the cases where the parser does not construct a phrase in the appropriate loca-tion to serve as the antecedent for an empty node, the syntactic structure is typically so distorted that either the pattern-matcher fails or the head- nding algorithm does not return the correct head either.</Paragraph>
      <Paragraph position="13"> Empty node Section 23 Parser output  reported for all types of empty node that occured more than 100 times in the gold standard corpus (section 23 of the Penn Treebank); these are ordered by frequency of occurence in the gold standard. Section 23 is a test corpus consisting of a version of section 23 from which all empty nodes and indices were removed. The parser output was produced by Charniak's parser (Charniak, 2000). Empty node Section 23 Parser output</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML