File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1034_metho.xml

Size: 9,411 bytes

Last Modified: 2025-10-06 14:14:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1034">
  <Title>Error-Driven Pruning of Treebank Grammars for Base Noun Phrase Identification</Title>
  <Section position="3" start_page="218" end_page="219" type="metho">
    <SectionTitle>
2 The Treebank Approach
</SectionTitle>
    <Paragraph position="0"> Figure 2 depicts the treebank approach to base NP identification. For training, the algorithm requires a corpus that has been annotated with base NPs.</Paragraph>
    <Paragraph position="1"> More specifically, we assume that the training corpus is a sequence of words wl, w2,..., along with a set of base NP annotations b(il&amp;), b(i~j~),..., where b(ij) indicates that the NP brackets words i through j:  \[NP Wi, ..., W j\]. The goal of the training phase is to create a base NP grammar from this training corpus: 1. Using any available part-of-speech tagger, assign a part-of-speech tag ti to each word wi in the training corpus.</Paragraph>
    <Paragraph position="2"> 2. Extract from each base noun phrase b(ij) in the training corpus its sequence of part-of-speech tags tl .... ,tj to form base NP rules, one rule per base NP.</Paragraph>
    <Paragraph position="3"> 3. Remove any duplicate rules.</Paragraph>
    <Paragraph position="4"> The resulting &amp;quot;grammar&amp;quot; can then be used to identify base NPs in a novel text.</Paragraph>
    <Paragraph position="5"> 1.</Paragraph>
    <Paragraph position="6"> 2.</Paragraph>
    <Paragraph position="7">  Assign part-of-speech tags tl, t2,.., to the input words wl, w2, * * * Proceed through the tagged text from left to right, at each point matching the NP rules against the remaining part-of-speech tags ti,ti+l,.., in the text.</Paragraph>
    <Paragraph position="8">  Not this year. National Association of Manufacturers settled on the Hoosier capital of Indianapolis for its next meeting. And the city decided to treat its guests more like royalty or rock sta~ than factory owners.</Paragraph>
    <Paragraph position="10"> 3. If there are multiple rules that match beginning at ti, use the longest matching rule R. Add the  new base noun phrase b(i,i+\]R\[-1) to the set of base NPs. Continue matching at ti+lR\[.</Paragraph>
    <Paragraph position="11"> With the rules stored in an appropriate data structure, this greedy &amp;quot;parsing&amp;quot; of base NPs is very fast. In our implementation, for example, we store the rules in a decision tree, which permits base NP identification in time linear in the length of the tagged input text when using the longest match heuristic.</Paragraph>
    <Paragraph position="12"> Unfortunately, there is an obvious problem with the algorithm described above. There will be many unhelpful rules in the rule set extracted from the training corpus. These &amp;quot;bad&amp;quot; rules arise from four sources: bracketing errors in the corpus; tagging errors; unusual or irregular linguistic constructs (such as parenthetical expressions); and inherent ambiguities in the base NPs -- in spite of their simplicity. For example, the rule (VBG NNS), which was extracted from manufacturing/VBG titans/NNS in the example text, is ambiguous, and will cause erroneous bracketing in sentences such as The execs squeezed in a few meetings before \[boarding/VBG buses/NNS~ again. In order to have a viable mechanism for identifying base NPs using this algorithm, the grammar must be improved by removing problematic rules.</Paragraph>
    <Paragraph position="13"> The next section presents two such methods for automatically pruning the base NP grammar.</Paragraph>
  </Section>
  <Section position="4" start_page="219" end_page="221" type="metho">
    <SectionTitle>
3 Pruning the Base NP Grammar
</SectionTitle>
    <Paragraph position="0"> As described above, our goal is to use the base NP corpus to extract and select a set of noun phrase rules that can be used to accurately identify base NPs in novel text. Our general pruning procedure is shown in Figure 3. First, we divide the base NP corpus into two parts: a training corpus and a pruning corpus. The initial base NP grammar is extracted from the training corpus as described in Section 2.</Paragraph>
    <Paragraph position="1"> Next, the pruning corpus is used to evaluate the set of rules and produce a ranking of the rules in terms of their utility in identifying base NPs. More specifically, we use the rule set and the longest match heuristic to find all base NPs in the pruning corpus.</Paragraph>
    <Paragraph position="2"> Performance of the rule set is measured in terms of labeled precision (P): p _- # of correct proposed NPs # of proposed NPs We then assign to each rule a score that denotes the &amp;quot;net benefit&amp;quot; achieved by using the rule during NP parsing of the improvement corpus. The benefit of rule r is given by B~ = C, - E, where C~  is the number of NPs correctly identified by r, and E~ is the number of precision errors for which r is responsible. 1 A rule is considered responsible for an error if it was the first rule to bracket part of a reference NP, i.e., an NP in the base NP training corpus. Thus, rules that form erroneous bracketings are not penalized if another rule previously bracketed part of the same reference NP.</Paragraph>
    <Paragraph position="3"> For example, suppose the fragment containing base NPs Boca Raton, Hot Springs, and Palm Beach is bracketed as shown below.</Paragraph>
    <Paragraph position="4">  ton, Hot as a noun phrase, so its score is -1. Rule (NNP) incorrectly identifies Springs, but it is not held responsible for the error because of the previous error by (NNP NNP, NNP / on the same original NP Hot Springs: so its score is 0. Finally, rule (NNP NNP) receives a score of 1 for correctly identifying Palm Beach as a base NP.</Paragraph>
    <Paragraph position="5"> The benefit scores from evaluation on the pruning corpus are used to rank the rules in the grammar.</Paragraph>
    <Paragraph position="6"> With such a ranking, we can improve the rule set by discarding the worst rules. Thus far, we have investigated two iterative approaches for discarding rules, a thresholding approach and an incremental approach. We describe each, in turn, in the subsec-</Paragraph>
    <Section position="1" start_page="219" end_page="219" type="sub_section">
      <SectionTitle>
3.1 Threshold Pruning
</SectionTitle>
      <Paragraph position="0"> Given a ranking on the rule set, the threshold algorithm simply discards rules whose score is less than a predefined threshold R. For all of our experiments, we set R = 1 to select rules that propose more correct bracketings than incorrect. The process of evaluating, ranking, and discarding rules is repeated until no rules have a score less than R. For our evaluation on the WSJ corpus, this typically requires only four to five iterations.</Paragraph>
    </Section>
    <Section position="2" start_page="219" end_page="219" type="sub_section">
      <SectionTitle>
3.2 Incremental Pruning
</SectionTitle>
      <Paragraph position="0"> Thresholding provides a very coarse mechanism for pruning the NP grammar. In particular, because of interactions between the rules during bracketing, thresholding discards rules whose score might increase in the absence of other rules that are also being discarded. Consider, for example, the Boca Raton fragments given earlier. In the absence of (NNP NNP , NNP), the rule (NNP NNP / would have received a score of three for correctly identifying all three NPs.</Paragraph>
      <Paragraph position="1"> As a result, we explored a more fine-grained method of discarding rules: Each iteration of incremental pruning discards the N worst rules, rather than all rules whose rank is less than some threshold. In all of our experiments, we set N = 10. As with thresholding, the process of evaluating, ranking, and discarding rules is repeated, this time until precision of the current rule set on the pruning corpus begins to drop. The rule set that maximized precision becomes the final rule set.</Paragraph>
    </Section>
    <Section position="3" start_page="219" end_page="221" type="sub_section">
      <SectionTitle>
3.3 Human Review
</SectionTitle>
      <Paragraph position="0"> In the experiments below, we compare the thresholding and incremental methods for pruning the NP grammar to a rule set that was pruned by hand.</Paragraph>
      <Paragraph position="1"> When the training corpus is large, exhaustive review of the extracted rules is not practical. This is the case for our initial rule set, culled from the WSJ corpus, which contains approximately 4500 base NP rules. Rather than identifying and discarding individual problematic rules, our reviewer identified problematic classes of rules that could be removed from the grammar automatically. In particular, the goal of the human reviewer was to discard rules that introduced ambiguity or corresponded to overly complex base NPs. Within our partial parsing framework, these NPs are better identified by more informed components of the NLP system. Our reviewer identified the following classes of rules as possibly troublesome: rules that contain a preposition, period, or colon; rules that contain WH tags; rules that begin/end with a verb or adverb; rules that contain pronouns with any other tags; rules that contain misplaced commas or quotes; rules that end with adjectives. Rules covered under any of these classes  were omitted from the human-pruned rule sets used in the experiments of Section 4.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML