File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1034_intro.xml
Size: 7,955 bytes
Last Modified: 2025-10-06 14:06:32
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1034"> <Title>Error-Driven Pruning of Treebank Grammars for Base Noun Phrase Identification</Title> <Section position="2" start_page="0" end_page="218" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Finding base noun phrases is a sensible first step for many natural language processing (NLP) tasks: Accurate identification of base noun phrases is arguably the most critical component of any partial parser; in addition, information retrieval systems rely on base noun phrases as the main source of multi-word indexing terms; furthermore, the psycholinguistic studies of Gee and Grosjean (1983) indicate that text chunks like base noun phrases play an important role in human language processing. In this work we define base NPs to be simple, nonrecursive noun phrases -- noun phrases that do not contain other noun phrase descendants. The bracketed portions of Figure 1, for example, show the base NPs in one sentence from the Penn Treebank Wall Street Journal (WSJ) corpus (Marcus et al., 1993).</Paragraph> <Paragraph position="1"> Thus, the string the sunny confines of resort towns like Boca Raton and Hot Springs is too complex to be a base NP; instead, it contains four simpler noun phrases, each of which is considered a base NP: the sunny confines, resort towns, Boca Raton, and Hot Springs.</Paragraph> <Paragraph position="2"> Previous empirical research has addressed the problem of base NP identification. Several algorithms identify &quot;terminological phrases&quot; -- certain When \[it\] is \[time\] for \[their biannual powwow\] , \[the nation\] 's \[manufacturing titans\] typically jet off to \[the sunny confines\] of \[resort towns\] base noun phrases with initial determiners and modifiers removed: Justeson & Katz (1995) look for repeated phrases; Bourigault (1992) uses a hand-crafted noun phrase grammar in conjunction with heuristics for finding maximal length noun phrases; Voutilainen's NPTool (1993) uses a handcrafted lexicon and constraint grammar to find terminological noun phrases that include phrase-final prepositional phrases. Church's PARTS program (1988), on the other hand, uses a probabilistic model automatically trained on the Brown corpus to locate core noun phrases as well as to assign parts of speech.</Paragraph> <Paragraph position="3"> More recently, Ramshaw & Marcus (In press) apply transformation-based learning (Brill, 1995) to the problem. Unfortunately, it is difficult to directly compare approaches. Each method uses a slightly different definition of base NP. Each is evaluated on a different corpus. Most approaches have been evaluated by hand on a small test set rather than by automatic comparison to a large test corpus annotated by an impartial third party. A notable exception is the Ramshaw & Marcus work, which evaluates their transformation-based learning approach on a base NP corpus derived from the Penn Treebank WSJ, and achieves precision and recall levels of approximately 93%.</Paragraph> <Paragraph position="4"> This paper presents a new algorithm for identifying base NPs in an arbitrary text. Like some of the earlier work on base NP identification, ours is a trainable, corpus-based algorithm. In contrast to other corpus-based approaches, however, we hypothesized that the relatively simple nature of base NPs would permit their accurate identification using correspondingly simple methods. Assume, for example, that we use the annotated text of Figure 1 as our training corpus. To identify base NPs in an unseen text, we could simply search for all occurrences of the base NPs seen during training -- it, time, their biannual powwow, ..., Hot Springs -- and mark them as base NPs in the new text. However, this method would certainly suffer from data sparseness. Instead, we use a similar approach, but back off from lexical items to parts of speech: we identify as a base NP any string having the same part-of-speech tag sequence as a base NP from the training corpus. The training phase of the algorithm employs two previously successful techniques: like Charniak's (1996) statistical parser, our initial base NP grammar is read from a &quot;treebank&quot; corpus; then the grammar is improved by selecting rules with high &quot;benefit&quot; scores. Our benefit measure is identical to that used in transformation-based learning to select an ordered set of useful transformations (Brill, 1995).</Paragraph> <Paragraph position="5"> Using this simple algorithm with a naive heuristic for matching rules, we achieve surprising accuracy in an evaluation on two base NP corpora of varying complexity, both derived from the Penn Treebank WSJ. The first base NP corpus is that used in the Ramshaw & Marcus work. The second espouses a slightly simpler definition of base NP that conforms to the base NPs used in our Empire sentence analyzer. These simpler phrases appear to be a good starting point for partial parsers that purposely delay all complex attachment decisions to later phases of processing.</Paragraph> <Paragraph position="6"> Overall results for the approach are promising.</Paragraph> <Paragraph position="7"> For the Empire corpus, our base NP finder achieves 94% precision and recall; for the Ramshaw & Marcus corpus, it obtains 91% precision and recall, which is 2% less than the best published results. Ramshaw & Marcus, however, provide the learning algorithm with word-level information in addition to the part-of-speech information used in our base NP finder. By controlling for this disparity in available knowledge sources, we find that our base NP algorithm performs comparably, achieving slightly worse precision (-1.1%) and slightly better recall (+0.2%) than the Ramshaw & Marcus approach. Moreover, our approach offers many important advantages that make it appropriate for many NLP tasks: * Training is exceedingly simple.</Paragraph> <Paragraph position="8"> . The base NP bracketer is very fast, operating in time linear in the length of the text.</Paragraph> <Paragraph position="9"> . The accuracy of the treebank approach is good for applications that require or prefer fairly simple base NPs.</Paragraph> <Paragraph position="10"> . The learned grammar is easily modified for use with corpora that differ from the training texts.</Paragraph> <Paragraph position="11"> Rules can be selectively added to or deleted from the grammar without worrying about ordering effects.</Paragraph> <Paragraph position="12"> * Finally, our benefit-based training phase offers a simple, general approach for extracting grammars other than noun phrase grammars from annotated text.</Paragraph> <Paragraph position="13"> Note also that the treebank approach to base NP identification obtains good results in spite of a very simple algorithm for &quot;parsing&quot; base NPs. This is extremely encouraging, and our evaluation suggests at least two areas for immediate improvement. First, by replacing the naive match heuristic with a probabilistic base NP parser that incorporates lexical preferences, we would expect a nontrivial increase in recall and precision. Second, many of the remaining base NP errors tend to follow simple patterns; these might be corrected using localized, learnable repair rules.</Paragraph> <Paragraph position="14"> The remainder of the paper describes the specifics of the approach and its evaluation. The next section presents the training and application phases of the treebank approach to base NP identification in more detail. Section 3 describes our general approach for pruning the base NP grammar as well as two instantiations of that approach. The evaluation and a discussion of the results appear in Section 4, along with techniques for reducing training time and an initial investigation into the use of local repair heuristics.</Paragraph> </Section> class="xml-element"></Paper>