File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/w95-0107_intro.xml

Size: 10,892 bytes

Last Modified: 2025-10-06 14:05:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0107">
  <Title>Text Chunking using Transformation-Based Learning</Title>
  <Section position="4" start_page="82" end_page="85" type="intro">
    <SectionTitle>
2 Text Chunking
</SectionTitle>
    <Paragraph position="0"> Abney (1991) has proposed text chunking as a useful preliminary step to parsing. His chunks are inspired in part by psychological studies of Gee and Grosjean (1983) that link pause durations in reading and naive sentence diagraming to text groupings that they called C-phrases, which very roughly correspond to breaking the string after each syntactic head that is a content word. Abney's other motivation for chunking is procedural, based on the hypothesis that the identification of chunks can be done fairly dependably by finite state methods, postponing the decisions that require higher-level analysis to a parsing phase that chooses how to combine the chunks.</Paragraph>
    <Section position="1" start_page="82" end_page="84" type="sub_section">
      <SectionTitle>
2.1 Existing Chunk Identification Techniques
</SectionTitle>
      <Paragraph position="0"> Existing efforts at identifying chunks in text have been focused primarily on low-level noun group identification, frequently as a step in deriving index terms, motivated in part by the limited coverage of present broad-scale parsers when dealing with unrestricted text. Some researchers have applied grammar-based methods, combining lexical data with finite state or other grammar constraints, while others have worked on inducing statistical models either directly from the words or from automatically assigned part-of-speech classes.</Paragraph>
      <Paragraph position="1"> On the grammar-based side, Bourigault (1992) describes a system for extracting &amp;quot;terminological noun phrases&amp;quot; from French text. This system first uses heuristics to find &amp;quot;maximal length noun phrases&amp;quot;, and then uses a grammar to extract &amp;quot;terminological units.&amp;quot; For example, from the maximal NP le disque dur de la station de travail it extracts the two terminological phrases disque dur, and station de travail. Bourigault claims that the grammar can parse &amp;quot;around 95% of the maximal length noun phrases&amp;quot; in a test corpus into possible terminological phrases, which then require manual validation. However, because its goal is terminological phrases, it appears that this system ignores NP chunk-initial determiners and other initial prenominal modifiers, somewhat simplifying the parsing task.</Paragraph>
      <Paragraph position="2"> Voutilalnen (1993), in his impressive NPtool system, uses an approach that is in some ways similar to the one used here, in that he adds to his part-of-speech tags a new kind of tag that shows chunk structure; the chunk tag &amp;quot;@&gt;N&amp;quot;, for example, is used for determiners and premodifiers, both of which group with the following noun head. He uses a lexicon that lists all the possible chunk tags for each word combined with hand-built constraint grammar patterns. These patterns eliminate impossible readings to identify a somewhat idiosyncratic kind of target noun group that does not include initial determiners but does include postmodifying prepositional phrases (including determiners). Voutilainen claims recall rates of 98.5% or better with precision of 95% or better. However, the sample NPtool analysis given in the appendix of (Voutilainen, 1993), appears to be less accurate than claimed in general, with 5 apparent mistakes (and one unresolved ambiguity) out of the 32 NP chunks in that sample, as listed in Table 1. These putative errors,  combined with the claimed high performance, suggest that NPtool's definition of NP chunk i.s also tuned for extracting terminological phrases, and thus excludes many kinds of NP premodifiers, again simplifying the chunking task.</Paragraph>
      <Paragraph position="3"> NPtool parse Apparent correct parse less \[time\] \[less time\] the other hand the \[other hand\] many \[advantages\] \[many advantages\] \[b!nary addressing\] \[binary addressing and and \[instruction formats\] instruction formats\] a purely \[binary computer\] a \[purely binary computer\]  Kupiec (1993) also briefly mentions the use of finite state NP recognizers for both English and French to prepare the input for a program that identified the correspondences between NPs in bilingual corpora, but he does not directly discuss their performance.</Paragraph>
      <Paragraph position="4"> Using statistical methods, Church's Parts program (1988), in addition to identifying parts of speech, also inserted brackets identifying core NPs. These brackets were placed using a statistical model trained on Brown corpus material in which NP brackets had been inserted semi-automatically. In the small test sample shown, this system achieved 98% recall for correct brackets. At about the same time, Ejerhed (1988), working with Church, performed comparisons between finite state methods and Church's stochastic models for identifying both non-recursive clauses and non-recursive NPs in English text. In those comparisons, the stochastic methods outperformed the hand built finite-state models, with claimed accuracies of 93.5% (clauses) and 98.6% (NPs) for the statistical models compared to to 87% (clauses) and 97.8% (NPs) for the finite-state methods.</Paragraph>
      <Paragraph position="5"> Running Church's program on test material, however, reveals that the definition of NP embodied in Church's program is quite simplified in that it does not include, for example, structures or words conjoined within NP by either explicit conjunctions like &amp;quot;and&amp;quot; and &amp;quot;or&amp;quot;, or implicitly by commas. Church's chunker thus assigns the following NP chunk structures: \[a Skokie\], \[hi.\] , \[subsidiary\] \[newer\], \[big-selling prescriptions drugs\] \[the inefficiency\] , \[waste\] and \[lack\] of \[coordination\] \[Kidder\], \[Peabody\] ~ \[Co\] It is difficult to compare performance figures between studies; the definitions of the target chunks and the evaluation methodologies differ widely and are frequently incompletely specified. All of the cited performance figures above also appear to derive from manual checks by the investigators of the system's predicted output, and it is hard to estimate the impact of the system's suggested chunking on the judge's determination. We believe that the work reported here is the first study which has attempted to find NP chunks subject only to the limitation that the structures recognized do not include recursively embedded NPs, and which has measured performance by automatic comparison with a preparsed corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="84" end_page="85" type="sub_section">
      <SectionTitle>
2.2 Deriving Chunks from Treebank Parses
</SectionTitle>
      <Paragraph position="0"> We performed experiments using two different chunk structure targets, one that tried to bracket non-recursive &amp;quot;baseNPs&amp;quot; and one that partitioned sentences into non-overlapping N-type and V-type chunks, loosely following Abney's model. Training and test materials with chunk tags encoding each of these kinds of structure were derived automatically from the parsed Wall Street Journal text in the Penn Treebank (Marcus et al., i994). While this automatic derivation process introduced a small percentage of errors of its own, it was the only practical way both to provide the amount of training data required and to aJlow for fully-automatic testing.</Paragraph>
      <Paragraph position="1"> The goal of the &amp;quot;baseNP&amp;quot; chunks was to identify essentially the initial portions of non-recursive noun phrases up to the head, including determiners but not including postmodifying prepositional phrases or clauses. These chunks were extracted from the Treebank parses, basically by selecting NPs that contained no nested NPs 1. The handling of conjunction followed that of the Treebank annotators as to whether to show separate baseNPs or a single baseNP spanning the conjunction 2. Possessives were treated as a special case, viewing the possessive marker as the first word of a new baseNP, thus flattening the recursive structure in a useful way. The following sentences give examples of this baseNP chunk structure: During \[N the third quarter N\] , IN Compaq N\] purchased \[N a former Wang Laboratories manufacturing facility N\] in \[N Sterling N\], \[N Scotland N\], which will be used for IN international service and repair operations N\] * \[N The government N\] has \[N other agencies and instruments N\] for pursuing \[N these other objectives N\] * Even IN Mao Tse-tung N\] \[N's China/v\] began in \[N 1949 N\] with \[N a partnership N\] between \[N the communists N\] and \[N a number N\] of IN smaller , non-communist parties N\] * The chunks in the partitioning chunk experiments were somewhat closer to Abney's model, where the prepositions in prepositional phrases are included with the object NP up to the head in a single N-type chunk. This created substantial additional ambiguity for the system, which had to distinguish prepositions from particles. The handling of conjunction again follows the Treebank parse with nominal conjuncts parsed in the Treebank as a single NP forming a single N chunk, while those parsed as conjoined NPs become separate chunks, with any coordinating conjunctions attached like prepositions to the following N chunk.</Paragraph>
      <Paragraph position="2"> The portions of the text not involved in N-type chunks were grouped as chunks termed Vtype, though these &amp;quot;V&amp;quot; chunks included many elements that were not verbal, including adjective phrases. The internal structure of these V-type chunks loosely followed the Treebank parse, though V chunks often group together elements that were sisters in the underlying parse tree.</Paragraph>
      <Paragraph position="3"> Again, the possessive marker was viewed as initiating a new N-type chunk. The following sentences are annotated with these partitioning N and V chunks: \[N Some bankers N\] \[v are reporting v\] \[N more inquiries than usual N\] IN about CDs N\] \[N since Friday N\] *  \[N Eastern Airlines N\] \[N ' creditors N\] \[V have begun exploring v\] \[N alternative approaches N\] \[N to a Chapter 11 reorganization N\] \[Y because v\] \[g they Y\]\[Y are unhappy v\] \[g with the carrier N\] \[g's latest proposal N\] * \[N Indexing N\] \[N for the most part N\] \[v has involved simply buying v\] \[w and then holding v\] \[Y stocks N\] \[Y in the correct mix N\] \[Y to mirror V\] \[g a stock market barometer g\] * These two kinds of chunk structure derived from the Treebank data were encoded as chunk tags attached to each word and provided the targets for the transformation-based learning.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML