File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/w94-0112_intro.xml

Size: 8,162 bytes

Last Modified: 2025-10-06 14:05:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W94-0112">
  <Title>Bootstrapping Statistical Processing into a Rule-based Natural Language Parser</Title>
  <Section position="2" start_page="0" end_page="97" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> For decades, the majority of NL parsers have been &amp;quot;rule-based.&amp;quot; In such parsers, knowledge about the syntactic structure of a language is written in the form of linguistic rules, and these rules are applied by the parser to input text segments in order to produce the resulting parse trees. Information about individual words, such as what parts-of-speech they may be, is usually stored in an online dictionary, or &amp;quot;lexicon,&amp;quot; which is accessed by the parser for each word in the input text prior to applying the linguistic rules.</Paragraph>
    <Paragraph position="1"> Although rule-based parsers are widely-used in real, working NLP systems, they have the disadvantage that extensive amounts of (dictionary) data and labor (to write the rules) by highly-skilled linguists are required in order to create, enhance, and maintain them. This is especially true if the parser is required to have &amp;quot;broad coverage&amp;quot;, i.e., if it is to be able to parse NL text from many different domains (what one might call '!general&amp;quot; text).</Paragraph>
    <Paragraph position="2"> In the last few years, there has been increasing activity in the computational linguistics community focused on making use of statistical methods to acquire information from large corpora of NL text, and on using that information in statistical NL parsers. Instead of being stored in the traditional form of dictionary data and grammatical rules, linguistic knowledge in these parsers is represented as statistical parameters, or probabilities. These probabilities are commonly used together with simpler, less specified, dictionary data and/or rules, thereby taking the place of much of the information created by sldlled labor in rule-based systems.</Paragraph>
    <Paragraph position="3"> Advantages of the statistical approach that are claimed by its proponents include a significant decrease in the amount of rule coding required to create a parser that performs adequately, and the ability to &amp;quot;tune&amp;quot; a parser to a particular type of text simply by extracting statistical information from the same type of text. Perhaps the most significant disadvantage appears to be the requirement for large amounts of training data, often in the form of large NL text corpora that have been annotated with hand-coded tags specifying parts-of-speech, syntactic function, etc. There have been a number of efforts to extract information from corpora that are not tagged (e.g., Kupiec and Maxwell 1992), but the depth of information thus obtained and its utility in &amp;quot;automatically&amp;quot; creating a NL parser is usually limited.</Paragraph>
    <Paragraph position="4"> To overcome the need for augmenting corpora with tags in order to obtain more useful inforrnation,  researchers in statistical NLP have experimented with a variety of strategies, some of which employ varying degrees of traditional linguistic abstraction. Su and Chang (1992) group words in untagged corpora into equivalence classes, according to their possible parts-of-speech. They then perform statistical analyses over these equivalence classes, rather than over the words themselves, in order to obtain higher-level trigram language models that will be used later by their statistics-based parser. Brown et al. (1992) have similarly resorted to reducing inflected word forms to their underlying lemmas before estimation of statistical parameters.</Paragraph>
    <Paragraph position="5"> Bnscoe and Carroll (1993) carry the use of traditional rule-based linguistics a step further by using a unification-based grammar as a starting point. Through a process of human-supervised training on a small corpus of text, a statistical model is then developed which is used to rank the parses produced by the grammar for a given input.</Paragraph>
    <Paragraph position="6"> A similar method of interactive training has been used by Simmons and Yu (1991) to produce favorable results.</Paragraph>
    <Paragraph position="7"> Beyond the realm of simply using traditional linguistics to enhance the quality of data extracted from corpora by statistical methods, there have been attempts to create hybrid systems that incorporate statistical information into already well-developed rule-based frameworks. For example, McKee and Maloney (1992) have used common statistical methods to extract information such as part-of-speech frequency, verb sub-categorization frames, and prepositional phrase attachment preferences from corpora and have then incorporated it into the processing in their knowledge-based parser in order to quickly expand its coverage in new domains.</Paragraph>
    <Paragraph position="8"> In comparing rule-based approaches with those which are more purely statistics-based, and including everything in between, one could claim that there is some constant amount of linguistic knowledge that is required to create an NL parser, and one must either code it explicitly into the parser (using rules), or use statistical methods to extract it from sources such as text corpora. Furthermore, in the latter case, the extraction of useful information from the raw data in corpora is facilitated by additional information provided through manual tagging, through &amp;quot;seeding&amp;quot; the process with linguistic abstractions (e.g., parts-of-speech), or through the interaction of human supervisors during the extraction process. In any case, it appears that in addition to information that may be obtained by statistical methods, generalized linguistic knowledge from a human source is also clearly desirable, if not required, in order to create truly capable parsers.</Paragraph>
    <Paragraph position="9"> Proponents of statistical metheds usually point to the data-driven aspect of their approach as enabling them to create robust parsers that can handle &amp;quot;real text.&amp;quot; Although many rule-based parsers have been limited in scope, we believe that it is indeed possible to create and maintain broad-coverage, rule-based NL systems (e.g., Jensen 1993), by carefully studying and using ample amounts of data to refme those systems. It has been our experience that the complexity and difficulty of creating such rule-based systems can be readily managed if one has a powerful and comprehensive set of tools.</Paragraph>
    <Paragraph position="10"> Nevertheless, it is also clearly desirable to be able to use statistical methods to adapt (or tune) rule-based systems automatically for particular types of text as well as to acquire additional linguistic information from corpora and to integrate it with information that has been developed by trained linguists.</Paragraph>
    <Paragraph position="11"> To the end of incorporating statistics-based processing into a rule-based parser, we have devised a &amp;quot;bootstrapping&amp;quot; method. This method uses a rule-based parser to compute part-of-speech and rule probabilities while processing a large, non-annotated corpus. These probabilities are then incorporated into the very same parser, thereby providing guidance to the parser as it assigns parts of speech to words and applies rules during the processing of new text.</Paragraph>
    <Paragraph position="12"> Although our method relies on the existence of a broad-coverage, rule-based parser, which, as discussed at the beginning of this paper, is not trivial to develop, the benefits of this approach are that relevant statistical information can be obtained automatically from large untagged corpora, and that this information can be used to improve  significantly the speed and accuracy of the parser.</Paragraph>
    <Paragraph position="13"> This method also obviates the need for any human-supervised training during the parsing process and allows for &amp;quot;tuning&amp;quot; the parser to particular types of text.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML