File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/p93-1035_intro.xml

Size: 3,378 bytes

Last Modified: 2025-10-06 14:05:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P93-1035">
  <Title>Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> There has been a great deal of interest of late in the automatic induction of natural language grammar. Given the difficulty inherent in manually building a robust parser, along with the availability of large amounts of training material, automatic grammar induction seems like a path worth pursuing. A number of systems have been built that can be trained automatically to bracket text into syntactic constituents. In (MM90) mutual information statistics are extracted from a corpus of text and this information is then used to parse new text. (Sam86) defines a function to score the quality of parse trees, and then uses simulated annealing to heuristically explore the entire space of possible parses for a given sentence. In (BM92a), distributional analysis techniques are applied to a large corpus to learn a context-free grammar.</Paragraph>
    <Paragraph position="1"> The most promising results to date have been *The author would like to thank Mark Liberman, Melting Lu, David Magerman, Mitch Marcus, Rich Pito, Giorgio Satta, Yves Schabes and Tom Veatch.</Paragraph>
    <Paragraph position="2"> This work was supported by DARPA and AFOSR jointly under grant No. AFOSR-90-0066, and by ARO grant No. DAAL 03-89-C0031 PRI.</Paragraph>
    <Paragraph position="3"> 1 Not in the traditional sense of the term.</Paragraph>
    <Paragraph position="4"> based on the inside-outside algorithm, which can be used to train stochastic context-free grammars.</Paragraph>
    <Paragraph position="5"> The inside-outside algorithm is an extension of the finite-state based Hidden Markov Model (by (Bak79)), which has been applied successfully in many areas, including speech recognition and part of speech tagging. A number of recent papers have explored the potential of using the inside-outside algorithm to automatically learn a grammar (LY90, SJM90, PS92, BW92, CC92, SRO93).</Paragraph>
    <Paragraph position="6"> Below, we describe a new technique for grammar induction. The algorithm works by beginning in a very naive state of knowledge about phrase structure. By repeatedly comparing the results of parsing in the current state to the proper phrase structure for each sentence in the training corpus, the system learns a set of ordered transformations which can be applied to reduce parsing error. We believe this technique has advantages over other methods of phrase structure induction. Some of the advantages include: the system is very simple, it requires only a very small set of transformations, a high degree of accuracy is achieved, and only a very small training corpus is necessary. The trained transformational parser is completely symbolic and can bracket text in linear time with respect to sentence length. In addition, since some tokens in a sentence are not even considered in parsing, the method could prove to be considerably more robust than a CFG-based approach when faced with noise or unfamiliar input. After describing the algorithm, we present results and compare these results to other recent results in automatic phrase structure induction.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML