File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/p93-1035_metho.xml

Size: 10,349 bytes

Last Modified: 2025-10-06 14:13:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="P93-1035">
  <Title>Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach</Title>
  <Section position="3" start_page="0" end_page="259" type="metho">
    <SectionTitle>
TRANSFORMATION-BASED
ERROR-DRIVEN LEARNING
</SectionTitle>
    <Paragraph position="0"> The phrase structure learning algorithm is a transformation-based error-driven learner. This learning paradigm, illustrated in figure 1, has proven to be successful in a number of different natural language applications, including part of speech tagging (Bri92, BM92b), prepositional</Paragraph>
  </Section>
  <Section position="4" start_page="259" end_page="259" type="metho">
    <SectionTitle>
UNANNOTATED
TEXT
STATE
ANNOTATED TRUTH
RULES
</SectionTitle>
    <Paragraph position="0"> phrase attachment (BR93), and word classification (Bri93). In its initial state, the learner is capable of annotating text but is not very good at doing so. The initial state is usually very easy to create. In part of speech tagging, the initial state annotator assigns every word its most likely tag. In prepositional phrase attachment, the initial state annotator always attaches prepositional phrases low. In word classification, all words are initially classified as nouns. The naively annotated text is compared to the true annotation as indicated by a small manually annotated corpus, and transformations are learned that can be applied to the output of the initial state annotator to make it better resemble the truth.</Paragraph>
  </Section>
  <Section position="5" start_page="259" end_page="261" type="metho">
    <SectionTitle>
LEARNING PHRASE
STRUCTURE
</SectionTitle>
    <Paragraph position="0"> The phrase structure learning algorithm is trained on a small corpus of partially bracketed text which is also annotated with part of speech information. All of the experiments presented below were done using the Penn Treebank annotated corpus(MSM93). The learner begins in a naive initial state, knowing very little about the phrase structure of the target corpus. In particular, all that is initially known is that English tends to be right branching and that final punctuation is final punctuation. Transformations are then learned automatically which transform the output of the naive parser into output which better resembles the phrase structure found in the training corpus. Once a set of transformations has been learned, the system is capable of taking sentences tagged with parts of speech and returning a binary-branching structure with nonterminals unlabelled. 2 The Initial State Of The Parser Initially, the parser operates by assigning a right-linear structure to all sentences. The only exception is that final punctuation is attached high. So, the sentence &amp;quot;The dog and old cat ate .&amp;quot; would be incorrectly bracketed as: ((The(dog(and(old (cat ate))))). ) The parser in its initial state will obviously not bracket sentences with great accuracy. In some experiments below, we begin with an even more naive initial state of knowledge: sentences are parsed by assigning them a random binary-branching structure with final punctuation always attached high.</Paragraph>
    <Section position="1" start_page="259" end_page="259" type="sub_section">
      <SectionTitle>
Structural Transformations
</SectionTitle>
      <Paragraph position="0"> The next stage involves learning a set of transformations that can be applied to the output of the naive parser to make these sentences better conform to the proper structure specified in the training corpus. The list of possible transformation types is prespecified. Transformations involve making a simple change triggered by a simple environment. In the current implementation, there are twelve allowable transformation types: * (1-8) (AddHelete) a (leftlright) parenthesis to the (leftlright) of part of speech tag X.</Paragraph>
      <Paragraph position="2"> between tags X and Y.</Paragraph>
      <Paragraph position="3"> To carry out a transformation by adding or deleting a parenthesis, a number of additional simple changes must take place to preserve balanced parentheses and binary branching. To give an example, to delete a left paren in a particular environment, the following operations take place (assuming, of course, that there is a left paren to delete):  1. Delete the left paren.</Paragraph>
      <Paragraph position="4"> 2. Delete the right paren that matches the just deleted paren.</Paragraph>
      <Paragraph position="5"> 3. Add a left paren to the left of the constituent immediately to the left of the deleted left paren. 2This is the same output given by systems described in (MM90, Bri92, PS92, SRO93).</Paragraph>
      <Paragraph position="6"> 260 4. Add a right paren to the right of the constituent immediately to the right of the deleted left paren.</Paragraph>
      <Paragraph position="7"> 5. If there is no constituent immediately to the  right, or none immediately to the left, then the transformation fails to apply.</Paragraph>
      <Paragraph position="8"> Structurally, the transformation can be seen as follows. If we wish to delete a left paten to the right of constituent X 3, where X appears in a subtree of the form:  Given the sentence: 5 The dog barked .</Paragraph>
      <Paragraph position="9"> this would initially be bracketed by the naive parser as: ((The(dogbarked)).) If the transformation delete a left parch to the right of a determiner is applied, the structure would be transformed to the correct bracketing:  into two structural transformations, that shown here and its converse, along with six triggering environments.</Paragraph>
      <Paragraph position="10"> 5Input sentences are also labelled with parts of speech.</Paragraph>
      <Paragraph position="11"> If it is, the following steps are carried out to  add the right paren: 1. Add the right paren.</Paragraph>
      <Paragraph position="12"> 2. Delete the left paten that now matches the newly added paren.</Paragraph>
      <Paragraph position="13"> 3. Find the right paren that used to match the just deleted paren and delete it.</Paragraph>
      <Paragraph position="14"> 4. Add a left paren to match the added right paren.  This results in the same structural change as deleting a left paren to the right of X in this particular structure.</Paragraph>
      <Paragraph position="15"> Applying the transformation add a right paten to the right of a noun to the bracketing: ((The(dogbarked)).) will once again result in the correct bracketing: (((Thedog)barked).)</Paragraph>
    </Section>
    <Section position="2" start_page="259" end_page="261" type="sub_section">
      <SectionTitle>
Learning Transformations
</SectionTitle>
      <Paragraph position="0"> Learning proceeds as follows. Sentences in the training set are first parsed using the naive parser which assigns right linear structure to all sentences, attaching final punctuation high. Next, for each possible instantiation of the twelve transformation templates, that particular transformation is applied to the naively parsed sentences. The resuiting structures are then scored using some measure of success that compares these parses to the correct structural descriptions for the sentences provided in the training corpus. The transformation resulting in the best scoring structures then becomes the first transformation of the ordered set of transformations that are to be learned. That transformation is applied to the right-linear structures, and then learning proceeds on the corpus of improved sentence bracketings. The following procedure is carried out repeatedly on the training corpus until no more transformations can be found whose application reduces the error in parsing the training corpus:  1. The best transformation is found for the structures output by the parser in its current state. 6 2. The transformation is applied to the output resulting from bracketing the corpus using the parser in its current state.</Paragraph>
      <Paragraph position="1"> 3. This transformation is added to the end of the ordered list of transformations.</Paragraph>
      <Paragraph position="2"> SThe state of the parser is defined as naive initial-state knowledge plus all transformations that currently have been learned.</Paragraph>
      <Paragraph position="3">  4. Go to 1.</Paragraph>
      <Paragraph position="4"> After a set of transformations has been learned, it can be used to effectively parse fresh text. To parse fresh text, the text is first naively parsed and then every transformation is applied, in order, to the naively parsed text.</Paragraph>
      <Paragraph position="5"> One nice feature of this method is that different measures of bracketing success can be used: learning can proceed in such a way as to try to optimize any specified measure of success. The measure we have chosen for our experiments is the same measure described in (PS92), which is one of the measures that arose out of a parser evaluation workshop (ea91). The measure is the percentage of constituents (strings of words between matching parentheses) from sentences output by our system which do not cross any constituents in the Penn Treebank structural description of the sentence.</Paragraph>
      <Paragraph position="6"> For example, if our system outputs: (((Thebig) (dogate)).) and the Penn Treebank bracketing for this sentence was: (((Thebigdog) ate). ) then the constituent the big would be judged correct whereas the constituent dog ate would not. Below are the first seven transformations found from one run of training on the Wall Street Journal corpus, which was initially bracketed using the right-linear initial-state parser.</Paragraph>
      <Paragraph position="7">  1. Delete a left paren to the left of a singular noun. 2. Delete a left paren to the left of a plural noun. 3. Delete a left paren between two proper nouns. 4. Delet a left paten to the right of a determiner. 5. Add a right paten to the left of a comma.</Paragraph>
      <Paragraph position="8"> 6. Add a right paren to the left of a period.</Paragraph>
      <Paragraph position="9"> 7. Delete a right paren to the left of a plural noun.  The first four transformations all extract noun phrases from the right linear initial structure. The sentence &amp;quot;The cat meowed .&amp;quot; would initially be bracketed as: 7 ((The (cat meowed)) . ) Applying the first transformation to this bracketing would result in:</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML