XML Viewer - p00-1058

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1058_metho.xml
Size: 14,269 bytes
Last Modified: 2025-10-06 14:07:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1058">
  <Title>Statistical parsing with an automatically-extracted tree adjoining grammar</Title>
  <Section position="5" start_page="3" end_page="3" type="metho">
    <SectionTitle>
3 Some properties of probabilistic
TAG
</SectionTitle>
    <Paragraph position="0"> In a lexicalized TAG, because each composition brings together two lexical items, every composition probability involves a bilexical dependency. Given a CFG and head-percolation scheme, an equivalent TAG can be constructed whose derivations mirror the dependency analysis implicit in the head-percolation scheme.</Paragraph>
    <Paragraph position="1"> Furthermore, there are some dependency analyses encodable byTAGs that are not encodable by a simple head-percolation scheme. For example, for the sentence #5CJohn should have left,&amp;quot; Magerman's rulesmake should and have the headsof theirrespective VPs, so that there is no dependency between left and its subject John #28see Figure 2a#29. Since nearly a quarter of nonempty subjects appear in such a con#0Cguration, this is not a small problem.</Paragraph>
    <Paragraph position="2">  should have left.&amp;quot; #28We could make VP the head of VP instead, but this would generate auxiliaries independently of each other, so that, for example, P#28John leave#29 #3E 0.#29 TAG can produce the desired dependencies #28b#29 easily, using the grammar of Figure 1. A more complex lexicalization scheme for CFG could as well #28one which kept track of two heads at a time, for example#29, but the TAG account is simpler and cleaner.</Paragraph>
    <Paragraph position="3"> Bilexical dependencies are not the only nonlocal dependencies that can be used to improve parsing accuracy. For example, the attachment of an S depends on the presence or absence of the embedded subject #28Collins, 1999#29; Treebank-style two-level NPs are mismodeled by PCFG #28Collins, 1999; Johnson, 1998#29; the generation of a node depends on the label of its grandparent #28Charniak, 2000; Johnson, 1998#29. In order to capture such dependencies in a PCFG-based model, they must be localized either by transforming the data or modifying the parser. Such changes are not always obvious a priori and often must be devised anew for each language or each corpus.</Paragraph>
    <Paragraph position="4"> But none of these cases really requires special treatment in a PTAG model, because each composition probability involves not onlya bilexicaldependencybuta #5Cbiarboreal&amp;quot; #28tree-tree#29 dependency. That is, PTAG generates an entire elementary tree at once, conditioned on the entire elementary tree being modi#0Ced. Thus dependencies that haveto be stipulated in a PCFGby tree transformations or parser modi#0Ccations are captured for free in a PTAG model. Of course, the price that the PTAG model pays is sparser data; the backo#0B model must therefore be chosen carefully.</Paragraph>
  </Section>
  <Section position="6" start_page="3" end_page="3" type="metho">
    <SectionTitle>
4 Inducing a stochastic grammar
from the Treebank
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Reconstructing derivations
</SectionTitle>
      <Paragraph position="0"> We want to extract from the Penn Tree-bank an LTAG whose derivations mirror the dependency analysis implicit in the head-percolation rules of #28Magerman, 1995; Collins, 1997#29. For each node #11, these rules classify exactly one child of #11 as a head and the rest as either arguments or adjuncts. Using this classi#0Ccation we can construct a TAG derivation#28includingelementary trees#29 from a derived tree as follows:  1. If #11 is an adjunct, excise the subtree rooted at #11 to form a modi#0Cer tree.</Paragraph>
      <Paragraph position="1"> 2. If #11 is an argument, excise the subtree rooted at #11 to form an initialtree, leaving behind a substitution node.</Paragraph>
      <Paragraph position="2"> 3. If #11 has a right corner #12 which is an ar null gument with the same label as #11 #28and all intervening nodes are heads#29, excise the segment from #11 down to #12 to form an auxiliary tree.</Paragraph>
      <Paragraph position="3"> Rules #281#29 and #282#29 produce the desired result; rule #283#29 changes the analysis somewhat by making subtrees with recursive arguments into predicative auxiliary trees. It produces, among other things, the analysis of auxiliary verbs described in the previous section. It is applied in a greedy fashion, with potential #11s consideredtop-down andpotential #12sbottomup. The complicated restrictions on #12 are simply to ensure that a well-formed TIG derivation is produced.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Parameter estimation and
</SectionTitle>
      <Paragraph position="0"> smoothing Now that we have augmented the training data to include TAG derivations, we could try to directly estimate the parameters of the modelfromSection2. But sincethe numberof #28tree, site#29 pairs is very high, the data would be too sparse. We therefore generate an elementary tree in two steps: #0Crst the tree template #28that is, the elementary tree minus its modi#0Cer trees auxiliary trees  anchor is inserted.</Paragraph>
      <Paragraph position="1"> anchor#29, then the anchor. The probabilities are decomposed as follows:</Paragraph>
      <Paragraph position="3"> the anchor itself.</Paragraph>
      <Paragraph position="4"> The generation of the tree template has two backo#0B levels: at the #0Crst level, the anchor of #11 is ignored, and at the second level, the POS tag of the anchor as well as the #0Dag f are ignored. The generation of the anchor has three backo#0B levels: the #0Crst two are as before, and the third just conditions the anchor on its POStag. The backed-o#0B modelsare combined by linear interpolation, with the weights chosen as in #28Bikel et al., 1997#29.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="3" end_page="3" type="metho">
    <SectionTitle>
5 The experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.1 Extracting the grammar
</SectionTitle>
      <Paragraph position="0"> We ran the algorithm given in Section 4.1 on sections 02#7B21 of the Penn Treebank. The extracted grammar is large #28about 73,000 trees, with words seen fewer than four times replaced with the symbol *UNKNOWN*#29, but if we  rank #28log-log#29 consider elementary tree templates, the grammar is quite manageable: 3626 tree templates, of which 2039 occur more than once #28see Figure 4#29.</Paragraph>
      <Paragraph position="1"> The 616 most frequent tree-template types account for 99#25 of tree-template tokens inthe training data. Removing all but these trees from the grammar increased the error rate by about 5#25 #28testing on a subset of section 00#29. A few of the most frequent tree-templates are shown in Figure 3.</Paragraph>
      <Paragraph position="2"> So the extracted grammar is fairly compact, but how complete is it? If we plot the growth of the grammar during training #28Figure 5#29, it's not clear the grammar will ever converge, even though the very idea of a  bined in new ways, and the extraction heuristics fail to factor this variation out. In a random sample of 100 once-seen elementary tree templates, we found #28by casual inspection#29 that 34 resulted from annotation errors, 50 from de#0Cciencies in the heuristics, and four apparently from performance errors. Only twelve appeared to be genuine.</Paragraph>
      <Paragraph position="3"> Therefore the continued growth of the grammar is not as rapid as Figure 5 might indicate. Moreover, our extraction heuristics evidently have room to improve. The majority of trees resulting from de#0Cciencies in the heuristics involved complicated coordination structures, which is not surprising, since co-ordination has always been problematic for TAG.</Paragraph>
      <Paragraph position="4"> To see what the impact of this failure to converge is, we ran the grammar extractor on some held-out data #28section 00#29. Out of 45082 tree tokens, 107 tree templates, or 0.2#25, had not been seen in training. This amounts to about one unseen tree template every 20 sentences. When we consider lexicalized trees, this #0Cgure of course rises: out of the same 45082 tree tokens, 1828 lexicalized trees, or 4#25, had not been seen in training.</Paragraph>
      <Paragraph position="5"> So the coverage of the grammar is quite good. Note that even incases wherethe parser encounters a sentence for which the #28fallible#29 extraction heuristics would have produced an unseen tree template, it is possible that the parser will use other trees to produce the correct bracketing.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.2 Parsing with the grammar
</SectionTitle>
      <Paragraph position="0"> We useda CKY-style parser similarto the one describedin#28Schabesand Waters, 1996#29, with a modi#0Ccation to ensure completeness #28because foot nodes are treated as empty, which CKY prohibits#29 and another to reduce useless substitutions. We also extended the parser to simulate sister-adjunction as regular adjunction and compute the #0Dag f which distinguishes the #0Crst modi#0Cer from subsequent modi#0Cers.</Paragraph>
      <Paragraph position="1"> We use a beam search, computing the score of an item #5B#11;i;j#5D by multiplying it by the prior probability P#28#11#29 #28Goodman, 1997#29; any item with score less than 10  times that of the best item in a cell is pruned.</Paragraph>
      <Paragraph position="2"> Following #28Collins, 1997#29, words occurring fewer than four times in training were replaced with the symbol *UNKNOWN* and tagged with the output of the part-of-speech tagger described in #28Ratnaparkhi, 1996#29. Tree templates occurring only once in training were ignored entirely.</Paragraph>
      <Paragraph position="3"> We #0Crst compared the parser with #28Hwa, 1998#29: we trained the model on sentences of length 40 or less in sections 02#7B09 of the Penn Treebank, down to parts of speech only, and then tested on sentences of length 40 or less in section 23, parsing from part-of-speech tag sequences to fully bracketed parses. The metric used was the percentage of guessed brackets which did not cross any correct brackets. Our parser scored 84.4#25 compared with 82.4#25 for #28Hwa, 1998#29, an error reduction of 11#25. Next we compared our parser against lexicalized PCFG parsers, training on sections 02#7B21 and testing on section 23. The results are shown in Figure 6.</Paragraph>
      <Paragraph position="4"> These results place our parser roughly in the middle of the lexicalized PCFG parsers.</Paragraph>
      <Paragraph position="5"> While the results are not state-of-the-art, they do demonstrate the viability of TAG as a framework for statistical parsing. With  brackets, 0 CB = no crossing brackets, #14 2 CB = two or fewer crossing brackets. All #0Cgures except CB are percentages.</Paragraph>
      <Paragraph position="6"> improvements in smoothing and cleaner handling of punctuation and coordination, perhaps these results can be brought more upto-date. null</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="3" end_page="3" type="metho">
    <SectionTitle>
6 Conclusion: related and future
</SectionTitle>
    <Paragraph position="0"> work #28Neumann, 1998#29 describes an experiment similar to ours, although the grammar he extracts onlyarrives at a complete parse for10#25 of unseen sentences. #28Xia, 1999#29 describes a grammar extraction process similar to ours, and describes some techniques for automatically #0Cltering out invalid elementary trees. Our work has a great deal in common with independent work by Chen and Vijay-Shanker #282000#29. They present a more detailed discussion of various grammar extraction processes and the performance of supertagging models #28B. Srinivas, 1997#29 based on the extracted grammars. Theydonot reportparsing results, though their intention is to evaluate how the various grammars a#0Bect parsing accuracy and how k-best supertagging a#0Bfects parsing speed.</Paragraph>
    <Paragraph position="1"> Srinivas's work on supertags #28B. Srinivas, 1997#29 also uses TAG for statistical parsing, but with a rather di#0Berent strategy: tree templates are thought of as extended parts-ofspeech, and these are assigned to words based on local #28e.g., n-gram#29 context.</Paragraph>
    <Paragraph position="2"> As for future work, there are still possibilities made available byTAG which remain to be explored. One, also suggested by #28Chen and Vijay-Shanker, 2000#29, is to group elementary trees into families and relate the trees of a family by transformations. For example, one would imagine that the distribution of active verbs and their subjects would be similar to the distribution of passiveverbs and their notional subjects, yet they are treated as independent in the current model. If the two con#0Cgurations could be related, then the sparseness of verb-argument dependencies would be reduced.</Paragraph>
    <Paragraph position="3"> Another possibility is the use of multiplyanchored trees. Nothing aboutPTAG requires that elementary trees have only a single anchor #28or any anchor at all#29, so multiplyanchored trees could be used to make, for example, the attachment of a PP dependent not only on the preposition #28as in the current model#29 but the lexical head of the prepositional object as well, or the attachment of a relative clause dependent on the embedded verb as well as the relative pronoun. The smoothing method described above would have to be modi#0Ced to account for multiple anchors.</Paragraph>
    <Paragraph position="4"> In summary,wehave argued that TAG provides a cleaner way of looking at statistical parsing than lexicalized PCFG does, and demonstrated that in practice it performs in the same range. Moreover, the greater #0Dexibility of TAG suggests some potential improvements which would be cumbersome to implement using a lexicalized CFG. Further research will show whether these advantages turn out to be signi#0Ccant in practice.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML