File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/95/p95-1037_evalu.xml

Size: 6,593 bytes

Last Modified: 2025-10-06 14:00:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1037">
  <Title>Statistical Decision-Tree Models for Parsing*</Title>
  <Section position="5" start_page="280" end_page="281" type="evalu">
    <SectionTitle>
4 Experiment Results
</SectionTitle>
    <Paragraph position="0"> In the absence of an NL system, SPATTER can be evaluated by comparing its top-ranking parse with the treebank analysis for each test sentence. The parser was applied to two different domains, IBM Computer Manuals and the Wall Street Journal.</Paragraph>
    <Section position="1" start_page="280" end_page="280" type="sub_section">
      <SectionTitle>
4.1 IBM Computer Manuals
</SectionTitle>
      <Paragraph position="0"> The first experiment uses the IBM Computer Manuals domain, which consists of sentences extracted from IBM computer manuals. The training and test sentences were annotated by the University of Lancaster. The Lancaster treebank uses 195 part-of-speech tags and 19 non-terminal labels. This tree-bank is described in great detail in (Black et al., 1993).</Paragraph>
      <Paragraph position="1"> The main reason for applying SPATTER to this domain is that IBM had spent the previous ten years developing a rule-based, unification-style probabilistic context-free grammar for parsing this domain. The purpose of the experiment was to estimate SPATTER's ability to learn the syntax for this domain directly from a treebank, instead of depending on the interpretive expertise of a grammarian. The parser was trained on the first 30,800 sentences from the Lancaster treebank. The test set included 1,473 new sentences, whose lengths range from 3 to 30 words, with a mean length of 13.7 words. These sentences are the same test sentences used in the experiments reported for IBM's parser in (Black et al., 1993). In (Black et al., 1993), IBM's parser was evaluated using the 0-crossing-brackets measure, which represents the percentage of sentences for which none of the constituents in the parser's parse violates the constituent boundaries of any constituent in the correct parse. After over ten years of grammar development, the IBM parser achieved a 0-crossing-brackets score of 69%.</Paragraph>
      <Paragraph position="2"> On this same test set, SPATTER scored 76%.</Paragraph>
    </Section>
    <Section position="2" start_page="280" end_page="281" type="sub_section">
      <SectionTitle>
4.2 Wall Street Journal
</SectionTitle>
      <Paragraph position="0"> The experiment is intended to illustrate SPATTER's ability to accurately parse a highly-ambiguous, large-vocabulary domain. These experiments use the Wall Street Journal domain, as annotated in the Penn Treebank, version 2. The Penn Treebank uses 46 part-of-speech tags and 27 non-terminal labels. 2 The WSJ portion of the Penn Treebank is divided into 25 sections, numbered 00 - 24. In these experiments, SPATTER was trained on sections 02 - 21, which contains approximately 40,000 sentences. The test results reported here are from section 00, which contains 1920 sentences, s Sections 01, 22, 23, and 24 will be used as test data in future experiments.</Paragraph>
      <Paragraph position="1"> The Penn Treebank is already tokenized and sentence detected by human annotators, and thus the test results reported here reflect this. SPATTER parses word sequences, not tag sequences. Furthermore, SPATTER does not simply pre-tag the sentences and use only the best tag sequence in parsing. Instead, it uses a probabilistic model to assign tags to the words, and considers all possible tag sequences according to the probability they are assigned by the model. No information about the legal tags for a word are extracted from the test corpus. In fact, no information other than the words is used from the test corpus.</Paragraph>
      <Paragraph position="2"> For the sake of efficiency, only the sentences of 40 words or fewer are included in these experiments. 4 For this test set, SPATTER takes on average 12 2This treebank also contains coreference information, predicate-argument relations, and trace information indicating movement; however, none of this additional information was used in these parsing experiments.</Paragraph>
      <Paragraph position="3"> SFor an independent research project on coreference, sections 00 and 01 have been annotated with detailed coreference information. A portion of these sections is being used as a development test set. Training SPATTER on them would improve parsing accuracy significantly and skew these experiments in favor of parsing-based approaches to coreference. Thus, these two sections have been excluded from the training set and reserved as test sentences.</Paragraph>
      <Paragraph position="4"> 4SPATTER returns a complete parse for all sentences of fewer then 50 words in the test set, but the sentences of 41 - 50 words required much more computation than the shorter sentences, and so they have been excluded.</Paragraph>
      <Paragraph position="5">  seconds per sentence on an SGI R4400 with 160 megabytes of RAM.</Paragraph>
      <Paragraph position="6"> To evaluate SPATTER's performance on this domain, I am using the PARSEVAL measures, as defined in (Black et al., 1991): Precision no. of correct constituents in SPATTER parse no. of constituents in SPATTER parse Recall no. of correct constituents in SPATTER parse no. of constituents in treebank parse Crossing Brackets no. of constituents which violate constituent boundaries with a constituent in the treebank parse.</Paragraph>
      <Paragraph position="7"> The precision and recall measures do not consider constituent labels in their evaluation of a parse, since the treebank label set will not necessarily coincide with the labels used by a given grammar. Since SPATTER uses the same syntactic label set as the Penn Treebank, it makes sense to report labelled precision and labelled recall. These measures are computed by considering a constituent to be correct if and only if it's label matches the label in the treebank. null Table 1 shows the results of SPATTER evaluated against the Penn Treebank on the Wall Street Journal section 00.</Paragraph>
      <Paragraph position="8">  periments.</Paragraph>
      <Paragraph position="9"> Figures 5, 6, and 7 illustrate the performance of SPATTER as a function of sentence length. SPATTER's performance degrades slowly for sentences up to around 28 words, and performs more poorly and more erratically as sentences get longer. Figure 4 indicates the frequency of each sentence length in the test corpus.</Paragraph>
      <Paragraph position="10">  function of sentence length for Wall Street Journal experiments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML