File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/p96-1030_evalu.xml

Size: 4,598 bytes

Last Modified: 2025-10-06 14:00:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1030">
  <Title>FAST PARSING USING PRUNING AND GRAMMAR SPECIALIZATION</Title>
  <Section position="6" start_page="226" end_page="226" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> This section describes a number of experiments carried out to test the utility of the theoretical ideas presented above. The basic corpus used was a set of 16,000 utterances from the Air Travel Planning (ATIS; (Hemphill et al., 1990)) domain. All of these utterances were available in text form; 15,000 of them were used for training, with 1,000 held out for test purposes. Care was taken to ensure not just that the utterances themselves, but also the speakers of the utterances were disjoint between test and training data; as pointed out in (Rayner et al., 1994a), failure to observe these precautions can result in substantial spurious improvements in test data results.</Paragraph>
    <Paragraph position="1"> The 16,000 sentence corpus was analysed by the SRI Core Language Engine (Alshawi (ed), 1992), using a lexicon extended to cover the ATIS domain (Rayner, 1994). All possible grammatical analyses of each utterance were recorded, and an interactive tool was used to allow a human judge to identify the correct and incorrect readings of each utterance.</Paragraph>
    <Paragraph position="2"> The judge was a first-year undergraduate student with a good knowledge of linguistics but no prior experience with the system; the process of judging the corpus took about two and a half person-months.</Paragraph>
    <Paragraph position="3"> The input to the EBL-based grammar-specialization process was limited to readings of corpus utterances that had been judged correct. When utterances had more than one correct reading, a preference heuristic was used to select the most plausible one.</Paragraph>
    <Paragraph position="4"> Two sets of experiments were performed. In the first, increasingly large portions of the training set were used to train specialized grammars. The coverage loss due to grammar specialization was then measured on the 1,000 utterance test set. The experiment was carried out using both the chunking criteria from (Rayner and Samuelsson, 1994) (the &amp;quot;Old&amp;quot; scheme), and the chunking criteria described in Section 3 above (the &amp;quot;New&amp;quot; scheme). The results are presented in Table 1.</Paragraph>
    <Paragraph position="5"> The second set of experiments tested more directly the effect of constituent pruning and grammar specialization on the Spoken Language Translator's speed and coverage; in particular, coverage was measured on the real task of translating English into Swedish, rather than the artificial one of producing a correct QLF analysis. To this end, the first 500 test-set utterances were presented in the form of speech hypothesis lattices derived by aligning and conflating the top five sentence strings produced by a version of the DECIPHER (TM) recognizer (Murveit  number of training examples loss against et al., 1993). The lattices were analysed by four different versions of the parser, exploring the different combinations of turning constituent pruning on or off, and specialized versus unspecialized grammars.</Paragraph>
    <Paragraph position="6"> The specialized grammar used the &amp;quot;New&amp;quot; scheme, and had been trained on the full training set. Utterances which took more than 90 CPU seconds to process were timed out and counted as failures.</Paragraph>
    <Paragraph position="7"> The four sets of outputs from the parser were then translated into Swedish by the SLT transfer and generation mechanism (Agn~ et al., 1994). Finally, the four sets of candidate translations were pairwise compared in the cases where differing translations had been produced. We have found this to be an effective way of evaluating system performance. Although people differ widely in their judgements of whether a given translation can be regarded as &amp;quot;acceptable&amp;quot;, it is in most cases surprisingly easy to say which of two possible translations is preferable.</Paragraph>
    <Paragraph position="8"> The last two tables summarize the results. Table 2 gives the average processing times per input lattice for each type of processing (times measured running SICStus Prolog 3#3 on a SUN Sparc 20/HS21), showing how the time is divided between the various processing phases. Table 3 shows the relative scores of the four parsing variants, measured according to the &amp;quot;preferable translation&amp;quot; criterion.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML