File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-0124_evalu.xml

Size: 3,452 bytes

Last Modified: 2025-10-06 14:00:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0124">
  <Title>Analysis of Unknown Lexical Items using Morphological and Syntactic Information with the TIMIT Corpus</Title>
  <Section position="8" start_page="267" end_page="269" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> Two sets of data were collected in this experiment, one for the baseline run and one for the experimental run. Figure 1 compares the deletion rates, Figure 2 shows the insertion rates, and Figure 3 shows the match rates. Table I shows the total number of deletions and insertions, as well as each as a percentage of the total number of parses for all the sentences in the corpus (1137).</Paragraph>
    <Paragraph position="1"> In Figure 1, we see that the deletion rate for the baseline data is zero, as expected. Since every possible part of speech is assigned to each unknown word, all the original parses should be generated. The deletion rate for the experimental run shows that some parses are being deleted when there is one or more unknown words. With 10% of the open-class dictionary missing, there are 22 deletions out of 1137 total possible parse matches. This is an average of only 0.045 deletions per sentence, or 1.9% of the total parses, as shown in Figure I and Table 1. With 100% of the open-class dictionary missingmin other words, using only closed-class wordsmthere are 225 deletions, an average of 0.63 per sentence, or 19.8% of the total parses. In other words, over 80% of the original parses are produced. The deletion rate is in part due to the fact that many words in the complete dictionary are lexically ambiguous; whereas, many times the morphological recognizer assigns a smaller set of parts of speech, which can result in a correct parse being generated, but not the entire parse forest.</Paragraph>
    <Paragraph position="2"> The true value of the experimental approach can be seen when comparing the insertion rates.</Paragraph>
    <Paragraph position="3"> Figure 2 shows that the insertion rate in the baseline data is enormous. Data are only available up to 30% of the open-class dictionary missing, because after this point the program runs out of memory for storing all of the spurious parses. Since the baseline parser assigns every open-class part of speech to an unknown word, a combinatorial explosion of parses occurs. In the baseline case with 30% of the open-class words missing, there are a total of 110,942 insertions, or 311.6  o ~0&amp;quot;6 ~o.~ ~o, ~ ~o.~ ~ o. i ................... . ........ . ........</Paragraph>
    <Paragraph position="4"> r ....... ~', ....... . .....</Paragraph>
    <Paragraph position="5">  per sentence. At the same point in the experimental run (30% missing), there are only 2.5 insertions per sentence, less than one-hundredth of the baseline value. With 100% of the open-class dictionary missing, there are 69.3 insertions per sentence when using the morphological recognizer in the experimental run, as compared to the 311.6 insertions per sentence when using the baseline method with only 30% of the dictionary missing. Performance of the experimental system in terms of insertions with 100% of the open-class dictionary missing is comparable to the baseline performance with 20% of the open-class dictionary removed. Obviously, the experimental data shows that the insertion rate has been cut drastically from the baseline performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML