File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/h94-1040_evalu.xml

Size: 5,432 bytes

Last Modified: 2025-10-06 14:00:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1040">
  <Title>COMBINING KNOWLEDGE SOURCES TO REORDER N-BEST SPEECH HYPOTHESIS LISTS</Title>
  <Section position="6" start_page="218" end_page="219" type="evalu">
    <SectionTitle>
3.3. Results
</SectionTitle>
    <Paragraph position="0"> Table 1 shows the sentence error rates for different preference methods and utterance lengths, using 10-best lists; Table 2 shows the word error rates for each method on the full set.</Paragraph>
    <Paragraph position="1"> The absolute decrease in the sentence error rate between 1-best and optimized 10-best with all KSs is from 33.7% to 29.9%, a proportionM decrease of 11%. This is nearly exactly the same as the improvement measured when the lists were rescored using a class trigram model, though it should be stressed that the present experiments used far less training data. The word error rate decreased from 7.5% to 6.4%, a 13% proportional decrease. Here, however, the trigram model performed significantly better, and achieved a reduction of 22%.</Paragraph>
    <Paragraph position="2"> It is apparent that nearly all of the improvement is coming from the linguistic KSs; the difference between the lines &amp;quot;recognizer + linguistic KSs&amp;quot; and &amp;quot;all available KSs&amp;quot; is not significant. Closer inspection of the results also shows that the improvement, when evaluated in the context of the spoken languagetranslation task, is rather greater than Table 1  would appear to indicate. Since the linguistic KSs only look at the abstract semantic analyses of the hypotheses, they often tend to pick harmless syntactic variants of the reference sentence; for example all el the can be substituted for all the or what are ... for which are .... When syntactic variants of this kind are scored as correct, the figures are as shown in Table 3. The improvement in sentence error rate on this method of evaluation is from 28.8% to 22.8%, a proportional decrease of 21~0. On either type of evaluation, the difference between &amp;quot;all available KSs&amp;quot; and any other method except &amp;quot;recognizer + linguistic KSs&amp;quot; is significant at the 5% level according to the McNemar sign test \[10\].</Paragraph>
    <Paragraph position="3"> One point of experimental method is interesting enough to be worth a diversion. In earlier experiments, reported in the notebook version of this paper, we had not separated the data in such a way as to ensure that the speakers of the utterances in the test and training data were always disjoint. This led to results that were both better and also qualitatively different; the N-gram KSs made a much larger contribution, and appeared to dominate the linguistic KSs. This presumably shows that there are strong surface uniformities between utterances from at least some of the speakers, which the N-gram KSs can capture more easily than the linguistic ones. It is possible that the effect is an artifact of the data-collection methods, and is wholly or partially caused by users who repeat queries after system misrecognitions.</Paragraph>
    <Paragraph position="4"> For a total of 88 utterances, there was some acceptable 10-best hypothesis, but the hypothesis chosen by the method  that made use of all available KSs was unacceptable. To get a more detailed picture of where the preference methods might be improved, we inspected these utterances and categorized them into different apparent causes of failure. Four main classes of failure were considered: Apparently impossible: There is no apparent reason to prefer the correct hypothesis to the one chosen without access to intersentential context or prosody. There were two main subclasses: either some important content word was substituted by an equally plausible alternative (e.g. &amp;quot;Minneapolis&amp;quot; instead of &amp;quot;Indianapolis&amp;quot;), or the utterance was so badly malformed that none of the alternatives seemed plausible.</Paragraph>
    <Paragraph position="5"> Coverage problem: The correct hypothesis was not in implemented linguistic coverage, but would probably have been chosen if it had been; alternately, the selected hypothesis was incorrectly classed as &amp;quot; being in linguistic coverage, but would probably not have been chosen if it had been correctly classifted as ungrammatical.</Paragraph>
    <Paragraph position="6"> Clear preference failure: The information needed to make the correct choice appeared intuitively to be present, but had not been exploited.</Paragraph>
    <Paragraph position="7"> Uncertain: Other cases.</Paragraph>
    <Paragraph position="8"> The results are summarized in Table 4.</Paragraph>
    <Paragraph position="9"> At present, the best preference method is in effect able to identify about 40% of the acceptable new hypotheses produced when going from 1-best to 10-best. (In contrast, the &amp;quot;highest-in-coverage&amp;quot; method finds only about 20%.) It appears that addressing the problems responsible for the last three failure categories could potentially improve the proportion to something between 70% and 90%. Of this increase, about two-thirds could probably be achieved by suitable improvements to linguistic coverage, and the rest by other means. It seems plausible that a fairly substantial proportion of the failures not due to coverage problems can be ascribed to the very small quantity of training data used.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML