XML Viewer - h92-1017

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/92/h92-1017_evalu.xml
Size: 13,842 bytes
Last Modified: 2025-10-06 14:00:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1017">
  <Title>Recent Improvements and Benchmark Results for Paramax ATIS System</Title>
  <Section position="4" start_page="91" end_page="93" type="evalu">
    <SectionTitle>
3. BENCHMARK EVALUATION
</SectionTitle>
    <Paragraph position="0"> &amp;quot; Paramax undertook the February 1992 ATIS benchmark evaluation tests, with the cooperation of BBN, who provided us with speech recognizer output as described below. Our results were, in a word, disappointing. One component of the set of factors leading to this level of performance was our concentration on improvements to our system which were of applicability to spoken language understanding systems in general, as opposed to specific features that would only be applicable to the ATIS domain. But it is clear from the experience of this test that domain-specific features will have to be included if we are to perform in the ATIS domain. While we knew we were underemphasizing such features, we were somewhat taken by surprise since as recently as three months ago our level of performance had been comparable with other sites.</Paragraph>
    <Paragraph position="1"> This latest test has not only gotten us thinking about the performance level of our system, but also about the sub-ject of evaluation in general. The DARPA Spoken Language Understanding program, along with the MUC effort, has made significant advances in the state of the art of evaluation of language understanding systems (\[4\]).</Paragraph>
    <Paragraph position="2"> Particularly for the natural language community, these advances have given a new objectivity to the assessment of the level of achievement of their systems. Tests such as the ATIS benchmarks allow participants to find out how well their systems perform on suites of hundreds of utterances, not only in the simple absolute sense, but in relation to other similar systems.</Paragraph>
    <Paragraph position="3"> The benchmark tests as administered, however, fall short of quantifying progress in any satisfactory sense. One might be tempted to claim that a history of scores on a series of similar tests gives an indication of progress or the lack thereof. It appears, however, that the only sense in which this might be meaningful is if one compares relative performance of different systems from one test to another. For example, our system, as remarked above, performed comparably to a number of other systems in past tests, yet did poorly in relation to those systems on this test. Yet that tells us nothing about whether our system improved, stagnated, or degraded.</Paragraph>
    <Paragraph position="4"> The reason that the current common evaluation paradigm fails to quantify progress over time is that the test data varies from test to test. This variability tends to lessen the reliability of comparisons between system performance on different evaluations. That is, it is difficult to interpret variations in a single system's performance over time because we cannot quantify the effect of accidental differences in the particular test data that happens to be selected for a particular evaluation. Furthermore, the test paradigm has undergone a number of changes which make it difficult to compare results from evaluation to evaluation. For example, in June 1990, the test data included only Class A utterances. In February 1991, Class D1 utterances were included, but in a separate test. In the February 1992 test, Class A, D, and X utterarices were all included together. In addition, in the current test scoring is being conducted under the min/max rules (\[5\]). All of these differences contribute to lessening the reliability of comparisons.</Paragraph>
    <Paragraph position="5"> We have experimented with a more tightly controlled variation of the common evaluation metric in which the same test data is processed by several different versions of our system - the current version, and two older versions. These older versions correspond generally to the system which was reported on in February 1991 (\[1\]) and the system which we used to participate in an informal evaluation in October 1991. 2 By holding the data constant and varying the system, we eliminate the effect of data variations on scores. Furthermore, by comparing scores produced this way with the scores our system has received on the standard evaluations, we demonstrate that variations in the test data are a real concern, because we see a much less consistent pattern of development over time with the standard evaluation. Figure 1 shows how the scores on the February 1992 natural language test for the three different versions of the Paramax system varied.</Paragraph>
    <Paragraph position="6">  From Figure 1 we can see that our system has made mod2The database revision released in May 1991 complicated matters somewhat. As we have reported elsewhere, our system has separate modules for natural language processing (PUNDIT) and database query generation (QTIP). We were forced to use the October 1991 QTIP with the February 1991 PUNDIT. Thus the performance labelled &amp;quot;February 1991&amp;quot; is really an overestimate, to whatever degree QTIP improved in that time.</Paragraph>
    <Paragraph position="7">  est improvements over time, as shown by the decrease in weighted error rate. Additionally, the percentage of correctly answered (T) inputs has increased, while the percentage of unanswered (NA) inputs has decreased and the percentage of incorrectly answered (F) inputs has remained nearly constant. In contrast, if we compare our scores on the October dry run with the February 1992 benchmark test, we find that our system obtained a weighted error score of 64.1% on the October dry run when all utterances in classes A, D1, and D were considered. For the February benchmark, the corresponding figure was 66.7%. Breaking this down more finely, our class D error decreased from 97.9% to 83.9%, while our class A error increased from 47.4% to 54.5%. From this we might have concluded that our class D performance improved at the expense of our class A performance and overall system performance.</Paragraph>
    <Section position="1" start_page="92" end_page="92" type="sub_section">
      <SectionTitle>
Evaluations
</SectionTitle>
      <Paragraph position="0"> Comparative performance on class A tests.</Paragraph>
      <Paragraph position="1"> Focusing only on class A utterances, we can go back farther in time, finding (Figure 2) that our system's error consistently decreased until the most recent test, which would reinforce the hypothesis that our class A performance degraded recently. Yet we see from Figure 3, which elaborates upon the information in Figure 1, that such a conclusion is unwarranted.</Paragraph>
      <Paragraph position="2"> By running the same test with different versions of the understanding system, as described above, we obtain important information on the changes over time in system performance. This simple extension of the evaluation methodology already in place supplements comparisons between systems from different sites and comparisons between different tests with clear-cut documentation of the progress made by an individual site on its system. As such, it is a valuable tool to add to the ever-increasing arsenal of objective system evaluation techniques.</Paragraph>
      <Paragraph position="3"> Table 1 summarizes our scores on the February 1992 natural language and spoken language benchmark tests.</Paragraph>
      <Paragraph position="4"> The SLS results were obtained by filtering nbest out- null put from BBN's speech recognition system through the Paramax natural language processing system, using an N of 6. This SLS score is only five percentage points inferior to our NL score. The corresponding difference for other sites ranged from ten to thirty-six points, with no correlation between actual scores and this difference. At this time we have no explanation for this intriguing phenomenon. 3</Paragraph>
    </Section>
    <Section position="2" start_page="92" end_page="93" type="sub_section">
      <SectionTitle>
3.1. Speech Recognition
</SectionTitle>
      <Paragraph position="0"> The speech recognition scores which Paramax submitted for this evaluation were produced by using the Paramax natural language processing system to filter the n-best output from BBN's speech recognition system. The n-best output used in this evaluation had a word error  will have in scoring an utterance, and if this estimate is too high, a score of F is automatically assigned, even though the answer may be correct according to the rules of rain/max scoring. The difficulty parameter is R!/(R-H)!, where R is the number of columns in the maximal answer, and H is the number of columns in the system's answer. This figure must be less than 3 * 105. Thus, for example, if the maximal answer has 15 columns, no more than 5 of them can appear in the system's answer. 20 of our answers on both the natural language test and the spoken language test were subject to this phenomenon. It is NIST's belief that no other site was similarly affected on more than 2 utterances, and they have given us permission to present scores which have been adjusted to account for comparator errors.</Paragraph>
      <Paragraph position="1">  rate of 10.7%. N was set to 6 for this test. The natural language system selected the first candidate in the n-best which passed its syntactic, semantic, and application constraints, and output this candidate as the recognized utterance. If no candidate passed the natural language constraints, the first candidate of the n-best was output as the recognized utterance. After natural language filtering, the official speech recognition score was 10.6%. Although intuitively, natural language constraints should be able to reduce speech recognition error, they do not appear to do so in this case. There are at least three possible reasons for this outcome. One, the speech recognition is already quite good, consequently there is less room for improvement. Two, the natural language processing is not very good. Three, there is always going to be some residue of speech recognizer errors which result in perfectly reasonable recognized utterances, and no amount of natural language knowledge will be able to correct these. We have not explored these hypotheses in detail; however, we have done a related experiment with n-best data of a totally different kind, which shows a remarkably similar pattern. We briefly describe this experiment in the following section.</Paragraph>
    </Section>
    <Section position="3" start_page="93" end_page="93" type="sub_section">
      <SectionTitle>
3.2. Optical Character Recognition
</SectionTitle>
      <Paragraph position="0"> Although the nbest architecture was developed in the context, of spoken language understanding, it is in fact applicable to any kind of input where indeterminacies in the input result in misrecognitions. In addition to speech recognizers, optical character recognition systems (OCR's) also have this property. Although current OCR technology is quite accurate for relatively clean data, accuracy is greatly reduced as the data becomes less clean.</Paragraph>
      <Paragraph position="1"> For example, faxed documents tend to have a high OCR error rate. Many OCR errors result in output which is meaningless, either because the output words are not legitimate words of the language, or because the output sentences do not make sense. We have applied linguistic constraints to the OCR problem using a variation of the N-best interface with a natural language processing system. Because the OCR produces only one alternative, an &amp;quot;alternative generator&amp;quot; was developed which uses spelling correction, a lexicon, and various scoring metrics to generate a list of alternatives from raw OCR output. Just as in the case of speech recognizer output, alternatives are sent to the natural language processing system, which selects the first candidate which passes its constraints.</Paragraph>
      <Paragraph position="2"> The system was tested with 120 sentences of ATIS data on which the natural language system had previously been trained. The text data was faxed and then scanned.</Paragraph>
      <Paragraph position="3"> NIST speech recognition scoring software was used to compute word error rates. The word error rate for output directly from the OCR was 13.1%. After the output was sent through the spelling corrector, it was scored on the basis of the first candidate of the N-best set of alternatives. The error rate was reduced to 4.6%. Finally, the output from the natural language system was scored, resulting in a final average error rate of 4.2%.</Paragraph>
      <Paragraph position="4"> Sending the uncorrected OCR output directly into the natural language system for processing without correction led to a 73% average weighted error rate. Spelling correction improved the error rates to 33%. Finally, with natural language correction, the weighted error rate improved to 28%. Thus, although improvements in word accuracy were minimal, application accuracy was greatly improved. This is consistent with previous experiments we have done on speech recognizer output \[6\]. Additional detail on this experiment can be found in \[7\].</Paragraph>
      <Paragraph position="5"> Interestingly, training on the OCR data led to a processing time improvement. Comparing the performance of the October system and the February system, we found that total cpu time for processing the February test data was reduced by one-third. This improvement was due to improving processing inefficiencies which were noted during analysis of the processing of the OCR data. It is encouraging that the system is sufficiently general that training on OCR data can improve the processing of natural language and spoken language data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML