File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/91/h91-1014_evalu.xml
Size: 7,820 bytes
Last Modified: 2025-10-06 14:00:01
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1014"> <Title>Development and Preliminary Evaluation of the MIT ATIS System 1</Title> <Section position="5" start_page="91" end_page="92" type="evalu"> <SectionTitle> EVALUATION </SectionTitle> <Paragraph position="0"> Table 1 summarizes our results for the three obligatory system evaluations using the February-91 test set provided by NIST. The first test takes as input the transcriptions of the so-called Class A sentences, i.e., sentences that are context independent, and produces a CAS 2 output. The second test is the same as the first one, except that the sentences are Class D1, i.e., their interpretation depends upon a previous sentence, which is provided as additional input. The last test is the same as the first, except that the input is speech rather than text. For each data set, we give the percent correct, percent incorrect, percent with no answer, and the overall score, where the score penalizes incorrect answers weighted equally against correct answers.</Paragraph> <Paragraph position="1"> Comparing the first row of Table i with last June, our current implementation makes considerably fewer false alarms for text input. The two errors that the system made were due to a minor system bug; while the correct answer was displayed to the user locally, we inadvertently sent the wrong one to the comparator. We are encouraged by this result, 2CAS, or Common Answer Specification, is a standardized format for the information retrieved from the OAG database, which is compared against a &quot;reference&quot; CAS using a comparator provided by NIST. Data No. of I Correct Incorrect No Answer Score I conditions.</Paragraph> <Paragraph position="2"> since the errors were all due to factors unrelated to the development of the natural language technology, and as such can be fixed trivially.</Paragraph> <Paragraph position="3"> The results for the context-dependent sentences are given in column 2 of Table 1. Our system provided correct answers for 18 of the 38 context pairs, and made only 2 errors. This is a more stringent test than the first one, since providing the correct answer in this case demands that both sentences be correctly understood. One of the errors was due to the same system bug describe above, i.e., the right answer was displayed but not sent. In the second one, which we considered to he the only error made by the natural language system, the system simply ignored the context.</Paragraph> <Paragraph position="4"> The results for the Class A sentences with speech input are given in Column 3 of Table 1. Of the 19 sentences that provided an incorrect answer, 2 were correctly recognized, but failed due to the system bug mentioned above.</Paragraph> <Paragraph position="5"> We recently collected a sizable amount of spontaneous speech data, using a paradigm very different from the one used at TI. Our preliminary analyses of the two data sets have indicated significant differences in many dimensions, including the speaking rate, vocabulary growth, and amount of spontaneous speech disfluencies \[3\]. We thought it might be interesting to compare our system's performance on the two data sets. To this end, we asked B. Bly of SRI to help us generate the CAS reference answers for the designated development-test set of our database. The test set consists 'of 371 sentences, of which 198 were classified by Ms. Bly as Class A. Since'no aspects of our system had been trained on these data, we consider it to be a legitimate test set for purposes of this experiment, although we plan to use it in the future as a development test set.</Paragraph> <Paragraph position="6"> The results for CAS output with both text and speech input for the MIT data are given in Table 2. We should point out that in an initial run of the text-input condition, several answers marked as &quot;incorrect&quot; were judged by us to be dubious. We submitted these questionable answers to Ms. Bly, as part of the customary follow-up process of &quot;human ajudication&quot; established at NIST. The results in Table 2 thus represent the final outcome. Some of the discrepancies were due to an error on the part of the reference answer, several were due to the &quot;yes/no&quot; vs. table problem, several were due to the fact that SRI assumed 1990 for all dates, whereas most Development Test Set, with CAS reference answers provided by SRI, for both text and speech input.</Paragraph> <Paragraph position="7"> of the dates were actually in January of 1991.</Paragraph> <Paragraph position="8"> For the text-in/CAS-out test condition, we obtained an overall score of 72.7%, which is a dramatic improvement over our results on the TI data. In two of the three errors, the back-end ignored certain critical modifiers in the frame. The third error was fairly subtle: we interpreted &quot;I'd like to book a flight between Boston and San Francisco with stops in Denver and Atlanta,&quot; to mean or rather than and for the stops.</Paragraph> <Paragraph position="9"> We produced a correct answer for almost 40% of the utterances when the speech recognizer was included in the system, with an 8% false alarm rate. This gave an overall score of 32%, which is again substantially higher than the 18.6% score we received for the recognizer results on the standard test set. The MIT test was run after some bug fixes, which would have improved the score for the TI data to 24% (see Table 3). However, this is still substantially lower than the score for the MIT set. This is all the more surprising since the MIT test data were not prescreened for speech disfluencies s we included all of the Class A sentences of each test speaker.</Paragraph> <Paragraph position="10"> There are several possible explanations for the discrepancy.</Paragraph> <Paragraph position="11"> We believe that the MIT sentences are spoken more fluently, as suggested by the results of a statistical analysis reported in \[3\]. We also suspect that MIT subjects tend to use constructs that are more straightforward and conform more closely to standard English. Finally~ the MIT sentences include very few table clarification questions, a feature which allowed us to reduce the size and perplexity of our grammar.</Paragraph> <Paragraph position="12"> In general, the speech recognition error rate for our system is significantly higher in the ATIS domain than what we have experienced with the Resource Management domain.</Paragraph> <Paragraph position="13"> One conclusion we may draw is that spontaneous speech, with out-of-domain words and novel linguistic constructs, can combine to degrade recognition performance drastically.</Paragraph> <Paragraph position="14"> There are at least several other reasons that contribute to our increased speech recognition error rate. The version of our recognizer in the ATIS system used only context-independent phoneme models. This is because we have focused our research attention in speech recognition primarily on the Resource Management task, and did not really devote any effort to the ATIS domain until late January. The word-pair language model that we developed has a coverage of only 50% of SThe TI class A sentences were screened to exclude disfiuencies for the basic test set.</Paragraph> <Paragraph position="15"> Data No. of CorlSub Del Ins Error the TI test set sentences. It also has a perplexity of over 90, which is higher by a factor of at least four compared to what others have used (see, for example, \[2\]). In the next few months, we intend to incorporate context-dependent modelling to the ATIS domain. We will also replace the word-pair language model with a bigram so as to increase the coverage and lower the perplexity.</Paragraph> </Section> class="xml-element"></Paper>