File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/91/h91-1017_evalu.xml
Size: 5,944 bytes
Last Modified: 2025-10-06 14:00:01
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1017"> <Title>Interface Bugs Ignored Test Set System % Correct % Incorrect % No Answer Total Error</Title> <Section position="10" start_page="108" end_page="109" type="evalu"> <SectionTitle> EXPERIMENTAL RESULTS PILOT STUDIES </SectionTitle> <Paragraph position="0"> The current implementation of SOUL was trained on the first two-thirds of the ATIS0 training data available in June 1990, consisting of Dialogs B0 through B9 and BA through BN. The training set contained 472 utterances. SOUL was evaluated on the following three independent, non-overlapping test sets. Set 1 contained the 94 Class A and context-removable utterances from the official June 90 ATIS0 test set. Sets 2 and 3 both used sentences from Dialogs BO through BZ from the June 1990 data.</Paragraph> <Paragraph position="1"> These were set aside for use as an independent test set. What is important about this data, is that unlike Set 1, and the February 1991 official DARPA test set (Set 4, described later), the data are not restricted in any manner. All utterances produced by the speakers are included in the test set, regardless of whether they are well formed, within the bounds of the domain, ambiguous, context dependent, etc. Set 2 included all 232 utterances that were not context dependent, and therefore contained unanswerable, ambiguous, ill-formed and ungrammatical utterances, as well as Class A and context-removable queries. Set 3 consisted of the remaining 29 context-dependent uuerances contained in the transcripts from Dialogs BO through BZ.</Paragraph> <Paragraph position="2"> Results of the three evaluations, comparing the performance of PHOENIX alone and PHOENIX plus SOUL are given in the table below. These results were obtained using the standard DARPA/NIST scoring software. However, we allowed for a variety of additional error messages which were more specific than the generic NIST errors. Results using Test Set 1, indicate SOUL's ability to detect and correct inaccurate and incomplete output from the PHOENIX parser, since these sentences consist only of answerable, legal and non-ambiguous utterances. As these utterances are constrained, it is expected that only minor irnprovments will result for the addition of SOUL. In contrast, Test Set 2 contains unrestricted input, namely all utterances generated which are interpretable without context. Results using Test Set 2 indicate SoUL's ability to recognize unanswerable, derive multiple interpretations for ambiguous input, to interpret ill-formed and un-grarnmatical input and to correct inaccurate output from the PHOENIX parser. Finally, results from Test Set 3 indicate SoUL's proficiency in detecting context dependent utterances. However, it should be noted that the Test Set 3 results are not representative of POENIX performance. PHOENIX is designed to process context dependent utterances only when using context.</Paragraph> <Paragraph position="3"> for processing superlatives, comparatives and yes/no utterances were moved into the PHOENIX system. Therefore, the results presented in this section differ from the Test Set #1 in the pilot study section above in that all increases in accuracy due to properly interpreting superlatives, comparatives and yes/no questions are no longer included in the SOUL results.</Paragraph> <Paragraph position="4"> The February 1991 test set contained 148 isolated utterances and 38 utterances which are interpreted using limited context, or context provided by the preceeding query and its answer. These 148 individual utterances were constrained to be &quot;Class A&quot;, or answerable and unambiguous. On the 38 &quot;DI&quot; or limited context queries, ff the interpretation or the database response produced in response to the first utterance is inaccurate, the tested utterance will probably not be correct. The results of the evaluation are presented below.</Paragraph> <Paragraph position="5"> A number of interfaces were modified after performing the pilot evaluation and just prior to performing the February 1991 evaluation presented above. As a result, we introduced a number of errors into the system. The following table breaks down the sources of error and the percentages attributable to each source.</Paragraph> <Paragraph position="6"> As seen in Table BQ3 of the 19.3% errors on Class A queries, 9% are due to back-end bugs and an additional 2% are due to bugs in the interface between the PHOENIX and SOUL modules.</Paragraph> <Paragraph position="7"> Hence, 12 out of the 19.3 % errors are due to back-end type interface bugs.</Paragraph> <Paragraph position="8"> For Class D1 queries, of the 34.21% error, 18.42 are due to interface problems, a number reasonably proportional to the Class A results. Here, 5.26% is found in the PHOENIX SOUL interface and 13.16 in the database interface.</Paragraph> <Paragraph position="9"> All in all, the real errors made by the system for Class A queries result in roughly an 8% error rate, where for Class D1 queries, the error rate is roughly 16%, exactly double they Class A error rate. This is to be expected, as an error made on the context setting portion of a query will necessarily result in an error on the context dependent portion of a set of queries.</Paragraph> <Paragraph position="10"> An analysis of the official results including interface bugs and/or of the results excluding interfaces bugs indicates that SOUL is responsible for a sight decrease in overall error rate.</Paragraph> <Paragraph position="11"> The results also indicate that SOUL is reasonably good at detecting and flagging inaccurate interpretations, even when it is not able to produce a correct interpretation.</Paragraph> <Paragraph position="12"> end bugs are presented below. These data are derived from the original and official complete log of the February 5, 1991 data run.</Paragraph> </Section> class="xml-element"></Paper>