XML Viewer - h90-1023

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1023_metho.xml
Size: 13,226 bytes
Last Modified: 2025-10-06 14:12:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1023">
  <Title>Beyond Class A: A Proposal for Automatic Evaluation of Discourse</Title>
  <Section position="2" start_page="0" end_page="112" type="metho">
    <SectionTitle>
A Modest Proposal
</SectionTitle>
    <Paragraph position="0"> We originaUy postponed evaluation of non-class A sentences because there was no consensus on automated evaluation techniques for these sentences. We would like here to propose a methodology for both &amp;quot;unanswerable&amp;quot; sentences and for automated evaluation of context-dependent sentences. By capturing these two additional classes in the evaluation, we can evaluate on more than 90% of the data; in addition, we can evaluate entire (wellformed) dialogues, not just isolated query/answer pairs.</Paragraph>
    <Section position="1" start_page="0" end_page="109" type="sub_section">
      <SectionTitle>
Unanswerable Queries
</SectionTitle>
      <Paragraph position="0"> For unanswerable queries, we propose that the system recognize that the query is unanswerable and generate (for evaluation purposes) a canonical answer such as UNANSWERABLE_QUERY. This would be scored correct in exactly those cases where the query is in fact unanswerable. The use of a canonical message side-steps the tricky issue of exactly what kind of error message to issue to the user. This solution is proposed in the general spirit of the Canonical Answer Specification \[1\] which requires only a minimal answer, in order to impose the fewest constraints on the exact nature of the system's answer to the user. This must be distinguished from the use of NO_ANSWER, which flags cases where the system does not attempt to formulate a query. The NO.ANSWER response allows the system to admit that it doesn't understand something. By contrast, the UNANSWERABLE_QUERY answer actually diagnoses the cases where the system understands the query and determines that the query cannot be answered by the database.</Paragraph>
    </Section>
    <Section position="2" start_page="109" end_page="109" type="sub_section">
      <SectionTitle>
Capturing the Context
</SectionTitle>
      <Paragraph position="0"> The major obstacle to evaluation of context-dependent sentences is how to provide the context required for understanding the sentences. If each system were able to replicate the context in which the data is collected, it should be possible to evaluate context-dependent queries. This context (which we will call the &amp;quot;canonical context&amp;quot;) consists of the query-answer pairs seen by the subject up to that point during data collection. Figure 1 shows the kind of context dependencies that are found in the ATIS corpus.</Paragraph>
      <Paragraph position="1"> These examples show how contextual information is used. Query 2 (... I would like to find flights going on to San Francisco on Monda~t the 9th of July) requires the previous query Q1 to determine that the starting point of this leg is Denver. Query 3 (What would be the fare on United 3~37) refers to an entity mentioned in the answer of Query 2, namely United 343. United 343 may well include several legs, flying from Chicago to Denver to San Francisco, for example, with three fares for the different segments (Chicago to Denver, Chicago to San Francisco, and Denver to San Francisco). However, Query 3 depends on context from the previous display to focus only on the fare from Denver to San Francisco. Finally, Query 4 (What about Continental 1~g57) requires the previous query Q3 and its contezt to establish what is being asked about (fare from Denver to San Francisco); it also refers to an entity mentioned in the display D2 associated with Query 2 (Continental 1295). By building up a context using information from both the query and the answer, it is possible to interpret these queries correctly. This is shown schematically in Figure 2.</Paragraph>
      <Paragraph position="2">  points out an additional difficulty in evaluating sentences dependent on context, namely the possibility of &amp;quot;getting out of synch&amp;quot;. In this example, the system misprocesses the original request, saying that there are no flights from Atlanta to Denver leaving before 11. When the follow-up query asks Show me the cheapest one, there is an apparent incoherence, since there is no &amp;quot;cheapest&amp;quot; one in the empty set. However, if the canonical query/answer pairs are provided during evaluation, the system can &amp;quot;resynchronize&amp;quot; to the information originally displayed to the user and thus recognize that it should chose the cheapest flight from the set given in the canonical answer.</Paragraph>
    </Section>
    <Section position="3" start_page="109" end_page="109" type="sub_section">
      <SectionTitle>
Providing the Canonical Context
</SectionTitle>
      <Paragraph position="0"> The above examples illustrate what information is needed in order to understand queries in context. The next question is how to provide this &amp;quot;canonical context&amp;quot; (consisting of the query/answer pairs generated during data collection) for purposes of automated evaluation.</Paragraph>
      <Paragraph position="1"> Providing the set of queries is, of course, not a problem: this is exactly the set of input data. a Providing the canonical answers is more of a problem, because it requires each system to reproduce the answer displayed during data gathering. Since there is no agreement as to what constitutes the best way to display the data, requiring that each system reproduce the original display seems far too constraining. However, we can provide, for evaluation purposes, the display seen by the subject during data collection. The log file in the training data contains this information in human-readable form. It can be provided in more convenient form for automatic processing by representing the display as a list of lists, where the first element in the list is the set of column headings, and the remaining elements are the rows of data. This &amp;quot;canonical display format&amp;quot; is illustrated in  For evaluation, the canonical (transcribed) query and the canonical display would be furnished with each 30f course, if the input is speech data, then the system could misunderstand the speech data; therefore, to preserve synchronization as much as possible, we propose that the transcribed input be provided for evaluation of speech input.</Paragraph>
      <Paragraph position="3"> query, to provide the full context to the system, allowing it to &amp;quot;resynchronize&amp;quot; at each step in the dialogue. 4 The system could then process the query (which creates any context associated with the query) and answer the query (producing the usual CAS output). It would then reset its context to the state before query processing and add the &amp;quot;canonical context&amp;quot; from the canonical query and from the canonical display, leaving the system with the appropriate context to handle the next query. This is illustrated in Figure 5.</Paragraph>
      <Paragraph position="4"> This methodology allows the processing of an entire dialogue, even when the context may not be from the directly preceding query, but from a few queries back. At Unisys, we have already demonstrated the feasibility of substituting an &amp;quot;external&amp;quot; DB answer for the internally generated answer \[3\]. We currently treat the display (that is, the set ofDB tuples returned) as an entity available for reference, in order to capture answer/question dependencies, as illustrated in Figure 3.</Paragraph>
      <Paragraph position="5"> 4There is still the possibility that the system mlslnterprets the query and then needs to use the query as context for a subsequent query. In thls case, providing the answer may not help, unless there is some redundancy between the query and the answer.</Paragraph>
    </Section>
    <Section position="4" start_page="109" end_page="109" type="sub_section">
      <SectionTitle>
Ambiguous Queries
</SectionTitle>
      <Paragraph position="0"> In addition to the suggestions for handling unanswerable queries and context-dependent queries, there seems to be an emerging consensus that ambiguous queries can be handled by allowing any of several possible answers to be counted as correct. The system would then be resynchronized as described above, to use the canonical answer furnished during data collection.</Paragraph>
    </Section>
    <Section position="5" start_page="109" end_page="109" type="sub_section">
      <SectionTitle>
Evaluation Format
</SectionTitle>
      <Paragraph position="0"> Taking the need for context into consideration and the need to allow systems to resynchronize as much as possible, the proposed form of test input for each utterance in a dialogue is:  For evaluation, the system still outputs a transcription and an answer in CAS format; these are evaluated against the SNOR transcription and the reference answer in CAS, as is done now.</Paragraph>
      <Paragraph position="1"> With each utterance, the system processes the utterance, then is allowed to &amp;quot;resynchronize&amp;quot; against the correct question-answer pair, provided as part of the evaluation input data before evaluating the next utterance. Is It Too Easy To Cheat.* One obvious drawback of this proposal is that it makes it extremely easy to cheat - the user is provided with the transcription and the database display. It is clearly easy to succumb to the temptation to look at the answer - but it is easy to look at the input sentences under the current system; only honesty prevents us from doing that. Providing a canonical display raises the possibility of deriving the correct answer by a simple reformatting of the canonical display. However, it would be easy to prevent this simple kind of cheating by inserting extra tuples or omitting a required tuple from the canonical display answer. This would make any answer derived from the display not compare correctly to the canonical answer. In short, the issue of cheating does not seem like an insurmountable obstacle: we are now largely on the honor system, and if we wished to make it more difficult to cheat, it is not difficult to think of minor alterations that would protect the system from obvious mappings of input to correct answer.</Paragraph>
    </Section>
    <Section position="6" start_page="109" end_page="112" type="sub_section">
      <SectionTitle>
Evaluating Whole Discourses
</SectionTitle>
      <Paragraph position="0"> There are several arguments in favor of moving beyond class A queries:  * Yield is increased from 60% to over 90%; * Data categorization is easier (due to elimination of the context-removable class); * Data validation is easier (no need to rerun contextremovable queries); * Data from different data collection paradigms can  be used by multiple sites; * We address a realistic problem, not just an artificial subset.</Paragraph>
      <Paragraph position="1"> This is particularly important in light of the results from the June evaluation. In general, systems performed in the 50-60% range on class A sentences. This means that the coverage of the data was in the 30-40% range.  If we move on to include unanswerable queries and context dependent queries, we are at least looking at more than 90% of the data. Given that several sites already have the ability to process context-dependent material (\[4\], \[6\], \[3\]), this should enable contractors to report significantly better overall coverage of the corpus. Subjective Evaluation Criteria In addition to these fully automated evaluation criteria, we also propose that we include some subjective evaluation criteria, specifically:</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="112" end_page="112" type="metho">
    <SectionTitle>
* User Satisfaction
* Task Completion Quality and Time
</SectionTitle>
    <Paragraph position="0"> At the previous meeting, the MIT group reported on results using outside evaluators to assess system performance (\[5\]). We report on a similar experiment at this meeting(\[2\]), in which three evaluators showed good reliability in scoring correct system answers. This indicates that subjective black box evaluation is a feasible approach to system evaluation. Out suggestion is that subjective evaluation techniques be used to supplement and complement the various automated techniques under development.</Paragraph>
    <Paragraph position="1"> Conclusion This proposal does not address several important issues. For example, clearly a useful system would move towards an expert system, and not remain restricted to a DB interface. We agree that this is an important direction, but have not addressed it here. We also agree with observations that the Canonical Answer hides or conflates information. It does not capture the notion of focus, for example. And we have explicitly side-stepped the difficult issues of what kind of detailed error messages a system should provide, how it should handle failed presupposition, how it should respond to queries outside the DB. For the next round, we are suggesting that it is sufficient to recognize the type of problem the system has, and to supplement the objective measures with some subjective measures of how actual users react to the system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML