File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-2185_evalu.xml

Size: 6,543 bytes

Last Modified: 2025-10-06 14:00:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2185">
  <Title>An Interactive Domain Independent Approach to Robust Dialogue Interpretation</Title>
  <Section position="6" start_page="1132" end_page="1134" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> An empirical evaluation was conducted in order to determine how much improvement can be gained with limited amounts of interaction in the  domain independent ROSE approach. This evaluation is an end-to-end evaluation where a sentence expressed in the source language is parsed into a language independent meaning representation using the ROSE approach. This meaning representation is then mapped onto a sentence in the target language. In this case both the source language and the target language are English. An additional evaluation demonstrates the improvement in interaction quality that can be gained by introducing available domain information.</Paragraph>
    <Section position="1" start_page="1133" end_page="1133" type="sub_section">
      <SectionTitle>
4.1 Domain Independent Repair
</SectionTitle>
      <Paragraph position="0"> First the system automatically selected 100 sentences from a previously unseen corpus of 500 sentences. These 100 sentences are the first 100 sentences in the set that a parse quality heuristic similar to that described in (Lavie, 1995) indicated to be of low quality. The parse quality heuristic evaluates how much skipping was necessary in the parser in order to arrive at a partial parse and how well the parser's analysis scores statistically. It should be kept in mind, then, that this testing corpus is composed of 100 of the most difficult sentences from the original corpus.</Paragraph>
      <Paragraph position="1"> The goal of the evaluation was to compute average performance per question asked and to compare this with the performance with using only the partial parser as well as with using only the Hypothesis Formation phase. In each case performance was measured in terms of a translation quality score assigned by an independent human judge to the generated natural language target text. Scores of Bad, Okay, and Perfect were assigned. A score of Bad indicates that the translation does not capture the original meaning of the input sentence. Okay indicates that the translation captures the meaning of the input sentence, but not in a completely natural manner.</Paragraph>
      <Paragraph position="2"> Perfect indicates both that the resulting translation captures the meaning of the original sentence and that it does so in a smooth and fluent manner.</Paragraph>
      <Paragraph position="3"> Eight native speakers of English who had never previously used the translation system participated in this evaluation to interact with the system. For each sentence, the participants were presented with the original sentence and with three or fewer questions to answer. The parse result, the result of repair without interaction, and the result for each user after each question were recorded in order to be graded later by the independent judge mentioned above. Note that this evaluation was conducted on the nosiest portion of the corpus, not on an average set of naturally occurring utterances. While this evaluation indicates that repair without interaction yields an acceptable result in only 36% of these difficult cases, in an evaluation over the entire corpus, it was determined to return an acceptable result in 78% of the cases.</Paragraph>
      <Paragraph position="4"> A global parameter was set such that the system never asked more than a maximum of three questions. This limitation was placed on the system in order to keep the task from becoming too tedious and time consuming for the users. It was estimated that three questions was approximately the maximum number of questions that users would be willing to answer per sentence.</Paragraph>
      <Paragraph position="5"> The results are presented in Figure 7. Repair without interaction achieves a 25% reduction in error rate. Since the partial parser only produced sufficient chunks for building an acceptable repair laypothesis in about. 26% of the cases where it did not produce an acceptable hypothesis by itself, the maxinmm reduction in error rate was 26%. Thus, a 25% reduction in error rate without interaction is a very positive result. Additionally, interaction increases the system's average translation quality above that of repair without interaction. With three questions, the system achieves a 37% reduction in error rate over partial parsing alone.</Paragraph>
    </Section>
    <Section position="2" start_page="1133" end_page="1134" type="sub_section">
      <SectionTitle>
4.2 Discourse Based Interaction
</SectionTitle>
      <Paragraph position="0"> In a final evaluation, the quality of questions based only on feature information was compared with that of questions focused on the task level using discourse information. The discourse processor was only able to provide sufficient information for reformulating 22% of the questions in terms of the task. The reason is that this discourse processor only provides information for reformulating questions distinguishing between meaning representations that differ in terms of status and augmented temporal information.</Paragraph>
      <Paragraph position="1"> Four independent human judges were asked to grade pairs of questions, assigning a score between 1 and 5 for relevance and form and indicating which question they would prefer to answer. They were instructed to think of relevance in terms of how use- null ful they expected the question would be in helping a computer understand the sentence the question was intended to clarify. For form, they were instructed to evaluate how natural and smooth sounding the generated question was.</Paragraph>
      <Paragraph position="2"> Interaction without discourse received on average 2.7 for form and 2.4 for relevance. Interaction with discourse, on the other hand, received 4.1 for form and 3.7 for relevance. Subjects preferred the discourse influenced question in 73.6% of the cases, expressed no preference in 14.8% of the cases, and preferred interaction without discourse in 11.6% of the cases. Though the discourse influenced question was not preferred universely, this evaluation supports the claim that humans prefer to receive clarifications on the task level and indicates that further exploration in using discourse information in repair, and particularly in interaction, is a promising avenue for future research.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML