File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/h01-1023_evalu.xml

Size: 7,117 bytes

Last Modified: 2025-10-06 13:58:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1023">
  <Title>Evaluation Results for the Talk'n'Travel System</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4. EVALUATION
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Evaluation Design
</SectionTitle>
      <Paragraph position="0"> The 9 groups funded by the Communicator program (ATT, BBN, CMU, Lucent, MIT, MITRE, SRI, and University of Colorado)  National Institute of Standards and Technology (NIST) in June and July of 2000. A pool of approximately 80 subjects was recruited from around the United States. The only requirements were that the subjects be native speakers of American English and have Internet access. Only wireline or home cordless phones were allowed.</Paragraph>
      <Paragraph position="1"> The subjects were given a set of travel planning scenarios to attempt. There were 7 such prescribed scenarios and 2 open ones, in which the subject was allowed to propose his own task. Prescribed scenarios were given in a tabular format. An example scenario would be a round-trip flight between two cities, departing and returning on given dates, with specific arrival or departure time preferences.</Paragraph>
      <Paragraph position="2"> Each subject called each system once and attempted to work through a single scenario; the design of the experiment attempted to balance the distributions of scenarios and users across the systems.</Paragraph>
      <Paragraph position="3"> Following each scenario attempt, subjects filled out a Web-based questionnaire to determine whether subjects thought they had completed their task, how satisfied they were with using the system, and so forth. The overall form of this evaluation was thus similar to that conducted under the ARISE program (Den Os, et al 1999).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4.2 Results
</SectionTitle>
    <Paragraph position="0"> Table 1 shows the result of these user surveys for Talk'n'Travel.</Paragraph>
    <Paragraph position="1"> The columns represent specific questions on the user survey. The first column represents the user's judgement as to whether or not he completed his task. The remaining columns, labeled Q1-Q5, are Likert scale items, for which a value of 1 signifies complete agreement, and 5 signifies complete disagreement. Lower numbers for these columns are thus better scores. The legend below the table identifies the questions.</Paragraph>
    <Paragraph position="2"> The first row gives the mean value for the measurements over all 78 sessions with Talk'n'Travel. The second row gives the mean value of the same measurements for all 9 systems participating.</Paragraph>
    <Paragraph position="3"> Talk'n'Travel's task completion score of 80.5% was the highest for all 9 participating systems. Its score on question Q5, representing user satisfaction, was the second highest.</Paragraph>
    <Paragraph position="4"> An independent analysis of task completion was also performed by comparing the logs of the session with the scenario given.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Analysis and Discussion
</SectionTitle>
      <Paragraph position="0"> We analyzed the log files of the 29.5% of the sessions that did not result in the completion of the required scenario. Table 4 gives a breakdown of the causes.</Paragraph>
      <Paragraph position="1">  The largest cause (39%) was the inability of the system to recognize a city referred to by the user, simply because that city was absent from the recognizer language model or language understander's lexicon. These cases were generally trivial to fix. The second, and most serious, cause (22%) was recognition errors that the user either did not attempt to repair or did not succeed in repairing. Dates proved troublesome in this regard, in which one date would be misrecognized for another, e.g. &amp;quot;October twenty third&amp;quot; for &amp;quot;October twenty first Another class of errors were caused by the user, in that he either gave the system different information than was prescribed by the scenario, or failed to supply the information he was supposed to. A handful of sessions failed because of additional causes, including system crashes and backend failure.</Paragraph>
      <Paragraph position="2"> Both time to completion and semantic error rate were affected by scenarios that failed because because of a missing city. In such scenarios, users would frequently repeat themselves many times in a vain attempt to be understood, thus increasing total utterance count and utterance error.</Paragraph>
      <Paragraph position="3">  An interesting result is that task success did not depend too strongly on word error rate. Even successful scenarios had an average WER of 18%, while failed scenarios had average WER of only 22%.</Paragraph>
      <Paragraph position="4"> A key issue in this experiment was whether users would actually interact with the system conversationally, or would respond only to directive prompts. For the first three sessions, we experimented with a highly general open prompt (&amp;quot;How can I help you?'), but quickly found that it tended to elicit overly general and uninformative responses (e.g. &amp;quot;I want to plan a trip&amp;quot;). We therefore switched to the more purposeful &amp;quot;What trip would you like to take?&amp;quot; for the remainder of the evaluation. Fully 70% of the time, users replied informatively to this prompt, supplying utterances &amp;quot;I would like an American flight from Miami to Sydney&amp;quot; that moved the dialogue forward.</Paragraph>
      <Paragraph position="5"> In spite of the generally high rate of success with open prompts, there was a pronounced reluctance by some users to take the initiative, leading them to not state all the constraints they had in mind. Examples included requirements on airline or arrival time. In fully 20% of all sessions, users refused multiple flights in a row, holding out for one that met a particular unstated requirement. The user could have stated this requirement explicitly, but chose not to, perhaps underestimating what the system could do. This had the effect of lengthening total interaction time with the system.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Possible Improvements
</SectionTitle>
      <Paragraph position="0"> Several possible reasons for this behavior on the part of users come to mind, and point the way to future improvements. The synthesized speech was fairly robotic in quality, which naturally tended to make the system sound less capable. The prompts themselves were not sufficiently variable, and were often repeated verbatim when a reprompt was necessary. Finally, the system's dialogue strategy needs be modified to detect when more initiative is needed from the user, and cajole him with open prompts accordingly.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML