File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/89/h89-2017_evalu.xml

Size: 7,577 bytes

Last Modified: 2025-10-06 14:00:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2017">
  <Title>DATA COLLECTION AND ANALYSIS IN THE AIR TRAVEL PLANNING DOMAIN</Title>
  <Section position="5" start_page="121" end_page="123" type="evalu">
    <SectionTitle>
RESULTS AND ANALYSIS
</SectionTitle>
    <Paragraph position="0"> The recorded data was first transcribed and verified. Then, various phenomenon that might characterize differences between the styles and conditions examined were counted: number of words, new vocabulary items (items not seen in any previous data), and number of &amp;quot;um&amp;quot;s and other pause fillers. For the human-machine interaction, we also analyzed grammatical false starts (&amp;quot;show me the how many fares are under $200&amp;quot;) and speech false starts (&amp;quot;sh- show me only the ones under $200&amp;quot;).</Paragraph>
    <Section position="1" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
Human-Human Data
</SectionTitle>
      <Paragraph position="0"> Twelve hours of data were recorded and transcribed. Of them, 8 hours were verified and analyzed for various characteristics including those in the table below. Note that &amp;quot;naive&amp;quot; user refers to the traveler in the traveler to travel agent conversations and &amp;quot;expert&amp;quot; user refers to the more constrained speech of the travel agent to the airline agent:</Paragraph>
    </Section>
    <Section position="2" start_page="121" end_page="122" type="sub_section">
      <SectionTitle>
User # Dialogues # Words Vocab # &amp;quot;um&amp;quot; % um
</SectionTitle>
      <Paragraph position="0"> naive 48 9,315 1,076 501 5.4 expert 1 0 737 230 21 2.8 Experience is a major factor in dialogue efficiency. Compare the 194 words per dialogue for &amp;quot;naive&amp;quot; users to the 74 words per dialogue for the experts. The vocabulary size also changes significantly between types of user, though this is more difficult to assess given the smaller data set. However, our intuitions, based on looking at these data, is that the vocabulary is substantially more restricted for the agent-agent dialogues for two reasons: the travel agent does not try to gain the sympathy of the airline agent (which travelers often do and which opens up the vocabulary tremendously), and both agents know very well what the other can do (which reduces the vocabulary significantly). Humans interacting with machines will not be likely to try to gain the machine's sympathy, but they will use a much larger vocabulary than otherwise if they are unsure about just what capabilities the system has. We have observed this phenomena in our human-machine simulations. Another measure of efficiency is the frequency of pause fillers, which differs in the two conditions by a factor of 2. Expert users are more concise, following a wellpracticed script. Both parties have a clear idea of what each can do for the other and both want an efficient, brief conversation. Pause fillers occur in these conversations primarily when the conversation is focused on new or unknown material such as a client's seat number or an unusual regulation. In the human-human data, when the traveler is unsure of the capabilities of the the agent, the agent takes an active role in guiding the traveler. Interactive conversation, as opposed to one-way communication, increases the efficiency of problem-solving (Oviatt &amp; Cohen, 1988). This will likely be important in designing efficient SLSs for naive, untrained users.</Paragraph>
      <Paragraph position="1"> We classified 30 conversations from the data in terms of general type of query used. Five of the 30 conversations were database query-oriented; most of the observed were not strictly database queries, but, rather, expressed constraints related to the problem to be solved. Four of the five database style  conversations are from information-only calls, where no booking was made. Information calls from the human-human transcripts usually don't involve all pieces of information necessary for booking a trip. In many cases the traveler merely wants airfare for a tdp from X to Y on day Z. Specific flight information and seating arrangements are left for later.</Paragraph>
      <Paragraph position="2"> In assessing the design of initial vocabulary, we took 10 dialogues, filled out the items syntactically and semantically, and added a list of function words we had for other purposes. The percent of new words observed in each successive dialogue (where those observed are added to the pool) declines substantially as new dialogues are included. It does not, however, appear to dip below about 3% even after 48 dialogues. This is not a surprising result; it only highlights the need for dealing with (detecting, forming speech models, syntactic models and semantic models for) words outside the expected vocabulary.</Paragraph>
    </Section>
    <Section position="3" start_page="122" end_page="122" type="sub_section">
      <SectionTitle>
Human-Machine Data
</SectionTitle>
      <Paragraph position="0"> We ran two air travel planning sessions per subject. There were two separate tasks as described above, crossed with two query styles: database query and &amp;quot;regular&amp;quot; (expressing constraints). Compare the human-machine results to those from the human-human condition (repeated here):</Paragraph>
    </Section>
    <Section position="4" start_page="122" end_page="123" type="sub_section">
      <SectionTitle>
User # Dialogues # Words Vocab # &amp;quot;urn&amp;quot; % um
</SectionTitle>
      <Paragraph position="0"> naive 48 9,315 1,076 501 5.4 expert 10 737 230 21 2.8 human- 86 10,622 505 380 3.6 machine These human-machine results appear to fall in between the naive and expert user human-human results in terms of words per dialogue, vocabulary size, and frequency of pause fillers. We suspect that this relationship between the user categories will hold for speech and grammatical false starts as well. This suggests that expert human-machine users could potentially adapt to a restricted vocabulary and still maintain efficiency. Future SLSs should plan for both the naive and the expert users.  The above table compares the database query (DBQ) with the regular condition, and the first task performed by the subject with the second task (the totals are also shown). The number of &amp;quot;um&amp;quot;s includes a variety of different pause fillers used by the subjects. The false start percentages are calculated by  dividing by the total number of words observed in that session. Each subject had an average of 9 to 12 false starts per session. The number of error messages refers to the number of times subjects were presented with a &amp;quot;can't handle that request&amp;quot; response to an utterance. In the comparison between DBQ and &amp;quot;regular&amp;quot; conditions, the only significant difference is that the &amp;quot;regular&amp;quot; condition has fewer errors than the DBQ. This suggests that the condition may not have been too constraining for the subjects; perhaps nothing that a short training session could not overcome. Differences between the first and second session, however, are larger: subjects in the first session are more verbose than in the second, and correspondingly, the first session has more error messages. These results suggest that pre-session training and user practice of the system might facilitate more efficient interaction with the machine. If one 5-minute session has this strong an effect, it is perhaps not unreasonable to consider short training sessions integrated in initial SLSs.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML