File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-1607_intro.xml

Size: 10,645 bytes

Last Modified: 2025-10-06 14:01:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1607">
  <Title>Comparing Several Aspects of Human-Computer and Human-Human Dialogues</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Our Data
</SectionTitle>
    <Paragraph position="0"> We have tagged 20 HH dialogues and 40 HC dialogues; the HH dialogues consist of 25,208 words and 1742 turns, and the HC dialogues consist of 8313 words and 957 turns. The HH dialogues are from the SRI ATIS collection (available at www.ai.sri.com/ communic/amex/amex.html) and the CMU Travel data (www.speech.cs.cmu.edu). The HC dialogues are from a subset of the systems participating in the DARPA Communicator data collection which was run in June and July of 2000.1 Subjects were selected by the U.S. National Institute of Standards and Technology (NIST) and each subject was paid to call all of the 9 participating sites with either a xed or open travel itinerary. The xed itineraries were intended to balance ight types (one-way, round-trip, multileg) and destinations (large US airport, small for1Ideally, of course, we would want HH dialogues collected under similar circumstances to the HC dialogues. However, no such data are available and collecting new HH data is not currently a feasible endeavor for us.</Paragraph>
    <Paragraph position="1"> eign airport, etc.).2 We made no attempt to balance the subset of data we analyzed; we merely selected 10 random users. We then identi ed four sites whose systems we expected to di er quite widely (based on researchers' descriptions of their systems), and extracted the 10 users' dialogues for each site. Each dialogue was tagged with three sets of annotations|for dialogue acts, for initiative and for provision of unsolicited information by users (HC only). Sections 2.1, 2.2 and 3.5 describe these in more detail. The passage below is excerpted from one of the human-computer interactions, using the transcribed versions of the user's turns.3 User said: i would like to fly from miami florida to sydney australia on wednesday october eleventh please System said: okay, from Miami to Sydney on Wednesday October 11. Can you provide the approximate departure time or airline? User said: [um] i prefer american airlines and i can leave anytime System said: There are no nonstop flights.</Paragraph>
    <Paragraph position="2"> I have a flight connecting in Los Angeles leaving at 7:15 a.m. a flight connecting in Los Angeles leaving at 5:20 p.m. and a flight connecting in Los Angeles leaving at 7:30 p.m. Do you like any of these? User said: yes how about the morning one System said: American flight 912 connecting to American flight 7362 in Los Angeles has been added to your itinerary. you depart Miami at 7:15 am and arrive in Sydney at 9:40 pm the following day. Where would you like to go next?</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Dialogue Act Tagging
</SectionTitle>
      <Paragraph position="0"> We have done two rounds of tagging. Our starting point was a 20-tag subset of the CSTAR Consortium tags (Consortium, 1999). We found the CSTAR tag set to be the most suitable for our task of the tagging schemes we considered, based on information provided in the survey of dialogue act annotation schemes assembled by the MATE project (Klein et al., 1998). We picked the CSTAR tags because they had been designed for task- null transcripts are reproduced with their native formatting, i.e. all caps, no punctuation, etc. The system side contains punctuation resulting from the generation process.</Paragraph>
      <Paragraph position="1"> oriented dialogues, the tag categories were su ciently clear and simple that we believed we would be able to tag the data reliably and, nally, the categories captured the sorts of distinctions we believed would be relevant. We rejected the DAMSL tag set (Core and Allen, 1997; Core et al., 1999) on the grounds that is was too sophisticated for our purposes, covering many aspects of dialogue structure that were not necessarily relevant for our task such as intentionality, grounding and context tracking. In addition, the interannotator agreement levels reported for this scheme are quite low. Some of the other tag sets we considered were (Carletta et al., 1995; Nakatani et al., 1995; van Vark et al., 1996; Di Eugenio et al., 1998; Jurafsky et al., 1997).</Paragraph>
      <Paragraph position="2"> In collaboration with AT&amp;T, we arrived at a set of changes to our tag set that would make it compatible with their e orts to tag system utterances automatically (Walker and Passonneau, 2001), in the hopes of being able to share results with them more easily. We added a situation/conversation/task distinction to a number of our tags (e.g. give-information split into give-task-info, give-situation-info and give-conversation-info). We also added a not-understand tag and collapsed some original tags into super-categories. Our revised tag set had 26 tags, and two people (one who had also done the rst round of tagging) tagged the same data set. The situation/conversation/task distinction turned out to be extremely di cult for the taggers to make; we believe that revisions to the tagging guidelines could lead to some improvement on this front, but without enumerating the kinds of utterances which fall into each category, this will remain a di cult task.</Paragraph>
      <Paragraph position="3"> We tagged each utterance that contained some speech, i.e. was not composed entirely of non-speech annotation like *pause* or [click], and we split turns4 into utterances using guidelines that had been developed internally for another purpose. Utterances on this de nition were roughly clause-sized units, and possibly fragmentary.5 This meant that there were often multiple dialogue acts (DAs) per turn, and where there were multiple sequential DAs of the same type, we collapsed them under a single tag on the assumption that they were combining to \perform&amp;quot; that DA. We initially split some of the CSTAR tags  ment the dialogues in a separate step.</Paragraph>
      <Paragraph position="4"> into implicit and explicit versions, but found that the implicit cases were so hard to identify that we were not using those tags, and they were dropped from the tag set.</Paragraph>
      <Paragraph position="5"> Tables 1 and 2 show roughly parallel sub-dialogues from the HH and HC data.6 Each turn is tagged with its DA, and the rst expert turn in Table 2 shows multiple DAs within a turn, a give-information followed by an offer.</Paragraph>
      <Paragraph position="6">  Expert: i have an American [give-task-info] Airlines ight departing Seattle at twelve fty ve p.m., arrives Tokyo at three p.m. the next day.</Paragraph>
      <Paragraph position="7"> Is that OK? [o er] User: yes I'll take it [accept] Expert: Will you return to seattle[req-task-info] from tokyo? User: what airport [req-task-info] Expert: Will you return to seattle[req-task-info] from tokyo?  With our rst tag set, our Kappa score for interannotator agreement on these dialogues is 0.90 (with two annotators). Not surprisingly, our Kappa score on the second, more complex tag set (cf. Table 10 for a list of the tags) was lower, 0.71 (0.74 on the HC data and 0.66 on the HH data). Both scores are in line with scores reported in similar tagging tasks (Klein et al., 1998): 0.56 for DAMSL (overall average), 0.83 for Maptask (experienced coders), 0.8-0.84 for Switchboard DAMSL and 0.83 for VerbMobil. The drop in score between our two tag sets emphasizes an issue which we continue to wrestle with{the trade-o between tag set complexity and tagging accuracy. At what point is it more useful to have re6Throughout the paper, we will use expert to refer to either the human or the computer travel agent, system to refer exclusively to the computer travel agent, and user to refer to the travelers.</Paragraph>
      <Paragraph position="8"> liable results from an impoverished tag set than results of questionable value from a sophisticated tag set?</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Initiative Tagging
</SectionTitle>
      <Paragraph position="0"> There is not a clearly agreed upon de nition of initiative in the literature on dialogue analysis (but see e.g., (Chu-Carroll and Brown, 1998; Jordan and Di Eugenio, 1997; Flammia and Zue, 1997)), despite the fact the terms initiative and mixed-initiative are widely used. Intuitively, it seems that control rests with the participant who is moving a conversation ahead at a given point, or selecting new topics for conversation.</Paragraph>
      <Paragraph position="1"> After experimenting with several tagging methods, we concluded that the approach presented in Walker and Whittaker (1990) adopted from (Whittaker and Stenton, 1988) best captured the aspects of the dialogue we were interested in and, as with the DAs, could be tagged reliably on our data.</Paragraph>
      <Paragraph position="2"> Each turn is tagged with which participant has control at the end of that turn, based on the utterance type. Again, we did not tag turns composed entirely of non-speech annotation, and we also excluded conventional openings and closings, following Walker and Whittaker. Below, we list the rules for tagging each utterance type; a prompt is an utterance \which did not express propositional content, such as Yeah, Okay, Uh-huh, . . . .&amp;quot; (Op cit, p. 3) The classi cation refers to the illocutionary force of the item, rather than to its particular syntactic form.</Paragraph>
      <Paragraph position="3"> Assertion: speaker has initiative unless it is a response to a question or command7 Question: speaker has initiative unless it is a response to a question or command Command: speaker has initiative Prompt: hearer has initiative Tables 3 and 4 show the same passages used above, but this time tagged for initiative. To give a sense of how the tagging rules are applied, let us step through the HC example (Table 4). Turn (1) is assigned expert-initiative, because it is an assertion which is not a response to any preceding question or command. Turn (2) is still expertinitiative, because it is an answer to the question Is that OK?. The third turn is a question and expert-initiative, but turn (4) is user-initiative because it is a question that is not a response to the previous question. The system 7Italics show our modi cation to the rule.</Paragraph>
      <Paragraph position="4"> does not address the user's question, but rather repeats its own question, so the nal turn (5) is expert-initiative.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML