XML Viewer - h01-1015

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1015_metho.xml
Size: 26,487 bytes
Last Modified: 2025-10-06 14:07:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1015">
  <Title>DATE: A Dialogue Act Tagging Scheme for Evaluation of Spoken Dialogue Systems</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. THE SPEECH-ACT DIMENSION
</SectionTitle>
    <Paragraph position="0"> The SPEECH-ACT dimension characterizes the utterance's communicative goal, and is motivated by the need to distinguish the communicative goal of an utterance from its form. As an example, consider the functional category of a REQUEST for information, found in many tagging schemes that annotate speech-acts [24, 18, 6]. Keeping the functional category of a REQUEST separate from the sentence modality distinction between question and statement makes it possible to capture the functional similarity between question and statement forms of requests, e.g., Can you tell me what time you would like to arrive? versus Please tell me what time you would like to arrive.</Paragraph>
    <Paragraph position="1"> In DATE, the speech-act dimension has ten categories. We use familiar speech-act labels, such as OFFER, REQUEST-INFO, PRESENT-INFO, ACKNOWLEDGMENT, and introduce new ones designed to help us capture generalizations about communicative behavior in this domain, on this task, given the range of system and human behavior we see in the data. One new one, for example, is STATUS-REPORT, whose speech-act function and operational definition are discussed below. Examples of each speech-act type are in Figure 4.</Paragraph>
    <Paragraph position="2">  In this domain, the REQUEST-INFO speech-acts are designed to solicit information about the trip the caller wants to book, such as the destination city (And what city are you flying to?), the desired dates and times of travel (What date would you like to travel on), or information about ground arrangements, such as hotel or car rental (Will you need a hotel in Chicago?).</Paragraph>
    <Paragraph position="3"> The PRESENT-INFO speech-acts also often pertain directly to the domain task of making travel arrangements: the system presents the user with a choice of itinerary (There are several flights from Dallas Fort Worth to Salisbury Maryland which depart between eight in the morning and noon on October fifth. You can fly on American departing at eight in the morning or ten thirty two in the morning, or on US Air departing at ten thirty five in the morning.), as well as a ticket price (Ticket price is 495 dollars), or hotel or car options.</Paragraph>
    <Paragraph position="4"> OFFERS involve requests by the caller for a system action, such as to pick a flight (I need you to tell me whether you would like to take this particular flight) or to confirm a booking (If this itinerary meets your needs, please press one; otherwise, press zero.) They typically occur after the prerequisite travel information has been obtained, and choices have been retrieved from the database.</Paragraph>
    <Paragraph position="5"> The ACKNOWLEDGMENT speech act characterizes system utterances that follow a caller's acceptance of an OFFER, e.g. I will book this leg or I am making the reservation.</Paragraph>
    <Paragraph position="6"> The STATUS-REPORT speech-act is used to inform the user about the status of the part of the domain task pertaining to the database retrieval, and can include apologies, mollification, requests to be patient, and so on. Their function is to let the user know what is happening with the database lookup, whether there are problems with it, and what types of problems. While the form of these acts are typically statements, their communication function is different than typical presentations of information; they typically function to keep the user apprised of progress on aspects of the task that the user has no direct information about, e.g. Accessing the database; this might take a few seconds. There is also a politeness function to utterances like Sorry this is taking so long, please hold., and they often provide the user with error diagnostics: The date you specified is too far in advance.;orPlease be aware that the return date must be later than the departure date.;orNo records satisfy your request.;orThere don't seem to be any flights from Boston.</Paragraph>
    <Paragraph position="7"> The speech-act inventory also includes two types of speech acts whose function is to confirm information that has already been provided by the caller. In order to identify and confirm the parameters of the trip, systems may ask the caller direct questions, as in SYS3 and SYS4 in Figure 2. These EXPLICIT-CONFIRM speech acts are sometimes triggered by the system's belief that a misunderstanding may have occurred. A typical example is Are you traveling to Dallas?. An alternative form of the same EXPLICIT-CONFIRM speech-act type asserts the information the system has understood and asks for confirmation in an immediately following question: I have you arriving in Dallas. Is that correct? In both cases, the caller is intended to provide a response.</Paragraph>
    <Paragraph position="8"> A less intrusive form of confirmation, which we tag as IMPLICIT-CONFIRM, typically presents the user with the system's understanding of one travel parameter immediately before asking about the next parameter. Depending on the site, implicit information can either precede the new request for information, as in Flying to Tokyo. What day are you leaving?, or can occur within the same utterance, as in What day do you want to leave London? More rarely, an implicit confirmation is followed by PRESENT-INFO: a flight on Monday September 25. Delta has a flight departing Atlanta at nine thirty. One question about the use of implicit confirmation strategy is whether the caller realizes they can correct the system when necessary [10]. Although IMPLICIT-CONFIRMS typically occur as part of a successful sequence of extracting trip information from the caller, they can also occur in situations where the system is having trouble understanding the caller. In this case, the system may attempt to instruct the user on what it is doing to remediate the problem in between an IMPLICIT-CONFIRM and a REQUEST-INFO: So far, I have you going from Tokyo. I am trying to assemble enough information to pick a flight. Right now I need you to tell me your destination.</Paragraph>
    <Paragraph position="9"> We have observed that INSTRUCTIONS are a speech-act type that distinguishes these human-computer travel planning dialogues from corresponding human-human travel planning dialogues. Instructions sometimes take the form of a statement or an imperative, and are characterized by their functional goal of clarifying the system's own actions, correcting the user's expectations, or changing the user's future manner of interacting with the system. Dialogue systems are less able to diagnose a communication problem than human travel agents, and callers are less familiar with the capabilities of such systems. As noted above, some systems resort to explicit instructions about what the system is doing or is able to do, or about what the user should try in order to assist the system: Try asking for flights between two major cities;orYou can cancel the San Antonio, Texas, to Tampa, Florida flight request or change it.</Paragraph>
    <Paragraph position="10"> To change it, you can simply give new information such as a new departure time. Note that INSTRUCTIONS, unlike the preceding dialogue act types, do not directly involve a domain task.</Paragraph>
    <Paragraph position="11"> Like the INSTRUCTION speech-acts, APOLOGIES do not address a domain task. They typically occur when the system encounters problems, for example, in understanding the caller (I'm sorry, I'm having trouble understanding you), in accessing the database (Something is wrong with the flight retrieval), or with the connection (Sorry, we seem to have a bad connection. Can you please call me back later?).</Paragraph>
    <Paragraph position="12"> The OPENING/CLOSING speech act category characterizes utterances that open and close the dialogue, such as greetings or goodbyes [26]. Most of the dialogue systems open the interactions with some sort of greeting--Hello, welcome to our Communicator flight travel system, and end with a sign-off or salutation--Thank you very much for calling. This session is now over. We distinguish these utterances from other dialogue acts, but we do not tag openings separate from closings because they have a similar function, and can be distinguished by their position in the discourse. We also include in this category utterances in which the systems survey the caller as to whether s/he got the information s/he needed or was happy with the system.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="2" type="metho">
    <SectionTitle>
4. THE TASK-SUBTASK DIMENSION
</SectionTitle>
    <Paragraph position="0"> The TASK-SUBTASK dimension refers to a task model of the domain task that the system is designed to support and captures distinctions among dialogue acts that reflect the task structure.</Paragraph>
    <Paragraph position="1">  Our domain is air travel reservations, thus the main communicative task is to specify information pertaining to an air travel reservation, such as the destination city. Once a flight has been booked, ancillary tasks such as arranging for lodging or a rental car become relevant. The fundamental motivation for the TASK-SUBTASK dimension in the DATE scheme is to derive metrics related to subtasks in order to quantify how much effort a system expends on particular subtasks.  This dimension distinguishes among 13 subtasks, some of which can also be grouped at a level below the top level task. The subtasks and examples are in Figure 5. The TOP-LEVEL-TRIPtask describes the task which contains as its subtasks the ORIGIN, DESTINATION, DATE, TIME, AIRLINE, TRIP-TYPE, RETRIEVAL and ITINERARY tasks. The GROUND task includes both the HOTEL and CAR subtasks. null Typically each COMMUNICATOR dialogue system acts as though it utilizes a task model, in that it has a particular sequence in which it will ask for task information if the user doesn't take the initiative to volunteer this information. For example, most systems ask first for the origin and destination cities, then for the date and time. Some systems ask about airline preference and others leave it to the caller to volunteer this information. A typical sequence of tasks for the flight planning portion of the dialogue is illustrated in Figure 6. As Figure 6 illustrates, any subtask can involve multiple speech acts. For example, the DATE subtask can consist of acts requesting, or implicitly or explicitly confirming the date. A similar example is provided by the subtasks of CAR (rental) and HOTEL, which include dialogue acts requesting, confirming or acknowledging arrangements to rent a car or book a hotel room on the same trip.</Paragraph>
  </Section>
  <Section position="6" start_page="2" end_page="2" type="metho">
    <SectionTitle>
BD
</SectionTitle>
    <Paragraph position="0"> This dimension is used as an elaboration of each speech-act type in other tagging schemes [24].</Paragraph>
  </Section>
  <Section position="7" start_page="2" end_page="2" type="metho">
    <SectionTitle>
BE
</SectionTitle>
    <Paragraph position="0"> It is tempting to also consider this dimension as a means of inferring discourse structure on the basis of utterance level labels, since it is widely believed that models of task structure drive the behavior of dialogue systems [23, 3, 22], and the relationship between discourse structure and task structure has been a core topic of research since Grosz's thesis [15]. However, we leave the inference of discourse structure as a topic for future work because the multifunctionality of many utterances suggests that the correspondence between task structure and dialogue structure may not be as straightforward as has been proposed in Grosz's work [30].</Paragraph>
    <Paragraph position="1">  SYS OK, got them. I have 13 flights. The first flight is on American at six fifty nine eh M, arriving at ten forty five PM, with a connection in Chicago.</Paragraph>
    <Paragraph position="2"> . Is that OK?  There are also differences in how each site's dialogue strategy reflects it conceptualization of the travel planning task. For example, some systems ask the user explicitly for their airline preferences whereas others do not (the systems illustrated in Figures 1 and 6 do not, wherase the one in Figure 2 does). Another difference is whether the system asks the user explicitly whether s/he wants a round-trip ticket. Some systems ask this information early on, and search for both the outbound and the return flights at the same time. Other systems do not separately model round-trip and multi-leg trips. Instead they ask the user for information leg by leg, and after requesting the user to select an itinerary for one leg of the flight, they ask whether the user has an additional destination. A final difference was that, in the June 2000 data collection, some systems such as the one illustrated in Figure 1 included the ground arrangements subtasks, and others did not.</Paragraph>
  </Section>
  <Section position="8" start_page="2" end_page="2" type="metho">
    <SectionTitle>
5. IMPLEMENTATION
</SectionTitle>
    <Paragraph position="0"> Our focus in this work is in labelling the system side of the dialogue; our goal was to develop a fully automatic 100% correct dialogue parser for the limited range of utterances produced by the 9 COMMUNICATOR systems. While we believe that it would be useful to be able to assign dialogue acts to both sides of the conversation, we expect that to require hand-labelling [1]. We also believe that in many cases the system behaviors are highly correlated with the user behaviors of interest; for example when a user has to repeat himself because of a misunderstanding, the system has probably prompted the user multiple times for the same item of information and has probably apologized for doing so. Thus this aspect of the dialogue would also be likely to be captured by the APOLOGY dialogue act and by counts of effort expended on the particular subtask.</Paragraph>
    <Paragraph position="1"> We implemented a pattern matcher that labels the system side of each dialogue. An utterance or utterance sequence is identifed automatically from a database of patterns that correspond to the dialogue act classification we arrived at in cooperation with the site developers. Where it simplifies the structure of the dialogue parser, we assign two adjacent utterances that are directed at the same goal the same DATE label, thus ignoring the utterance level segmentation, but we count the number of characters used in each act. Since some utterances are generated via recursive or iterative routines, some patterns involve wildcards.</Paragraph>
    <Paragraph position="2"> The current implementation labels the utterances with tags that are independent of any particular markup-language or representation format. We have written a transducer that takes the labelled dialogues and produces HTML output for the purpose of visualizing the distribution of dialogue acts and meta-categories in the dialogues. An additional summarizer program is used to produce a summary of the percentages and counts of each dialogue act as well as counts of meta-level groupings of the acts related to the different dimensions of the tagging scheme. We intend to use our current representation to generate ATLAS compliant representations [4].</Paragraph>
  </Section>
  <Section position="9" start_page="2" end_page="2" type="metho">
    <SectionTitle>
6. RESULTS
</SectionTitle>
    <Paragraph position="0"> Our primary goal was to achieve a better understanding of the qualitative aspects of each system's dialogue behavior. We can quantify the extent to which the dialogue act metrics have the potential to improve our understanding by applying the PARADISE framework to develop a model of user satisfaction and then examining the extent to which the dialogue act metrics improve these models [31]. In other work, we show that given the standard metrics collected for the COMMUNICATOR dialogue systems, the best model accounts for 38% of the variance in user satisfaction [28].</Paragraph>
    <Paragraph position="1"> When we retrain these models with the dialogue act metrics extracted by our dialogue parser, we find that many metrics are significant predictors of user satisfaction, and that the model fit increases from 38% to 45%. When we examine which dialogue metrics are significant, we find that they include several types of meta-dialogue such as explicit and implicit confirmations of what the user said, and acknowledgments that the system is going to go ahead and do the action that the user has requested. Significant negative predictors include apologies. On interpretation of many of the significant predictors is that they are landmarks in the dialogue for achievement of particular subtasks. However the predictors based on the core metrics included a ternary task completion metric that captures succinctly whether any task was achieved or not, and whether the exact task that the user was attempting to accomplish was achieved.</Paragraph>
    <Paragraph position="2"> A plausible explanation for the increase in the model fits is that user satisfaction is sensitive to exactly how far through the task the user got, even when the user did not in fact complete the task. The role of the other significant dialogue metrics are plausibly interpreted as acts important for error minimization. As with the task-related dialogue metrics, there were already metrics related to ASR performance in the core set of metrics. However, several of the important metrics count explicit confirmations, one of the desired date of travel, and the other of all information before searching the database, as in utterances SYS3 and SYS4 in Figure 2.</Paragraph>
  </Section>
  <Section position="10" start_page="2" end_page="2" type="metho">
    <SectionTitle>
7. DISCUSSION
</SectionTitle>
    <Paragraph position="0"> This paper has presented DATE, a dialogue act tagging scheme developed explicitly for the purpose of comparing and evaluating spoken dialogue systems. We have argued that such a scheme needs to make three important distinctions in system dialogue behaviors and we are investigating the degree to which any given type of dialogue act belongs in a single category or in multiple categories.</Paragraph>
    <Paragraph position="1"> We also propose the view that a tagging scheme be viewed as a partial model of a natural class of dialogues. It is a model to the degree that it represents claims about what features of the dialogue are important and are sufficiently well understood to be operationally defined. It is partial in that the distributions of the features and their relationship to one another, i.e., their possible manifestations in dialogues within the class, are an empirical question.</Paragraph>
    <Paragraph position="2"> The view that a dialogue tagging scheme is a partial model of a class of dialogues implies that a pre-existing tagging scheme can be re-used on a different research project, or by different researchers, only to the degree that it models the same natural class with respect to similar research questions, is sufficient for expressing observations about what actually occurs within the current dialogues of interest, and is sufficiently well-defined that high reliability within and across research sites can be achieved. Thus, our need to modify existing schemes was motivated precisely to the degree that existing schemes fall short of these requirements. Other researchers who began with the goal of re-utilizing existing tagging schemes have also found it necessary to modify these schemes for their research purposes [11, 18, 7].</Paragraph>
    <Paragraph position="3"> The most substantial difference between our dialogue act tagging scheme and others that have been proposed is in our expansion of the two-way distinction between dialogue tout simple vs.</Paragraph>
    <Paragraph position="4"> meta-dialogue, into a three-way distinction among the immediate dialogue goals, meta-dialogue utterances, and meta-situation utterances. Depending on further investigation, we might decide these three dimensions have equal status within the overall tagging scheme (or within the overall dialogue-modeling enterprise), or that there are two types of meta-dialogue: utterances devoted to maintaining the channel, versus utterances devoted to establishing/maintaining the frame. Further, in accord with our view that a tagging scheme is a partial model, and that it is therefore necessarily evolving as our understanding of dialogue evolves, we also believe that our formulation of any one dimension, such as the speech-act dimension, will necessarily differ from other schemes that model a speech-act dimension.</Paragraph>
    <Paragraph position="5"> Furthermore, because human-computer dialogue is at an early stage of development, any such tagging scheme must be a moving target, i.e., the more progress is made, the more likely it is we may need to modify along the way the exact features used in an annotation scheme to characterize what is going on. In particular, as system capabilities become more advanced in the travel domain, it will probably be necessary to elaborate the task model to capture different aspects of the system's problem solving activities. For example, our task model does not currently distinguish between different aspects of information about an itinerary, e.g. between presentation of price information and presentation of schedule information.</Paragraph>
    <Paragraph position="6"> We also expect that some domain-independent modifications are likely to be necessary as dialogue systems become more successful, for example to address the dimension of &amp;quot;face&amp;quot;, i.e. the positive politeness that a system shows to the user [5]. As an example, consider the difference between the interpretation of the utterance, There are no flights from Boston to Boston, when produced by a system vs. when produced by a human travel agent. If a human said this, it would be be interpretable by the recipient as an insult to their intelligence. However when produced by a system, it functions to identify the source of the misunderstanding. Another distinction that we don't currently make which might be useful is between the initial presentation of an item of information and its re-presentation in a summary. Summaries arguably have a different communicative function [29, 7]. Another aspect of function our representation doesn't capture is rhetorical relations between speech acts [20, 21].</Paragraph>
    <Paragraph position="7"> While we developed DATE to answer particular research questions in the COMMUNICATOR dialogues, there are likely to be aspects of DATE that can be applied elsewhere. The task dimension tagset reflects our model of the domain task. The utility of a task model may be general across domains and for this particular domain, the categories we employ are presumably typical of travel tasks and so, may be relatively portable.</Paragraph>
    <Paragraph position="8"> The speech act dimension includes categories typically found in other classifications of speech acts, such as REQUEST-INFO, OF-FER, and PRESENT-INFO. We distinguish information presented to the user about the task, PRESENT-INFO, from information provided to change the user's behavior, INSTRUCTION, and from information presented in explanation or apology for an apparent interruption in the dialogue, STATUS-REPORT. The latter has some of the flavor of APOLOGIES, which have an inter-personal function, along with OPENINGS/CLOSINGS. We group GREETINGS and SIGN-OFFS into the single category of OPENINGS/CLOSINGSon the assumption that politeness forms make less contribution to perceived system success than the system's ability to carry out the task, to correct misunderstandings, and to coach the user.</Paragraph>
    <Paragraph position="9"> Our third dimension, conversational-domain, adds a new category, ABOUT-SITUATION-FRAME, to the more familiar distinction between utterances directed at a task goal vs. utterances directed at a maintaining the communication. This distinction supports the separate classification of utterances directed at managing the user's assumptions about how to interact with the system on the air travel task. As we mention above, the ABOUT-SITUATION-FRAME utterances that we find in the human-computer dialogues typically did not occur in human-human air travel dialogues. In addition, as we note above, one obvious difference in the dialogue strategies implemented at different sites had to do with whether these utterances occurred upfront, within the dialogue, or both.</Paragraph>
    <Paragraph position="10"> In order to demonstrate the utility of dialogue act tags as metrics for spoken dialogue systems, we show that the use of these metrics in the application of PARADISE [31] improves our model of user satisfaction by an absolute 7%, from 38% to 45%. This is a large increase, and the fit of these models are very good for models of human behavior. We believe that we have only begun to discover the ways in which the output of the dialogue parser can be used. In future work we will examine whether other representations derived from the metrics we have applied, such as sequences or structural relations between various types of acts might improve our performance model further. We are also collaborating with other members of the COMMUNICATOR community who are investigating the use of dialogue act and initiative tagging schemes for the purpose of comparing human-human to human-computer dialogues [1].</Paragraph>
  </Section>
  <Section position="11" start_page="2" end_page="2" type="metho">
    <SectionTitle>
8. ACKNOWLEDGMENTS
</SectionTitle>
    <Paragraph position="0"> This work was supported under DARPA GRANT MDA 972 99 3 0003 to AT&amp;T Labs Research. Thanks to Payal Prabhu and Sungbok Lee for their assistance with the implementation of the dialogue parser. We also appreciate the contribution of J. Aberdeen, E. Bratt, S. Narayanan, K. Papineni, B. Pellom, J. Polifroni, A.</Paragraph>
    <Paragraph position="1"> Potamianos, A. Rudnicky, S. Seneff, and D. Stallard who helped us understand how the DATE classification scheme applied to their COMMUNICATOR systems' dialogues.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML