File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0221_metho.xml
Size: 19,690 bytes
Last Modified: 2025-10-06 14:07:57
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0221"> <Title>Training a Dialogue Act Tagger For Human-Human and Human-Computer Travel Dialogues</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Corpus, Data, Methods </SectionTitle> <Paragraph position="0"> Our experiments apply the rule learning program RIPPER (Cohen, 1996) to train a DATE dialogue act tagger for the utterances of the information provider in HC and HH travel planning dialogues. Like other automatic classi ers, RIPPER takes as input the names of a set of classes to be learned, the names and ranges of values of a xed set of features, and training data specifying the class and feature values for each example in a training set. Its output is a classi cation model for predicting the class of future examples. In RIPPER, the classi cation model is learned using greedy search guided by an information gain metric, and is expressed as an ordered set of if-then rules. Although any of several automatic classi ers could be used to train an automatic DATE tagger, RIPPER supports textual features, which are important for this problem, and outputs if-then rules that are easy to understand and which make clear which features are useful to the DATE tagger when classifying utterances.</Paragraph> <Paragraph position="1"> To apply RIPPER, the utterances in the corpus must be encoded in terms of a set of classes (the output classi cation) and a set of input features that are used as predictors for the classes. Below we describe the corpora, the classes derived from the DATE tagging scheme, the methods used for tagging the corpora using the DATE scheme, and the features that are extracted from the dialogue in which each utterance occurs.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Travel Planning Corpora </SectionTitle> <Paragraph position="0"> Our experiments utilize both HC and HH dialogues in the travel planning domain. The DARPA Communicator HC dialogue corpus consists of the June-2000 corpus and the October-2001 corpus. The June-2000 corpus contains 663 experimental dialogues collected during a three week period in June of 2000 for conversations between human users and 9 different Communicator travel planning systems.</Paragraph> <Paragraph position="1"> The October-2001 corpus contains 1252 experimental dialogues collected between April and October of 2001 for conversations between human users and 8 different COMMUNICATOR travel planning systems.</Paragraph> <Paragraph position="2"> The dialogues were quite complex, ranging between simple one way trips requiring no ground arrangements to multileg trips to international or domestic destinations that required car and hotel arrangements. The dialogues typically lasted between 2 and 10 minutes. There was a great deal of variation in the dialogue strategies implemented by the different systems, both between the sites during each collection as well as within a single site across the different collections, from 2000 to 2001. There were a total of 22930 system utterances in the June-2000 corpus and a total of 69766 utterances in the October-2001 corpus. Each dialogue interaction was logged by each system using a shared log le standard. We were primarily interested in three logged features: (1) the text of each system utterance; (2) what the recognizer understood for each user utterance; and (3) the transcription that each site provided for what the user actually said. We describe below in Section 2.4 how we used these three log le features to derive the features used to train the DATE tagger.</Paragraph> <Paragraph position="3"> The HH dialogue corpus consists of the CMU-corpus (Eskenazi et al., 1999). Dialogues in the travel planning domain were collected by the Communicator group at Carnegie Mellon University (CMU), who arranged with the onsite travel agency People's Travel to record calls from a number of volunteer subjects who called the human travel agent to plan intended trips. These calls were then transcribed and the recordings and the transcriptions were made available to members of the Communicator community. Labellers at our site subsequently segmented the travel agent side of the conversation into utterances where each utterance realized a single dialogue act. We used this utterance level segmentation to de ne the unit for tagging in the experiments described below. The CMU-corpus consists of 38 dialogues with a total of 1062 travel agent utterances. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Class Assignment </SectionTitle> <Paragraph position="0"> The classes used to train the DATE tagger are derived directly from the DATE tagging scheme (Walker et al., 2001c). DATE classi es each utterance along three cross-cutting orthogonal dimensions of utterance classi cation: (1) a SPEECH ACT dimension; (2) a CONVERSATIONAL-DOMAIN dimension; and (3) a TASK-SUBTASK dimension. The</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> SPEECH ACT and CONVERSATIONAL-DOMAIN di- </SectionTitle> <Paragraph position="0"> mensions should be general across domains, while the TASK-SUBTASK dimension involves a task model that is not only domain speci c, but could vary from system to system because some systems might make ner-grained subtask distinctions.</Paragraph> <Paragraph position="1"> The SPEECH ACT dimension captures distinctions between distinct communicative goals such as requesting information (REQUEST-INFO), presenting information (PRESENT-INFO) and making offers (OFFER) to act on behalf of the caller. The types of speech acts are speci ed and illustrated in Figure 3.</Paragraph> <Paragraph position="2"> The CONVERSATIONAL-DOMAIN dimension distinguishes between talk devoted to the task of booking airline reservations ( about-task ) versus talk devoted to maintaining the verbal channel of communication ( about-communication ) (Allen and Core, 1997). DATE adds a third domain called about-situation-frame , to distinguish utterances that provide information about the interactional context, e.g. Try saying a short sentence, or I know about 500 international destinations.</Paragraph> <Paragraph position="3"> The TASK-SUBTASK dimension focusses on specifying which subtask of the travel reservation task the utterance contributes to. Some examples are given in Figure 4. This dimension distinguishes among 28 subtasks, some of which can also be grouped at a level below the top level task. The TOP-LEVEL-TRIP task describes the task which contains as its subtasks the ORIGIN, DESTINATION, DATE, TIME, AIRLINE, TRIP-TYPE, RETRIEVAL and ITINERARY tasks. The GROUND task includes both the HOTEL and CAR subtasks. The HOTEL task includes both the HOTEL-NAME and HOTEL-LOCATION subtasks.</Paragraph> <Paragraph position="4"> Some utterances, especially about-situation-frame utterances such as instructions and apologies are not speci c to any task. For example, apologies made by the system about a misunderstanding can be made within any subtask. We give these utterances a meta value in the task dimension.</Paragraph> <Paragraph position="5"> It is possible to achieve very speci c labelling of system utterances by applying all three dimensions simultaneously. For example, one set of output classes for the DATE tagger consists of the combination of all three classes so that an utterance such as I found three ights that match your request is classi ed as ABOUT-TASK:PRESENT-INFO:FLIGHT.2 However, the DATE scheme also makes it possible to train and test a DATE tagger for just the SPEECH-ACT dimension or just the TASK dimension. Figure 5 shows utterances from a June-2000 dialogue fragment that are classi ed along each of the three DATE dimensions.</Paragraph> <Paragraph position="6"> Tagging utterances along the SPEECH ACT dimension provides the most general tagging. This level of categorization is task-independent and possibly situation independent, ie. from HC to HH dialogues.</Paragraph> <Paragraph position="7"> One set of experiments simply tests performance of a DATE tagger for the speech-act dimension on the HC dialogue data. In addition, we also train a DATE tagger on the HC dialogues using only the speech 2DATE labels that are speci ed for all the three dimensions have the dimension values given in three elds separated by : . The rst eld contains the value for the Conversational-Domain Dimension, the second for the Speech-Act Dimension, and the third for the Task-Subtask Dimension.</Paragraph> <Paragraph position="8"> act dimension for the purpose of applying it to a test set of the CMU-corpus of HH dialogues.3</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Preparation of Training and Test Data via DATE Tagging </SectionTitle> <Paragraph position="0"> The DATE labelling of the June-2000 data was done with a semi-automatic tagger: an utterance or utterance sequence is identi ed and labelled automatically by reference to a database of utterance patterns hand-labelled with DATE tags. The collection and DATE labelling of the utterance patterns was done in cooperation with site developers. As discussed above, these patterns for the 2000 data set were often quite speci c, and often involved whole utterances. However, since the systems use template based generation and have only a limited number of ways of saying the same content, relatively few utterance patterns needed to be hand-labelled when compared to the actual number of utterances occurring in the corpus. Further abstraction on the patterns was done with a named-entity labeller which replaces speci c tokens of city names, airports, hotels, airlines, dates, times, cars, and car rental companies with their generic type labels. For example, what time do you want to leave a1 AIRPORT a2 on a1 DATE-TIME a2 ? is the typed utterance for what time do you want to leave Newark International on Monday?. For the 2000 tagging, the number of utterances in the pattern database was 1700 whereas the total number of utterances in the 663 dialogues was 22930. The named-entity labeller was also applied to the system utterances in the corpus. We collected vocabulary lists from all the sites for the named-entity labelling task. In most cases, systems had preclassi ed the individual tokens into generic types.</Paragraph> <Paragraph position="1"> The tagger implements a simple pattern matching algorithm to do the dialogue act labelling: for each utterance pattern in the pattern database, the tagger attempts to nd a match in the dialogues; if the match succeeds, the DATE label of that pattern is assigned to the matching utterance in the dialogue. The matching ignores punctuation since systems vary in the way they record punctuation.4 Certain utterances have different communicative functions depending on the context in which they segmentation problem for the tagger. We assume that the utterances in the pattern database provide the reference points for utterance boundaries.</Paragraph> <Paragraph position="2"> occur. For example, phrases like leaving in the a1 DATE-TIME a2 are implicit con rmations when they constitute an utterance on their own, but are part of the ight information presentation when they occur embedded in utterances such as I have one ight leaving in the a1 DATE-TIME a2 . To prevent incorrect labelling for such ambiguous cases, the pattern database is sorted so that sub-patterns are listed later than the patterns within which they are embedded, and the pattern matcher is forced to match patterns in their order of occurrence in the database.</Paragraph> <Paragraph position="3"> While this tagger achieved 100a0 accuracy for the 2000 data by using many speci c patterns, when applied to the 2001 corpus it was able to label only 60a0 of the data. On examination of the unlabelled utterances, we found that many systems had augmented their inventory of named-entity items as well as system utterances from the 2000 to the 2001 data collection. As a result, there were many new patterns unaccounted for in the existing named-entity lists as well as in the pattern database. In an attempt to cover the remaining 40a0 of the data, we therefore augmented the named-entity lists by obtaining a new set of preclassi ed vocabulary items from the sites, and added 800 hand-labelled patterns to the pattern database. For the labelling of any additional unaccounted-for patterns, we implemented a contextual rule-based postprocessor that looks at the surrounding dialogue acts of an unmatched utterance within a turn and attempts to label it. The contextual rules are intended to capture rigid system dialogue behaviors that are re ected in the DATE sequences within a turn.5 For example, one very frequently occurring DATE sequence within system turns is about task:present info: ight, about task:present info:price, about task:offer: ight. The rule using this contextual information can be informally stated as follows: if in a turn, the rst two utterances are labelled as about task:present info: ight and about task:present info:price, and the third utterance is unlabelled, assign the third utterance the label about task:offer: ight. Not all turn-internal DATE sequences are used as contextual rules, however, because many of them are highly ambiguous. For example, about communicaton:apology:meta slu reject can be followed by a system instruction as well as any kind of request for information (typically) repeated from the previous system utterance. Figure 6 shows the current DATE tagging system, augmented with the DATE rule-based postprocessor.</Paragraph> <Paragraph position="4"> With the 2000 tagger augmented with the additional named-entity items, utterance patterns, and the postprocessor, we were able to label 98.4a0 of the (69766) utterances in the 2001 corpus.</Paragraph> <Paragraph position="5"> We conducted a hand evaluation of 10 dialogues which we selected randomly from each system. The evaluation of the total 80 dialogues shows that we achieved 96a0 accuracy on the 2001 tagging.</Paragraph> <Paragraph position="6"> In order to label the HH corpus of 1062 utterances, we started with 10 dialogues (305 utterances) labelled with the CSTAR dialogue act tagging scheme (Finke et al., 1998; Doran et al., 2001). We automatically converted the labels to DATE, and then hand-corrected them. We labelled the rest of the HH data by training a DATE tagger, applying it to the remainder of the corpus, and hand-correcting the results.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Feature Extraction </SectionTitle> <Paragraph position="0"> The corpus is used to construct the machine learning features as follows. In RIPPER, feature values are continuous (numeric), set-valued (textual), or symbolic. We encoded each utterance in terms of a set of 19 features that were either derived from the logles, derived from human transcription of the user utterances, or represent aspects of the dialogue context in which each utterance occurs.</Paragraph> <Paragraph position="1"> The complete feature set used by the machine learner is described in Figure 7. The features fall into three categories: (1) target utterance features ; (2) context features ; and (3) whole dialogue features. null within the dialogues.</Paragraph> <Paragraph position="2"> a3 target utterance features: utt-string, contains-word-FLIGHT-or-AIRLINE, contains-word-HOTEL-or-ROOM, contains-word-RENTAL-or-CAR, containsword-CITY-or-AIRPORT, contains-word-DATE-TIME, pattern-length.</Paragraph> <Paragraph position="3"> a3 context features: left-sys-utt-string, right-sys-uttstring, da-num, position-in-turn, left dacontext1, left-da-context2, usr-orig-string, usr-typed-string, rec-orig-string, rec-typed-string usr-rec-string-identity. a3 whole dialogue features: system-name, turn-number. The target utterance features include the target utterance string for which the dialogue act is to be predicted (utt-string), and a set of features derived from the named-entity labelling about what semantic types are instantiated in the target string. For example the feature contains-word-FLIGHTor-AIRLINE is represented by a boolean variable specifying whether the utterance string contains the words FLIGHT or AIRLINE. Similar features are contains-word-HOTEL-or-ROOM, containsword-RENTAL-or-CAR, contains-word-CITY-or-AIRPORT, and contains-word-DATE-TIME. The pattern-length feature encodes the character length of the target utterance. The motivation for these features is to represent basic aspects of the target utterance, e.g. its length, and the lexical items and semantic types that appear in the utterance.</Paragraph> <Paragraph position="4"> The context features encode simple aspects about the context in which the target utterance occurs. Two of these represent the system utterance strings to the left and right of the target utterance (left-sys-utt-string and right-sys-utt-string). The left-da-context1 and left-da-context2 features represent the left unigram and bigram dialogue act context of the target utterance; this goes beyond the target turn to only the last dialogue act in the previous system turn. The da-num feature encodes the number of dialogue acts in the target turn and the position-in-turn feature encodes the position of the target utterance in its turn. In addition, the user's previous utterance is represented as part of the context, both in terms of automatically extractable features like what the automatic speech recognizer thought the user said (rec-orig-string), and a version of this on which the named-entity labeller has been run (rec-typed-string), as well as in terms of human generated transcriptions of the user's utterance. Features based on the transcriptions include the original human transcription (usr-orig-string) and the transcription after named-entity tagging (usrtyped-string). The usr-rec-string-identity feature is a</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> = Dimension of Date used for output classi cation (Maj. Cl. = Majority Class, Acc = Accuracy, SE = Standard Error) </SectionTitle> <Paragraph position="0"> boolean feature based on comparing the user's transcribed utterance with the recognizer's hypothesis of what the user said, using simple string-identity.</Paragraph> <Paragraph position="1"> Some applications of DATE tagging would not use features derived from human generated transcriptions so the experiments below report accuracy gures for DATE taggers which ignore these features.</Paragraph> <Paragraph position="2"> The motivation for the context features is to represent aspects of the context in which the utterance occurs in terms of a window of surrounding lexical items and dialogue acts.</Paragraph> <Paragraph position="3"> The whole dialogue features are the name of the site whose system generated the dialogue (systemname), and the turn number of the target utterance within the whole dialogue (turn-number). For HH dialogues the system-name has the value human .</Paragraph> <Paragraph position="4"> The motivation for including the system-name feature is to see whether there are any aspects of the dialogue act realizations that are speci c to particular systems. The motivation for the turn-number feature is that particular types of dialogue acts are more likely to occur in particular phases of the dialogue.</Paragraph> </Section> </Section> class="xml-element"></Paper>