File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0303_metho.xml
Size: 21,290 bytes
Last Modified: 2025-10-06 14:07:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0303"> <Title>Dialogue Management in the Mercury Flight Reservation System</Title> <Section position="3" start_page="0" end_page="12" type="metho"> <SectionTitle> 2 Knowledge Representations </SectionTitle> <Paragraph position="0"> Mercury's dialogue manager manipulates linguistic and world knowledge represented in the form of semantic frames. At each turn it begins with an E-form representing the constraints of the current query as a set of (key: value) pairs. This E-form pro-U: John Doe S: Please speak the month and day you registered as your password, as in, may first. U: april first S: Welcome John. You have successfully logged onto Mercury. How may I help you? U: i want to travel to washington dulles on november fourth S: From Boston to Washington Dulles on November 4. Can you provide the approximate departure time or airline preference? U: i need to arrive at around nine a m I: Okay. Flights from Boston to Washington Dulles arriving at 9:00 am on November 4. One moment please.</Paragraph> <Paragraph position="1"> S: The best matching flight is a U S Airways flight arriving at 8:59 a.m. I also have a United flight arriving at 8:36 a.m.</Paragraph> <Paragraph position="2"> U: i what time does the united flight leave S: United flight 1819 departs at 7:00 am from Boston. Shall I book this flight for you? U: yes please S: United flight 1819 has been added to your itinerary. You depart Boston at 7:00 am and arrive in Washington Dulles at 8:36 am. Where would you like to go next? U: i would like to return to boston S: What date will you be returning on? U: the same day I: Okay. United flights from Washington Dulles to Boston on November 4. One moment please. S: Can you provide a departure or arrival time? The earliest nonstop leaves at 9:30 a.m. and arrives at 10:57 a.m.</Paragraph> <Paragraph position="3"> U: i would like to leave after three thirty S: The best matching flight is a United flight departing at 4:30 p.m. I also have a United flight departing at 5:30 p.m.</Paragraph> <Paragraph position="4"> U: i will take the first one S: United flight 1278 has been added to your itinerary. You depart Washington Dulles at 4:30 pm and arrive in Boston at 6:07 pm. Shall I price your itinerary? U: yes please I: Pricing your itinerary will take a while. In the meantime, I'll summarize your itinerary for you. S: on November 4, you depart on United flight 1819 from Boston at 7:00 am and arrive in Washington Dulles at 8:36 am. From there, you depart on United flight 1278 at 4:30 pm and arrive in Boston at 6:07 pm.</Paragraph> <Paragraph position="5"> The price for your itinerary is $249.00. Shall I email your itinerary to you? U: yes S: Your itinerary has been emailed. Is there something else I can do for you? U: no thank you S: Thank you for calling! Have a nice day! &quot;S&quot; a system utterance, and 'T' an intermediate system response spoken prior to database retrieval. vides the initial values in the dialogue state, which evolves over the course of the turn as rules are executed. null In addition to the dialogue state, Mercury also makes use of several other internal representations which retain state. These are packaged up into a turn-manager history frame which is passed to the hub for safe-keeping at the end of each turn, and returned with each subsequent turn. The user model contains a record of the system's current belief about the state of the user's flight plan. For example, if the user's first query were &quot;I want to go to Denver next Friday morning, returning the following Wednesday,&quot; the system would record that this is a round trip flight and would save the return date (unresolved, in case there was a recognition error on the forward leg date) in the user model. It would retrieve from the user model, as the forward leg source, the default value provided by the user at enrollment. Thus the user model retains both the initial user preferences and all of the details of the current flight plan as they become available. The system also retains internally a list of all the flights that matched the user's constraint specifications, as well as a list of the subset of the flights that it has spoken about (e.g., the nonstop flights or the earliest flight). These flights become available for reference in subsequent queries such as &quot;book the third one,&quot; or &quot;where does the united flight stop?&quot;.</Paragraph> <Paragraph position="6"> Whenever the system needs to go to the database, it provides an intermediate reply frame which serves both to ask the user for patience and to confirm a consistent shared model of the dialogue history. The system's final response is also a frame, and it often contains a list of flights to be read back to the user.</Paragraph> <Paragraph position="7"> In other cases, it may be a yes-no question as in &quot;Shall I email your itinerary to you?&quot; The latter case also necessitates sending both to the discourse component a system-initiative context for yes/no responses, and to the language understanding component a dialogue context filter that favors confirmations and denials in the N-best selection. A similar process takes place when the system prompts for information such as a date or a departure city.</Paragraph> <Paragraph position="8"> Table 1 shows a dialogue between a user and the system involving a real trip between Boston and Washington D.C. The name and password information have been omitted for privacy reasons. In the figure, &quot;U&quot; stands for &quot;user&quot;, &quot;S&quot; for system, and 'T' for intermediate response.</Paragraph> </Section> <Section position="4" start_page="12" end_page="14" type="metho"> <SectionTitle> 3 Dialogue Management Strategies </SectionTitle> <Paragraph position="0"> A finite state network is a convenient way to formulate dialogue states, as long as the system is able to maintain strong control over the order in which slots are filled, and especially when slots are filled one at a time. However, if the user is permitted the flexibility to fill slots in any order, and to fill more than one slot in a single turn, then it becomes difficult and inefficient to define a finite state network that covers the entire space of possibilities.</Paragraph> <Paragraph position="1"> A finite state solution for a simple dialogue exam- null the state diagram in Figure 2.</Paragraph> <Paragraph position="2"> ple is given in Figure 1. Here, we assume that the user can fill four slots (source, destination, date, and airline) and that all four must be filled before the system can retrieve flights. The slots can be filled in any order, and the user can fill any number of slots in each turn. For notational convenience, we represent the states filled/unfilled by 1/0, and we order the fields as \[source, destination, date, airline\], such that state &quot;1101&quot; says that only date is missing. At each state the system prompts the user for all missing slots. Thus, state 1000 would speak, &quot;Can you provide one or more of destination, date, and airline?&quot; If the user provides more than a single field in a given turn, the system advances by more than one state in the state diagram. To fully specify all possibilities, even for this simple example, requires a large number of states in the state diagram.</Paragraph> <Paragraph position="3"> Figure 2 shows a set of ordered rules which achieve the same goals as the state diagram but with a greatly simplified dialogue control specification. In this case, state is encoded implicitly as a bit map corresponding to the filled/unfilled values of the four variables. The notation &quot;!&quot; stands for &quot;NOT,&quot; meaning that this variable is not yet filled. Following the arrow is the name of the operation to be performed when the conditions are met. Each operation is usually a specific function, and is free to augment the variable space by either altering the values of pre-existing variables or introducing new variables. In our simple example the four prompt operations simply add their respective variables to the response frame and increment the variable &quot;nprompt.&quot; The final function &quot;mark_multiple&quot; fires only if more than one of the preceding functions have fired. Its action is to insert &quot;one or more of&quot; into the response. The final &quot;retrieve&quot; operation only fires if none of the preceding operations fired. It is significant that state is dispersed not only over the variables but also over the operations. Thus, any single prompt state of the state transition matrix would involve up to two operations in the dialogue control table, and, conversely, the &quot;mark_multiple&quot; operation is associated with several states.</Paragraph> <Paragraph position="4"> In addition to possibly altering the values of variables in the dialogue state, operations also have the responsibility to return, upon completion, one of three &quot;move&quot; states. The majority of the operations return the state ~'CONTINUE,&quot; which means simply to continue execution just beyond the rule that just fired. The two other possibilities are &quot;STOP,&quot; i.e., exit from the dialogue control table, and &quot;RESTART&quot; which means to return to the top of the set of dialogue rules and start over. An (optional) distinguished operation, which handles activities that should occur routinely at every turn, is executed upon exiting the dialogue control process. The tests on variables can be binary, arithmetic, or string matching on values.</Paragraph> <Paragraph position="5"> Our Mercury system makes use of this &quot;ordered rules&quot; strategy for dialogue control. However, the rules are dealing with a great deal more complexity than simply deciding what prompts to issue when certain slots are unfilled. There are currently a total of over 200 rules in Mercury's dialogue control table.</Paragraph> <Paragraph position="6"> These rules can be grouped into several distinct categories. Only nine of the rules involve prompts for missing information. Another nine are involved with logging into the system, i.e., acquiring the name and the password, which may have to be entered using the telephone keypad, either upon user request or as a consequence of recognition failure. Eleven rules are concerned with meta-level interactions such as apologies for missing services (no flight status information available), and requests for help or repetition. Several rules have to do with determining whether the current request plausibly involves a reference to a flight in a pre-existing flight list. This could be an obvious reference to the nth flight, or it could be an inquiry about the &quot;united flight&quot; or the &quot;nine a.m. flight.&quot; Several more rules involve interpreting various references to relative dates and/or times such as &quot;the following Thursday,&quot; or &quot;that same afternoon.&quot; The largest single category concerns preparing the reply frame, after the database retrieval has already taken place. Twenty six rules are concerned with this task, and they are keyed on a combination of the number and type of flights retrieved and the specific request of the user (e.g., &quot;where do they connect?&quot;). The purpose of these functions is to reduce the number of database tuples returned for a given query to a manageable set for a spoken response. This reduction is based on a hierarchy of quality measures for each flight, beginning with any stated or inferred preference (e.g., a particular airport in the case of cities with multiple airports, or a particular airline in the case of a multi-leg booking where one leg has already been established) and including number of stops and length of flight. These functions also consolidate information about the flights to be spoken, combining shared attributes. Our ultimate goal is to produce a response such as &quot;I have found three non-stop United flights. Flight 100 leaves at 8:45, flight 200 leaves at 9:10, and flight 300 leaves at 9:30.&quot; The dialogue control table facilitates this interaction by enabling the system developer to encode the constraints of the hierarchy in the rules.</Paragraph> <Paragraph position="7"> Finally, there are a miscellaneous set of rules that have to do with updating the user model, preparing the intermediate reply, pricing or emailing the itinerary, preparing the database query, filtering flights, or updating the itinerary.</Paragraph> <Paragraph position="8"> Since the operations are general functions, it is up to the system developer to decide how to parcel up the computational requirements into the individual operations. With experience, one acquires a set of guidelines to help formalize this process. As a general rule, it is preferrable to limit the use of nested function calls. Instead, an operation can set a vari- null able to indicate that another operation needs to be called, and then the intended subroutine gets promoted to the status of a dialogue control operation.</Paragraph> <Paragraph position="9"> This has the effect of exposing more of the internal activities of the system to the dialogue control table, which serves as a very useful outline of system control flow. Another general policy is that each operation should be concerned with a single well-defined task, again in or~er~not to conceal complexity.</Paragraph> <Paragraph position="10"> Figure 3 shows several examples of actual rules in Mercury's dialogue control table.</Paragraph> </Section> <Section position="5" start_page="14" end_page="15" type="metho"> <SectionTitle> 4 Data Collection and Evaluation </SectionTitle> <Paragraph position="0"> Mercury first became available for data collection in October '99. Prospective users must first enroll by filling in a simple form on a Web page, where they enter, minimally, their name, email address, and password (a date). Once the user's name has been added to the recognizer and language understanding components, they receive an email message informing them of the telephone number. Users are encouraged to attempt to book real trips. From late October to early December, we collected 49 dialogues involving real flight bookings, and these form the basis for our evaluation studies.</Paragraph> <Paragraph position="1"> Overall, 73% of the bookings were successful (36/49). We used a very strict requirement for success. For example, in one dialogue considered unsuccessful the system did not know the airline that the user requested, and so the user compromised and booked the trip on a different airline. Three of the failures are due to the user simply hanging up in frustration, and three others are due to the system hanging up due to a misrecognized &quot;good-bye.&quot; Two failures were due to user inattentiveness. The user believed that the trip was correctly booked, but a misrecognition produced a different itinerary than the one they were specifying. Finally, four of the failures involved completely correct bookings, but the system was unable to follow through with the pricing and/or emailing of the itinerary. Some of these involved inadequacies in the dialogue module, once the user did not provide the expected response to a system request. There was a striking difference in recognition error between the successful and the incomplete bookings (11.5% vs 26% WER). A heavy foreign accent accounted for some of the recognition problems.</Paragraph> <Paragraph position="2"> Some easily measurable statistics for the successes are given in Figure 4. These numbers were computed in terms of number of new attributes introduced with each dialogue turn. IBR = Information Bit Rate.</Paragraph> <Paragraph position="3"> on the &quot;core dialogue,&quot; defined as the interval subsequent to logging on and up until the itinerary is fully specified, but has not yet been priced. On average users required less than four minutes to complete the core dialogue, although three outliers took more than seven minutes.</Paragraph> <Section position="1" start_page="14" end_page="15" type="sub_section"> <SectionTitle> 4.1 Log File Evaluation </SectionTitle> <Paragraph position="0"> We have long been interested in seeking evaluation metrics that are automatic and that can apply on a per-utterance basis but evaluate a significant portion of the system beyond the recognizer. In \[Polifroni et al. (1998)\] we proposed an E-form evaluation metric, which compares an E-form obtained by parsing the original orthography against that obtained by parsing the selected recognizer hypothesis. We believe this is a good metric for evaluating how well the recognizer and parser are doing, but it says nothing about the discourse and dialogue components.</Paragraph> <Paragraph position="1"> We recently devised two new evaluation metrics, which we believe are useful measures for assessing the performance of the recognizer, parser, discourse, and dialogue components, collectively. To compute the measures, we must reprocess the log file after the orthographic transcription has been provided for the user queries. Basically, both the recognizer hypothesis and the original orthography are run through the system utterance by utterance, with the discourse and dialogue states being maintained exclusively by the recognizer branch. For both branches, the E-form that is produced after the turn manager has finished processing the query is sent to a special evaluation server. This server maintains a running record of all the attributes that appear in the orthography path, comparing them against their counterparts in the recognizer path.</Paragraph> <Paragraph position="2"> The two parameters that emerge from comparing these E-forms we refer to as information bit rate (IBR) and user frustration (UF). IBR measures the average number of new attributes introduced per user query. A subsequent query that reiterates the same attribute is excluded since it did not introduce any new information. Thus if the user said, &quot;I want to go from Seattle to Chicago on December 27,&quot; and the system misrecognized the date as &quot;December 22,&quot; then a subsequent query, &quot;I said December 27&quot; would be registered as contributing a 0 count to the IBR parameter. The UF. parameter tabulates how many turns it took, on average, for an intended attribute to be transmitted successfully to the system. Thus, in the example above, the source and destination each took one turn, but the date took two.</Paragraph> <Paragraph position="3"> There are some difficulties with rerunning the dialogue at a later time. Both the system and the database are in a state of flux, and so the dialogue can become incoherent. For example, in one case the user said, &quot;Book it,&quot; in response to a single flight being proposed, but due to changes in the flight schedule, the system proposed three flights in the rerun and the dialogue became incoherent from that point on. To help alleviate incoherence, we provide a mechanism to artificially offset the date, at least to assure that the dates they have selected haven't already pasaed. !_n spite of the above problems, we feel that these'evaluation metrics show considerable promise.</Paragraph> <Paragraph position="4"> In a pilot study, we processed a subset of our data through this evaluation configuration. We identified a set of 17 attributes that could be monitored. Five percent of the utterances had orthographies that failed to parse. These are unevaluable without human reannotation, and are hence eliminated from the pool in the discussion below, although they clearly are likely to be very problematic. Figure 5 summarizes the results for information bit rate for the remainder of the utterances. A surprisingly large percentage of the utterances introduce no new concepts. Some, but not all, of these are similar to the date misrecognition example given above. Others are cases where the user was confused about the state of the system's knowledge, and decided to simply repeat all the preceding constraints just to make sure. Some are also misfirings of the endpoint detector producing content-free utterances such as &quot;okay.&quot; In other cases the user intended an action, but the system's understanding mechanism was not sophisticated enough. For example &quot;That's good&quot; meaning &quot;book it.&quot; We were pleased with the percentage of sentences that contained more than one attribute. We believe that a typical directed dialogue would have far fewer utterances with more than one attribute.</Paragraph> <Paragraph position="5"> Excluding the 5% of utterances whose orthography failed to parse, our system achieved a 1.05% user frustration rate. This means that, on average, one out of every 20 attributes had to be entered twice.</Paragraph> <Paragraph position="6"> We were very pleased with this number.</Paragraph> </Section> </Section> class="xml-element"></Paper>