File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1010_metho.xml

Size: 26,607 bytes

Last Modified: 2025-10-06 14:08:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1010">
  <Title>Data-Driven Strategies for an Automated Dialogue System</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 System Architecture and Components
</SectionTitle>
    <Paragraph position="0"> The Amities system uses the Galaxy Communicator Software Infrastructure (Seneff et al., 1998). Galaxy is a distributed, message-based, hub-and-spoke infrastructure, optimized for spoken dialogue systems.</Paragraph>
    <Paragraph position="1">  Components in the Amities system (Figure 1) include a telephony server, automatic speech recognizer, natural language understanding unit, dialogue manager, database interface server, response generator, and text-to-speech conversion.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Audio Components
</SectionTitle>
      <Paragraph position="0"> Audio components for the Amities system are provided by LIMSI. Because acoustic models have not yet been trained, the current demonstrator system uses a Nuance ASR engine and TTS Vocalizer.</Paragraph>
      <Paragraph position="1"> To enhance ASR performance, we integrated static GSL (Grammar Specification Language) grammar classes provided by Nuance for recognizing several high-frequency items: numbers, dates, money amounts, names and yes-no statements.</Paragraph>
      <Paragraph position="2"> Training data for the recognizer were collected both from our corpus of human-human dialogues and from dialogues gathered using a text-based version of the human-computer system. Using this version we collected around 100 dialogues and annotated important domain-specific information, as in this example: &amp;quot;Hi my name is [fname ; David] [lname ; Oconnor] and my account number is [account ; 278 one nine five].&amp;quot; Next we replaced these annotated entities with grammar classes. We also utilized utterances from the Amities banking corpus (Hardy et al., 2002) in which the customer specifies his/her desired task, as well as utterances which constitute common, domain-independent speech acts such as acceptances, rejections, and indications of nonunderstanding. These were also used for training the task identifier and the dialogue act classifier (Section 3.3.2). The training corpus for the recognizer consists of 1744 utterances totaling around 10,000 words.</Paragraph>
      <Paragraph position="3"> Using tools supplied by Nuance for building recognition packages, we created two speech recognition components: a British model in the UK and an American model at two US sites.</Paragraph>
      <Paragraph position="4"> For the text to speech synthesizer we used Nuance's Vocalizer 3.0, which supports multiple languages and accents. We integrated the Vocalizer and the ASR using Nuance's speech and telephony API into a Galaxy-compliant server accessible over a telephone line.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Natural Language Understanding
</SectionTitle>
      <Paragraph position="0"> The goal of the language understanding component is to take the word string output of the ASR module, and identify key semantic concepts relating to the target domain. This is a specialized kind of information extraction application, and as such, we have adapted existing IE technology to this task.</Paragraph>
      <Paragraph position="1">  We have used a modified version of the ANNIE engine (A Nearly-New IE system; Cunningham et al., 2002; Maynard, 2003). ANNIE is distributed as the default built-in IE component of the GATE framework (Cunningham et al., 2002). GATE is a pure Java-based architecture developed over the past eight years in the University of Sheffield Natural Language Processing group. ANNIE has been used for many language processing applications, in a number of languages both European and non-European. This versatility makes it an attractive proposition for use in a multilingual speech processing project.</Paragraph>
      <Paragraph position="2"> ANNIE includes customizable components necessary to complete the IE task - tokenizer, gazetteer, sentence splitter, part of speech tagger and a named entity recognizer based on a powerful engine named JAPE (Java Annotation Pattern Engine; Cunningham et al., 2000).</Paragraph>
      <Paragraph position="3"> Given an utterance from the user, the NLU unit produces both a list of tokens for detecting dialogue acts, an important research goal inside this project, and a frame with the possible named entities specified by our application. We are interested particularly in account numbers, credit card numbers, person names, dates, amounts of money, locations, addresses and telephone numbers.</Paragraph>
      <Paragraph position="4"> In order to recognize these, we have updated the gazetteer, which works by explicit look-up tables of potential candidates, and modified the rules of the transducer engine, which attempts to match new instances of named entities based on local grammatical context. There are some significant differences between the kind of prose text more typically associated with information extraction, and the kind of text we are expecting to encounter. Current models of IE rely heavily on punctuation as well as certain orthographic information, such as capitalized words indicating the presence of a name, company or location. We have access to neither of these in the output of the ASR engine, and so had to retune our processors to data which reflected that.</Paragraph>
      <Paragraph position="5"> In addition, we created new processing resources, such as those required to spot number units and translate them into textual representations of numerical values; for example, to take &amp;quot;twenty thousand one hundred and fourteen pounds&amp;quot;, and produce &amp;quot;PS20,114&amp;quot;. The ability to do this is of course vital for the performance of the system.</Paragraph>
      <Paragraph position="6"> If none of the main entities can be identified from the token string, we create a list of possible fallback entities, in the hope that partial matching would help narrow the search space.</Paragraph>
      <Paragraph position="7"> For instance, if a six-digit account number is not identified, then the incomplete number recognized in the utterance is used as a fallback entity and sent to the database server for partial matching.</Paragraph>
      <Paragraph position="8"> Our robust IE techniques have proved invaluable to the efficiency and spontaneity of our data-driven dialogue system. In a single utterance the user is free to supply several values for attributes, prompted or unprompted, allowing tasks to be completed with fewer dialogue turns.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Dialogue Manager
</SectionTitle>
      <Paragraph position="0"> The dialogue manager identifies the goals of the conversation and performs interactions to achieve those goals. Several &amp;quot;Frame Agents&amp;quot;, implemented within the dialogue manager, handle tasks such as verifying the customer's identity, identifying the customer's desired transaction, and executing those transactions. These range from a simple balance inquiry to the more complex change of address and debit-card payment. The structure of the dialogue manager is illustrated in Figure 2.</Paragraph>
      <Paragraph position="1"> Rather than depending on a script for the progression of the dialogue, the dialogue manager takes a data-driven approach, allowing the caller to take the initiative. Completing a task depends on identifying that task and filling values in frames, but this may be done in a variety of ways: one at a time, or several at once, and in any order.</Paragraph>
      <Paragraph position="2"> For example, if the customer identifies himself or herself before stating the transaction, or even if he or she provides several pieces of information in one utterance--transaction, name, account number, payment amount--the dialogue manager is flexible enough to move ahead after these variations.</Paragraph>
      <Paragraph position="3"> Prompts for attributes, if needed, are not restricted to one at a time, but they are usually combined in the way human agents request them; for example, city and county, expiration date and issue number,  If the system fails to obtain the necessary values from the user, reprompts are used, but no more than once for any single attribute. For the customer verification task, different attributes may be  requested. If the system fails even after reprompts, it will gracefully give up with an explanation such as, &amp;quot;I'm sorry, we have not been able to obtain the information necessary to update your address in our records. Please hold while I transfer you to a customer service representative.&amp;quot;  For task identification, the Amities team has made use of the data collected in over 500 conversations from a British call center, recorded, transcribed, and annotated. Adapting a vector-based approach reported by Chu-Carroll and Carpenter (1999), the Task ID Frame Agent is domain-independent and automatically trained.</Paragraph>
      <Paragraph position="4"> Tasks are represented as vectors of terms, built from the utterances requesting them. Some examples of labeled utterances are: &amp;quot;Erm I'd like to cancel the account cover premium that's on my, appeared on my statement&amp;quot; [CancelInsurance] and &amp;quot;Erm just to report a lost card please&amp;quot; [Lost/StolenCard].</Paragraph>
      <Paragraph position="5"> The training process proceeds as follows:  1. Begin with corpus of transcribed, annotated calls.</Paragraph>
      <Paragraph position="6"> 2. Document creation: For each transaction, collect raw text of callers' queries. Yield: one &amp;quot;document&amp;quot; for each transaction (about 14 of these in our corpus).</Paragraph>
      <Paragraph position="7"> 3. Text processing: Remove stopwords, stem content words, weight terms by frequency.</Paragraph>
      <Paragraph position="8"> Yield: one &amp;quot;document vector&amp;quot; for each task. 4. Compare queries and documents: Create &amp;quot;query vectors.&amp;quot; Obtain a cosine similarity score for each query/document pair. Yield: cosine scores/routing values for each query/document pair.</Paragraph>
      <Paragraph position="9"> 5. Obtain coefficients for scoring: Use binary logistic regression. Yield: a set of coefficients for each task.</Paragraph>
      <Paragraph position="10"> Next, the Task ID Frame Agent is tested on unseen utterances or queries: 1. Begin with one or more user queries.</Paragraph>
      <Paragraph position="11"> 2. Text processing: Remove stopwords, stem content words, weight terms (constant weights). Yield: &amp;quot;query vectors&amp;quot;.</Paragraph>
      <Paragraph position="12"> 3. Compare each query with each document.</Paragraph>
      <Paragraph position="13"> Yield: cosine similarity scores.</Paragraph>
      <Paragraph position="14"> 4. Compute confidence scores (use training coefficients). Yield: confidence scores,  representing the system's confidence that the queries indicate the user's choice of a particular transaction.</Paragraph>
      <Paragraph position="15"> Tests performed over the entire corpus, 80% of which was used for training and 20% for testing, resulted in a classification accuracy rate of 85% (correct task is one of the system's top 2 choices). The accuracy rate rises to 93% when we eliminate confusing or lengthy utterances, such as requests for information about payments, statements, and general questions about a customer's account.</Paragraph>
      <Paragraph position="16"> These can be difficult even for human annotators to classify.</Paragraph>
      <Paragraph position="17">  The purpose of the DA Classifier Frame Agent is to identify a caller's utterance as one or more domain-independent dialogue acts. These include Accept, Reject, Non-understanding, Opening, Closing, Backchannel, and Expression. Clearly, it is useful for a dialogue system to be able to identify accurately the various ways a person may say &amp;quot;yes&amp;quot;, &amp;quot;no&amp;quot;, or &amp;quot;what did you say?&amp;quot; As with the task identifier, we have trained the DA classifier on our corpus of transcribed, labeled human-human calls, and we have used vector-based classification techniques. Two differences from the task identifier are 1) an utterance may have multiple correct classifications, and 2) a different stoplist is necessary. Here we can filter out the usual stops, including speech dysfluencies, proper names, number words, and words with digits; but we need to include words such as yeah, uh-huh, hi, ok, thanks, pardon and sorry.</Paragraph>
      <Paragraph position="18"> Some examples of DA classification results are shown in Figure 3. For sure, ok, the classifier returns the categories Backchannel, Expression and Accept. If the dialogue manager is looking for either Accept or Reject, it can ignore Backchannel and Expression in order to detect the correct classification. In the case of certainly not, the first word has a strong tendency toward Accept, though both together constitute a Reject act.</Paragraph>
      <Paragraph position="19"> Text: &amp;quot;sure, okay&amp;quot; Text: &amp;quot;certainly not&amp;quot;  Our classifier performs well if the utterance is short and falls into one of the selected categories (86% accuracy on the British data); and it has the advantages of automatic training, domain independence, and the ability to capture a great variety of expressions. However, it can be inaccurate when applied to longer utterances, and it is not yet equipped to handle domain-specific assertions, questions, or queries about a transaction.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Database Manager
</SectionTitle>
      <Paragraph position="0"> Our system identifies users by matching information provided by the caller against a database of user information. It assumes that the speech recognizer will make errors when the caller attempts to identify himself. Therefore perfect matches with the database entries will be rare.</Paragraph>
      <Paragraph position="1"> Consequently, for each record in the database, we attach a measure of the probability that the record is the target record. Initially, these measures are estimates of the probability that this individual will call. When additional identifying information arrives, the system updates these probabilities using Bayes' rule.</Paragraph>
      <Paragraph position="2"> Thus, the system might begin with a uniform probability estimate across all database records. If the user identifies herself with a name recognized by the machine as &amp;quot;Smith&amp;quot;, the system will appropriately increment the probabilities of all entries with the name &amp;quot;Smith&amp;quot; and all entries that are known to be confused with &amp;quot;Smith&amp;quot; in proportion to their observed rate of substitution. Of course, all records not observed to be so confusable would similarly have their probabilities decreased by Bayes' rule. When enough information has come in to raise the probability for some record above a threshold (in our system 0.99 probability), the system assumes that the caller has been correctly identified. The designer may choose to include a verification dialog, but our decision was to minimize such interactions to shorten the calls.</Paragraph>
      <Paragraph position="3"> Our error-correcting database system receives tokens with an identification of what field each token should represent. The system processes the tokens serially. Each represents an observation made by the speech recognizer. To process a token, the system examines each record in the database and updates the probability that the record is the target record using Bayes' rule: where rec is the event where the record under consideration is the target record.</Paragraph>
      <Paragraph position="4"> As is common in Bayes' rule calculations, the denominator P(obs) is treated as a scaling factor, and is not calculated explicitly. All probabilities are renormalized at the end of the update of all of the records. P(rec) is the previous estimate of the probability that the record is the target record.</Paragraph>
      <Paragraph position="5"> P(obs|rec) is the probability that the recognizer returned the observation that it did given that the target record is the current record under examination. For some of the fields, such as the account number and telephone number, the user responses consist of digits. We collected data on the probability that the speech recognition system we are using mistook one digit for another and calculated the values for P(obs|rec) from the data.</Paragraph>
      <Paragraph position="6"> For fields involving place names and personal names, the probabilities were estimated.</Paragraph>
      <Paragraph position="7"> Once a record has been selected (by virtue of its probability being greater than the threshold) the system compares the individual fields of the record with values obtained by the speech recognizer. If the values differ greatly, as measured by their Levenshtein distance, the system returns the field name to the dialogue manager as a candidate for additional verification. If no record meets the threshold probability criterion, the system returns the most probable record to the dialogue manager, along with the fields which have the greatest Levenshtein distance between the recognized and actual values, as candidates for reprompting.</Paragraph>
      <Paragraph position="8"> Our database contains 100 entries for the system tests described in this paper. We describe the system in a more demanding environment with one million records in Inouye et al. (2004). In that project, we required all information to be entered by spelling the items out so that the vocabulary was limited to the alphabet plus the ten digits. In the current project, with fewer names to deal with, we allowed the complete vocabulary of the domain: names, streets, counties, and so forth.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Response Generator
</SectionTitle>
      <Paragraph position="0"> Our current English-only system preserves the language-independent features of our original trilingual generator, storing all language- and domain-specific information in separate text files.</Paragraph>
      <Paragraph position="1"> It is a template-based system, easily modified and extended. The generator constructs utterances according to the dialogue manager's specification of one or more speech acts (prompt, request, confirm, respond, inform, backchannel, accept, reject), repetition numbers, and optional lists of attributes, values, and/or the person's name. As far as possible, we modeled utterances after the human-human dialogues.</Paragraph>
      <Paragraph position="2"> For a more natural-sounding system, we collected variations of the utterances, which the generator selects at random. Requests, for example, may take one of twelve possible forms: Request, part 1 of 2: Can you just confirm  |Can I have  |Can I take |</Paragraph>
      <Paragraph position="4"> Request, part 2 of 2: [list of attributes], [person name]?  |[list of attributes], please? Offers to close or continue the dialogue are similarly varied: Closing offer, part 1 of 2: Is there anything else  |Anything else  |Is there anything else at all Closing offer, part 2 of 2: I can do for you today?  |I can help you with today?  |I can do for you?  |I can help you with? | you need today?  |you need?</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="123456789" type="metho">
    <SectionTitle>
4 Preliminary Evaluation
</SectionTitle>
    <Paragraph position="0"> Ten native speakers of English, 6 female and 4 male, were asked to participate in a preliminary inlab system evaluation (half in the UK and half in the US). The Amities system developers were not among these volunteers. Each made 9 phone calls to the system from behind a closed door, according to scenarios designed to test various customer identities as well as single or multiple tasks. After each call, participants filled out a questionnaire to register their degree of satisfaction with aspects of the interaction.</Paragraph>
    <Paragraph position="1"> Overall call success was 70%, with 98% successful completions for the VerifyId and 96% for the CheckBalance subtasks (Figure 4).</Paragraph>
    <Paragraph position="2"> &amp;quot;Failures&amp;quot; were not system crashes but simulated transfers to a human agent. There were 5 user terminations.</Paragraph>
    <Paragraph position="3"> Average word error rates were 17% for calls that were successfully completed, and 22% for failed calls. Word error rate by user ranged from 11% to  Call duration was found to reflect the complexity of each scenario, where complexity is defined as the number of &amp;quot;concepts&amp;quot; needed to complete each task. The following items are judged to be concepts: task identification; values such as first name, last name, house number, street and phone number; and positive or negative responses such as whether a new card is desired. Figures 5 and 6 illustrate the relationship between length of call and task complexity. It should be noted that customer verification, a task performed in every dialogue, requires a minimum of 3 personal details to be verified against a database record, but may require more in the case of recognition errors.</Paragraph>
    <Paragraph position="4"> The overall average number of turns per dialogue was 18.28. The user spoke an average of 6.89 words per turn and the system 11.42.</Paragraph>
    <Paragraph position="5"> User satisfaction for each call was assessed by way of a questionnaire containing five statements. These covered the clarity of the instructions, ease of doing the task, how well the system understands the caller, how well the system works, and the caller's enjoyment of the system. Participants rated each on a five-point Likert scale. Summed results showed an average score of 20.45 over all users (range 5-25; higher = stronger agreement).</Paragraph>
    <Paragraph position="6">  Figures 5 and 6. Scenario Complexity (top) and Average Call Duration by Scenario (in seconds) Although user satisfaction was high, we were more interested in identifying the major problems for the callers. Users were often frustrated by recognition failures and/or unsuccessful attempts to capture values such as a new street address, county, or phone number. Sometimes the system failed to determine that the user was finished.</Paragraph>
    <Paragraph position="7"> Because the system is designed to be non-restrictive in accepting users' input, misidentifications were common. One user spoke for many by saying, &amp;quot;There were times when the system would ask me to repeat information I had not yet given to it.&amp;quot; We plan to revise our strategy so that we are a little more cautious in our reprompts. Occasionally, the system misidentified the user's desired task and had difficulty backing off gracefully and starting the correct task. We are working on improving our recovery strategies for these cases.</Paragraph>
    <Paragraph position="8"> The following transcription of an interaction with our system illustrates the efficiency made possible when the data supplied by the caller drives the conversation: AMITIES: Hello, this is Am-ih-tee-eh! What can I do for you? USER: yes this is Stella Lang. My card number is three one six seven one four. Can you tell me the balance on my account please? AMITIES: Can you just confirm the city and county? USER: uh yes it's London, Middlesex AMITIES: I'm sorry, I didn't understand what you said. Can you repeat, please? USER: London, Middlesex AMITIES: All right, your balance is 1753 pounds and 20 pence. Anything else I can do for you? USER: yes I'd like to make a payment of fifty three pounds with my debit card please AMITIES: Can I have the debit card number, Miss Lang? USER: sure, it's two eight seven four six six four nine five eight nine five three eight nine one, and it expires October two thousand six AMITIES: What's the issue number, please? USER: one AMITIES: Ok, Miss Lang, I'll send in the payment amount of 53 pounds. Did I get that right? USER: yes, what's my new balance please? AMITIES: Ok, that's gone through. Your payment will be credited overnight and will reflect on your next statement. Your balance is 1700 pounds and 20 pence. Anything else I can do for you today? USER: no, that's it thanks AMITIES: Thanks very much for your call, goodbye.</Paragraph>
    <Paragraph position="9"> USER: bye</Paragraph>
  </Section>
  <Section position="6" start_page="123456789" end_page="123456789" type="metho">
    <SectionTitle>
5 Discussion, Future Work
</SectionTitle>
    <Paragraph position="0"> The preliminary evaluation reported here indicates promise for an automated dialogue system such as ours, which incorporates robust techniques for information extraction, record matching, task identification, dialogue act classification, and an overall data-driven strategy. Task duration and number of turns per dialogue both appear to indicate greater efficiency and corresponding user satisfaction than many other similar systems. In the DARPA Communicator evaluation, for example, between 60 and 79 calls were made to each of 8 participating sites (Walker, et al., 2001, 2002). A sample scenario for a domestic round-trip flight contained 8 concepts (airline, departure city, state, date, etc.). The average duration for such a call was over 300 seconds; whereas our overall average was 104 seconds. ASR accuracy rates in 2001 were about 60% and 75%, for airline itineraries not completed and completed; and task completion rates were 56%. Our average number of user words per turn, 6.89, is also higher than that reported for Communicator systems. This number seems to reflect lengthier responses to open prompts, responses to system requests for multiple attributes, and greater user initiative.</Paragraph>
    <Paragraph position="1"> We plan to port the system to a new domain: from telephone banking to information-technology support. As part of this effort we are again collecting data from real human-human calls. For advanced speech recognition, we hope to train our ASR on new acoustic data. We also plan to expand our dialogue act classification so that the system can recognize more types of acts, and to improve our classification reliability.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML