XML Viewer - w04-2314

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2314_metho.xml
Size: 32,687 bytes
Last Modified: 2025-10-06 14:09:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2314">
  <Title>Bootstrapping Spoken Dialog Systems with Data Reuse</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 AT&amp;T Spoken Dialog System
</SectionTitle>
    <Paragraph position="0"> Once a phone call is established, the dialog manager prompts the caller either with a pre-recorded or synthesized greetings message. At the same time, it activates the top level ASR grammar. The caller speech is then translated into text and sent to the SLU which replies with a semantic representation of the utterance. Based on the SLU reply and the implemented dialog strategy, the DM engages in a mixed initiative dialog to drive the user towards the goal. The DM iterates the previously described steps until the call reaches a nal state (e.g. the call is transferred to a CSR, an IVR or the caller hangs up).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 ASR
</SectionTitle>
      <Paragraph position="0"> Robust speech recognition is a critical component of a spoken dialog system. The speech recognizer uses trigram language models based on Variable N-gram Stochastic Automata (Riccardi et al., 1996). The acoustic models are subword unit based, with triphone context modeling and variable number of gaussians (4-24). The output of the ASR engine (which can be the 1-best or a lattice) is then used as the input of the SLU component.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 SLU
</SectionTitle>
      <Paragraph position="0"> In a natural spoken dialog system, the de nition of understanding depends on the application. In this work, we focus only on goal-oriented call classi cation tasks, where the aim is to classify the intent of the user into one of the prede ned call-types. As a call classi cation example, consider the utterance in the previous example dialog I would like to know my account balance, in a customer care application. Assuming that the utterance is recognized correctly, the corresponding intent or the call-type would be Request(Account Balance) and the action would be prompting for the account number and then telling the balance to the user or routing this call to the Billing Department.</Paragraph>
      <Paragraph position="1"> Classi cation can be achieved by either a knowledge-based approach which depends heavily on an expert writing manual rules or a data-driven approach which trains a classi cation model to be used during run-time. In our current system we consider both approaches. Data-driven classi cation has long been studied in the machine learning community. Typically these classi cation algorithms try to train a classi cation model using the features from the training data. More formally, each object in the training data, a3a5a4a7a6a9a8a11a10 , is represented in the form a12a14a13a15a4a16a10a18a17a20a19a21a4a16a10a23a22 , wherea13a24a4a25a10a27a26a29a28a21a10 is the feature set and thea19a30a4a16a10a27a26a32a31a18a10 is the assigned set of classes for that object for the application a33 . In this study, we have used an extended version of a Boosting-style classi cation algorithm for call classi cation (Schapire, 2001) so that it is now possible to develop hand written rules to cover low frequent classes or bias the classi er decision for some of the classes. This is explained in detail in Schapire et al. (2002). In our previous work, we have used rules to bootstrap the SLU models for new applications when no training data is available (Di Fabbrizio et al., 2002).</Paragraph>
      <Paragraph position="2"> Classi cation is employed for all utterances in all dialogs as seen in the sample dialog in Figure 1. Thus all the expressions the users can utter are classi ed into pre-de ned call-types before starting an application. Even the utterances which do not contain any speci c information content get a special call-type (e.g. Hello). So, in our case objects, a8 a10 are utterances and classes, a31 a10 , are call-types for a given applicationa33 .</Paragraph>
      <Paragraph position="3"> In the literature, in order to determine the applicationspeci c call-types, rst a wizard data collection is performed (Gorin et al., 1997). In this approach, a human, i.e. wizard, acts like the system, though the users of the system do not know about this. This method turned out to be better than recording user-agent (human-human) dialogs, since the responses to machine prompts are found to be signi cantly different than responses to humans, in terms of language characteristics.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 DM
</SectionTitle>
      <Paragraph position="0"> In a mixed-initiative Spoken Dialog System, the Dialog Manager is the key component responsible for the human-machine interaction. The DM keeps track of the speci c discourse context and provides disambiguation and clari cation strategies when the SLU call-types are  ambiguous or have associated low con dence scores. It also extracts other information from the SLU response in order to complete the information necessary to provide a service.</Paragraph>
      <Paragraph position="1"> Previous work on dialog management (Abella and Gorin, 1999) shows how an object inheritance hierarchy is a convenient way of representing the task knowledge and the relationships among the objects. A formally de ned Construct Algebra describes the set of operations necessary to execute actions (e.g. replies to the user or motivators). Each dialog motivator consists of a small processing unit which can be combined accordingly to the object hierarchy to build the application. Although this approach demonstrated effective results in different domains (Gorin et al., 1997; Buntschuh et al., 1998), it proposes a model which substantially differs from the call ow model broadly used to specify the human-machine interaction.</Paragraph>
      <Paragraph position="2"> Building and maintaining large-scale voice-enabled applications requires a more direct mapping between speci cations and programming model, together with authoring tools that simpli es the time consuming implementation, debugging, and testing phases. Moreover, the DM requires broad protocols and standard interfaces support to interact with modern enterprise backend systems (e.g. databases, http servers, email servers, etc.). Alternatively, VoiceXML (vxm, 2003) provides the basic infrastructure to build spoken dialog system, but the lack of SLU support and of ine tools compromises the use in a data-driven classi cation applications.</Paragraph>
      <Paragraph position="3"> Our approach proposes a general and scalable framework for Spoken Dialog Systems. Figure 2 depicts the logical DM framework architecture. The Flow Controller (FC) implements an abstraction of pluggable dialog strategy modules. Different algorithms can be implemented and made available to the DM engine. Our DM provides three basic algorithms. Traditional call routing systems are better described in terms of ATNs (Augmented Transition Networks) (Bobrow and Fraser, 1969). ATNs are attractive mechanisms for dialog speci cation since they are (a) an almost direct translation of call ow speci cations, (b) easy to augment with speci c mixed-initiative interactions, (c) practical to manage extensive dialog context. Complex knowledge-based tasks could be synthetically described by a variation of knowledge trees. Plan-based dialogs are effectively de ned by rules and constraints. null The FC provides a synthetic XML-based language to author the appropriate dialog strategy. Dialog strategy algorithms are encapsulated using object oriented paradigms. This allows dialog authors to write sub-dialogs with different algorithms, depending on the nature of the task and use them interchangeably and exchanging variables through the local and global contexts. A complete description of the DM is out of the scope of this publication and will be covered elsewhere. We will focus our attention on the ATN module which is the one used in our experiments. The ATN engine operates on the semantic representation provided by the SLU and the current dialog context to control the interaction ow.</Paragraph>
      <Paragraph position="4"> 3 Bootstrapping a Spoken Dialog System This section describes how we bootstrap the main components of a spoken dialog system, namely the ASR, SLU, and DM. For all modules, we assume no data from the application domain is available.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Unsupervised Learning of Language Models
</SectionTitle>
      <Paragraph position="0"> State-of-the-art speech recognition systems are generally trained using in-domain transcribed utterances, preparation of which is labor intensive and time-consuming.</Paragraph>
      <Paragraph position="1"> In this work, we re-train only the statistical language models, and use an acoustic model trained using data from other applications. Typically, the recognition accuracy improves by adding more data from the application domain to train statistical language models (Rosenfeld, 1995).</Paragraph>
      <Paragraph position="2"> In our previous work, we have proposed active and unsupervised learning techniques for reducing the amount of transcribed data needed to achieve a given word accuracy, for automatic speech recognition, when some data (transcribed or untranscribed) is available from the application domain (Riccardi and Hakkani-Tcurrency1ur, 2003). Iyer and Ostendorf (1999) have examined various similarity techniques to selectively sample out-of-domain data to enhance sparse in-domain data for statistical language models, and have found that even the brute addition of out-of-domain data is useful. Venkataraman and Wang (2003) have used maximum likelihood count estimation and document similarity metrics to select a single vocabulary from many corpora of varying origins and characteristics. In these studies, the assumption is that there is some domain data (transcribed and/or untranscribed) available, and its a34 -gram distributions are used to extend that set with additional data.</Paragraph>
      <Paragraph position="3"> In this paper, we focus on the reuse of transcribed data from other resources, such as human-human dialogs (e.g.</Paragraph>
      <Paragraph position="4"> Switchboard Corpus, (Godfrey et al., 1992)), or human-machine dialogs from other spoken dialog applications, as well as some text data from the web pages of the application domain. We examine the style and content similarity, when out-of-domain data is used to train statistical language models and when no in-domain human-machine dialog data is available. Intuitively, the domain web pages could be useful to learn domain-speci c vocabulary. Other application data can provide stylistic characteristics of human-machine dialogs.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Call-type Classi cation with Data Reuse
</SectionTitle>
      <Paragraph position="0"> The bottleneck of building reasonably performing classication models is the amount of time and money spent for high quality labeling. By labeling, we mean assigning one or more prede ned label(s) (call-type(s)) to each utterance.</Paragraph>
      <Paragraph position="1"> In our previous work, in order to build call classication systems in shorter time frames, we have employed active and unsupervised learning methods to selectively sample the data to label (Tur et al., 2003; Tur and Hakkani-Tcurrency1ur, 2003). We have also incorporated manually written rules to bootstrap the Boosting classi er (Schapire et al., 2002) and used it in the AT&amp;T HelpDesk application (Di Fabbrizio et al., 2002).</Paragraph>
      <Paragraph position="2"> In this study, we aim to reuse the existing labeled data from other applications to bootstrap a given application.</Paragraph>
      <Paragraph position="3"> The idea is forming a library of call-types along with the associated data and let the UE expert responsible for that application exploit this information source.</Paragraph>
      <Paragraph position="4"> Assume that there is an oracle which categorizes all the possible natural language sentences which can be uttered in any spoken dialog application we deal with. Let us denote this set of universal classes with a31 , such that the call-type set of a given application is a subset of that, a31 a10 a26a35a31 . It is intuitive that, some of the call-types will appear in all applications, some in only one of them, etc.</Paragraph>
      <Paragraph position="5"> Thus, we categorizea31 a10 into three: 1. Generic Call-types: These are the intents appearing independent of the application. A typical example would be a request for talking to a human instead of a machine. Call this set a31a37a36a39a38a41a40a42a19 a4a11a43 a19 a4 a6a44a31a46a45a47a17a49a48a51a50a5a52 2. Re-usable Call-types: These are the intents which are not generic but have already been de ned for a previous application (most probably from the same or similar industry sector) and have already had labeled data. Call this set a31a54a53 a10 a38a55a40a42a19 a4a15a43 a19 a4 a6a56a31a46a45a57a17a59a58a46a50a5a52 3. Speci c Call-types: These are the intents speci c to the application, because of the speci c business needs or application characteristics. Call this set</Paragraph>
      <Paragraph position="7"> Now, for each applicationa33 , we have a31 a10 a38a9a31a37a36a64a63a65a31a54a53 a10 a63a66a31a5a60 a10 It is up to the UE expert to decide which call-types are speci c or reusable, i.e. the sets a31a54a53a64a10 and a31a5a60a67a10 . Given that not any two applications are the same, deciding on whether to reuse a call-type along with its data is very subjective. There may be two applications including the intent Request(Account Balance) one from a telecommunications sector, the other from a pharmaceutical sector, and the wording can be slightly different. For example while in one case we may have How much is my last phone bill, the other can be do I owe you anything on the medicine. Since each classi er can tolerate some amount of language variability and noise, we assume that if the names of the intents are the same, their contents are the same. Since in some cases, this assumption does not hold, it is still an open problem to selectively sample the portions of data to reuse automatically.</Paragraph>
      <Paragraph position="8"> Sincea31a51a36 appears in all applications by de nition, this is the core set of call-types in a new application, a34 .</Paragraph>
      <Paragraph position="9"> Then, if the UE expert knows the possible reusable intents,a31a54a53a56a68 , existing in the application, they can be added too. The bootstrap classi er can then be trained using the utterances associated with the call-types, a31a37a36a70a69a71a31a54a53a56a68 in the call-type library. For the application speci c intents, a31a54a60 a68 , it is still possible to augment the classi er with a few rules as described in Schapire et al. (2002). This is also up to the expert to decide.</Paragraph>
      <Paragraph position="10"> Depending on the size of the library or the similarity of the new application to the existing ones, using this approach it is possible to cover a signi cant portion of the intents. For example, in our experiments, we have seen that 10% of the responses to the initial prompt is a request to talk to a human. Using this system, we have the capability to continue the dialog with the user and getting the intent before sending them to a human agent.</Paragraph>
      <Paragraph position="11"> Using this approach the application begins with a reasonably well working understanding component. One can also consider this as a more complex wizard, depending on the bootstrap model.</Paragraph>
      <Paragraph position="12"> Another advantage of maintaining a call-type library and exploiting them by reuse is that automatically ensures consistency in labeling and naming. Note that the design of call-types is not a well de ned procedure and most of the time, it is up to the expert. Using this approach it is possible to discipline the art of call-type design to some extent.</Paragraph>
      <Paragraph position="13"> After the system is deployed and real data is collected, then the application speci c or other reusable call-types can be determined by the UE expert to get a complete picture.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Bootstrapping the Dialog Manager with Reuse
</SectionTitle>
      <Paragraph position="0"> Mixed-initiative dialogs generally allow users to take control over the machine dialog ow almost at any time of the interaction. For example, during the course of a dialog aiming to a speci c task, a user may utter a new intention (speech act) and deviate from the previously stated goal. Depending on the nature of the request, the DM strategy could either decide to shift to the different context (context shift) or re-prompt providing additional information. Similarly, other dialog strategy patterns such as correction, start-over, repeat, con rmation, clari cation, contextual help, and the already mentioned context shift, are recurring features in a mixed-initiative system.</Paragraph>
      <Paragraph position="1"> Our goal is to derive some overall approach to dialog management that would de ne templates or basic dialog strategies based on the call-type structure. For the speci c call routing task described in this paper, we generalized dialog strategy templates based on the categorization of the call-type presented in 3.2 and on best practice user experience design.</Paragraph>
      <Paragraph position="2"> Generic call-types, such as Yes, No, Hello, Goodbye, Repeat, Help, etc., are domain independent, but are handled in most reusable sub-dialogs with the speci c dialog context. When detected in any dialog turn, they trigger context dependent system replies such as informative prompts (Help), greetings (Hello) and summarization of the previous dialog turn using the dialog history (Repeat). In this case, the dialog will handle the request and resume the execution when the information has been provided.</Paragraph>
      <Paragraph position="3"> Yes and No generic call-types are used for con rmation if the system is expecting a yes/no answer or ignored with a system re-prompt in other contexts.</Paragraph>
      <Paragraph position="4"> Call-types are further categorized as vague and concrete. A request like I have a question will be classi ed as vague Ask(Info) and will generate a clari cation question: OK. What is your question? Concrete call-types categorize a clear routing request and they activate a con rmation dialog strategy when they are classi ed with low con dence scores. Concrete call-type can also have associated mandatory or optional attributes. For instance, the concrete call-type Request(Account Balance) requires a mandatory attribute AccountNumber (generally captured by the SLU) to complete the task.</Paragraph>
      <Paragraph position="5"> We generalized sub-dialogs to handle the most common call- type attributes (telephone number, account number, zip code, credit card, etc.) including a dialog container that implements the optimal ow for multiple inputs. A common top level dialog handles the initial open prompt requests. Reusable dialog templates are implemented as ATNs where the actions are executed when the network arcs are traversed and passed as parameters at run-time. Disambiguation of multiple call-types is not supported. We only consider the top scoring call-type assuming that multiple call-types with high con dence are rare events.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> For our experiments, we selected an application from the pharmaceutical domain to bootstrap. We have evaluated the performances of the ASR language model, call classi er, and dialog manager as described below.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Speech Recognition Experiments
</SectionTitle>
      <Paragraph position="0"> To bootstrap a statistical language model for ASR, we used human-machine spoken language data from two previous AT&amp;T VoiceTone spoken dialog applications (App.</Paragraph>
      <Paragraph position="1"> 1 (telecommunication domain) and App. 2 (medical insurance domain)). We also used some data from the application domain web pages (Web). Table 1 lists the sizes of these corpora. App. Training Data and App. Test Data correspond to the training and test data we have for the new application and are used for controlled experiments. We also extended the available corpora with human-human dialog data from the Switchboard corpus (SWBD) (Godfrey et al., 1992).</Paragraph>
      <Paragraph position="2"> Table 2 summarizes some style and content features of the available corpora. For simpli cation, we only compared the percentage of pronouns and lled pauses to show style differences, and the domain test data out-of-vocabulary word (OOV) rate for content variations.</Paragraph>
      <Paragraph position="3"> The human-machine spoken dialog corpora include much more pronouns than the web data. There are even further differences between the individual pronoun distributions.</Paragraph>
      <Paragraph position="4"> For example, out of all the pronouns in the web data, 35% is you , and 0% is I , whereas in all of the human-machine dialog corpora, more than 50% of the pronouns are I . In terms of style, both spoken dialog corpora can be considered as similar. In terms of content, the second application data is the most similar corpus, as it results in the lowest OOV rate for the domain test data. In Table 3, we show further reductions in the App. test set OOV rate, when we combine these corpora.</Paragraph>
      <Paragraph position="5"> Figure 3 shows the effect of using various corpora as training data for statistical language models used in the App. 1 App. 2 Web Data App. Training Data App. Test Data</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
In-Domain
</SectionTitle>
      <Paragraph position="0"> App. 1 App. 2 SWBD Web Data Training Data  rate of the test data.</Paragraph>
      <Paragraph position="1"> recognition of the test data. We also computed ASR run-time curves by varying the beam-width of the decoder, as the characteristics of the corpora effects the size of the language model. Content-wise the most similar corpus (App. 2) resulted in the best performing language model, when the corpora are considered separately. We obtained the best recognition accuracy, when we augment App. 2 data with App. 1 and the web data. Switchboard corpus also resulted in a reasonable performance, but the problem is that it resulted in a very big language model, slowing down the recognition. In that gure, we also show the word accuracy curve, when we use in-domain transcribed data for training the language model.</Paragraph>
      <Paragraph position="2"> Once some data from the domain is available, it is possible to weight the available out-of-domain data and the web-data while reusing, to achieve further improvements. When we lack any in-domain data, we expect the UE expert to reuse the application data from the most similar sectors and/or combine all available data.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Call-type Classi cation Experiments
</SectionTitle>
      <Paragraph position="0"> We have performed the SLU tests using the Boostexter tool (Schapire and Singer, 2000). For all experiments, we have used worda34 -grams of transcriptions as features and iterated Boostexter 1,100 times. In this study we have assumed that all candidate utterances are rst recognized by the same automatic speech recognizer (ASR), so we deal with only text input of the same quality, which corresponds to the recognitions obtained using the language model trained from App. 1, App. 2, and Web data.</Paragraph>
      <Paragraph position="1">  various language models and pruning thresholds.</Paragraph>
      <Paragraph position="2"> As the test set, we again used the 5,537 utterances collected from a pharmaceutical domain customer care application. We used a very limited library of call-types from a telecommunications domain application. We have made controlled experiments where we know the true call-types of the utterances. In this application we have 97 call-types with a fairly high perplexity of 32.81.</Paragraph>
      <Paragraph position="3"> If an utterance has the call-type which is covered by the bootstrapped model, we expect that call-type to get high con dence. Otherwise, we expect the model to reject it by assigning the special call-type Not(Understood) meaning that the intent in the utterance is known to be not covered or some call-type with low con dence. Then we compute the rejection accuracy (RA) of the bootstrap model:  call-types, and also with call-types from the library and rules. In order to evaluate the classi er performance for the utterances whose call-types are covered by the bootstrapped model, we have used classi cation accuracy (CA) which is the fraction of utterances in which the top scoring call-type is one of the true call-types assigned by a human-labeler and its con dence score is more than  These two measures are actually complementary to each other. For the complete model trained with all the training data, where all the intents are covered, these two metrics are the same.</Paragraph>
      <Paragraph position="4"> In order to see our upper bound, we rst trained a classi er using 30,000 labeled utterances from the same application. First row of Table 4 presents these results using both the transcriptions of the test set and using the ASR output with around 68% word accuracy. As the con dence threshold we have chosen a hypothetical value of 0.3 for all experiments. As seen, 78.27% classi cation (or rejection) accuracy is the performance using all training data. This reduces to 61.73% when we use the ASR output. This is mostly because of the unrecognized words which are critical for the application. This is intuitive since ASR language model has not been trained with domain data.</Paragraph>
      <Paragraph position="5"> Then we trained a generic model using only generic call-types. This model has achieved better accuracies as seen in the second row, since we do not expect it to distinguish among the reusable or speci c call-types. Furthermore, for classi cation accuracy we only use the portion of the test set whose call-types are covered by the model and the call-types in this model are de nitely easier than the speci c ones. The drawback is that we only cover about half of the utterances. Using the ASR output, unlike the in-domain model case, did not hurt much, since the ASR already covers the utterances with generic call-types with great accuracy.</Paragraph>
      <Paragraph position="6"> We then trained a bootstrapped model using 13 call-types from the library and a few simple rules written manually for three frequent intents. Since the library consists of an application from a fairly different domain, we could only exploit intents related to billing, such as Request(Account Balance). While determining the call-types to write rules, we actually played the expert which has previous knowledge on the application. This enabled us to increase the coverage to 70.34%.</Paragraph>
      <Paragraph position="7"> The most impressive result of these experiments is that, we have got a call classi er which is trained without any in-domain data and can handle most utterances with almost same accuracy as the trained with extensive amounts of data. Noting the weakness of our current call-type library we expect even better performances as we augment more call-types from on-going applications.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Dialog Level Evaluation
</SectionTitle>
      <Paragraph position="0"> Evaluation of spoken dialog system performances is a complex task and depends on the purpose of the desired dialog metric (Paek, 2001). While ASR and SLU can be fairly assessed off-line using utterances collected in previous runs of the baseline system, the dialog manager requires interaction with a real motivated user who will cooperate with the system to complete the task. Ideally, the bootstrap system has to be deployed in the eld and the dialogs have to be manually labeled to provide accurate measure of task completion rate. Usability metrics also require direct feedback from the caller to properly measure the user satisfaction (speci cally, task success and dialog cost) (Walker et al., 1997). However, we are more interested in automatically comparing the bootstrap system performances with a reference system, working on the same domain and with identical dialog strategies.</Paragraph>
      <Paragraph position="1"> As a rst order of approximation, we reused the 3,082 baseline test dialogs (5,537 utterances) collected by the live reference system and applied the same dialog turn sequence to evaluate the bootstrap system. According to the reference system call ow, the 97 call-types covered by the reference classi er are clustered into 32 DM categories (DMC). A DMC is a generalization of more speci c intents.</Paragraph>
      <Paragraph position="2"> The bootstrap system only classi es 16 call-types and 16 DMC accordingly to the bootstrapping SLU design requirements described in 4.2. This is only half of the reference system DMC coverage, but it actually addresses 70.34% of the total utterance classi cation task.</Paragraph>
      <Paragraph position="3"> We simulate the execution of the dialog using data collected from a deployed system, with the following proce- null dure: for each dialoga109a110a4 in the reference data set, we pass the utterance a33 to the bootstrap classi er and select the result with the highest con dence score</Paragraph>
      <Paragraph position="5"> con dence score thresholds, a111a37a112a4a114a113a112 , for acceptance, and a111a54a115a114a116a118a117 , for rejection. Call-types, whose con dence scores are in between these two thresholds are con rmed. Then: 1. the dialog is considered as successful if the following condition is veri ed:  , assuming that the dialog did not contain any relevant user intention.</Paragraph>
      <Paragraph position="6"> A further experiment considers only the nal routing destinations (e.g. speci c type of agent or the automatic ful llment system destination). Both reference and bootstrap systems direct calls to 12 different destinations, implying that a few DM categories are combined into the same destination. This quanti es how effectively the system routes the callers to the right place in the call center and, conversely, gives some metric to evaluate the missed automation and misrouted calls. The test has been executed for both transcribed and untranscribed utterances. Results are shown in Table 5. Even with a modest 50% DM Categories coverage, the bootstrap system shows an overall task completion of 67.27% in case of transcribed data and 57.39% using the output generated by the bootstrap ASR. When considering the route destinations, completion increases to 70.67% and 61.84% respectively. This approach explicitly ignores the dialog context, but it contemplates the call-type categorization, the con rmation mechanism and the nal route destination, that would be missed in the SLU evaluation. Although a more completed evaluation analysis is needed, lower bound results are indicative of the overall performances. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Summary
</SectionTitle>
    <Paragraph position="0"> This paper shows that bootstrapping a spoken dialog system reusing existing transcribed and labeled data from out-of-domain human-machine dialogs, common reusable dialog templates and patterns, is possible to achieve operational performances. Our evaluations on a call classi cation system using no domain speci c data indicate 67% ASR word accuracy, 79% SLU call classi cation accuracy with 70% coverage, and 62% routing accuracy with 50% DM coverage. Our future work consists of developing techniques to re ne the bootstrap system when application domain data become available.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML