Bootstrapping Spoken Dialog Systems with Data Reuse
Giuseppe Di Fabbrizio Gokhan Tur Dilek Hakkani-T currency1ur
AT&T Labs-Research,
Florham Park, NJ, 07932
a0 pino,gtur,dtur
a1 @research.att.com
Abstract
Building natural language spoken dialog sys-
tems requires large amounts of human tran-
scribed and labeled speech utterances to reach
useful operational service performances. Fur-
thermore, the design of such complex systems
consists of several manual steps. The User
Experience (UE) expert analyzes and de nes
by hand the system core functionalities: the
system semantic scope (call-types) and the di-
alog manager strategy which will drive the
human-machine interaction. This approach is
extensive and error prone since it involves sev-
eral non-trivial design decisions that can only
be evaluated after the actual system deploy-
ment. Moreover, scalability is compromised by
time, costs and the high level of UE know-how
needed to reach a consistent design. We pro-
pose a novel approach for bootstrapping spo-
ken dialog systems based on reuse of existing
transcribed and labeled data, common reusable
dialog templates and patterns, generic language
and understanding models, and a consistent de-
sign process. We demonstrate that our ap-
proach reduces design and development time
while providing an effective system without
any application speci c data.
1 Introduction
Spoken dialog systems aim to identify intents of humans,
expressed in natural language, and take actions accord-
ingly, to satisfy their requests (Gorin et al., 2002). In a
natural spoken dialog system, typically,  rst the speaker’s
utterance is recognized using an automatic speech rec-
ognizer (ASR). Then, the intent of the speaker is identi-
 ed from the recognized sequence, using a spoken lan-
guage understanding (SLU) component. This step can
be framed as a classi cation problem for goal-oriented
call routing systems (Gorin et al., 2002; Natarajan et al.,
2002, among others). Then, the user would be engaged in
a dialog via clari cation or con rmation prompts if nec-
essary. The role of the dialog manager (DM) is to interact
in a natural way and help the user to achieve the task that
the system is designed to support.
In our case we only consider automated call routing
systems where the task is to reach the right route in a
large call center, which could be either a live opera-
tor or an automated system. An example dialog from
a telephone-based customer care application is given in
Figure 1. Typically the design of such complex sys-
tems consists of several manual steps, including analy-
sis of existing IVR (Interactive Voice Response) systems,
customer service representative (CSR) interviews, cus-
tomers’ testimonials, CSR training material, and, when
available, human-machine unconstrained speech record-
ings. Based on these heterogeneous requirements, the
User Experience (UE) expert analyzes and de nes by
hand the system core functionalities: the system semantic
scope (call-types) and the dialog manager strategy which
will drive the human-machine interaction. Once the UE
expert designs the system, large amounts of transcribed
and labeled speech utterances are needed for building the
ASR and SLU models.
In our previous work, we have presented active and
unsupervised (or semi-supervised) learning algorithms
in order to reduce the amount of labeling effort needed
while building ASR and SLU systems (Tur et al., 2003;
Tur and Hakkani-Tcurrency1ur, 2003; Riccardi and Hakkani-Tcurrency1ur,
2003). There we focus on a single application, and only
the ASR and SLU components. In this study, we aim
to exploit the labeled and transcribed data and common
reusable dialog templates and patterns obtained from
similar previous applications to bootstrap the whole spo-
ken dialog system with ASR, SLU, and DM components.
The organization of this paper is as follows. Sec-
a2 System: How may I help you?
a2 User: Hello?
a2 Call-type: Hello
a2 System: Hello, how may I help you?
a2 User: I have a question.
a2 Call-type: Ask(Info)
a2 System: OK, What is your question?
a2 User: I would like to know my account balance.
a2 Call-type: Request(Account Balance)
a2 System: I can help you with that. What is your ac-
count number?
a2 User: ...
Figure 1: An example natural language dialog
tion 2 describes brie y the AT&T Spoken Dialog Sys-
tem, which we use in this study, and its main components,
ASR, SLU, and DM. In Section 3 we present our method
to bootstrap ASR, SLU, and DM for a new application.
Section 4 presents our experiments using real data from a
customer care application.
2 AT&T Spoken Dialog System
Once a phone call is established, the dialog manager
prompts the caller either with a pre-recorded or synthe-
sized greetings message. At the same time, it activates
the top level ASR grammar. The caller speech is then
translated into text and sent to the SLU which replies with
a semantic representation of the utterance. Based on the
SLU reply and the implemented dialog strategy, the DM
engages in a mixed initiative dialog to drive the user to-
wards the goal. The DM iterates the previously described
steps until the call reaches a  nal state (e.g. the call is
transferred to a CSR, an IVR or the caller hangs up).
2.1 ASR
Robust speech recognition is a critical component of
a spoken dialog system. The speech recognizer uses
trigram language models based on Variable N-gram
Stochastic Automata (Riccardi et al., 1996). The acous-
tic models are subword unit based, with triphone context
modeling and variable number of gaussians (4-24). The
output of the ASR engine (which can be the 1-best or a
lattice) is then used as the input of the SLU component.
2.2 SLU
In a natural spoken dialog system, the de nition of  un-
derstanding depends on the application. In this work,
we focus only on goal-oriented call classi cation tasks,
where the aim is to classify the intent of the user into
one of the prede ned call-types. As a call classi cation
example, consider the utterance in the previous example
dialog I would like to know my account balance, in a cus-
tomer care application. Assuming that the utterance is
recognized correctly, the corresponding intent or the call-
type would be Request(Account Balance) and the action
would be prompting for the account number and then
telling the balance to the user or routing this call to the
Billing Department.
Classi cation can be achieved by either a knowledge-
based approach which depends heavily on an expert writ-
ing manual rules or a data-driven approach which trains
a classi cation model to be used during run-time. In our
current system we consider both approaches. Data-driven
classi cation has long been studied in the machine learn-
ing community. Typically these classi cation algorithms
try to train a classi cation model using the features from
the training data. More formally, each object in the train-
ing data, a3a5a4a7a6a9a8a11a10 , is represented in the form a12a14a13a15a4a16a10a18a17a20a19a21a4a16a10a23a22 ,
wherea13a24a4a25a10a27a26a29a28a21a10 is the feature set and thea19a30a4a16a10a27a26a32a31a18a10 is the
assigned set of classes for that object for the application
a33 . In this study, we have used an extended version of a
Boosting-style classi cation algorithm for call classi ca-
tion (Schapire, 2001) so that it is now possible to develop
hand written rules to cover low frequent classes or bias
the classi er decision for some of the classes. This is
explained in detail in Schapire et al. (2002). In our previ-
ous work, we have used rules to bootstrap the SLU mod-
els for new applications when no training data is avail-
able (Di Fabbrizio et al., 2002).
Classi cation is employed for all utterances in all di-
alogs as seen in the sample dialog in Figure 1. Thus all
the expressions the users can utter are classi ed into pre-
de ned call-types before starting an application. Even the
utterances which do not contain any speci c information
content get a special call-type (e.g. Hello). So, in our case
objects, a8 a10 are utterances and classes, a31 a10 , are call-types
for a given applicationa33 .
In the literature, in order to determine the application-
speci c call-types,  rst a  wizard data collection is per-
formed (Gorin et al., 1997). In this approach, a human,
i.e. wizard, acts like the system, though the users of the
system do not know about this. This method turned out
to be better than recording user-agent (human-human) di-
alogs, since the responses to machine prompts are found
to be signi cantly different than responses to humans, in
terms of language characteristics.
2.3 DM
In a mixed-initiative Spoken Dialog System, the Dia-
log Manager is the key component responsible for the
human-machine interaction. The DM keeps track of the
speci c discourse context and provides disambiguation
and clari cation strategies when the SLU call-types are
Flow
Controller
ATN Clarification Rule-Based
Output
Processor
Input
Processor
Context
Augmented
Transition
Network
Knowledge Tree
R ules
Concepts Actions VoiceXML
SLU
output
(XML)
Figure 2: Dialog Manager Architecture
ambiguous or have associated low con dence scores. It
also extracts other information from the SLU response in
order to complete the information necessary to provide a
service.
Previous work on dialog management (Abella and
Gorin, 1999) shows how an object inheritance hierar-
chy is a convenient way of representing the task knowl-
edge and the relationships among the objects. A for-
mally de ned Construct Algebra describes the set of op-
erations necessary to execute actions (e.g. replies to the
user or motivators). Each dialog motivator consists of
a small processing unit which can be combined accord-
ingly to the object hierarchy to build the application. Al-
though this approach demonstrated effective results in
different domains (Gorin et al., 1997; Buntschuh et al.,
1998), it proposes a model which substantially differs
from the call  ow model broadly used to specify the
human-machine interaction.
Building and maintaining large-scale voice-enabled
applications requires a more direct mapping between
speci cations and programming model, together with au-
thoring tools that simpli es the time consuming imple-
mentation, debugging, and testing phases. Moreover, the
DM requires broad protocols and standard interfaces sup-
port to interact with modern enterprise backend systems
(e.g. databases, http servers, email servers, etc.). Alter-
natively, VoiceXML (vxm, 2003) provides the basic in-
frastructure to build spoken dialog system, but the lack of
SLU support and of ine tools compromises the use in a
data-driven classi cation applications.
Our approach proposes a general and scalable frame-
work for Spoken Dialog Systems. Figure 2 depicts the
logical DM framework architecture. The Flow Controller
(FC) implements an abstraction of pluggable dialog strat-
egy modules. Different algorithms can be implemented
and made available to the DM engine. Our DM provides
three basic algorithms. Traditional call routing systems
are better described in terms of ATNs (Augmented Tran-
sition Networks) (Bobrow and Fraser, 1969). ATNs are
attractive mechanisms for dialog speci cation since they
are (a) an almost direct translation of call  ow speci ca-
tions, (b) easy to augment with speci c mixed-initiative
interactions, (c) practical to manage extensive dialog con-
text. Complex knowledge-based tasks could be syntheti-
cally described by a variation of knowledge trees. Plan-
based dialogs are effectively de ned by rules and con-
straints.
The FC provides a synthetic XML-based language
to author the appropriate dialog strategy. Dialog strat-
egy algorithms are encapsulated using object oriented
paradigms. This allows dialog authors to write sub-
dialogs with different algorithms, depending on the na-
ture of the task and use them interchangeably and ex-
changing variables through the local and global contexts.
A complete description of the DM is out of the scope of
this publication and will be covered elsewhere. We will
focus our attention on the ATN module which is the one
used in our experiments. The ATN engine operates on
the semantic representation provided by the SLU and the
current dialog context to control the interaction  ow.
3 Bootstrapping a Spoken Dialog System
This section describes how we bootstrap the main compo-
nents of a spoken dialog system, namely the ASR, SLU,
and DM. For all modules, we assume no data from the
application domain is available.
3.1 Unsupervised Learning of Language Models
State-of-the-art speech recognition systems are generally
trained using in-domain transcribed utterances, prepa-
ration of which is labor intensive and time-consuming.
In this work, we re-train only the statistical language
models, and use an acoustic model trained using data
from other applications. Typically, the recognition accu-
racy improves by adding more data from the application
domain to train statistical language models (Rosenfeld,
1995).
In our previous work, we have proposed active and un-
supervised learning techniques for reducing the amount
of transcribed data needed to achieve a given word ac-
curacy, for automatic speech recognition, when some
data (transcribed or untranscribed) is available from the
application domain (Riccardi and Hakkani-Tcurrency1ur, 2003).
Iyer and Ostendorf (1999) have examined various sim-
ilarity techniques to selectively sample out-of-domain
data to enhance sparse in-domain data for statistical lan-
guage models, and have found that even the brute addi-
tion of out-of-domain data is useful. Venkataraman and
Wang (2003) have used maximum likelihood count esti-
mation and document similarity metrics to select a sin-
gle vocabulary from many corpora of varying origins and
characteristics. In these studies, the assumption is that
there is some domain data (transcribed and/or untran-
scribed) available, and its a34 -gram distributions are used
to extend that set with additional data.
In this paper, we focus on the reuse of transcribed data
from other resources, such as human-human dialogs (e.g.
Switchboard Corpus, (Godfrey et al., 1992)), or human-
machine dialogs from other spoken dialog applications,
as well as some text data from the web pages of the ap-
plication domain. We examine the style and content sim-
ilarity, when out-of-domain data is used to train statis-
tical language models and when no in-domain human-
machine dialog data is available. Intuitively, the domain
web pages could be useful to learn domain-speci c vo-
cabulary. Other application data can provide stylistic
characteristics of human-machine dialogs.
3.2 Call-type Classi cation with Data Reuse
The bottleneck of building reasonably performing classi-
 cation models is the amount of time and money spent
for high quality labeling. By  labeling, we mean assign-
ing one or more prede ned label(s) (call-type(s)) to each
utterance.
In our previous work, in order to build call classi-
 cation systems in shorter time frames, we have em-
ployed active and unsupervised learning methods to se-
lectively sample the data to label (Tur et al., 2003; Tur
and Hakkani-Tcurrency1ur, 2003). We have also incorporated
manually written rules to bootstrap the Boosting clas-
si er (Schapire et al., 2002) and used it in the AT&T
HelpDesk application (Di Fabbrizio et al., 2002).
In this study, we aim to reuse the existing labeled data
from other applications to bootstrap a given application.
The idea is forming a library of call-types along with the
associated data and let the UE expert responsible for that
application exploit this information source.
Assume that there is an oracle which categorizes all
the possible natural language sentences which can be ut-
tered in any spoken dialog application we deal with. Let
us denote this set of universal classes with a31 , such that
the call-type set of a given application is a subset of that,
a31 a10 a26a35a31 . It is intuitive that, some of the call-types will
appear in all applications, some in only one of them, etc.
Thus, we categorizea31 a10 into three:
1. Generic Call-types: These are the intents appearing
independent of the application. A typical example
would be a request for talking to a human instead of
a machine. Call this set
a31a37a36a39a38a41a40a42a19 a4a11a43 a19 a4 a6a44a31a46a45a47a17a49a48a51a50a5a52
2. Re-usable Call-types: These are the intents which
are not generic but have already been de ned for a
previous application (most probably from the same
or similar industry sector) and have already had la-
beled data. Call this set
a31a54a53 a10 a38a55a40a42a19 a4a15a43 a19 a4 a6a56a31a46a45a57a17a59a58a46a50a5a52
3. Speci c Call-types: These are the intents speci c
to the application, because of the speci c business
needs or application characteristics. Call this set
a31a54a60a11a10a7a38a55a40a42a19a21a4 a43 a19a21a4a30a61a6a44a31 a45 a17a49a48a51a50a62a61a38
a33
a52
Now, for each applicationa33 , we have
a31 a10 a38a9a31a37a36a64a63a65a31a54a53 a10 a63a66a31a5a60 a10
It is up to the UE expert to decide which call-types are
speci c or reusable, i.e. the sets a31a54a53a64a10 and a31a5a60a67a10 . Given
that not any two applications are the same, deciding on
whether to reuse a call-type along with its data is very
subjective. There may be two applications including the
intent Request(Account Balance) one from a telecommu-
nications sector, the other from a pharmaceutical sector,
and the wording can be slightly different. For example
while in one case we may have How much is my last
phone bill, the other can be do I owe you anything on the
medicine. Since each classi er can tolerate some amount
of language variability and noise, we assume that if the
names of the intents are the same, their contents are the
same. Since in some cases, this assumption does not hold,
it is still an open problem to selectively sample the por-
tions of data to reuse automatically.
Sincea31a51a36 appears in all applications by de nition, this
is the core set of call-types in a new application, a34 .
Then, if the UE expert knows the possible reusable in-
tents,a31a54a53a56a68 , existing in the application, they can be added
too. The bootstrap classi er can then be trained using the
utterances associated with the call-types, a31a37a36a70a69a71a31a54a53a56a68 in
the call-type library. For the application speci c intents,
a31a54a60 a68 , it is still possible to augment the classi er with a
few rules as described in Schapire et al. (2002). This is
also up to the expert to decide.
Depending on the size of the library or the similarity
of the new application to the existing ones, using this ap-
proach it is possible to cover a signi cant portion of the
intents. For example, in our experiments, we have seen
that 10% of the responses to the initial prompt is a request
to talk to a human. Using this system, we have the capa-
bility to continue the dialog with the user and getting the
intent before sending them to a human agent.
Using this approach the application begins with a rea-
sonably well working understanding component. One
can also consider this as a more complex wizard, depend-
ing on the bootstrap model.
Another advantage of maintaining a call-type library
and exploiting them by reuse is that automatically ensures
consistency in labeling and naming. Note that the design
of call-types is not a well de ned procedure and most of
the time, it is up to the expert. Using this approach it is
possible to discipline the art of call-type design to some
extent.
After the system is deployed and real data is collected,
then the application speci c or other reusable call-types
can be determined by the UE expert to get a complete
picture.
3.3 Bootstrapping the Dialog Manager with Reuse
Mixed-initiative dialogs generally allow users to take
control over the machine dialog  ow almost at any time
of the interaction. For example, during the course of a
dialog aiming to a speci c task, a user may utter a new
intention (speech act) and deviate from the previously
stated goal. Depending on the nature of the request, the
DM strategy could either decide to shift to the differ-
ent context (context shift) or re-prompt providing addi-
tional information. Similarly, other dialog strategy pat-
terns such as correction, start-over, repeat, con rmation,
clari cation, contextual help, and the already mentioned
context shift, are recurring features in a mixed-initiative
system.
Our goal is to derive some overall approach to dialog
management that would de ne templates or basic dialog
strategies based on the call-type structure. For the spe-
ci c call routing task described in this paper, we general-
ized dialog strategy templates based on the categorization
of the call-type presented in 3.2 and on best practice user
experience design.
Generic call-types, such as Yes, No, Hello, Goodbye,
Repeat, Help, etc., are domain independent, but are han-
dled in most reusable sub-dialogs with the speci c dia-
log context. When detected in any dialog turn, they trig-
ger context dependent system replies such as informative
prompts (Help), greetings (Hello) and summarization of
the previous dialog turn using the dialog history (Repeat).
In this case, the dialog will handle the request and resume
the execution when the information has been provided.
Yes and No generic call-types are used for con rmation if
the system is expecting a yes/no answer or ignored with
a system re-prompt in other contexts.
Call-types are further categorized as vague and con-
crete. A request like I have a question will be classi ed
as vague Ask(Info) and will generate a clari cation ques-
tion: OK. What is your question? Concrete call-types cat-
egorize a clear routing request and they activate a con r-
mation dialog strategy when they are classi ed with low
con dence scores. Concrete call-type can also have asso-
ciated mandatory or optional attributes. For instance, the
concrete call-type Request(Account Balance) requires a
mandatory attribute AccountNumber (generally captured
by the SLU) to complete the task.
We generalized sub-dialogs to handle the most com-
mon call- type attributes (telephone number, account
number, zip code, credit card, etc.) including a dialog
container that implements the optimal  ow for multiple
inputs. A common top level dialog handles the initial
open prompt requests. Reusable dialog templates are im-
plemented as ATNs where the actions are executed when
the network arcs are traversed and passed as parameters
at run-time. Disambiguation of multiple call-types is not
supported. We only consider the top scoring call-type as-
suming that multiple call-types with high con dence are
rare events.
4 Experiments and Results
For our experiments, we selected an application from the
pharmaceutical domain to bootstrap. We have evaluated
the performances of the ASR language model, call clas-
si er, and dialog manager as described below.
4.1 Speech Recognition Experiments
To bootstrap a statistical language model for ASR, we
used human-machine spoken language data from two pre-
vious AT&T VoiceTone spoken dialog applications (App.
1 (telecommunication domain) and App. 2 (medical in-
surance domain)). We also used some data from the ap-
plication domain web pages (Web). Table 1 lists the sizes
of these corpora.  App. Training Data and  App. Test
Data correspond to the training and test data we have
for the new application and are used for controlled ex-
periments. We also extended the available corpora with
human-human dialog data from the Switchboard corpus
(SWBD) (Godfrey et al., 1992).
Table 2 summarizes some style and content features
of the available corpora. For simpli cation, we only
compared the percentage of pronouns and  lled pauses
to show style differences, and the domain test data out-
of-vocabulary word (OOV) rate for content variations.
The human-machine spoken dialog corpora include much
more pronouns than the web data. There are even further
differences between the individual pronoun distributions.
For example, out of all the pronouns in the web data, 35%
is  you , and 0% is  I , whereas in all of the human-
machine dialog corpora, more than 50% of the pronouns
are  I . In terms of style, both spoken dialog corpora can
be considered as similar. In terms of content, the second
application data is the most similar corpus, as it results in
the lowest OOV rate for the domain test data. In Table 3,
we show further reductions in the App. test set OOV rate,
when we combine these corpora.
Figure 3 shows the effect of using various corpora as
training data for statistical language models used in the
App. 1 App. 2 Web Data App. Training Data App. Test Data
No. of Utterances 35,551 79,792 NA 29,561 5,537
No. of Words 329,959 385,168 71,497 299,752 47.684
Table 1: ASR Data Characteristics
In-Domain
App. 1 App. 2 SWBD Web Data Training Data
Percentage of Pronouns 15.14% 9.16% 14.8% 5.30% 14.5%
Percentage of Filled Pauses 2.66% 2.27% 2.74% 0% 3.26%
Test Set OOV Rate 9.79% 1.99% 2.64% 13.36% 1.02%
Table 2: Style and Content differences among various data sources.
Corpora Test Set
OOV Rate
App 1 + App 2 Data 1.53%
App 1 + App 2 + Web Data 1.22%
App 1 + App 2 + Web + SWBD Data 0.88%
Table 3: Effect of training corpus combination on OOV
rate of the test data.
recognition of the test data. We also computed ASR run-
time curves by varying the beam-width of the decoder, as
the characteristics of the corpora effects the size of the
language model. Content-wise the most similar corpus
(App. 2) resulted in the best performing language model,
when the corpora are considered separately. We obtained
the best recognition accuracy, when we augment App. 2
data with App. 1 and the web data. Switchboard corpus
also resulted in a reasonable performance, but the prob-
lem is that it resulted in a very big language model, slow-
ing down the recognition. In that  gure, we also show the
word accuracy curve, when we use in-domain transcribed
data for training the language model.
Once some data from the domain is available, it is pos-
sible to weight the available out-of-domain data and the
web-data while reusing, to achieve further improvements.
When we lack any in-domain data, we expect the UE ex-
pert to reuse the application data from the most similar
sectors and/or combine all available data.
4.2 Call-type Classi cation Experiments
We have performed the SLU tests using the Boostexter
tool (Schapire and Singer, 2000). For all experiments, we
have used worda34 -grams of transcriptions as features and
iterated Boostexter 1,100 times. In this study we have
assumed that all candidate utterances are  rst recognized
by the same automatic speech recognizer (ASR), so we
deal with only text input of the same quality, which cor-
responds to the recognitions obtained using the language
model trained from App. 1, App. 2, and Web data.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.740
45
50
55
60
65
70
75
80
ASR Word Accuracy Experiments
Run Time in Real Time
Word Accuracy (%)
App. 1 Data
App. 2 Data
App. 1 and 2 Data
App. 1, 2 and Web Data
In−Domain Transcribed Data
SWBD Data
Figure 3: The word accuracy of in-domain test data, using
various language models and pruning thresholds.
As the test set, we again used the 5,537 utterances col-
lected from a pharmaceutical domain customer care ap-
plication. We used a very limited library of call-types
from a telecommunications domain application. We have
made controlled experiments where we know the true
call-types of the utterances. In this application we have
97 call-types with a fairly high perplexity of 32.81.
If an utterance has the call-type which is covered by
the bootstrapped model, we expect that call-type to get
high con dence. Otherwise, we expect the model to re-
ject it by assigning the special call-type Not(Understood)
meaning that the intent in the utterance is known to be
not covered or some call-type with low con dence. Then
we compute the rejection accuracy (RA) of the bootstrap
model:
a72a74a73
a38a76a75a62a77a79a78a62a80a82a81a84a83a86a85a88a87a70a89a90a85a42a83a42a83a91a81a42a89a82a92a94a93a96a95a97a83a91a81
a33
a81a84a89a90a92a94a81a84a98a7a77a5a92a49a92a94a81a99a83a91a100
a34
a89a90a81a91a101
a75a102a77a79a78a62a80a82a81a84a83a86a85a88a87a70a100a47a93a103a93a104a77a5a92a49a92a94a81a84a83a42a100
a34
a89a90a81a91a101
Transcriptions ASR Output
Coverage Rejection Acc. Classi cation Acc. Rejection Acc. Classi cation Acc.
In-Domain Model 100.00% 78.27% 78.27% 61.73% 61.73%
Generic Model 45.78% 88.53% 95.38% 85.55% 91.08%
Bootstrapped Model 70.34% 79.50% 79.13% 71.86% 68.40%
Table 4: SLU results using transcriptions and ASR output with the models trained with in-domain data, only generic
call-types, and also with call-types from the library and rules.
In order to evaluate the classi er performance for the
utterances whose call-types are covered by the boot-
strapped model, we have used classi cation accuracy
(CA) which is the fraction of utterances in which the top
scoring call-type is one of the true call-types assigned
by a human-labeler and its con dence score is more than
some threshold:
a19
a73
a38a76a75a62a77a5a78a105a80a82a81a84a83a86a85a88a87a70a89a90a85a42a83a42a83a91a81a42a89a82a92a94a93a96a95a27a89a106a93a103a100a104a101a42a101a106a107a94a87a5a107a49a81a42a98a108a77a5a92a49a92a94a81a84a83a42a100
a34
a89a90a81a91a101
a75a62a77a79a78a62a80a82a81a84a83a86a85a88a87a70a100a47a93a103a93a104a77a5a92a49a92a94a81a84a83a91a100
a34
a89a106a81a42a101
These two measures are actually complementary to
each other. For the complete model trained with all the
training data, where all the intents are covered, these two
metrics are the same.
In order to see our upper bound, we  rst trained a clas-
si er using 30,000 labeled utterances from the same ap-
plication. First row of Table 4 presents these results using
both the transcriptions of the test set and using the ASR
output with around 68% word accuracy. As the con -
dence threshold we have chosen a hypothetical value of
0.3 for all experiments. As seen, 78.27% classi cation
(or rejection) accuracy is the performance using all train-
ing data. This reduces to 61.73% when we use the ASR
output. This is mostly because of the unrecognized words
which are critical for the application. This is intuitive
since ASR language model has not been trained with do-
main data.
Then we trained a generic model using only generic
call-types. This model has achieved better accuracies as
seen in the second row, since we do not expect it to distin-
guish among the reusable or speci c call-types. Further-
more, for classi cation accuracy we only use the portion
of the test set whose call-types are covered by the model
and the call-types in this model are de nitely easier than
the speci c ones. The drawback is that we only cover
about half of the utterances. Using the ASR output, un-
like the in-domain model case, did not hurt much, since
the ASR already covers the utterances with generic call-
types with great accuracy.
We then trained a bootstrapped model using 13 call-
types from the library and a few simple rules written
manually for three frequent intents. Since the library
consists of an application from a fairly different domain,
we could only exploit intents related to billing, such as
Request(Account Balance). While determining the call-
types to write rules, we actually played the expert which
has previous knowledge on the application. This enabled
us to increase the coverage to 70.34%.
The most impressive result of these experiments is
that, we have got a call classi er which is trained with-
out any in-domain data and can handle most utterances
with almost same accuracy as the trained with extensive
amounts of data. Noting the weakness of our current call-
type library we expect even better performances as we
augment more call-types from on-going applications.
4.3 Dialog Level Evaluation
Evaluation of spoken dialog system performances is a
complex task and depends on the purpose of the desired
dialog metric (Paek, 2001). While ASR and SLU can
be fairly assessed off-line using utterances collected in
previous runs of the baseline system, the dialog manager
requires interaction with a real motivated user who will
cooperate with the system to complete the task. Ideally,
the bootstrap system has to be deployed in the  eld and
the dialogs have to be manually labeled to provide accu-
rate measure of task completion rate. Usability metrics
also require direct feedback from the caller to properly
measure the user satisfaction (speci cally, task success
and dialog cost) (Walker et al., 1997). However, we are
more interested in automatically comparing the bootstrap
system performances with a reference system, working
on the same domain and with identical dialog strategies.
As a  rst order of approximation, we reused the 3,082
baseline test dialogs (5,537 utterances) collected by the
live reference system and applied the same dialog turn
sequence to evaluate the bootstrap system. According to
the reference system call  ow, the 97 call-types covered
by the reference classi er are clustered into 32 DM cat-
egories (DMC). A DMC is a generalization of more spe-
ci c intents.
The bootstrap system only classi es 16 call-types and
16 DMC accordingly to the bootstrapping SLU design
requirements described in 4.2. This is only half of the
reference system DMC coverage, but it actually addresses
70.34% of the total utterance classi cation task.
We simulate the execution of the dialog using data col-
lected from a deployed system, with the following proce-
DM Category DM Route
Transcribed ASR output Transcribed ASR output
concrete 44.65% 34.13% 47.18% 36.99%
concrete+conf 50.78% 42.20% 53.47% 44.32%
concrete+conf+vague/generic 67.27% 57.39% 70.67% 61.84%
Table 5: DM Evaluation results.
dure: for each dialoga109a110a4 in the reference data set, we pass
the utterance a33 to the bootstrap classi er and select the
result with the highest con dence score
a89
a10 . We use two
con dence score thresholds, a111a37a112a4a114a113a112 , for acceptance, and
a111a54a115a114a116a118a117 , for rejection. Call-types, whose con dence scores
are in between these two thresholds are con rmed. Then:
1. the dialog is considered as successful if the follow-
ing condition is veri ed:
a101a84a89a90a85a42a83a91a81
a12
a89
a10 a22a67a119a29a111a51a112a4a120a113a112a67a121
a89
a10 a6a44a109a123a122a124a31 a10 a121
a89
a10 a6a62a19
a85
a34
a89a90a83a42a81a84a92a94a81
wherea111a37a112a4a114a113a112 is the acceptance threshold,a109a123a122a124a31 a10 is
the manually labeled reference DM category set for
the turna33 , anda19
a85
a34
a89a90a83a42a81a84a92a94a81
is the set of concrete call-
types;
2. the dialog is considered as successful with con rma-
tion if the following condition is veri ed:
a111 a115a114a116a118a117a126a125
a101a99a89a106a85a42a83a42a81
a12
a89
a10a88a22a128a127a32a111 a112a4a114a113a112
a121
a89
a10a74a6a129a109a123a122a124a31a79a10a130a121
a89
a10a74a6a102a19
a85
a34
a89a82a83a91a81a84a92a94a81
wherea111a54a115a114a116a118a117 is the rejection threshold;
3. if in any turn of the dialog, a mismatching concrete
call-type is found, the dialog is considered as unsuc-
cessful:
a101a84a89a90a85a42a83a91a81
a12
a89
a10 a22a67a119a29a111a51a112a4a120a113a112a67a121
a89
a10 a61a6a44a109a123a122a124a31 a10 a121
a89
a10 a6a62a19
a85
a34
a89a90a83a42a81a84a92a94a81
If none of the conditions above are satis ed, we apply
steps 1 and 3 with a lower threshold and
a89
a10 a6a32a131
a100a18a132a47a77a5a81a86a133
a89
a10 a6a29a134
a81
a34
a81a84a83a84a107a49a89
, assuming that the dialog did not contain
any relevant user intention.
A further experiment considers only the  nal routing
destinations (e.g. speci c type of agent or the automatic
ful llment system destination). Both reference and boot-
strap systems direct calls to 12 different destinations, im-
plying that a few DM categories are combined into the
same destination. This quanti es how effectively the sys-
tem routes the callers to the right place in the call cen-
ter and, conversely, gives some metric to evaluate the
missed automation and misrouted calls. The test has
been executed for both transcribed and untranscribed ut-
terances. Results are shown in Table 5. Even with a mod-
est 50% DM Categories coverage, the bootstrap system
shows an overall task completion of 67.27% in case of
transcribed data and 57.39% using the output generated
by the bootstrap ASR. When considering the route desti-
nations, completion increases to 70.67% and 61.84% re-
spectively. This approach explicitly ignores the dialog
context, but it contemplates the call-type categorization,
the con rmation mechanism and the  nal route destina-
tion, that would be missed in the SLU evaluation. Al-
though a more completed evaluation analysis is needed,
lower bound results are indicative of the overall perfor-
mances.
5 Summary
This paper shows that bootstrapping a spoken dialog
system reusing existing transcribed and labeled data
from out-of-domain human-machine dialogs, common
reusable dialog templates and patterns, is possible to
achieve operational performances. Our evaluations on a
call classi cation system using no domain speci c data
indicate 67% ASR word accuracy, 79% SLU call classi -
cation accuracy with 70% coverage, and 62% routing ac-
curacy with 50% DM coverage. Our future work consists
of developing techniques to re ne the bootstrap system
when application domain data become available.
Acknowledments
We would like to thank Liz Alba, Lee Begeja, Harry
Blanchard, David Gibbon, Patrick Haffner, Zhu Liu,
Mazin Rahim, Bernard Renger, Behzad Shahraray, and
Jay Wilpon for many helpful discussions.

References
A. Abella and A. Gorin. 1999. Construct algebra: An-
alytical dialog management. In Proceedings of the
Annual Meeting of the Association for Computational
Linguistics (ACL), Washington, D.C., June.
D. Bobrow and B. Fraser. 1969. An augmented state
transition network analysis procedure. In Proceedings
of the International Joint Conference on Arti cial In-
telligence (IJCAI), pages 557 567, Washington, D.C,
May.
B. Buntschuh, C. Kamm, G. Di Fabbrizio, A. Abella,
M. Mohri, S. Narayanan, I. Zeljkovic, R. D. Sharp,
J.Wright, S. MarcusU, J. Shaffer., R. Duncan., and
J. G. Wilpon. 1998. VPQ : A spoken language inter-
face to large scale directory information. In Proceed-
ings of the International Conference on Spoken Lan-
guage Processing (ICSLP), Sydney, Australia, Novem-
ber.
G. Di Fabbrizio , D. Dutton, N. Gupta, B. Hollister,
M. Rahim, G. Riccardi, R. Schapire, and J. Schroeter.
2002. AT&T help desk. In Proceedings of the Inter-
national Conference on Spoken Language Processing
(ICSLP), Denver, CO, September.
J. J. Godfrey, E. C. Holliman, and J. McDaniel. 1992.
Switchboard: Telephone speech corpus for research
and development. In Proceedings of International
Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), volume 1, pages 517 520, San Fran-
cisco, USA, March.
A. L. Gorin, G. Riccardi, and J. H. Wright. 1997. How
May I Help You? . Speech Communication, 23:113 
127.
A. L. Gorin, G. Riccardi, and J. H. Wright. 2002. Auto-
mated natural spoken dialog. IEEE Computer Maga-
zine, 35(4):51 56, April.
R. Iyer and M. Ostendorf. 1999. Relevance weighting
for combining multi-domain data fora34 -gram language
modeling. Computer Speech and Language, 13:267 
282.
P. Natarajan, R. Prasad, B. Suhm, and D. McCarthy.
2002. Speech enabled natural language call routing:
BBN call director. In Proceedings of the International
Conference on Spoken Language Processing (ICSLP),
Denver, CO, September.
T. Paek. 2001. Empirical methods for evaluating dia-
log systems. In Proceedings of the Annual Meeting of
the Association for Computational Linguistics (ACL)
Workshop on Evaluation Methodologies for Language
and Dialogue Systems, Toulouse, France, July.
G. Riccardi and D. Hakkani-Tcurrency1ur. 2003. Active and
unsupervised learning for automatic speech recog-
nition. In Proceedings of the European Confer-
ence on Speech Communication and Technology (EU-
ROSPEECH), Geneva, Switzerland, September.
G. Riccardi, R. Pieraccini, and E. Bocchieri. 1996.
Stochastic automata for language modeling. Computer
Speech and Language, 10:265 293.
R. Rosenfeld. 1995. Optimizing lexical and a34 -gram
coverage via judicious use of linguistic data. In Pro-
ceedings of European Conference on Speech Commu-
nication and Technology (EUROSPEECH), volume 2,
pages 1763 1766, Madrid, Spain, September.
R. E. Schapire and Y. Singer. 2000. Boostexter: A
boosting-based system for text categorization. Ma-
chine Learning, 39(2/3):135 168.
R. E. Schapire, M. Rochery, M. Rahim, and N. Gupta.
2002. Incorporating prior knowledge into boosting. In
Proceedings of the International Conference on Ma-
chine Learning (ICML), Sydney, Australia, July.
R. E. Schapire. 2001. The boosting approach to machine
learning: An overview. In Proceedings of the MSRI
Workshop on Nonlinear Estimation and Classi cation,
Berkeley, CA, March.
G. Tur and D. Hakkani-Tcurrency1ur. 2003. Unsupervised learn-
ing for spoken language understanding. In Proceed-
ings of the European Conference on Speech Commu-
nication and Technology (EUROSPEECH), Geneva,
Switzerland, September.
G. Tur, R. E. Schapire, and D. Hakkani-Tcurrency1ur. 2003. Ac-
tive learning for spoken language understanding. In
Proceedings of the International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), Hong
Kong, May.
Anand Venkataraman and Wen Wang. 2003. Tech-
niques for effective vocabulary selection. In Proceed-
ings of European Conference on Speech Communi-
cation and Technology (EUROSPEECH), pages 245 
248, Geneva, Switzerland, September.
2003. Voice extensible markup language (VoiceXML)
version 2.0. http://www.w3.org/TR/voicexml20/.
M. A. Walker, D. J. Litman, C. A. Kamm, and A. Abella.
1997. PARADISE : A framework for evaluating spo-
ken dialogue agents. In Proceedings of the Annual
Meeting of the Association for Computational Lin-
guistics (ACL)-Conference of the European Chapter of
the Association for Computational Linguistics (EACL),
Madrid, Spain, July.
