Evaluating Automatic Dialogue Strategy Adaptation for a 
Spoken Dialogue System 
Jennifer Chu-Carroll 
Lucent Technologies Bell Laboratories 
600 Mountain Avenue 
Murray Hill, NJ 07974, U.S.A. 
jencc @research.bell-labs.corn 
Jill Suzanne Nickerson 
Harvard University 
Cambridge, MA 02138, U.S.A. 
nickerso @eecs.harvard.edu 
Abstract 
In this paper, we describe an empirical evaluation of an 
adaptive mixed initiative spoken dialogue system. We 
conducted two sets of experiments to evaluate the mixed 
initiative and automatic adaptation aspects of the system, 
and analyzed the resulting dialogues along three dimen- 
sions: performance factors, discourse features, and ini- 
tiative distribution. Our results show that 1) both the 
mixed initiative and automatic adaptation aspects led 
to better system performance in terms of user satisfac- 
tion and dialogue efficiency, and 2) the system's adap- 
tation behavior better matched user expectations, more 
efficiently resolved dialogue anomalies, and resulted in 
higher overall dialogue quality. 
1 Introduction 
Recent advances in speech technologies have enabled 
spoken dialogue systems to employ mixed initiative di- 
alogue strategies (e.g. (Allen et al., 1996; Sadek et al., 
1996; Meng et al., 1996)). Although these systems inter- 
act with users in a manner more similar to human-human 
interactions than earlier systems employing system ini- 
tiative strategies, their response strategies are typically 
selected using only local dialogue context, disregarding 
dialogue history. Therefore, their gain in naturalness and 
performance under optimal conditions is often overshad- 
owed by their inability to cope with anomalies in dia- 
logues by automatically adapting dialogue strategies. In 
contrast, Figure 1 shows a dialogue in which the sys- 
tem automatically adapts dialogue strategies based on 
the current user utterance and dialogue history. 1 Af- 
ter failing to obtain a valid response to an information- 
seeking query in utterance (4), the system adapted dia- 
logue strategies to provide additional information in (6) 
that assisted the user in responding to the query. Further- 
more, after the user responded to a limited system prompt 
in (10) with a fully-specified query in (11), implicitly 
indicating her intention to take charge of the problem- 
IS and U indicate system and user utterances, respectively. The 
words appearing in square brackets are the output from the Lucent 
Automatic Speech Recognizer (Reichl and Chou, 1998; Ortmanns et 
al., 1999), configured to use class-based probabilistic n-gram  
models. The task and dialogue initiative annotations are explained in 
Section 2.1. 
solving process, the system again adapted strategies, 
hence providing an open-ended prompt in (13). 
Previous work has shown that dialogue systems in 
which users can explicitly change the system's dia- 
logue strategies result in better performance than non- 
adaptable systems (Litman and Pan, 1999). However, 
no earlier system allowed for initiative-oriented auto- 
matic strategy adaptation based on information dynam- 
ically extracted from the user's spoken input. In this 
paper, we briefly introduce MIMIC, a mixed initiative 
spoken dialogue system that automatically adapts dia- 
logue strategies. We then describe two experiments that 
evaluated the effectiveness of MIMIC's mixed initiative 
and automatic adaptation capabilities. Our results show 
that, when analyzed along the performance dimension, 
MIMIC's mixed initiative and automatic adaptation fea- 
tures lead to more efficient dialogues and higher user sat- 
isfaction. Moreover, when analyzed along the discourse 
and initiative dimensions, MIMIC's adaptation capabil- 
ities result in dialogues in which system behavior better 
matches user expectations and dialogue anomalies are re- 
solved more efficiently. 
2 MIMIC: An Adaptive Mixed Initiative 
Dialogue System 
MIMIC is a voice-enabled, telephone-based dialogue 
system that provides movie showtime information. Two 
main features distinguish MIMIC from other existing 
spoken dialogue systems. First, MIMIC automatically 
adapts dialogue strategies based on dialogue character- 
istics and thus tailors its behavior to the dialogue con- 
text. Second, MIMIC's initiative module is decoupled 
from the dialogue manager's goal and strategy selection 
processes, thereby enabling the system's adaptation be- 
havior to be easily modified. We outline characteristics 
of MIMIC relevant to our experiments and analyses. For 
further details about the system, see (Chu-Carroll, 2000). 
2.1 Adaptation of Dialogue Strategies 
Figure 1 showed an actual interaction between MIMIC 
and a user. The task and dialogue initiative holders 
annotated for each system turn are the factors that af- 
fect MIMIC's dialogue strategy adaptation. Following 
(Chu-Carroll and Brown, 1998), task initiative tracks the 
202 
Utterance 
(1) S: Hello, this is MIMIC, the movie information system. 
(2) How can I help you? 
(3) U: What time is Analyze This playing in Montclair? 
\[what time is analyze this playing in montclair\] 
(4) S: What theater would you like? User 
(5) U: Montclair. 
\[montclair\] 
(6) S: Choices for theater are Wellmont Theater and Clearviews Screening Zone. System 
(7) What theater would you like? 
(8) U: Clearviews Screening Zone. 
lure movie screening zone\] 
(9) S: Analyze This is playing at Clearviews Screening Zone in Montclair at 1:15pm, 4:10pro, System 
and 6:30pm. 
(10) Please say the name of the movie or theater or town you would like information about. 
(11) U: What time is Analyze This playing at the Wellmont Theater? 
\[what time is analyze this playing at the wellmont theater\] 
(12) S: Analyze this is playing at Wellmont Theater in Montclair at 3:15pm, 5:45pm, 8:20pm, User 
and 10:10pm. 
(13) Can I help you with anything else? 
Task Dialogue 
Initiative Initiative 
User System 
System 
System 
System 
System 
Figure 1: An Adaptive Mixed Initiative Dialogue with MIMIC 
lead in the process toward achieving the dialogue partici- 
pants' domain goal, while dialogue initiative models the 
lead in determining the current discourse focus. In our 
information query application domain, MIMIC has task 
(and thus dialogue) initiative when its utterances are in- 
tended to provide helpful guidance toward achieving the 
user's domain goal, while it has dialogue but not task 
initiative if its utterances only specify the current dis- 
course goal. 2 For example, as a result of MIMIC taking 
over task initiative in (6), helpful guidance, in the form 
of valid response choices, was provided in its attempt 
to obtain a theater name after the user failed to answer 
an earlier question intended to solicit this information. 
In (4), MIMIC specified the current discourse goal (re- 
questing information about a missing theater) but did not 
suggest valid response choices since it only had dialogue 
initiative. 
MIMIC's ability to automatically adapt dialogue 
strategies is achieved by employing an initiative mod- 
ule that determines initiative distribution based on par- 
ticipant roles, cues detected during the current user ut- 
terance, and dialogue history (Chu-Carroll and Brown, 
1998). This initiative framework utilizes the Dempster- 
Shafer theory (Shafer, 1976; Gordon and Shortliffe, 
1984), and represents the current initiative distribution as 
two basic probability assignments (bpas) that signify the 
overall amount of evidence supporting each agent hav- 
ing task and dialogue initiatives. The effects that a cue 
has on changing the current task and dialogue initiative 
distribution are also represented as bpas, obtained using 
an iterative training procedure on a corpus of transcribed 
21n the dialogues collected in our experiments, which are described 
in Section 3, there are system turns in which MIMIC had neither task 
nor dialogue initiative. However, such cases are rare in this domain and 
will not be discussed in this paper. 
and annotated human-human dialogues. At the end of 
each user turn, the bpas representing the effects of cues 
detected during that turn are combined with the bpas rep- 
resenting the current initiative distribution to obtain the 
initiative distribution for the system's next turn. 
In Figure 1, utterance (3) triggered the cue Ambigu- 
ousAction since the town of Montclair has multiple the- 
aters. Although AmbiguousAction contributed to the sys- 
tem having both initiatives in the next turn, the effect of 
this cue, when combined with the initiative distribution 
of (3), resulted in MIMIC having dialogue but not task 
initiative in (4). However, in addition to triggering Am- 
biguousAction, utterance (5) also triggered NoNewlnfo, 
since it did not contribute any new information. The ef- 
fect of these two cues, combined with the initiative dis- 
tribution in (5), was enough to lead MIMIC to take over 
task initiative in (6). 
2.2 Modification of Adaptation Behavior 
MIMIC employs an initiative module which is decou- 
pied from its goal selection process, while allowing the 
outcome of both processes to guide its response strat- 
egy selection. A major advantage of this design is that 
changes in the system's strategy selection behavior, and 
thus its adaptation behavior, can be achieved by modi- 
fying parameters in the initiative module, while leaving 
the goal and strategy selection processes intact. The ini- 
tiative module determines initiative distribution based on 
participant roles, cues observed in the current user ut- 
terance, and dialogue history. Although all three factors 
may play a role in modifying MIMIC's adaptation be- 
havior, we discuss how the effects of participant roles 
and dialogue history are modified to create systems used 
as bases of comparison in our experiments. 
Participant roles affect the default initiative distribu- 
203 
Utterance 
(14) S: Hello, this is MIMIC, the movie information system. 
(15) Please say the name of the movie or theater or town you would like information about. 
(16) U: Analyze This. 
\[analyze this1 
(17) S: Choices for task are when and where. System 
(18) What would you like to know about Analyze This? 
(19) U: When. 
\[when is\] 
(20) S: Choices for theater are Middlebrook Galeria, Loews East Hanover, Montgomery Cinemas, System 
Atlantic Triplex Cinema, and others. 
(21) What theater would you like? 
(22) U: Montclair. 
\[montclair\] 
(23) S: Choices for theater are Wellmont Theater and Clearviews Screening Zone. System 
(24) What theater would you like? 
(25) U: Wellmont. 
\[wellmontl 
(26) S: Analyze This is playing at Wellmont Theater in Montclair at 3:15pm, 5:45pm, 8:20pm, System 
and 10:10pm. 
(27) Please say the name of the movie or theater or town you would like information about. 
(28) U: Analyze This in Montclair. 
\[analyze this in montclair\] 
(29) S: Choices for task are location, what, where, and when. System 
(30) What would you like to know about Montclair? 
(31) U: When. 
(32) \[when\] 
(33) S: Choices for theater are Wellmont Theater and Clearviews Screening Zone System 
(34) What theater would you like? 
(35) U: Clearviews Screening Zone. 
\[screening zone\] 
(36) S: Analyze This is playing at Clearviews Screening Zone in Montclair at 1:15pm, 4:10pm, System 
and 6:30pm. 
(37) Please say the name of the movie or theater or town you would like information about. 
Figure 2: A System Initiative Dialogue with MIMIC-SI 
Task Dialogue 
Initiative Initiative 
System System 
System 
System 
System 
System 
System 
System 
Sys~m 
tion for MIMIC's initial utterance. For naive users, it 
may be more appropriate for MIMIC to employ system- 
initiative strategies. This system, MIMIC-SI, can be 
achieved by setting the default initiative distribution to 
provide definite evidence for the system having both task 
and dialogue initiatives) Figure 2 shows a dialogue 
in which a user interacts with MIMIC-SI to obtain the 
same information as in Figure 1. MIMIC-SI prompts the 
user for one piece of information at a time and provides 
(sometimes unnecessary) assistance, in the form of valid 
dialogue moves, during each turn. 
Taking into account dialogue history allows MIMIC 
to determine the initiative distribution based on the cu- 
mulative effect of previously observed cues. This ef- 
fect can be disabled so that only local dialogue con- 
text affects the initiative distribution for the next turn. 
Based on MIMIC's parameter settings, this adjustment 
results in MIMIC-MI, a system which always has dia- 
3Based on the Dempster-Shafer theory, if the bpas of the default 
initiative distribution or those of a detected cue provide definite evi- 
dence for drawing a certain conclusion, then no subsequent cues have 
any effect on changing that conclusion. 
logue initiative but never has task initiative. MIMIC-MI 
behaves similarly to many existing non-adaptive mixed 
initiative dialogue systems (e.g., (Bennacef et al., 1996; 
Papineni et' al., 1999)). Figure 3 shows a dialogue in- 
volving MIMIC-MI in which the user again attempts to 
acquire the same information as in the previous two dia- 
logues. Notice that (41)-(49) constitute a failed attempt 
to solicit a theater name from the user. Instead of provid- 
ing helpful information as in (6) in Figure 1, MIMIC-MI 
relied on the user to change her problem-solving strategy 
in (50) to obtain the information needed to answer the 
system's question. 
3 Experimental Design 
Our main goal in evaluating MIMIC is to determine 
whether users find the mixed initiative and automatic 
adaptation aspects of its dialogue strategies useful. We 
compared MIMIC to two control systems: MIMIC-SI 
and MIMIC-MI, since they employ dialogue manage- 
ment strategies similar to those in many existing sys- 
tems. The comparison between MIMIC and MIMIC-SI 
204 
Utterance 
(38) S: 
(39) 
(40) U: 
(41) S: 
(42) U: 
(43) S: 
(44) U: 
(45) S: 
(46) U: 
(47) S: 
(48) U: 
(49) S: 
(50) U: 
(51) S: 
(52) U: 
(53) S: 
(54) 
(55) 
(56) 
(57) 
U: 
S: 
Hello, this is MIMIC, the movie information system. 
How can I help you? 
When is Analyze This playing in Montclair? 
\[when analyze this playing in montclair\] 
What theater would you like? 
I1 
What theater would you like? 
Um. 
lure\] 
What theater would you like? 
Montclair. 
\[montclairl 
What theater would you like? 
Any. 
\[any\] 
What theater would you like? 
Where in Montclair is Analyze This playing? 
\[where in montclair is analyze this playing\] 
Analyze This is playing at Wellmont Theater and Clearviews Screening Zone in Montclair. 
When is Analyze This playing at Wellmont Theater? 
\[when is analyze this playing at wellmont theater in\] 
Analyze This is playing at Wellmont Theater in Montclair at 3:15pro, 5:45pm, 8:20pro, 
and 10:10pro. 
Can I help you with anything else? 
When is it playing at Clearviews Screening Zone? 
\[when is it playing at movie screening zone\] 
Analyze This is playing at Clearviews Screening Zone in Montclair at 1:15pm, 4:10pm, 
and 6:30pm. 
Can I help you with anything else? 
Figure 3: A Non-Adaptive Mixed Initiative Dialogue with MIMIC-MI 
focused on the contribution of mixed-initiative dialogue 
management, while the comparison between MIMIC and 
MIMIC-MI emphasized the contribution of automatic 
strategy adaptation. The following three factors were 
controlled in our experiments: 
Town Theater 
(if playing) 
Hoboken 
Task Dialogue 
Initiative Initiative 
User System 
User System 
User System 
User System 
User System 
User System 
User System 
User System 
User System 
Movie Times after 5:10pm 
(if playing) 
Antz 
(a) Easy Task 
1. System version: For each experiment, two systems 
were used: MIMIC and a control system. In the first 
experiment MIMIC was compared with MIMIC-SI, 
and in the second experiment, with MIMIC-MI. 
2. Order: For each experiment, all subjects were ran- 
domly divided into two groups. One group per- 
formed tasks using MIMIC first, and the other group 
used the control system first. 
Town Theater Movie Two Times 
(if playing) (if playing) 
Millbum Analyze This 
Berkeley Hgts 
Mountainside 
Analyze This 
Analyze This 
Madison True Crime 
Hoboken True Crime 
(b) Difficult Task 
3. Task difficulty: 3-4 tasks which highlighted differ- 
ences between systems were used for each experi- 
ment. Based on the amount of information to be ac- 
quired, we divided the tasks into two groups: easy 
and difficult; an example of each is shown in Fig- 
ure 4. 
Figure 4: Sample Tasks for Evaluation Experiments 
Eight subjects 4 participated in each experiment. Each 
of the subjects interacted with both systems to perform 
4The subjects were Bell Labs researchers, summer students, and 
their friends. Most of them are computer scientists, electrical engi- 
205 
all tasks. The subjects completed one task per call so 
that the dialogue history for one task did not affect the 
next task. Once they had completed all tasks in sequence 
using one system, they filled out a questionnaire to as- 
sess user satisfaction by rating 8-9 statements, similar 
to those in (Walker et al., 1997), on a scale of 1-5, where 
5 indicated highest satisfaction. Approximately two days 
later, they attempted the same tasks using the other sys- 
tem. 5 These experiments resulted in 112 dialogues with 
approximately 2,800 dialogue turns. 
In addition to user satisfaction ratings, we automat- 
ically logged, derived, and manually annotated a num- 
ber of features (shown in boldface below). For each 
task/subject/system triplet, we computed the task suc- 
cess rate based on the percentage of slots correctly filled 
in on the task worksheet, and counted the # of calls 
needed to complete each task. 6 For each call, the user- 
side of the dialogue was recorded, and the elapsed time 
of the call was automatically computed. All user ut- 
terances were logged as recognized by our automatic 
speech recognizer (ASR) and manually transcribed from 
the recordings. We computed the ASR word error rate, 
ASR rejection rate, and ASR timeout rate, as well as 
# of user turns and average sentence length for each 
task/subject/system triplet. Additionally, we recorded 
the cues that the system automatically detected from 
each user utterance. All system utterances were also 
logged, along with the initiative distribution for each 
system turn and the dialogue acts selected to generate 
each system response. 
4 Results and Discussion 
Based on the features described above, we com- 
pared MIMIC and the control systems, MIMIC-SI and 
MIMIC-MI, along three dimensions: performance fea- 
tures, in which comparisons were made using previously 
proposed features relevant to system performance (e.g., 
(Price et al., 1992; Simpson and Fraser, 1993; Danieli 
and Gerbino, 1995; Walker et al., 1997)); discourse fea- 
tures, in which comparisons were made using character- 
istics of the resulting dialogues; and initiative distribu- 
tion, where initiative characteristics of all dialogues in- 
volving MIMIC from both experiments were examined. 
4.1 Performance Features 
For our performance evaluation, we first applied a three- 
way analysis of variance (ANOVA) (Cohen, 1995) to 
each feature using three factors: system version, order, 
neers, or linguists, and none had prior knowledge of MIMIC. 
SWe used the exact same set of tasks rather than designing tasks of 
similar difficulty levels because we intended to compare all available 
features between the two system versions, including ASR word error 
rate, which would have been affected by the choice of movie/theater 
names in the tasks. 
6Although the vast majority of tasks were completed in one call, 
some subjects, when unable to make progress, did not change strategies 
as in (41)-(49) in Figure 3; instead, they hung up and started the task 
over. 
Performance Feature MIMIC 
# of user turns 10.3 
Elapsed time (see.) 229.5 
ASR timeout (%) 12.5 
User satisfaction (n=8) 21.9 
ASR rejection (%) 514 
Task success (%) 100 
# of calls I 1.0 
28.1 ASR word error (%) 
Sl 
13.6 
277.5 
6.9 
19.8 
8.1 
98.8 
1.1 
31.1 
(a) MIMIC vs. MIMIC-SI (n=32) 
P 
0.0075 
0.0162 
0.0239 
0.0447 
0.1911 
0.3251 
0.572 
0.8475 
Performance Feature MIMIC 
ASR timeout (%) 5.7 
# of user turns 10.3 
User satisfaction (n=8) 29.5 
Elapsed time (see.) 200.6 
ASR word error (%) 23.0 
Task success (%) 100 
# of calls 1.21 
8.4 ASR rejection (%) 
MI p 
15.6 0.001 
14.3 0.0199 
24.4 0.0364 
246.4 0.0457 
30.6 0.0588 
98.4 0.1639 
1.21 0.5 
7.7 0.8271 
(b) MIMIC vs. MIMIC-MI (n=24) 
Table 1: Comparison of Performance Features 
and task difficulty. 7 If no interaction effects emerged, we 
compared system versions using paired sample t-tests. 8 
Following the PARADISE evaluation scheme (Walker 
et al., 1997), we divided performance features into four 
groups: 
• Task success: task success rate, # of calls. 
• Dialogue quality: ASR rejection rate, ASR timeout 
rate, ASR word error rate. 
• Dialogue efficiency: # of user turns, elapsed time. 
• System usability: user satisfaction. 
For both experiments, the ANOVAs showed no inter- 
action effects among the controlled factors. Tables l(a) 
and l(b) summarize the results of the paired sample t- 
tests based on performance features, where features that 
differed significantly between systems are shown in ital- 
ics. 9 These results show that, when compared with either 
7User satisfaction was a per subject as opposed to a per task per- 
formance feature; thus, we performed a two-way ANOVA using the 
factors system version and order. 
8This paper focuses on evaluating the effect of MIMIC's mixed ini- 
tiative and automatic adaptation capabilities. We assess these effects 
based on comparisons between system version when no interaction ef- 
fects emerged from the ANOVA tests using the factors system version, 
order, and task difficulty. Effects based on system order and task diffi- 
culty alone are beyond the scope of this paper. 
9Typically p<0.05 is considered statistically significant (Cohen, 
1995). 
206 
control system, users were more satisfied with MIMIC t° 
and that MIMIC helped users complete tasks more effi- 
ciently. Users were able to complete tasks in fewer turns 
and in a more timely manner using MIMIC. 
When comparing MIMIC and MIMIC-MI, dialogues 
involving MIMIC had a lower timeout rate. When 
MIMIC detected cues signaling anomalies in the dia- 
logue, it adapted strategies to provide assistance, which 
in addition to leading to fewer timeouts, saved users time 
and effort when they did not know what to say. In con- 
trast, users interacting with MIMIC-MI had to iteratively 
reformulate questions until they obtained the desired in- 
formation from the system, leading to more timeouts 
(see (41)-(49) in Figure 3). However, when comparing 
MIMIC and MIMIC-SI, even though users accomplished 
tasks more efficiently with MIMIC, the resulting dia- 
logues contained more timeouts. As opposed to MIMIC- 
SI, which always prompted users for one piece of infor- 
mation at a time, MIMIC typically provided more open- 
ended prompts when the user had task initiative. Even 
though this required more effort on the user's part in for- 
mulating utterances and led to more timeouts, MIMIC 
quickly adapted strategies to assist users when recog- 
nized cues indicated that they were having trouble. 
To sum up, our experiments show that both MIMIC's 
mixed initiative and automatic adaptation aspects re- 
sulted in better performance along the dialogue efficiency 
and system usability dimensions. Moreover, its adap- 
tation capabilities contributed to better performance in 
terms of dialogue quality. MIMIC, however, did not con- 
tribute to higher performance in the task success dimen- 
sion. In our movie information domain, the tasks were 
sufficiently simple; thus, all but one user in each experi- 
ment achieved a 100% task success rate. 
4.2 Discourse Features 
Our second evaluation dimension concerns characteris- 
tics of resulting dialogues. We analyzed features of user 
utterances in terms of utterance length and cues observed 
and features of system utterances in terms of dialogue 
acts. For each feature, we again applied a three-way 
ANOVA test, and if no interaction effects emerged, we 
performed a paired sample t-test to compare system ver- 
sions. 
The cues detected in user utterances provide insight 
into both user intentions and system capabilities. The 
cues that MIMIC automatically detects are a subset of 
those discussed in (Chu-Carroll and Brown, 1998): il 
• TakeOverTask: triggered when the user provides 
more information than expected; an implicit indi- 
cation that the user wants to take control of the 
l°The range of user satisfaction scores was 8-40 for experiment one 
and 9-45 for experiment two. 
l t A subset of these cues corresponds loosely to previously proposed 
evaluation metrics (e.g., (Danieli and Gerbino, 1995)). However, our 
system automatically detects these features instead of requiring manual 
annotation by experts. 
Discourse Feature MIMIC 
Cue: TakeOverTask 1.84 
Cue: AmbiguousActResolved 1.69 
'Cue: AmbiguousAction 3 
Avg sentence length (words) 6.82 
Cue: InvalidAction 1.16 
Cue: NoNewInfo 1.28 
Sl 
5 
4.59 
6.59 
5.45 
0.94 
1.38 
(a) MIMIC vs. MIMIC-SI (n=32) 
P 
0 
0 
0.0008 
0.0016 
0.1738 
0.766 
Discourse Feature MIMIC 
Cue: TakeOverTask 2.33 
Cue: InvalidAction 2.04 
Cue: NoNewlnfo 2.25 
Cue: AmbiguousActResolved 2.08 
Avg sentence length (words) 5.26 
Cue: AmbiguousAction 4.13 
MI 
0 
3.75 
4.79 
1.13 
5.63 
4.38 
(b) MIMIC vs. MIMIC-MI (n=24) 
P 
0 
0.0011 
0.0161 
0.0297 
0.1771 
0.8767 
Table 2: Comparison of User Utterance Features 
problem-solving process. 
• NoNewlnfo: triggered when the user is unable to 
make progress toward task completion, either when 
the user does not know what to say or the ASR en- 
gine fails to recognize the user's utterance. 
• lnvalidAction/InvalidActionResolved: triggered 
when the user utterance makes an invalid as- 
sumption about the domain and when the invalid 
assumption is corrected, respectively. 
• AmbiguousAction/AmbiguousActionResolved: trig- 
gered when the user query is ambiguous and when 
the ambiguity is resolved, respectively. 
Tables 2(a) and (b) summarize the results of the paired 
sample t-tests based on user utterance features where fea- 
tures whose numbers of occurrences were significantly 
different according to system version used are shown in 
italics. 12 Table 2(a) shows that users expected the system 
to adapt its strategies when they attempted to take control 
of the dialogue. Even though MIMIC-SI did not behave 
as expected, the users continued their attempts, resulting 
in significantly more occurrences of TakeOverTask in di- 
alogues with MIMIC-SI than with MIMIC. Furthermore, 
the average sentence length in dialogues with MIMIC 
was only 1.5 words per turn longer than in dialogues 
with MIMIC-SI, providing further evidence that users 
~2Since system dialogue acts are often selected based on cues de- 
tected in user utterances, we only discuss results of our user utterance 
feature analysis, using dialogue act analysis results as additional sup- 
port for our conclusions. 
207 
preferred to provide free-formed queries, regardless of 
system version used. 
Table 2(b) shows that MIMIC was more effec- 
tive at resolving dialogue anomalies than MIMIC-MI. 
More specifically, there were significantly fewer oc- 
currences of NoNewlnfo in dialogues with MIMIC 
than with MIMIC-MI. In addition, while the number 
of occurrences of AmbiguousAction was not signifi- 
cantly different for the two systems, the number that 
were resolved (AmbiguousActionResolved) was signif- 
icantly higher in interactions with MIMIC than with 
MIMIC-MI. Since NoNewlnfo and AmbiguousAction 
both prompted MIMIC to adapt strategies and, as a re- 
suit, provide additional useful information, the user was 
able to quickly resolve the problem at hand. This is fur- 
ther supported by the higher frequency of the system dia- 
logue act GiveOptions in MIMIC (p=0), which provides 
helpful information based on dialogue context. 
In sum, the results of our discourse feature analysis 
further confirm the usefulness of MIMIC's adaptation 
capabilities. Comparisons with MIMIC-SI provide ev- 
idence that MIMIC's ability to give up initiative better 
matched user expectations. Moreover, comparisons with 
MIMIC-MI show that MIMIC's ability to opportunisti- 
cally take over initiative resulted in dialogues in which 
anomalies were more efficiently resolved and progress 
toward task completion was more consistently made. 
4.3 Initiative Analysis 
Our final analysis concerns the task initiative distri- 
bution in our adaptive system in relation to the fea- 
tures previously discussed. For each dialogue involving 
MIMIC, we computed the percentage of turns in which 
MIMIC had task initiative and the correlation coefficient 
(r) between the initiative percentage and each perfor- 
mance/discourse feature. To determine if this correlation 
was significant, we performed Fisher' s r to z transform, 
upon which a conventional Z test was performed (Cohen, 
1995). 
Tables 3(a) and (b) summarize the correlation between 
the performance and discourse features and the percent- 
age of turns in which MIMIC has task initiative, respec- 
tively. 13 Again, those correlations which are statistically 
significant are shown in italics. Table 3(a) shows a strong 
positive correlation between task initiative distribution 
and the number of user turns as well as the elapsed time 
of the dialogues. Although earlier results (Table l(a)) 
show that dialogues in which the system always had task 
initiative tended to be longer, we believe that this corre- 
lation also suggests that MIMIC took over task initiative 
more often in longer dialogues, those in which the user 
was more likely to be having difficulty. Table 3(a) fur- 
ther shows moderate correlation between task initiative 
distribution and ASR rejection rate as well as ASR word 
error rate. It is possible that such a correlation exists 
13This test was not performed for user satisfaction, since user saris- 
faction was a per subject and not a per dialogue feature. 
Performance Feature r p 
# of user turns 0,71 0 
ASR rejection 0.55 0 
Elapsed time 0.51 0.00002 
ASR word error 0.46 0.00012 
~# of calls 0.15 0.1352 
! ASR timeout -0.003 0.4911 
Task success rate 0 0.5 
(a) Performance Features 
Discourse Feature r p 
Cue: AmbiguousActionResolved 0.61 0 
Cue: NoNewlnfo 0.59 0 
Cue: TakeOverTask 0.44 0.00028 
Cue: lnvalidAction 0.42 0.00057 
Average sentence length -0.40 0.00099 
Cue: AmbiguousAction 0.38 0.00169 
(b) Discourse Features 
Table 3: Correlation Between Task Initiative Distribution 
and Features (n=56) 
because ASR performance worsens when MIMIC takes 
over task initiative. However, in that case, we would have 
expected the results in Section 4.1 to show that the ASR 
rejection and word error rates for MIMIC-SI are signif- 
icantly greater than those for MIMIC, which are in turn 
significantly greater than those for MIMIC-MI, since in 
MIMIC-SI the system always had task initiative and in 
MIMIC-MI the system never took over task initiative. 
To the contrary, Tables l(a) and l(b) showed that the 
differences in ASR rejection rate and ASR word error 
rate were not significant between system versions, and 
Table l(b) showed that ASR word error rate for MIMIC- 
MI was in fact quite substantially higher than that for 
MIMIC. This suggests that the causal relationship is the 
other way around, i.e., MIMIC's adaptation capabilities 
allowed it to opportunistically take over task initiative 
when ASR performance was poor. 
Table 3(b) shows that all cues are positively correlated 
with task initiative distribution. For AmbiguousAction, 
lnvalidAction, and NoNewlnfo, this correlation exists be- 
cause observation of these cues contributed to MIMIC 
having task initiative. However, note that AmbiguousAc- 
tionResolved has a stronger positive correlation with task 
initiative distribution than does AmbiguousAction, again 
indicating that MIMIC's adaptive strategies contributed 
to more efficient resolution of ambiguous actions. 
In brief, our initiative analysis lends additional sup- 
port to the conclusions drawn in our performance and 
discourse feature analyses and provides new evidence 
for the advantages of MIMIC's adaptation capabilities. 
208 
In addition to taking over task initiative when previously 
identified dialogue anomalies were encountered (e.g., de- 
tection of ambiguous or invalid actions), our analysis 
shows that MIMIC took over task initiative when ASR 
performance was poor, allowing the system to better con- 
strain user utterances, t4 
5 Conclusions 
This paper described an empirical evaluation of MIMIC, 
an adaptive mixed initiative spoken dialogue system. We 
conducted two experiments that focused on evaluating 
the mixed initiative and automatic adaptation aspects of 
MIMIC and analyzed the results along three dimensions: 
performance features, discourse features, and initiative 
distribution. Our results showed that both the mixed 
initiative and automatic adaptation aspects of the sys- 
tem led to better performance in terms of user satisfac- 
tion and dialogue efficiency. In addition, we found that 
MIMIC's adaptation behavior better matched user expec- 
tations, more efficiently resolved anomalies in dialogues, 
and led to higher overall dialogue quality. 
Acknowledgments 
We would like to thank Bob Carpenter and Christine 
Nakatani for their help on experimental design, Jan van 
Santen for discussion on statistical analysis, and Bob 
Carpenter for his comments on an earlier draft of this pa- 
per. Support for the second author is provided by an NSF 
graduate fellowship and a Lucent Technologies GRPW 
grant. 
laAlthough not currently utilized, the ability to adapt dialogue strate- 
gies when ASR performance is poor enables the system to employ dia- 
logue strategy specific  models for ASR. 

References 

James F. Allen, Bradford W. Miller, Eric K. Ringger, 
and Teresa Sikorski. 1996. A robust system for natural spoken dialogue. In Proceedings of the 34th Annual Meeting of the Association for Computational 
Linguistics, pages 62-70. 

S. Bennacef, L. Devillers, S. Rosset, and L. Lamel. 
1996. Dialog in the RAILTEL telephone-based system. In Proceedings of the 4th International Conference on Spoken Language Processing. 

Jennifer Chu-Carroll and Michael K. Brown. 1998. An 
evidential model for tracking initiative in collaborative dialogue interactions. User Modeling and User-Adapted Interaction, 8(3-4):215-253. 

Jennifer Chu-Carroll. 2000. MIMIC: An adaptive mixed 
initiative spoken dialogue system for information 
queries. In Proceedings of the 6th ACL Conference on 
Applied Natural Language Processing. To appear. 

Paul R. Cohen. 1995. Empirical Methods for Artificial 
Intelligence. MIT Press. 

Morena Danieli and Elisabetta Gerbino. 1995. Metrics 
for evaluating dialogue strategies in a spoken  
system. In Proceedings of the AAAI Spring Symposium on Empirical Methods in Discourse Interpretation and Generation, pages 34-39. 

Jean Gordon and Edward H. Shortliffe. 1984. The 
Dempster-Shafer theory of evidence. In Bruce 
Buchanan and Edward Shortliffe, editors, Rule-Based 
Expert Systems: The MYCIN Experiments of the 
Stanford Heuristic Programming Project, chapter 13, 
pages 272-292. Addison-Wesley. 

Diane J. Litman and Shimei Pan. 1999. Empirically 
evaluating an adaptable spoken dialogue system. In 
Proceedings of the 7th International Conference on 
User Modeling, pages 55-64. 

H. Meng, S. Busayaponchai, J. Glass, D. Goddeau, 
L. Hetherington, E. Hurley, C. Pao, J. Polifroni, 
S. Seneff, and V. Zue. 1996. WHEELS: A conversational system in the automobile classifieds domain. In 
Proceedings of the International Conference on Spoken Language Processing, pages 542-545. 

Stefan Ortmanns, Wolfgang Reichl, and Wu Chou. 1999. 
An efficient decoding method for real time speech 
recognition. In Proceedings of the 5th European Conference on Speech Communication and Technology. 

K.A. Papineni, S. Roukos, and R.T. Ward. 1999. Free-flow dialog management using forms. In Proceedings 
of the 6th European Conference on Speech Communication and Technology, pages 1411-1414. 

Patti Price, Lynette Hirschman, Elizabeth Shriberg, and 
Elizabeth Wade. 1992. Subject-based evaluation measures for interactive spoken  systems. In Proceedings of the DARPA Speech and Natural Language 
Workshop, pages 34-39. 

Wolfgang Reichl and Wu Chou. 1998. Decision tree 
state tying based on segmental clustering for acoustic 
modeling. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 

M.D. Sadek, A. Ferrieux, A. Cozannet, P. Bretier, E Panaget, and J. Simonin. 1996. Effective human-computer cooperative spoken dialogue: The AGS demonstrator. In Proceedings of the International 
Conference on Spoken Language Processing. 

Glenn Shafer. 1976. A Mathematical Theory of Evidence. Princeton University Press. 

Andrew Simpson and Norman M. Fraser. 1993. Black 
box and glass box evaluation of the SUNDIAL system. 
In Proceedings of the 3rd European Conference on 
Speech Communication and Technology, pages 1423-1426. 

Gert Veldhuijzen van Zanten. 1999. User modelling in 
adaptive dialogue management. In Proceedings of the 
6th European Conference on Speech Communication 
and Technology, pages 1183-1186. 

Marilyn A. Walker, Diane J. Litman, Candance A. 
Kamm, and Alicia Abella. 1997. PARADISE: A 
framework for evaluating spoken dialogue agents. In 
Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pages 271-280. 
