Taking Account of the User's View in 3D Multimodal Instruction 
Dialogue 
Yukiko I. Nakano and Kenji hnamura and Hisashi Ohara 
1-1 Hikari-no-oka, Yokosuka, Kanagawa, 239-0847 Japan 
{yukiko, ilnamura, ohara}@ntl;nly.isl.ntt.co.jp 
Abstract 
While recent advancements in virtual reality technology 
have created a rich communication interface linking hu- 
mans and computers, there has beefl little work on build- 
ing dialogue systems for 3D virtual worlds. This paper 
proposes a method for altering the instruction dialogue 
to match the user's view in a virtual enviromnent. -\~re 
illustrate the method with the system MID-aD, which in- 
teractively instructs the user on dismantling some parts 
of a car. First, in order to change the content of ~he 
instruction dialogue to match the user's view, we extend 
the refinement-driven plmming algorithm by using the 
user's view as a l)lan constraint. Second, to manage the 
dialogue smoothly, the systeln keeps track of the user's 
viewpoint as part of the dialogue skate and uses this 
information for coping with interruptive sul)dialogues. 
These mechanisms enable MID-3D to set instruction di- 
alogues in an incremental way; it takes account of the 
user's view even when it changes frequently. 
1 Introduction 
In a aD virtual enviromnent, we can freely walk 
through the virtual space and view three di- 
mensional objects from various angles. A inul- 
tilnodal dialogue system for such a virtual en- 
vironment should ainl to realize conversations 
which are performed in the real world. It would 
also be very useflll for education, where it is 
necessary to learn in near real-life situations. 
One of the most significant characteristics of 
3D virtual environments is that the user can se- 
lect her/his own view from whidi to observe the 
virtual world. Thus, the nmltimodal instruc- 
tion dialogue system should be able to set the 
course of the dialogue by considering the user's 
current view. However, previous works on nml- 
tilnodal presentation generation and instruc- 
tion dialogue generation (Wahlster et al., 1993; 
Moore, 1995; Cawsey, 1992) do not achieve this 
goal because they were not designed to hail- 
(lie dialogues pertbrmed in 3D virtual environ- 
ments. 
This paper proposes a method that ensures 
that the course of the dialogue matches the 
user's view in the virtual environment. More 
specificall> we focus on (1) how to select the 
contents of the dialogue since it is essential 
that the instruction dialogue system form a se- 
quence of dialogue contents that is coherent 
and comprehensible, and (2) how to control 
mixed-initiative instruction dialogues snloothly, 
especially how to manage interruptive subdia- 
logues. These two problelns basically determine 
the course of the dialogue. 
First, in order to decide the appropriate con- 
tent, we propose a content selection mechanism 
based on plan-based multilnodal presentation 
generation (Andrd and Rist, 1993; Wahlster et 
al., 1993). We extend this algorithm by using 
the user's view as a constraint in expanding the 
plan. In addition, by employing tilt incremen- 
tal planning algorithm, the syst;em can adjust 
the content to match the user's view during on- 
going conversations. 
Second, ill order to nlanage interruptive sub- 
dialogues, we propose a dialogue management 
mechanism that takes account of the user's 
view. This mechanism maintains the user's 
viewpoint as a dialogue state in addition to in- 
tentional and linguistic context (Rich and Sid- 
her, 1998). It maintains the dialogue state as a 
focus stack of discourse segments and updates 
it at each turn. Tlms, it Call track the view- 
point information in an on-going dialogue. By 
using this viewpoint inibrlnation in restarting 
the dialogue after an interruptive subdialogue, 
the dialogue Inai~agement medmnism returns 
the user's viewpoint to that of the interrupted 
segment. 
These two mechanisms work as a core dia- 
logue engine in MID-3D (Multimodal Instruc- 
tion Dialogue system for 3D virtual environ- 
ments). They make it possible to set the in- 
struction dialogue in an increnlental ww while 
572 
Figure 1: Right angle 
Figure 2: l,efl; angle 
considering the user's view. They also (mal)h'~ 
MID-a1) to (:re~te coherent and mixe, d-initiative 
(liah)gues in virtual enviromuents. 
This paper is organized as lbllows. In Sec- 
ti(m 2, we define the 1)rol)h;ms spc(:ifi(: 1;o 313 
multimoda\] (tiah)gne genera.tion. Section 3 de- 
scribes rclat;ed works. \]n S('x:l;ion 4, we pro- 
pose the MID-a1) architecture. Sections 5 ;rod 
6 des(:ril)e the contenl; plmming meclm.nism a.nd 
the dialogue manngement meclm.nism, a.nd show 
they dynami(:ally decide coherent insl;rn(:t;ions, 
and control mixed-initial;ire diah)guc.s consider- 
ing the user's view. V~/e also show a smnt)le di- 
:dogue in Section 7. 
2 Problems 
In a virtual emdromnent, the user can freely 
move a.round the world and select her/his own 
view. r\['he systelll C&llllOt; predict where the user 
will stand and what; s/he observes in the vir- 
tual environment. This section describes two 
types of 1)roblems in generating instru(:tion dia- 
logues ibr such virtual enviromnents. They arc 
caused l)y mismatches b(~,twe(;ll tile llSel'~S vi0,w- 
l)oint ;m(1 the sta.te of th(; dialogue. 
First, the syStelll shouM check whether the 
user's view matches the tbcns of the next ex- 
change when the systen~ tries to ('hange COllllllll- 
ni('ative goals. \]if a mismatch occurs, the system 
shouhl choose the instru(:tion (li~dogue content 
according to the user's view. Figure 1 a,n(1 2 m:e 
examl)les of observing a car's front suspension 
from (liff(',r(mt, points of view. In Figm'(', 1, the 
right; side of the steering system can 1)e seen, 
while Figure 2 shows the left side. If the system 
is not aware of the user's view, I;he system may 
talk about the left; tie rod end even though the 
user's view remains the right side (Figure 1). 
In such n (:ase, the system shouM chang(: its d(> 
scril)tion or ask the user to change her/his view 
to |;11('. left; side. view (Figure 2) and r('.(-Olmnen(:e 
its instruction hi)out this part. Therefore, the 
system should be al)le to change the content 
of the dialogue according 1;o the user's view. 
In order to ac(:omplish this, the system shoul(1 
lmve ;1. content selection nlechan.ism whi(:h in- 
crementally (let:ides i;h('~ content while ('he(:king 
the llSef~s (;llrrellt vi(!w. 
Second, t;here could 1)e a case in which 1;21(; 
user chang(~s 1;he, topi(: as well as the vie\vl)oillt 
as interrupl;ing the. system's instru('t;ion, i n such 
a case, the (tia.h)gue~ system shouhl kee l) track of 
the user's viewpoint as ~ 1)art of the dialogue 
state nnd return to that viewpoint when resmn- 
ing the (lia.logu(? after the interrupl;ing sul)(li- 
alogue. Sul)l)ose that while the sys|;em is (',x- 
l)lnining tlm right; t)i(; rod end, th('. user initially 
looks a,t the right side, (l"igure 1) hut then shifts 
her/his view to the left (Figure 2) and asks 
about the \]eft knu(-kle arm. After finishing a 
sub(lialogue about this arm, the syst(;nl tries 
to return to the dialogue al)out the interrupted 
topic. At this time, if the sysl;em resumed the 
dialogue using the current view (Figure 2), the 
view and the instruction would \])e(;olne mis- 
matched. When resmning the interrupted di- 
alogue, it would be less (:onfllsing to the user 
if the system retm:ned to the user's prior view- 
l)oint rather than selecting n new o11o. '\].'he user 
may be (:onfilsed if the dialogue is resulned but 
the observed state looks different. 
\,Ve address the ~fl)ove problems. In order to 
(:ope wit;h the first; problem, we present a con- 
tent selection mechanism that incrementally ex- 
pands the content plan of a multimodal dialogue 
while checking the user's view. To solve the 
second 1)roblem, we present a. dialogue nmnage- 
merit me(:\]mnism l;hat keel)s t:ra(-k of the user's 
viewpoint as a part of the diah)gue context and 
573 
............................................................................................................................................................................ -: 
~;'~ Operation l,uttons 
Figure 3: The system architecture 
uses this intbrmation in resuming the dialogue 
after interruptive subdialogues. 
3 Related work 
There are many multimodal systems, such as 
nmltimedia presentation systems and animated 
agents (Mwbury, 1993; Lester et al., 1997; 
Bares and Lester, 1997; Stone and Lester, 1996; 
Towns et al., 1998)~ all of which use 3D graph- 
ics and 3D animations. In some of them (May- 
bury, 1993; Wahlster et al., 1993; Towns et 
al., 1998), planning is used in generating mul- 
timodal presentations including graphics and 
animations. They are similar to MID-aD in 
that they use planning mechanisms in content 
planning. However, in presentation systems, 
unlike dialogue systems, the user just watches 
the presentation without changing her/his view. 
Therefore, these studies are not concerned with 
dlanging the content of the discourse to match 
the user's view. 
In some studies of dialogue management 
(Rich and Sidner, 1998; Stent et M., 1999), 
the state of the dialogue is represented using 
Grosz and Sidner's framework (Grosz and Sid- 
ner, 1986). We also adopt this theory in our di- 
alogue management mechanism. However, they 
do not keep track of the user's viewpoint infor- 
mation as a part of the dialogue state because 
they were not concerned with dialogue manage- 
ment in virtual environments. 
Studies on pedagogical agents have goals 
closer to ours. In (Rickel and .\]ohnson, 1999), 
a pedagogical agent demonstrates the sequen- 
tial operation of complex machiuery and an- 
swers some follow up questions fl'on~ the stu- 
dent. Lester et al. (1999) proposes a life- 
like pedagogical agent that supports problem- 
solving activities. Although these studies are 
concerned with building interactive learning en- 
vironments using natural , they do not 
discuss how to decide the course of on-going in- 
struction dialogues in an incremental and coher- 
ent way. 
4 Overview of the System 
Architecture 
This section describes the architecture of MID- 
3D. This system instructs users how to disman- 
tle the steering system of a cal'. Tile system 
steps through the procedure and the user can 
interrupt the system's instructions at any time. 
Figme 3 shows the architecture and a snapshot 
of the system. The 3D virtual environment is 
viewed through an application window. A 3D 
model of a part of the car is provided and a frog- 
574 
like character is used as the pedagogical agent 
(Johnson et al., 2000). The user herself/himself 
Call also al)l)ear in the virtual enviromn(mt as 
an avatar. The buttons to the right of the 3D 
scre(m are operation 1)uttons tbr changillg the 
viewpoint. By using these buttons, the user can 
freely change her/his viewt)oint at any time. 
This system consists of five main modules: 
hll)Ut Analyzer, Domain Plan Reasoner, Con- 
tent Planner (CP), Sentence Planner, Dialogue 
Manager (DM), and Virtual Environment Con- 
troller. 
First of all, the user's inputs are interpreted 
through the Input Analyzer. It receives strings 
of characters from the voice recognizer and 
the user's inputs ti'om the Virtual Environment 
Controller. It interl)rets these inputs, trans- 
forms them into a semantic reprcsentation~ and 
sends them to the DM. 
The DM, working as a dialogue management 
mechanism, keeI)s track of the dialogue (:ontext 
including the user:s view and decides the, next 
goal (or a(:tion) of the system. Ut)on receiv- 
ing an intmt from the user through the Input 
Analyzer, the DM sends it to the l)omaill Plan 
Reasoner (DPR) to get discourse goals for re- 
st)onding to the inlmt. For example, if th(: user 
requests some instruction, the DI'I{ decides the 
sequence of steps that realizes the l)rocedure 1)y 
refi~rring to domain knowh~dge. Th(: 1354 (;hen 
adds (;he discourse goals to the goal agenda. 
If the user does not sulmlit a ~lew (;ot)ie , the 
DM (:ontilmes to expand the, instruction plan 
1)y sending a goal in the goal agenda to (:lie CP. 
Details of the I)M are given in Section 6. 
After the goal is sent to the CP, it decides the 
apl)ropriate contents of instruction dialogue by 
eml)loying a refinement-driven hierar(:hi(:al lin- 
ear 1)lamfing technique. When it; receives a goal 
fl'om the DM, it exl)ands the goal and returns 
its sul)goal to the DM. 13y ret)eating this pro- 
cess, the dialogue contents are, gradually spec- 
ified. Theretbre, the CP provides the scenario 
tbr the instruction 1)ased on the control 1)rovided 
by the DM. Details of the CP are provided in 
Section 5. 
The Sentence Plalmer generates surface, lin- 
guisti(: expressions coordinated with action 
(Kato et al., 1996). The linguistic exl)ressions 
arc. output through a voice synthesizer. Actions 
;/re realized through the Virtual Enviromnent 
Controller as 3D animation. 
For the Virtual Environment Controller, we 
use HyCLASS (Kawanol)e et al., 1998), which 
<Operator 1> 
(:tleader 
:Iiftbcl 
:Constraints 
:Main-Acts 
:Subskliary-Acts 
<Operator 2> 
(:lleader 
:Effect 
:Conslraints 
:Main-Acts 
:Subsidiary-Acts 
(Inshuct-act N l l ?act MM) 
(BMB S 11 (Goal II (Done 11 ?act))) 
((KB (Obj ?act ?object)) 
(Visible-p (Visible ?ol~iect t))) 
((Look S II) 
(Request S I I (Try It (action ?act)) NO-SYNC MM)) 
((Describe- act S II ?act MM) 
(Reset S (actioll ?act)))) 
(Instruct-act S 11 ?act MM) 
(BMB S 11 (Goal I1 (Done 11 '?act))) 
((KB (Obj ?act ?object)) 
(Visiblc-p (Visible ?object oil))) 
((Look S ll) 
(Make-recognize S 11 (Object ?object) MM) 
(Rcqucst S 11 (Try I1 (action ?act)) NO-SYNC M M)) 
((l)escribc-act S 11 ?act MM) 
(Reset S (action '?act)))) 
Figure 4: Exanlt)les of Content Plan Operators 
is a 3D simulation-1)ased environment tbr edu- 
(:ational activities. Several APls are provided 
tbr controlling HyCLASS. By using these in- 
terfaces, the CP and the DM can discern the 
liser~s view and issue an action command in ()l'- 
der to challge the virtual (;nvironnmllt. \¥h(m 
HyCLASS receives an action command, it in- 
terprets the command and renders the 31) ani- 
mation corresponding to the action in real time. 
5 Selecting the Conten(; of 
Instruction Dialogue 
Ill this section, we introduce the CP and show 
how the instruction dialogue is (leeided in all 
in(:renl(:ntal way to ma, tch the user's view. 
5.1 Content Planner 
In MID-3D, the CP is (:ailed by the DM. Wheal 
a goal is put to the CP fl'(nn the DM, it; selects a 
plan operator fi)r achieving the goal, applies the 
ol)erator to lind new subgoals, and returns them 
to l;he \])M. The sul)goals are then added to the 
goal agenda maintained by the DM. Theretbre, 
the CP provides the seenm:io tbr the instruc- 
tion dialogue to the DM and enables MID-3D 
to output coherent instructions. Moreover, the 
Content Planer emt)loys depth-first search with 
a retinement-drivell hierarchical linear plmming 
algorithm as in (Cmvsey, 1992). The advantage 
of this method is that the t)lan is de, veloped in- 
crenmntally, and can be changed while the con- 
versation is in progress. Thus, by aI)plying this 
algorithm to 3D dialogues, it be(-omes lmssible 
to set instruction dialogue strategies that are 
contingent on the user's view. 
575 
5.2 Considering the User's View in 
Content Selection 
In order to decide the dialogue content accord- 
ing to tile user's view, we extend the descrip- 
tion of the content plan operator (Andrd and 
Rist, 1993) by using the user's view as a con- 
straint in plan operator selection. We also mod- 
ify the constraint checking flmctions of |;lie pre- 
vious planning algorithm such that HyCLASS 
is queried about the state of the virtual envi- 
ronment. 
Figure 4 shows examples of content plan op- 
erators. Each operator consists of the name 
of the operator (Header), the etfcct resulting 
from plan execution (Effect), the constraints for 
executing the plan (Constraints), the essential 
subgoals (Main-acts), and the optional subgoals 
(Subsidiary-acts). As shown in {Operator 1.) 
in Figure 4, we use the constraint (gisible-p 
(Visible ?object t)) to check whether the 
object is visible fl'om tile user's viewpoint. 
Actually, the CP asks HyCLASS to examine 
whether the object is in the student's field of 
view. 
If an object is bound to the ?object vari- 
able by rel~rring to the knowledge base, and 
the object is visible to the user, (Operator 1) 
is selected. As a result, two Main-Acts (look- 
ing at the, user and requesting to try to do 
the action) and two Subsidiary-Acts (showing 
how to do the action, then resetting the state) 
are set as subgoals and returned to the DM. 
In contrast, if l;he object is not visible to the 
user, {Operator 2} is selected. In this case, a 
goal for making the user i(tenti(y the object is 
added to the Main-Acts; (Hake-recognize S 
H (Object ?object) MM). 
As shown al)ove, the user's view is considered 
in deciding the instruction strategy. In addition 
to the above example, the distance between the 
target object and the user as well as three di- 
mensional overlapping of objects, can also be 
considered as constraims related to the user's 
view. 
Although the user's view is also considered in 
selecting locative expressions of objects in the 
Sentence Planner in MID-3D, we do not discuss 
this issue here becanse surface generation is not 
the tbcus of this paper. 
6 Managing Interruptive 
Subdialogue 
The DM controls the other components of MID- 
3D based on a discourse model that represents 
the state of tile dialogue. This section describes 
the DM and shows how the user's view is used 
in managing the instruction dialogue. 
6.1 Maintaining the Discourse Model 
The DM maintains a discourse model for track- 
ing the state of the dialogue. The discourse 
model consists of the discourse goal agenda 
(agenda), focus stack, and dialogue history. The 
agenda is a list of goals that should be achieved 
through a dialogue between the user and the 
system. If all the goals in the agenda are accom- 
plished, the instruction (tialogue finishes suc- 
cessflflly. The focus stack is a sta& of discourse 
segment frames (DSF). Each DSF is a frmne 
structure that stores the tbllowing inlbrmation 
as slot vMues: 
utterance content (UC): A list of utter- 
ance contents constructing a discourse segment. 
Physical actions are also regarded as uttcra.nce 
contents (D;rguson and Allen, 1998). 
discourse purpose (1)19: The purt)ose of a dis- 
course segment. 
- 9oal state (GS): A state (or states) whi('h 
shouhl 1)e accomplished to achieve the discourse 
lmrpose of the segment. 
In addition to these, we add the user's view- 
point slot to the DSF description in order to 
track the user's viewl)oint information: 
user's vic.'wpoint (UV): Current user's view- 
point, which is represented as the position and 
orientation of the camera. The position consists 
of x-, y-, and z-coordinates. The orientation 
consists of x-, y-, and z-angles of the ('amera. 
The basic algorithm of the DM is to repeat 
(a) th(; peribnning actions step and (1)) updat- 
ing the discourse model, until there is no un- 
satisfied goal in the agenda (~IYaum, 1994). In 
1)ertbrming actions step, the DM decides what 
to do next ill the current dialogue state, an(1 
then pertbnns the action. When continuing the 
system explanation, the DM posts the first goal 
in the agenda to the CP. If the user's response 
is needed in the current state, the 1)M waits tbr 
the nser's input. 
The other step in the DM algorith.m is to up- 
date the discourse model according to the state 
that results from the actions pertbrmed by the 
user as well as the actions peribrmed by the sys- 
tem. Although we do not detail this step here, 
the tbllowing operations could be executed de- 
pending on the case. if the current discourse 
purpose is accomplished, the top level DSF is 
popped and added to the dialogue history, q_/he 
576 
l I)SFI21 
DSFI2 
DSFI Jf 
J 
J 
UV: ((18, -20, -263) (0, 0.3 I, 0)) 
UC: ((IJseJ~act (Ask where heal_r)) 
I)P: (Response-to-user-act 
(Uscr-act (ask where bootr))) 
GS: ((Know 11 (About (l'lace_of boot_r)))...) 
UV: ((-38, -22, -259) (0, -0.33, 0)) 
UC: ((System-act (lnl'(~rm S 11 (Show S (Action 
rcmovc-tiemd end.I)) NO-SYNC I'R)) 
DI': (I)cscribe-acl S l I rcmove-licrod end I)) 
GS: ((Know 1I (llove-lo-do 11 
(action remove-tiered eml I)))...) 
Figure 5: Example of the state of a dialogue 
system then assunms that the user understands 
the instruction and adds the assumption to the 
user model. If a new discourse 1)urpose is in- 
troduced from the CP, the I)M creates a new 
DSF by setting the header of the selected plan 
operator in the discourse lmrpose slot mM the 
effi~ct of the operator in the goal state slot. The 
DSF is then trashed to the tbcus stack. If the 
current discourse purpose is contimmd, the DM 
updates the information of the top level DSF. 
6.2 Considering the User's View in 
Coping with Interruptive 
Subdialogues 
The main ditlbxence of the Dialogue Manager of 
our system from the i)revious one is to maintain 
the user's viewpoint information and use this in 
managing the dialogue. When the DM updates 
the information of the current DSt i', it observes 
the user's viewi~oint at that petal; and renews 
the UV slot and it also adds the sema.nl;ic rep- 
resentation of utterance (or action) in the UC 
slot. As a result, it becomes possible to update 
the user's viewpoint information at each turn, 
and to track the user's viewl)oint in an on-going 
dialogue. 
By using this mechanism, the DM can cope 
with interruptive subdialognes. In resmning 
from a subdialogue, the user may become con- 
timed if the dialogue is resumed but the ob- 
served state differs from what the user relllelll- 
hers. In order to match the view to the re- 
sumed dialogtm, the I)M refers the UV slot of 
the top DSF and puts the users view ha& to 
that point. This ensures that the user experi- 
ences a smooth transition back to the previous 
topic. Figure 5 shows an example of the state of 
a dialogue. DSF12 represents a discourse seg- 
ment that describes how to remove the left tie 
rod end. DSF121 represents the user-initiated 
interrul)tive subdialogue about where the right 
\[14\]System: The left knuckle arm is removed like this. 
(with the anilnation showing the left knuckle 
arm coming off) 
\[ 15\]User: (After moving the viewpoint to Figure I and 
clicking the right knuckle ann) 
What is this? 
\[16\]System: This is the right knuckle arm. 
\[I 7\]Uscr: OK. 
\[18\]Systeln: Now, let's continue the explanation. 
(with changing the view to the one in utterance \[ 14\]) 
\[19\]System: The left knuckle arm is removed like this. 
(with the animation showing the left knuckle arm 
coming off) 
\[20\]User: Yes. 
\[21\]System: Now, try removing the left knuckle arm. 
\[22\]User: (After clicking the left knuckle arm) Remove it. 
\[23\]System: You have succcssfitlly completed thc operation. 
\[24\]System: The next step is to remove the right tie rod cad. 
\[25\]System: Let's move to the right tie rod end. 
1126\] User: Ycs. 
\[27\]System: (Moving the view to the right) 
The right tie rod end is in fiont of you. 
\[28\] User: Yes, 1 see. 
Figure 6: Example of a dialogue with MII)-3D 
boot is. hmnediately before starting DSF\]21, 
the user's viewpoint in l)SF12 is ((-38, -22, 
-259) (0, -0.33, 0)). After completing the 
subdialogue \])y answering the user's question, 
DSF121 is l)opped and the system resmnes 
DSF12. At this time, the \])M gets the view- 
point value of the top DSF (DSF12), alld (;Oltl- 
mands ItyCLASS to change the viewpoint to 
that view, which is in this case ((-as, -22, -2,59) 
(0, -0.a3, 0)) ' The systeln then restarts the 
interrupted dialogue. 
7 Exmnple 
In order to illustrate the behavior of MID-3D, 
an example is shown in Figure 6. This is a part 
of an instruction dialogue on how to dismantle 
the steering system of a car. The current topic 
is removing the left knuckle arm. In utterance 
\[14\], the system describes how to remove this 
part in conjunction with an animation created 
by HyCLASS. 
In \[15\], the user interrupted the system's in- 
struction and asked "What is this?" by clicking 
the right knuckle arm. At this point, the user's 
speech input was interpreted in the Input An- 
~In the current system, it; is not 1)ossible to move 
the camera to an arbitrary point because of the limi- 
tations of the virtual environment controller employed. 
Accordingly, this func|;ion is al)proximated by selecting 
the nearest of several predetined viewpoints. 
577 
alyzer and a user initiative subdialogue started 
by t)ushing another DSF onto the focus stack. 
In order to answer the question, the DM asked 
the Domain Plan Reasoner how to answer the 
user's question. As a result, a discourse goal was 
returned to the DM and added to the agenda. 
The DM then sent the goal (Describe-name S 
H (object knuckle_arm_r)) to the CP. This 
goal generated utterance \[16\]. 
In system utterance \[18\], in order to resume 
the dialogue, a recta-comment, "Now let's con- 
tinue the explanation", was generated and the 
viewpoint returned to the previous one in \[14\] 
as noted in the DSF. After returning to the pre- 
vious view, the interrupted goal was re-planned. 
As a result, utterance \[19\] was generated. 
After completing this operation in \[23\], 
the next step, removing the right tie rod 
end, is started. At this time, if the 
user is viewing the left side (Figure 2) and 
the system has the goal (Instruct-act S 
H remove-tierod_end_r MR), (Operator 2} in 
Figure 4 is applied because the target object, 
right tie rod end, is not visible fi'om the user's 
viewpoint. Thus a goal of making the user view 
the right tie rod end is added as a subgoal and 
utterances \[24\] and \[25\] are generated. 
8 Discussion 
This paper proposed a inethod tbr altering in- 
struction dialogues to match the user's view in 
a virtual enviromnent. We described the Con- 
tent Planner which can incrementally decide co- 
herent instruction dialogue content to match 
changes in the user's view. We also presented 
the Dialogue Manager, which can keep track 
of the user's viewpoint in an on-going dialogue 
and use this intbrmation in resuming from inter- 
ruptive subdialogues. These mechanisms allow 
to detect mismatches between the user's view- 
point and the topic at any point in the dialogue, 
and then to choose the instruction content and 
user's viewpoint appropriately. MID-3D, an ex- 
perimental system that uses these mechanisms, 
shows that the method we proposed is effective 
in realizing instruction dialogues that suit the 
user's view in virtual enviromnents. 

References 

Elisabeth Andr6 and Thmnas Rist. 1993. The design of 
illustrated documents as a planning task. In Mark T. 
Maybury, editor, Intelligent Multimedia Interfaces, 
pages 94-116. AAAI Press / The MIT Press. 

William H. Bares and James C. Lester. 1997. Real- 
time generation of customized 3D animated explana- 
tions for knowledge-based learning environments. In 
AAAI97, pages 347-354. 

Alison Cawsey. 1992. Explanation and Interaction: The 
Computer Generation of Expalanatory Dialogues. The. 
MIT Press. 

George Ferguson and James F. Allen. 1998. TRIPS: 
An integrated intelligent problem-solving assistant. 
In AAAI98, pages 567-572. 

Barbara J. Orosz and Candace L. Sidner. 1986. Atten- 
tion, intentions, and the structure of discourse. Com- 
putational Linguistics, 12(3):175-204. 

W. Lewis Johnson, Jeff W. Rickel, and James C. Lester. 
2000. Animated pedagogical agents: Face-to-face in- 
teraction in interactive learning environments. Inter- 
national Journal of Artificial InteUigencc in Educa- 
tion. 

Tsuneaki Kato, Ynkiko I. Nakano, Hideharu Nakajima, 
and Takaaki Hasegawa. 1996. Interactive mnltimodal 
explanations and their temporal coordination. In 
ECAI-96, pages 261-265. John Willey and Sons Lim- 
ited. 

Akihisa Kawanobe, Susumn Kakuta, Hirofumi Touhei, 
and Katsumi Hosoya. 1998. Preliminary report 
on HyCLASS anthoring tool. In ED-MEDIA/ED- 
TELECOM. 

James C. Lester, Jennifer L. Voerlnan, Stuart O. Towns, 
and Charles B. Callaway. 1997. Cosmo: A lih;-like 
animated pedagogical agent witl, deictie believability. 
In IJCAI-97 Workshop, Animated Interface Agent. 

Jmnes C. Lester, Brian A. Stone, and Gray D. Stelling. 
1999. Lifelike pedagogical agents for mixed-initiative 
problem solving in constructivist learning environ- 
ments. User Modeling and User-Adapted Interaction, 
9(1-2):1-44. 

Mark T. Maybury. 1993. Planning multimedia explana- 
tion using communicative acts. In Mark T. Maylmry, 
editor, Intelligent Multimedia Interfaces, pages 59 -74. 
AAAI Press / The MIT Press. 

Johamm D. Moore. 1995. Participating in Explanatory 
Dialogues: Interpreting and I~esponding to Questions 
in Context. MIT Press. 

Chm'les Rich and Candace L. Sidner. 1998. COLLA- 
GEN: A collaboration manager for software interfhce 
agents. User Modeling and User-Adapted Interaction, 
8:315-350. 

Jeff W. Rickel and W. Lewis Johnson. 1999. Animated 
agents for procedual training in virtual reality: Per- 
ception, cognition and motor control. Applied Artifi- 
cial Intellifence, 13:343-392. 

Amanda Stent, John Dowding, Jean Mark Gawron, Eliz- 
abeth Owen Brat, and Robert Moore. 1999. The 
CommandTalk spoken dialogue systeln. In AC'Lgg, 
pages 183-190. 

Brian A. Stone and James C. Lester. 1996. Dynami- 
cally sequencing an animated pedagogical agent. In 
AAAI96, pages 424-431. 

Stuart G. Towns, Charles B. Callaway, and 3anles C. 
Lester. 1998. Generating coordinated natural lan- 
guage and 3D animations for complex spatial expla- 
nations. In AAAI98, pages 112-119. 

David R. Traum. 1994. A Computational Theory of 
Grounding in Natural Language Conversation. Ph.D. 
thesis, University of Rochester. 

Wolfgang \Vahlster, Elisabcth Andr6, Wolfgang Fin- 
kler, Hans-Jiirgen Profitlieh, and Thomas Rist. 1993. 
Plan-based integration of natural  and graph- 
ics generation. Artificial Intelligence, 63:387-427. 
