Multi-Modal-Method: 
A Design Method for Building Multi-Modal Systems 
Hideo Shimazu and Yosuke Takashima 
Int'ormation Technology l{esca.rch I,abora.tories. 
N i';C, Cort)ora.lJon 
4- 1- \[ Miya.za.ki, Miya.nia.e, Ka.wa.sa.ld, 216 
,J a.p a,n 
{shinla.zu~ yosu Ice} <iLjoke.cl.nec.co,jp 
Abstract 
This pa.per descril>es Mull;i-ModM- 
Method, a. design nlethod for buihling 
gra.nnna.r-i)a.sed lnull;i iiiodaJ systeins. 
M ulti-ModM-MeIJiod defhies the proce- 
diire, which hlllerfa.ce desiguers iiiay l'of 
low hi developing niuil;i-lliodaJ sys/,e, iiis, 
and provides MM-I)(\]G, a. gl'a.iiillia.ti- 
(',a.\] \[i'&lll(eW()i:k for lll/i\]i,i-illOd;i.l hlpul, ill 
tel:prel~a.tion. Mull, i-Moda.l Method has 
been inductively defiiie(t through several 
experhnent.a.1 i-milt, i-inodaJ int;erfa.ce sys- 
teni developnie.nts. A ca.se st, udy of a 
iiiu\]l;i nioda.l dra.wing l,ool developinenl; 
a.iong with Multi-Modal-Method is re 
pori;ed. 
1 Introduction 
'l'his pa.l>e,r descril>es Multi Moda.l-Mel;hod, a 
inelho(l for \])uil(ling <gl:itilillia3'-.l)ased nulll.i-inoda\] 
sysl.eilis. 
The, ilio|,iva.l:iOl\] \])ellii/(t this resea.i:ch is t,l\]a,1, 
defiliiug such a. iiieLhod is necessa, l'y for build- 
ilig nexl, g.::iiera.l,i(:,n iliterf0..l::es. We believe iillilt, i- 
lilOda,l intel:Pa, ce is olie ()f the a, dva, i\]ced inter~ 
fa.ce l>eyond present gra.phic user iill.erfi~.ces (GUI) 
such a.s Windows a.nd Ma.cintosh. Although there 
has h.eell significant i:esea.rch Oil nmli, i inodaJ sys- 
1,ellis (Allga,yer 1989; ('.ohel/ 1,989; Cloheii 199 \[; 
Ila.yes 1{)87; Kol)sa. 1986; Wa\]llster 1,989), these 
systenls ha.w ~, been buill, as ta.sk-specifi(" expert 
systelns, focused oil the a.l>i>lic~d, ion of tim idea.s. 
All,hough a. nuniber of luetliodologies ha.ve been 
forinulaJ;ed l,o buihl presenl; (\] U Is by sofLwa.re s(:i 
elil.isl;s a.nd c, onsull;ii\]g firii/s, i, hey a.re not; a.pplic, a- 
ble t,o inull;i-nioda.l systelll develo\[>inenl,, \])eca.use 
the underlying principles a.re differenl, between 
preseul, (:IUI a, nd n\]ull:i-nio(llJ syst, eins. Tilus, 
we had to develop ()Ill' ow/l de, sign niei, hodof 
ogy, opi, inilze(l \['or nullt;i-nioda.l sysi, elilS. We used 
the firs/, gra.i\]n\]ia.l, icaJ \['ra.mewoi:k fol: nnill, i lilodaJ 
systems, Mull, i-Moda.l I)efinil,e Cla.use (71raJllilia.l" 
(MM-I)CG) (Shi.m.:,,u 1994). Then, the Muld- 
ModM-Mei;hod wa.s iliductively defined ba.<wd Oil 
,severa,\[ cases of gra, lJmm, r-based multi modal sys 
tern developnmnt. 
2 Multi-modal Processing vs 
Event-driven Programming 
M uld-nmda,l interfa,ce is one oft,he a,d,,'a.nced ilHer 
Face beyond present 6', \[! Is. lJreseul, (;{!ls axe in 
tegra, l, ion of objecl orienled comImlin 9 and cvc~l- 
driv en prog.rav~ m in9. 
One of I,he nlost iH\]por(~a,v,t i.now~.t, io.s in 
computer i>rogi:aniiiiing during the past <leca.de 
ha.s been the developnle, ul, of "Object oriented" 
(:ouq)utiug. Viewing sofl\[,wa.re ('OlH\[)onellt, s as 
if they are _r>hysicaJ objects, cha.r~cterizaJJe via. 
class/sut>class rela, l, ions based on sinq)le lea Jutes 
and/el: how \[llnct, ions of t;he objecl.s differ, is a 
power\['ul inel, a.phor. Tile l)i:o~ra.iiililei" (:ii.ii ilow 
hiia.gine coluplex sys/,eills a.s l>uill; up of these silii 
pler ol)jeci,s, liiiicii a.s a. c, hiht builds a. la.l'ge si.i'/IC. 
tlire oul, o\[" sillil)le, l>uihihig 1)locks or a.li a.rchilecl 
;t..i'l'{i.ligjes ;~.. Puncl, iona.1, yei, aesl.hel, ica.lly a.l)i)ea.iing 
edifice \['rolii COliipoiielll,s such a,s woodeii \]>+~illliS 
a.iid ineta.l gh'ders. 'l\]iinkhig o\[" the coiilpiiter 
screeN, the whidows Oil l, ha.l, soi'eeli, itnd eveli the 
bits in those whidows a.s shnple objecls coiiil)osed 
togeLller int,o a, i)owerletil editor has l)eell a.I\] <;x 
l, reniely coinpeliing vision for iril,er\['~lx:e designers. 
In f~tcl,, ot:,jecl,-oi:ien/.ed progra, innlhig ha.s t)e.coine 
a, COl'llersl, Olle of inLei:fa,ce design, mid I,\]le doilli- 
lilt.Ill, ineta.phor il\] inter\['a.ce \[,rogra.iiil ni lig systeiiis. 
Ilowever, SOlile r(\](;el\]l, systelllS lla.ve gOlie be- 
yolid ot>jeci;s for (lea.ling with inierfa, ce develop 
luenl;. This is t)eca.use, especia.lly in willdow-based 
systelilS> SOllie tyi)e,s of inl, er\['a.ce COllipOileiils (\]o 
iiol; fit; well wheii viewed as "objccls." 'l'hinldiig 
el" the illOtiSe ,'l.s it physicaJ eiltity for the l)rogr;illl 
Iller to lise iliitkes perfect sense, btlt 'viewing a, 
"iiiollse, (:lick;' as a, li object seeliiS less conipelling. 
,qinlila.rly, other actions, such as sketchiug wiili 
a light I)ell, sca.iltlilig 0, (\]ocuillent, or sl)ea.king a. 
seliteilce Ca, llilOl, t)e 1;hough of" a.s physica.l en/,itiesj 
but i:a.ther iiltlSt be viewed as "'events" which occur 
Oil a.li objeci;. Thus> \['or exa.ilil)le , tools on \'Viii 
dows like Visua.l Bask ha.re been lea.niug /.owa.rd 
a, progra.iiiinhig inelLhodo\]ogy i, ha.I, a,llows nol, olliy 
925 
ol>je<:ts, but also event,-l>ased progra+nuning. 
It, is our contention t3m/{ while evenC-ba.sed pro 
gramnfing is a. step ill the right direction, it. does 
not go fa.r enough. In pa.rticular, we <:laitu that it 
is the order of events in a. sequettce that is critical. 
This is especially true in a nmld-moda.l iuterfaee 
where eveut, s may l>e coming from a set of different; 
conqmtationa+l device.s, each runifing separately. 
In such an interface, a. mouse click, a spoken utter- 
a.nce, a drawing with a. light pets, and some typed 
comttta.nds mat have t.o be integra.ted into a single 
inl>ut. The ordering of the input events is clearly 
a critical fitctor in understanding the meaning of 
such inputs, aim "parsing" such astring requires a. 
more principled approach than simply expecting 
an application t,o handh~ the plethora of <tiverse 
inlJuts its all ++heir forths. 
The major purl>ose oF this paper is to define 
a. frameworl( and <\]esign methodology for a cosn- 
pul.illg model which can inl.erl>reC a set; of events, 
particuhu'ly iu the area. of nmlti-mo(hd interfa.ce 
design, lit the next section we describe this idea. 
more fully and develol> a simple example. 
Fusion 
colll{lined 
indcpc/idcnt 
Use of M0dalilies 
sequential parallel 
AI :IT~I{NATH 
EX(21,USIVli 
( (:i:::i:i:::(/::::(: i:i:. (( / ( ( SYNt!RGfSTK7 
CONCURI,HiNT 
Figure. 1" Nigay and Coutaz's tuulCi-ntodal system 
categorization 
3 Understanding Event Streams 
N iga.y and Cou taz (1993) divided uutlCi-modal sys- 
l.etllS iuto four categories. They are defined by two 
independent features; fllsior~ and 'use of r, odalily. 
"l,'usion" covers the possil)le combination of differ- 
ent types of data. the a.l>sence of fusion is ca.lled 
"indel)eudent" whereas the l)resence is referred to 
as "coml)il\]ed". aUse of modaliCies" expresses the 
tempora.l availability of multiple ntoda.lities. This 
dimension covers the a.bseuce or presence of Imam.1 - 
lelism at the user int~erface. "Parallel lisa;' allows 
the user to employ multiple modalities sintulta+ 
neously. "~'Sequential" forces the user I,o use the 
modalil, ies one after another. In this paper, we 
(lea\] with Cite :'synergist+it" category, the most. dif- 
ficult among t, he Corn' categories. 
A simple example shows how difficult it, is to 
understa.nd synergistic user expressions. Consider 
the example of a chiht who is using a nnlltinm- 
dia encyclopedia system whicls provides a, mix of 
speech recognition (and language processing) and 
a. mouse. The chiht states "Ca.n this, do this," 
pointing at a picture on the screen and clicking the 
mouse during the first %his" and then choosing all 
itmn front a lllelltl during the second. The syslenl 
must realize that the first, point is, say, a. pict.ur<2 
of a particular animal a.ud the second is the tttetm 
item "fly." Somewhere, the system itlusl, creale a.Jl 
internal representation of this query that conforms 
to some data (or knowledge) base query la.nguage. 
In tile object-.orienCed metaphor, some sort of cen- 
traJ application object is in cha.rge, and must send 
messages to the screeu, the mouse, and the voice 
system asldng for input upon activation. This sys= 
tern then synthesizes that information and pro- 
duces a query such as "\[QU l,\]l{Y: Func-of <Object 
l)inosaur-bitmap-7><:ntenu item I,'I,V >\]" which 
it is progra.mnmd to answer. 
Note, however, that as the central system ol>- 
ject is in cha.rge, it, must send messages (or otll 
erwise cosltact) the wu:ious modalities of intera.c- 
tiou to be aware of tlte. possibility of input. This 
can be arbitrarily hard, especially as we consider 
that the number of utodalities wi\]l keep grow- 
ing as user interface technology design comin 
ues. Even R)r this simple example the same query 
can be a.sked many ways: the child could speak 
"can a. ptera.smdon fly?"; could choose from the 
menu aquery-I)utcCion," point at the dinosaur, and 
then mouse "fly"; could type t.o a. conmm.ml lilac 
"query:flmction PT1) Fly"; or any other COllll>i- 
im.tion of these capal>ilities. The central ol<}ecl 
coordinating all these modalities IIItlSt sm,l ap 
propriate messages at approl)riate times to ea.cll <)r 
the drivers of the wu:ious devices, and theu iimsl 
syuthesize the answers that are received. 
Unfortunately, the situation is made even luore 
conlplex by the fact that the system ca.nnot ex- 
tt'acC a\]\] inputs alsd colnbine them in sonle sin@e 
ltla.Slller. The sequence in which the inputs are. re- 
ceived can be critical tha.t is, the %vent stream" 
must be aua.lyzed as a.n ordered set of events which 
determine tile interaction. If the chi\]<l says ~'ls 
this (points a.t elel>hant) bigger than this (points 
aC pteranodon)?" then the system must recog 
nize in which order tile poiuCs and the anaphoric 
references occur. Simply recognizing /he query 
concerning the elephant and pteranodon is uot 
enough; we must understand (and process) theni 
in the correct order. 
The computatiollal met.aphor we prefer is nol 
Chat+ of objects, but rather that of l>rocessing the 
stream of events in a. gra.nuna.tical mamler. Thus, 
instead of having a central object initiating sollm 
sort of message passing, we view each of the indi 
vidual interaction techniques a.s producing reports 
concerning the events which occur and the t.imitlg 
of these events (e.g., the mouse in the aJ>ow', s<:e 
nario will simply report "<Mouse-Click :Xpos 300 
:Ypos 455 :start 2700 :end 273.5>.") 
Using the example, :'can this do this", we de- 
scribe \]tow sophisticate synergistic iuputs should 
be processed more precisely. Figure 2 shows four 
926 
Case 1 
Sgeech Mode 
Mouse h~0ut Mode 
Case 2 
Speech Mode 
Mouse Input Mode 
Case 3 
Speech Mode 
Mouse Input Mode 
Case 4 
Speech Mod0 
Mouse Input Mode 
Can this do this ? 
~ E::::::2::3 I::::::E::3 
,l& ~t, 
Can this do lhis ? 
Can this ,do this ? 
r"-'-"~l:::::::\] ~ I::::::=:1 
t A 
i1 
Timoout 
Can this do this ? 
\[::2:::::~ I::::Z:z\] 1:::2:2:Z3 
I,'igure 2: Four inpHt t.indngs for "(:a.n this do t.his" 
t.itHing cases of a. user's it,put, of the exa.ttq)lc. I%.ch 
case should be processed iu a+ diffet+enl ,tmnm'.r: 
Case 1: There u.re two nlouse it,l>t,ts, a.,td each 
of t.hellt ,ua.tehes correspo,tding spce('h iuput. 
Tlwxel'ol:e, t t,a,l, chiug l;>oth int>u {.S iS easy. 
Case :2: There is one mouse inl)ut which points 
a.t.a, specific a.ldW.a.l.ed object, "l>tera.nodoH ". The 
illl>Ut nm.t('hes the fit:st. "tJ,is". The second "this", 
therefore, is iHterl>re, ted as the la.st rel+'erred a+ct.io.. 
Case+ 3: +l'he,;e is oue mouse input, which \[>oiuts 
'a.t 'a. specific a('tiou, "fly". The inl>ut IImt.clw, s I.he 
second "this". 'l'he first "this", therefore, is inl.er- 
l>re.ted a.s (.he lain referre.d a.,,ittla.ted ol>ject.. 
Case 4: T\]ml'e a.re t.w(:, mouse i,,pul.s, one of 
which is ilq>ut lotlg a.t"ter the, first mouse input (ff)r 
example, I tnimtt:e a.fter). I, this case, the seeottd 
inouse i,\[>ut is ig.ored l>eca.use of l imeo+lt I>y t.he 
syst, eln. OI\]ly I,he first mouse inlmt is iuterpreted. 
:l'hereff)re, ea.se d ix l>roc<'sse(\] tim sa, me a+s case 2. 
4 Multi-Modal-Method Design 
Prot'ess 
'l'he design \[>rocess of t, he Multi-Moda,l-Method 
lies seve, l~ stel>s. 
Step 1.: Task sehwtion 
A tluml>el: of tnulti.-l,K)da.1 int, erfa.ces ha.ve bee,, 
de.velol)ed. There axe cevta.inly severa.l a.pplica.tion 
fieh\]s in which nmlti ||loda.I systems a.re a.l>l>liea. 
ble. The+' include: design and editiug, pt:esenta.- 
t icm, infi:)rt,m.l:ion rett:ieva\], and educe.lion. 
Step 2: Mode. and media selection 
The tmHtber ;u,d type of Jl,o(les a,ttd media. 
sllottl(l be deternfined. Gettera\]\]y, niode arid lue- 
,:lie. do not. ha.re a. otle-t.O-Olle eorresl>Olldell(:e. For 
exatnple, a.lt.hottgh speech inl>ttt a.H,:t keyl>oa.rd in 
put use+ different media+, they a.re t.vea.ted a.s the 
sa.nm mode beta.use they a.re used and interpreted 
idemica, lly. 
Step 3: Corpus collection 
The eorl)us of multi-i~mda.\] expressi<ms to tlm 
a,pplica, tio. is collected. This process is the su.,+ 
as that, for tm.latra,l la, ngua,ge processing. 
Step 4: Corpus analysis 
The collected corl>tls is mta.lyzed, l,:a.ch expres 
sion iu the COrl>US shouht I>e a, na.lyzed I>a.sed ou 
L\]le R)Howiug cl:il.eria.. 
Economy: l)oes the exl>ression save a, tlse, r's 
la.bor? I,\]aeh expressiotl is exa.,t,iued as to 
whether it; ca.n sa.ve a. use, t:'s fa.bor v:hett t rmm 
ferring his/her iut, entio, to the a.pplica.tioa 
system. For example, in .t+ piet+tu'e <\]ra+',ving 
tool, if a. user is a.llowed t<) point a.1. a. si>e 
eific ol>jeet while sa,ying %Jelete", he/she ca.tl 
sa.ve ht.bor, be(:a.use he/she does ,of ha:.+'e Io 
cha, llge, t.hc IHouse positiotl frol~l the CIILIIVIhS to 
it. lllellll item a.t t.he lllellt£ |)ill' a.rea., a+lld lille.ill 
Fl:Olll the tDelltl })lt.l' it.Fee, to (:he (:3.+llV&'-;. 
Plausit:,ility: 1,:a.eh exl>ression is exandued as 
to whether: it. is likely to be used in a+ i'ea.1 
appliea.tiot|. As desct'ibed t+etow, writing 
gra.tlmm.rs for tuulti--tttoda.1 interfaces requires 
mu,::h more effort tha.tt f<::,r single tt~<::,da.\] iJl 
teJq+'aees. O.ly frequently used ex\[>ressi<)us 
should be selected ca.refi|lly. The sp<,.ech 
mode is be/.l.er I"or selecting a.n itetn anmllg 
n ta.rge mltu\]>er of ca, ndida, tes, such as choos 
i,g a. {'it.)" ua.me a.lllong all cities in the I.!S:\. 
Ou the of.her ha+ml, a..w, uu iHterfa.(:e is bet 
t.er I"or sefe,::l.il~g one a.luong a+ small tmtIll>,:'.r 
o f e a,u did a,t,es. 
The set. ,:)r the select;ed expressiorts t)ecOllles tim 
seed for the specifiea.tion of the desigm:d t,,ult.i 
|t|oda.1 sysl.etu. 
Step 5: Specification Design 
The diflieulty level oft.he interface (tesig,, should 
be (lel.er,l,hted />ased cm the a.ualysis of' sele(:le(l 
corpus e+xpressio,ls. Thet:e a.re five dil-liculty levels 
of multi-modal input e×i>ressions (Ta.b\]e 1): 
Level 1: Single mode input: l,',veu it, a. tl,ulti 
,node.1 syst, em, users oR,e|| wa,nt t,o express /heir 
i,Ientions with si,gle modal expressions. For ex- 
ample, I:>oiul, ing a,t a,u existing object, thee select 
hlg "delete;' from the menu. 
Lewd 2: All mode inputs express identical 
contents: I.',a.eh tt,ode input, expresses a.n i(hm 
t.ical cc:,ntelfl.. I"or exa.utl>le, poitH;ing a+/; a,n exisl.~ 
ing ot>ject, then selecting "delete" from the lllel\]l+l, 
while saying "delete the reeta ugle". 
Level 3: A eoinbination of incomifle, te mode 
inputs eomph;lnent each other: Each t,lod<~ 
input does not. expresses tf|e <::otHelltS }>y itself. 
927 
Each mode input complements other mode inputs; 
thus they express a. single content. For exam- 
pie, pointing a.t a.n existing object, while saying 
"'delete". 
Level 4: Each mode input is contradictory: 
The contents generated from independent lnode 
inputs axe contra.dictory one ~nother. For exam- 
pie, sa.ying "delete the circle", while pointing at. 
a. rectangle object which hides the specified cir- 
cle object on the screen. Contra.dictions a.re often 
solved by context a.na.lysis. 
Level 5: A COlnbination of mode inputs still 
lacks something: The contents genera.ted from 
the combination of the interpretations genera.ted 
fl'om individua.l mode inputs a.re insufficient. For 
example, sa.ying "move it. here", while pointing a.t 
a. specific point. The point should be unified with 
"here", a.nd a.n object specified by" "it" should be 
interpreted as the last referred object. This type 
of interpreta.tion requires of context a.na.lysis. 
It becomes more dimcult to interpret expres- 
sions as the level increases. Especia.lly, since 
level's 4 a.nd 5 require tight iutegra.tion with con- 
text a.na.lysis, interfa.ce designers should consider 
whether the applica.tion users really need these 
levels or not. 
Step 6: Architecture Design 
Any multi-moda.l system can ha.re a. multi agent 
a.rchitecture beta.use ea.ch mode processing is ea.s- 
ily ma.pped to a.n independent a.gent. There are 
two extreme types of architecture which ma.na.ge 
the agents. One is bh~ckboard a.rchitecture where 
a.gen ts excha.nge ilfforma.tion using a shared men,- 
ory ca.lled a. bla.ckboa.rd. 'l'he a.rchitecture fits 
multi-moda.1 systems whose multi-modM expres 
sions a.re sophistica.ted a.nd integra.ted with con- 
text. a.na.lyses. The other is subsumption a.rchitec- 
ture where ea.ch a.gent a.cts ra.ther independently. 
ln forma.tion excha.nge pa.ths between a.gents a.re 
limited. The a.rchit.ecture fits multi-lnodM sys- 
tems whose multi-moda.l expressions a.re simple 
a.nd slereotyped. Ma.ny a.ctuaJ multi-modaJ sys- 
tem a.rchitectures are combina.tions of these ex- 
trelne a.rchitectures. 
Step 7: Grammar rule writiug 
Each selected mu\]ti-moda\] expression is defined 
by the corresponding gra.mma.r rule to interpret it. 
The gra.mma.tica.l ffa,lnework for the mult.i-moda.l 
expressiou should ha.re the following functiona.li- 
ties: 
(1) Modes should be interpreted equally 
and indei)(mdently. If ea.ch mode is trea.ted 
in the same ma.nner as tha.t of a na.tura.l la.n- 
gua.ge mode, synta.x a.nd semantics of inputs of 
ea.ch mode are defined with gramlna.r fornmla.tion. 
Thus, complex multi-modM expressions can be de- 
fined declara.tively a.nd more easily. 
(2) Mode interpretations shouhl be referred 
to one another. Inputs of ea.ch mode shouhl 
be interpreted independently. However, the inter- 
pretation of such inputs should be referred to by 
other mode interpretations. There ~re a.mbiguities 
which a.re solved only by integrating pa.rtiM inter- 
preta.tions of rehtted modes. For example, if a. user 
sta.tes "this recta.ngle", pointing at a. different type 
of object overlapping the recta.ugle object, the a.m 
biguity of the object pointing nmst be solved by 
comparing the two mode interpreta.tions. 
(3) Mode interpretation should handle tem- 
poral inforlnation. Tempora.l ilfformat.ion of 
inputs, such as input a.rriva.1 time a.nd the interva.1 
between two inputs, is importa.nt in interpretitlg 
multi-rood a.1 iuputs. 
Multi-Moda.l 1)CG (MM-I)CG) supports these 
functiona.lities. MM-DCG is a superset of 1)(7(\[; 
(Pereh'a. 1980); everything possible in 1)CG is a.lso 
possible in MM-I)CG. MM-I)CG has two ma~jor 
extensions: 
t. MMq)CG ca.n receive ,~rbitra.ry llUlllbers Of 
input strea,ms, while 1)CG ca,n receive only 
one. A single gnunm~r rule in MM I)C(; 
cain allow the coexistence of gra.nnna.tica.l ca.t 
egories, thus Mlowing for their iutegra.tiou. 
2. hi MM-1)CG, ea.ch individual piece of input 
da.ta, is required to a.tta.ch the beginning time 
a.nd t\]Ie end time as its time sta.mp. Using 
the time sta.mp, MM-I)CG a.utom~tica\]ly ca.1- 
culates the beginning time trod the end time 
of a.ny level of insta.ntia.ted gra.mma.tica.1 ca.t 
egories genera.ted during parsing. The tra.ns- 
la.tor of MM-I)CG to Prolog predica.tes gell 
era.tes code which perform this task. 1 
Figure '3 illustra,tes a,n a,pplica, tion written in MM- 
I)CG. 
~ ~ Multi-modal Interpreter 
word word MM-DCG Rules 
word word 
click click 
\[ 
Prolog Interpreter \] 
Figure 3: Multi-modM a.pplication written in M :\'I- 
I)CG 
These processes form one cycle in the systent 
evolution. Bec~use of the in crease in multi-rood a\] 
expressions, the qua.lity of tile system improves a.s 
1The details of MM-DCG ~re described ill (Shi- 
m~zu 1994) 
928 
~~ 1,;x aml>~. 
1 single mode pointi,,g at an obje<:t,Te\]l~ "delete" fl'om tit<+' menu 
2 re<hmda.nt 
3 incoml)lel,e 
d <:oi,tra,dictory 
5 la.eldng 
l>oitging at, an ot>ject, select.trig "delet.e" fron\] the 
tnenu, while sa.ying "delete the rectangle" 
pointing at a.n exisl;ing object+, while saying "+'delete" 
saying "delete the circle", while pointing 
a.t. a recla++gh-: whi<:h covers t.he specitied circle 
saying "move it. here", while pointing at a point. 
'l'a.l>le 1: I,'ive. levels of' tutdt.i-Jnoda.1 i,q>ut.s 
file cych: iLera.tes. When the system rea.ches the 
lll;t.t;tJre sl;a..ge,, the syst:ctn is released to end users. 
5 Case Study 
This section describes the design process of a. 
mult.i-nmda.l drawing 1,ool along with tim ttmlt.i- 
modal-nlethod. The following is tile trace of the 
(lesigtt process. 
Step l: Task soh;c+tioIl Since there has I)ee!l 
.,dgnifica.nt research oil develot>ing mult, i .,odal 
drawing tool (l\[iyoshi 199d; Niga,y 1993; \,% 1993; 
I~,ellik 1993), the application fiehl is l>rolnising. 
Step 2: Mode. and media seJection In tl,is 
exl>criu,ent , we R)<'+use<l on only input, t.odes. In 
put. modes include speech, keyboard a.d mouse 
inputs. These input nJodes a.re synergistic. Oub. 
l>ut modes include l>ictures and text, but outputs 
axe llOt synergistic. 
Step 3: Corpus collect'oil We co\]le<:l.ed 
about, two humlred nmlti-ttnoda.1 exln'essions front 
pol,ent, iaJ users a+s it,st.r,Jctiol~s for t.he i~lulti:moda.1 
dra.wing tool. The users had exl>erience wit+h using 
cxisti.g dra:e,,ing t+ools. 
Step 4: Corpus analysis The following are 
some of tile result.s of l~he a.nalysis of the. collected 
corpus. 
• Users want. to use various ,nixed modes ac 
cording to the sil;ua.tions dmy are dealing 
with. 
+D Users wa.tlt. Lo use abridged expressions, 
whi<:h causes integration of multi-modal in- 
terpret.at.ion and cont.ext analysis. 
• Users wa.ttt I.o handle exisI;ing objects a.s a set.. 
I,'or example, "Cha+,ge+ tile col<),' of all circles. 
,., Users want. \[.o ha.ndle exist.trig objects whi<:h 
are not shown on t.he display. For example, 
asking "'how many re<:tangles a.re hid<letl ()tit, 
of the canvasT'. 
+ Users wa.nt t.o use+ l.he tuouse a.mbiguously. 
For exa.nq)le, saying "l)e\]ete this circle", 
while I>oinl.ittg a.t a point, a~u:ay fl'o,n but near 
the circle. Such ambiguous pointing can be 
<:orre<-t\]y interprete.d only whett multi-,imdal 
expressio, is a\]lowe, d. 
Step 5: Specification Design The a.ua.lysis 
taught us tllaJ. ~,ulti-,,,odal drawhtg tools should 
support level% d and 5 (the most dill, cult levels) 
to meet ordinary users: rcquirenm.l.s. The sped- 
\[,catkins were determiued based on these require 
ll\]ell\[;s. 
Step 6: Architectm'e Design Since tlw, re= 
quired specification is tim most, difl\]ctJt synergy 
lewJ, (.he a+rchit.ect.ure is blackl>oa.rd a.rc\],itecl;tJrc 
where, ea.ch agent can ex<'ha.t,ge infor,~m.t.ion in 
va.rying ways. 
Stell. 7': Gr3.llllllal' rule. writing After tlu~ 
a.nalysis, about, forty expressions were selected, 
a.ud va.ria.tions of ea.ch selected expression were 
a.lso genera.ted a.nd a.dded. (-~rammat rules were. 
de.fined corresponditlg to each mull:i, tnoda.1 expres 
sion. Figure 4 shows a part <)f the grammar rules 
written in MM-I)CG. The rules define how to in 
l.erl)reI, a.n hupera.i.ive sentence like "l)elete this 
circle" wil.h va.riet.ies of expressions. It allows the 
spokeu uttera.nce mode(speech sl.rea.ul), l.he tylm 
it\] ,node (keyl>oa.rd strem,), a.ud the mouse l>oilll 
ittg mode 0hOUSe stxeam), l{ules iu the level I sec 
1.ion define single tnoda.\] e+xprcssio,m. In tim level 
2 section, whethe, r di\[l'erent, mode hq>uts express 
identica.l cotg.ents is examined. The combina.tion 
of the verb_by_multimodal/1 clause a+ud the secolul 
object/1 clause+ is m\] exami>le of the level 3 exl>res- 
sions, lit the le.vel 4 sect.iou, select_right_meaning/3 
enclosed inside curly brackets { and } is a. Prolog 
predicate which detertnines the correct mea.lfit,g 
using cot,text analysis whet, (lilTere.,,t tuo(le iulmtS 
genera.re contradictory meanings. Such a. l>redi- 
ca+i.e is defined it\] a task-specific ltlal,tleF. Ill the 
level 5 section, find_appropriate_termt/2 enclosed 
inside curly brackei.s { a,d } is a. l'rolog pred,. 
ca.re which finds a.u a.ppropria.te term ttshkg (:ou 
texl. analysis whe. the cond>inat.io,t of gcneraJed 
tuea.ui,g of all modes still lacks htrort\]m.tion. Su<:}l 
a predica.Le is also defined it, a t.a.sk-spe+cific lira, 
.er. A trivial heuristic rule exmnple is "to use the 
,|tost recently a.ppea.red t.erm". 
C, ra.J,n,ar writers should understand that. the 
re,tuber of grammar ,:tiles for muld-tttoda.l int<:v- 
faces becomes much larger than for any single 
moda.1 int+erfitces. If there are three triodes; :U\], 
M2, attd M3, a.nd the mJmbers of granum:u' rules 
929 
% st:ream definition 
active_stream(speech, re)us% keyboard) 
(/c l,evel 1 
imperative(meaning(Action, Object)\] -- > verb(Action\], object(Object\]. 
wrb(Aetion)-- > verbJ)y_menu(Action\]. 
verb(Action)-- > verbJ~y_multimodal(Action). 
w.rbJ~y_menu(Action)-- > menu(Menu_it.era, Act:ion\]. 
verloJ)y_inult:im~dal(delete) -- > (speech or keyboard\]:\[delete\]. 
ulen u (in ellu_i t eln_2~l, delete\]. 
object(Oh)) -- > ,ioun_phrase(Obj). 
obj~-ct(Ol<i\] -- > pointing(Oh)). 
noU._l)hrase(Obj) -- > article, norm(Noun), {attril)ute(type, Noun, Oh j\]}. 
article -- > (speech or keyboard\]: \[this\]. 
.oun(clrcle\] -- > (speech ~r keyboard\]:\[circle\]. 
poialing(Ob,i) -- > mouse:\[lmlton(left, Ioc(X, Y)\]\],{attribute(Iocation, (X, Y), 0b j\]}. 
% Lewl 2 
verb(Actionl) -- > verb_by_nmnu(Actionl), verb_by_multlmodal(Acl;ion2), {Actionl == Action2}. 
ob.ieel(Ob.il) -- > noun_l)\]lrase(Objl), poinl:mg(Olojg), {Objl == Oh j2}. 
% Level 3 
% I,evel 4 
verbfAclion}-- > verb._by_n'mml(Actionl\], verbJ)y_nnlltimodal(Action2),{selecta'ight-meaning(Aetionl, Action2, Action)}. 
objecl(Obj) -- > noun_phrase(Objl\], poinling(Obj2\], {select_right_meaning(Objl, Ob.i2, Oh j)}. 
% l,evel 5 
imperalive(meaningfAction, Object)) -- > w.rb(Aetion)~ {fil/d_appr~Jpriat~_term(object, Object)}. 
imp~rat.ive(meaning(Aclion, Object\]) -- > object(Object),{find_approl>riate_term(action, Action)}. 
Figure 4: Gramma.r Description of "I)elete this circle" Using MM-I)CG 
for ea.ch mode a.re; (~1, G2, and Ca. Then, the 
totaJ number of the multi-nloda.l gra.mn|m' rules is 
the sun1 of the gramma.r rules of a.ny combination 
of these three modes. Thus, the tota.l number, 
(7~o,,/ is: 
M~ , ;V\[~ ,Ms D_ ,5' 
The a.bove steps took about two ma, n month for 
the first cycle. The most. time COl~suming steps 
were step 4 and step 7. 
6 Conclusion 
This pa.per described the nmlti-nloda.l-method, ~ 
design method for building grammar-ba.sed nlulti- 
moda.l systems. '\['he inuh.i-modal-nmt.hod de- 
fines t.he procedures which iuterfa.ce designers ma.y 
follow in developing gra.mma.r-based multi-modal 
syst.ems, a.nd provides MM-/)CG, a gramma, ticM 
framework for multi-roods,1 input interpreta, tion. 
The multi-modal-met,hod has been inductively de- 
fined through severa.l experimenta,l nmlti-moda,l 
interfa, ce system clevelopments. A development 
process of a. muld-modM dra.wing tool a.long with 
the multi-roods.l-method was aJso introduced. 
Acknowledgements 
We would like to tha.nk Prof. Ja.tnes Ilendler for 
his advice during this resea.rch a.nd in writing this 
pa.per. 
References 
Allgayer, J., Janscn-\¥mkehl, R., reddig, C., andY%eithing N.: 
"Bidirectional Use of Knowledge in the Multi Modal NI, 
Access System XTR.A', Proc of IJCAI-89, 1989. 
Bellik, ¥, and Teil, D., "A Muldmodal dialoga~ controller for 
multimodal user interface management system applical:ion: 
A nmltimodal window manager", Adjunct Proceedings of 
INTEB.CHI-93. 
Cohen, P.}C, Dalrympl% M.~ Moran, D.B., P~reira= F.C.N.: <t 
al., "Synergistic Use of Direct Manipulation and Natul'al 
Language", Proe. of CIII-S& 1989. 
Cohen, P.R., "The Role of Natural l,anguage in a Mullimudal 
Interface", 199l Int.ernational Symposium on Next Genera- 
lion Humall Interface, 1991. 
Hayes, P.a., "Steps towards Integrating natural f,anguage and 
Graphical Interaction for Knowledge-based Systems", Ad- 
vances in Artificial Intelligence - \[l, Elsevier Science Pub- 
lishers, 1987. 
Hiyoshi, M, and Shimam h rd., "Drawing Pictures with Nalm'al 
Language and Direct Manipulation" Proc. of COI,ING-94, 
1994. 
KtA)sa, A., Allgayer, J., l-/.eddig, C.: R,eithing, NI: Schumauks, 
D., Harbusch, K., and Wahlst~r, W, ~;Combining Deic- 
tic Gestures and Natural l,anguage for R.e%rent hlentil\]- 
eatiun", Proc. of COI-,ING-8G, 1986. 
Nigay, \[,, and C0utaz, J., "A D~sign Space for bhfltimodal 
Systems: Concurrent Processing and Data Fusion" Proc. of 
INTEP~CHI-g3, lC,93. 
Pereira, F.= and Warren, D.H.D., i)"Defildte C.'lause Graumtnl'S 
for Language Analysis - A survey of the Formalism and a 
Comparison with Augmented Transition Networks", Artifi-. 
eial Intelligence, vol. 13, no. 3, 1980. 
Shimazu, H., Arita, S., and Takashima, Y., "Multi-Modal I)ef- 
mite Clause Grammar" Proc. of COLING-94, 1994. 
Vo, M.T., and Waibel, A., "A multi-modal human-computer 
interface: Combination of Gesture and Speech I:/.ecogni- 
don", Adjtmct Proceedings of INTERCHI-93 
\¥ahlsler, W.~ "User and discourse models for multimodal com- 
munication", in J.W. Sullivan and S.W. Tyler, editors, In 
telligent User Interfaces: chapter 3: ACM Press Fronliers 
Series, Addison Wesley Publishing, 1989. 
930 
