Multi-Modal Definite Clause Grammar 
Hideo Shimazu, Seigo Arita, and Yosuke Takashima 
Information Technology Research Lal)or~tories, NEC Corporation 
4-1-1 Miya.zaki, Miya.mae Kawasaki, 216 Ja.pan 
{sh±mazu, arita, yosuke}@joke.cl.nec.co.jp 
Abstract 
This paper describes the first reported gram- 
matical framework for a nmltimodal inter- 
face. Although multimodal interfaces offer the 
promise of a flexible and user fl'iendly means 
of human-coml)uter interaction, no study has 
yet appeared on formal granunatical f'l'ame- 
works for theln. We have developed Multi- 
Modal Definite Clause Ch'ammar (MM-I)CG), 
an extension of Definite Clause Gramumr. The 
major features of MM-I)CG inch, de eal)ability 
to handle an arbitrary mlmber of modes and 
temporal information in grammar rules, l:ur- 
ther, we have developed MM-DCG translator 
to transfer rules in MM-DCG into Prolog pred- 
icates. 
1 Introduction 
This paper describes tile first reported grammatical 
fi'amework for a multimodal interface. Specifically, the 
authors have developed MM-DC.G (Multi-Modal l)cfi- 
nite Clause Gra,nmar), an extension of I)CCI \[Pereira 
and Warren, 1980\] for lnultm3odal input processing. 
The major features of MM-DCG include capability to 
handle an arbitrary nn,nber of modes and temporal in- 
formation in grammar rules. 
The motivation behind this research was two-foht. 
First, the extension to multimodal was found t.o be 
the minimum requirement \[br natural language inter- 
face systems to be insta.lled in real al~plications. We 
have developed natural language interface for relational 
database (RDB) \[Shimazu et. al., I9!)2; Arita et. al., 
1992a; Arita et. al., 1992b\]. Spoken user queries are 
transformed into SQL specifications, and dispatched t.o 
RDBMS. The retrieved results are displayed at a com- 
puter terminal. The results include not only table forms 
but also picture images, like Figure 1. When users see 
picture images on the terminal, they naturally want to 
generate following queries by referring to such picture 
images. For example, they want to say, "Show me the 
interior of this one" or "Are there the same type of cars 
as this ear" while pointing at a specific picture on the 
display. If such multi-modal utterances be accept.able, 
the natural language interface will be more practical 
Figure I: Natural Language Interface Screen hnage 
enough to be used in many real world applications. 
Second, no st;udy has yet appeard on developing for- 
real grammatical fi'amework for multi-modal interfaces. 
Although there have been many researches on multi- 
modal systems, these systems are built as task-specific 
expert systems. The capability of such systems to pro- 
cess multi-moda.l inputs is too limited to interpret com- 
plex multi-modal expressions. This is mainly due to the 
fact that they have not developed their systems on for- 
nml grammatical framework for multi-modal interfaces. 
MM-DCG is the first reported grammatical frame- 
work for a multimodal interface. Multi-modal input 
processing rules can be written in MM-I)CG simply and 
effectively. Rules in MM-I)CG are translated into Pro- 
log predicates easily. 
2 Multi-Modal Input Processing 
Consider a query (.'xample to a nmlti-modal interface 
with a screen image like Figure. 1. A user states "Can 
this, attach this," pointing at a picture on the screen 
and clicking the mouse during the first "this" and then 
choosing an item fl'om a lllenu during the second. The 
system must realize that the first point is to a spe- 
cific autonaobile and the second is to the menu item 
"CD player". After integrating the two mouse pointing 
events into the two "this" in the utterance, the system 
nmst create an internal representation of this query that 
conforms to SQI, specifications. \[n tiffs example, even 
if the order of the two mouse clicking events is opposite, 
832 
the system Intlst generate the salne SQI, spcciiicaI.ion, 
but the interl>retation will I>e i\]l(>l'e dill|cult. In order 
to interpret such complex (:ombinatio,s of lmllti-modal 
inputs, the following requirement.s exist: 
(1) Modes should be interI)reted equally an<t in- 
del)ende.ntly. In <:onventiomtl multi-modal systems, 
natural language mode plays a major role, aml other 
modes such as mouse input mode are auxilia.ry. Inl)uts 
of auxiliary modes are merged into <;orresl)onding nat- 
ural language expressions iu a surl'ace level, and the 
merged natural language query is interpreted I>y con- 
ventionatl natural language parsers. Therefore, varie.l.y 
of accepte<l multi-modatl exl>r<'.ssions is very limited. 
llowever, If each tnode is treated wit, Is the same man- 
sler as that of ssatsls'atl \]allgSH/ge IlSOde, SyllldtX assd s(,- 
mantics of iulmts of each mode are (lefim~d with gram- 
sBar forlnulat;ion. 'Fhus, ccmq)lex exl)rcsskms can l>e de 
fined declaratively and more easily 
(2) Mode int<~'rI)reta|;ion should be r<4'(!l'red to 
one another, lnl)uts or each mode should be inter- 
preted independently. Ilowever, the interl)retatiol~ of 
such inputs should be referred I>y other lnode interl)re- 
tattions. There are alnbiguities which arc solved only by 
integrating partial interl)retabi<ms oF related modes. For 
examl>le, if user states "tiffs car", l>oi~ttit~g at an object 
which is overlal>l)ed on the. car object., the alnhiguity of 
the object pointing must he solved by conHlaring (lie 
two mode interpretations. 
(3) Mode interpre, tai;icm should handle temI>oral 
inthrmation. Tetlq>oral iuformation of inputs, such 
as input arriving time, inl,erwd between two inl)uts, 
plays an important role to i,~terl)rct mull.i-modal inputs. 
Consider an exasnl>le that a user states "\]low muc\[s is 
this car", and points at, a car i>icture a litt.le after the 
utterance. If tile interwd is three .scco~sds, the l>ointing 
event should be integrated with "this car" in the ut-. 
terance. Ilowever, if the ilH.erwd is three illi~sles, tile 
event should not I)e int.egraled. 
3 MM-DCG Design l)ecisions 
This section describes major design decisions made in 
developing MM-1)CG. Ih:eause MM-I)(X; is n superset 
of I)CG, everything possihle isl I)(!G is also possibhe in 
MM-I)CG. llowever, two major extensions are provided 
3.1 ll.eceiving Multiple Input Streams 
MM-I)CG cau receive arbitrary mind>ors o\[' different, in- 
put streams, while I)CG receives only ore!, I';ach mode 
is assigned an individual stl'ealll. Tlscrefore, a single 
grammar rule in MM-1)(:C, can allow the coexistence of 
grammatical categories in ditSwent modes, thus allow- 
ing for their integration. In addition, coa|.ext sensitiw~ 
inlbrnmtion can be inl.crclmnged among cattegories of 
different modes in a single rule. Figure 2 illustrates 
a multi-modal input processing luodule which accepts 
three independent streams. 
~' word word 
word word 
click click 
Multi-modal Ingrator 
MM-DCG Rules ::::::::::::::::::::::::::::::::::::::::::::::::: 
!:N~i~i:~& ~li:i~a~? :!:i!iii!iiil 
Prolog I, nterpfeter 
k. 
l,'igm.e 2: Multi-modal Input Processing Module 
T1 T2 T3 T4 T5 T6 I/ II 
(tl, t2, "the") (t3, 14, "blue") (t5, td, "car") 
Chronological Diroclion 
Figure 3: Time Calculation of Instant|areal Semantic 
(i:attegorics 
3.2 Cal<:ulating the Instantiated Time of 
Grammatical Categorh,,s 
Inputs of a single mode invariably have ordering rela- 
tions among them. A parser like DCG uses such order 
relations to amdyze syntax, semantics, and pragmatics. 
h,pul.s of differe.nt modes, however, have no inherent or- 
dering rehd.ions. Therefore, MM-I)CG requires tim at- 
t.achmelH: of both the beginning time and the end time 
to each individual piece or input data. MM-DCG au- 
tomatically calculates the beginning time and the end 
tiuw of any lew4 of grammatical categories generated 
during Imrsing. 
MM-I)C(; translator automatically generates the 
code which calculates the beginsfing and end times of 
any body goal in at grammar rule. The translator gen- 
erates two extra argnments to store the beginning time 
and end time into each head and body goals in MM- 
I)CG rules. The beg|truing time argument of the head 
is unified with the beg|truing time argulner, t of the first 
hody goal. The end time a.rgu,nent of the head is uniIied 
with the end time argument of the last body goal. Fig- 
m'e 3 shows the argtH\]lellt organization of noun_phrase 
rule. 
Thus, for example, if a noun phra~se category is in- 
stantiatcd by pa.rsing "tile blue car", the beginning time 
of the instant|areal category becomes equal to tile begin- 
,ring tilsle of "the", and the end time of the category is 
equal to the end time of "car". 
8.33 
Mouse input stream 
(button(left, (10, 20)) ) (button(left> (7, 25)) 9 
Time Interval 
Figure 4: Thneout C.oncept 
MM-DCG requires any input frolu every mode to 
have begimfing and end times. Thus, each item in an 
input sequence will haw; the following sl.ructilre: 
input(beginning-time, end-time, <actual input>) 
which means that the actual input was inputted frolll 
start-tlme and completed at. end-time. Adding of this 
time information is easy for ally of the SOl'l.s of till)ill. 
modes we are considering (i.e. speecll recognition, key- 
board inputs, mouse 1)oint, ing, el.c). 
One other iml)orta.nt item of notation: \[l'a variable 
is explicitly bound within at goal, the variable ret.urus 
the beginning and end times of the goal hi the R)rlll of 
a finletor. Thus, 
Time^goal 
means that "if goal succeeds, the beginnhlg time and 
end time of tile goal are rctnl'ned ill the wu'iable Time." 
Using the time iiflbrmation of instautiated categories, 
rule writers can define chronological collstra.ints aniong 
categories, for exaniple, the following descriptiotl ex- 
presses a constraint that pronoun category and pointing 
category nnist be both instantiated wittliu a five se(:- 
onds~ 
Tl'pronoun, T2"point £ng, 
{Dill is T2 - TI, Diff< 5} 
3.3 Defining Timeout in I{.ules 
Timeout is a constraint of intervals belween an input 
and its succeeding input of n. streanl (See Figure 4). If 
an interval between inputs of a st rean| hecomes larger 
than a threshold defined in gralluiu/r rllles, tile tinieout 
occurs, and tile streani is regarded C.llipl.y l.einporariiy 
although there still exist inputs in it.. 
The following points rule llleaus that "l/eceive i/louse 
clicking inputs wllile, tile interval between I.wo inputs is 
less thau 5 seconds or ilnti\] a stream I'Jecoines null, then 
return the list of the hlputs" 
points(E\]) --> mouse : E\] (s . o) . 
points(\[Pt I Pts\]) --> point(Pt), poin~s(Pts) 
point(Loc) --> mouse: \[button(left, Loc)\]. 
4 Rules Written in MM~DCG 
4.1 Syntax 
MM-DCG syntax extends I)CG in the following ways: 
• A body goal may o,' may not be specified its con- 
smiting stream: 
Irl' a body goal consumes inputs from specific 
streams, the goal must be accompanied by the 
stream names. For example, tile following rule 
noun_phrase --> keyboard:pronoun. 
nieans that "if the pronoun category is found which 
is generated by inputs from the keyboard stream, 
noun_phrase is found." If a body goal is not accom- 
pa,iied by any stream name, the goal is regarded as 
consunling sonic amount of inputs fi'om all modes. 
For example, the following rule 
noun_phrase --> noun. 
lneans that "if the noun category is found which 
is generated by inpufos frorn certain streams, 
noun_phrase is found." 
• A terminal synibol should always be accompanied 
by a specific stream name: 
For example, the following rule 
pointing--> mouse:\[button(left, loc(X, Y)\]. 
means that "if a flmctor button(left, Ice(X, Y)) is 
found at the mouse strea.nl, pointing is found". 
4.2 lime Example 
To demonstrate how MM-I)CG rules are written, this 
section describes a simple grammar needed to handle 
"object" with multi-modal inputs. 
Figure 5 shows the definition of "object". A rule 
writer defines existing slmeams specifically using a unit 
clause, active_stream/1. "Object" are specilied by using 
eitller one of the abow~' inodes or their combinations. 
The first object/l delhfition interprets natural lan- 
guage specifh:ations such as "the blue car". The second 
object/l interprets a nlouse clicking which points at a. 
sl>ecific grai>hical object on the display. The third ob- 
ject/l definition izd.erprets a combination of a natural 
language utterance and a inouse pointing, such as stat- 
ing "the bhle car" while pointing at a graphical object 
oil the display. A natural language utterance is inter- 
preted at. the noun_phrase body goal, and the identified 
object is bound to Objl. A mouse pointing event is in- 
terpreted at the pointing body goal, and the identified 
object is bound to Obj2. 
Then, Objl and Obj2 are compared their values in a 
Prolog predicate enclosed inside curly brackets { and 
}. Both variables shonld be equal. If not, because the 
interpretation of noun_phrase or pointing must be wrong, 
bacld.racking occurs. 
As seen above, a single grammar rule in MM-I)CG 
can allow the coexistence of grammatical categories in 
different niodes, thus allowing for their integration. In 
addition, teniporal and context sensitive information 
can be interchanged aniong categories of different modes 
in a single rule. 
834 
~, stream definit:ion 
active_stream(speech, mouse, keyboard). 
?, For natural language mode 
object(0bj) --> notm_phrase(\[Ibj). 
noun_phl-ase(Obj) -~> article, adjective(Att:t, A value), noun(Noun), 
{attribute(type, Noun, 0bj), attr~bute(Attr, A va\]ue, 0bj)}. 
article --> (speech or keyboard): \[the\]. 
adjective(color, blue) --> (speech or keyboard):\[b\]ue\]. 
noml(automobile) ---> (speech or keyboaid) : \[car\]. 
~. For mouse mode 
object(Obj) =-> po~nting(Ubj). 
poillting({\]bj) -~> mouse: \[button(\]efL, lee(X, Y))\] ,{attribute(location, (X, Y), 0bj)}, 
?, For combinations of modes 
object(Objl) --> noun phrase(t)bji), pointing(Obj2), {0bjl == 0bj2}. 
Figure 5: (;ramlnar I)cscriplion l",xample Using MM-I)C(; 
5 Translating MM-I)CG into Prolog 
This secl,ion describes lranslaLioll lcchniquos o\[' MM- 
I)C(; rules into Prolog i)redi('alcs, l"irst, we explain 
the translat, ion method of I~IM.I)(:(; ruh!s with a sin- 
gle stream. Even in the single, strca.i cas~', MM-I)(?(; 
translation method is dill'err,hi from Ihal of I)( I( ',. Then, 
the ira.sial, ion tecludqu<e wit.h tlmlliph' Sll'eaHiS is CX 
pie|ned. 
5.1 MM-DCG Transla|;ion for a Single Stream 
A head and body goals i. a gra.mtar ride ar~, I re,slated 
into ;~ predicate with four exLra al'guntcl~l.s Lwo for i he 
beginning time mid l, he end l inle and Iwo tLr ~'xpressing 
a eOllSttllled ill\[)ll{, Si, l'i!alll. '\['h<! l;ll.\[l!r two al'gtlHtelllS are 
tim same its the gelleral,cd al'g/llllCIll,S W\]I(!II I)(I(~ ruie,% 
are translated into Ih'olog prmlicai.es. 
The beginning tinlc arglmllml, of l, hc head is uni/icd 
with i, he begin,ring l, ilnC arguuleul, of i, he lirsl, body goal, 
and the end t;inlc argumenl, of the head is unilied with 
the elK| I, inle of the last, body goal. For eXalllp\]e, \[,h,.! 
following MM-I)C(; rule (for a single ,%r<un): 
nounphrase -- > article, adj, noun. 
is translates inl,o: 
noun_phrase(T0, T, N0, N) :- 
article(T0, TI, NO, Nl), 
adjective(T2, T3, NI, N~), 
noun(T4, T, N2, N). 
or, in Fmglish, 
There is a retail-phrase l~etu,ecn NO and N if 
there is an article Iwtwccn NO and NI, aud if 
there is an adjectiw! I,etu,~,,,u NI and N-), aml if 
there is a noun hetwec'n N2 and N, The noun- 
t~hrasc starts at (1'0, nml cm/s at T. The article 
starts at TO, a11d eu,ls at TI. "l'lw a+(j('ctivc 
starts at T2, and ends at "1'3 \[l'hc tloutl starts 
at T4, aml ends at T. 
A rule with a terniinal sylllboI is II'allS\]alcd illlO a 
ullil; ciallse, l"or examl)lc , 
noun --> keyboard:\[window\]. 
trails\[aLes into: 
noun(Ts,Te, \[input(Ts,Te, + <window' ') IN\] ,N) . 
A funcLor input/3 is inseri,cd into the third argunmnt 
forlllili,~ the input, s\[,rCalll of {,he+ predicate. The third 
al'~lllllOnl, of t, he t'llllCl\[,or input/3 is the act, ual input item, 
the "wimhm," string in this example. 
The first and second al'gUillelll, of input/3 is unitied 
wiLh the first and second argument of this unit clause 
r{~spectiw!ly, Th,~refore, if a string "window" is input via 
lhe keyboard ~t, reum, the noun category is instant|areal, 
and the beginldng and end time of the noun category 
is Llle same as t, lle start and (!lid Lime attached to the 
"window" input. 
5.2 Exte, nsion |;o Artdtrary Nmntmr of 
S trealliS 
Exl, ension frol. a single st, ream to nmltiple streams is 
easy. E;t('h stream needs four extra arguments - two for 
t, imiug iuformnt, iou and two for expressi.g a consumed 
input Si.l'Calll, Thus, i\[' there are n modes, 4n arguments 
arc ~.ldcd into head and goals argunlenl;s. 
For e×anll~lc: , if l, hcre are two streams, the noun_phrase 
defiuitioa in Lhc previous section is translated into the 
following prolog l,'edicaws with eight (2 x 4) extra ar- 
gillllell\[,S: 
llOUU_i3hras e (TxO, Tx, ~IxO, Nx, Ty O, Ty, NyO, Ny) : - 
article(TxO,Txl ,NxO,Nxi,TyO,Tyl,NyO,Nyl), 
adjective (Tx2,Tx3, Nxl ,Nx2,Ty2,Ty3, Nyi ,Ny2), 
noun(Tx4,Tx,Nx2,Nx,Ty4,Ty,Ny2,Ny). 
5.3 Extractions of Temporal Information 
If there is at variable bindi\]tg within a goal like, 
Tinle -goal 
the goal is t, ranslat, cd into a con,jullcl,ion of two body 
goals (for u single mode): 
(goal(T0, "1'1, R0, R),Time-- (T0, T1) ) 
835 
Ifthere exist n streams, tim variable Time is bound 
to a list of n time pairs, such as ~n'two modes: 
(goal(TxO,Txl,NxO,gxl,TyO,Tyl,NyO,Nyl), 
Time = \[(TxO, Txl), (TyO, Tyl)\] ) 
6 Related work 
The idea of understandillg multi-modal inputs in con- 
junction with each other, as presented in this paper, is 
not particularly new. The idea of a nnllti-n/odal input 
combining motions and pointing has been explored in a 
number of contexts. The classic 1980 paper "I>ut-That- 
There" \[Bolt, 1980\] describes an early system that pro- 
cedurally combined voice and gesture inputs. This idea 
was further explored in terms of integrating natural lan- 
guage and pointing by \[/Iayes, 1988\], who related nmlti- 
modal inputs to anaphoric reference in imtural language 
processing, particularly to t.he work o\[' \[Grosz, 1977\] and 
\[Sidner, 1979\]. Recent work in the design of direct ma- 
nipulation interfaces has also explored the notion of in- 
tegrating a set of diverse inpuls. Othe.r palmrs explor- 
ing multimodal interfaces include \[Allgayer el. al., 1989; 
Cohen el. al., 1989; Cohen, 1991; Kobsa et. al., 1986; 
Wahlster, 1989\]. Most of this work, howew.'r, has tb- 
cused on the application of the ideas, and not on the 
principles for integrating the different inputs. 1 
7 Conclusion 
In this paper, we haw; proposed the use of a grammar 
for dealing with input ewmt.s in a lmdti-modal user in- 
terface. We proposed MM-I)C(~, a novel gralnmatical 
framework for amultimodal inl.erface. MM-I)(:G is an 
extension of 1)CG for rnull.i-modal inpuls processing. 
The major features of MM-D(:G inchldc capability to 
handle an arbitrary nnnaber of modes and feral)oral in- 
formation in grammar rules. We showed its use \['or a 
simple example. The translation technique of the MM- 
DCG rules into Prolog predicates was also presented. 
An initiM implementation of MM-I)CG has been devel- 
oped at NEC Corporation, alld is currently being used 
for the development of a l)rototype mull.i-modal inter- 
face. 
References 
\[Allgayer el. al., 1989\] Allgayer, .\]., Jansen-\Vinke.ln, R., 
reddig, C., and Reithing N., 
\[Arita et. al., 1992a\] 
Arita, S., Shimazu, H., and 'lakashima, Y., "I~orl.able 
Natural Language Interface", Proc. of I.he 8th I\]nnlan 
Interface Symposium, 1992, (in Japanese). 
\[Arita et. al., 1992b\] 
Arita, S., Shimazu, Ii., and Takashima, Y., "Siml)le 
+ Robust = Pragmatic: A Natural Lal,guage Query 
Processing Model h)r Card-type 1)atabases", Proc. of 
the 13th Annual Conference of Ihe Cogldtive Sc:ience. 
Society, 1992. 
1A survey of this work is beyond I\[le scope O\[' this paper, 
the interested reader is directed to the review in \[Shneider- 
man, 1991\]. 
\[Bolt, 1980\] Bolt, R.A., "Pat-That There: Voice and Ges- 
ture at the Graphics Interface", Computer Graphics 14, 
3, 1980. 
\[Clocksin and Mellish, 1981\] (31ocksin, W.F. and Mellish, 
C.S., "Progrannning in Prolog", Springer-Verlag, 1981. 
\[Cohen et. al. , 1989\] Cohen, P.R., l)alryml)le, M., Moran, 
I).B., Pereira., F.G'.N., et al., "Synergistic Use of Direct 
Manipuhttion and Natnral Language", Proc. of CHI-88, 
1989. 
\[Cohen, 1991\] Cohen, P.R., "The Role of Natural Language 
in a MultinlodM Interface", 1991 lnternationM Sympo- 
sium on Next Generation Human Interface, 1991. 
\[Grosz, 1977\] Grosz, B. "The representation and use of fo- 
cus in a system for understanding dialogs," Proc. IJCAI 
1977, Boston, MA. 
\[ilayes, 1987\] Hayes, P.J., "Steps towards Integrating natu- 
ral Language and Graphical Interaction for Knowledge- 
based Systems", Advances in Artificial Intelligence- II, 
Elsevier Science Publishers, 1987. 
\[llayes, 1988\] llayes, P.J., "Using A Knowledge Base To 
Drive An Expert Systenl Interface With A Natural Lan- 
guage Component," in J. IIendler (ed.) Expert Systems: 
The User h~terface, Ablex Publishing, 1988. 
\[Kobsa et. al., 1986\] Kobsa, A., Allgayer, J., Reddig, C., 
Reithing, NI, Schumauks, D., lIarbusch, K., and 
Wahlster, W, "Combining Deictic Gestures and Nat- 
ural Language for Referent Identification", Proc. of 
COLING-86, 1986. 
\[Pereira and Warren, 1980\] Pereira, 
l"., and Warren, I).II.D., p"Definite Clause Grammars 
for l,angua.ge Analysis- A survey of the Formalism and 
a Comparison with Augmented Tl'ausitioll Networks", 
Artificial Intelligence, vol. 13, no. 3, 1980. 
\[Shimazu et. al., 1992\] Shimazu, 11., Arita, S., and 
Takashima, Y., "Design Tool Combining Keyword An- 
alyzer and Case-Based Parser \['or Developing Natural 
Language I)ataBase Interfaces", Proc. of COLING-92, 
1992. 
\[Shneiderman, 1991\] Designing The User Interface, Addi- 
son Wesley Publ., Reading, MA. 
\[Sidner, 1979\] Sidner, C. Towards a computational theory o\] 
definite anaphora comprehension in English Discourse, 
T1{-537, MIT AI l,ab, Cambridge, Ma. 
\[Wahlster, 1989\] Wahlster, W., "User and discourse, models 
for multimodal communication", in a.W. Sullivan and 
S.W. '\]'yler, editors, Intelligent User Interfaces, chap- 
ter3, ACM Press Frontiers Series, Addison Wesley Pub- 
lishing, 1989. 
836 
