Drawing Pictures with Natural Language and Direct Manipulation 
Mayumi Hiyoshi and Hideo Shimazu 
Information Technology Research Laboratories, NEC Corporation 
4-1-1 Miyazaki, Miyamae Kawasaki, 216 Japan 
{hiyoshi, shimazu}©j oke. c\]_.nec, co. jp 
Abstract 
A multimodal user interface allows users 
to communicate with computers using mul- 
tiple modalities, such as a mouse, a key- 
board or voice, in various combined ways. 
This paper discusses a multimodal drawing 
tool, whereby the user can use a mouse, 
a keyboard and voice effectively. Also, 
it describes an interpretation method, by 
which the system integrates voice inputs 
and pointing inputs using context. 
1 Introduction 
This paper describes an experimental implementa~ 
tion of a multimodal interface. Specifically, the au- 
thors have developed a multimodal drawing tool. The 
multimodal drawing tool allows users to draw pic- 
tures by using multiple modalities; mouse, keyboard 
and voice input, in various combined ways. 
Recently, most user interfaces tend to be based on 
a direct manipulation method. However the direct 
manipulation method is not always better than other 
ways. The direct manipulation method is not par- 
ticularly applicable, when mentioning several opera- 
tions together and operating an object which is not 
displayed. Also, it compels a user to point to a target 
object correctly with a pointing device. On the other 
hand, voice inputs have some advantages, since a us- 
er can feel free to speak at any time, and the user can 
use the voice input while simultaneously using other 
devices. A combination of such different modalities 
offers an interface which is easy for the user to use. 
Many multimodal systems, which integrate natu- 
ral language inputs and pointing inputs, have been 
developed \[2\]\[1\]\[5\]\[4\]. In those systems, tbe user uses 
natural language mainly supported by the pointing 
inputs. However, when the user has to communicate 
with the computer frequently, in such a system as 
drawing tool, it is ,lot effective for the user to always 
speak while working. 
A prototype system for a multimodal drawing tool 
has been developed, whereby the user can use voice 
inputs unrestrainedly and effectively, that is, the user 
can choose a modality unrestrainedly, and can use the 
voice inputs only when the user wants to do so. Ill 
such a system, input data come ill at random with 
multiple modalities. Tile multimodal system must be 
able to handle such several kinds of input data. 
2 Multimodal Inputs in Drawing 
Tools 
This section describes requirements to develop gen- 
eral drawing interfaces. In existing drawing tools, 
a mouse is a major input device. In addition, some 
drawing tools assign functions to some keys on a key- 
board to reduce inconvenience in menu operations. 
Issues regarding such interfaces are as follows: 
• It is troublesome to input a function. Because 
a user uses a mouse, both to select a menu and 
to draw a figure, tile user has to move a cursor 
many times from a menu area to a canvas on 
which figures are placed. 
• It is troublesome to look for a menu item. In pro- 
portion to increasing functions increment, menu 
items also increase. So, it becomes increasing- 
ly difficult to look for a specific objective menu 
item. 
• It is troublesome to continuously move a hand 
from a mouse to a keyboard. 
• It is not possible to express plural requirements 
simultaneously. For example, when a user wants 
to delete plural figure objects, the user has to 
choose the objects one by one. 
• The user has to point to an object correctly. For 
example, when the user wants to choose a line 
object on a display, the user has to move a cursor 
just above the line and click the mouse button. 
If the point shifts slightly, the object is not se- 
lected. 
By adding voice input functions to such an input 
environmeut, it becomes possible to solve these first 
722 
three issues. That is, by means of operation with the 
voice input, a user can concentrate on drawing, and 
menu search and any labor required by changing dee 
vices becomes unnecessary. 
t, br overcoming the rest of these issues, more con- 
it*vance is needed. The authors attempted to develop 
a mull*modal drawing tool, operable with both voice 
inlmts and pointing inputs, which has tire following 
time*ions. 
• A user can choose a modality (lUOHSe or voice) 
unrestrainedly, which means that the user can 
use the voice inputs only when the user wants 
to do so. Also, the user can use both modal*ties 
in various comhiued wws. For example, the us- 
er says "this", while pointing to one of several 
objects. 
• Plural requests can be e×pressed simultaneous- 
ly ( ex. "change the (-ok)r of all line objects to 
green"). So, the operation elticiency will be im- 
1)roved. 
• A user can shorten voice inputs (ex. "n,ove 
here") or omit mouse i)ointing events b,%sed on 
the situation, if the omitted concepts are able to 
be inferred l¥om eoute×t. For example, the. us- 
er can utter "this", as a reference for previously 
operated objects, without a mouse pointing. 
• Ambiguous pointings are possible. When a user 
wants to choose an object \['rein ~;uliollg those on 
a display, tire nser can indicate it rouglfly with a 
brief desrription, using the voice input. For ex.. 
aml)le , a user points at a spot near a target ob- 
ject and utters "line", whc'reby the nearest "line" 
obje.ct to the spot in selected. Or, a us(;,' points 
at objects piled n I) and says "circle", then only 
the "circle" objects among the piled /,t1 objects 
are selected. 
'lb realize these time*ions, it in necessary to solve 
the following new problems. 
1. Matching pointing inputs with voice inputs. 
hr tire proposed sysi.em, since pointing events 
may olteu occur independently, it is difficult to 
judge wtmther or not an event is at* indepen- 
dent input or whethe.r it follows a related voice 
input. So, an interpretation wherein the voice 
input and pointing event are connected in the 
order of arrival is not sufficient \[4\]. Therefore, 
a pointing event shouht be basically handled as 
an independent event. Then, the event is picked 
out from input history afl, erward, when the sys~ 
tern judges that the event relates to the li)llowing 
voice input. 
2. Solving several input data ambiguities. 
in the l)revious mouse based system, ambiguous 
inputs do not o<'.cur, because tim system requires 
that a user selects menus and target objects e~- 
plieitly and exactly. F, ven if the voice inl)ut func- 
tion in added in such a system~ it is possible to 
if)roe the user to give a detailed verbal sequence 
for the operation without ambiguity. However, 
when the time*ion becomes more sopltisticated, 
it is dilficult for the user to outline the user's in- 
tention in detail verbally. So, it is necessary to 
be able to interpret the ambiguous user's input. 
Several multimodal systems have been developed 
to solve these problems. For example, Hayes \[4\] pro 
posed the first issue to be addressed, but the definite 
solution was not addressed. ()()hen \[3\] presented a 
solution for the first issue, by utilizing context. How- 
ever, the solution is not sufficient for application to 
drawing tools, because it was presented only for query 
systems. The following section describes a prototype 
system for a multimodal drawing tool. Next, solu- 
tions for these problems are presented. 
3 Multlmodal Drawing Tool 
3.1 System Construction 
A l)rototype system for a nmltimodal drawing tool 
w~Ls develol)ed as a case study on a mull*modal coin- 
municatiou system. Figure I shows a system image 
of the prototype system, by which the user draws 
l)ictures using mouse, keyboard and voice. This sys- 
tem was developed on a SUN workstation using the 
X-window system and was written in Prolog. Voice 
input data is recognized on a personal colnputer, and 
the recognition result is sent to the workstation. 
Figure t: Syst, em hnage 
3.2 lnterlhce Examph~s 
Figure 2 shows a screen image, of the system. The 
user can draw pi<'tures with a combination (11' mouse 
and voice~ as using a mouse only. Input examples aye 
follows: 
723 
~ Mouse Input Mouse ~ t Handler 
Input 
f Keyboard Keyboard "tlnput Handler 
Input 
~f Voice Input Voice ~t Handler 
Input 
Display Display ~ Handler j- 
Input Integrator -1 
Figure 3: Multimodal Drawing Tool Structure 
Drawing 1 Tool 
 iiiiii!iiiiii iiiiiiiiii! 
 iil,i,i!,i,i iiiiii!iiii! ............ ,:: :.....,:..,... 
 il.iii.liiii iiiii!i\[ii.  iiiiliiiiiiii iiiiiiiiiili 
 iiiiiiii iiiiiii i! 
i!ii!ii!iiiii!iii!iiiiiii\[!iii!iiii!i}!i!i!: 
~!iii!iNi~i~iii 
\[\] MM-De'aw \[\] 
i~i~i~i!~iiiiii.i!i::i!::.~i!!ii~!.i.!~i::!i::!i~i~iiii~i:~:~i!: i i!i~i!~ii~:i!i .~!~ii!i'!ii!~i::!i::iiii::iiii':::i:ii!~:ii!!:.:.ii:~i!~ii:.!:!!.i:: 
:::::::::::::::::::::::::::::::::::::::::: 
.~ 
~iiiiiiiilNii~i!iiii i~ ........ ~i~ E, Vi;o,,,o,, ............ \[\] 
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::ii:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::~~ 
Figure 2: Screen hnage 
• if a user wants to move an existing circle object 
to some point, the user says "Move this circle 
here", while pointing at a circle object and a 
destination point. The system moves the circle 
object to the specified point. 
• If a user wants to choose an existing line object 
among several objects one upon another, the us- 
er can say "line" while pointing at a point near 
the line object. The system chooses the nearest 
line object to the point. 
• If the user wants to draw a red circle, the user 
can say "red circle". The system changes a cur- 
rent color mode to red and changes the drawing 
mode to the circle mode. 
3.3 System Structure 
Figure 3 shows a prototype system structure. The 
system includes Mouse Input Handler, Keyboard Iu- 
put Handler, Voice input Handler, Drawing Tool and 
Input Integrator. 
Each Input Handler receives mouse input events, 
keyboard input events and voice input events, and 
sends them to the Input Integrator. 
The Input Integrator receives an input message 
from each input handler, then interprets the message, 
using voice scripts, which show voice input patterns 
and sequences of operations related to the pattern- 
s, as well as mouse scripts, which show mouse input 
patterns and sequences of operations. When the in- 
put data matches one of the input patterns in the 
scripts, the Input Integrator executes the sequence of 
operations related to the pattern. That is, the In- 
put Integrator sends some messages to Drawing Tool 
to carry out the sequences of operations. If the in- 
put data matches a part of one of the input patterns 
in the scripts, tile Input Integrator waits for a next 
input. Then, a combination of previous input data 
724 
and the new input data is examined. Otherwise, the 
interpretation fails. The Input Integrator may refer 
to the Drawing Tool. For example, it refers to the 
Drawing 'reel for iirformation regarding an object at 
a specific position. 
Tile Drawing Toot manages attributes for tignre 
objects and current status, such as color, size, line 
width, etc. Also, it executes drawing and editing tip- 
erations, according to requests from the Input lute- 
grater, and it; sends the editing results to the Display 
Handler. The l)isplay Handler modilies the expres- 
sion on the display. 
4 Multimode Data Interpretation 
This section describes in detail interpretation meth- 
ods for tile multimodal inputs used in the drawing 
tool. 
4.1 Matching Pointing Inputs with Voice 
Inputs 
In conwmtional multimodal systems, all a,laphoric 
references in voice inputs bring about pointing input- 
s, and individual pointing input is connected to any 
anaphorie refi~renee. However, in our system, a nser 
can operate with either a pointing input only, a voice 
input only or a colnbination of pointing events and 
voice inputs. Because a pointing event may often oc- 
cur independently, when a l/ointing event does occur, 
the system cannot judge whe.ther tile event is an inde- 
pendent input or whether it follows the related voice 
input. Furthermore, tile user can utter "this", as ref- 
erence to an object, operated immediately before tile 
utterance. So, an interpretation that the voice input 
and pointing event are connected only in the order 
of arrival is not ,mtficient. In the proposed system, 
a pointing event is basically handled as an indepen- 
dent event. Then, the event is picked out from input 
history afterward, when the system judges that the 
event relates to the following voice input. Further- 
more, the system has to interpret the voice inputs 
using context ( ex. previous operated object). 
In the proposed system, pointing inputs fl'om start 
to end of a voice input are kept in a queue. When 
the voice input ends, the system binds phrases in 
the voice input and the pointing inputs in the queue. 
First, the system comliares the number of anaphorie 
references in the voice input and the mmdier of point- 
lug inputs in tile queue. Figure 4 shows timing data 
for a voice input and pointing inputs. In Case(l)~ 
the number of anaphoric references in the voice in- 
put and the number of pointing inputs. In the other 
cases, a pointing input is lacking. When a pointing 
input is lacking, the following three possiDle causes 
are considered. 
• The relative pointing event occurred before the 
voice input, and it was bandied previously 
(Case(2) in Fig. 4). 
• The Iirst anaphoric reference is "this" as refer- 
ence to all object which was operated immedi- 
ately before the voice input (Case(a) in Fig. 4). 
I 'l'he relative pointing event will occur after the 
voice input (Case(4)in Fig. 4). 
i m 
Case(l) ', move this here ' 
Voice input l ~EaE~ EEZZCi 
Pointing Input /k , ;t 
I i 
Case(2) ~ move this here ' 
Voice Input ~ ~ 
Pointing Input _ Z~ A , | 
Case(a) 
Voice Input 
Pointing input 
', move this here ', IZZZE3 ETZZZ\] EZZZ1 
i A 
_ <Object>'. . . ~t 
Case(4) \] 
Voice Input 
Pointing Input _ 
J i 
move this here 
° ' A. 
....... , . i~ t 
Figure 4: Timing data for voice input and pointing 
inputs 
The interpretation steps are as follows. 
1. '\['he system examines an input immediately be- 
fore the voice input. \[f it is a pointing event, 
the event is used for interpretation. That is, the 
event is added at the top of the pointing queue. 
2. When tile above operation fails and the tirst 
anaphorie references is "this", then the system 
picks up the object operated immediately before, 
if such exists. The object information is added 
at the top of the pointing queue. 
3. Otherwise, tile system waits for the next point- 
ing input. The inlmt is added onto the last of 
the pointing queue. When a time out occurs, the 
interpretation fails, due to a lack of a pointing 
event. 
If the system can obtain tile necessary informa- 
tion, it binds the anaphorie references in the voice 
input and pointing event and object information in 
tile pointing queue in the order of arrival. 
725 
4.2 Solving Input Data Ambiguity 
In a conventional mouse based system, there is no 
semantic ambiguity. Such systems require a user to 
select menus and target objects and to edit the ob- 
jects explicitly and exactly. Even if the voice input 
function is added in such a system, the user can be 
forced to utter operations without ambiguity. How- 
ever, when the function becomes more sophisticated, 
it is difficult for the user to utter the user's intentions 
in detail. So, it is necessary to be able to interpret 
the user's ambiguous input. In a multimodal drawing 
tool, such our system, one of the most essential input 
ambiguities is led by ambiguous I)ointings. 
l~br example, if a user says "character string", there 
are three possible interpretations: "the user wants to 
edit one of the existing character strings", "the user 
wants to choose one of the existing character strings" 
and "the user wants to write a new character string" 
In this example, the system interprets using the 
following heuristic rules. 
• If a pointing event does not exist immediately 
before, the system changes a drawing mode to 
the character string input mode. 
• If a pointing event upon a string object exists 
just before the voice input, then the system adds 
the character string object to a current selection; 
a group of objects selected currently. 
• When a pointing event exists immediately before 
the voice input and there is a character string 
object near the position of the user's point (ex. 
within a radius of five ram. fi'om the position), 
then the character string object is added to a 
current selection. 
• When a pointing event exists and there is no 
character string object near the position, then 
the mode is changed to the character string input 
mode at the position. 
Naturally, "character string" in these heuristics 
rules can be replaced by other figure types. If this 
heuristic rule is not perfect, the interpretation may 
be different from the user's intention. In such a case, 
it is important for a user to return from the error 
condition with minimum effort. For example, assume 
that a user, who wants to choose one of "character 
string" objects, says "character string" and points 
on the display, but the distance between the pointed 
position and the "character string" object is greater 
than the predefined threshold. Then, according to 
the above rules, the result of the system's interprc- 
ration will be to input a new character string at the 
position, and the drawing mode changes to the char- 
acter string input mode. In this case, the user wishes 
to turn back to a state which the user intended with 
mininmm elfort. The system must return to the state 
in which tlle character string input mode is canceled 
and the nearest "character string" object is selected. 
A solution is for the user to utter "select" only. Then, 
the system understands that it's interpretation was 
wrong and interprets that "select" means "select a 
character string object" using current context. 
5 Conclusion 
For implementing a multimodal system, based on di- 
rect nmnipulation system, tile system has to use not 
only pointing events concurrently with a voice input, 
but must also use the context, such as input history 
or information regarding the current operated object, 
ms information for binding to the voice input. Fur- 
thermore, it is important to solve any ambiguity in 
inputs. This paper discussed these problems, and 
described an interpretation nmthod using a drawing 
tool example. Furthermore, a prototype system for 
a multimodal drawing tool has been implemented. 
Much future work remains, but we believe that these 
elaborate interpretations may become bases of user 
fi.iendly multimodal interfaces. 
Acknowledgements 
A part of this study was conducted under the 
FRIEND21 national study project. 
References 
\[1\] Allgayer, J., Jansen-Winkeln, 1~., reddig, C., and 
Reithing N., "Bidirectional use of knowledge in 
the multi-modal NL access system XTRA', IJ- 
CAI'89, pp.1491-1497, 1989. 
\[2\] Bolt, R.A., "I~ut-That.-There: Voice and Ges-- 
ture at the Graphics Interface", Computer 
Graphics 14, 3, 1980. 
\[3\] Cohen, P.R., Dalrymple, M., Moran, D.B., 
Pereira, F.C.N., et al., "Synergistic Use of Direct 
Manipulation and Natural Language", Proc. of 
CHI-88, 1989. 
\[4\] Hayes, P.J., "Steps towards Integrating nat- 
ural Language and Graphical Interaction for 
Knowledge-based Systems", Advances in ArtiIi- 
cial Intelligence- lI, Elsevier Science Publishers, 
1987. 
\[5\] Wahlster, W., "User and discourse models for 
multilnodal communication", in J.W. Sullivan 
and S.W. Tyler, editors, Intelligent User Inter- 
faces, chapter3, ACM Press Frontiers Series~ Ad- 
dison Wesley Publishing, 1989. 
726 
