File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2115_metho.xml

Size: 16,460 bytes

Last Modified: 2025-10-06 14:13:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2115">
  <Title>Drawing Pictures with Natural Language and Direct Manipulation</Title>
  <Section position="3" start_page="0" end_page="722" type="metho">
    <SectionTitle>
2 Multimodal Inputs in Drawing
Tools
</SectionTitle>
    <Paragraph position="0"> This section describes requirements to develop general drawing interfaces. In existing drawing tools, a mouse is a major input device. In addition, some drawing tools assign functions to some keys on a keyboard to reduce inconvenience in menu operations.</Paragraph>
    <Paragraph position="1"> Issues regarding such interfaces are as follows: * It is troublesome to input a function. Because a user uses a mouse, both to select a menu and to draw a figure, tile user has to move a cursor many times from a menu area to a canvas on which figures are placed.</Paragraph>
    <Paragraph position="2"> * It is troublesome to look for a menu item. In proportion to increasing functions increment, menu items also increase. So, it becomes increasingly difficult to look for a specific objective menu item.</Paragraph>
    <Paragraph position="3">  simultaneously. For example, when a user wants to delete plural figure objects, the user has to choose the objects one by one.</Paragraph>
    <Paragraph position="4"> * The user has to point to an object correctly. For example, when the user wants to choose a line object on a display, the user has to move a cursor just above the line and click the mouse button.</Paragraph>
    <Paragraph position="5"> If the point shifts slightly, the object is not selected. null By adding voice input functions to such an input environmeut, it becomes possible to solve these first  three issues. That is, by means of operation with the voice input, a user can concentrate on drawing, and menu search and any labor required by changing dee vices becomes unnecessary.</Paragraph>
    <Paragraph position="6"> t, br overcoming the rest of these issues, more conit*vance is needed. The authors attempted to develop a mull*modal drawing tool, operable with both voice inlmts and pointing inputs, which has tire following time*ions.</Paragraph>
    <Paragraph position="7"> * A user can choose a modality (lUOHSe or voice) unrestrainedly, which means that the user can use the voice inputs only when the user wants to do so. Also, the user can use both modal*ties in various comhiued wws. For example, the user says &amp;quot;this&amp;quot;, while pointing to one of several objects.</Paragraph>
    <Paragraph position="8"> * Plural requests can be expressed simultaneously ( ex. &amp;quot;change the (-ok)r of all line objects to green&amp;quot;). So, the operation elticiency will be im1)roved. null * A user can shorten voice inputs (ex. &amp;quot;n,ove here&amp;quot;) or omit mouse i)ointing events b,%sed on the situation, if the omitted concepts are able to be inferred lY=om eoutext. For example, the. user can utter &amp;quot;this&amp;quot;, as a reference for previously operated objects, without a mouse pointing.</Paragraph>
    <Paragraph position="9"> * Ambiguous pointings are possible. When a user wants to choose an object \['rein ~;uliollg those on a display, tire nser can indicate it rouglfly with a brief desrription, using the voice input. For ex.. aml)le , a user points at a spot near a target object and utters &amp;quot;line&amp;quot;, whc'reby the nearest &amp;quot;line&amp;quot; obje.ct to the spot in selected. Or, a us(;,' points at objects piled n I) and says &amp;quot;circle&amp;quot;, then only the &amp;quot;circle&amp;quot; objects among the piled /,t1 objects are selected.</Paragraph>
    <Paragraph position="10"> 'lb realize these time*ions, it in necessary to solve the following new problems.</Paragraph>
    <Paragraph position="11"> 1. Matching pointing inputs with voice inputs.</Paragraph>
    <Paragraph position="12"> hr tire proposed sysi.em, since pointing events may olteu occur independently, it is difficult to judge wtmther or not an event is at* independent input or whethe.r it follows a related voice input. So, an interpretation wherein the voice input and pointing event are connected in the order of arrival is not sufficient \[4\]. Therefore, a pointing event shouht be basically handled as an independent event. Then, the event is picked out from input history afl, erward, when the sys~ tern judges that the event relates to the li)llowing voice input.</Paragraph>
    <Paragraph position="13"> 2. Solving several input data ambiguities.</Paragraph>
    <Paragraph position="14"> in the l)revious mouse based system, ambiguous inputs do not o&lt;'.cur, because tim system requires that a user selects menus and target objects e~plieitly and exactly. F, ven if the voice inl)ut function in added in such a system~ it is possible to if)roe the user to give a detailed verbal sequence for the operation without ambiguity. However, when the time*ion becomes more sopltisticated, it is dilficult for the user to outline the user's intention in detail verbally. So, it is necessary to be able to interpret the ambiguous user's input.</Paragraph>
    <Paragraph position="15"> Several multimodal systems have been developed to solve these problems. For example, Hayes \[4\] pro posed the first issue to be addressed, but the definite solution was not addressed. ()()hen \[3\] presented a solution for the first issue, by utilizing context. However, the solution is not sufficient for application to drawing tools, because it was presented only for query systems. The following section describes a prototype system for a multimodal drawing tool. Next, solutions for these problems are presented.</Paragraph>
  </Section>
  <Section position="4" start_page="722" end_page="724" type="metho">
    <SectionTitle>
3 Multlmodal Drawing Tool
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="722" end_page="722" type="sub_section">
      <SectionTitle>
3.1 System Construction
</SectionTitle>
      <Paragraph position="0"> A l)rototype system for a nmltimodal drawing tool w~Ls develol)ed as a case study on a mull*modal coinmunicatiou system. Figure I shows a system image of the prototype system, by which the user draws l)ictures using mouse, keyboard and voice. This system was developed on a SUN workstation using the X-window system and was written in Prolog. Voice input data is recognized on a personal colnputer, and the recognition result is sent to the workstation.</Paragraph>
      <Paragraph position="1"> Figure t: Syst, em hnage</Paragraph>
    </Section>
    <Section position="2" start_page="722" end_page="723" type="sub_section">
      <SectionTitle>
3.2 lnterlhce Examph~s
</SectionTitle>
      <Paragraph position="0"> Figure 2 shows a screen image, of the system. The user can draw pi&lt;'tures with a combination (11' mouse and voice~ as using a mouse only. Input examples aye follows:  * if a user wants to move an existing circle object to some point, the user says &amp;quot;Move this circle here&amp;quot;, while pointing at a circle object and a destination point. The system moves the circle object to the specified point.</Paragraph>
      <Paragraph position="1"> * If a user wants to choose an existing line object among several objects one upon another, the user can say &amp;quot;line&amp;quot; while pointing at a point near the line object. The system chooses the nearest line object to the point.</Paragraph>
      <Paragraph position="2"> * If the user wants to draw a red circle, the user  can say &amp;quot;red circle&amp;quot;. The system changes a current color mode to red and changes the drawing mode to the circle mode.</Paragraph>
    </Section>
    <Section position="3" start_page="723" end_page="724" type="sub_section">
      <SectionTitle>
3.3 System Structure
</SectionTitle>
      <Paragraph position="0"> Figure 3 shows a prototype system structure. The system includes Mouse Input Handler, Keyboard Iuput Handler, Voice input Handler, Drawing Tool and Input Integrator.</Paragraph>
      <Paragraph position="1"> Each Input Handler receives mouse input events, keyboard input events and voice input events, and sends them to the Input Integrator.</Paragraph>
      <Paragraph position="2"> The Input Integrator receives an input message from each input handler, then interprets the message, using voice scripts, which show voice input patterns and sequences of operations related to the patterns, as well as mouse scripts, which show mouse input patterns and sequences of operations. When the input data matches one of the input patterns in the scripts, the Input Integrator executes the sequence of operations related to the pattern. That is, the Input Integrator sends some messages to Drawing Tool to carry out the sequences of operations. If the input data matches a part of one of the input patterns in the scripts, tile Input Integrator waits for a next input. Then, a combination of previous input data  and the new input data is examined. Otherwise, the interpretation fails. The Input Integrator may refer to the Drawing Tool. For example, it refers to the Drawing 'reel for iirformation regarding an object at a specific position.</Paragraph>
      <Paragraph position="3"> Tile Drawing Toot manages attributes for tignre objects and current status, such as color, size, line width, etc. Also, it executes drawing and editing tiperations, according to requests from the Input lutegrater, and it; sends the editing results to the Display Handler. The l)isplay Handler modilies the expression on the display.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="724" end_page="725" type="metho">
    <SectionTitle>
4 Multimode Data Interpretation
</SectionTitle>
    <Paragraph position="0"> This section describes in detail interpretation methods for tile multimodal inputs used in the drawing tool.</Paragraph>
    <Section position="1" start_page="724" end_page="725" type="sub_section">
      <SectionTitle>
4.1 Matching Pointing Inputs with Voice
Inputs
</SectionTitle>
      <Paragraph position="0"> In conwmtional multimodal systems, all a,laphoric references in voice inputs bring about pointing inputs, and individual pointing input is connected to any anaphorie refi~renee. However, in our system, a nser can operate with either a pointing input only, a voice input only or a colnbination of pointing events and voice inputs. Because a pointing event may often occur independently, when a l/ointing event does occur, the system cannot judge whe.ther tile event is an independent input or whether it follows the related voice input. Furthermore, tile user can utter &amp;quot;this&amp;quot;, as reference to an object, operated immediately before tile utterance. So, an interpretation that the voice input and pointing event are connected only in the order of arrival is not ,mtficient. In the proposed system, a pointing event is basically handled as an independent event. Then, the event is picked out from input history afterward, when the system judges that the event relates to the following voice input. Furthermore, the system has to interpret the voice inputs using context ( ex. previous operated object).</Paragraph>
      <Paragraph position="1"> In the proposed system, pointing inputs fl'om start to end of a voice input are kept in a queue. When the voice input ends, the system binds phrases in the voice input and the pointing inputs in the queue.</Paragraph>
      <Paragraph position="2"> First, the system comliares the number of anaphorie references in the voice input and the mmdier of pointlug inputs in tile queue. Figure 4 shows timing data for a voice input and pointing inputs. In Case(l)~ the number of anaphoric references in the voice input and the number of pointing inputs. In the other cases, a pointing input is lacking. When a pointing input is lacking, the following three possiDle causes are considered.</Paragraph>
      <Paragraph position="3">  The interpretation steps are as follows.</Paragraph>
      <Paragraph position="4"> 1. '\['he system examines an input immediately before the voice input. \[f it is a pointing event, the event is used for interpretation. That is, the event is added at the top of the pointing queue.</Paragraph>
      <Paragraph position="5"> 2. When tile above operation fails and the tirst anaphorie references is &amp;quot;this&amp;quot;, then the system picks up the object operated immediately before, if such exists. The object information is added at the top of the pointing queue.</Paragraph>
      <Paragraph position="6"> 3. Otherwise, tile system waits for the next pointing input. The inlmt is added onto the last of the pointing queue. When a time out occurs, the interpretation fails, due to a lack of a pointing event.</Paragraph>
      <Paragraph position="7"> If the system can obtain tile necessary information, it binds the anaphorie references in the voice input and pointing event and object information in tile pointing queue in the order of arrival.</Paragraph>
    </Section>
    <Section position="2" start_page="725" end_page="725" type="sub_section">
      <SectionTitle>
4.2 Solving Input Data Ambiguity
</SectionTitle>
      <Paragraph position="0"> In a conventional mouse based system, there is no semantic ambiguity. Such systems require a user to select menus and target objects and to edit the objects explicitly and exactly. Even if the voice input function is added in such a system, the user can be forced to utter operations without ambiguity. However, when the function becomes more sophisticated, it is difficult for the user to utter the user's intentions in detail. So, it is necessary to be able to interpret the user's ambiguous input. In a multimodal drawing tool, such our system, one of the most essential input ambiguities is led by ambiguous I)ointings.</Paragraph>
      <Paragraph position="1"> l~br example, if a user says &amp;quot;character string&amp;quot;, there are three possible interpretations: &amp;quot;the user wants to edit one of the existing character strings&amp;quot;, &amp;quot;the user wants to choose one of the existing character strings&amp;quot; and &amp;quot;the user wants to write a new character string&amp;quot; In this example, the system interprets using the following heuristic rules.</Paragraph>
      <Paragraph position="2"> * If a pointing event does not exist immediately before, the system changes a drawing mode to the character string input mode.</Paragraph>
      <Paragraph position="3"> * If a pointing event upon a string object exists just before the voice input, then the system adds the character string object to a current selection; a group of objects selected currently.</Paragraph>
      <Paragraph position="4"> * When a pointing event exists immediately before the voice input and there is a character string object near the position of the user's point (ex.</Paragraph>
      <Paragraph position="5"> within a radius of five ram. fi'om the position), then the character string object is added to a current selection.</Paragraph>
      <Paragraph position="6"> * When a pointing event exists and there is no character string object near the position, then the mode is changed to the character string input mode at the position.</Paragraph>
      <Paragraph position="7"> Naturally, &amp;quot;character string&amp;quot; in these heuristics rules can be replaced by other figure types. If this heuristic rule is not perfect, the interpretation may be different from the user's intention. In such a case, it is important for a user to return from the error condition with minimum effort. For example, assume that a user, who wants to choose one of &amp;quot;character string&amp;quot; objects, says &amp;quot;character string&amp;quot; and points on the display, but the distance between the pointed position and the &amp;quot;character string&amp;quot; object is greater than the predefined threshold. Then, according to the above rules, the result of the system's interprcration will be to input a new character string at the position, and the drawing mode changes to the character string input mode. In this case, the user wishes to turn back to a state which the user intended with mininmm elfort. The system must return to the state in which tlle character string input mode is canceled and the nearest &amp;quot;character string&amp;quot; object is selected. A solution is for the user to utter &amp;quot;select&amp;quot; only. Then, the system understands that it's interpretation was wrong and interprets that &amp;quot;select&amp;quot; means &amp;quot;select a character string object&amp;quot; using current context.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML