File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/a00-1046_abstr.xml

Size: 12,513 bytes

Last Modified: 2025-10-06 13:41:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1046">
  <Title>The Efficiency of Multimodal Interaction for a Map-based Task</Title>
  <Section position="1" start_page="0" end_page="333" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper compares the efficiency of using a standard direct-manipulation graphical user interface (GUI) with that of using the QuickSet pen/voice multimodal interface for supporting a military task. In this task, a user places military units and control measures (e.g., various types of lines, obstacles, objectives) on a map. Four military personnel designed and entered their own simulation scenarios via both interfaces.</Paragraph>
    <Paragraph position="1"> Analyses revealed that the multimodal interface led to an average 3.5-fold speed improvement in the average entity creation time, including all error handling. The mean time to repair errors also was 4.3 times faster when interacting multimodally. Finally, all subjects reported a strong preference for multimodal interaction.</Paragraph>
    <Paragraph position="2"> These results indicate a substantial efficiency advantage for multimodal over GUI-based interaction during map-based tasks.</Paragraph>
    <Paragraph position="3"> Introduction Nearly two decades ago at ACL'80, Professor Ben Shneiderman challenged the field of natural language processing as follows: In constructing computer systems which mimic rather than serve people, the developer may miss opportunities for applying the unique and powerful features of a computer: extreme speed, capacity to repeat tedious operations accurately, virtually unlimited storage for data, and distinctive input/output devices. Although the slow rate of human speech makes menu selection impractical, high-speed computer displays make menu selection an appealing alternative. Joysticks, light pens or the &amp;quot;mouse&amp;quot; are extremely rapid and accurate ways of selecting and moving graphic symbols or text on a display screen. Taking advantage of these and other computer-specific techniques will enable designers to create powerful tools without natural language commands. \[20, p. 139\] He also challenged us to go beyond mere claims, but to demonstrate the benefits of natural language processing technologies empirically. Since then, not only has there been a long period of unprecedented innovation in hardware, software architectures, speech processing, and natural language processing, but NLP research has also embraced empirical methods as one of its foundations. Still, we have yet to defend claims empirically that technologies for processing natural human communication are more efficient, effective, and/or preferred, than interfaces that are best viewed as &amp;quot;tools,&amp;quot; especially interfaces involving a direct manipulation style of interaction. The present research attempts to take a small step in this direction.</Paragraph>
    <Paragraph position="4"> In fact, it has often been claimed that spoken language-based human-computer interaction will not only be more natural but also more efficient than keyboard-based interaction.</Paragraph>
    <Paragraph position="5"> Many of these claims derive from early modality comparison studies \[1\], which found a 2-3 fold speedup in task performance when people communicated with each other by telephone vs. by keyboard. Studies of the use of some of the initial commercial speech recognition systems have reported efficiency gains of approximately 20% - 40% on a variety  of interactive hands-busy tasks [10] compared with keyboard input. Although these results were promising, once the time needed for error correction was included, the speed advantage of speech often evaporated [18] ~. A recent study of speech-based dictation systems [9] reported that dictation resulted in a slower and more errorful method of text creation than typing. From such results, it is often concluded that the age of spoken human-computer interaction is not yet upon us.</Paragraph>
    <Paragraph position="6"> Most of these studies have compared speech with typing, However, in order to affect mainstream computing, spoken interaction would at a minimum need to be found to be superior to graphical user interfaces (GUIs) for a variety of tasks. In an early study of one component of GUIs, Rudnicky [18] compared spoken interaction with use of a scroll bar, finding that error correction wiped out the speed advantages of speech, but users still preferred to speak. Pausch and Leatherby [17] examined the use of simple speaker-dependent discrete speech commands with a graphical editor, as compared with the standard menu-based interface. With a 19-word vocabulary, subjects were found to create drawings 21% faster using speech and mouse than with the menu-based system. They conjectured that reduction in mouse-movement was the source of the advantage. In general, more research comparing speech and spokenlanguage-based interfaces with graphical user interfaces still is needed.</Paragraph>
    <Paragraph position="7"> We hypothesize that one reason for the equivocal nature of these results is that speech is often being asked to perform an unnatural act the interface design requires people to speak when other modalities of communication would be more appropriate. In the past, strengths and weaknesses of various communication modalities have been described [2, 6, 13], and a strategy of developing multimodal user interfaces has been developed using the strengths of one mode to overcome weaknesses in another, Interface simulation studies I See also [6, 10] for a survey of results.</Paragraph>
    <Paragraph position="8">  comparing multimodal (speech/pen) interaction with speech-only have found a 35% reduction in user errors, a 30% reduction in spoken dysfluencies (which lead to recognition errors), a 10% increase in speed, and a 100% user preference for multimodal interaction over speech-only in a map-based task [14]. These results suggest that multimodal interaction may well offer advantages over GUI's for map-based tasks, and may also offer advantages for supporting error correction during dictation [16, 19].</Paragraph>
    <Paragraph position="9"> In order to investigate these issues, we undertook a study comparing a multimodal and a graphical user interface that were built for the same map-based task ~.</Paragraph>
    <Paragraph position="10">  interface [4] for supporting a common military planning/simulation task. In this task, a user arrays forces on a map by placing icons representing military units (e.g., the 82 n~ Airbome Division) and &amp;quot;control measures,&amp;quot; 2 A high-performance spoken language system was also developed for a similar task [ 12] but to our knowledge it was not formally evaluated against the relevant GUI.</Paragraph>
    <Paragraph position="11"> 3 A case study of one user was reported in [3]. This paper reports a fuller study, with different users, statistical analyses, and an expanded set of dependent measures (including error correction).  (e.g., various types of lines, obstacles, and objectives). A shared backend application subsystem, called Exlnit, takes the user specifications and attempts to decompose the higher echelon units into their constituents. It then positions the constituent units on the map, subject to the control measures and features of the terrain.</Paragraph>
    <Section position="1" start_page="332" end_page="332" type="sub_section">
      <SectionTitle>
1.2 ExInit's GUI
</SectionTitle>
      <Paragraph position="0"> Exlnit provides a direct manipulation GUI (built by MRJ Corp.) based on the Microsoft Windows suite of interface tools, including a tree-browser, drop-down scrolling lists, buttons (see Figure 1). Many military systems incorporate similar user interface tools for accomplishing these types of tasks (e.g., ModSAF [7]). The tree-browser is used to represent and access the collection of military units. The user employs the unit browser to explore the echelon hierarchy until the desired unit is located. The user then selects that unit, and drags it onto the map in order to position it on the terrain. The system then asks for confirmation of the unit's placement. Once confirmed, Exlnit invokes its deployment server to decompose the unit into its constituents and position them on the terrain. Because this is a time-consuming process depending on the echelon of the unit, only companies and smaller units were considered.</Paragraph>
      <Paragraph position="1"> To create a linear or area control measure, the user pulls down a list of all control measure  types, then scrolls and selects the desired type. Then the user pushes a button to start entering points, selects the desired locations, and finally clicks the button to exit the point creation mode. The user is asked to confirm that the selected points are correct, after which the system connects them and creates a control measure object of the appropriate type.</Paragraph>
      <Paragraph position="2"> Finally, there are many more features to this GUI, but they were not considered for the present comparison. The system and its GUI were well-received by the client, and were used to develop the largest known distributed simulation (60,000 entities) for the US</Paragraph>
    </Section>
    <Section position="2" start_page="332" end_page="333" type="sub_section">
      <SectionTitle>
Government's Synthetic Theater of War
</SectionTitle>
      <Paragraph position="0"> program (STOW).</Paragraph>
      <Paragraph position="1"> 4 There were 45 entries, viewable in a window of size 9. The entries consisted of linear features (boundaries, obstacles, etc.), then areas.</Paragraph>
    </Section>
    <Section position="3" start_page="333" end_page="333" type="sub_section">
      <SectionTitle>
1.3 QuickSet's Multimodai Interface
</SectionTitle>
      <Paragraph position="0"> QuickSet is a multimodal (pen/voice) interface for map-based tasks. With this system, a user can create entities on a map by simultaneously speaking and drawing \[4\]. With pen-based, spoken, or multimodal input, the user can annotate the map, creating points, lines, and areas of various types (see Figure 2). In virtue of its distributed multiagent architecture, QuickSet operates in various heterogeneous hardware configurations, including wearable, handheld, desktop, and wall-sized. Moreover, it controls numerous backend applications, including 3D terrain visualization \[5\] military simulation, disaster management \[15\] and medical informatics.</Paragraph>
      <Paragraph position="1"> The system operates as follows: When the pen is placed on the screen, the speech recognizer is activated, thereby allowing users to speak and gesture simultaneously. For this task, the user either selects a spot on the map and speaks the name of a unit to be placed there (e.g, &amp;quot;mechanized company&amp;quot;), or draws a control measure while speaking its name (e.g., &amp;quot;phase line green&amp;quot;). In response, QuickSet creates the appropriate military icon on its map and asks for confirmation. Speech and gesture are recognized in parallel, with the speech interpreted by a definite-clause natural language parser. For this study, IBM's Voice Type Application Factory, a continuous, speaker-independent speech recognition system, was used with a bigram grammar and 662-word vocabulary. In general, analyses of spoken language and of gesture each produce a list of interpretations represented as typed feature structures \[8\]. The language supported by the system essentially consists of complex noun phrases, including attached prepositional phrases and gerunds, and a small collection of sentence forms. Utterances can be just spoken, or coupled with pen-based gestures. Multimodal integration searches among the set of interpretations for the best joint interpretation \[8, 22\], which often disambiguates both speech and gesture simultaneously \[15\]. Typed feature structure unification provides the basic information fusion operation. Taking advantage of the system's mutual disambiguation capability, QuickSet confirms its interpretation of the user input after multimodal integration \[11\], thereby allowing the system to correct recognition and interpretation errors. If the result is acceptable, the user needs only to proceed; only unacceptable results require explicit disconfirmation. Finally, the multimodal interpretation is sent directly to the Exlnit deployment server, effectively bypassing the Exlnit GUI.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML