File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1005_metho.xml

Size: 6,883 bytes

Last Modified: 2025-10-06 14:14:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1005">
  <Title>QuickSet: Multimodal Interaction for Simulation Set-up and Control</Title>
  <Section position="4" start_page="20" end_page="21" type="metho">
    <SectionTitle>
4. SYSTEM ARCHITECTURE
</SectionTitle>
    <Paragraph position="0"> Architecturally, QuickSet uses distributed agent technologies based on the Open Agent Architecture for interoperation, information brokering and distribution. An z Open Agent Architecture is a trademark of SRI International. Natural language agent: The natural language agent currently employs a definite clause grammar and produces typed feature structures as a representation of the utterance meaning. Currently, for this task, the language consists of noun phrases that label entities, as well as a variety of imperative constructs for supplying behavior.</Paragraph>
    <Paragraph position="1"> Muitimodal integration agent: The multimodal interpretation agent accepts typed feature structure meaning representations from the language and gesture recognition agents, and produces a unified multimodal interpretation.</Paragraph>
    <Paragraph position="2">  More detail on the architecture and the individual agents =re provided in [12, 22].</Paragraph>
  </Section>
  <Section position="5" start_page="21" end_page="21" type="metho">
    <SectionTitle>
5. EXAMPLE
</SectionTitle>
    <Paragraph position="0"> Holding QuickSet in hand, the user views a map from the ModSAF simulation, and with spoken language coupled with pen gestures, issues commands to ModSAF. In otter to create a unit in QuickSet, the user would hold the pen at the desired location and utter (for instance): &amp;quot;led T72 platoon&amp;quot; resulting in a new platoon of the specified type  primarily by SRI International, but modified by us for multimodal interaction, serves as the communication channel between the OAA-brokered agents and the ModSAF simulation system. This agent offers an API for ModSAF that other agents can use.</Paragraph>
    <Paragraph position="1"> Web display agent: The Web display agent can be used to create entities, points, lines, and areas. It posts queries for updates to the state of the simulation via Java code that interacts with the blackboard and facilitator. The queries am routed to the running ModSAF simulation, and the available entities can be viewed over a WWW connection using a suitable browser.</Paragraph>
    <Paragraph position="2"> Other user interfaces: When another user interface connected to the facilitator subscribes to and produces the same set of events as others, it immediately becomes part of a collaboration. One can view this as human-human collaboration mediated by the agent architecture, or as agentagent collaboration.</Paragraph>
    <Paragraph position="3"> CommandVu agent: Since the CommandVu virtual reality system is an agent, the same multimodal interface on the handheld PC can be used to create entities and to fly the user through the 3-D terrain. For example, the user can ask &amp;quot;CommandVu, fly me to this platoon &lt;gesture on the map&gt;.&amp;quot; Application bridge agent: The bridge agent generalizes the underlying applications' API to typed feature structures, thereby providing an interface to the various applications such as ModSAF, CommandVu, and Exinit.</Paragraph>
    <Paragraph position="4"> This allows for a domain-independent integration architecture in which constraints on multimodal interpretation are stated in terms of higher-level constructs such as typed feature structures, greatly facilitating reuse. CORBA bridge agent: This agent converts OAA  establishes two platoons, a barbed-wire fence, a breached minefield, and then issues a command to one platoon to follow a traced route, The user then adds a barbed-wire fence to the simulation by drawing a line at the desired location while uttering '&amp;quot;oarbed wire.&amp;quot; Similarly a fortified line is ~. A minefield of an amorphous shape is drawn and is labeled verbally, and finally an M1A1 platoon is created as above. Then the user can assign a task to the new platoon by saying &amp;quot;M1A1 platoon follow this route&amp;quot; while drawing the route with the pen. The results of these commands are visible on the QuickSet screen, as seen in Figure 4, in the ModSAF simulation, and in the CommandVu 3D rendering of the scene. In addition to multimodal input, unimodal spoken language and gestural commands can be given at any time, depending on the user's task and preference.</Paragraph>
  </Section>
  <Section position="6" start_page="21" end_page="22" type="metho">
    <SectionTitle>
6. MULTIMODAL INTEGRATION
</SectionTitle>
    <Paragraph position="0"> Since any unimodal recognizer will make mistakes, the output of the gesture recognizer is not accepted as a simple unilateral decision. Instead the recognizer produces a set of probabilities, one for each possible interpretation of the gesture. The recognized entities, as well as their recognition probabilities, are sent to the facilitator, which forwards them to the multimodal interpretation agent. In combining the meanings of the gestural and spoken interpretations, we attempt to satisfy an important design consideration, namely that the communicative modalities should compensate for each other's weaknesses [7, 16].</Paragraph>
    <Paragraph position="1"> This is accomplished by selecting the highest scoring unified interpretation of speech and gesture. Importantly,  the unified interpretation might not include the highest scoring gestural (or spoken language) interpretation because it might not be semantically compatible with the other mode. The key to this interpretation process is the use of a typed feature structure \[1, 3\] as a meaning representation language that is common to the natural language and gestural interpretation agents. Johnston et al. \[12\] present the details of multimodal integration of continuous speech and pen-based gesture, guided by research in users' multimodal integration and synchronization strategies \[19\]. Unlike many previous approaches to multimodal integration (e.g, \[2, 9, 12, 15, 25\]) speech is not &amp;quot;in charge,&amp;quot; in the sense of relegating gesture a secondary and dependent role.</Paragraph>
    <Paragraph position="2"> This mutually-compensatory interpretation process is capable of analyzing multimodal constructions, as well as speech-only and pen-only constructions when they occur.</Paragraph>
    <Paragraph position="3"> Vo and Wood's system \[24\] is similar to the one reported here, though we believe the use of typed feature structures provides a more generally usable and formal integration mechanism than their frame-merging strategy. Cheyer and Julia \[4\] sketch a system based on Oviatt's \[17\] results and the OAA \[8\], but do not discuss the integration strategy nor multimodal compensation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML