File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2029_metho.xml
Size: 18,034 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2029"> <Title>Multi-Modal Question-Answering: Questions without Keyboards</Title> <Section position="4" start_page="168" end_page="168" type="metho"> <SectionTitle> 3 Interacting with Virtual Photos </SectionTitle> <Paragraph position="0"> As mentioned in the Introduction, virtual photos can become a useful metaphor for interaction with NPCs in games. Ideally, the player should be able to take a picture of anything in the virtual world and then show that photo to an NPC to engage in a dialog about the photo contents.</Paragraph> <Paragraph position="1"> In our implementation, the player interacts with the NPC by clicking on an object in the photo to pull up a menu of context-dependent natural language queries. When the player selects an item from this menu, the query is sent to the NPC that the player is currently &quot;talking to&quot;. This menu of context sensitive queries is crucial to the interaction because a pointing gesture without an accompanying description is ambiguous (Schmauks, 1987) and it is through this menu selection that the player expresses intent and restricts the scope of the dialog.</Paragraph> <Paragraph position="2"> There are two obvious benefits to approaching the QA interaction in this way. First, even though the topic is limited by the objects in the photo, the player is given control over the direction of the dialog. This is an improvement over the traditional scripted NPC interaction where the player has little control over the dialog. The other benefit is that while the player is given control over the content, the player is not granted too much control since the photo metaphor limits the topic to things that are relevant to the game. This effectively avoids the out-ofdomain, paraphrase and ambiguity problems that commonly plague natural language interfaces.</Paragraph> <Section position="1" start_page="168" end_page="168" type="sub_section"> <SectionTitle> 3.1 Annotations </SectionTitle> <Paragraph position="0"> The quality of player-NPC interaction is directly dependent on the kind of annotations that are used. For example, associating a literal text string with each object would result in a system where the NPCs would not exhibit individuality since they would all produce the exact same answer to a query. Alternately, using a global object identifier would also cause problems because in a dynamically changing world we would need to create a system to keep track of differences from object at the time of the photo and the object's current state.</Paragraph> <Paragraph position="1"> It is for these reasons that we record for each object an abstract representation that we can manipulate and merge with data from other sources like the NPC's KB. Beyond providing a place to record information about the objects that are specific to a particular photo, this also allows us to individualize the NPC responses and create a more interesting QA interaction.</Paragraph> </Section> <Section position="2" start_page="168" end_page="168" type="sub_section"> <SectionTitle> 3.2 Example Interaction </SectionTitle> <Paragraph position="0"> As a simple example, imagine a photo taken by a player that shows a few houses in a town. Taking this photo to an NPC and clicking on one of the houses will bring up a menu of possible questions that is determined by the object and the contents of the NPC's KB. Selecting the default &quot;What is this?&quot; query for an NPC that has no special knowledge of the objects in this photo will result in the generic description (stored in the photo) being used for the NPC's response (e.g., &quot;That is a blue house&quot;).</Paragraph> <Paragraph position="1"> If, however, the NPC has some knowledge about the object, then the NPC will be able to provide information beyond that provided within the photo. Given the following information: This is John's house.</Paragraph> <Paragraph position="2"> My name for John is my father.</Paragraph> <Paragraph position="3"> the NPC can piece it all together and generate &quot;That is my father's house&quot; as an answer.</Paragraph> </Section> </Section> <Section position="5" start_page="168" end_page="169" type="metho"> <SectionTitle> 4 Representing Knowledge </SectionTitle> <Paragraph position="0"> A key component of our system is the semantic representation that is used to encode not only the information that the NPC has about the surroundings, but also to encode the contents of the virtual photo. These KBs, which are created from text documents containing natural language descriptions, form the core document set on which the QA process operates.</Paragraph> <Section position="1" start_page="168" end_page="169" type="sub_section"> <SectionTitle> 4.1 Semantic Representation </SectionTitle> <Paragraph position="0"> While there are a variety of representations that can be used to encode semantic information, we opted to use a representation that is automati- null cally extracted from natural language text. We chose this representation because we desired a notation that: and Suzuki, 2002) that is produced by our parser. These structures, called logical forms (LFs), are the forms that are stored in the KB. This tree structure has many advantages. First, since it is based on our broad coverage grammar it provides a reasonable representation for all of the things that a player or NPC is likely to want to talk about in a game. We also are readily able to generate output text from this representation by making use of our generation component. In addition, the fact that this representation is created directly from natural language input means that game designers can create these KBs without any special training in knowledge representation.</Paragraph> <Paragraph position="1"> Another advantage of this tree structure is that it is easy to manipulate by copying subtrees from one tree into another. Passing this manipulated tree to our generation component results in the text output that is presented to the user. The ease with which we can manipulate these structures allows us to dynamically create new trees and provide the NPC with the ability to talk about a wide array of subjects without having to author all of the interactions.</Paragraph> </Section> <Section position="2" start_page="169" end_page="169" type="sub_section"> <SectionTitle> 4.2 Anaphora </SectionTitle> <Paragraph position="0"> As mentioned, once these sentences for the KB have been authored, our parser automatically handles the work required to create the LFs from the text. However, we do not have a fully automatic solution for the issue of reference resolution or anaphora. For this, we currently rely on the person creating the KB to resolve references to objects within the text or KB (endophora) and in the virtual world (exophora).</Paragraph> </Section> </Section> <Section position="6" start_page="169" end_page="170" type="metho"> <SectionTitle> 5 Posing Questions </SectionTitle> <Paragraph position="0"> In our system questions are posed by first narrowing down the scope of the query by selecting an object in a virtual photo, and then choosing a query from a list that is automatically produced by the QA system. This architecture places a heavy burden on the query generation component since that is the component that determines the ultimate limitations of the system.</Paragraph> <Section position="1" start_page="169" end_page="169" type="sub_section"> <SectionTitle> 5.1 Query Generation </SectionTitle> <Paragraph position="0"> In a system where the only automatically generated queries are allowed, it is important to be able to create a set of interesting queries to avoid frustrating the user. Beyond the straightforward &quot;Who/What/Where is this?&quot;-style of questions, we also use a question generator (originally described by Schwartz et al. (2004) in the context of language learning) to produce a set of answerable questions about the selected object.</Paragraph> <Paragraph position="1"> Once the player selects a query, the final step in query generation is to create the LF representation of the question. This is required so that we can more easily find matches in the KB. Fortunately, because the queries are either formulaic (e.g., the &quot;Who/What/Where&quot; queries), or extracted from the KB, the LF is trivially created with requiring a runtime parsing system.</Paragraph> </Section> <Section position="2" start_page="169" end_page="169" type="sub_section"> <SectionTitle> 5.2 Knowledgebase Matching </SectionTitle> <Paragraph position="0"> When the player poses a query to an NPC, we need to find an appropriate match in the KB. To do this, we perform subtree matches between the query's LF and the contents of the KB, after first modifying the original query so that question words (e.g., Who, What, ...) are replaced with special identifiers that permit wildcard matches.</Paragraph> <Paragraph position="1"> When a match is found, a complete, grammatical response is created by replacing the wildcard node with the matching subtree and then passing this structure to the text generation component.</Paragraph> </Section> <Section position="3" start_page="169" end_page="170" type="sub_section"> <SectionTitle> 5.3 Deixis </SectionTitle> <Paragraph position="0"> In order to make the NPC's responses believable, the final step is to incorporate deictic references into the utterance. These are references that depend on the extralinguistic context, such as the identity, time or location of the speaker or listener. Because the semantic structures are easy to manipulate, we can easily replace these references with the appropriate reference. An example of this was given earlier when the sub-tree corresponding to &quot;my father&quot; was used to refer to the owner of the house.</Paragraph> <Paragraph position="1"> This capability gives us a convenient way to support having separate KBs for shared knowledge and individual knowledge. General information can be placed in the shared KB, while knowledge that is specific to an individual (like the fact that John is &quot;my father&quot;) is stored in a separate KB that is specific to that individual.</Paragraph> <Paragraph position="2"> This allows us to avoid having to re-author the knowledge for each NPC while still allowing individualized responses.</Paragraph> </Section> </Section> <Section position="7" start_page="170" end_page="170" type="metho"> <SectionTitle> 6 Creating Annotated Photographs </SectionTitle> <Paragraph position="0"> Our virtual photos consist of three major parts: the image, the object locator map and the object descriptors. In addition, we define some simple metadata. We use the term &quot;annotations&quot; to refer to the combination of the object locator map, the descriptors and the metadata.</Paragraph> <Paragraph position="1"> While the photo image is trivially created by recording the camera view when the photo is taken, the other parts require special techniques and are described in the following sections.</Paragraph> <Section position="1" start_page="170" end_page="170" type="sub_section"> <SectionTitle> 6.1 The Object Locator Map (OLM) </SectionTitle> <Paragraph position="0"> The object locator map (OLM) is an image-space map that corresponds 1-to-1 with the pixels in the virtual photograph image. For each image pixel, the corresponding OLM &quot;pixel&quot; contains information about the object that corresponds to that image-space location. We create the OLM using the back buffer technique attributed originally to Weghorst et al. (1984).</Paragraph> </Section> <Section position="2" start_page="170" end_page="170" type="sub_section"> <SectionTitle> 6.2 The Object Descriptors </SectionTitle> <Paragraph position="0"> The object descriptors contain the semantic description of the objects plus some metadata that helps determine how the player and NPC can interact with the objects in the photo.</Paragraph> <Paragraph position="1"> In our system, we use the semantic annotations associated with each object as a generic description that contains information that would be readily apparent to someone looking at the object. Thus, these descriptions focus on the physical characteristics (derived from the object description) or current actions (derived from the current animation state) of the object.</Paragraph> </Section> </Section> <Section position="8" start_page="170" end_page="171" type="metho"> <SectionTitle> 7 3D Modeling </SectionTitle> <Paragraph position="0"> The modeling of 3D scenes and objects has typically been done in isolation, where only graphical (display and performance) concerns were considered. In this section, we discuss some of the changes that are required on the modeling side to better support our interaction.</Paragraph> <Section position="1" start_page="170" end_page="171" type="sub_section"> <SectionTitle> 7.1 Enhancements </SectionTitle> <Paragraph position="0"> Beyond the enhancement of attaching abstract semantic descriptions (rather than simple text labels as in Feiner et al. (1992)) to each object in the virtual world's scene graph, we introduce a few other features to enhance the interactivity of the virtual photos.</Paragraph> <Paragraph position="1"> Semantic Anchors A limitation of attaching the semantic descriptions to objects in the 3D world is that this only covers concrete objects that have a physical representation in the world. Semi-abstract objects (called &quot;negative parts&quot; by Landau and Jackendoff (1993)) like a cave or a hole do not have a direct representation in the world and thus do not have objects onto which semantic descriptions can be attached. However, it is certainly possible that the player might wish to refer to these objects in the course of a game.</Paragraph> <Paragraph position="2"> We provide support for these referable, non-physical objects through the use of semantic anchors, which are invisible objects in the world that provide anchor points onto which we can attach information. For example, abstract objects like a hole or a cave can be filled with a semantic anchor so that when a photo is taken of a region that includes the cave, the player can click on that region and get a meaningful result.</Paragraph> <Paragraph position="3"> Since these objects are not displayed, there is no requirement that they be closed 3D forms.</Paragraph> <Paragraph position="4"> This gives us the flexibility to create viewdependent semantic anchors by tagging regions of space based on the current view. For example, a cave entrance could be labeled simply as a &quot;cave&quot; for viewpoints outside the cave while this same portal can be termed an &quot;exit&quot; (or left unlabeled) from vantage points inside the cave.</Paragraph> <Paragraph position="5"> By orienting these open forms correctly, we can rely on the graphic engine's backface culling to automatically remove the anchors that are inap- null Backface culling is an optimization technique that removes back-facing surfaces (i.e., surfaces on the side of the object away from the viewer) from the graphic engine pipeline so that resources are not wasted processing them. This technique relies on the assumption that all the objects in the virtual world are closed 3D forms so that drawing only the front-facing surfaces doesn't change the resulting image. propriate for the current view.</Paragraph> </Section> <Section position="2" start_page="171" end_page="171" type="sub_section"> <SectionTitle> Action Descriptions </SectionTitle> <Paragraph position="0"> In addition to attaching semantic descriptions to objects, we also allow semantic descriptions be added to animation sequences in the game. This provides a convenient mechanism for identifying what a person is doing in a photo so that questions relating to action can be proposed.</Paragraph> <Paragraph position="1"> Key Features We also permit key features to be defined (as was apparently done for POKEMON SNAP) so that we can approximate object identifiability. In our implementation, we require that (at least a portion of) all key features are visible to satisfy this requirement.</Paragraph> <Paragraph position="2"> The advantage of this approach is that it is easy to implement (since there's no need to determine if the entire key feature is visible), but it requires that the key features be chosen carefully in order to produce reasonable results.</Paragraph> </Section> <Section position="3" start_page="171" end_page="171" type="sub_section"> <SectionTitle> 7.2 Limitations </SectionTitle> <Paragraph position="0"> Even with the proposed enhancements, there are clear limitations to the annotated 3D model approach that will require further investigation.</Paragraph> <Paragraph position="1"> First, there is an unfortunate disconnect between the modeled structures and semantic structures. When a designer creates a 3D model, the only consideration is the graphical presentation of the model and so joints like a wrist or elbow are likely to be modeled as a single point.</Paragraph> <Paragraph position="2"> This contrasts with a more semantic representation, which would have the wrist extend slightly into the hand and forearm.</Paragraph> <Paragraph position="3"> Another problem is the creation of relationships between the objects in the photo. This is difficult because many relationships (like &quot;next to&quot; or &quot;behind&quot;) can mean different things in world-space (as they are in the virtual world) and image-space (as they appear in the photo).</Paragraph> <Paragraph position="4"> And finally, there is the standard &quot;picking from an object hierarchy&quot; problem where, when a node in the hierarchy is selected, the user's intent is ambiguous since the intended item could be the node or any of its parent nodes.</Paragraph> </Section> </Section> class="xml-element"></Paper>