File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2029_intro.xml
Size: 5,008 bytes
Last Modified: 2025-10-06 14:02:55
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2029"> <Title>Multi-Modal Question-Answering: Questions without Keyboards</Title> <Section position="3" start_page="167" end_page="168" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> A wide variety of work has been done on integrating graphics and/or virtual environments with natural language dating back to Winograd's (1972) classic &quot;blockworld&quot; simulation. More recently, researchers have been investigating how graphics and natural language can work together to create more compelling interfaces.</Paragraph> <Section position="1" start_page="167" end_page="167" type="sub_section"> <SectionTitle> 2.1 Multimodal Interfaces </SectionTitle> <Paragraph position="0"> A large body of work has been created on multimodal interfaces - combining multiple modes of interaction so that the advantages of one mode offset the limitations of another. In the specific case of combining natural language and graphics, there have been two main areas of study: interacting with graphical elements to resolve ambiguous references on the natural language side (Bolt, 1980; Kobsa et al., 1986); and generating coordinated text and graphic presentations using information from a knowledgebase (Andre and Rist (1994); Towns et al. (1998)).</Paragraph> <Paragraph position="1"> In addition to these two main areas, early work by Tennant (1983) experimented with using a predictive left-corner parser to populate dynamic menus that the user would navigate to construct queries that were guaranteed to be correct and task-relevant.</Paragraph> <Paragraph position="2"> Our work contains elements from all of these categories in that we use input gestures to resolve reference ambiguity and we make use of a KB to coordinate the linguistic and graphical information. We were also inspired by Tennant's work on restricting the player's input to avoid parsing problems. However, our work differs from previous efforts in that we:</Paragraph> </Section> <Section position="2" start_page="167" end_page="167" type="sub_section"> <SectionTitle> 2.2 Virtual Photographs </SectionTitle> <Paragraph position="0"> The concept of a virtual photograph has existed as long as people have taken screenshots of their view into a 3D environment. Recently, however, there have been a few applications that have experimented with adding a limited amount of interactivity to these static images. Video games, notably POKEMON SNAP (Nintendo, 1999), incorporate a limited form of interactive virtual photos. While there is no published information about the techniques used in these games, we can infer much by examining the level of interaction permitted.</Paragraph> <Paragraph position="1"> In POKEMON SNAP, the player zooms around each level on a rail car taking as many photographs of &quot;wild&quot; pokemon as possible. Scoring in the game is based not only on the number of unique subjects found (and successfully photographed), but also on the quality of the individ- null ual photographs. The judging criteria include: * Is the subject centered in the photo? * Is the face visible? (for identifiability) * Does the subject occupy a large percentage of the image? * Are there multiple pokemon (same type)? * What is the subject doing? (pose) In order to properly evaluate the photos, the game must perform some photo annotation when the photo is taken. However, since interaction with the photo is limited to scoring and display, these annotations are easily reduced to the set of values necessary to calculate the score. From the players' perspective, since there is no mechanism for interacting with the contents of the photo, all interaction is completed by the time the photo is taken - the photo merely serves as an additional game object.</Paragraph> </Section> <Section position="3" start_page="167" end_page="168" type="sub_section"> <SectionTitle> 2.3 Interactive Images </SectionTitle> <Paragraph position="0"> Recently, a lot of work has gone on in the field of making images (including electronic versions of real photographs) more interactive by manually or automatically annotating image contents or by making use of existing image metadata.</Paragraph> <Paragraph position="1"> The most commonly used example of this are the HTML image maps (Berners-Lee and Connolly, 1995) supported by most web browsers.</Paragraph> <Paragraph position="2"> An example that is more relevant to our work is the ALFRESCO system (Stock, 1991), which uses graphical representations of Italian frescos and allows the user to query using a combination of natural language and pointing gestures. Beyond the obvious difference that our system doesn't permit direct natural language input, our work also differs in that we annotate the images with scene information beyond a simple object ID and we calculate the image regions automatically from the objects in the virtual world.</Paragraph> </Section> </Section> class="xml-element"></Paper>