File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0603_intro.xml
Size: 5,891 bytes
Last Modified: 2025-10-06 14:01:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0603"> <Title>Understanding Complex Visually Referring Utterances</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> We present a study of how people describe objects in visual scenes of the kind shown in Figure 1. Based on this study, we propose a computational model of visually-grounded language understanding. A typical referring expression for Figure 1 might be, &quot;the far back purple cone that's behind a row of green ones&quot;. In such tasks, speakers construct expressions to guide listeners' attention to intended objects. Such referring expressions succeed in communication because speakers and listeners find similar features of the visual scene to be salient, and share an understanding of how language is grounded in terms of these features. This work is a step towards our longer term goals to develop a conversational robot (Roy et al., forthcoming 2003) that can fluidly connect language to perception and action.</Paragraph> <Paragraph position="1"> To study the characteristics of descriptive spatial language, we collected several hundred referring expressions based on scenes similar to Figure 1. We analysed the descriptions by cataloguing the visual features that they referred to within a scene, and the range of linguistic devices (words or grammatical patterns) that they used to refer to those features. The combination of a visual feature and corresponding linguistic device is referred to as a descriptive strategy.</Paragraph> <Paragraph position="2"> referring expressions (if this figure has been reproduced in black and white, the light cones are green in colour, the dark cones are purple) We propose a set of computational mechanisms that correspond to the most commonly used descriptive strategies from our study. The resulting model has been implemented as a set of visual feature extraction algorithms, a lexicon that is grounded in terms of these visual features, a robust parser to capture the syntax of spoken utterances, and a compositional engine driven by the parser that combines visual groundings of lexical units. We use the term grounded semantic composition to highlight that both the semantics of individual words and the word composition process itself are visually-grounded. We propose processes that combine the visual models of words, governed by rules of syntax. In designing our system, we made several simplifying assumptions. We assumed that word meanings are independent of the visual scene, and that semantic composition is a purely incremental process. As we will show, neither of these assumptions holds in all of our data, but our system still understands most utterances correctly.</Paragraph> <Paragraph position="3"> To evaluate the system, we collected a set of spoken utterances from three speakers. The model was able to correctly understand the visual referents of 59% of the expressions (chance performance was 1/30summationtext30i=1 1/i = 13%). The system was able to resolve a range of linguistic phenomena that made use of relatively complex compositions of spatial semantics. We provide an analysis of the sources of failure in this evaluation, based on which we propose a number of improvements that are required to achieve human level performance.</Paragraph> <Paragraph position="4"> An extended report on this work can be found in (Gorniak and Roy, 2003).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Related Work </SectionTitle> <Paragraph position="0"> Winograd's SHRDLU is a well known system that could understand and generate natural language referring to objects and actions in a simple blocks world (Winograd, 1970). Like our system it performs semantic interpretation during parsing by attaching short procedures to lexical units. However, SHRDLU had access to a clean symbolic representation of the scene and only handles sentences it could parse complete. The system discussed here works with a synthetic vision system, reasons over geometric and other visual measures, and works from accurate transcripts of noisy human speech.</Paragraph> <Paragraph position="1"> Partee provides an overview of the general formal semantics approach and the problems of context based meanings and meaning compositionality from this perspective (Partee, 1995). Our work reflects many of the ideas from this work, such as viewing adjectives as functions, as well as idea's from Pustejovsky's theory of the Generative Lexicon (GL) (Pustejovsky, 1995). However, these formal approaches operate in a symbolic domain and leave the details of non-linguistic influences on meaning unspecified, whereas we take the computational modelling of these influences as our primary concern.</Paragraph> <Paragraph position="2"> Word meanings have been approached by several researchers as a problem of associating visual representations, often with complex internal structure, to word forms. Models have been suggested for visual representations underlying spatial relations (Regier and Carlson, 2001). Models for verbs include grounding their semantics in the perception of actions (Siskind, 2001). Landau and Jackendoff provide a detailed analysis of additional visual shape features that play a role in language (Landau and Jackendoff, 1993).</Paragraph> <Paragraph position="3"> We have previously proposed methods for visually-grounded language learning (Roy and Pentland, 2002), understanding (Roy et al., 2002), and generation (Roy, 2002). However, the treatment of semantic composition in these efforts was relatively primitive. While this simple approach worked in the constrained domains that we have addressed in the past, it does not scale to the present task.</Paragraph> </Section> </Section> class="xml-element"></Paper>