File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1131_metho.xml
Size: 21,980 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1131"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Incremental generation of spatial referring expressions in situated dialog [?]</Title> <Section position="5" start_page="1042" end_page="1045" type="metho"> <SectionTitle> 3 Approach </SectionTitle> <Paragraph position="0"> We base our GRE approach on an extension of the incremental algorithm (Dale and Reiter, 1995).</Paragraph> <Paragraph position="1"> The motivation for basing our approach on this algorithm is its polynomial complexity. The algorithm iterates through the properties of the target and for each property computes the set of distractor objects for which (a) the conjunction of the properties selected so far, and (b) the current prop-erty hold. A property is added to the list of selected properties if it reduces the size of the distractor object set. The algorithm succeeds when all the distractors have been ruled out, it fails if all the properties have been processed and there are still some distractor objects. The algorithm can be refined by ordering the checking of properties according to fixed preferences, e.g. first a taxonomic description of the target, second an absolute property such as colour, third a relative property such as size. (Dale and Reiter, 1995) also stipulate that the type description of the target should be included in the description even if its inclusion does not make the target distinguishable.</Paragraph> <Paragraph position="2"> We extend the original incremental algorithm in two ways. First we integrate a model of object salience by modifying the condition under which a description is deemed to be distinguishing: it is, if all the distractors have been ruled out or if the salience of the target object is greater than the highest salience score ascribed to any of the current distractors. This is motivated by the observation that people can easily resolve underdetermined references using salience (Duwe and Strohner, 1997). We model the influence of visual and discourse salience using a function salience(L), Equation 1. The function returns a value between 0 and 1 to represent the relative salience of a landmark L in the scene. The relative salience of an object is the average of its visual is computed using the algorithm of (Kelleher and van Genabith, 2004). Computing a relative salience for each object in a scene is based on its perceivable size and its centrality relative to the viewer focus of attention, returning scores in the range of 0 to 1. The discourse salience (S disc ) of an object is computed based on recency of mention (Hajicov'a, 1993) except we represent the maximum overall salience in the scene as 1, and use 0 to indicate that the landmark is not salient in the current context. Algorithm 1 gives the basic algorithm with salience.</Paragraph> <Paragraph position="3"> Algorithm 1 The Basic Incremental Algorithm Require: T = target object; D = set of distractor objects. Initialise: P = {type,colour,size}; DESC = {}</Paragraph> <Paragraph position="5"> Distinguishing description generated if type(x) negationslash[?] DESC then</Paragraph> <Paragraph position="7"> Failed to generate distinguishing description return DESC Secondly, we extend the incremental algorithm in how we construct the context model used by the algorithm. The context model determines to a large degree the output of the incremental algorithm. However, Dale and Reiter do not define how this set should be constructed, they only write: &quot;[w]e define the context set to be the set of entities that the hearer is currently assumed to be attending to&quot; (Dale and Reiter, 1995, pg. 236). Before applying the incremental algorithm we must construct a context model in which we can check whether or not the description generated distinguishes the target object. To constrain the combinatorial explosion in relational scene model construction we construct a series of reduced scene models, rather than one complex exhaustive model. This construction is driven by a hierarchy of spatial relations and the partitioning of the context model into objects that may and may not function as landmarks. These two components are developed below. SS3.1 discusses a hierarchy of spatial relations, and SS3.2 presents a classification of landmarks and uses these groupings to create a definition of a distinguishing locative description.</Paragraph> <Paragraph position="8"> In SS3.3 we give the generation algorithm integrating these components.</Paragraph> <Section position="1" start_page="1042" end_page="1044" type="sub_section"> <SectionTitle> 3.1 Cognitive Ordering of Contexts </SectionTitle> <Paragraph position="0"> Psychological research indicates that spatial relations are not preattentively perceptually available (Treisman and Gormican, 1988). Rather, their perception requires attention (Logan, 1994; Logan, 1995). These findings point to subjects constructing contextually dependent reduced relational scene models, rather than an exhaustive context free model. Mimicking this, we have developed an approach to context model construction that constrains the combinatorial explosion inherent in the construction of relational context models by incrementally building a series of reduced context models. Each context model focuses on a different spatial relation. The ordering of the spatial relations is based on the cognitive load of interpreting the relation. Below we motivate and develop the ordering of relations used.</Paragraph> <Paragraph position="1"> We can reasonably asssume that it takes less effort to describe one object than two. Following the Principle of Minimal Cooperative Effort (Clark and Wilkes-Gibbs, 1986), one should only use a locative expression when there is no distinguishing description of the target object using a simple feature based approach. Also, the Principle of Sensitivity (Dale and Reiter, 1995) states that when producing a referring expression, one should prefer features the hearer is known to be able to interpret and see. This points to a preference, due to cognitive load, for descriptions that identify an object using purely physical and easily perceivable features ahead of descriptions that use spatial expressions. Experimental results support this (van der Sluis and Krahmer, 2004).</Paragraph> <Paragraph position="2"> Similarly, we can distinguish between the cognitive loads of processing different forms of spatial relations. In comparing the cognitive load associated with different spatial relations it is important to recognize that they are represented and processed at several levels of abstraction. For example, the geometric level, where metric properties are dealt with, the functional level, where the specific properties of spatial entities deriving from their functions in space are considered, and the pragmatic level, which gathers the underlying principles that people use in order to discard wrong relations or to deduce more information (Edwards and Moulin, 1998). Our discussion is grounded at the geometric level.</Paragraph> <Paragraph position="3"> Focusing on static prepositions, we assume topological prepositions have a lower perceptual load than projective ones, as perceiving two objects being close to each other is easier than the processing required to handle frame type as the easiest to process, before absolute gradable predicates (e.g.</Paragraph> <Paragraph position="4"> color), which is still easier than relative gradable predicates (e.g. size) (Dale and Reiter, 1995).</Paragraph> <Paragraph position="5"> We can refine the topological versus projective preference further if we consider their contrastive and relative uses of these relations (SS2). Perceiving and interpreting a contrastive use of a spatial relation is computationally easier than judging a relative use. Finally, within projective prepositions, psycholinguistic data indicates a perceptually based ordering of the relations: above/below are easier to percieve and interpret than in front of /behind which in turn are easier than to the right of /to the left of (Bryant et al., 1992; Gapp, 1995). In sum, we propose the following ordering: topological contrastive < topological relative < projective constrastive < projective relative.</Paragraph> <Paragraph position="6"> For each level of this hierarchy we require a computational model of the semantics of the relation at that level that accomodates both contrastive and relative representations. In SS2 we noted that the distinctions between the semantics of the different topological prepositions is often based on functional and pragmatic issues.</Paragraph> <Paragraph position="7"> Currently, however, more psycholinguistic data is required to distinguish the cognitive load associated with the different topological prepositions. We use the model of topological proximity developed in (Kelleher et al., 2006) to model all the relations at this level. Using this model we can define the extent of a region proximal to an object. If the target or one of the distractor objects is the only object within the region of proximity around a given landmark this is taken to model a contrastive use of a topological relation relative to that landmark. If the landmark's region of proximity contains more than one object from the target and distractor object set then it is a relative use of a topological relation. We handle the issue of frame of reference ambiguity and model the semantics of projective prepostions using the framework developed in (Kelleher et al., 2006). Here again, the contrastive-relative distinc- null See inter alia (Talmy, 1983; Herskovits, 1986; Vandeloise, 1991; Fillmore, 1997; Garrod et al., 1999) for more discussion on these differences tion is dependent on the number of objects within the region of space defined by the preposition.</Paragraph> </Section> <Section position="2" start_page="1044" end_page="1044" type="sub_section"> <SectionTitle> 3.2 Landmarks and Descriptions </SectionTitle> <Paragraph position="0"> If we want to use a locative expression, we must choose another object in the scene to function as landmark. An implicit assumption in selecting a landmark is that the hearer can easily identify and locate the object within the context. A landmark can be: the speaker (3)a, the hearer (3)b, the scene (3)c, an object in the scene (3)d, or a group of objects in the scene (3)e.</Paragraph> <Paragraph position="1"> (3) a. the ball on my right [speaker] b. the ball to your left [hearer] c. the ball on the right [scene] d. the ball to the left of the box [an object in the scene] e. the ball in the middle [group of objects] null Currently, we need new empirical research to see if there is a preference order between these landmark categories. Intuitively, in most situations, either of the interlocutors are ideal landmarks because the speaker can naturally assume that the hearer is aware of the speaker's location and their own. Focusing on instances where an object in the scene is used as a landmark, several authors (Talmy, 1983; Landau, 1996; Gapp, 1995) have noted a target-landmark asymmetry: generally, the landmark object is more permanently located, larger, and taken to have greater geometric complexity. These characteristics are indicative of salient objects and empirical results support this correlation between object salience and landmark selection (Beun and Cremers, 1998). However, the salience of an object is intrinsically linked to the context it is embedded in. For example, in Figure 5 the ball has a relatively high salience, because it is a singleton, despite the fact that it is smaller and geometrically less complex than the other figures. Moreover, in this scene it is the only object that can function as a landmark without recourse to using the scene itself or a grouping of objects. Clearly, deciding which objects in a given context are suitable to function as landmarks is a complex and contextually dependent process. Some of the factors effecting this decision are object See (Gorniak and Roy, 2004) for further discussion on the use of spatial extrema of the scene and groups of objects in the scene as landmarks salience and the functional relationships between objects. However, one basic constraint on landmark selection is that the landmark should be distinguishable from the target. For example, given the context in Figure 5 and all other factors being equal, using a locative such as the man to the left of the man would be much less helpful than using the man to the right of the ball. Following this observation, we treat an object as a candidate landmark if the following conditions are met: (1) the object is not the target, and (2) it is not in the distractor set either.</Paragraph> <Paragraph position="2"> Furthermore, a target landmark is a member of the candidate landmark set that stands in relation to the target. A distractor landmark is a member of the candidate landmark set that stands in the considered relation to a distractor object. We then define a distinguishing locative description as a locative description where there is target landmark that can be distinguished from all the members of the set of distractor landmarks under the relation used in the locative.</Paragraph> </Section> <Section position="3" start_page="1044" end_page="1045" type="sub_section"> <SectionTitle> 3.3 Algorithm </SectionTitle> <Paragraph position="0"> We first try to generate a distinguishing description using Algorithm 1. If this fails, we divide the context into three components: the target, the distractor objects, and the set of candidate landmarks.</Paragraph> <Paragraph position="1"> We then iterate through the set of candidate landmarks (using a salience ordering if there is more than one, cf. Equation 1) and try to create a distinguishing locative description. The salience ordering of the landmarks is inspired by (Conklin and McDonald, 1982) who found that the higher the salience of an object the more likely it appears in the description of the scene it was embedded in.</Paragraph> <Paragraph position="2"> For each candidate landmark we iterate through the hierarchy of relations, checking for each relation whether the candidate can function as a target landmark under that relation. If so we create a context model that defines the set of target and distractor landmarks. We create a distinguishing locative description by using the basic incremental algorithm to distinguish the target landmark from the distractor landmarks. If we succeed in generating a distinguishing locative description we return the description and stop.</Paragraph> <Paragraph position="3"> Algorithm 2 The Locative Incremental Algorithm DESC = Basic-Incremental-Algorithm(T,D) if DESC negationslash= Distinguishing then create CL the set of candidate landmarks</Paragraph> <Paragraph position="5"> If we cannot create a distinguishing locative description we face two choices: (1) iterate on to the next relation in the hierarchy, (2) create an embedded locative description distinguishing the landmark. We adopt (1) over (2), preferring the dog to the right of the car over the dog near the car to the right of the house. However, we can generate these longer embedded descriptions if needed, by replacing the call to the basic incremental algorithm for the landmark object with a call to the whole locative expression generation algorithm, using the target landmark as the target object and the set of distractor landmarks as the distractors.</Paragraph> <Paragraph position="6"> An important point in this context is the issue of infinite regression (Dale and Haddock, 1991).</Paragraph> <Paragraph position="7"> A compositional GRE system may in certain contexts generate an infinite description, trying to distinguish the landmark in terms of the target, and the target in terms of the landmark, cf. (4). But, this infinite recursion can only occur if the context is not modified between calls to the algorithm.</Paragraph> <Paragraph position="8"> This issue does not affect Algorithm 2 as each call to the algorithm results in the domain being partitioned into those objects we can and cannot use as landmarks. This not only reduces the number of object pairs that relations must be computed for, but also means that we need to create a distinguishing description for a landmark on a context that is a strict subset of the context the target description was generated in. This way the algorithm cannot distinguish a landmark using its target.</Paragraph> <Paragraph position="9"> (4) the bowl on the table supporting the bowl on the table supporting the bowl ...</Paragraph> </Section> <Section position="4" start_page="1045" end_page="1045" type="sub_section"> <SectionTitle> 3.4 Complexity </SectionTitle> <Paragraph position="0"> The computational complexity of the incremental the number of attributes in the final referring description (Dale and Reiter, 1995). This complexity is independent of the number of attributes to be considered. Algorithm 2 is bound by the same complexity. For the average case, however, we see the following. For one, with every increase in n l , we see a strict decrease in n d : the more attributes we need, the fewer distractors we strictly have due to the partitioning into distractor and target landmarks. On the other hand, we have the dynamic construction of a context model. This latter factor is not considered in (Dale and Reiter, 1995), meaning we would have to multiply for context construction. Depending on the size of this constant, we may see an advantage of our algorithm in that we only consider a single spatial relation each time we construct a context model, we avoid an exponential number of comparisons: we need to make if relations are symmetric).</Paragraph> </Section> </Section> <Section position="6" start_page="1045" end_page="1046" type="metho"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> We examplify the approach on the visual scene on the left of Figure 6. This context consists of two red boxes R1 and R2 and two blue balls B1 and B2. Imagine that we want to refer to B1. We begin by calling Algorithm 2. This in turn calls Algorithm 1, returning the property ball. This is not sufficient to create a distinguishing description as B2 is also a ball. In this context the set of candidate landmarks equals {R1,R2}. We take R1 as first candidate landmark, and check for topological proximity in the scene as modeled in (Kelleher et al., 2006). The image on the right of Figure 6 illustrates the resulting scene analysis: the green region on the left defines the area deemed to be proximal to R1, and the yellow region on the right defines the area proximal to R2. Clearly, B1 is in the area proximal to R1, making R1 a target landmark. As none of the distractors (i.e., B2) are located in a region that is proximal to a candidate landmark there are no distractor landmarks.</Paragraph> <Paragraph position="1"> As a result when the basic incremental algorithm is called to create a distinguishing description for the target landmark R1 it will return box and this will be deemed to be a distinguishing locative description. The overall algorithm will then return sis of R1 and R2 the vector {ball, proximal, box} which would result in the realiser generating a reference of the form: the ball near the box.</Paragraph> <Paragraph position="2"> The relational hierarchy used by the framework has some commonalities with the relational subsumption hierarchy proposed in (Krahmer and Theune, 2002). However, there are two important differences between them. First, an implication of the subsumption hierarchy proposed in (Krahmer and Theune, 2002) is that the semantics of the relations at lower levels in the hierarchy are subsumed by the semantics of their parent relations. For example, in the portion of the subsumption hierarchy illustrated in (Krahmer and Theune, 2002) the relation next to subsumes the relations left of and right of. By contrast, the relational hierarchy developed here is based solely on the relative cognitive load associated with the semantics of the spatial relations and makes not claims as to the semantic relationships between the semantics of the spatial relations. Secondly, (Krahmer and Theune, 2002) do not use their relational hierarchy to guide the construction of domain models.</Paragraph> <Paragraph position="3"> By providing a basic contextual definition of a landmark we are able to partition the context in an appropriate manner. This partitioning has two advantages. One, it reduces the complexity of the context model construction, as the relationships between the target and the distractor objects or between the distractor objects themselves do not need to be computed. Two, the context used during the generation of a landmark description is always a subset of the context used for a target (as the target, its distractors and the other objects in the domain that do not stand in relation to the target or distractors under the relation being considered are excluded). As a result the framework avoids the issue of infinite recusion. Furthermore, the target-landmark relationship is automat- null For more examples, see the videos available at http://www.dfki.de/cosy/media/.</Paragraph> <Paragraph position="4"> ically included as a property of the landmark as its feature based description need only distinguish it from objects that stand in relation to one of the distractor objects under the same spatial relationship. In future work we will focus on extending the framework to handle some of the issues effecting the incremental algorithm, see (van Deemter, 2001). For example, generating locative descriptions containing negated relations, conjunctions of relations and involving sets of objects (sets of targets and landmarks).</Paragraph> </Section> class="xml-element"></Paper>