File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1408_metho.xml

Size: 28,361 bytes

Last Modified: 2025-10-06 14:14:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1408">
  <Title>Generating Referential Descriptions in Multimedia Environments</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Basics of Existing Algorithms
</SectionTitle>
    <Paragraph position="0"> Basically, the issue of producing a distinguishing description requires selecting a set of descriptors according to criteria which reflect humans preferences and verbalizing these descriptors while meeting natural language constraints. Over the last decade, (Dale, 1989, Dale, Haddock, 1991, Reiter, 1990b, Dale, Reiter, 1995), and others 2 have contributed to this issue (see the systems NAOS (Novak, 1988), EPICURE (Dale, 1988), FN (Reiter, 1990a), and IDAS (Reiter, Dale, 1992)).</Paragraph>
    <Paragraph position="1"> Recently, we have introduced several improvements to these methods (Horacek, 1996, 1997).</Paragraph>
    <Paragraph position="2"> In some more detail, the goal is to produce a referring expression that constitutes a distinguishing description, that is a description of the entity being referred to, but not to any other object in the current context set. A context set is defined as the set of entities the addressee is currently assumed to be attending to - the contrast set is the same except to the intended referent; an equivalent term is the set of potential distractors (McDonald, 1981).</Paragraph>
    <Paragraph position="3"> This is similar to the set of entities in the focus spaces of the discourse focus stack in Grosz and Sidner's theory of discourse structure (Grosz, Sidner, 1986). The existing algorithms attempt to identify the intended referent by determining a set of descriptors attributed to that referent, that is, a set of attributes. Some algorithms also include descriptors in the description that are attributed to other entities related to the original referent, that is, relations from the point of view of the intended referent. Attributes and relations by themselves are mere predicates which still need to be mapped onto proper lexical items, not necessarily in a simple one-to-one fashion. Some of the associated problems and a proposal to systematically incorporate this mapping are described in (Horacek, 1997).</Paragraph>
    <Paragraph position="4"> Viewed in procedural terms, the algorithms have to  consider three issues: 1. A cognitively motivated pre-selection of descriptors, which is based on psychologically motivated criteria that should reflect human preferences.</Paragraph>
    <Paragraph position="5"> 2. The ultimate selection of descriptors, which can  overrule the cognitively motivated pre-selection of the next descriptor due to linguistic phenomena such as implicature and due to other interference problems with previously chosen descriptors.</Paragraph>
    <Paragraph position="6"> 3. Adequately expressing the chosen set of descriptors in lexical terms.</Paragraph>
    <Paragraph position="7"> The approach undertaken by Appelt and Kronfeld (Appelt, 1985a, Appelt, 1985b, Kronfeld, 1986, Appelt, Kronfeld, 1987) is very elaborate but it suffers from limited coverage, missing assessments of the relative benefit of alternatives, and notorious inefficiency. null The first two issues are rather well understood for attributes only, but not so much for relations. The third issue is widely neglected - it is simply assumed that the chosen set of descriptors can be expressed adequately.</Paragraph>
    <Paragraph position="8"> For some time, there was a debate about various optimization criteria for comparing the suitability of alternative sets of descriptors, but we feel this issue is settled now in favor of the incremental algorithm interpretation (Reiter, Dale, 1992): preferred descriptors are sequentially included in the referring expression to be produced provided each descriptor leads to the exclusion of at least one potential distractor. In comparison to other interpretations, it is the weakest one; it has still polynomial complexity but it is independent of the number of attributes available for building a description.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Concepts in Existing Algorithms
</SectionTitle>
    <Paragraph position="0"> Abstracting from details, the algorithms producing a distinguishing description rely on three basic concepts: * the notion of a focus space, which delimits the scope in which referents and related entities are to be found, * the notion of a descriptor, by which referents can be described and ultimately identified, * the notion of a context set, which helps distinguishing referents from one another on the basis of sets of descriptors.</Paragraph>
    <Paragraph position="1"> In addition, a number of issues are taken into account by these algorithms in one or another way: * incorporating phenomena, such as basic-level categories for objects and inferability of properties, such as non-prototypicality of mentioned properties, * search strategies and complexity considerations, such as interaction between pre-selection and ultimate selection and choices among local referent candidates (selecting among alternative relations as descriptors), * adequate expressibility of the chosen set of descriptors, in terms of naturally composed surface expressions that convey the intended meaning, thereby avoiding, for instance, scope ambiguities or misinterpretations.</Paragraph>
    <Paragraph position="2"> In the following, we attempt to transfer the basic concepts to multimedia environments or, in case where this is not possible in a meaningful way, we propose a reasonable reinterpretation better suited to images.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Transferring Basic Concepts
</SectionTitle>
    <Paragraph position="0"> As far as the notion of a focus space is concerned, the transfer seems to work in a widely straightforward manner. Given some image of a scenery in which some object is to be identified, the focus space is simply the entire picture. There is, however, a principled difference in the way how a focus space is established for concrete images and for abstract language contexts: in a pure II language environment, the conversational setting determines which referents are considered to be within the focus space, which may occasionally be unclear for a few referents. In a multimedia environment, this depends on some application properties. If a specific picture constitutes the situational context, the area and the shape of that picture are precisely determined, as is the associated focus space. Otherwise, the precise boundaries of the image and the associated focus space are subject to minor uncertainties, as in the abstract linguistic context.</Paragraph>
    <Paragraph position="1"> The next ingredient to consider are the descriptors, which reveal a fundamental difference between texts as an abstract medium and images as a concrete medium. Transferring the notion of a descriptor to images in a direct way would lead to a very unnatural way of communicating identificational information by a picture, especially when several descriptors are to be presented in sequence to achieve the ultimately intended identification goal. Acting this way would mean that all objects to which the first descriptor applies must be highlighted in some way, then all to which also the second descriptor applies, and so forth. Obviously, this procedure would be more confusing than helpful to an observer. Moreover, simply highlighting the intended referent might do the job, but this action alone may not always work satisfactorily, if the intended referent is badly recognizable or even invisible.</Paragraph>
    <Paragraph position="2"> Because of the inadequacy of adapting the notion of a descriptor to images in a direct fashion, we consider an alternative way of describing the intended referent: a region of the picture where the intended referent can be found or, at least, whose identification helps in finding the intended referent. More precisely, a region can either be the area minimally surrounding a specific object, or it can merely be some connected area, specified by its surroundings or by a pointer to a central position in that area. In the first case, the area is precisely defined, but it may be considerably vague in the second case.</Paragraph>
    <Paragraph position="3"> In some sense, regions and descriptors cover the focus space in an orthogonal way: while the former cover a connected area on a picture, the latter typically appear there as a set of islands. As opposed to that, a descriptor covers a connected area in the abstract descriptor-referent space, while a region typically appears there as a set of islands. In some occasions, locality descriptors may do a similar job as regions, but this would probably be less effective in many cases, when multiple locality descriptors are required. As a consequence, the selection of an adequate region differs in some crucial sense from the selection of an adequate descriptor: a candidate descriptor is chosen from a set of distinct alternatives, while determining a candidate region is more a matter of accuracy and precision in terms of appropriately fixing the borderlines of the region which lies around the intended referent or some other entity related to it. Altogether, a region typically comprises the equivalent of several descriptors as far as the contribution to the identification task is concerned: either a category of the object enclosed by the region, accompanied by a set of further descriptors, if necessary, or a suitable combination of locality descriptors.</Paragraph>
    <Paragraph position="4"> Once we have &amp;quot;reinterpreted&amp;quot; the notion of a descriptor in terms of regions as building elements of distinguishing descriptions for images, we have to deal with regions in computing the context set. For this concept, extending the algorithm does not prove to be difficult. Since both, descriptors and regions restrict the context set in view of the entire focus space or some previously restricted part of it, although in a complementary way, the computation of the context set modified by a newly introduced region works analogously to the pure language environment.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Changes in the Algorithms
</SectionTitle>
    <Paragraph position="0"> When extending the existing algorithms to multimedia environments, we discuss choices between regions and descriptors as well as their coordination in the existing processing schema. We first restrict our considerations to single images - allowing the incorporation of multiple images might easily complicate matters so that temporal presentation aspects additionally come into play, requiring the design of animations. Nevertheless, accomplishing the communicative goal in an environment consisting of a single image only is not always trivial in the sense that the intended referent just needs to be annotated or highlighted in some way. That entity may be invisible or badly recognizable so that pointing at it is simply impossible or unlikely to convey the message properly.</Paragraph>
    <Paragraph position="1"> As far as the issues involved in composing a description are concerned, some crucial differences between the media considered exist. Basic-level categories are exclusively relevant for language, and inferability is, apart from language, relevant for abstract images only.</Paragraph>
    <Paragraph position="2"> The expressibility issue, when being reinterpreted for regions of an image, yields problems, too, but they are entirely different from the expressibility problems in language generation: for images, visibility and various aspects of recognizability, such as sufficient degrees of salience in terms of shape, granularity, and brightness come into play. Judging the adequacy of these aspects is a typical issue in presenting information by a picture and, hence, can be considered the visual counterpart of expressibility on the language side.</Paragraph>
    <Paragraph position="3"> When choosing between a descriptor and a region as two candidates to focus on some portion of the environment, some principled preferences seem to be plausible  when brevity of the resulting expression is envisioned: * An 'exact' region, taken by a specific object, is probably better conveyed by the picture component, especially if several similar objects are in the focus of attention.</Paragraph>
    <Paragraph position="4"> * However, if the object is either very small (almost  invisible) or extremely large (almost covering the entire focus space), choosing language as the medium seems to be more appropriate.</Paragraph>
    <Paragraph position="5"> * A 'generic' region, that is, a region which nearly perfectly fits a locality descriptor (see (Gapp, 1995) for an operationalization of degrees of applicability), is better described by language, especially when some other region can be used more beneficially as a component of the referential description.</Paragraph>
    <Paragraph position="6"> * For ordinary regions, however, images are generally the preferred medium.</Paragraph>
    <Paragraph position="7"> In addition to the choice between a descriptor and a region as the next ingredient for narrowing the focus space, adequate coordination of the participating media is a crucial concern. In our environment, this task is largely simplified because of the restriction to a single image. However, at least some sort of annotation should be given to support the coherence of the overall description. In more complex cases, several regions of an image need to be coordinated as well, which might even require their temporal synchronization.</Paragraph>
    <Paragraph position="8"> In addition to dealing with these local preference and choice criteria, we need to incorporate the selection among descriptors and regions into a process where several selections are made in a coordinated way until the intended referent is identified. This process should widely follow the schema based on the incremental algorithm interpretation of minimality of the number of descriptors. By adopting this schema, we maintain the psychologically motivated search strategy and the reasonable computational complexity associated with that schema.</Paragraph>
    <Paragraph position="9"> Since descriptors and regions are fundamentally different, a multimedia version of the algorithm requires two choice components to be designed, one for choosing the best descriptor, and the other for choosing the best region. In addition, a referee component could be designed to make the final decision. Such choices could be repeated until a region has outscored its competing descriptor or until the communicative goal is accomplished. This way, a region can describe the intended referent directly or indirectly, that is, in terms of other entities. Because regions may have an entirely different contribution to the restriction of the focus space, a region is usually a proper alternative to a descriptor, rather than a mere substitute. In view of the environment that is given by a common focus space, that is, by a single image, a simpler strategy may even turn out to be better: a region considered most suitable is selected by the responsible component, and, if necessary, further descriptors are selected until the communicative goal is achieved. Apart from these descriptors, the language part of the description should also entail a reference to the pictorial part, such as an object category or a deictic reference to that region. Even if the region alone already accomplishes the communicative goal, such a reference phrase should be built, in order to clarify the purpose of highlighting the region. The rationale behind this strategy is that in a single image one region is usually sufficient to restrict the context set as much as possible by the pictorial component.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Extending the Algorithm's Coverage
</SectionTitle>
    <Paragraph position="0"> So far, we have only considered environments consisting of a single image and language descriptions. If we move on to more complex environments in which several images may contribute to a description, we are definitively leaving the scope of the existing algorithms, since we are not just facing a single focus space, but a set or a chain of focus spaces (when considering only one image at a time). The connection among these focus spaces may vary significantly according to the way how the corresponding images interact. The following constellations seem to be of interest: 1. An image and some sort of a focused part There could be an image providing a global view of a scenery, combined with images presenting views on portions of that scenery that are invisible on the overview. The subsidiary images may present referents behind an obstacle, or inside some other object, or objects only partially visible in the overview.</Paragraph>
    <Paragraph position="1"> Moreover, we could be confronted with an image that shows some portion of a larger image (such as a portion of a large map), and the intended referent is located in another part of the whole image. In order to navigate between disjoint portions of a picture, two strategies seem to be promising: either presenting a sequence of pictures that gives some impression of scrolling, or presenting an overview first before moving on to the part that entails the intended referent. In both cases, these images contribute to bridging differences in locality.</Paragraph>
    <Paragraph position="2"> 2. An abstracted view and some concrete images There could be an abstract image providing an overview of some sort (such as the map of a city) and several concrete images that refer to one or another part of the abstract image (such as a group of buildings or a square in the city). The abstract image may then be used to direct the addressee's attention to a particular area of the focus space, while the concrete images support the proper identification task.</Paragraph>
    <Paragraph position="3"> 3. Images in largely varying degrees of granularity There could be an image providing an overview of a large scenery in which individual objects appear in a too small size to be recognizable. In addition to the strategy applied to the abstract overview and the concrete images, a smoother transition seems to be a promising alternative. Depending on the degree of condensation between the overview image and images that present objects in an easily recognizable format, using a few images of intermediate size might be a suitable means to support orientation.</Paragraph>
    <Paragraph position="4"> In order to make these concepts more concrete, a lot of testing in connection with concrete applications is required. Moreover, it seems to be much harder to formulate a reasonably concrete schematic procedure and suitable criteria for a multimedia version of the algorithms discussed, because images are associated with a higher degree of freedom than language. However, if we compare the discussion in this section with the original environment underlying the generation of referential descriptions, it becomes apparent that we have left the scope of what is commonly considered as the task of generating referential descriptions in a number of places but such an effect may easily happen in extending a method to multimedia environments.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Our Future Application Area
</SectionTitle>
    <Paragraph position="0"> In the near future, we intend to apply our approach to interface a graphical component by which we can visualize machine-generated mathematical proofs and related data structures. The task of our present interest, the identification of a particular object in the trace of a proof, is one of the issues in presenting mathematical proofs in multimedia environments. In some occasions, even groups of objects and their relations to one another may be subject to identification, which constitutes another kind of extension to the algorithms for generating referential descriptions.</Paragraph>
    <Paragraph position="1"> Proof presentation is realized within the mathematical assistant 12~nega (Benzmiiller et al., 1997), an interactive environment for proof development. Within ~mega, automated prover components such as Otter (McCune, 1994) can be called on problems considered as manageable by a machine. The result is a proof tree which needs to be fundamentally reorganized prior to presentation, because the refutation graph underlying the original proof is much too detailed to be comprehensible to humans, even to experts in mathematics. Therefore, an appropriate level of granularity is selected by condensing groups of inference steps to yield proofs built from &amp;quot;macro-steps&amp;quot;, which is motivated by rules of the natural deduction calculus (Gentzen, 1935). This is called the assertional level and dealt with in detail in (Huang, 1996). A typical example of an assertion level step is e.g., the application of a lemma. Once a proof is transformed to the assertional level, it can be verbalized suitably by the Proverb System (Huang, Fiedler, 1996). Another possibility to present a proof is to visualize the proof tree, which is the kind of presentation we address in this paper.</Paragraph>
    <Paragraph position="2"> Even at the assertional level, traces of machine-found proofs may grow very large even for problem of medium complexity. Therefore, a number of measurements to support identification are required, for instance, moving from an overview of the proof tree to a focused part of it.</Paragraph>
    <Paragraph position="3"> Moreover, moving from abstract to concrete environments may apply here to cases where the object to be identified lies in some detailed information about axioms or theorems, to which some node in the trace gives access.</Paragraph>
    <Paragraph position="4"> The following Figures show the trace of a moderately complex proof. The proof demonstrates the truth of the following axiom: the transitive closure of the union of two sets is identical to the transitive closure of the union of the transitive closures of the two sets, in terms of formulas: (p u t~)* = (p* u ~*)*. Figure 1 shows an overview of the whole proof, and Figures 2 and 3 selected portions of it, at a larger size. While individual nodes are still identifiable in the proof tree overview in Figure 1, the recognizability of nodes may easily be lost in larger proof trees, which motivates focusing on tree portions.</Paragraph>
    <Paragraph position="5">  In these proof trees, a root node represents the lemma to be proved (a root node of a subtree represents some supporting lemma), and the leaf nodes represent assumptions, axioms, or coreferences to specific subtrees in the proof. Moreover, proof derivations join nodes and their successors in upward direction. The geometric figures in the proof tree represent types of nodes: circles stand for ordinary nodes, triangles for assumptions or axioms, and squares for coreferences. The annotations in the Figures are made here by hand, to illustrate focused steps in the proof. In the implementation, a formula associated with an individual node can be viewed by clicking on that node so that the formula appears in a separate window (though in a less convenient predicate-like format rather than in the more common mathematical notation).</Paragraph>
    <Paragraph position="6"> In addition, the formulas are marked by numbers that also appear in the corresponding node of the proof tree.</Paragraph>
    <Paragraph position="7"> As an adds-on to this graphical presentation, we intend to incorporate a variety of interactive explanation facilities. One part of these facilities comprises various sorts of identification issues: * one specific object in the proof tree, * some formula or subformula associated with a specific node in the proof tree; this constellation is an instance of a concrete entity associated with some part of an abstract overview - see the second item in the extension categories introduced in Section 6, * a formula associated with a specific node in the proof tree or some part of it, that is not shown in the visible portion of the tree; this constellation is an instance of a referent that lies outside the scope of the focus space - see the first item in the extension categories, * some part of a formula associated with a specific node in the proof tree, which appears in a too small size to be recognizable; this is an instance of a referent which needs to be zoomed at to be recognizable - see the third item in the extension categories.</Paragraph>
    <Paragraph position="8"> Moreover, multiple objects may be subject to any of the above identification issues. In the following, we illustrate these identification categories by a few examples including suitable graphical displays and associated verbal descriptions.</Paragraph>
    <Paragraph position="9"> Let us assume that the whole proof tree (as an overview) is in the current focus of attention, and the user asks: &amp;quot;Where is the lemma '((x .~ y) ^ (y transitive)) (x* G y)' used in the proof?&amp;quot; As an answer, the regions where the three instantiations of this lemma appear in the proof are marked (see the arrows labeled by 1 in Figure 1), and their instantiations are given as formulas in the associated verbal description. Moreover, the regions of one or several of these instantiations could be illustrated by a focused picture, such as in Figure 2. A suitable accompanying verbal description would be: &amp;quot;That lemma is applied three times (see the annotations in the overview labeled by 1), one of these instantiations appears in the part proving (p u o0&amp;quot; ~- (19&amp;quot; u t~ ) , where x is instantiated to c 1 and y is instantiated to (c I u c2)*, (see the annotation in the enhanced tree portion, corresponding to the tree portions marked by 1 in the overview).&amp;quot; If this description is followed by a subsequent question &amp;quot;How is the subset definition applied here?&amp;quot;, the pictorial presentation needs to move to an adjacent portions of the proof tree, because the referent to be identified lies outside the subtree shown in Figure 2. The overview is then  shown again, and the annotation in Figure 3 provides additional information, in terms of the instantiations of this definition. A suitable verbalization would be &amp;quot;That lemma is proved in an adjacent part of the tree, where c I c (c I w c2)* is proved, as indicated in the overview (see the annotation labeled by 2 and the tree portion marked by 2 in the overview). The subset definition is instantiated to c I and (c I u c2)*, respectively.&amp;quot; We believe that these moderate sketches already demonstrate the usefulness of multimedia presentations in the task envisioned. Finally, these examples illustrate the following observations: * Choices between media become even richer through the possibility to incorporate annotations, which offers itself in the domain of mathematics.</Paragraph>
    <Paragraph position="10"> * The identification task is tightly interwoven with providing additional, descriptive information, which we feel to be typical in realistic domains.</Paragraph>
    <Paragraph position="11"> * While many of the details in proof presentation are highly domain-specific, the general lines in identifying objects in multimedia environments are valid across a number of domains. However, a characteristic feature that limits the generality and at the same time greatly helps in referring to portions of the proof tree is its strictly hierarchical organization, which may be present in some, but not in many other domains.</Paragraph>
    <Paragraph position="12"> In any case, future experience will tell us more about identification techniques in multimedia environments, especially concerning the contribution of each presentation mode and their coordination, as well as about degrees of domain-dependence and independence of the techniques involved.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML