File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1423_metho.xml

Size: 26,642 bytes

Last Modified: 2025-10-06 14:07:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1423">
  <Title>Coordination and context-dependence in the generation of embodied conversation</Title>
  <Section position="3" start_page="0" end_page="172" type="metho">
    <SectionTitle>
2 Exploring the relationship between
</SectionTitle>
    <Paragraph position="0"> -speech and gesture To generate embodied communicative action requires an architecture for embodied conversation; ours is provided by the agent REA (&amp;quot;Real Estate Agent&amp;quot;), a computer-generated humanoid that has an articulated graphical body, can sense the user passively through cameras and audio input, and supports communicative actions realized in speech with intonation, facial display, and animated gesture. REA currently offers the reasoning and display capabilities to act as a real estate agent showing ..... ~users'the--features&amp;quot;o~ vm i-o-wsmodels&amp;quot;of howsesthat ~ appear on-screen behind her. We use existing features of kEA here as a resem'ch platform for imple- null menting models of the relationship between speech and spontaneous hand gestures during conversation.</Paragraph>
    <Paragraph position="1"> For more details about the functionality of REA see (Cassell, 2000a).</Paragraph>
    <Paragraph position="2"> Evidence from many sources suggests that this re.lationship is aclose one..About three,quarters of al! clauses in narrative discourse are accompanied by gestures of one kind or another (McNeill, 1992), and within those clauses, the most effortful part of gestures tends to co-occur with or just before the phonologically most prominent syllable of the accompanying speech (Kendon, 1974).</Paragraph>
    <Paragraph position="3"> Of course, communication is still possible without gesture. But it has been shown that when speech is ambiguous (Thompson and Massaro, 1986) or in a speech situation with some noise (Rogers, 1978), listeners do rely on gestural cues (and, the higher the noise-to-signal ratio, the more facilitation by gesture). Similarly, Cassell et al. (1999) established that listeners rely on information conveyed only in gesture as they try to comprehend a story.</Paragraph>
    <Paragraph position="4"> Most interesting in terms of building interactive dialogue systems is the semantic and pragmatic relationship between gesture and speech. The two channels do not always manifest the same information, but what they convey is virtually always compatible. Semantically, speech and gesture give a consistent view of an overall situation. For example, gesture may depict the way in which an action was carried out when this aspect of meaning is not depicted in speech. Pragmatically, speech and gesture mark information about this meaning as advancing the purposes of the conversation in a consistent way. Indeed, gesture often emphasizes information that is also focused pragmatically by mechanisms like prosody in speech (Cassell, 2000b). The semantic and pragmatic compatibility seen in the gesture-speech relationship recalls the interaction of words and graphics in multimodal presentations (Feiner and McKeown, 1991; Green et al., 1998; Wahlster et al., 1991 ). In fact, some suggest (McNeill, 1992), that gesture and speech arise together from an underlying representation that has both visual and linguistic aspects, and so the relationship between gesture and speech is essential to the production of meaning and to its comprehension.</Paragraph>
    <Paragraph position="5"> This theoretical perspective on speech and gesture involves two key claims with computational import: that gesture and speech reflectacommon conceptual source; and that the content and form of a gesture is tuned to the communicative context and the  actor's communicative intentions. We believe that these characteristics of the use of gesture are universal, and see the key contribution of this work as providing a general framework for building dialogue systems in accord with them. However, a concrete !mplementationrequires &amp;quot; more thanJustgeneralities behind its operation; we also need an understanding of the precise ways gesture and speech are used together in a particular task and setting.</Paragraph>
    <Paragraph position="6"> To this end, we collected a sample of real-estate descriptions in line with what REA might be asked to provide. To elicit each description, we asked one subject to study a video and floor plan of a particular house, and then to describe the house to a second subject (who did not know the house and had not seen the video). During the conversation, the video and floor plan were not available to either subject; the listener was free to interrupt and ask questions.</Paragraph>
    <Paragraph position="7"> The collected conversations were transcribed, yielding 328 utterances and 134 referential gestures, and coded to describe the general communicative goals of the speaker and the kinds of semantic features realized in speech and gesture.</Paragraph>
    <Paragraph position="8"> Analysis of the data revealed that for roughly 50% of the gesture-accompanied utterances, gestural content was redundant with speech; for the other 50% gesture contributed content that was different, but complementary, to that contributed by speech.</Paragraph>
    <Paragraph position="9"> In addition, the relationship between content of gesture, content of speech and general communicative functions in house descriptions could be captured by a small number or rules; these rules are informed by and accord with our two key claims about speech and gesture. For example, one rule describes dialogue contributions whose general function was what we call presentation, to advance the description of the house by introducing a single new object.. These contributions tended to be made up of a sentence that asserted the existence of an object of some type, accompanied by a non-redundant ges* ture that elaborated theshape or location of the object. Our approach casts this extended description of a new entity, mediated by two compatible modalities, as the speaker's expression of one overall function of presentation.</Paragraph>
    <Paragraph position="10">  ( I ) is a representative example.</Paragraph>
    <Paragraph position="11"> (1) It has \[a nice garden\]. (right hand, held flat,  traces a circle, indicating location of the garden sunounding the house) Six rules account for 60% of the gestures in the  96% (precision). These patterns provide a concrete specification for the main communicative strategies and communicative resources required for REA. m full discussion of the experimental methods and analysis, and the resulting rules, can be found in (Yan, 2000).</Paragraph>
  </Section>
  <Section position="4" start_page="172" end_page="173" type="metho">
    <SectionTitle>
3 Framing the generation problem
</SectionTitle>
    <Paragraph position="0"> In REA, requests for the generation of speech and gesture are formulated within the dialogue management module. REA'S utterances reflect a coordination of multiple kinds of processing in the dialogue manager- the system recognizes that it has the floor, derives the appropriate communicative context for a response and an appropriate set of communicative goals, triggers the generation process, and realizes the resulting speech and gesture. The dialogue manager is only one component in a multithreaded architecture that carries out hardwired reactions to input as well as deliberative processing. The diversity is required in order to exhibit appropriate interactional and propositional conversational behaviors at a range of time scales, from tracking the user's movements with gaze and providing nods and other feedback as the user speaks, to participating in routine exchanges and generating principled responses to user's queries. See (Cassell, 2000a) for description and motivation of the architecture, as well as the conversational functions and behaviors it supports.</Paragraph>
    <Paragraph position="1"> REA'S design and capabilities reflect our research focus on allying conversational content with conversation management, and allying nonverbal modalities with speech: how can anembodiedagent use'all its communicative modalities to contribute new content when needed (propositional function), to signal the state of the dialogue, and to regulate the over-all process of conversation (interactional function)? Within this focus, REA's talk is firmly delimited.</Paragraph>
    <Paragraph position="2"> REA'S utterances take a question-answer format, in which the user asks about (and REA describes) a single house .at.a. time. REA'S .sentences ,ate short; generally, they contribute just a few new semantic features about particular rooms or features of the house (in speech and gesture), and flesh this contribution out with a handful of meaningful elements (in speech and gesture) that ground the contribution in shared context of the conversation.</Paragraph>
    <Paragraph position="3"> Despite the apparent simplicity, the dialogue manager must contribute a wealth of information about the domain and the conversation to represent the communicative context. This detail is needed for REA tO achieve a theoretically-motivated realization of the common patterns of speech and gesture we observed in human conversation. For example, a variety of changing features determine whether marked forms in speech and gesture are appropriate in the context. REA'S dialogue manager tracks the changing status of such features as: e Attentionalprominence, represented (as usual in natural language generation) by setting up a context set for each entity (Dale, 1992). Our model of prominence is a simple local one similar to (Strube, 1998).</Paragraph>
    <Paragraph position="4"> o Cognitive status, including whether an entity is hearer-old or hearer-new (Prince, 1992), and whether an entity is in-focus or not (Gundel et al., 1993). We can assume that houses and their rooms are hearer-new until REA describes them; and that just those entities mentioned in the prior sentence are in-focus.</Paragraph>
    <Paragraph position="5"> Information structure, including the open propositions or, following (Steedman, 1991 ), themes, which describe the salient questions currently at issue in the discourse (Prince, 1986). In REA'S dialogue, open questions are always general questions about some entity raised by a recent turn; although in principle such an open question ought to be formalized as theme(XP.Pe), REA can use the simpler theme(e).</Paragraph>
    <Paragraph position="6"> In fact, both speech and gesture depend on the same * &amp;quot; kinds of'feamresi;-andaccessthem in the same way; &amp;quot; this specification of the dialogue state crosscuts distinctions of communicative modality.</Paragraph>
    <Paragraph position="7">  Another component of context is provided by a domain knowledge base, consisting of facts explicitly labeled with the kind of information they represent. This defines the common ground in the conversation in terms of sources of information that lation of goals and tightly fits the context specified by the dialogue manager.</Paragraph>
  </Section>
  <Section position="5" start_page="173" end_page="176" type="metho">
    <SectionTitle>
4 Generation and linguistic representation
</SectionTitle>
    <Paragraph position="0"> speaker and hearer share. Modeling the discourse as We model REA'S communicative actions as coma shared source of information means that new ~e &amp;quot;'-':''~'lmsed~degf:a~c~rHeetidegn'degf'atdegmie'etementsiqnclndiiag both lexical items in speech and clusters of seman- mantic features REA imparts are added to the common ground as the dialogue proceeds. Following results from (Kelly et al., 1999) which show that information from both speech and gesture is used to provide context for ongoing talk, our common ground may be updated by both speech and gesture.</Paragraph>
    <Paragraph position="1"> The structured domain knowledge also provides a resource for specifying communicative strategies.</Paragraph>
    <Paragraph position="2"> Recall that REA'S communicative strategies are formulated in terms of functions which are common in naturally-occurring dialogues (such as &amp;quot;presentation&amp;quot;) and which lead to distinctive bundles of content in gesture and speech. The knowledge base's kinds of information provide a mechanism for specifying and reasoning about such functions. The knowledge base is structured to describe the relationship between the system's private information and the questions of interest that that information can be used to settle. Once the user's words have been interpreted, a layer of production rules constructs obligations for response (Traum and Allen, 1994); then, a second layer plans to meet these obligations by deciding to present a specified kind of information about a specified object. This determines some concrete communicative goals--facts of this kind that a contribution to dialogue could make. Both speech and gesture can access the whole structured database in realizing these concrete communicative goals. For example, a variety of facts that bear on where a residence is--which city, which neighborhood or, if appropriate, where in a building--all provide the same kind of information, and would therefore fit the obligation to specify the location of a residence. Or, to implement the rule for presentation described in connection with ( 1 ), we can associate an obligation of presentation with a cluster of facts describing an object's type, its loca-tion in a house, and its size, shape or quality.</Paragraph>
    <Paragraph position="3"> The communicative context and concrete communicative goals provide a common source for generating speech and gesture in REA. The utterance generation problem ,in REa,.then, is to construct acomplex communicative action, made up of speech and coverbal gesture, that achieves a given consteltic features expressed as gestures; since we assume that any such item usually conveys a specific piece of content, we refer to these elements generally as lexicalized descriptors. The generation task in REA thus involves selecting a number of such lexicalized descriptors and organizing them into a grammatical whole that manifests the right semantic and pragmatic coordination between speech and gesture.</Paragraph>
    <Paragraph position="4"> The information conveyed must be enough that the hearer can identify the entity in each domain reference from among its context set. Moreover, the descriptors must provide a source which allows the hearer to recover any needed new domain proposition, either explicitly or by inference.</Paragraph>
    <Paragraph position="5"> We use the SPUD generator (&amp;quot;Sentence Planning Using Description&amp;quot;) introduced in (Stone and Doran, 1997) to carry out this task for REA. SPUD builds the utterance element-by-element; at each stage of construction, SPUD'S representation of the current, incomplete utterance specifies its syntax, semantics, interpretation and fit to context. This representation both allows SPUD to determine which lexicalized descriptors are available at each stage to extend the utterance, and to assess the progress towards its communicative goals which each extension would bring about. At each stage, then, SPUD selects the available option that offers the best immediate advance toward completing the utterance successfully. (We have developed a suite of guidelines for the design of syntactic structures, semantic and pragmatic representations, and the interface between them so that SPUD'S greedy search, which is necessary for real-time performance, succeeds in finding concise and effective Utterances described by the grammar (Stone et al., 2000).) As part of the development of REA, we have constructed a new inventory of lexicalized descriptors.</Paragraph>
    <Paragraph position="6"> REA'S descriptors consist of entries that contribute to coverbal gestures, as well as revised entries for spoken words that allow for their coordination with gesture under appropriate discourse conditions. The :-organization of'these entries assures'that--rasing the same mechanism as with speech--REA'S gestures draw on the single available conceptual representa- null tion and that both REA'S gesture and the relationship between gesture and speech-vary as a function of pragmatic context in the same way as natural gestures and speech do. More abstractly, these entries enable SPUD to realize the concrete goals tied to common communicative functions with same distribution of speech and gestiire bbse~ed:iffn/lttl'ralconversations. null To explain how these entries work, we need to  consider SPUD's representation of lexicalized descriptors in more detail. Each entry is specified in three parts. The first part--the syntax of the elemenv--sets out what words or other actions the element contributes to its utterance. The syntax is a hierarchical structure, formalized using Feature-Based Lexicalized Tree Adjoining Grammar (LTAG) (Joshi et al., 1975; Schabes, 1990).</Paragraph>
    <Paragraph position="7">  Syntactic structures are also associated with referential indices that specify the entities in the discourse that the entry refers to. For the entry to apply at a particular stage, its syntactic structure must combine by LTAG operations with the syntax of the ongoing utterance.</Paragraph>
    <Paragraph position="8"> REA'S syntactic entries combine typical phrase-structure analyses of linguistic constructions with annotations that describe the occurrence of gestures in coordination with linguistic phrases. Our device for this is a construction SYNC which pairs a description of a gesture G with the syntactic structure of a spoken constituent c: SYNC (2) G C The temporal interpretation of (2) mirrors the rules for surface synchrony between speech and gesture presented in (Cassell et al., 1994). That is, the preparatory phase of gesture G is set to begin before the time constituent c begins; the stroke of gesture G (the most effortful part) co-occurs with the most phonologically prominent syllable in c; and, except in cases of coarticulation between successive gestures, by the time the constituent c is complete, the speaker must be relaxing and bringing the hands out of gesture space (while the generator specifies synchrony as described, in practice the synchronization of synthesized speech with graphics is an ongoing challenge in the REA~projeet).-Jn. sum; 'the production of gesture G is synchronized with the production of speech c. (Our representation of synchrony  in a single tree conveniently allows modules dowrLstream to describe embodied communicative actions as marked-up text.) The syntactic description of the gesture itself indicates the choices the generator must make to produce a gesture, but does not analyze a ,gesture liter~i|y--~is '~/ hier~chy :i~f ~+p~a~e ~ m~=~fi~s~-'~f~?:&amp;quot;= .... stead, these choices specify independent semantic features which we can associate with aspects of a gesture (such as handshape and trajectory through space). Our current grammar does not undertake the final step of associating semantic features to choice of particular handshapes and movements, or gesture morphology; we reserve this problem for later in the research program. We allow gesture to accompany alternative constituents by introducing alternative syntactic entries; these entries take on different pragmatic requirements (as described below) to capture their respective discourse functions.</Paragraph>
    <Paragraph position="9"> So much for syntax. The second part--the semantics of the element--is a formula that specifies the content that the element carries. Before the entry can be used, SPUD must establish that the semantics holds of the entities the entry describes. If the semantics already follows from the common ground, SPUD assumes that the hearer can use it to help identify the entities described. If the semantics is merely part of the system's private knowledge, SPUD treats it as new information for the hearer.</Paragraph>
    <Paragraph position="10"> Finally, the third part--the pragmatics of the element--is also a formula that SPUD looks to prove before using the entry. Unlike the semantics, however, the pragmatics does not achieve specific communicative goals like identifying referents. Instead, the pragmatics establishes a general fit between the entry and the context.</Paragraph>
    <Paragraph position="11"> The entry schematized in (3) illustrates these three components; the entry also suggests how these components can define coordinated actions of speech and gesture that respond coherently to the context.</Paragraph>
    <Paragraph position="12">  c 'pragmaties:&amp;quot;heardr-n-ew(x) A'theme{O) .....</Paragraph>
    <Paragraph position="13"> (3) describes the use of have to introduce a new feature of (a house) o. The feature, indicated throughout the entry by the variable x,.is realized as the object NP of the verb have, but x can also form the basis of a gesture G coordinated with the noun phrase (as indicated by the SYNC constituent). The entry asserts that o has x.</Paragraph>
    <Paragraph position="14"> (3) is a presentational Construction; in other  words, it coordinates non-redundant paired speech and gesture in the same way as demonstrated by our house description data. To represent this constraint on its use, the entry carries two pragmatic requirements: first, x must be new to the hearer; moreover, o must link up with the open question in the discourse that the sentence responds to.</Paragraph>
    <Paragraph position="15"> The pragmatic conditions of (3) help support our theory of the discourse function of gesture and speech. A similar kind of sentence could be used to address other open questions in the discourse-for example, to answer which house has a garden? This would not be a presentational function, and (3) would be infelicitous here. In that case, gesture would naturally coordinate with and elaborate on the answering information--in this case the house. So the different information structure would activate a different entry, where the gesture would coordinate with the subject and describe o.</Paragraph>
    <Paragraph position="16"> Meanwhile, alternative entries like (4a) and (4b)---two entries that both convey (4c) and that both could combine with (3) by LTAG operations-underlie our claim that our implementation allows gesture and speech to draw on a single conceptual source and fulfill similar communicative intentions.</Paragraph>
    <Paragraph position="17">  (4a) provides a structure that could substitute for the G node in (3) to produce semantically and pragmatically coordinated speech and gesture. (4a) specifies a right hand gestnre:in.wlhieh.~the hand. traces out a circular trajectory; a further decision must determine the correct handshape (node RS, as a func- null tion of the entity x that the gesture describes). We pair (4a) with the semantics in (4c), and thereby model that the gesture indicates that one object, x, surrounds another, p. Since p cannot be further described, p must be identified by an additional presupposition of the gesture which.picks up~a reference frame from the sliared context.</Paragraph>
    <Paragraph position="18"> Similarly, (4b) describes how we could modify the vP introduced by (3) (using the LTAG operation of adjunction), to produce an utterance such as It has a garden surrounding it. By pairing (4b) with the same semantics (4c), we ensure that SPUD will treat the communicative contribution of the alternative constructions of (4) in a parallel fashion. Both are triggered by accessing background knowledge and both are recognized as directly communicating specified facts.</Paragraph>
  </Section>
  <Section position="6" start_page="176" end_page="176" type="metho">
    <SectionTitle>
5 Solving the generation problem
</SectionTitle>
    <Paragraph position="0"> We now sketch how entries such as these combine together to account for REA'S utterances. Our example is the dialogue in (5): (5) a User: Tell me more about the house.</Paragraph>
    <Paragraph position="1"> b REA: It has \[a nice garden\]. (right hand, held fiat, traces a circle) REA's response indicates both that the house has a nice garden and that it surrounds the house.</Paragraph>
    <Paragraph position="2"> As we have seen, (5b) represents a common pattern of description; this particular example is motivated by an exchange two human subjects had in our study, cf. (1). (5b) represents a solution to a generation problem that arises as follows within REA'S overall architecture. The user's directive is interpreted and classified as a directive requiring a deliberative response. The dialogue manager recognizes an obligation to respond to the directive, and concludes that to fulfill the function of presenting the garden would discharge this obligation. The presentational function grounds out in the communicative goal to convey a collection of facts about the garden (type, quality, location relative to the house). Along with these goals, the dialogue manager supplies its communicative context, which represents the centrality of the house in attentional prominence, cognitive status and information structure.</Paragraph>
    <Paragraph position="3"> In producing (5b) in response to this NLG problem, SPUD both calculates the applicability of and . ~determines a preference,for-theqexiOatized descriptors involved. Initially, (3) is applicable; the system knows the house has the garden, and represents the garden as new and the house as questioned. The entry can be selected over potential alternatives based on its interpretation--it achieves a communicative goal, refers to a prominent entity, and makes a relatively specific connection to facts in the context.</Paragraph>
    <Paragraph position="4"> and what its role might be. Likewise, we need a model of the communicative effects of spontaneous coverbal gesture--one that allows us to reason naturally about the multiple goals speakers have in producing each utterance.</Paragraph>
    <Paragraph position="5"> _Similarly, in the .second, stage, SPUD evaluates .and selects (4a) because it Communicates a needed fact 7 in a way that helps flesh out a concise, balanced communicative act by supplying a gesture that by using (3) SPUD has already realized belongs here.</Paragraph>
    <Paragraph position="6"> Choices of remaining elements--the words garden and nice, the semantic features to represent the garden in the gesture--proceed similarly. Thus SPUD arrives at the response in (5b) just by reasoning from the declarative specification of the meaning and context of communicative actions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML