File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1412_metho.xml
Size: 17,489 bytes
Last Modified: 2025-10-06 14:10:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1412"> <Title>Noun Phrase Generation for Situated Dialogs</Title> <Section position="4" start_page="1" end_page="81" type="metho"> <SectionTitle> 2 Generation for Situated Tasks </SectionTitle> <Paragraph position="0"> Many previous projects, such as (Lauria et al., 2001; Moratz and Tenbrink, 2003; Skubic et al., 2002), inter alia, study interpretation of situated language, e.g. for giving directions to a robot. The focus of our work is rather on generating navigation instructions for a human partner to follow.</Paragraph> <Paragraph position="1"> Linguistic studies have shown that speakers select noun phrase forms to refer to entities based on a variety of factors. Some of the factors are intrinsic to the object being described, while others are features of the context in which the expression is spoken. The entity's status within the discourse, spatial position, and the presence of similar items from which the target referent must be distinguished, have all been found to cause changes to the lexical properties chosen for a particular referring expression (i.e. (Gundel et al., 1993; Prince, 1981; Grosz et al., 1995)). This variation is expressed in terms of the determiner chosen (e.g.</Paragraph> <Paragraph position="2"> that/a), the head noun (e.g. that/door/one), and the presence of additional modifiers such as pre-nominal adjectives or prepositional phrases.</Paragraph> <Paragraph position="3"> In natural language generation, the process of generating referring expressions occurs in stages (Reiter and Dale, 1992). The process we explore in this paper is the sentence planning stage, which determines whether the context supports generating a particular referring expression as a pronoun, description, one-anaphor, etc.</Paragraph> <Paragraph position="4"> There has been extensive research in both automatic route description and on general noun phrase (NP) generation, but few projects consider extra-linguistic information as part of the context that influences dialog behavior. (Poesio et al., 1999) applies statistical techniques for the problem of NP generation. However, even though the corpus used in that study contains descriptions of museum items visually accessible to the user, the features used in generation were mostly linguistic, and included little information about the visual or spatial properties of the referent. Another related study in statistical NP generation (Cheng et al., 2001) focuses on choosing the modifiers to be included. Again, no features derived from the situated world were used in that study. (Maass et al., 1995) use features from the world, including objects' color, height, width, and visibility, as well as the user's direction of travel and distance from objects, for generating instructions in a situated task. However, their focus is on selecting landmarks and descriptions under time pressure, rather than selecting the linguistic form to be produced.</Paragraph> </Section> <Section position="5" start_page="81" end_page="82" type="metho"> <SectionTitle> 3 Data Collection </SectionTitle> <Paragraph position="0"> Our task setup is designed to elicit natural, spontaneous situated language examples from human partners. The experimental platform employs a virtual-reality (VR) world in which one partner, the direction-follower (DF), moves about to perform a series of tasks, such as pushing buttons to re-arrange objects in the room, finding and picking up treasures, etc. The simulated world was presented from first-person perspective on a desk-top computer monitor. The DF had no knowledge of the world map or tasks.</Paragraph> <Paragraph position="1"> DG: you can currently see three buttons... there's actually a fourth button that's kind of hidden DF: yeah DG: by this cabinet on the right DF: I know, yeah DG: ok, um, so what you wanna do is you want to go in and you're gonna press one of the buttons that's on the right hand wall, so you wanna go all the way straight into the room and then face the wall DF: mhm DG: there with the two buttons DF: yep DG: um and you wanna push the one that's on the left frame His partner, the direction-giver (DG), had a paper 2D map of the world and a list of tasks to complete. As they performed the task, the DG had instant feedback about the DF's location in the VR world, via mirroring of the DF's computer screen on the DG's computer monitor. The partners communicated through headset microphones. Our paid participants were self-identified native speakers of North American English. Figure 1 shows an example view of the world and the accompanying dialog fragment.</Paragraph> <Paragraph position="2"> The video output of DF's computer was captured to a camera, along with the audio from both microphones. A logfile created by the VR software recorded the DF's coordinates, gaze angle, and the position of objects in the world 10 times per second. These data sources were synchronized using calibration markers. A technical report is available that describes the recording equipment and software used (Byron, 2005).</Paragraph> <Section position="1" start_page="81" end_page="82" type="sub_section"> <SectionTitle> 3.1 Corpus and Annotation Scheme </SectionTitle> <Paragraph position="0"> Using the above-described setup, we created a corpus consisting of 15 dialogs containing a total of 221 minutes of speech. It was transcribed and word-aligned using Praat and SONIC.</Paragraph> <Paragraph position="1"> The dialogs were further annotated using the Anvil software (Kipp, 2004) to identify a set of target referring expressions in the corpus. Because we are in- null terested in the spatial properties of the referents of these target referring expressions, the items of interest in this experiment were restricted to objects with a defined spatial position.</Paragraph> <Paragraph position="2"> Each object in the virtual world was assigned a symbolic id, and the id of each target referring expression was added to the annotation. Referring expressions with plural referents were marked as Set, and were labeled with a list of the members in the set. Expressions were also annotated as either vague when the referent was not clear at the time of utterance or abandoned in case the utterance was cut short. Items that did not contain a surface realization of the head of the NP (e.g., on the left), were marked with the tag empty.</Paragraph> <Paragraph position="3"> The corpus contains 1736 target expressions, of which 221 were Vague, 45 were Empty, and 228 were Sets. The remaining 1242 form the set of test items in the experiment described below. Vague items were excluded since we do not wish for the algorithm we develop to reproduce this behavior. Set items were excluded in order to avoid the more complex calculation of spatial properties associated with plural entities.</Paragraph> <Paragraph position="4"> The data used in the experiments is a consensus version on which both annotators, two of the authors, agreed on the set of target expressions and their properties. Due to the constraints introduced by the task, referent annotation achieved almost perfect agreement. The data used in this study is only the DG's language.</Paragraph> </Section> </Section> <Section position="6" start_page="82" end_page="85" type="metho"> <SectionTitle> 4 Algorithm Development </SectionTitle> <Paragraph position="0"> Our ultimate goal is to provide input to a surface realization component for NP generation, given the ID of a target referent and a vector of context features. It is desirable for these context features to be automatically derived, to limit the reliance on human annotation, so we restricted out study to features that either were derived automatically, or required minimal human annotation.</Paragraph> <Paragraph position="1"> One impact of this decision is that even though the linguistic literature predicts that syntactic features such as grammatical role are important in selecting NP forms, these features were difficult to obtain. Our corpus contains spontaneous spoken discourse, which has no sentence boundaries and relaxed structural constraints. Thus, automatic parsing was problematic. With improved parsing techniques, we may include syntactic information in the decision process for NP generation in future, but this was not included in the current study.</Paragraph> <Paragraph position="2"> Following (Poesio et al., 1999), we consider the det a, the, that, none head it, that, one, noun, none it that button on the right NP frames for it and that button on the right Figure 2: NP frame slot values and examples information conveyed by an NP to be divided into four slots which must be filled to be able to generate the NP form: a determiner/quantifier, a pre or post-modifier and a head noun slot. There were very few examples of premodifiers in the corpus, so we collapsed the modifier feature. Therefore, the output from our algorithm is an NP frame specifying values for the three slots for each target expression. Figure 2 shows the possible values in each slot and example slot values for two NPs. The number of occurrences in the entire corpus for the NP frame slot values are shown in Table 2.</Paragraph> <Paragraph position="3"> In the experimental VR world developed for this study, all items from the same category were designed to look identical. This was intended to encourage the subjects to use referring expressions that rely on spatial attributes or deictic reference such as that one. The spatial properties of target referents and distractors are used as inputs to the content planning algorithm. Their values in this study were calculated automatically based on geometric properties of the virtual world.</Paragraph> <Paragraph position="4"> To form the training dataset, we processed each target expression with a syntactic chunker.</Paragraph> <Paragraph position="5"> The partial parse it produced was further processed with a regular-expression matcher to isolate the values corresponding to the three slots. Parser errors caused some low-count NP frame values, so we retained only items that occurred at least 10 times in the entire corpus. Any parser errors that remained in the data were not hand corrected, in order to minimize human intervention.</Paragraph> <Section position="1" start_page="82" end_page="85" type="sub_section"> <SectionTitle> 4.1 Context Features </SectionTitle> <Paragraph position="0"> Given the restrictions that we impose over what is accessible to the learning algorithm, we developed a set of features for each referring expression that characterize both the referent and the context in which the expression was spoken. The context The two possible tags for Mod occurred in almost equal proportion (49%/51%) 1. Count and chainCount the mention counts for the referent over the dialog and inside a reference chain a 2. DeltaTime and DeltaTimeChain the time elapsed since it was last mentioned in the dialog overall or in a chain 3. PrevSpeaker the previous speaker that mentioned the ID (either DG or DF) 4. Mod 8. Distance the distance between the referent and the DF's VR coordinates 9. Angle the angle between the center of the DF's view angle and the center of the referent 10. Visible a boolean value which indicates if the object is visible Relation to other objects in the world 11. Visible Distractors the number of other objects besides the target referent in the field of view 12. SameCatVisDistractors the number of visible distractors of the same type as the referent Object category and its information status 13. Cat the semantic category of the referent: door/cabinet/button 14. First Locate indicates if this is the first expression that allowed the DF to identify the object in the world. The point where joint spatial reference is accomplished. features. The target object is B4.</Paragraph> <Paragraph position="1"> features are not only linguistic but also derived from the extralinguistic situation, including spatial relations between the referent and the DF's position and orientation at that instant. The context feature for each target expression includes these automatically-calculated attributes as well as features from the annotation described above. Table 1 describes the full set of context features, and Figure 3 shows a schematic of the spatial context features. null The mention history of any target referent is important for determining the form to use in a subsequent referring expression. Ideally, the discourse history feature should indicate whether a referent has already been discussed, and the distance between a new mention and its antecedent. But determining the discourse status of items in this world was complicated by two factors. All objects in the world of the same semantic category had identical visual features, and the VR world in which the task is conducted is a maze, which required the subjects to perform tasks, move to a different portion of the maze, and possibly return to a previously visited room. Due to the visual and spatial confusion possible in this setting, there is no guarantee that our subjects could accurately calculate whether they were discussing the same object they had encountered before, or remember whether that object had been discussed. While the subjects were focused on a task in a particular room, however, it is reasonable to expect that they could remember which items had been discussed.</Paragraph> <Paragraph position="2"> Therefore, the discourse histories of target objects were calculated using a re-initialization process.</Paragraph> <Paragraph position="3"> Each time the subjects left a VR room to pursue a different task, if more than 25s elapsed before the next mention of objects in that room, those subsequent expressions were considered to be in new coreference chains. This time constant was established by examining pronominal referring expressions in the training dialogs.</Paragraph> <Paragraph position="4"> These features were used as input to develop a classifier to determine NP frames for unseen target referents in context. We chose decision trees due to their ease of interpretation, but we plan to test other machine learning techniques in the future. 5 dialogs were held out as unseen data and the remaining 10 were used to train and adjust the parameters of the decision process. The first procedure was to test whether the three slot values are interdependent. In contrast to previous work, which focused on predicting the values for one of the slots at a time, we hold that due to their interdependence, these decisions should not be made separately. For example, a noun form that has the pronoun it as the head will never have a modifier or a determiner. If the three slots are independent, training three separate classifiers and then combining their decisions will yield better results. On the other hand, if they are dependent, better results will be obtained through training a single classifier on the combined label. Unfortunately, combining the labels is problematic due to data sparsity. To test these dependencies, we trained several decision trees, varying the independence assumptions: Independent - a decision tree was trained for each slot and their outputs combined at the end.</Paragraph> <Paragraph position="5"> Joint - a decision tree was trained for the combined label for all three slots Conditional - three decision trees were trained in sequence, each having access to the output of the previous tree. For example, Mod-Det-Head means that first the Mod tree was trained, then a tree to classify Det, using the output from Mod, and finally a tree for Head , using both the Det and Mod values.</Paragraph> <Paragraph position="6"> All possible orderings between Mod, Head and Det were tested. The best result obtained was from the ordering Mod-Det-Head, but the differences between the orderings were not significant. The 10 fold cross-validation results are shown in Table 3. There were 632 items in the data set. The Conditional trees outperformed the Independent trees by 9%, which is significant at the level of (p<.0002).</Paragraph> <Paragraph position="7"> As our training data suggests, we test the Mod-Det-Head trees against our held out data. We decided to use a leave one out method of training/testing due to the sparsity of data.</Paragraph> <Paragraph position="8"> Decision tree classifiers offer the opportunity to examine the relevance of particular features in the final decision. Algorithm 1 and 2 show example trees derived for the Mod and Det features (the Head tree is not shown due to space limitations).</Paragraph> <Paragraph position="9"> We found that there are significant dependencies between the slots in the NP form. Each time one of the slots' values was available to the decision process, it was selected as most informative feature in the next tree. The spatial features were selected as informative in all the trees, most prevelantly in the Algorithm 1 An example decision tree for Mod if FirstLocate= Truethen if V isibleDistractors =0then if Distance [?] 116 then decision tree for Mod, suggesting that the decision of including extra information is driven largely by the spatial configuration. The information status features and discourse history, such as First Locate, Type, and attributes of the prior mention, were selected as good predictors for the Det slot.</Paragraph> </Section> </Section> class="xml-element"></Paper>