File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-3001_metho.xml

Size: 17,267 bytes

Last Modified: 2025-10-06 14:10:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-3001">
  <Title>What's There to Talk About? A Multi-Modal Model of Referring Behavior in the Presence of Shared Visual Information</Title>
  <Section position="5" start_page="9" end_page="10" type="metho">
    <SectionTitle>
4 The Puzzle Task Corpus
</SectionTitle>
    <Paragraph position="0"> The corpus data used for the development of the models in this paper come from a subset of data collected over the past few years using a referential communication task called the puzzle study (Gergle et al., 2004).</Paragraph>
    <Paragraph position="1"> In this task, pairs of participants are randomly assigned to play the role of &amp;quot;Helper&amp;quot; or &amp;quot;Worker.&amp;quot; It is the goal of the task for the Helper to successfully describe a configuration of pieces to the Worker, and for the Worker to correctly arrange the pieces in their workspace. The puzzle solutions, which are only provided to the Helper, consist of four blocks selected from a larger set of eight. The goal is to have the Worker correctly place the four solution pieces in the proper configuration as quickly as possible so that they match the target solution the Helper is viewing.</Paragraph>
    <Paragraph position="2"> Each participant was seated in a separate room in front of a computer with a 21-inch display.</Paragraph>
    <Paragraph position="3"> The pairs communicated over a high-quality, full-duplex audio link with no delay. The experimental displays for the Worker and Helper are illustrated in Figure 1.</Paragraph>
    <Paragraph position="4">  Helper's view (right).</Paragraph>
    <Paragraph position="5"> The Worker's screen (left) consists of a staging area on the right hand side where the puzzle pieces are held, and a work area on the left hand side where the puzzle is constructed. The Helper's screen (right) shows the target solution on the right, and a view of the Worker's work area in the left hand panel. The advantage of this setup is that it allows exploration of a number of different arrangements of the shared visual space. For instance, we have varied the proportion of the workspace that is visually shared with the Helper in order to examine the impact of a limited field-of-view. We have offset the spatial alignment between the two displays to simulate settings of various video systems. And we have added delays to the speed with which the Helper receives visual feedback of the Worker's actions in order to simulate network congestion.</Paragraph>
    <Paragraph position="6"> Together, the data collected using the puzzle paradigm currently contains 64,430 words in the form of 10,640 contributions collected from over 100 different pairs. Preliminary estimates suggest that these data include a rich collection of over 5,500 referring expressions that were generated across a wide range of visual settings. In this paper, we examine a small portion of the data in order to assess the feasibility and potential contribution of the corpus for model development.</Paragraph>
    <Section position="1" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
4.1 Preliminary Corpus Overview
</SectionTitle>
      <Paragraph position="0"> The data collected using this paradigm includes an audio capture of the spoken conversation surrounding the task, written transcriptions of the spoken utterances, and a time-stamped record of all the piece movements and their representative state in the shared workspace (e.g., whether they are visible to both the Helper and Worker). From  these various streams of data we can parse and extract the units for inclusion in our models.</Paragraph>
      <Paragraph position="1"> For initial model development, we focus on modeling two primary conditions from the PUZZLE CORPUS. The first is the &amp;quot;No Shared Visual Information&amp;quot; condition where the Helper could not see the Worker's workspace at all. In this condition, the pair needs to successfully complete the tasks using only linguistic information. The second is the &amp;quot;Shared Visual Information&amp;quot; condition, where the Helper receives immediate visual feedback about the state of the Worker's work area. In this case, the pairs can make use of both linguistic information and shared visual information in order to successfully complete the task.</Paragraph>
      <Paragraph position="2"> As Table 1 demonstrates, we use a small random selection of data consisting of 10 dialogues from each of the Shared Visual Information and No Shared Visual Information conditions. Each of these dialogues was collected from a unique participant pair. For this evaluation, we focused primarily on pronoun usage since this has been suggested to be one of the major linguistic efficiencies gained when pairs have access to a shared visual space (Kraut et al., 2003).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="10" end_page="11" type="metho">
    <SectionTitle>
5 Preliminary Model Overviews
</SectionTitle>
    <Paragraph position="0"> The models evaluated in this paper are based on Centering Theory (Grosz et al., 1995; Grosz &amp; Sidner, 1986) and the algorithms devised by Brennan and colleagues (1987) and adapted by Tetreault (2001). We examine a language-only model based on Tetreault's Left-Right Centering (LRC) model, a visual-only model that uses a measure of visual salience to rank the objects in the visual field as possible referential anchors, and an integrated model that balances the visual information along with the linguistic information to generate a ranked list of possible anchors.</Paragraph>
    <Section position="1" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
5.1 The Language-Only Model
</SectionTitle>
      <Paragraph position="0"> We chose the LRC algorithm (Tetreault, 2001) to serve as the basis for our language-only model. It has been shown to fare well on task-oriented spoken dialogues (Tetreault, 2005) and was easily adapted to the PUZZLE CORPUS data.</Paragraph>
      <Paragraph position="1"> LRC uses grammatical function as a central mechanism for resolving the antecedents of anaphoric references. It resolves referents by first searching in a left-to-right fashion within the current utterance for possible antecedents. It then makes co-specification links when it finds an antecedent that adheres to the selectional restrictions based on verb argument structure and agreement in terms of number and gender. If a match is not found the algorithm then searches the lists of possible antecedents in prior utterances in a similar fashion.</Paragraph>
      <Paragraph position="2"> The primary structure employed in the language-only model is a ranked entity list sorted by linguistic salience. To conserve space we do not reproduce the LRC algorithm in this paper and instead refer readers to Tetreault's original formulation (2001). We determined order based on the following precedence ranking: Subject g37 Direct Object g37 Indirect Object Any remaining ties (e.g., an utterance with two direct objects) were resolved according to a left-to-right breadth-first traversal of the parse tree.</Paragraph>
    </Section>
    <Section position="2" start_page="10" end_page="11" type="sub_section">
      <SectionTitle>
5.2 The Visual-Only Model
</SectionTitle>
      <Paragraph position="0"> As the Worker moves pieces into their workspace, depending on whether or not the workspace is shared with the Helper, the objects become available for the Helper to see. The visual-only model utilized an approach based on visual salience. This method captures the relevant visual objects in the puzzle task and ranks them according to the recency with which they were active (as described below).</Paragraph>
      <Paragraph position="1"> Given the highly controlled visual environment that makes up the PUZZLE CORPUS, we have complete access to the visual pieces and exact timing information about when they become visible, are moved, or are removed from the shared workspace. In the visual-only model, we maintain an ordered list of entities that comprise the shared visual space. The entities are included in the list if they are currently visible to both the Helper and Worker, and then ranked according to the recency of their activation.2 2 This allows for objects to be dynamically rearranged depending on when they were last 'touched' by the Worker.</Paragraph>
    </Section>
    <Section position="3" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
5.3 The Integrated Model
</SectionTitle>
      <Paragraph position="0"> We used the salience list generated from the language-only model and integrated it with the one from the visual-only model. The method of ordering the integrated list resulted from general perceptual psychology principles that suggest that highly active visual objects attract an individual's attentional processes (Scholl, 2001).</Paragraph>
      <Paragraph position="1"> In this preliminary implementation, we defined active objects as those objects that had recently moved within the shared workspace.</Paragraph>
      <Paragraph position="2"> These objects are added to the top of the linguistic-salience list which essentially rendered them as the focus of the joint activity. However, people's attention to static objects has a tendency to fade away over time. Following prior work that demonstrated the utility of a visual decay function (Byron et al., 2005b; Huls et al., 1995), we implemented a three second threshold on the lifespan of a visual entity. From the time since the object was last active, it remained on the list for three seconds. After the time expired, the object was removed and the list returned to its prior state. This mechanism was intended to capture the notion that active objects are at the center of shared attention in a collaborative task for a short period of time. After that the interlocutors revert to their recent linguistic history for the context of an interaction.</Paragraph>
      <Paragraph position="3"> It should be noted that this is work in progress and a major avenue for future work is the development of a more theoretically grounded method for integrating linguistic salience information with visual salience information.</Paragraph>
    </Section>
    <Section position="4" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
5.4 Evaluation Plan
</SectionTitle>
      <Paragraph position="0"> Together, the models described above allow us to test three basic hypotheses regarding the likely impact of linguistic and visual salience: Purely linguistic context. One hypothesis is that the visual information is completely disregarded and the entities are salient purely based on linguistic information. While our prior work has suggested this should not be the case, several existing computational models function only at this level.</Paragraph>
      <Paragraph position="1"> Purely visual context. A second possibility is that the visual information completely overrides linguistic salience. Thus, visual information dominates the discourse structure when it is available and relegates linguistic information to a subordinate role. This too should be unlikely given the fact that not all discourse deals with external elements from the surrounding world.</Paragraph>
      <Paragraph position="2"> A balance of syntactic and visual context. A third hypothesis is that both linguistic entities and visual entities are required in order to accurately and perspicuously account for patterns of observed referring behavior. Salient discourse entities result from some balance of linguistic salience and visual salience.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="11" end_page="12" type="metho">
    <SectionTitle>
6 Preliminary Results
</SectionTitle>
    <Paragraph position="0"> In order to investigate the hypotheses described above, we examined the performance of the models using hand-processed evaluations of the PUZZLE CORPUS data. The following presents the results of the three different models on 10 trials of the PUZZLE CORPUS in which the pairs had no shared visual space, and 10 trials from when the pairs had access to shared visual information representing the workspace. Two experts performed qualitative coding of the referential anchors for each pronoun in the corpus with an overall agreement of 88% (the remaining anomalies were resolved after discussion).</Paragraph>
    <Paragraph position="1"> As demonstrated in Table 2, the language-only model correctly resolved 70% of the referring expressions when applied to the set of dialogues where only language could be used to solve the task (i.e., the no shared visual information condition). However, when the same model was applied to the dialogues from the task conditions where shared visual information was available, it only resolved 41% of the referring expressions correctly. This difference was significant, 2(1,  of the PUZZLE CORPUS evaluated.</Paragraph>
    <Paragraph position="2"> In contrast, when the visual-only model was applied to the same data derived from the task conditions in which the shared visual information was available, the algorithm correctly resolved 66.7% of the referring expressions. In comparison to the 41% produced by the language-only model. This difference was also significant, 2(1, N=78) = 5.16, p = .02. However, we did not find evidence of a difference between the performance of the visual-only model on the visual task conditions and the language-only model on the  language task conditions, 2(1, N=69) = .087, p = .77 (n.s.).</Paragraph>
    <Paragraph position="3"> The integrated model with the decay function also performed reasonably well. When the integrated model was evaluated on the data where only language could be used it effectively reverts back to a language-only model, therefore achieving the same 70% performance. Yet, when it was applied to the data from the cases when the pairs had access to the shared visual information it correctly resolved 69.2% of the referring expressions. This was also better than the 41% exhibited by the language-only model, 2(1, N=78) = 6.27, p = .012; however, it did not statistically outperform the visual-only model on the same data, 2(1, N=78) = .059, p = .81 (n.s.).</Paragraph>
    <Paragraph position="4"> In general, we found that the language-only model performed reasonably well on the dialogues in which the pairs had no access to shared visual information. However, when the same model was applied to the dialogues collected from task conditions where the pairs had access to shared visual information the performance of the language-only model was significantly reduced. However, both the visual-only model and the integrated model significantly increased performance. The goal of our current work is to find a better integrated model that can achieve significantly better performance than the visual-only model. As a starting point for this investigation, we present an error analysis below.</Paragraph>
    <Section position="1" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
6.1 Error Analysis
</SectionTitle>
      <Paragraph position="0"> In order to inform further development of the model, we examined a number of failure cases with the existing data. The first thing to note was that a number of the pronouns used by the pairs referred to larger visible structures in the workspace. For example, the Worker would sometimes state, &amp;quot;like this?&amp;quot;, and ask the Helper to comment on the overall configuration of the puzzle. Table 3 presents the performance results of the models after removing all expressions that did not refer to pieces of the puzzle.</Paragraph>
      <Paragraph position="1">  stricted to piece referents.</Paragraph>
      <Paragraph position="2"> In the errors that remained, the language-only model had a tendency to suffer from a number of higher-order referents such as events and actions. In addition, there were several errors that resulted from chaining errors where the initial referent was misidentified. As a result, all subsequent chains of referents were incorrect.</Paragraph>
      <Paragraph position="3"> The visual-only model and the integrated model had a tendency to suffer from timing issues. For instance, the pairs occasionally introduced a new visual entity with, &amp;quot;this one?&amp;quot; However, the piece did not appear in the workspace until a short time after the utterance was made.</Paragraph>
      <Paragraph position="4"> In such cases, the object was not available as a referent on the object list. In the future we plan to investigate the temporal alignment between the visual and linguistic streams.</Paragraph>
      <Paragraph position="5"> In other cases, problems simply resulted from the unique behaviors present when exploring human activities. Take the following example,  (3) Helper: There is an orange red that obscures  half of it and it is to the left of it In this excerpt, all of our models had trouble correctly resolving the pronouns in the utterance. However, while this counts as a strike against the model performance, the model actually presented a true account of human behavior. While the model was confused, so was the Worker. In this case, it took three more contributions from the Helper to unravel what was actually intended.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML