File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0603_evalu.xml

Size: 5,032 bytes

Last Modified: 2025-10-06 13:58:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0603">
  <Title>Understanding Complex Visually Referring Utterances</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6 Results and Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Overall Performance
</SectionTitle>
      <Paragraph position="0"> In Table 1 we present overall accuracy results, indicating for which percentage of different groups of examples our system picked the same referent as the person describing the object. The first line in the table shows performance relative to the total set of utterances collected. The second one shows the percentage of utterances our system understood correctly excluding those marked as using a descriptive strategy that was not listed in Section 4, and thus not expected to be understood by Bishop. The final line in Table 1 shows the percentage of utterances for which our system picked the correct referent relative to the clean development and testing sets. Although there is obviously room for improvement, these results are significant given that chance performance on this task is only 13.3% and  Colour Due to the simple nature of colour naming in the Bishop task, the probabilistic composers responsible for selecting objects based on colour made no errors.</Paragraph>
      <Paragraph position="1"> Spatial Extrema Our ordering composers correctly identify 100% of the cases in which a participant uses only colour and a single spatial extremum in his or her description. Participants also favour this descriptive strategy, using it with colour alone in 38% of the clean data. In the clean training data, Bishop understands 86.8% of all utterances employing spatial extrema. Participants composed one or more spatial region or extrema references in 30% of the clean data. Our ordering composers correctly interpret 85% of these cases, for example that in Figure 2 in Section 2.2.2. The mistakes our composers make are usually due to overcommitment and faulty ordering. null Spatial Regions Description by spatial region occurs alone in only 5% of the clean data, and together with other strategies in 15% of the clean data. Almost all the examples of this strategy occurring alone use words like &amp;quot;middle&amp;quot; or &amp;quot;centre&amp;quot;. The top image in Figure 8 exemplifies the use of &amp;quot;middle&amp;quot; that our ordering semantic composer models. The object referred to is the one closest to the centre of the board. The bottom image in Figure 8 shows a different interpretation of middle: the object in the middle of a (linguistically not mentioned) group of objects.</Paragraph>
      <Paragraph position="2"> Note that within the group there are two candidate centre objects, and that the one in the front is preferred. There are also further meanings of middle that we expand on in (Gorniak and Roy, 2003). In summary, we can catalogue a number of different meanings for the word &amp;quot;middle&amp;quot; in our data that are linguistically indistinguishable, but depend on visual and historical context to be correctly understood. null &amp;quot;the green one in the middle&amp;quot; &amp;quot;the purple cone in the middle&amp;quot;  Grouping Our composers implementing the grouping strategies used by participants are the most simplistic of all composers we implemented, compared to the depth of the actual phenomenon of visual grouping. As a result, Bishop only understands 29% of utterances that employ grouping in the clean training data. More sophisticated grouping algorithms have been proposed, such as Shi and Malik's (2000).</Paragraph>
      <Paragraph position="3"> Spatial Relations The AVS measure divided by distance between objects corresponds very well to human spatial relation judgements in this task. All the errors that occur in utterances that contain spatial relations are due to the possible landmarks or targets not being correctly identified (grouping or region composers might fail to provide the correct referents).</Paragraph>
      <Paragraph position="4"> Our spatial relation composer picks the correct referent in all those cases where landmarks and targets are the correct ones. Bishop understands 64.3% of all utterances that employ spatial relations in the clean training data. There are types of spatial relations such as relations based purely on distance and combined relations (&amp;quot;to the left and behind&amp;quot;) that we decided not to cover in this implementation, but that occur in the data and should be covered in future efforts. null Anaphora Our solution to the use of anaphora in the Bishop task performs perfectly (100% of utterances employing anaphora) in understanding reference back to a single object in the clean development data. However, there are more complex variants of anaphora that we do not currently cover, for example reference back to groups of objects.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML