File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0609_evalu.xml

Size: 6,481 bytes

Last Modified: 2025-10-06 13:58:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0609">
  <Title>Grounding Word Meanings in Sensor Data: Dealing with Referential Uncertainty</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> This section presents the results of experiments in which word meanings are grounded in the sensor data of a mobile robot. The domain of discourse was a set of blocks.</Paragraph>
    <Paragraph position="1"> There were 32 individual blocks with one block for each possible combination of two sizes (small and large), four colors (red, blue, green and yellow) and four shapes (cone, cube, sphere and rectangle).</Paragraph>
    <Paragraph position="2"> To generate sensor data for the robot, one set of human subjects played with the blocks, repeatedly selecting a subset of the blocks and placing them in some configuration in the robot's visual field. The only restrictions placed on this activity were that there could be no more than three blocks visible at one time, two blocks of the same color could not touch, and occlusion from the perspective of the robot was not allowed.</Paragraph>
    <Paragraph position="3"> Given a configuration of blocks, the robot generated a digital image of the configuration using a color CCD camera and identified objects in the image as contiguous regions of uniform color. Given a set of objects, i.e. a set of regions of uniform color in the robot's visual field, virtual sensor groups implemented in software extracted the following information about each object: a0a2a1 measured the area of the object in pixels; a0a4a3a6a5 measured the height and width of the bounding box around the object; a0a4a3a6a5 measured the a7 and a8 coordinates of the centroid of the object in the visual field; a0a32a3a10a9a12a11 measured the hue, saturation and intensity values averaged over all pixels comprising the object;a0 a9 returned a vector of three numbers that represented the shape of the object (Stollnitz et al., 1996). In addition, the a0a2a13a15a14a17a16 sensor group returned the proximal orientation, center of mass orientation and distance for the pair of objects as described in (Regier, 1996). These sensor groups constitute the entirety of the robot's sensorimotor experience of the configurations of blocks created by the human subjects.</Paragraph>
    <Paragraph position="4"> From the 120 block configurations created by the four subjects, a random sample of 50 of configurations was shown to a different set of subjects who were asked to generate natural language utterances describing what they saw. The only restriction placed on the utterances was that they had to be truthful statements about the scenes.</Paragraph>
    <Paragraph position="5"> Recurring patterns were discovered in the audio waveforms corresponding to the utterances (Oates, 2001) and these patterns were used as candidate words. Recall that a sensor group is semantically associated with a word when the mutual information between occurrences of the word and values in the sensor groups are statistically significant. Table 1 shows thea25 values for the mutual information for a number of combinations of words and sensor groups. Note from the first column that it is clear that the meaning of the word &amp;quot;red&amp;quot; is grounded in the a0 a3a10a9a18a11 sensor group. It is the only one with a statistically significant mutual information value. As the second column indicates, the mutual information between the word &amp;quot;small&amp;quot; and the a0a19a1 sensor group is significant at the 0.05 level,  cells of the table show the probability of making an error in rejecting the null hypothesis that occurrences of the word and values in the sensor group are independent.</Paragraph>
    <Paragraph position="6">  and the mutual information between this word and the a0 a3 a5 sensor group is not significant but is rather small. Both of these sensor groups return information about the size of an object, but the a0 a3 a5 sensor group overestimates the area of non-rectangular objects because it returns the height and width of a bounding box around an object. Finally, note from the third column that the denotation of the word &amp;quot;above&amp;quot; is correctly determined to lie in the a0 a13a15a14 a16 sensor group, yet there appears to be some relationship between this word and the a0a2a3a6a5 sensor group. The reason for this is that objects that are said to be &amp;quot;above&amp;quot; tend to be much higher in the robot's visual field than all of the other objects.</Paragraph>
    <Paragraph position="7"> How is it possible to determine the extent to which a machine has discovered and represented the semantics of a set of words? We are trying to capture semantic distinctions made by humans in natural language communication, so it makes sense to ask a human how successful the system has been. This was accomplished as follows. For each word for which a semantic association was discovered, each of the training utterances that used the word were identified. For the scene associated with each utterance, the CPF underlying the word was used to identify the most probable referent of the word. For example, if the word in question was &amp;quot;red&amp;quot;, then the mean HSI values of all objects in the scene would be computed and the object for which the underlying CPF defined over HSI space yielded the highest probability would be deemed to be the referent of that word in that scene. A human subject was then asked if it made sense for that word to refer to that object in that scene.</Paragraph>
    <Paragraph position="8"> The percentage of content words (i.e. words like &amp;quot;red&amp;quot; and &amp;quot;large&amp;quot; as opposed to &amp;quot;oh&amp;quot; and &amp;quot;there&amp;quot;) for which a semantic association was discovered was a1a3a2a5a4 a2a2a11 . Given a semantic association, the two ways that it can be in error are as follows: either the wrong sensor group is selected or the conditional probability field defined over that sensor group is wrong. Given all of the configurations for which a particular word was used, the semantic accuracy is the percentage of configurations that the meaning component of the word selects an aspect of the configuration that a native speaker of the language says is appropriate.</Paragraph>
    <Paragraph position="9"> The semantic accuracy was a1a2a8a6a4 a12 a11 .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML