File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0609_metho.xml

Size: 24,023 bytes

Last Modified: 2025-10-06 14:08:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0609">
  <Title>Grounding Word Meanings in Sensor Data: Dealing with Referential Uncertainty</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 A Simplified Learning Problem
</SectionTitle>
    <Paragraph position="0"> Rather than starting with the full complexity of the problem facing the learner, consider the following highly simplified version. Suppose an agent with a single sensor group, a0a32a1 , lives in a world with a single object, a0a2a1 , that periodically changes color. Each time the color changes, a one word utterance is generated describing the new color, which is one of &amp;quot;black&amp;quot;, &amp;quot;white&amp;quot; or &amp;quot;gray&amp;quot;. In this scenario there is no need to identify a0 because there is only one possibility. Also, each time a word is uttered there is perfect information about its denotation; it is the current value produced by a0 a1 a20 a0 a1 a22 . (The notation a0 a20 a0 a22 indicates the value recorded by sensor groupa0 when it is applied to object a0 . This assumes an ability to individuate objects in the environment.) Therefore, the probability that a native speaker of our simple language will utter a30 to refer toa17 is the same as the probability of hearinga30 givena17 . This fact makes it possible to recover the form of the CPF for each of the three words by noticing which values of a0a32a1 a20 a0a3a1 a22 co-occur with the words and applying Bayes' rule as follows:  a22 is simply the number of utterances containing a30 divided by the total number of utterances. The quantitiesa25 a20a17a23a22 anda25 a20a17 a26hear-Wa22 can be estimated using a number of standard techniques. We use kernel density estimators with (multivariate) Gaussian kernels to estimate probability densities such as these.</Paragraph>
    <Paragraph position="1"> The simplified version of the word-learning problem presented in this section can be made more realistic, and thus more complex, by increasing either the number of objects in the environment or the number of sensor groups available to the agent. Section 3 explores the former, and section 4 explores the latter.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Multiple Objects
</SectionTitle>
    <Paragraph position="0"> When there is no ambiguity about the referent of a word, it is possible to recover the conditional probability field that represents the word's denotational meaning by passive observation of the contexts in which it is used. Unfortunately, referential ambiguity is a feature of natural languages that we contend with on a daily basis. This ambiguity appears to be at its most extreme for young children acquiring their first language who must determine for each newly identified word the referent of the word from the infinitely many aspects of their environment that are perceptually available.</Paragraph>
    <Paragraph position="1"> Consider what happens when we add a second object, a0a7a6 , to our example domain. If both objects change color at exactly the same time, though not necessarily to the same color, the learner has no way of knowing whether an utterance refers to the value produced by a0a4a1 a20 a0a3a1 a22 or</Paragraph>
    <Paragraph position="3"> about the referent, the best the learner can do is make a guess, which will be right only a8a10a9a2a11 of the time. As the number of objects in the environment increases, this percentage decreases.</Paragraph>
    <Paragraph position="4"> Referential ambiguity can also take the form of uncertainty about the sensor group to which a word refers. Given two objects, a0a12a1 and a0 a6 , and two sensor groups, a0 a1</Paragraph>
    <Paragraph position="6"> realistic assumption that the learner has a priori knowledge about the sensor group to which a word refers. This assumption will be relaxed in the following section.</Paragraph>
    <Paragraph position="7"> Intuitively, referential ambiguity clouds the relationship between the denotational meaning of a word and the observable manifestations of its meaning, i.e. the contexts in which the word is used. As we will now demonstrate, it is possible to make the nature of this clouding precise, leading to an understanding of the impact of referential ambiguity on learnability.</Paragraph>
    <Paragraph position="8"> Suppose an agent hears an utterance a13 containing word a30 while its attention is focused on the output of sensor group a0 . (Recall that in this section we are making the assumption that the agent knows that a30 refers to a0 .) Why might a13 contain a30 ? There are two mutually exclusive and exhaustive cases: a13 is (at least in part) abouta0 , anda30 is chosen to denote the current value produced by a0 ; a13 is not about a0 , and a13 contains a30 despite this fact. The latter case might occur if, for example, a30 has multiple meanings and the utterance uses one of the meanings of a30 that does not denote a value produced by a0 .</Paragraph>
    <Paragraph position="9"> Let a14a3a15a17a16a10a18a20a19 a20 a13 a27 a0 a22 denote the fact that a13 is (at least in part) about a0 , and let a21a3a16a3a22a20a19a20a14a24a23a25a22a27a26 a20 a13 a27 a30 a22 denote the fact that a30 occurs in a13 . Then the conditional probability of an utterance containing a30 given the current value, a17 ,</Paragraph>
    <Paragraph position="11"> produced by a0 can be expressed via equation 1 in figure 1. Equation 1 is a more formal, probabilistic statement of the conditions given above under which a13 will contain a30 . It can be simplified as shown in the remainder of the figure.</Paragraph>
    <Paragraph position="12"> The first step in transforming equation 1 into equation 2 is to apply the fact thata25 a20a18a17 a2a20a19a28a22 a4 a25 a20a21a17 a22a22a6 a25 a20a19 a22a23a8 a25 a20a18a17 a1 a19a28a22 . The resulting joint probability is the probability of a conjunction of terms that contains both a14a3a15a17a16a10a18a20a19 a20 a13 a27 a0 a22 and</Paragraph>
    <Paragraph position="14"> a22 , and is therefore 0 and can be dropped.</Paragraph>
    <Paragraph position="15"> The remaining two terms are then rewritten using Bayes' rule. Finally, three substitutions are made:</Paragraph>
    <Paragraph position="17"> Simplification then leads directly to equation 2.</Paragraph>
    <Paragraph position="18"> Before discussing the implications of equation 2, consider the import of a10 and a15 . The probability that a13 is about a0 (i.e. a10 ) is the probability that the speaker and the hearer are attending to the same sensory information.</Paragraph>
    <Paragraph position="19"> When a10 a4 a12 , there is perfect shared attention, and the speaker always refers to those aspects of the physical environment to which the hearer is currently attending.</Paragraph>
    <Paragraph position="20"> When a10 a4 a9 , there is never shared attention, and the speaker always refers to aspects of the environment other than those to which the hearer is currently attending.</Paragraph>
    <Paragraph position="21"> The probability that a13 contains a30 even when a13 is not about a0 (i.e. a15 ) is the probability that a30 will be used to refer to some feature of the environment other than that measured by a0 . There are two reasons why a30 might occur in a sentence that does not refer to a0 :  some object other than the one that is the hearer's focus of attention (e.g. a0 a20 a0a2a1 a22 rather than a0 a20 a0 a6 a22 ) Note that a15 comes into play only when a10a25a24 a12 , i.e. when there is less than perfect shared attention between the speaker and the hearer.</Paragraph>
    <Paragraph position="22"> The most significant aspect of equation 2 is from the standpoint of learnability. In our original one-object, one-sensor world there was never any doubt as to the referent of a word, and it was therefore the case that utter-Wa20a0a28a27a17a23a22 a4 hear-Wa20a0a28a27a17a23a22 . This equivalence becomes clear in equation 2 by setting a10 a4 a12 and simplifying. Because it is possible to compute hear-Wa20a0a28a27 a17a23a22 from observable information via Bayes' rule, it was possible in that world to recover utter-Wa20a0a28a27a17a23a22 rather directly. However, equation 2 tells us that even in the face of imperfect shared attention (i.e. a10a26a24 a12 ) and homonymy (i.e. a15a28a27 a9 ) it is the case that hear-Wa20a0a28a27a17a23a22 is a linear transform of utter-Wa20a0a28a27a17a23a22 . Moreover, the values of a10 and a15 determine the precise nature of the transform.</Paragraph>
    <Paragraph position="23"> To get a better handle on the effects of a10 and  strate the effects of varying a10 and a15 on hear-Wa20a0a28a27 a17a23a22 . That is, the figures show how varying a10 and a15 affect the information about utter-Wa20a0a28a27a17a23a22 available to the learner.</Paragraph>
    <Paragraph position="24"> Recall from equation 2 that the conditional probability that word a30 will be heard given the current value produced bya0 is a linear function ofutter-Wa20a0a28a27a17a23a22 which has slope a10 and intercept a20a13a12 a8a29a10a21a22a16a15 . When the slope is zero (i.e. a10 a4 a9 ) the speaker and the hearer never focus on the same features of the environment, and the probability of hearing a30 is just the background probability of hearing a30 , independent of the value of a0 . When the slope is one (i.e. a10 a4 a12 ) the speaker and the hearer always focus on the same features of the environment and so the effect of a15 vanishes. The observable manifestation of utter-Wa20a0a28a27a17a23a22 and utter-Wa20a0a28a27a17a23a22 are equivalent. These two case are shown in the first and last graphs in figure 2, which contains plots of hear-Wa20a0a28a27a17a23a22 over a range of values of a15 for various fixed values of a10 .</Paragraph>
    <Paragraph position="25"> Figure 2 makes it clear that decreasing a10 preserves the overall shape of utter-Wa20a0a28a27a17a23a22 as observed through hear-Wa20a0a28a27a17a23a22 , while squashing it to fit in a smaller range of values. Increasing a10 diminishes the effect of a15 , which is to offset the entire curve vertically. That is, the higher the level of shared attention between speaker and hearer, the less the impact of the background frequency of a30 on the observable manifestation of utter-Wa20a0a28a27a17a23a22 . Figure 3, which shows plots of hear-Wa20a0a28a27a17a23a22 given a range of values of a10 for various fixed values of a15 , is another way of looking at the same data. The role of a10 in squashing the observable manifestation of utter-Wa20a0a28a27a17a23a22 is apparent, as is the role of a15 in vertically shifting the curves. Only when a10 a4 a9 is there no information about the form of utter-Wa20a0a28a27a17a23a22 in the plot of hear-Wa20a0a28a27a17a23a22 .</Paragraph>
    <Paragraph position="26"> What does all of this have to say about the impact of a10 and a15 on the learnability of word meanings from sensory information about the contexts in which they are uttered? As we will demonstrate shortly, if the following expression is true for a given conditional probability field, utter-Wa20a0a28a27a17a23a22 , then it is possible to recover that CPF from observable data (i.e. from hear-Wa20a0a28a27a17a23a22 ):</Paragraph>
    <Paragraph position="28"> The claim is as follows. If there is both a value produced by a0 that is always referred to as a30 and a value that is never referred to as a30 , one can recover the CPF that represents the denotational meaning of a30 simply by observing the contexts in which a30 is used.</Paragraph>
    <Paragraph position="29"> Intuitively, the above expression places two constraints on word meanings. First, for a worda30 whose denotation is defined over sensor group a0 , it must be the case that some value produced by a0 is (almost) universally agreed to have no better descriptor than a30 ; there is no other word in the language that is more suitable for denoting this value. Second, there must be some value produced by a0 for which it is (almost) universally agreed that a30 is not the best descriptor. It is not necessarily the case that a30 is the worst descriptor, only that some other word or words are better.</Paragraph>
    <Paragraph position="30"> As equation 2 indicates, hear-Wa20a0a28a27 a17a23a22 is a linear transform of utter-Wa20a0a28a27a17a23a22 with slope a10 and intercept</Paragraph>
    <Paragraph position="32"> equation 2 we can determine its parameters, making it possible to reverse the transform and compute the value of utter-Wa20a0a28a27a17a23a22 given the value of hear-Wa20a0a28a27 a17a23a22 . Because hear-Wa20a0a28a27a17a23a22 is a linear transform of utter-Wa20a0a28a27a17a23a22 , any value of a17 that minimizes (maximizes) one minimizes (maximizes) the other. Recall that conditional probability fields map from sensor vectors to probabilities, which must lie in the range [0, 1]. Under the assumption that utter-Wa20a0a28a27a17a23a22 takes on the value a9 at some point, such as when</Paragraph>
    <Paragraph position="34"> must be at its minimum value at that point as well. Let that value bea25a8a7a10a9a12 . Likewise, under the assumption that utter-Wa20a0a28a27a17a23a22 takes on the value a12 at some point, such as when a17 a4 a17 a1 , hear-Wa20a0a28a27a17a23a22 must be at its maximum value at that point as well. Let that value bea25a11a7a10a12a14a13 . These observations lead to the following system of two equations:</Paragraph>
    <Paragraph position="36"> Solving these equations for a10 and a15 yields the following:</Paragraph>
    <Paragraph position="38"> Recall that the goal of this exercise is to recover utter-Wa20a0a28a27a17a23a22 from its observable manifestation, hear-Wa20a0a28a27a17a23a22 . This can finally be accomplished by substituting the values for a10 and a15 given above into equation 2 and solving for utter-Wa20a0a28a27a17a23a22 as shown in figure 4.</Paragraph>
    <Paragraph position="39"> That is, one can recover the CPF that represents the denotational meaning of a word by simply scaling the range of conditional probabilities of the word given observations so that it completely spans the interval a21a9 a27 a12a14a22 .</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Multiple Sensor Groups
</SectionTitle>
    <Paragraph position="0"> This section considers a still more complex version of the problem by allowing the learner to have more than one sensor group. Suppose an agent has two sensor groups, a0 a1 and a0 a6 , and that word a30 refers to a0 a1 . The agent can observe the values produced by both sensor groups, note whether each value co-occurred with an utterance containing a30 , compute both hear-Wa20a0 a1 a27a17a23a22 and hear-Wa20a0 a6 a27a17a23a22 , and apply equation 2 to obtain utter-Wa20a0 a1 a27a17a23a22 and utter-Wa20a0 a6 a27 a17a23a22 . How is the agent to determine that utter-Wa20a0 a1 a27a17a23a22 represents the meaning of a30 and utter-Wa20a0 a6 a27 a17a23a22 is garbage? The key insight is that if the meaning of a30 is grounded in a0 , there will be some values of a17 a18 a0 for which it is more likely that a30 will be uttered than for others, and thus there will be some values for which it is more likely that a30 will be heard than others. Indeed, our ability to recover utter-Wa20a0a28a27a17a23a22 from hear-Wa20a0a28a27a17a23a22 is founded on the assumption that there is some value of  ity is one. This is not necessarily the case for the conditional probability of hearing a30 given a17 a18 a0 because the level of shared attention between the speaker and the learner, a10 , influences the range of probabilities spanned by hear-Wa20a0a28a27 a17a23a22 , with smaller values of a10 leading to smaller ranges.</Paragraph>
    <Paragraph position="1"> Note that in our simple example with two sensor groups the speaker considers only the value of a0 a1 when determining whether to uttera30 , and the learner considers only the value ofa0 a6 when constructinghear-Wa20a0 a6 a27a17a23a22 . Using the terminology and notation developed in section 3, there is no shared attention between the speaker and the learner with respect to a30 and a0 a6 , and it is therefore the case that a10 a4 a9 and hear-Wa20a0 a6 a27 a17a23a22 a4a25a15 . If the exact value of hear-Wa20a0 a6 a27a17a23a22 is known for all a17 , an obviously unrealistic assumption, it is a simple matter to determine that utter-Wa20a0 a6 a27a17a23a22 cannot represent the meaning of a30 by noting that it is constant.</Paragraph>
    <Paragraph position="2"> If utter-Wa20a0 a6 a27a17a23a22 is not constant, then the speaker is more likely to utter a30 for some values of a17 a18 a0 a6 than for others, and the meaning ofa30 is therefore grounded in  a6 . As indicated by figure 2, the height of the bumps in the conditional probability field depend on a10 , the level of shared attention, but if there are any bumps at all we know that the meaning of a30 is grounded in the corresponding sensor group and we can recover the underlying conditional probability field. Under the assumption that the exact value of hear-Wa20a0a28a27a17a23a22 can be computed, an agent can identify the sensor group in which the denotation of a word is grounded by simply recoveringutter-Wa20a0a28a27a17a23a22 for each of its sensor groups and looking for the one that is not constant.</Paragraph>
    <Paragraph position="3"> In practice, the exact value of hear-Wa20a0a28a27 a17a23a22 will not be known, and it must be estimated from a finite number of observations. That is, an estimate of hear-Wa20a0a28a27 a17a23a22 will be used to compute an estimate of utter-Wa20a0a28a27a17a23a22 . Even if there is no association between a30 and a0 , and utter-Wa20a0a28a27a17a23a22 is therefore truly constant, an estimate of this conditional probability based on finitely many data will invariably not be constant. Therefore, the strategy of identifying relevant sensor groups by looking for bumpy conditional probability fields will not work.</Paragraph>
    <Paragraph position="4"> The problem is that for any given word a30 and sensor group a0 , it is difficult to distinguish between cases in which a30 and a0 are unrelated and cases in which the meaning of a30 is grounded in a0 but shared attention is low. The solution to this problem has two parts, both of which will be described in detail shortly. First, the mutual information between occurrences of words and sensor values will be used as a measure of the degree to which hearing a30 depends on the value produced by a0 , and vice versa. Second, a non-parametric statistical test based on randomization testing will be used to convert the real-valued mutual information into a binary decision as to whether or not the denotation of a30 is grounded in  Note that a0 a20a30a2a1 a0 a22 is the mutual information between two different types of random variables, one discrete (a30 ) and one continuous (a0 ). In the expression above, the summation over the two possible values of a30 , i.e. hear-W and a4 hear-W, is unpacked, yielding a sum of two integrals over the values ofa17a19a18 a0 . Within each integral the value of a30 is held constant. Finally, recall that a17 is a vector with the same dimensionality as the sensor group from which it is drawn, so the integrals above are actually defined to range over all of the dimensions of the sensor group.</Paragraph>
    <Paragraph position="5"> When a0 a20a30 a27 a0 a22 is zero, knowing whether a30 is uttered provides no information about the value produced by a0 , and vice versa. When a0 a20a30 a27 a0 a22 is large, knowing the value of one random variable leads to a large reduction in uncertainty about the value of the other. Larger values of mutual information reflect tighter concentrations of the mass of the joint probability distribution and thus higher certainty on the part of the agent about both the circumstances in which it is appropriate to utter a30 and the denotation of a30 when it is uttered.</Paragraph>
    <Paragraph position="6"> Although mutual information provides a measure of the degree to which a30 and a0 are dependent, to understand and generate utterances containing a30 the agent must at some point make a decision as to whether or not its meaning is in fact grounded in a0 . How is the agent to make this determination based on a single scalar value? The next section describes a way of converting scalar mutual information values into binary decisions as to whether a word's meaning is grounded in a sensor group that avoids all of the potential pitfalls just described. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Randomization Testing
</SectionTitle>
      <Paragraph position="0"> Given word a30 , sensor group a0 , and their mutual information a0 a20a30a2a1 a0 a22 , the task facing the learner is to determine whether the meaning of a30 is grounded in a0 . This can be phrased as a yes-or-no question in the following two ways. Is it the case that occurrences of a30 and the values produced by a0 are dependent? Is it the case that occurrences of a30 and the values produced by a0 are not independent? The latter question is the form used in statistical hypothesis testing. In this case the null hypothesis, a0 a5 , would be that occurrences of a30 and the values produced by a0 are independent. Given a distribution of mutual information values derived under a0 a5 , it is possible to determine the probability of getting a mutual information value at least as large as a0  small, then the null hypothesis can be rejected with a correspondingly small probability of making an error in doing so (i.e. the probability of committing a type-I error is small). That is, the learner can determine that occurrences of a30 and the values produced by a0 are not independent, that the meaning of a30 is grounded in a0 , with a bounded probability of being wrong.</Paragraph>
      <Paragraph position="1"> We've now reduced the problem to that of obtaining a distribution of values of a0 a20a30a2a1 a0 a22 under a0 a5 . For most exotic distributions, such as this one, there is no parametric form. However, in such cases it is often possible to obtain an empirical distribution via a technique know as randomization testing (Cohen, 1995; Edgington, 1995).</Paragraph>
      <Paragraph position="2"> This approach can be applied to the current problem as follows - each datum corresponds to an utterance and indicates whether a30 occurred in the utterance and the value produced by a0 at the time of the utterance; the test statistic is a0 a20a30a2a1 a0 a22 ; and the null hypothesis is that occurrences of a30 and values produced by a0 are independent. If the null hypothesis is true, then whether or not a particular value produced by a0 co-occurred with a30 is strictly a matter of random chance. It is therefore a simple matter to enforce the null hypothesis by splitting the data into two lists, one containing each of the observed sensor values and one containing each of the labels that indicates whether or not a30 occurred, and creating a new data set by repeatedly randomly selecting one item from each list without replacement and pairing them together.</Paragraph>
      <Paragraph position="3"> This gives us all of the elements required by the generic randomization testing procedure described above.</Paragraph>
      <Paragraph position="4"> Given a word and a set of sensor groups, randomization testing can be applied independently to each group to determine whether it is the one in which the meaning of a30 is grounded. The answer may be in the affirmative for zero, one or more sensor groups. None of these outcomes is necessarily right or wrong. As noted previously, it may be that the meaning of the word is too abstract to ground out directly in sensory data. It may also be the case that a word has multiple meanings, each of which is grounded in a different sensor group, or a single meaning that is grounded in multiple sensor groups.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML