Grounding Word Meanings in Sensor Data:
Dealing with Referential Uncertainty
Tim Oates
Department of Computer Science and Electrical Engineering
University of Maryland Baltimore County
1000 Hilltop Circle
Baltimore, MD 21250
oates@cs.umbc.edu
Abstract
We consider the problem of how the mean-
ings of words can be grounded in sensor data.
A probabilistic representation for the mean-
ings of words is defined, a method for recov-
ering meanings from observational information
about word use in the face of referential uncer-
tainty is described, and empirical results with
real utterances and robot sensor data are pre-
sented.
1 Introduction
We are interested in how robots might learn language
given qualitatively the same inputs available to children -
natural language utterances paired with sensory access to
the environment. This paper focuses on the sub-problem
of learning word meanings. Suppose a robot has acquired
a set of sound patterns that may or may not correspond to
words. How is it possible to separate the words from the
non-words, and to learn the meanings of the words?
We assume the robot’s sensory access to its environ-
ment is through a collection of primitive sensors orga-
nized into sensor groups, where each sensor group is a
set of related sensors. For example, the sensor group
a0a2a1 might return a single value representing the mean
grayscale intensity of a set of pixels corresponding to an
object in the visual field. The sensor group a0a4a3a6a5 might
return two values representing the height and width of the
bounding box around the object.
Learning the meanings of words requires a represen-
tation for meaning. We use a representation that we call
a conditional probability field (CPF), which is a type of
scalar field. A scalar field is a map of the following form:
a7a9a8a11a10a6a12a14a13a15a16a10
The mapping assigns to each vectora17a19a18
a10 a12 a scalar value
a7a21a20
a17a23a22 . A conditional probability field assigns to each a17 ,
which corresponds to a point in an a24 -dimensional sen-
sor group, a conditional probability of the form a25 a20Ea26a17a23a22 ,
where E denotes the occurrence of some event. Let
Ea20a0a28a27 a17a23a22 denote the CPF defined over sensor group a0 for
event E.
The semantics of a CPF clearly depend on the nature
of E. Two events that will be of particular importance in
learning the meanings of words are:
a29 utter-W - the event that word
a30 is uttered, per-
haps as part of an utterance that refers to some fea-
ture of the world denoted by a30
a29 hear-W - the event that word
a30 is heard
The corresponding conditional probability fields are:
a29 utter-W
a20a0a28a27
a17a23a22 - the probability that word a30 will
be uttered by a competent speaker of the language
to denote the feature of the physical world that a0
is currently sensing (i.e. that results in the current
value ofa17 )
a29 hear-W
a20a0a28a27
a17a23a22 - the probability that word a30 will
be heard given that a17a31a18 a0 is observed
In this framework, the meaning of word a30 is simply
utter-Wa20a0a28a27a17a23a22 . The last plot in figure 3 shows a CPF
defined over a0a32a1 that might represent the meaning of the
word “gray”. Grayscale intensities near 128 will be called
gray with probability almost one, whereas intensities near
0 and 255 will never be called gray. Rather, they are
“black” and “white” respectively.
Learning the denotation ofa30 involves determining the
identity of a0 and then recovering utter-Wa20a0a28a27a17a23a22 . The
learner does not have direct access to utter-Wa20a0a28a27a17a23a22 .
Rather, the learner must gain information about
utter-Wa20a0a28a27a17a23a22 indirectly, by noticing the sensory con-
texts in which a30 is used and those in which it is not,
i.e. via hear-Wa20a0a28a27 a17a23a22 .
This problem is difficult due to referential uncertainty.
Even if the utterances the learner hears are true state-
ments about aspects of its environment that are percep-
tually available, there are are usually many aspects of the
environment that might be a given word’s referent. This
is Quine’s “gavagai” problem (Quine, 1960). The algo-
rithm described in this paper solves a restricted version
of the gavagai problem, one in which the denotation of a
word must be representable as a CPF defined over one of
a set of pre-defined sensor groups.
2 A Simplified Learning Problem
Rather than starting with the full complexity of the prob-
lem facing the learner, consider the following highly sim-
plified version. Suppose an agent with a single sensor
group, a0a32a1 , lives in a world with a single object, a0a2a1 , that
periodically changes color. Each time the color changes,
a one word utterance is generated describing the new
color, which is one of “black”, “white” or “gray”.
In this scenario there is no need to identify a0 because
there is only one possibility. Also, each time a word is
uttered there is perfect information about its denotation;
it is the current value produced by a0 a1 a20 a0 a1 a22 . (The nota-
tion a0 a20 a0 a22 indicates the value recorded by sensor groupa0
when it is applied to object a0 . This assumes an ability to
individuate objects in the environment.) Therefore, the
probability that a native speaker of our simple language
will utter a30 to refer toa17 is the same as the probability of
hearinga30 givena17 . This fact makes it possible to recover
the form of the CPF for each of the three words by notic-
ing which values of a0a32a1 a20 a0a3a1 a22 co-occur with the words and
applying Bayes’ rule as follows:
utter-Wa20a0a32a1 a27a17a23a22a5a4 hear-Wa20a0a2a1a6a27a17a23a22
a4 a25
a20hear-W
a26a17a23a22
a4
a25
a20
a17 a26hear-Wa22a25
a20hear-W
a22
a25
a20
a17a23a22
The maximum-likelihood estimate of the quantity
a25
a20hear-W
a22 is simply the number of utterances contain-
ing a30 divided by the total number of utterances. The
quantitiesa25 a20a17a23a22 anda25 a20a17 a26hear-Wa22 can be estimated using
a number of standard techniques. We use kernel density
estimators with (multivariate) Gaussian kernels to esti-
mate probability densities such as these.
The simplified version of the word-learning problem
presented in this section can be made more realistic,
and thus more complex, by increasing either the num-
ber of objects in the environment or the number of sensor
groups available to the agent. Section 3 explores the for-
mer, and section 4 explores the latter.
3 Multiple Objects
When there is no ambiguity about the referent of a word,
it is possible to recover the conditional probability field
that represents the word’s denotational meaning by pas-
sive observation of the contexts in which it is used. Un-
fortunately, referential ambiguity is a feature of natural
languages that we contend with on a daily basis. This
ambiguity appears to be at its most extreme for young
children acquiring their first language who must deter-
mine for each newly identified word the referent of the
word from the infinitely many aspects of their environ-
ment that are perceptually available.
Consider what happens when we add a second object,
a0a7a6 , to our example domain. If both objects change color
at exactly the same time, though not necessarily to the
same color, the learner has no way of knowing whether
an utterance refers to the value produced by a0a4a1 a20 a0a3a1 a22 or
a0a2a1 a20
a0 a6 a22 . In the absence of any exogenous information
about the referent, the best the learner can do is make
a guess, which will be right only a8a10a9a2a11 of the time. As
the number of objects in the environment increases, this
percentage decreases.
Referential ambiguity can also take the form of un-
certainty about the sensor group to which a word refers.
Given two objects, a0a12a1 and a0 a6 , and two sensor groups, a0 a1
anda0 a6 , a word can refer to any of the following: a0 a1 a20 a0 a1 a22 ,
a0
a1
a20
a0a7a6 a22 ,
a0
a6
a20
a0 a1 a22 ,
a0
a6
a20
a0a7a6 a22 . In this section we make the un-
realistic assumption that the learner has a priori knowl-
edge about the sensor group to which a word refers. This
assumption will be relaxed in the following section.
Intuitively, referential ambiguity clouds the relation-
ship between the denotational meaning of a word and the
observable manifestations of its meaning, i.e. the contexts
in which the word is used. As we will now demonstrate,
it is possible to make the nature of this clouding precise,
leading to an understanding of the impact of referential
ambiguity on learnability.
Suppose an agent hears an utterance a13 containing
word a30 while its attention is focused on the output of
sensor group a0 . (Recall that in this section we are mak-
ing the assumption that the agent knows that a30 refers
to a0 .) Why might a13 contain a30 ? There are two mutu-
ally exclusive and exhaustive cases: a13 is (at least in part)
abouta0 , anda30 is chosen to denote the current value pro-
duced by a0 ; a13 is not about a0 , and a13 contains a30 despite
this fact. The latter case might occur if, for example, a30
has multiple meanings and the utterance uses one of the
meanings of a30 that does not denote a value produced by
a0 .
Let a14a3a15a17a16a10a18a20a19 a20 a13 a27 a0 a22 denote the fact that a13 is (at least in
part) about a0 , and let a21a3a16a3a22a20a19a20a14a24a23a25a22a27a26 a20 a13 a27 a30 a22 denote the fact
that a30 occurs in a13 . Then the conditional probability
of an utterance containing a30 given the current value, a17 ,
hear-Wa20a0a28a27 a17a23a22 a4 a25 a20 a14a3a15a17a16a3a18 a19 a20 a13 a27 a0 a22 a1 a21a3a16a10a22a20a19 a14a24a23a25a22 a26 a20 a13 a27 a30 a22a3a2a5a4 a14a10a15a17a16a3a18a20a19 a20 a13 a27 a0 a22 a1 a21a3a16a3a22a20a19a20a14a24a23a25a22a27a26 a20 a13 a27 a30 a22 a22 (1)
a4 a25
a20
a14a3a15a17a16a3a18a20a19
a20
a13
a27 a0
a22
a1
a21a3a16a3a22 a19 a14a24a23a25a22a27a26
a20
a13
a27
a30 a22a22a7a6a9a25
a20
a4 a14a3a15 a16a3a18a20a19
a20
a13
a27 a0
a22
a1
a21a3a16a3a22a20a19 a14a17a23a25a22a27a26
a20
a13
a27
a30 a22 a22a9a8
a25
a20
a14a3a15a17a16a3a18a20a19
a20
a13
a27 a0
a22
a1
a21a3a16a3a22 a19 a14a24a23a25a22a27a26
a20
a13
a27
a30 a22
a1
a4 a14a3a15 a16a3a18a20a19
a20
a13
a27 a0
a22
a1
a21a3a16a3a22a20a19 a14a17a23a25a22a27a26
a20
a13
a27
a30 a22 a22
a4 a25
a20
a14a3a15a17a16a3a18a20a19
a20
a13
a27 a0
a22
a1
a21a3a16a3a22 a19 a14a24a23a25a22a27a26
a20
a13
a27
a30 a22a22a7a6a9a25
a20
a4 a14a3a15 a16a3a18a20a19
a20
a13
a27 a0
a22
a1
a21a3a16a3a22a20a19 a14a17a23a25a22a27a26
a20
a13
a27
a30 a22 a22
a4 a25
a20
a21a3a16a3a22a20a19 a14a17a23a25a22a27a26
a20
a13
a27
a30 a22 a26a14a3a15a17a16a3a18a20a19
a20
a13
a27 a0
a22 a22a25
a20
a14a3a15 a16a3a18a20a19
a20
a13
a27 a0
a22a22a7a6
a25
a20
a21a3a16a3a22a20a19 a14a17a23a25a22a27a26
a20
a13
a27
a30 a22 a26a4 a14a3a15a17a16a3a18a20a19
a20
a13
a27 a0
a22 a22a25
a20
a4 a14a3a15 a16a3a18a20a19
a20
a13
a27 a0
a22a22
a4 a10 utter-W
a20a0a28a27
a17a23a22a11a6
a20a13a12
a8a14a10a21a22a16a15 (2)
Figure 1: A derivation of the relationship between hear-Wa20a0a28a27a17a23a22 and utter-Wa20a0a28a27a17a23a22 .
produced by a0 can be expressed via equation 1 in figure
1. Equation 1 is a more formal, probabilistic statement
of the conditions given above under which a13 will contain
a30 . It can be simplified as shown in the remainder of the
figure.
The first step in transforming equation 1 into equation
2 is to apply the fact thata25 a20a18a17 a2a20a19a28a22 a4 a25 a20a21a17 a22a22a6 a25 a20a19 a22a23a8 a25 a20a18a17 a1
a19a28a22 . The resulting joint probability is the probability of a
conjunction of terms that contains both a14a3a15a17a16a10a18a20a19 a20 a13 a27 a0 a22 and
a4 a14a3a15a17a16a10a18a20a19
a20
a13
a27 a0
a22 , and is therefore 0 and can be dropped.
The remaining two terms are then rewritten using Bayes’
rule. Finally, three substitutions are made:
a29 utter-W
a20a0a28a27
a17a23a22 a4
a25
a20
a21a3a16a3a22 a19 a14a24a23a25a22a27a26
a20
a13
a27
a30 a22 a26a14a3a15a17a16a10a18a20a19
a20
a13
a27 a0
a22a22
a29
a10 a4 a25
a20
a14a3a15 a16a3a18a20a19
a20
a13
a27 a0
a22a22
a29
a15 a4 a25
a20
a21a3a16a3a22a20a19 a14a17a23a25a22a27a26
a20
a13
a27
a30 a22 a26a4 a14a3a15a17a16a3a18a20a19
a20
a13
a27 a0
a22 a22
Simplification then leads directly to equation 2.
Before discussing the implications of equation 2, con-
sider the import of a10 and a15 . The probability that a13 is
about a0 (i.e. a10 ) is the probability that the speaker and
the hearer are attending to the same sensory information.
When a10 a4
a12 , there is perfect shared attention, and the
speaker always refers to those aspects of the physical
environment to which the hearer is currently attending.
When a10 a4 a9 , there is never shared attention, and the
speaker always refers to aspects of the environment other
than those to which the hearer is currently attending.
The probability that a13 contains a30 even when a13 is not
about a0 (i.e. a15 ) is the probability that a30 will be used to
refer to some feature of the environment other than that
measured by a0 . There are two reasons why a30 might
occur in a sentence that does not refer to a0 :
a29
a30 is polysemous and one of the meanings that does
not refer to a0 is used in the utterance
a29
a30 is used to refer to the value produced by
a0 for
some object other than the one that is the hearer’s
focus of attention (e.g. a0 a20 a0a2a1 a22 rather than a0 a20 a0 a6 a22 )
Note that a15 comes into play only when a10a25a24
a12 , i.e. when
there is less than perfect shared attention between the
speaker and the hearer.
The most significant aspect of equation 2 is from the
standpoint of learnability. In our original one-object,
one-sensor world there was never any doubt as to the
referent of a word, and it was therefore the case that
utter-Wa20a0a28a27a17a23a22 a4 hear-Wa20a0a28a27a17a23a22 . This equivalence
becomes clear in equation 2 by setting a10 a4 a12 and simpli-
fying. Because it is possible to compute hear-Wa20a0a28a27 a17a23a22
from observable information via Bayes’ rule, it was pos-
sible in that world to recover utter-Wa20a0a28a27a17a23a22 rather di-
rectly. However, equation 2 tells us that even in the face
of imperfect shared attention (i.e. a10a26a24 a12 ) and homonymy
(i.e. a15a28a27 a9 ) it is the case that hear-Wa20a0a28a27a17a23a22 is a linear
transform of utter-Wa20a0a28a27a17a23a22 . Moreover, the values of a10
and a15 determine the precise nature of the transform.
To get a better handle on the effects of a10 and
a15 on the manifestation of utter-W
a20a0a28a27
a17a23a22 through
hear-Wa20a0a28a27a17a23a22 , consider figures 2 and 3. The last plot
in figure 2 shows an example of a conditional prob-
ability field utter-Wa20a0a28a27 a17a23a22 , which is also a plot of
hear-Wa20a0a28a27a17a23a22 when a10 a4
a12 . Figures 2 and 3 demon-
strate the effects of varying a10 and a15 on hear-Wa20a0a28a27 a17a23a22 .
That is, the figures show how varying a10 and a15 affect
the information about utter-Wa20a0a28a27a17a23a22 available to the
learner.
Recall from equation 2 that the conditional probability
that word a30 will be heard given the current value pro-
duced bya0 is a linear function ofutter-Wa20a0a28a27a17a23a22 which
has slope a10 and intercept a20a13a12 a8a29a10a21a22a16a15 . When the slope is
zero (i.e. a10 a4 a9 ) the speaker and the hearer never focus
on the same features of the environment, and the prob-
ability of hearing a30 is just the background probability
of hearing a30 , independent of the value of a0 . When the
slope is one (i.e. a10 a4 a12 ) the speaker and the hearer al-
ways focus on the same features of the environment and
so the effect of a15 vanishes. The observable manifestation
of utter-Wa20a0a28a27a17a23a22 and utter-Wa20a0a28a27a17a23a22 are equivalent.
These two case are shown in the first and last graphs in
figure 2, which contains plots of hear-Wa20a0a28a27a17a23a22 over a
range of values of a15 for various fixed values of a10 .
Figure 2 makes it clear that decreasing a10 preserves the
overall shape of utter-Wa20a0a28a27a17a23a22 as observed through
hear-Wa20a0a28a27a17a23a22 , while squashing it to fit in a smaller
range of values. Increasing a10 diminishes the effect of a15 ,
which is to offset the entire curve vertically. That is, the
higher the level of shared attention between speaker and
hearer, the less the impact of the background frequency of
a30 on the observable manifestation of utter-Wa20a0a28a27a17a23a22 .
Figure 3, which shows plots of hear-Wa20a0a28a27a17a23a22 given
a range of values of a10 for various fixed values of
a15 , is another way of looking at the same data. The
role of a10 in squashing the observable manifestation of
utter-Wa20a0a28a27a17a23a22 is apparent, as is the role of a15 in ver-
tically shifting the curves. Only when a10 a4 a9 is there
no information about the form of utter-Wa20a0a28a27a17a23a22 in the
plot of hear-Wa20a0a28a27a17a23a22 .
What does all of this have to say about the impact of
a10 and a15 on the learnability of word meanings from sen-
sory information about the contexts in which they are ut-
tered? As we will demonstrate shortly, if the following
expression is true for a given conditional probability field,
utter-Wa20a0a28a27a17a23a22 , then it is possible to recover that CPF
from observable data (i.e. from hear-Wa20a0a28a27a17a23a22 ):
a0
a17a2a1 utter-W
a20a0a28a27
a17a2a1 a22 a4
a9 a1 a0
a17a4a3 utter-W
a20a0a28a27
a17a4a3 a22 a4
a12
The claim is as follows. If there is both a value produced
by a0 that is always referred to as a30 and a value that
is never referred to as a30 , one can recover the CPF that
represents the denotational meaning of a30 simply by ob-
serving the contexts in which a30 is used.
Intuitively, the above expression places two constraints
on word meanings. First, for a worda30 whose denotation
is defined over sensor group a0 , it must be the case that
some value produced by a0 is (almost) universally agreed
to have no better descriptor than a30 ; there is no other
word in the language that is more suitable for denoting
this value. Second, there must be some value produced
by a0 for which it is (almost) universally agreed that a30 is
not the best descriptor. It is not necessarily the case that
a30 is the worst descriptor, only that some other word or
words are better.
As equation 2 indicates, hear-Wa20a0a28a27 a17a23a22 is a linear
transform of utter-Wa20a0a28a27a17a23a22 with slope a10 and intercept
a20 a12
a8 a10a21a22a16a15 . If we know two points on the line defined by
equation 2 we can determine its parameters, making it
possible to reverse the transform and compute the value
of utter-Wa20a0a28a27a17a23a22 given the value of hear-Wa20a0a28a27 a17a23a22 .
Because hear-Wa20a0a28a27a17a23a22 is a linear transform of
utter-Wa20a0a28a27a17a23a22 , any value of a17 that minimizes (maxi-
mizes) one minimizes (maximizes) the other. Recall that
conditional probability fields map from sensor vectors to
probabilities, which must lie in the range [0, 1]. Under
the assumption that utter-Wa20a0a28a27a17a23a22 takes on the value
a9 at some point, such as when
a17 a4 a17a6a5 , hear-W
a20a0a28a27
a17a23a22
must be at its minimum value at that point as well. Let
that value bea25a8a7a10a9a12 . Likewise, under the assumption that
utter-Wa20a0a28a27a17a23a22 takes on the value a12 at some point, such
as when a17 a4 a17 a1 , hear-Wa20a0a28a27a17a23a22 must be at its maxi-
mum value at that point as well. Let that value bea25a11a7a10a12a14a13 .
These observations lead to the following system of two
equations:
a25a15a7a10a9
a12
a4 a10a17a16 utter-W
a20a0a28a27
a17a6a1a11a22a7a6
a20 a12
a8 a10 a22 a15
a4 a10a17a16
a9
a6
a20 a12
a8 a10 a22 a15
a4
a20 a12
a8 a10 a22 a15
a25 a7a18a12a19a13 a4 a10a17a16 utter-W
a20a0a28a27
a17 a3 a22a7a6
a20 a12
a8 a10 a22 a15
a4 a10a17a16
a12
a6
a20 a12
a8 a10 a22 a15
a4 a10 a6
a20 a12
a8 a10 a22 a15
Solving these equations for a10 and a15 yields the following:
a10 a4 a25a8a7a10a12a14a13 a8 a25a8a7a20a9
a12
a15 a4
a25a8a7a10a9
a12
a12
a8 a25 a7a18a12a19a13 a6a9a25 a7a10a9
a12
Recall that the goal of this exercise is to re-
cover utter-Wa20a0a28a27a17a23a22 from its observable manifesta-
tion, hear-Wa20a0a28a27a17a23a22 . This can finally be accomplished
by substituting the values for a10 and a15 given above into
equation 2 and solving for utter-Wa20a0a28a27a17a23a22 as shown in
figure 4.
That is, one can recover the CPF that represents the de-
notational meaning of a word by simply scaling the range
of conditional probabilities of the word given observa-
tions so that it completely spans the interval a21a9 a27 a12a14a22 .
4 Multiple Sensor Groups
This section considers a still more complex version of
the problem by allowing the learner to have more than
one sensor group. Suppose an agent has two sensor
groups, a0 a1 and a0 a6 , and that word a30 refers to a0 a1 . The
agent can observe the values produced by both sensor
groups, note whether each value co-occurred with an
utterance containing a30 , compute both hear-Wa20a0 a1 a27a17a23a22
and hear-Wa20a0 a6 a27a17a23a22 , and apply equation 2 to obtain
utter-Wa20a0 a1 a27a17a23a22 and utter-Wa20a0 a6 a27 a17a23a22 .
How is the agent to determine that utter-Wa20a0 a1 a27a17a23a22
represents the meaning of a30 and utter-Wa20a0 a6 a27 a17a23a22 is
garbage? The key insight is that if the meaning of a30 is
grounded in a0 , there will be some values of a17 a18 a0 for
which it is more likely that a30 will be uttered than for
others, and thus there will be some values for which it is
more likely that a30 will be heard than others. Indeed, our
ability to recover utter-Wa20a0a28a27a17a23a22 from hear-Wa20a0a28a27a17a23a22
is founded on the assumption that there is some value of
a17 a18
a0 for which the conditional probability of uttering
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250
p(W|x)
x
alpha = 0.00
beta = 1.00
beta = 0.75
beta = 0.50
beta = 0.25
beta = 0.00
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250
p(W|x)
x
alpha = 0.50
beta = 1.00
beta = 0.75
beta = 0.50
beta = 0.25
beta = 0.00
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250
p(W|x)
x
alpha = 1.00
beta = 1.00
beta = 0.75
beta = 0.50
beta = 0.25
beta = 0.00
Figure 2: The effects of a15 on hear-Wa20a0a28a27 a17a23a22 for various values of a10 .
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250
p(W|x)
x
beta = 0.0
alpha = 1.00
alpha = 0.75
alpha = 0.50
alpha = 0.25
alpha = 0.00
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250
p(W|x)
x
beta = 0.50
alpha = 1.00
alpha = 0.75
alpha = 0.50
alpha = 0.25
alpha = 0.00
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250
p(W|x)
x
beta = 1.00
alpha = 1.00
alpha = 0.75
alpha = 0.50
alpha = 0.25
alpha = 0.00
Figure 3: The effects of a10 on hear-Wa20a0a28a27a17a23a22 for various values of a15 .
hear-Wa20a0a28a27 a17a23a22a5a4 a10 utter-Wa20a0a28a27a17a23a22 a6 a20a13a12 a8a14a10a21a22a16a15
a4
a20
a25a15a7a18a12a19a13 a8 a25a15a7a10a9
a12
a22 utter-W
a20a0a28a27
a17a23a22a11a6
a20 a12
a8 a25a15a7a18a12a19a13 a6a9a25a15a7a10a9
a12
a22
a25a8a7a10a9
a12
a12
a8 a25 a7a18a12a19a13 a6a9a25 a7a10a9
a12
a4
a20
a25a15a7a18a12a19a13 a8 a25a15a7a10a9
a12
a22 utter-W
a20a0a28a27
a17a23a22a11a6a9a25a15a7a10a9
a12
utter-Wa20a0a28a27 a17a23a22a5a4 hear-W
a20a0a28a27
a17a23a22a9a8 a25a15a7a10a9
a12
a25 a7a10a12a14a13 a8 a25 a7a20a9
a12
(3)
Figure 4: How to derive the meaning of a word from observations of its use in the face of referential uncertainty.
a30 is zero and some other value for which that probabil-
ity is one. This is not necessarily the case for the con-
ditional probability of hearing a30 given a17 a18 a0 because
the level of shared attention between the speaker and the
learner, a10 , influences the range of probabilities spanned
by hear-Wa20a0a28a27 a17a23a22 , with smaller values of a10 leading to
smaller ranges.
Note that in our simple example with two sensor
groups the speaker considers only the value of a0 a1 when
determining whether to uttera30 , and the learner considers
only the value ofa0 a6 when constructinghear-Wa20a0 a6 a27a17a23a22 .
Using the terminology and notation developed in section
3, there is no shared attention between the speaker and
the learner with respect to a30 and a0 a6 , and it is therefore
the case that a10 a4 a9 and hear-Wa20a0 a6 a27 a17a23a22 a4a25a15 .
If the exact value of hear-Wa20a0 a6 a27a17a23a22 is known for all
a17 , an obviously unrealistic assumption, it is a simple
matter to determine that utter-Wa20a0 a6 a27a17a23a22 cannot rep-
resent the meaning of a30 by noting that it is constant.
If utter-Wa20a0 a6 a27a17a23a22 is not constant, then the speaker is
more likely to utter a30 for some values of a17 a18 a0 a6 than
for others, and the meaning ofa30 is therefore grounded in
a0
a6 . As indicated by figure 2, the height of the bumps in
the conditional probability field depend on a10 , the level of
shared attention, but if there are any bumps at all we know
that the meaning of a30 is grounded in the corresponding
sensor group and we can recover the underlying condi-
tional probability field. Under the assumption that the
exact value of hear-Wa20a0a28a27a17a23a22 can be computed, an agent
can identify the sensor group in which the denotation of a
word is grounded by simply recoveringutter-Wa20a0a28a27a17a23a22
for each of its sensor groups and looking for the one that
is not constant.
In practice, the exact value of hear-Wa20a0a28a27 a17a23a22 will not
be known, and it must be estimated from a finite number
of observations. That is, an estimate of hear-Wa20a0a28a27 a17a23a22
will be used to compute an estimate of utter-Wa20a0a28a27a17a23a22 .
Even if there is no association between a30 and a0 , and
utter-Wa20a0a28a27a17a23a22 is therefore truly constant, an estimate
of this conditional probability based on finitely many data
will invariably not be constant. Therefore, the strategy of
identifying relevant sensor groups by looking for bumpy
conditional probability fields will not work.
The problem is that for any given word a30 and sen-
sor group a0 , it is difficult to distinguish between cases
in which a30 and a0 are unrelated and cases in which the
meaning of a30 is grounded in a0 but shared attention is
low. The solution to this problem has two parts, both
of which will be described in detail shortly. First, the
mutual information between occurrences of words and
sensor values will be used as a measure of the degree to
which hearing a30 depends on the value produced by a0 ,
and vice versa. Second, a non-parametric statistical test
based on randomization testing will be used to convert
the real-valued mutual information into a binary decision
as to whether or not the denotation of a30 is grounded in
a0 .
4.1 Mutual Information
Let a0 a20a30a2a1 a0 a22 denote the mutual information between oc-
currences of word a30 and values produced by sensor
group a0 . The value of a0 a20a30a2a1 a0 a22 is defined as follows:
a3a5a4a7a6a9a8
a25
a20hear-Wa27
a17a23a22a7a10a12a11a9a13
a25
a20hear-Wa27
a17a23a22
a25
a20hear-W
a22a25
a20
a17a23a22a15a14a15a16
a6
a3a7a4a7a6a9a8
a25
a20
a4 hear-W
a27
a17a23a22a7a10a12a11a9a13
a25
a20
a4 hear-W
a27
a17a23a22
a25
a20
a4 hear-Wa22a25
a20
a17a23a22 a14a15a16
Note that a0 a20a30a2a1 a0 a22 is the mutual information between
two different types of random variables, one discrete
(a30 ) and one continuous (a0 ). In the expression above,
the summation over the two possible values of a30 ,
i.e. hear-W and a4 hear-W, is unpacked, yielding a sum
of two integrals over the values ofa17a19a18 a0 . Within each in-
tegral the value of a30 is held constant. Finally, recall that
a17 is a vector with the same dimensionality as the sensor
group from which it is drawn, so the integrals above are
actually defined to range over all of the dimensions of the
sensor group.
When a0 a20a30 a27 a0 a22 is zero, knowing whether a30 is uttered
provides no information about the value produced by a0 ,
and vice versa. When a0 a20a30 a27 a0 a22 is large, knowing the
value of one random variable leads to a large reduction
in uncertainty about the value of the other. Larger val-
ues of mutual information reflect tighter concentrations
of the mass of the joint probability distribution and thus
higher certainty on the part of the agent about both the
circumstances in which it is appropriate to utter a30 and
the denotation of a30 when it is uttered.
Although mutual information provides a measure of
the degree to which a30 and a0 are dependent, to under-
stand and generate utterances containing a30 the agent
must at some point make a decision as to whether or
not its meaning is in fact grounded in a0 . How is the
agent to make this determination based on a single scalar
value? The next section describes a way of converting
scalar mutual information values into binary decisions
as to whether a word’s meaning is grounded in a sen-
sor group that avoids all of the potential pitfalls just de-
scribed.
4.2 Randomization Testing
Given word a30 , sensor group a0 , and their mutual infor-
mation a0 a20a30a2a1 a0 a22 , the task facing the learner is to deter-
mine whether the meaning of a30 is grounded in a0 . This
can be phrased as a yes-or-no question in the following
two ways. Is it the case that occurrences of a30 and the
values produced by a0 are dependent? Is it the case that
occurrences of a30 and the values produced by a0 are not
independent?
The latter question is the form used in statistical hy-
pothesis testing. In this case the null hypothesis, a0 a5 ,
would be that occurrences of a30 and the values produced
by a0 are independent. Given a distribution of mutual in-
formation values derived under a0 a5 , it is possible to de-
termine the probability of getting a mutual information
value at least as large as a0
a20
a30a2a1
a0
a22 . If this probability is
small, then the null hypothesis can be rejected with a cor-
respondingly small probability of making an error in do-
ing so (i.e. the probability of committing a type-I error
is small). That is, the learner can determine that occur-
rences of a30 and the values produced by a0 are not inde-
pendent, that the meaning of a30 is grounded in a0 , with a
bounded probability of being wrong.
We’ve now reduced the problem to that of obtaining
a distribution of values of a0 a20a30a2a1 a0 a22 under a0 a5 . For most
exotic distributions, such as this one, there is no paramet-
ric form. However, in such cases it is often possible to
obtain an empirical distribution via a technique know as
randomization testing (Cohen, 1995; Edgington, 1995).
This approach can be applied to the current problem
as follows - each datum corresponds to an utterance and
indicates whether a30 occurred in the utterance and the
value produced by a0 at the time of the utterance; the test
statistic is a0 a20a30a2a1 a0 a22 ; and the null hypothesis is that oc-
currences of a30 and values produced by a0 are indepen-
dent. If the null hypothesis is true, then whether or not
a particular value produced by a0 co-occurred with a30 is
strictly a matter of random chance. It is therefore a sim-
ple matter to enforce the null hypothesis by splitting the
data into two lists, one containing each of the observed
sensor values and one containing each of the labels that
indicates whether or not a30 occurred, and creating a new
data set by repeatedly randomly selecting one item from
each list without replacement and pairing them together.
This gives us all of the elements required by the generic
randomization testing procedure described above.
Given a word and a set of sensor groups, randomiza-
tion testing can be applied independently to each group
to determine whether it is the one in which the meaning
of a30 is grounded. The answer may be in the affirmative
for zero, one or more sensor groups. None of these out-
comes is necessarily right or wrong. As noted previously,
it may be that the meaning of the word is too abstract to
ground out directly in sensory data. It may also be the
case that a word has multiple meanings, each of which is
grounded in a different sensor group, or a single meaning
that is grounded in multiple sensor groups.
5 Experiments
This section presents the results of experiments in which
word meanings are grounded in the sensor data of a mo-
bile robot. The domain of discourse was a set of blocks.
There were 32 individual blocks with one block for each
possible combination of two sizes (small and large), four
colors (red, blue, green and yellow) and four shapes
(cone, cube, sphere and rectangle).
To generate sensor data for the robot, one set of hu-
man subjects played with the blocks, repeatedly selecting
a subset of the blocks and placing them in some config-
uration in the robot’s visual field. The only restrictions
placed on this activity were that there could be no more
than three blocks visible at one time, two blocks of the
same color could not touch, and occlusion from the per-
spective of the robot was not allowed.
Given a configuration of blocks, the robot generated
a digital image of the configuration using a color CCD
camera and identified objects in the image as contiguous
regions of uniform color. Given a set of objects, i.e. a
set of regions of uniform color in the robot’s visual field,
virtual sensor groups implemented in software extracted
the following information about each object: a0a2a1 mea-
sured the area of the object in pixels; a0a4a3a6a5 measured the
height and width of the bounding box around the object;
a0a4a3a6a5 measured the
a7 and a8 coordinates of the centroid
of the object in the visual field; a0a32a3a10a9a12a11 measured the hue,
saturation and intensity values averaged over all pixels
comprising the object;a0 a9 returned a vector of three num-
bers that represented the shape of the object (Stollnitz et
al., 1996). In addition, the a0a2a13a15a14a17a16 sensor group returned
the proximal orientation, center of mass orientation and
distance for the pair of objects as described in (Regier,
1996). These sensor groups constitute the entirety of the
robot’s sensorimotor experience of the configurations of
blocks created by the human subjects.
From the 120 block configurations created by the four
subjects, a random sample of 50 of configurations was
shown to a different set of subjects who were asked
to generate natural language utterances describing what
they saw. The only restriction placed on the utterances
was that they had to be truthful statements about the
scenes.
Recurring patterns were discovered in the audio wave-
forms corresponding to the utterances (Oates, 2001) and
these patterns were used as candidate words. Recall that a
sensor group is semantically associated with a word when
the mutual information between occurrences of the word
and values in the sensor groups are statistically signifi-
cant. Table 1 shows thea25 values for the mutual informa-
tion for a number of combinations of words and sensor
groups. Note from the first column that it is clear that the
meaning of the word “red” is grounded in the a0 a3a10a9a18a11 sen-
sor group. It is the only one with a statistically significant
mutual information value. As the second column indi-
cates, the mutual information between the word “small”
and the a0a19a1 sensor group is significant at the 0.05 level,
Table 1: For each sensor group and several words, the
cells of the table show the probability of making an error
in rejecting the null hypothesis that occurrences of the
word and values in the sensor group are independent.
Sensor Word
Group “red” “small” “above”
a0 a1 0.76 0.05 0.47
a0a2a3 a5 0.86 0.09 0.31
a0a4a3a6a5 0.29 0.67 0.07
a0 a3a10a9a18a11 0.00 0.49 0.82
a0a19a9 0.34 0.58 0.44
a0a4a13a15a14 a16 0.57 0.97 0.00
and the mutual information between this word and the
a0 a3 a5 sensor group is not significant but is rather small.
Both of these sensor groups return information about the
size of an object, but the a0 a3 a5 sensor group overesti-
mates the area of non-rectangular objects because it re-
turns the height and width of a bounding box around an
object. Finally, note from the third column that the de-
notation of the word “above” is correctly determined to
lie in the a0 a13a15a14 a16 sensor group, yet there appears to be
some relationship between this word and the a0a2a3a6a5 sen-
sor group. The reason for this is that objects that are said
to be “above” tend to be much higher in the robot’s visual
field than all of the other objects.
How is it possible to determine the extent to which a
machine has discovered and represented the semantics of
a set of words? We are trying to capture semantic distinc-
tions made by humans in natural language communica-
tion, so it makes sense to ask a human how successful the
system has been. This was accomplished as follows. For
each word for which a semantic association was discov-
ered, each of the training utterances that used the word
were identified. For the scene associated with each utter-
ance, the CPF underlying the word was used to identify
the most probable referent of the word. For example, if
the word in question was “red”, then the mean HSI values
of all objects in the scene would be computed and the ob-
ject for which the underlying CPF defined over HSI space
yielded the highest probability would be deemed to be the
referent of that word in that scene. A human subject was
then asked if it made sense for that word to refer to that
object in that scene.
The percentage of content words (i.e. words like “red”
and “large” as opposed to “oh” and “there”) for which a
semantic association was discovered was a1a3a2a5a4 a2a2a11 . Given a
semantic association, the two ways that it can be in error
are as follows: either the wrong sensor group is selected
or the conditional probability field defined over that sen-
sor group is wrong. Given all of the configurations for
which a particular word was used, the semantic accuracy
is the percentage of configurations that the meaning com-
ponent of the word selects an aspect of the configuration
that a native speaker of the language says is appropriate.
The semantic accuracy was a1a2a8a6a4 a12 a11 .
6 Discussion
This paper described a method for recovering the deno-
tational meaning of a word, i.e. utter-Wa20a0a28a27 a17a23a22 , given
a set of sensory observations, each labeled according
to whether it co-occurred with an utterance contain-
ing the word, i.e. hear-Wa20a0a28a27 a17a23a22 . It was shown that
hear-Wa20a0a28a27a17a23a22 is a linear function of utter-Wa20a0a28a27a17a23a22
where the parameters of the transform are determined
by the level of shared attention and the background fre-
quency of a30 . Given two weak assumptions about the
form of utter-Wa20a0a28a27 a17a23a22 , these parameters can be recov-
ered and the transform inverted. The use of mutual in-
formation and randomization testing to identify the par-
ticular sensor group that captures a word’s meaning was
described. It is therefore possible to identify the denota-
tional meaning of a word by simply observing the con-
texts in which it is and is not used, even in the face of
imperfect shared attention and homonymy.

References
Paul R. Cohen. 1995. Empirical Methods for Artificial
Intelligence. The MIT Press.
Eugene S. Edgington. 1995. Randomization Tests. Mar-
cel Dekker.
Tim Oates. 2001. Grounding Knowledge in Sen-
sors: Unsupervised Learning for Language and Plan-
ning. Ph.D. thesis, The University of Massachusetts,
Amherst.
W. V. O. Quine. 1960. Word and object. MIT Press.
Terry Regier. 1996. The Human Semantic Potential. The
MIT Press.
Eric J. Stollnitz, Tony D. DeRose, and David H. Salesin.
1996. Wavelets for Computer Graphics: Theory and
Applications. Morgan Kaufmann.
