Conversational Robots: Building Blocks for Grounding Word Meaning
Deb Roy
MIT Media Lab
dkroy@media.mit.edu
Kai-Yuh Hsiao
MIT Media Lab
eepness@mit.edu
Nikolaos Mavridis
MIT Media Lab
nmav@media.mit.edu
Abstract
How can we build robots that engage in fluid
spoken conversations with people, moving be-
yond canned responses to words and towards
actually understanding? As a step towards ad-
dressing this question, we introduce a robotic
architecture that provides a basis for grounding
word meanings. The architecture provides per-
ceptual, procedural, and affordance represen-
tations for grounding words. A perceptually-
coupled on-line simulator enables sensory-
motor representations that can shift points of
view. Held together, we show that this archi-
tecture provides a rich set of data structures
and procedures that provide the foundations for
grounding the meaning of certain classes of
words.
1 Introduction
Language enables people to talk about the world, past,
present, and future, real and imagined. For a robot to
do the same, it must ground language in its world as
mediated by its perceptual, motor, and cognitive capac-
ities. Many words that refer to entities in the world can
be grounded through sensory-motor associations. For in-
stance, the meaning of ball includes perceptual associ-
ations that encode how balls look and predictive models
of how balls behave. The representation of touch must in-
clude procedural associations that encode how to perform
the action, and perceptual encodings to recognize the ac-
tion in others. In this view, words serve as labels for per-
ceptual or action concepts. When a word is uttered, the
underlying concept is communicated since the speaker
and listener maintain similar associations. This basic
approach underlies most work to date in building ma-
chines that ground language (Bailey, 1997; Narayanan,
1997; Regier and Carlson, 2001; Roy and Pentland, 2002;
Siskind, 2001; Lammens, 1994; Steels, 2001).
Not all words, however, can be grounded in terms
of perceptual and procedural representations, even when
used in concrete situations. In fact, in even the simplest
conversations about everyday objects, events, and rela-
tions, we run into problems. Consider a person and a
robot sitting across a table from each other, engaged in
coordinated activity involving manipulation of objects.
After some interaction, the person says to the robot:
Touch the heavy blue thing that was on my left.
To understand and act on this command in context,
consider the range of knowledge representations that the
robot must bind words of this utterance to. Touch can
be grounded in a visually-guided motor program that en-
ables the robot to move towards and touch objects. This
is an example of a procedural association which also crit-
ically depends on perception to guide the action. Heavy
specifies a property of objects which involves affordances
that intertwine procedural representations with percep-
tual expectations (Gibson, 1979). Blue and left specify
visual properties. Thing must be grounded in terms of
both perception and affordances (one can see an object,
and expect to reach out and touch it). Was triggers a ref-
erence to the past. My triggers a shift of perspective in
space.
We have developed an architecture in which a physical
robot is coupled with a physical simulator to provide the
basis for grounding each of these classes of lexical se-
mantics1. This workshop paper provides an abbreviated
1We acknowledge that the words in this example, like most
words, have numerous additional connotations that are not cap-
tured by the representations that we have suggested. For exam-
ple, words such as touch, heavy and blue can be used metaphor-
ically to refer to emotional actions and states. Things are not al-
ways physical perceivable objects, my usually indicates posses-
sion, and so forth. Barwise and Perry use the phrase “efficiency
of language” to highlight the situation-dependent reusability of
words and utterances (Barwise and Perry, 1983). However, for
version of a forthcoming paper (Roy et al., forthcoming
2003).
The robot, called Ripley, is driven by compliant actu-
ators and is able to manipulate small objects. Ripley has
cameras, touch, and various other sensors on its “head”.
Force sensors in each actuated joint combined with po-
sition sensors provide the robot with a sense of proprio-
ception. Ripley’s visual and proprioceptive systems drive
a physical simulator that keeps a constructed version of
the world (that includes Ripley’s own physical body) in
synchronization with Ripley’s noisy perceptual input. An
object permanence module determines when to instanti-
ate and destroy objects in the model based on perceptual
evidence. Once instantiated, perception can continue to
influence the properties of an object in the model, but
knowledge of physical world dynamics is built into the
simulator and counteracts ’unreasonable’ percepts.
Language is grounded in terms of associations with el-
ements of this perceptually driven world model, as well
as direct groundings in terms of sensory and motor rep-
resentations. Although the world model directly reflects
reality, the state of the model is the result of an interpre-
tation process that compiles perceptual input into a sta-
ble registration of the environment. As opposed to direct
perception, the world model affords the ability to assume
arbitrary points of view through the use of synthetic vi-
sion which operates within the physical model, enabling
a limited form of “out of body experience”. This ability
is essential to successfully differentiate the semantics of
my left versus your left. Non-linguistic cues such as the
visual location of the communication partners can be in-
tegrated with linguistic input to context-appropriate per-
spective shifts. Shifts of perspective in time and space
may be thought of as semantic modulation functions. Al-
though the meaning of “left” in one sense remains con-
stant across usages, the words “my” and “your” modu-
late the meaning by swapping frames of reference. We
suspect that successful use of language requires constant
modulations of meanings of this and related kinds.
We describe the robot and simulator, and mechanisms
for real-time coupling. We then discuss mechanisms
within this architecture designed for the purposes of
grounding the semantics of situated, natural spoken con-
versation. Although no language understanding system
has yet been constructed, we conclude by sketching how
the semantics of each of the words and the whole utter-
ance discussed above can be grounded in the data struc-
tures and processes provided by this architecture. This
work represents steps towards our long term goal of de-
veloping robots and other machines that use language in
the utterance and context that we have described, the ground-
ings listed above play essential roles. It may be argued that
other senses of words are often metaphoric extensions of these
embodied representations (Lakoff and Johnson, 1980).
human-like ways by leveraging deep, grounded represen-
tations of meaning that “hook” into the world through
machine perception, action, and higher layers of cogni-
tive processes. The work has theoretical implications on
how language is represented and processed by machine,
and also has practical applications where natural human-
robot interaction is needed such as deep-sea robot con-
trol, remote handling of hazardous materials by robots,
and astronaut-robot communication in space.
2 Background
Although robots, speech recognizers, and speech synthe-
sizers can easily be connected in shallow ways, the re-
sults are limited to canned behavior. The proper inte-
gration of language in a robot highlights deep theoret-
ical issues that touch on virtually all aspects of artifi-
cial intelligence (and cognitive science) including per-
ception, action, memory, and planning. Along with other
researchers, we use the term grounding to refer to prob-
lem of anchoring the meaning of words and utterances in
terms of non-linguistic representations that the language
user comes to know through some combination of evolu-
tionary and lifetime learning.
A natural approach is to connect words to perceptual
classifiers so that the appearance of an object, event, or
relation in the environment can instantiate a correspond-
ing word in the robot. This basic idea has been applied
in many speech-controlled robots over the years (Brown
et al., 1992; McGuire et al., 2002; Crangle and Suppes,
1994).
Detailed models have been suggested for sensory-
motor representations underlying color (Lammens,
1994), spatial relations (Regier, 1996; Regier and Carl-
son, 2001). Models for grounding verbs include ground-
ing verb meanings in the perception of actions (Siskind,
2001), and grounding in terms of motor control programs
(Bailey, 1997; Narayanan, 1997). Object shape is clearly
important when connection language to the world, but re-
mains a challenging problem in computational models of
language grounding. Landau and Jackendoff provide a
detailed analysis of additional visual shape features that
play a role in language (Landau and Jackendoff, 1993).
In natural conversation, people speak and gesture to
coordinate joint actions (Clark, 1996). Speakers and lis-
teners use various aspects of their physical environment
to encode and decode utterance meanings. Communica-
tion partners are aware of each other’s gestures and foci
of attention and integrate these source of information into
the conversational process. Motivated by these factors,
recent work on social robots have explored mechanisms
that provide visual awareness of human partners’ gaze
and other facial cues relevant for interaction (Breazeal,
2003; Scassellati, 2002).
3 Ripley: An Interactive Robot
Ripley was designed specifically for the purposes of ex-
ploring questions of grounded language, and interactive
language acquisition. The robot has a range of motions
that enables him to move objects around on a tabletop
placed in front of him. Ripley can also look up and make
“eye contact” with a human partner. Three primary con-
siderations drove the design of the robot: (1) We are in-
terested in the effects of changes of visual perspective and
their effects on language and conversation, (2) Sensory-
motor grounding of verbs. (3) Human-directed training
of motion. For example, to teach Ripley the meaning of
“touch”, we use “show-and-tell” training in which exem-
plars of the word (in this case, motor actions) can be pre-
sented by a human trainer in tandem with verbal descrip-
tions of the action.
To address the first consideration, Ripley has cameras
placed on its head so that all motions of the body lead to
changes of view point. This design decision leads to chal-
lenges in maintaining stable perspectives in a scene, but
reflect the type of corrections that people must also con-
stantly perform. To support acquisition of verbs, Ripley
has been designed with a “mouth” that can grasp objects
and enable manipulation. As a result, the most natural
class of verbs that Ripley will learn involve manual ac-
tions such as touching, lifting, pushing, and giving. To
address the third consideration, Ripley is actuated with
compliant joints, and has “training handles”. In spite of
the fact that the resulting robot resembles an arm more
than a torso, it nonetheless serves our purposes as a vehi-
cle for experiments in situated, embodied, conversation.
In contrast, many humanoid robots are not actually able
to move their torso’s to a sufficient degree to obtain sig-
nificant variance in visual perspectives, and grasping is
often not achieved in these robots due to additional com-
plexities of control. This section provides a description of
Ripley’s hardware and low level sensory processing and
motor control software layers.
3.1 Mechanical Structure and Actuation
The robot is essentially an actuated arm, but since cam-
eras and other sensors are placed on the gripper, and the
robot is able to make “eye contact”, we often think of
the gripper as the robot’s head. The robot has seven de-
grees of freedom (DOF’s) including a 2-DOF shoulder,
1-DOF elbow, 3-DOF wrist / neck, and 1-DOF gripper
/ mouth. Each DOF other than the gripper is actuated
by series-elastic actuators (Pratt et al., 2002) in which all
force from electric motors are transferred through torsion
springs. Compression sensors are placed on each spring
and used for force feedback to the low level motion con-
troller. The use of series-elastic actuators gives Ripley the
ability to precisely sense the amount of force that is being
applied at each DOF, and leads to compliant motions.
3.2 Motion Control
A position-derivative control loop is used to track target
points that are sequenced to transit smoothly from the
starting point of a motion gesture to the end point. Nat-
ural motion trajectories are learned from human teachers
through manual demonstrations.
The robot’s motion is controlled in a layered fashion.
The lowest level is implemented in hardware and consists
of a continuous control loop between motor amplifiers
and force sensors of each DOF. At the next level of con-
trol, a microcontroller implements a position-derivative
(PD) control loop with a 5 millisecond cycle time. The
microcontroller accepts target positions from a master
controller and translates these targets into force com-
mands via the PD control loop. The resulting force com-
mands are sent down stream to the motor amplifier con-
trol loop. The same force commands are also sent up
stream back to the master controller, serving as dynamic
proprioceptive force information
To train motion trajectories, the robot is put in a grav-
ity canceling motor control mode in which forces due
to gravity are estimated based on the robot’s joint po-
sitions and counteracted through actuation. While in
this mode, a human trainer can directly move the robot
through desired motion trajectories. Motion trajectories
are recorded during training. During playback, motion
trajectories can be interrupted and smoothly revised to
follow new trajectories as determined by higher level con-
trol. We have also implemented interpolative algorithms
that blend trajectories to produce new motions that be-
yond the training set.
3.3 Sensory System and Visual Processing
Ripley’s perceptual system is based on several kinds of
sensors. Two color video cameras, a three-axis tilt ac-
celerometer (for sensing gravity), and two microphones
are mounted in the head. Force sensitive resistors provide
a sense of touch on the inside and outside surfaces of the
gripper fingers. In the work reported here, we make use of
only the visual, touch, and force sensors. The remaining
sensors will be integrated in the future. The microphones
have been used to achieve sound source localization and
will play a role in maintaining “eye contact” with com-
munication partners. The accelerometer will be used to
help correct frames of reference of visual input.
Complementing the motor system is the robot’s sensor
system. One of the most important sets of sensors is the
actuator set itself; as discussed, the actuators are force-
controlled, which means that the control loop adjusts the
force that is output by each actuator. This in turn means
that the amount of force being applied at each joint is
known. Additionally, each DOF is equipped with abso-
lute position sensors that are used for all levels of motion
control and for maintaining the zero-gravity mode.
The vision system is responsible for detecting ob-
jects in the robot’s field of view. A mixture of Gaus-
sians is used to model the background color and pro-
vides foreground/background classification. Connected
regions with uniform color are extracted from the fore-
ground regions. The three-dimensional shape of an object
is represented using histograms of local geometric fea-
tures, each of which represents the silhouette of the ob-
ject from a different viewpoint. Three dimension shapes
are represented in a view-based approach using sets of
histograms. The color of regions is represented using his-
tograms of illumination-normalized RGB values. Details
of the shape and color representations can be found in
(Roy et al., 1999).
To enable grounding of spatial terms such as “above”
and “left”, a set of spatial relations similar to (Regier,
1996) is measured between pair of objects. The first fea-
ture is the angle (relative to the horizon) of the line con-
necting the centers of area of an object pair. The second
feature is the shortest distance between the edges of the
objects. The third spatial feature measures the angle of
the line which connects the two most proximal points of
the objects.
The representations of shape, color, and spatial rela-
tions described above can also be generated from virtual
scenes based on Ripley’s mental model as described be-
low. Thus, the visual features can serve as a means to
ground words in either real time camera grounded vision
or simulated synthetic vision.
3.4 Visually-Guided Reaching
Ripley can reach out and touch objects by interpolating
between recorded motion trajectories. A set of sample
trajectories are trained by placing objects on the tabletop,
placing Ripley in a canonical position so that the table
is in view, and then manually guiding the robot until it
touches the object. A motion trajectory library is col-
lected in this way, with each trajectory indexed by the
position of the visual target. To reach an object in an arbi-
trary position, a linear interpolation between trajectories
is computed.
3.5 Encoding Environmental Affordances: Object
Weight and Compliance
Words such as “heavy” and “soft” refer to properties of
objects that cannot be passively perceived, but require
interaction with the object. Following Gibson (Gibson,
1979), we refer to such properties of objects as affor-
dances. The word comes from considerations of what
an object affords to an agent who interacts with it. For
instance, a light object can be lifted with ease as opposed
to a heavy object. To assess the weight of an unknown
object, an agent must actually lift (or at least attempt
to lift) it and gauge the level of effort required. This is
precisely how Ripley perceives weight. When an object
is placed in Ripley’s mouth, a motor routine is initiated
which tightly grasps the object and then lifts and low-
ers the object three times. While the motor program is
running, the forces experienced in each DOF (Section
3.2) are monitored. In initial word learning experiments,
Ripley is handed objects of various masses and provided
word labels. A simple Bayes classifier was trained to dis-
tinguish the semantics of “very light”, “light”, “heavy”,
and “very heavy”. In a similar vein, we also grounded
the semantics of “hard” and “soft” in terms of grasping
motor routines that monitor pressure changes at each fin-
gertip as a function of grip displacement.
4 A Perceptually-Driven “Mental Model”
Ripley integrates real-time information from its visual
and proprioceptive systems to construct an “internal
replica”, or mental model of its environment that best
explains the history of sensory data that Ripley has ob-
served2. The mental model is built upon the ODE rigid
body dynamics simulator (Smith, 2003). ODE provides
facilities for modeling the dynamics of three dimensional
rigid objects based on Newtonian physics. As Rip-
ley’s physical environment (which includes Ripley’s own
body) changes, perception of these changes drive the cre-
ation, updating, and destruction of objects in the mental
model. Although simulators are typically used in place of
physical systems, we found physical simulation to be an
ideal substrate for implementing Ripley’s mental model
(for coupled on-line simulation, see also (Cao and Shep-
herd, 1989; Davis, 1998; Surdu, 2000)).
The mental model mediates between perception of the
objective world on one hand, and the semantics of lan-
guage on the other. Although the mental model reflects
the objective environment, it is biased as a result of a pro-
jection through Ripley’s particular sensory complex. The
following sections describe the simulator, and algorithms
for real-time coupling to Ripley’s visual and propriocep-
tive systems.
The ODE simulator provides an interface for creating
and destroying rigid objects with arbitrary polyhedron ge-
ometries placed within a 3D virtual world. Client pro-
grams can apply forces to objects and update their proper-
ties during simulation. ODE computes basic Newtonian
updates of object positions at discrete time steps based
object masses and applied forces. Objects in ODE are
currently restricted to two classes. Objects in Ripley’s
workspace (the tabletop) are constrained to be spheres of
fixed size. Ripley’s body is modeled within the simula-
2Mental models have been proposed as a central mechanism
in a broad range of cognitive capacities (Johnson-Laird, 1983).
Figure 1: Ripley looks down at objects on a tabletop.
tor as a configuration of seven connected cylindrical links
terminated with a rectangular head that approximate the
dimensions and mass of the physical robot. We introduce
the following notation in order to describe the simulator
and its interaction with Ripley’s perceptual systems.
4.1 Coupling Perception to the Mental Model
An approximate physical model of Ripley’s body is built
into the simulator. The position sensors from the 7 DOFs
are used to drive a PD control loop that controls the joint
forces applied to the simulated robot. As a result, motions
of the actual robot are followed by dampened motions of
the simulated robot.
A primary motivation for introducing the mental model
is to register, stabilize, and track visually observed ob-
jects in Ripley’s environment. An object permanence
module, called the Objecter, has been developed as a
bridge between raw visual analysis and the physical sim-
ulator. When a visual region is found to stably exist for a
sustained period of time, an object is instantiated by the
Objecter in the ODE physical simulator. It is only at this
point that Ripley becomes “aware” of the object and is
able to talk about it. Once objects are instantiated in the
mental model, they are never destroyed. If Ripley looks
away from an object such that the object moves out of
view, a representation of the object persists in the mental
model. Figure 1 shows an example of Ripley looking over
the workspace with four objects in view. In Figure 2, the
left image shows the output from Ripley’s head-mounted
camera, and the right image shows corresponding simu-
lated objects which have been registered and are being
tracked.
The Objecter consists of two interconnected compo-
nents. The first component, the 2D-Objecter, tracks two-
dimension visual regions generated by the vision sys-
tem. The 2D-Objecter also implements a hysteresis func-
tion which detects visual regions that persist over time.
Figure 2: Visual regions and corresponding simulated ob-
jects in Ripley’s mental model corresponding to the view
from Figure 1
Figure 3: By positioning a synthetic camera at the posi-
tion approximating the human’s viewpoint, Ripley is able
to “visualize” the scene from the person’s point of view
which includes a partial view of Ripley.
The second component, the 3D-Objecter, takes as in-
put persistent visual regions from the 2D-Objecter, which
are brought into correspondence with a full three dimen-
sional physical model which is held by ODE. The 3D-
Objecter performs projective geometry calculations to ap-
proximate the position of objects in 3D based on 2D re-
gion locations combined with the position of the source
video camera (i.e., the position of Ripley’s head). Each
time Ripley moves (and thus changes his vantage point),
the hysteresis functions in the 2D-Objecter are reset, and
after some delay, persistent regions are detected and sent
to the 3D-Objecter. No updates to the mental model are
performed while Ripley is in motion. The key problem in
both the 2D- and 3D-Objecter is to maintain correspon-
dence across time so that objects are tracked and persist
in spite of perceptual gaps. Details of the Objecter will
be described in (Roy et al., forthcoming 2003).
4.2 Synthetic Vision and Imagined Changes of
Perspective
The ODE simulator is integrated with the OpenGL 3D
graphics environment. Within OpenGL, a 3D environ-
ment may be rendered from an arbitrary viewpoint by
positioning and orienting a synthetic camera and render-
Figure 4: Using virtual shifts of perspective, arbitrary
vantage points may be taken. The (fixed) location of the
human partner is indicated by the figure on the left.
ing the scene from the camera’s perspective. We take ad-
vantage of this OpenGL functionality to implement shifts
in perspective without physically moving Ripley’s pose.
For example, to view the workspace from the human part-
ner’s point of view, the synthetic camera is simply moved
to the approximate position of the person’s head (which
is currently a fixed location). Continuing our example,
Figures 3 and 4 show examples of two synthetic views
of the situation from Figures 1 and 2. The visual analy-
sis features described in Section 3.3 can be applied to the
images generated by synthetic vision.
4.3 Event-Based Memory
A simple form of run length encoding is used to com-
pactly represent mental model histories. Each time an ob-
ject changes a properties more than a set threshold using
a distance measure that combines color, size, and loca-
tion disparities, an event is detected in the mental model.
Thus stretches of time in which nothing in the environ-
ment changes are represented by a single frame of data
and a duration parameter with a relatively large value.
When an object changes properties, such as its position,
an event is recorded that only retains the begin and end
point of the trajectory but discards the actual path fol-
lowed during the motion. As a result, references to the
past are discretized in time along event boundaries. There
are many limitations to this approach to memory, but as
we shall see, it may nonetheless be useful in grounding
past tense references in natural language.
5 Putting the Pieces Together
We began by asking how a robot might ground the mean-
ing of the utterance, “Touch the heavy blue thing that was
on my left”. We are now able to sketch an answer to this
question. Ripley’s perceptual system, and motor control
system, and mental model each contribute elements for
grounding the meaning of this utterance. In this section,
we informally show how the various components of the
architecture provide a basis for language grounding.
The semantic grounding of each word in our ex-
ample utterance is presented using algorithmic descrip-
tions reminiscent of the procedural semantics developed
by Winograd (Winograd, 1973) and Miller & Johnson
(Miller and Johnson-Laird, 1976). To begin with a simple
case, the word “blue” is a property that may be defined as:
property Blue(x){
c←GetColorModel(x)
return fblue(c)
}
The function returns a scalar value that indicates how
strongly the color of object x matches the expected color
model encoded in fblue. The color model would be en-
coded using the color histograms and histogram com-
parison methods described in Section 3.3. The function
GetColorModel() would retrieve the color model of x
from memory, and if not found, call on motor procedures
to look at x and construct a model.
“Touch” can be grounded in the perceptually-guided
motor procedure described in Section 3.4. This reaching
gesture terminates successfully when the touch sensors
are activated and the visual system reports that the target
x remains in view:
procedure Touch(x){
repeat
Reach-towards(x)
until touch sensor(s) activated
if x in view then
return success
else
return failure
end if
}
Along similar lines, it is also useful to define a weigh
procedure (which has been implemented as described in
Section 3.5):
procedure Weigh(x){
Grasp(x)
resistance←0
while Lift(x) do
resistance←resistance + joint forces
end while
return resistance
}
Weigh() monitors the forces on the robot’s joints as it
lifts x. The accumulated forces are returned by the func-
tion. This weighing procedure provides the basis for
grounding “heavy”:
property Heavy(x){
w←GetWeight(x)
return fheavy(w)
}
Similar to GetColorModel(), GetWeight() would first
check if the weight of x is already known, and if not,
then it would optionally call Weigh() to determine the
weight.
To define “the”, “my”, and “was”, it is useful to intro-
duce a data structure that encodes contextual factors that
are salient during language use:
structure Context{
Point-of-view
Working memory
}
The point of view encodes the assumed perspective for
interpreting spatial language. The contents of working
memory would include, by default, objects currently in
the workspace and thus instantiated in Ripley’s mental
model of the workspace. However, past tense markers
such as “was” can serve as triggers for loading salient el-
ements of Ripley’s event-based memory into the working
model. To highlight its effect on the context data struc-
ture, Was() is defined as a context-shift function:
context-shift Was(context){
Working memory←Salient events from mental model
history
}
“Was” triggers a request from memory (Section 4.3) for
objects which are added to working memory, making
them accessible to other processes. The determiner “the”
indicates the selection of a single referent from working
memory:
determiner The(context){
Select most salient element from working memory
}
In the example, the semantics of “my” can be grounded
in the synthetic visual perspective shift operation de-
scribed in Section 4.2:
context-shift My(context){
context.point-of-view←GetPointOfView(speaker)
}
Where GetPointOfV iew(speaker) obtains the spatial
position and orientation of the speaker’s visual input.
“Left” is also grounded in a visual property model
which computes a geometric spatial function (Section
3.3) relative to the assumed point of view:
property Left(x, context){
trajector←GetPosition(x)
return fleft(trajector,context.point−of−view)
}
GetPosition(), like GetColorModel() would use the
least effortful means for obtaining the position of x. The
function fleft evaluates how well the position of x fits
a spatial model relative to the point of view determined
from context.
“Thing” can be grounded as:
object Thing(x){
if (IsTouchable(x) and IsViewable(x)) return true;
else return false
}
This grounding makes explicit use of two affordances of
a thing, that it be touchable and viewable. Touchability
would be grounded using Touch() and viewability based
on whether x has appeared in the mental model (which is
constructed based on visual perception).
The final step in interpreting the utterance is to com-
pose the semantics of the individual words in order to
derive the semantics of the whole utterance. We address
the problem of grounded semantic composition in detail
elsewhere (Gorniak and Roy, protectforthcoming 2003).
For current purposes, we assume that a syntactic parser
is able to parse the utterance and translate it into a nested
set of function calls:
Touch(The(Left(My(Heavy(Blue(Thing(Was(context)))))))))
The innermost argument, context, includes the as-
sumed point of view and contents of working memory.
Each nested function modifies the contents of context
by either shifting points of view, loading new contents
into working memory, or sorting / highlighting contents
of working memory. The Touch() procedure finally acts
on the specified argument.
This concludes our sketch of how we envision the im-
plemented robotic architecture would be used to ground
the semantics of the sample sentence. Clearly many im-
portant details have been left out of the discussion. Our
intent here is to convey only an overall gist of how lan-
guage would be coupled to Ripley. Our current work is
focused on the realization of this approach using spoken
language input.

References
D. Bailey. 1997. When push comes to shove: A computa-
tional model of the role of motor control in the acqui-
sition of action verbs. Ph.D. thesis, Computer science
division, EECS Department, University of California
at Berkeley.
Jon Barwise and John Perry. 1983. Situations and Atti-
tudes. MIT-Bradford.
Cynthia Breazeal. 2003. Towards sociable robots.
Robotics and Autonomous Systems, 42(3-4).
Michael K. Brown, Bruce M. Buntschuh, and Jay G.
Wilpon. 1992. SAM: A perceptive spoken lan-
guage understanding robot. IEEE Transactions on Sys-
tems, Man, and Cybernetics, 22 . IEEE Transactions
22:1390–1402.
F. Cao and B. Shepherd. 1989. Mimic: a robot planning
environment integrating real and simulated worlds. In
IEEE International Symposium on Intelligent Control,
page 459464.
Herbert Clark. 1996. Using Language. Cambridge Uni-
versity Press.
C. Crangle and P. Suppes. 1994. Language and Learning
for Robots. CSLI Publications, Stanford, CA.
W. J. Davis. 1998. On-line simulation: Need and
evolving research requirements. In J. Banks, editor,
Handbook of Simulation: Principles, Methodology,
Advances, Applications and Practice. Wiley.
James J. Gibson. 1979. The Ecological Approach to Vi-
sual Perception. Erlbaum.
Peter Gorniak and Deb Roy. forthcoming, 2003.
Grounded semantic composition for visual scenes.
P.N. Johnson-Laird. 1983. Mental Models: Towards a
Cognitive Science of Language, Inference, and Con-
sciousness. Cambridge University Press.
George Lakoff and Mark Johnson. 1980. Metaphors We
Live By. University of Chicago Press, Chicago.
Johan M. Lammens. 1994. A computational model of
color perception and color naming. Ph.D. thesis, State
University of New York.
Barbara Landau and Ray Jackendoff. 1993. ”What” and
”where” in spatial language and spatial cognition. Be-
havioral and Brain Sciences, 16:217–265.
P. McGuire, J. Fritsch, J.J. Steil, F. Roethling, G.A. Fink,
S. Wachsmuth, G. Sagerer, and H. Ritter. 2002. Multi-
modal human-machine communication for instructing
robot grasping tasks. In Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and
Systems (IROS).
George Miller and Philip Johnson-Laird. 1976. Lan-
guage and Perception. Harvard University Press.
Srinivas Narayanan. 1997. KARMA: Knowledge-based
active representations for metaphor and aspect. Ph.D.
thesis, University of California Berkeley.
J. Pratt, B. Krupp B, and C. Morse. 2002. Series elas-
tic actuators for high fidelity force control. Industrial
Robot, 29(3):234–241.
T. Regier and L. Carlson. 2001. Grounding spatial lan-
guage in perception: An empirical and computational
investigation. Journal of Experimental Psychology,
130(2):273–298.
Terry Regier. 1996. The human semantic potential. MIT
Press, Cambridge, MA.
Deb Roy and Alex Pentland. 2002. Learning words from
sights and sounds: A computational model. Cognitive
Science, 26(1):113–146.
Deb Roy, Bernt Schiele, and Alex Pentland. 1999.
Learning audio-visual associations from sensory in-
put. In Proceedings of the International Conference
of Computer Vision Workshop on the Integration of
Speech and Image Understanding, Corfu, Greece.
Deb Roy, Kai-Yuh Hsiao, and Nick Mavridis. forthcom-
ing, 2003. Coupling robot perception and on-line sim-
ulation: Towards grounding conversational semantics.
Brian Scassellati. 2002. Theory of mind for a humanoid
robot. Autonomous Robots, 12:13–24.
Jeffrey Siskind. 2001. Grounding the Lexical Semantics
of Verbs in Visual Perception using Force Dynamics
and Event Logic. Journal of Artificial Intelligence Re-
search, 15:31–90.
R Smith. 2003. ODE: Open dynamics engine.
Luc Steels. 2001. Language games for autonomous
robots. IEEE Intelligent Systems, 16(5):16–22.
John R. Surdu. 2000. Connecting simulation to the
mission operational environment. Ph.D. thesis, Texas
A&M.
T. Winograd, 1973. A Process model of Language Un-
derstanding, pages 152–186. Freeman.
