File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0610_metho.xml

Size: 18,472 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0610">
  <Title>Conversational Robots: Building Blocks for Grounding Word Meaning</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Ripley: An Interactive Robot
</SectionTitle>
    <Paragraph position="0"> Ripley was designed specifically for the purposes of exploring questions of grounded language, and interactive language acquisition. The robot has a range of motions that enables him to move objects around on a tabletop placed in front of him. Ripley can also look up and make &amp;quot;eye contact&amp;quot; with a human partner. Three primary considerations drove the design of the robot: (1) We are interested in the effects of changes of visual perspective and their effects on language and conversation, (2) Sensorymotor grounding of verbs. (3) Human-directed training of motion. For example, to teach Ripley the meaning of &amp;quot;touch&amp;quot;, we use &amp;quot;show-and-tell&amp;quot; training in which exemplars of the word (in this case, motor actions) can be presented by a human trainer in tandem with verbal descriptions of the action.</Paragraph>
    <Paragraph position="1"> To address the first consideration, Ripley has cameras placed on its head so that all motions of the body lead to changes of view point. This design decision leads to challenges in maintaining stable perspectives in a scene, but reflect the type of corrections that people must also constantly perform. To support acquisition of verbs, Ripley has been designed with a &amp;quot;mouth&amp;quot; that can grasp objects and enable manipulation. As a result, the most natural class of verbs that Ripley will learn involve manual actions such as touching, lifting, pushing, and giving. To address the third consideration, Ripley is actuated with compliant joints, and has &amp;quot;training handles&amp;quot;. In spite of the fact that the resulting robot resembles an arm more than a torso, it nonetheless serves our purposes as a vehicle for experiments in situated, embodied, conversation.</Paragraph>
    <Paragraph position="2"> In contrast, many humanoid robots are not actually able to move their torso's to a sufficient degree to obtain significant variance in visual perspectives, and grasping is often not achieved in these robots due to additional complexities of control. This section provides a description of Ripley's hardware and low level sensory processing and motor control software layers.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Mechanical Structure and Actuation
</SectionTitle>
      <Paragraph position="0"> The robot is essentially an actuated arm, but since cameras and other sensors are placed on the gripper, and the robot is able to make &amp;quot;eye contact&amp;quot;, we often think of the gripper as the robot's head. The robot has seven degrees of freedom (DOF's) including a 2-DOF shoulder, 1-DOF elbow, 3-DOF wrist / neck, and 1-DOF gripper / mouth. Each DOF other than the gripper is actuated by series-elastic actuators (Pratt et al., 2002) in which all force from electric motors are transferred through torsion springs. Compression sensors are placed on each spring and used for force feedback to the low level motion controller. The use of series-elastic actuators gives Ripley the ability to precisely sense the amount of force that is being applied at each DOF, and leads to compliant motions.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Motion Control
</SectionTitle>
      <Paragraph position="0"> A position-derivative control loop is used to track target points that are sequenced to transit smoothly from the starting point of a motion gesture to the end point. Natural motion trajectories are learned from human teachers through manual demonstrations.</Paragraph>
      <Paragraph position="1"> The robot's motion is controlled in a layered fashion.</Paragraph>
      <Paragraph position="2"> The lowest level is implemented in hardware and consists of a continuous control loop between motor amplifiers and force sensors of each DOF. At the next level of control, a microcontroller implements a position-derivative (PD) control loop with a 5 millisecond cycle time. The microcontroller accepts target positions from a master controller and translates these targets into force commands via the PD control loop. The resulting force commands are sent down stream to the motor amplifier control loop. The same force commands are also sent up stream back to the master controller, serving as dynamic proprioceptive force information To train motion trajectories, the robot is put in a gravity canceling motor control mode in which forces due to gravity are estimated based on the robot's joint positions and counteracted through actuation. While in this mode, a human trainer can directly move the robot through desired motion trajectories. Motion trajectories are recorded during training. During playback, motion trajectories can be interrupted and smoothly revised to follow new trajectories as determined by higher level control. We have also implemented interpolative algorithms that blend trajectories to produce new motions that beyond the training set.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Sensory System and Visual Processing
</SectionTitle>
      <Paragraph position="0"> Ripley's perceptual system is based on several kinds of sensors. Two color video cameras, a three-axis tilt accelerometer (for sensing gravity), and two microphones are mounted in the head. Force sensitive resistors provide a sense of touch on the inside and outside surfaces of the gripper fingers. In the work reported here, we make use of only the visual, touch, and force sensors. The remaining sensors will be integrated in the future. The microphones have been used to achieve sound source localization and will play a role in maintaining &amp;quot;eye contact&amp;quot; with communication partners. The accelerometer will be used to help correct frames of reference of visual input.</Paragraph>
      <Paragraph position="1"> Complementing the motor system is the robot's sensor system. One of the most important sets of sensors is the actuator set itself; as discussed, the actuators are forcecontrolled, which means that the control loop adjusts the force that is output by each actuator. This in turn means that the amount of force being applied at each joint is known. Additionally, each DOF is equipped with absolute position sensors that are used for all levels of motion control and for maintaining the zero-gravity mode.</Paragraph>
      <Paragraph position="2"> The vision system is responsible for detecting objects in the robot's field of view. A mixture of Gaussians is used to model the background color and provides foreground/background classification. Connected regions with uniform color are extracted from the foreground regions. The three-dimensional shape of an object is represented using histograms of local geometric features, each of which represents the silhouette of the object from a different viewpoint. Three dimension shapes are represented in a view-based approach using sets of histograms. The color of regions is represented using histograms of illumination-normalized RGB values. Details of the shape and color representations can be found in (Roy et al., 1999).</Paragraph>
      <Paragraph position="3"> To enable grounding of spatial terms such as &amp;quot;above&amp;quot; and &amp;quot;left&amp;quot;, a set of spatial relations similar to (Regier, 1996) is measured between pair of objects. The first feature is the angle (relative to the horizon) of the line connecting the centers of area of an object pair. The second feature is the shortest distance between the edges of the objects. The third spatial feature measures the angle of the line which connects the two most proximal points of the objects.</Paragraph>
      <Paragraph position="4"> The representations of shape, color, and spatial relations described above can also be generated from virtual scenes based on Ripley's mental model as described below. Thus, the visual features can serve as a means to ground words in either real time camera grounded vision or simulated synthetic vision.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Visually-Guided Reaching
</SectionTitle>
      <Paragraph position="0"> Ripley can reach out and touch objects by interpolating between recorded motion trajectories. A set of sample trajectories are trained by placing objects on the tabletop, placing Ripley in a canonical position so that the table is in view, and then manually guiding the robot until it touches the object. A motion trajectory library is collected in this way, with each trajectory indexed by the position of the visual target. To reach an object in an arbitrary position, a linear interpolation between trajectories is computed.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Encoding Environmental Affordances: Object
Weight and Compliance
</SectionTitle>
      <Paragraph position="0"> Words such as &amp;quot;heavy&amp;quot; and &amp;quot;soft&amp;quot; refer to properties of objects that cannot be passively perceived, but require interaction with the object. Following Gibson (Gibson, 1979), we refer to such properties of objects as affordances. The word comes from considerations of what an object affords to an agent who interacts with it. For instance, a light object can be lifted with ease as opposed to a heavy object. To assess the weight of an unknown object, an agent must actually lift (or at least attempt to lift) it and gauge the level of effort required. This is precisely how Ripley perceives weight. When an object is placed in Ripley's mouth, a motor routine is initiated which tightly grasps the object and then lifts and lowers the object three times. While the motor program is running, the forces experienced in each DOF (Section 3.2) are monitored. In initial word learning experiments, Ripley is handed objects of various masses and provided word labels. A simple Bayes classifier was trained to distinguish the semantics of &amp;quot;very light&amp;quot;, &amp;quot;light&amp;quot;, &amp;quot;heavy&amp;quot;, and &amp;quot;very heavy&amp;quot;. In a similar vein, we also grounded the semantics of &amp;quot;hard&amp;quot; and &amp;quot;soft&amp;quot; in terms of grasping motor routines that monitor pressure changes at each fingertip as a function of grip displacement.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 A Perceptually-Driven &amp;quot;Mental Model&amp;quot;
</SectionTitle>
    <Paragraph position="0"> Ripley integrates real-time information from its visual and proprioceptive systems to construct an &amp;quot;internal replica&amp;quot;, or mental model of its environment that best explains the history of sensory data that Ripley has observed2. The mental model is built upon the ODE rigid body dynamics simulator (Smith, 2003). ODE provides facilities for modeling the dynamics of three dimensional rigid objects based on Newtonian physics. As Ripley's physical environment (which includes Ripley's own body) changes, perception of these changes drive the creation, updating, and destruction of objects in the mental model. Although simulators are typically used in place of physical systems, we found physical simulation to be an ideal substrate for implementing Ripley's mental model (for coupled on-line simulation, see also (Cao and Shepherd, 1989; Davis, 1998; Surdu, 2000)).</Paragraph>
    <Paragraph position="1"> The mental model mediates between perception of the objective world on one hand, and the semantics of language on the other. Although the mental model reflects the objective environment, it is biased as a result of a projection through Ripley's particular sensory complex. The following sections describe the simulator, and algorithms for real-time coupling to Ripley's visual and proprioceptive systems.</Paragraph>
    <Paragraph position="2"> The ODE simulator provides an interface for creating and destroying rigid objects with arbitrary polyhedron geometries placed within a 3D virtual world. Client programs can apply forces to objects and update their properties during simulation. ODE computes basic Newtonian updates of object positions at discrete time steps based object masses and applied forces. Objects in ODE are currently restricted to two classes. Objects in Ripley's workspace (the tabletop) are constrained to be spheres of fixed size. Ripley's body is modeled within the simula- null tor as a configuration of seven connected cylindrical links terminated with a rectangular head that approximate the dimensions and mass of the physical robot. We introduce the following notation in order to describe the simulator and its interaction with Ripley's perceptual systems.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Coupling Perception to the Mental Model
</SectionTitle>
      <Paragraph position="0"> An approximate physical model of Ripley's body is built into the simulator. The position sensors from the 7 DOFs are used to drive a PD control loop that controls the joint forces applied to the simulated robot. As a result, motions of the actual robot are followed by dampened motions of the simulated robot.</Paragraph>
      <Paragraph position="1"> A primary motivation for introducing the mental model is to register, stabilize, and track visually observed objects in Ripley's environment. An object permanence module, called the Objecter, has been developed as a bridge between raw visual analysis and the physical simulator. When a visual region is found to stably exist for a sustained period of time, an object is instantiated by the Objecter in the ODE physical simulator. It is only at this point that Ripley becomes &amp;quot;aware&amp;quot; of the object and is able to talk about it. Once objects are instantiated in the mental model, they are never destroyed. If Ripley looks away from an object such that the object moves out of view, a representation of the object persists in the mental model. Figure 1 shows an example of Ripley looking over the workspace with four objects in view. In Figure 2, the left image shows the output from Ripley's head-mounted camera, and the right image shows corresponding simulated objects which have been registered and are being tracked.</Paragraph>
      <Paragraph position="2"> The Objecter consists of two interconnected components. The first component, the 2D-Objecter, tracks two-dimension visual regions generated by the vision system. The 2D-Objecter also implements a hysteresis function which detects visual regions that persist over time.  tion approximating the human's viewpoint, Ripley is able to &amp;quot;visualize&amp;quot; the scene from the person's point of view which includes a partial view of Ripley.</Paragraph>
      <Paragraph position="3"> The second component, the 3D-Objecter, takes as input persistent visual regions from the 2D-Objecter, which are brought into correspondence with a full three dimensional physical model which is held by ODE. The 3D-Objecter performs projective geometry calculations to approximate the position of objects in 3D based on 2D region locations combined with the position of the source video camera (i.e., the position of Ripley's head). Each time Ripley moves (and thus changes his vantage point), the hysteresis functions in the 2D-Objecter are reset, and after some delay, persistent regions are detected and sent to the 3D-Objecter. No updates to the mental model are performed while Ripley is in motion. The key problem in both the 2D- and 3D-Objecter is to maintain correspondence across time so that objects are tracked and persist in spite of perceptual gaps. Details of the Objecter will be described in (Roy et al., forthcoming 2003).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Synthetic Vision and Imagined Changes of
Perspective
</SectionTitle>
      <Paragraph position="0"> The ODE simulator is integrated with the OpenGL 3D graphics environment. Within OpenGL, a 3D environment may be rendered from an arbitrary viewpoint by positioning and orienting a synthetic camera and render- null vantage points may be taken. The (fixed) location of the human partner is indicated by the figure on the left. ing the scene from the camera's perspective. We take advantage of this OpenGL functionality to implement shifts in perspective without physically moving Ripley's pose.</Paragraph>
      <Paragraph position="1"> For example, to view the workspace from the human partner's point of view, the synthetic camera is simply moved to the approximate position of the person's head (which is currently a fixed location). Continuing our example, Figures 3 and 4 show examples of two synthetic views of the situation from Figures 1 and 2. The visual analysis features described in Section 3.3 can be applied to the images generated by synthetic vision.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Event-Based Memory
</SectionTitle>
      <Paragraph position="0"> A simple form of run length encoding is used to compactly represent mental model histories. Each time an object changes a properties more than a set threshold using a distance measure that combines color, size, and loca-tion disparities, an event is detected in the mental model.</Paragraph>
      <Paragraph position="1"> Thus stretches of time in which nothing in the environment changes are represented by a single frame of data and a duration parameter with a relatively large value.</Paragraph>
      <Paragraph position="2"> When an object changes properties, such as its position, an event is recorded that only retains the begin and end point of the trajectory but discards the actual path followed during the motion. As a result, references to the past are discretized in time along event boundaries. There are many limitations to this approach to memory, but as we shall see, it may nonetheless be useful in grounding past tense references in natural language.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML