File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/w03-0610_concl.xml
Size: 5,733 bytes
Last Modified: 2025-10-06 13:53:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0610"> <Title>Conversational Robots: Building Blocks for Grounding Word Meaning</Title> <Section position="6" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Putting the Pieces Together </SectionTitle> <Paragraph position="0"> We began by asking how a robot might ground the meaning of the utterance, &quot;Touch the heavy blue thing that was on my left&quot;. We are now able to sketch an answer to this question. Ripley's perceptual system, and motor control system, and mental model each contribute elements for grounding the meaning of this utterance. In this section, we informally show how the various components of the architecture provide a basis for language grounding.</Paragraph> <Paragraph position="1"> The semantic grounding of each word in our example utterance is presented using algorithmic descriptions reminiscent of the procedural semantics developed by Winograd (Winograd, 1973) and Miller & Johnson (Miller and Johnson-Laird, 1976). To begin with a simple case, the word &quot;blue&quot; is a property that may be defined as: property Blue(x){</Paragraph> <Paragraph position="3"> The function returns a scalar value that indicates how strongly the color of object x matches the expected color model encoded in fblue. The color model would be encoded using the color histograms and histogram comparison methods described in Section 3.3. The function GetColorModel() would retrieve the color model of x from memory, and if not found, call on motor procedures to look at x and construct a model.</Paragraph> <Paragraph position="4"> &quot;Touch&quot; can be grounded in the perceptually-guided motor procedure described in Section 3.4. This reaching gesture terminates successfully when the touch sensors are activated and the visual system reports that the target check if the weight of x is already known, and if not, then it would optionally call Weigh() to determine the weight.</Paragraph> <Paragraph position="5"> To define &quot;the&quot;, &quot;my&quot;, and &quot;was&quot;, it is useful to introduce a data structure that encodes contextual factors that are salient during language use: The point of view encodes the assumed perspective for interpreting spatial language. The contents of working memory would include, by default, objects currently in the workspace and thus instantiated in Ripley's mental model of the workspace. However, past tense markers such as &quot;was&quot; can serve as triggers for loading salient elements of Ripley's event-based memory into the working model. To highlight its effect on the context data structure, Was() is defined as a context-shift function: context-shift Was(context){ Working memory-Salient events from mental model history } &quot;Was&quot; triggers a request from memory (Section 4.3) for objects which are added to working memory, making them accessible to other processes. The determiner &quot;the&quot; indicates the selection of a single referent from working memory: determiner The(context){ Select most salient element from working memory } In the example, the semantics of &quot;my&quot; can be grounded in the synthetic visual perspective shift operation described in Section 4.2: context-shift My(context){ context.point-of-view-GetPointOfView(speaker) } Where GetPointOfV iew(speaker) obtains the spatial position and orientation of the speaker's visual input. &quot;Left&quot; is also grounded in a visual property model which computes a geometric spatial function (Section 3.3) relative to the assumed point of view: property Left(x, context){</Paragraph> <Paragraph position="7"> GetPosition(), like GetColorModel() would use the least effortful means for obtaining the position of x. The function fleft evaluates how well the position of x fits a spatial model relative to the point of view determined from context.</Paragraph> <Paragraph position="8"> &quot;Thing&quot; can be grounded as: object Thing(x){ if (IsTouchable(x) and IsViewable(x)) return true; else return false } This grounding makes explicit use of two affordances of a thing, that it be touchable and viewable. Touchability would be grounded using Touch() and viewability based on whether x has appeared in the mental model (which is constructed based on visual perception).</Paragraph> <Paragraph position="9"> The final step in interpreting the utterance is to compose the semantics of the individual words in order to derive the semantics of the whole utterance. We address the problem of grounded semantic composition in detail elsewhere (Gorniak and Roy, protectforthcoming 2003). For current purposes, we assume that a syntactic parser is able to parse the utterance and translate it into a nested set of function calls: Touch(The(Left(My(Heavy(Blue(Thing(Was(context))))))))) The innermost argument, context, includes the assumed point of view and contents of working memory. Each nested function modifies the contents of context by either shifting points of view, loading new contents into working memory, or sorting / highlighting contents of working memory. The Touch() procedure finally acts on the specified argument.</Paragraph> <Paragraph position="10"> This concludes our sketch of how we envision the implemented robotic architecture would be used to ground the semantics of the sample sentence. Clearly many important details have been left out of the discussion. Our intent here is to convey only an overall gist of how language would be coupled to Ripley. Our current work is focused on the realization of this approach using spoken language input.</Paragraph> </Section> class="xml-element"></Paper>