File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2016_metho.xml
Size: 12,089 bytes
Last Modified: 2025-10-06 14:09:36
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2016"> <Title>Controlling Animated Agents in Natural Language</Title> <Section position="3" start_page="0" end_page="92" type="metho"> <SectionTitle> 2 System Overview </SectionTitle> <Paragraph position="0"> A screen shot of K3 is shown in Fig. 1. There are two agents and several objects in a virtual world.</Paragraph> <Paragraph position="1"> The current system accepts simple Japanese utterances with anaphoric and elliptical expressions, such as &quot;Walk to the desk.&quot; and &quot;Further&quot;. The size of the lexicon is about 100 words.</Paragraph> <Paragraph position="2"> The architecture of the K3 is illustrated in Fig. 2. system. The speech recognition module receives the user's speech input and generates a sequence of words. The syntactic/semantic analysis module analyzes the word sequence to extract a case frame. This module accepts ill-formed speech input including postposition omission, inversion, and self-correction. At this stage, not all case slots are necessarily filled, because of ellipses in the utterance. Even in cases where there is no ellipsis, instances of objects are not identified at this stage.</Paragraph> <Paragraph position="3"> Resolving ellipses and anaphora, and identifying instances in the world are performed by the discourse analysis module. Anaphora resolution and instance identification are achieved by using plan-knowledge, which will be described in section 3.</Paragraph> <Paragraph position="4"> The discourse analysis module extracts the user's goal as well and hands it over to the planning modules, which build a plan to generate the appropriate animation. In other words, the planning modules translate the user's goal into animation data. However, the properties of these two ends are very different and straightforward translation is rather difficult. The user's goal is represented in terms of symbols, while the animation data is a sequence of numeric values. To bridge this gap, we take a two-stage approach - macroand micro-planning.</Paragraph> <Paragraph position="5"> During the macro-planning, the planner needs to know the physical properties of objects, such as their size, location and so on. For example, to pick up a ball, the agent first needs to move to the location at which he can reach the ball. In this planning process, the distance between the ball and the agent needs to be calculated. This sort of information is represented in terms of coordinate values of the virtual space and handled by the micro-planner.</Paragraph> <Paragraph position="6"> To interface the macro- and micro-planning, we introduced the SPACE object to represent a location in the virtual space by its symbolic and nu- null meric character. The SPACE object is described in section 4.</Paragraph> <Paragraph position="7"> 3 Plan-based Anaphora Resolution 3.1 Surface-clue-based Resolution vs.</Paragraph> <Paragraph position="8"> Plan-based Resolution Consider the following two dialogue examples.</Paragraph> <Paragraph position="9"> (1-1) &quot;Agent X, push the red ball.&quot; (1-2) &quot;Move to the front of the blue ball.&quot; (1-3) &quot;Push it.&quot; (2-1) &quot;Agent X, pick up the red ball.&quot; (2-2) &quot;Move to the front of the blue ball.&quot; (2-3) &quot;Put it down.&quot; The second dialogue is different from the first one only in terms of the verbs in the first and third utterances. The syntactic structure of each sentence in the second dialogue (2-1)-(2-3) is the same as the corresponding sentence in the first dialogue (1-1)-(1-3). However, pronoun &quot;it&quot; in (1-3) refers to &quot;the blue ball&quot; in (1-2), and pronoun &quot;it&quot; in (2-3) refers to &quot;the red ball&quot; in (2-1). The difference between these two examples is not explained by the theories based on surface clues such as the centering theory (?; ?).</Paragraph> <Paragraph position="10"> In the setting of SHRDLU-like systems, the user has a certain goal of arranging objects in the world, and constructs a plan to achieve it through interaction with the system. As Cohen pointed out, users tend to break up the referring and predicating functions in speech dialogue (?). Thus, each user's utterance suggests a part of plan rather than a whole plan that the user tries to perform. To avoid redundancy, users need to use anaphora.</Paragraph> <Paragraph position="11"> From these observations, we found that considering a user's plan is indispensable in resolving anaphora in this type of dialogue system and developed an anaphora resolution algorithm using the relation between utterances in terms of partial plans (plan operators) corresponding to them.</Paragraph> <Paragraph position="12"> The basic idea is to identify a chain of plan operators based on their effects and preconditions. Our method explained in the rest of this section finds preceding utterances sharing the same goal as the current utterance with respect to their corresponding plan operators as well as surface linguistic clues.</Paragraph> <Section position="1" start_page="92" end_page="92" type="sub_section"> <SectionTitle> 3.2 Resolution Algorithm </SectionTitle> <Paragraph position="0"> Recognized speech input is transformed into a case frame. At this stage, anaphora is not resolved. Based on this case frame, a plan operator is retrieved in the plan library. This process is generally called &quot;plan recognition.&quot; A plan operator used in our system is similar to that of STRIPS (?), which consists of precondition, effect and action description.</Paragraph> <Paragraph position="1"> Variables in the retrieved plan operator are filled with case fillers in the utterance. There might be missing case fillers when anaphora (zero pronoun) is used in the utterance. The system tries to resolve these missing elements in the plan operator. To resolve the missing elements, the system again uses clue words and the plan library.</Paragraph> <Paragraph position="2"> An overview of the anaphora resolution algorithm is shown in Figure 3.</Paragraph> <Paragraph position="3"> When the utterance includes clue words, the system uses them to search the history database for the preceding utterance that shares the same goal as the current utterance. Then, it identifies the referent on the basis of case matching.</Paragraph> <Paragraph position="4"> There are cases in which the proper preceding utterance cannot be identified even with the clue words. These cases are sent to the left branch in Fig. 3 where the plan library is used to resolve anaphora.</Paragraph> <Paragraph position="5"> When there is no clue word or the clue word does not help to resolve the anaphora, the process goes through the left branch in Fig. 3. First, the system enumerates the candidates of referents using the surface information, then filters them out with linguistic clues and the plan library. For example, demonstratives such as &quot;this&quot;, &quot;that&quot; are usually used for objects that are in the user's view. Therefore, the referent of anaphora with demonstratives is restricted to the objects in the current user's view.</Paragraph> <Paragraph position="6"> If the effect of a plan operator satisfies the pre-condition of another plan operator, and the utterances corresponding to these plan operators are uttered in discourse, they can be considered to intend the same goal. Thus, identifying a chain of effect-precondition relations gives important information for grouping utterances sharing the same goal. We can assume an anaphor and its referent appear within the same utterance group.</Paragraph> <Paragraph position="7"> Once the utterance group is identified, the system finds the referent based on matching variables between plan operators.</Paragraph> <Paragraph position="8"> After filtering out the candidates, there still might be more than one candidate left. In such a case, each candidate is assigned a score that is calculated based on the following factors: saliency, agent's view, and user's view.</Paragraph> </Section> </Section> <Section position="4" start_page="92" end_page="93" type="metho"> <SectionTitle> 4 Handling Spatial Vagueness </SectionTitle> <Paragraph position="0"> To interface the macro- and micro-planning, we introduced the SPACE object which represents a location in the virtual world. Because of space limitations, we briefly explain the SPACE object.</Paragraph> <Paragraph position="1"> The macro planner uses plan operators described in terms of the logical forms. Thus, the SPACE object is designed to behave as a symbolic object in the macro-planning by referring to its unique identifier. On the other hand, a loca-tion could be vague and the most plausible place changes depending on the situation. Therefore, it should be treated as a certain region rather than a single point. To fulfill this requirement, we adopt the idea of the potential model proposed by Yamada et al. (?). Vagueness of a location is naturally realized as a potential function embedded in the SPACE object. The most plausible point is calculated by using the potential function with the Steepest Descent Method on request.</Paragraph> <Paragraph position="2"> Consider the following short conversation between a human (H) and a virtual agent (A).</Paragraph> <Paragraph position="3"> H: Do you see a ball in front of the desk? A: Yes.</Paragraph> <Paragraph position="4"> H: Put it on the desk.</Paragraph> <Paragraph position="5"> When the first utterance is given in the situation shown in Fig. 1, the discourse analysis module identifies an instance of &quot;a ball&quot; in the following steps.</Paragraph> <Paragraph position="6"> In step (A), an instance of SPACE is created as an instance of the class inFrontOf. The constructor of inFrontOf takes three arguments: the reference object, the viewpoint, and the axis order. Although it is necessary to identify the reference frame, we focus on the calculation of potential functions given a reference frame.</Paragraph> <Paragraph position="7"> Suppose the parameters of inFrontOf have been resolved in the preceding steps, and the discourse analysis module chooses the axis mirror order and the orientation of the axis based on the viewpoint of the light-colored arrows in Fig. 4. The closest arrow to the viewpoint-based &quot;front&quot; axis ((1) in Fig. 4) is chosen as the &quot;front&quot; of the desk. Then, the parameters of potential function corresponding to &quot;front&quot; are set.</Paragraph> <Paragraph position="8"> In step (B), the method matchObjects() returns a list of objects located in the potential field of space#1 shown in Fig. 5. The objects in the list are sorted in descending order of the potential value of their location.</Paragraph> <Paragraph position="9"> In step (C), the most plausible object satisfying the type constraint (BALL) is selected by the method getFirstMatch().</Paragraph> <Paragraph position="10"> When receiving the next utterance, &quot;Put it on the desk.&quot;, the discourse analysis module resolves the referent of the pronoun &quot;it&quot; and extracts the user's goal.</Paragraph> <Paragraph position="11"> walk(inFrontOf(ball#1, viewpoint#1, MIRROR)</Paragraph> </Section> <Section position="5" start_page="93" end_page="94" type="metho"> <SectionTitle> AND reachableByHand(ball#1) AND NOT(occupied(ball#1))) </SectionTitle> <Paragraph position="0"> The movement walk takes a SPACE object representing its destination as an argument. In this example, the conjunction of three SPACE objects is given as the argument. The potential function of the resultant SPACE is calculated by multiplying the values of the corresponding three potential functions at each point.</Paragraph> <Paragraph position="1"> As this example illustrates, the SPACE object effectively plays a role as a mediator between the macro and micro planning.</Paragraph> </Section> class="xml-element"></Paper>