File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0605_metho.xml
Size: 20,722 bytes
Last Modified: 2025-10-06 14:08:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0605"> <Title>An Architecture for Word Learning using Bidirectional Multimodal Structural Alignment</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Implementation with Visual Domain of </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Spatial Relations </SectionTitle> <Paragraph position="0"> In order to validate our general architecture, we outline a system-in-progress which instantiates the architecture for the particular semantic domain of spatial relations. The domain of spatial relations captures the relative positioning, orientation, and movement of objects in space. Examples of sentences capturing spatial semantics include &quot;The boy threw the ball into the box on the table&quot;, &quot;The path went from the tree to the lakeside&quot;, and &quot;The sign points to the door&quot;.</Paragraph> <Paragraph position="1"> The following sections describe the methods and representations we have chosen to satisfy the requirements outlined in Section 2. Figure 2 shows how the system is designed and how the rest of this paper is organized.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Lexical-Conceptual Semantics </SectionTitle> <Paragraph position="0"> We use Lexical-Conceptual Semantics (LCS) (Jackendoff, 1983) as our semantic representation. LCS is a cognitive representation that focuses on trajectories and spatial relations. Unlike other representations such as Logical Form (LF) and Conceptual Dependency (CD), LCS delineates notions of PATHs and PLACEs. LCS is more formally outlined in (Jackendoff, 1983) and is compared to other semantic representations in (Bender, 2001).</Paragraph> <Paragraph position="1"> The following productions yield a simplified portion of the LCS space. For a complete description, refer to (Jackendoff, 1983).</Paragraph> <Paragraph position="2"> tions in the physical world. However, it is easily extensible to other domains, such as the temporal and possessive domains (Jackendoff, 1983; Dorr, 1992). Research focusing on using LCS in the abstract domain of social politics is also ongoing in our lab. Furthermore, it seems that much of language is spatial in nature. For example, there is significant psychological evidence that humans use spatial relations to talk about abstract domains such as time (Boroditsky, 2000). As a consequence, we believe that techniques for learning Lexical-Conceptual Semantics for words, developed here using the concrete spatial relations domain, will be extendible to many other domains. null</Paragraph> </Section> <Section position="6" start_page="2" end_page="5" type="metho"> <SectionTitle> 5 Language Parsing and Structural Alignment </SectionTitle> <Paragraph position="0"> This section will describe the methods used to simultaneously parse linguistic input strings and align the resultant semantic structures with those from vision. The primary architecture of the system is a constraint propagation network using grammatical rules as constraints. A custom constraint network topology is generated for each linguistic input string using a bidirectional search algorithm.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 5.1 Parsing/Alignment as Constraint Propagation </SectionTitle> <Paragraph position="0"> The parsing and structural alignment system may be viewed as a single large constraint, as in Figure 4. This constraint has two inputs: on one side, it takes a set of semantic representations originating from the vision processor. On the other side, it takes a linguistic input string, together with possible meanings for each word in the string, as determined by a lexicon (treating unknown words as having any possible meaning). As output, the constraint eliminates, in each input set, all meanings which do not lead to a successful structurally aligned parse.</Paragraph> <Paragraph position="1"> In order to achieve such a complicated constraint, it is useful to decompose the constraint into a network of simpler constraints, each working over a local domain of only a few constituents rather than over the domain of an entire sentence, as in Figure 5. We can then base these subconstraints on grammatical rules over a fixed number of constituents, and trust the composed network to handle the complete sentence.</Paragraph> </Section> <Section position="2" start_page="2" end_page="4" type="sub_section"> <SectionTitle> 5.2 Grammatical Framework for Constraints </SectionTitle> <Paragraph position="0"> The grammar framework chosen for our system is Combinatorial Categorial Grammar (CCG) (Steedman, 2000).</Paragraph> <Paragraph position="1"> CCG has many advantages in a system like ours. First, there are only a handful of rules for combining constituents, and these rules are explicit and well defined. These qualities facilitate the implementation of constraints. In addition, CCG is adept at parsing around missing information, because it was designed to handle functions as a constraint on linguistic and visual interpretations, requiring that expressions follow grammatical rules and that they align with the semantic domain input. This example shows the system presented with the sentence &quot;The dove flies,&quot; and with a corresponding conceptual structure (from vision). In this situation, all the words were known, so the system will simply eliminate the interpretations of &quot;dove&quot; as a GO and flies as THINGs. If the word &quot;dove&quot; had not been known, the system would still select the verb form of flies (by alignment), which brings &quot;dove&quot; into alignment with the appropriate fragment of semantic structure (the THING frame).</Paragraph> <Paragraph position="2"> linguistic phenomena such as parasitic-gapping . The ability to gracefully handle incomplete phrases is crucial in our system, because it enables us to parse around unknown words.</Paragraph> <Paragraph position="3"> In CCG, syntactic categories can either be atomic elements, or functors of those elements. The atomic elements are usually {S, N, NP} corresponding to Sentence, Noun, and Noun Phrase. Syntactic category functors are expressed in argument-rightmost curried notation, using /or\ to indicate whether the argument is expected to the left or right, respectively. Thus NP/N indicates a NP requiring a N to the right (and is therefore the syntactic category of a determiner), while (S\NP)/NP indicates an An example of a sentence with parasitic gapping is &quot;John hates and Mary loves the movie,&quot; where both verbs share the same object. CCG handles this by treating &quot;John hates&quot; and &quot;Mary loves&quot; as constituents, which can then be conjoined by &quot;and&quot; into a single &quot;John hates and Mary loves&quot; constituent (traditional grammars are unable to recognize &quot;John hates&quot; as ure 4 is actually implemented as a network of simpler constraints, as shown here. Each constraint implements a grammatical rule, as shown in Figure 6. The topology of the constraint network is dependent on the particular linguistic string, and is constructed by bidirectional search, as described in Section 5.3.</Paragraph> <Paragraph position="4"> S requiring one NP to the left and one to the right (this is the category of a monotransitive verb). For semantics, the notation is extended to X:f, indicating semantic category X with lambda calculus semantics f. These elements combine using a few simple productions, such as the following functional application rules</Paragraph> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.3 Constructing the Constraint Network </SectionTitle> <Paragraph position="0"> The constraints we use are parse rules; therefore our constraint network topology embodies a parse tree for any sentence it can handle. Since our inputs do not include the parse tree, we must consider how to generate an appropriate constraint network topology.</Paragraph> <Paragraph position="1"> One option is to use the same network topology to handle all sentences of the same length. Such a network would have to contain every possible parse tree, and thus would essentially result in an exhaustive search of the parse space. A better solution would be to avoid the exhaustive search by constructing a custom constraint topology for each sentence, using standard heuristic parse techniques. The drawback to this approach is that we are not interested in finding just any potential parse of a phrase/sentence, nor even the most statistically probable parse. Since our intent is to perform structural alignment with input from the non-linguistic domain, our goal in parsing is to find the semantic parse structure which in a constraint propagation network. Any one of the three inputs can be left unspecified, and the constraint can completely determine the value based on the other two inputs. aligns best with the semantic structure input from the non-linguistic domain. It follows that we should use the non-linguistic input to guide our search.</Paragraph> <Paragraph position="2"> Our system applies bidirectional search to the parse/alignment problem. In contrast to traditional search techniques, bidirectional search treats both the &quot;source&quot; and &quot;goal&quot; symmetrically; the search-space is traversed both forward from the source and backward from the goal. The search processes operating in each direction interact with each other whenever their paths reach the same state in the search-space. This interaction provides hints for quickly completing the remainder of the search.</Paragraph> <Paragraph position="3"> For example, if the forward and backward paths reach the same search-state, then the forward searcher quickly reaches the goal by tracing the backward-path.</Paragraph> <Paragraph position="4"> The specific style of bidirectional search we are investigating is based on Streams and Counterstreams (Ullman, 1996), in which forward and backward search paths interact with each other by means of primed pathways.</Paragraph> <Paragraph position="5"> For each transition, two priming values are maintained: a forward priming and backward priming. Primings are used when a decision must be made between several possible transitions that could extend a search path; those transitions that have a higher priming (using the forward priming for forward searches, backward priming for backward searchers) are preferred for expansion. Transition primings in a particular direction (either forward or backward) are increased whenever a search path traverses the transition in the opposite direction. The net influence of the primings is that transitions previously traversed in one direction are more likely to be explored in the opposite direction, if the opportunity arises. By extension, primings provide clues for finding a path from any state to the target state.</Paragraph> <Paragraph position="6"> The Streams and Counterstreams approach to bidirectional search facilitates incorporation of other types of context. For example, some situational context can be captured by allowing primings from previous parses of recent sentences to influence the current parse. Also, statistical cues such as Lexical Attraction (Yuret, 1999) can be integrated into the system by using heuristics to bias primings.</Paragraph> </Section> <Section position="4" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.4 Structural Alignment </SectionTitle> <Paragraph position="0"> The three components of structural alignment specified in Section 2.4 (atomic alignment, identification of compatible match sets, and structurally implied matches) are woven into the bidirectional search construction of the constraint network topology. When a constraint network fragment is constructed which bridges between a small portion of the linguistic input and the non-linguistic semantics, this &quot;atomic alignment&quot; primes the bidirectional search to be more likely to repeat this match while constructing larger constraint network fragments; hence atomic alignment leads to larger structural alignment.</Paragraph> <Paragraph position="1"> The constraints in the constraint network ensure that all active atomic alignments are compatible. Finally, when the constraint network bridges large portions of the linguistic and non-linguistic inputs, the non-linguistic semantic structure gets partitioned across the words in the linguistic input by the grammatical constraints. This completes the structural alignment by bringing unknown words into correspondence with their probable semantics. null</Paragraph> </Section> <Section position="5" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 5.5 Handling Uncertainty </SectionTitle> <Paragraph position="0"> Throughout this discussion, we have considered words as being either completely learned or completely unlearned.</Paragraph> <Paragraph position="1"> Clearly, though, there is much middle ground, including words whose meanings are still ambiguous among several options, as well as words for which some meanings have been well acquired, while other valid meanings have yet to be learned. How can our system handle this degreeof-acquisition continuum? Let us consider what we can expect from the meaning-refinement module. First, it should be able to report a set of witnessed possible meanings for each word, together with a correctness strength for each interpretation. This would be based on how regularly that interpretation has been witnessed. Furthermore, the module should be able to report the likelihood that the word still has unwitnessed interpretations; for initial occurrences of a word, this likelihood would be quite high, but with more exposure to the word, this likelihood would fall off.</Paragraph> <Paragraph position="2"> Returning to our system, we can now treat each word as having a set of known meanings together with a wild-card unknown meaning. When using bidirectional search to construct the constraint network topology, we bias the primings of transitions which reduce a word's potential meaning set, using the likelihood estimates given by the meaning-refinement module.</Paragraph> </Section> </Section> <Section position="7" start_page="5" end_page="6" type="metho"> <SectionTitle> 6 Lexical-Conceptual Structures from Video </SectionTitle> <Paragraph position="0"> The proposed system includes a vision component that is responsible for converting pixel data from a video input into the semantic structure described in Section 4. This vision system is an implementation of the ideas presented by John Bender (2001). Following Bender's prescriptions, the vision system does not perform object recognition. Instead, the goal of the system is to analyze the different paths and places that are present in a scene and, by relating these paths and places to one another, to construct an LCS representation of the actions.</Paragraph> <Section position="1" start_page="5" end_page="6" type="sub_section"> <SectionTitle> 6.1 Data Flow </SectionTitle> <Paragraph position="0"> The vision system consists of two parts. First, video frames are analyzed in sequence and the objects present in each scene are tracked using traditional vision algorithms and techniques . For each object, information about the object's size, shape, and position over the life of the scene is stored in a data structure that we call a Blob. This name was chosen to highlight the fact that the vision system makes no attempt at object recognition or fine-grained analysis and is instead concerned only with paths along which the objects (blobs) move.</Paragraph> <Paragraph position="1"> Second, the data regarding each object's progression through the scene is interpreted by an implementation of Bender's algorithm DESCRIBE v.2 to produce the semantic representation that is used by the other components of the system.</Paragraph> </Section> <Section position="2" start_page="6" end_page="6" type="sub_section"> <SectionTitle> 6.2 Pixels to Blobs </SectionTitle> <Paragraph position="0"> The low-level portion of the vision system is fed sequences of pixel matrices by an external system that captures video data. In the current implementation, this pixel These likelihood estimations could be generated, among other ways, by a meaning-refinement module incorporating a Bayesian model.</Paragraph> <Paragraph position="1"> For details concerning image labeling and object extraction algorithms see (Horn, 1986) duced by the simulator. The left image represents the start state and the right image represents the final state. data is sent from a simulator in which the actions of simple objects take place. The pixel matrices include integer definitions of each pixel's value, supplying all color information. null When the analysis of a particular scene begins, the vision system captures a snapshot of the background that it uses as a reference for all subsequent frames related to the same scene. As new video frames are input, the stored background is subtracted and the new video frames are converted to binary images. A noise removal algorithm is applied to the binary images to remove any residual elements of the original background.</Paragraph> <Paragraph position="2"> Once converted to a binary representation, each video frame is labeled using an object labeling algorithm and each distinct object is identified. Each object present within a frame is overlaid with a shape that will be used in the Blob representation passed along to the next component of the system. Each of the overlaid shapes is (possibly) matched to a shape observed in a previous frame. This matching procedure attempts to identify objects persisting between frames based on proximity in size, shape, color, and position using a 4-dimensional nearest-neighbors approach. If a shape matches with a previously known entry, the Blob structure corresponding to that particular object is assigned a new shape for its progression. If no match is found, a new Blob structure is created for the newly-observed object.</Paragraph> <Paragraph position="3"> Once the analysis of all frames of a scene is complete, the list of Blobs is fed to the next portion of the vision system for further interpretation.</Paragraph> <Paragraph position="4"> the vision system in use. Figure 7 shows the raw images representing the start and end states of the scene. Figure 8 shows a visualization of the object data created by the low-level portion of the system. The trace represents the path along which the object moved during the scene.</Paragraph> <Paragraph position="5"> The moving object's position changes are tracked and the trace of its path is generated.</Paragraph> <Paragraph position="6"> based on the example scene presented in Figure 7. Note that no object recognition is in use, so the objects are given temporary names (blob0 and blob1).</Paragraph> </Section> <Section position="3" start_page="6" end_page="6" type="sub_section"> <SectionTitle> 6.3 Blobs to LCS </SectionTitle> <Paragraph position="0"> The generation of semantic structures from vision data concludes with an analysis of the Blobs generated by the low-level vision system. This analysis is performed by implementing an algorithm described, but never implemented in a system, by Bender (2001).</Paragraph> <Paragraph position="1"> The algorithm first examines the list of objects present in the scene and computes the simple exists? and moving? predicates. If an object is found and moving, an LCS GO frame is instantiated and the object is compared to all others present so the appropriate path and place functions can be calculated. The calculation of path and place functions is based on a set of routines suggested by Bender.</Paragraph> <Paragraph position="2"> These routines compute the direction, intersection, and place-descriptions (above, on, left-of, etc.) for each pair of objects. Finally, the path and place functions described in Section 4 are found by examining the output of the visual routines and are added to the LCS frame.</Paragraph> <Paragraph position="3"> Figure 9 shows the LCS frame constructed by the system based on the example shown in Figure 7. The frame can now be used by the remainder of our system in the structural alignment phase.</Paragraph> </Section> </Section> class="xml-element"></Paper>