File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1102_metho.xml

Size: 16,213 bytes

Last Modified: 2025-10-06 14:14:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1102">
  <Title>Unification-based Multimodal Parsing</Title>
  <Section position="3" start_page="624" end_page="625" type="metho">
    <SectionTitle>
2 Parsing in Multidimensional Space
</SectionTitle>
    <Paragraph position="0"> The integrator in Johnston et al 1997 does in essence parse input, but the resulting structures can only be unary or binary trees one level deep; unimodal spoken or gestural commands and multimodal combinations consisting of a single spoken element and a single gesture. In order to account for a broader range of multimodal expressions, a more general parsing mechanism is needed.</Paragraph>
    <Paragraph position="1"> Chart parsing methods have proven effective for parsing strings and are commonplace in natural language processing (Kay 1980). Chart parsing involves population of a triangular matrix of well-formed constituents: chart(i, j), where i and j are numbered vertices delimiting the start and end of the string. In its most basic formulation, chart parsing can be defined as follows, where .</Paragraph>
    <Paragraph position="2"> is an operator which combines two constituents in accordance with the rules of the grammar.</Paragraph>
    <Paragraph position="3"> chart(i, j) = U chart(i, k) * chart(k, j) i&lt;k&lt;j Crucially, this requires the combining constituents to be discrete and linearly ordered. However, multimodal input does not meet these requirements:  gestural input spans two (or three) spatial dimensions, there is an additional non-spatial acoustic dimension of speech, and both gesture and speech are distributed across the temporal dimension.</Paragraph>
    <Paragraph position="4"> Unlike words in a string, speech and gesture may overlap temporally, and there is no single dimension on which the input is linear and discrete. So then, how can we parse in this multidimensional space of speech and gesture? What is the rule for chart parsing in multi-dimensional space? Our formulation of multidimensional parsing for multimodal systems (multichart) is as follows.</Paragraph>
    <Paragraph position="6"> In place of numerical spans within a single dimension (e.g. chart(3,5)), edges in the multidimensional chart are identified by sets (e.g.</Paragraph>
    <Paragraph position="7"> multichart({\[s, 4, 2\], \[g, 6, 1\]})) containing the identifiers(IDs) of the terminal input elements they contain. When two edges combine, the ID of the resulting edge is the union of their IDs. One constraint that linearity enforced, which we can still maintain, is that a given piece of input can only be used once within a single parse. This is captured by a requirement of non-intersection between the ID sets associated with edges being combined. This requirement is especially important since a single piece of spoken or gestural input may have multiple interpretations available in the chart. To prevent multiple interpretations of a single signal being used, they are assigned IDs which are identical with respect to the the non-intersection constraint. The multichart statement enumerates all the possible combinations that need to be considered given a set of inputs whose IDs are contained in a set X.</Paragraph>
    <Paragraph position="8"> The multidimensional parsing algorithm (Figure 4) runs bottom-up from the input elements, building progressively larger constituents in accordance with the ruleset. An agenda is used to store edges to be processed. As a simplifying assumption, rules are assumed to be binary. It is straightforward to extend the approach to allow for non-binary rules using techniques from active chart parsing (Earley 1970), but this step is of limited value given the availability  of multimodal subcategorization (Section 4).</Paragraph>
    <Paragraph position="9"> while AGENDA C/ \[ \] do remove front edge from AGENDA and make it CURRENTEDGE for each EDGE, EDGE E CHART if CURRENTEDGE (1 EDGE = find set NEWEDGES = U ( (U CURRENTEDGE * EDGE) (U EDGE * CURRENTEDGE)) add NEWEDGES to end of AGENDA add CURRENTEDGE to CHART  For use in a multimodal interface, the multidimensional parsing algorithm needs to be embedded into the integration agent in such a way that input can be processed incrementally. Each new input received is handled as follows. First, to avoid unnecessary computation, stale edges are removed from the chart. A timeout feature indicates the shelflife of an edge within the chart. Second, the interpretations of the new input are treated as terminal edges, placed on the agenda, and combined with edges in the chart in accordance with the algorithm above. Third, complete edges are identified and executed. Unlike the typical case in string parsing, the goal is not to find a single parse covering the whole chart; the chart may contain several complete non-overlapping edges which can be executed. These are assigned to a category command as described in the next section. The complete edges are ranked with respect to probability. These probabilities are a function of the recognition probabilities of the elements which make up the comrrrand. The combination of probabilities is specified using declarative constraints, as described in the next section. The most probable complete edge is executed first, and all edges it intersects with are removed from the chart. The next most probable complete edge remaining is then executed and the procedure continues until there are no complete edges left in the chart. This means that selection of higher probability complete edges eliminates overlapping complete edges of lower probability from the list of edges to be executed. Lastly, the new chart is stored. In ongoing work, we are exploring the introduction of other factors to the selection process. For example, sets of disjoint complete edges which parse all of the terminal edges in the chart should likely be preferred over those that do not.</Paragraph>
    <Paragraph position="10"> Under certain circumstances, an edge can be used more than once. This capability supports multiple creation of entities. For example, the user can utter 'multiple helicopters' point point point point in order to create a series of vehicles. This significantly speeds up the creation process and limits reliance on speech recognition. Multiple commands are persistent edges; they are not removed from the chart after they have participated in the formation of an executable command. They are assigned timeouts and are removed when their alloted time runs out.</Paragraph>
    <Paragraph position="11"> These 'self-destruct' timers are zeroed each time another entity is created, allowing creations to chain together.</Paragraph>
  </Section>
  <Section position="4" start_page="625" end_page="627" type="metho">
    <SectionTitle>
3 Unification-based Multimodal
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="625" end_page="627" type="sub_section">
      <SectionTitle>
Grammar Representation
</SectionTitle>
      <Paragraph position="0"> Our grammar representation for multimodal expressions draws on unification-based approaches to syntax and semantics (Shieber 1986) such as Head- null driven phrase structure grammar (HPSG) (Pollard and Sag 1987,1994). Spoken phrases and pen gestures, which are the terminal elements of the multimodal parsing process, are referred to as lexical edges. They are assigned grammatical representations in the form of typed feature structures by the natural language and gesture interpretation agents respectively. For example, the spoken phrase &amp;quot;helicopter is assigned the representation in Figure 5.  The cat feature indicates the basic category of the element, while content specifies the semantic content. In this case, it is a create_unit command in which the object to be created is a vehicle of type helicopter, and the location is required to be a point. The remaining features specify auxiliary information such as the modality, temporal interval, and probability associated with the edge. A point gesture has the representation in Figure 6.</Paragraph>
      <Paragraph position="2"> Multimodal grammar rules are productions of the form LHS --r DTR1 DTR2 where LHS, DTR1, and DTR2 are feature structures of the form indicated above. Following HPSG, these are encoded as feature structure rule schemata. One advantage of this is that rule schemata can be hierarchically ordered, allowing for specific rules to inherit basic constraints from general rule schemata. The basic multimodal integration strategy of Johnston et al 1997 is now just one rule among many (Figure 7).</Paragraph>
      <Paragraph position="4"> The lhs,dtrl, and dtr2 features correspond to LHS, DTR1, and DTR2 in the rule above. The constraints feature indicates an ordered series of constraints which must be satisfied in order for the rule to apply. Structure-sharing in the rule representation is used to impose constraints on the input feature structures, to construct the LHS category, and to instantiate the variables in the constraints. For example, in Figure 7, the basic constraint that the location of a located command such as 'helicopter' needs to unify with the content of the gesture it combines with is captured by the structure-sharing tag \[5\]. This also instantiates the location of the resulting edge, whose content is inherited through tag \[1 \]. The application of a rule involves unifying the two candidate edges for combination against dtrl and dtr2. Rules are indexed by their cat feature in order to avoid unnecessary unification. If the edges unify with dtrl and dtr2, then the constraints are checked. If they are satisfied then a new edge is created whose category is the value of lhs and whose ID set consists of the union of the ID sets assigned to the two input edges.</Paragraph>
      <Paragraph position="5"> Constraints require certain temporal and spatial relationships to hold between edges. Complex constraints can be formed using the basic logical operators V, A, and =C/,. The temporal constraint in Figure 7, overlap(J7\], \[10\]) V follow(\[7\],\[lO\], 4), states that the time of the speech \[7\] must either overlap with or start within four seconds of the time of the gesture \[10\]. This temporal constraint is based on empirical investigation of multimodal interaction (Oviatt et al 1997). Spatial constraints are used for combinations of gestural inputs. For example, close_to(X, Y) requires two gestures to be a limited distance apart (See Figure 12 below) and contact(X, Y) determines whether the regions occupied by two objects are in contact. The remaining constraints in Figure 7 do not constrain the inputs per se, rather they are used to calculate the time, prob, and modality features for the resulting edge. For example, the constraint combine_prob(\[8\], \[11\], \[4\]) is used to combine the probabilities of two inputs and assign a joint probability to the resulting edge.</Paragraph>
      <Paragraph position="6"> In this case, the input probabilities are multiplied.</Paragraph>
      <Paragraph position="7"> The assign_modality(\[6\], \[9\], \[2\]) constraint determines the modality of the resulting edge. Auxiliary features and constraints which are not directly relevant to the discussion will be omitted.</Paragraph>
      <Paragraph position="8"> The constraints are interpreted using a prolog meta-interpreter. This basic back-tracking constraint satisfaction strategy is simplistic but adequate for current purposes. It could readily be substituted with a more sophisticated constraint solving strategy allowing for more interaction among constraints, default constraints, optimization among a series of constraints, and so on. The addition of functional constraints is common in HPSG and other unification grammar formalisms (Wittenburg 1993).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="627" end_page="627" type="metho">
    <SectionTitle>
4 Multimodal Subcategorization
</SectionTitle>
    <Paragraph position="0"> Given that multimodal grammar rules are required to be binary, how can the wide variety of commands in which speech combines with more than one gestural element be accounted for? The solution to this problem draws on the lexicalist treatment of complementation in HPSG. HPSG utilizes a sophisticated theory of subcategorization to account for the different complementation patterns that verbs and other lexical items require. Just as a verb subcategorizes for its complements, we can think of a lexical edge in the multimodal grammar as subcategorizing for the edges with which it needs to combine. For example, spoken inputs such as 'calculate distance from here to here' an d ' sandbag wall from here to here' (Figure 8) result in edges which subcategorize for two gestures. Their multimodal subcategorization is specified in a list valued subcat feature, implemented using a recursive first/rest feature structure (Shieber 1986:27-32).</Paragraph>
    <Paragraph position="2"> The cat feature is subcat_comrnand, indicating that this is an edge with an unsaturated subcategorization list. The first/rest structure indicates the two gestures the edge needs to combine with and terminates with rest: end. The temporal constraints on expressions such as these are specific to the expressions themselves and cannot be specified in the rule constraints. To support this, we allow for lexical edges to carry their own specific lexical constraints, which are held in a constraints feature at each level in the subeat list. In this case, the first gesture is constrained to overlap with the speech or come up to four seconds before it and the second gesture is required to follow the first gesture. Lexical constraints are inherited into the rule constraints in the combinatory schemata described below. Edges with subcat features are combined with other elements in the chart in accordance with general combinatory schemata. The first (Figure 9) applies to unsaturated edges which have more than one element on their subcat list. It unifies the first element of the subcat list with an element in the chart and builds a new edge of category subcat_command whose subcat list is the value of rest.</Paragraph>
    <Paragraph position="3">  The second schema (Figure 10) applies to unsaturated (cat: subcat_command) edges on whose subcat list only one element remains and generates saturated (cat: command) edges.</Paragraph>
    <Paragraph position="4">  This specification of combinatory information in the lexical edges constitutes a shift from rules to representations. The ruleset is simplified to a set of general schemata, and the lexical representation is extended to express combinatorics. However, there is still a need for rules beyond these general schemata in order to account for constructional meaning (Goldberg 1995) in multimodal input, specifically with respect to complex unimodal gestures.</Paragraph>
  </Section>
  <Section position="6" start_page="627" end_page="628" type="metho">
    <SectionTitle>
5 Visual Parsing: Complex Gestures
</SectionTitle>
    <Paragraph position="0"> In addition to combinations of speech with more than one gesture, the architecture supports unimodal gestural commands consisting of several independently recognized gestural components. For example, lines may be created using what we term gestural diacritics. If environmental noise or other factors make speaking the type of a line infeasible, it may be specified by drawing a simple gestural mark or word over a line gesture. To create a barbed wire, the user can draw a line specifying its spatial extent and then draw an alpha to indicate its type.</Paragraph>
    <Paragraph position="1"> Figure 1 1: Complex Gesture for Barbed Wire This gestural construction is licensed by the rule schema in Figure 12. It states that a line gesture  (dtrl) and an alpha gesture (dtr2) can be combined, resulting in a command to create a barbed wire. The location information is inherited from the line gesture. There is nothing inherent about alpha that makes it mean 'barbed wire'. That meaning is embodied only in its construction with a line gesture, which is captured in the rule schema. The close_to constraint requires that the centroid of the alpha be in proximity to the line.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML