File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1054_metho.xml

Size: 15,996 bytes

Last Modified: 2025-10-06 14:07:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1054">
  <Title>Finite-state Multimodal Parsing and Understanding</Title>
  <Section position="4" start_page="369" end_page="369" type="metho">
    <SectionTitle>
2 Finite-state Language Processing
</SectionTitle>
    <Paragraph position="0"> Finite-state transducers (FST) are finite-state automata (FSA) where each transition consists of an input and an output symbol. The transition is traversed if its input symbol matches the current symbol in the input and generates the output symbol associated with the transition. In other words, an FST can be regarded as a 2-tape FSA with an input tape from which the input symbols are read and an output tape where the output symbols are written.</Paragraph>
    <Paragraph position="1"> Finite-state machines have been extensively applied to many aspects of language processing including, speech recognition (Pereira and Riley, 1997; Riccardi et al., 1996), phonology (Kaplan and Kay, 1994), morphology (Koskenniemi, 1984), chunking (Abney, 1991; Joshi and Hopely, 1997; Bangalore, 1997), parsing (Roche, 1999), and machine translation (Bangalore and Riccardi, 2000).</Paragraph>
    <Paragraph position="2"> Finite-state models are attractive n~echanisms for language processing since they are (a) efficiently learnable fiom data (b) generally effective for decoding and (c) associated with a calculus for composing machines which allows for straightforward integration of constraints fl'om various levels of language processing. Furdmrmore, software implementing the finite-state calculus is available for research purposes (Mohri eta\[., 1998). Another motivation for our choice of finite-state models is that they enable tight integration of language processing with speech and gesture recognition.</Paragraph>
  </Section>
  <Section position="5" start_page="369" end_page="372" type="metho">
    <SectionTitle>
3 Finite-state MultimodalGrammars
</SectionTitle>
    <Paragraph position="0"> Multimodal integration involves merging semantic content fi'om multiple streams to build a joint interpretation for a inultimodal utterance. We use a finite-state device to parse multiple input strealns and to combine their content into a single semantic representation. For an interface with n inodes, a finite-state device operating over n+ 1 tapes is needed. The first n tapes represent the input streams and r~ + \] is an output stream representing their composition. In the case of speech and pen input there are three tapes, one for speech, one for pen gesture, and a third for their combined meaning.</Paragraph>
    <Paragraph position="1"> As an example, in the messaging application described above, users issue spoken commands such as email this person and that organization and gestm'e on the appropriate person and organization on the screen. The structure and interpretation of multimodal colnlnands of this kind can be captured declaratively in a multi-modal context-free grammar. We present a fi'agment capable of handling such commands in Figure 2.</Paragraph>
    <Paragraph position="3"> The non-terminals in the multimodal grammar are atomic symbols. The multimodal aspects el' the grammar become apparent in the terlninals. Each terminal contains three components W:G:M corresponding to the n q- 1 tapes, where W is for the spoken language stream, G is the gesture stream, and M is the combined meaning. The epsilon symbol is used to indicate when oue of these is empty in a given terminal. The symbols in W are woMs from the speech stream. The symbols in G are of two types.</Paragraph>
    <Paragraph position="4"> Symbols like Go indicate the presence of a particular kind of gesturc in the gesture stream, while those like et are used as references to entities referred to by the gesture (See Section 3.1). Simple deictic pointing gestures are assigned semantic types based on tl~e entities they are references to. Gp represents a gestural tel'erence to a person on the display, Go to an organization, and Gd lo a department. Compared with a feature-based multimodal gralnlnar, these types constitute a set of atomic categories which make ltle relewmt dislinclions for gesture events prcdicllug speech events and vice versa. For example, if the gesture is G,, then phrases like thLs person aud him arc preferred speech events and vice versa.</Paragraph>
    <Paragraph position="5"> These categories also play a role in constraining the semantic representation when the speech is under-specified with respect to semantic type (e.g. email this one). These gesture symbols can be organized into a type hierarchy reflecting the ontology of the entities in the application domain. For exampie, there might be a general type G with subtypes Go and Gp, where G v has subtypes G,,,,~ and Gpf for male and female.</Paragraph>
    <Paragraph position="6"> A multimodal CFG (MCFG) can be defined fop really as quadruple &lt; N, 7', P, S &gt;. N is the set of nonterminals. 1 ~ is the set of productions of the form A -+ (~whereA E Nand,~, C (NUT)*. Sis the start symbol for the grammar. 7' is the set ot' terminals of the l'orm (W U e) : (G U e) : M* where W is the vocabulary of speech, G is the vocabulary of gesture=GestureSymbols U EventSymbols; GcsturcSymbols ={G v, Go, Gpj', G~.., ...} and a finite collections of \],gventSymbols ={c,,c~, ..., c,,}. M is the vocabulary to lel)rcsent meaning and includes event symbols (Evenl:Symbol.s C M).</Paragraph>
    <Paragraph position="7"> In general a context-free grammar can be approximated by an FSA (Pereira and Wright 1997, Nederher 1997). The transition symbols of the approximated USA are the terminals of the context-fiee grammar and in the case of multimodal CFG as detined above, these terminals contain three components, W, G and M. The multimodal CFG fi'agmerit in Figurc 2 translates into the FSA in Figure 3, a three-tape finite state device capable of composing two input streams into a single output semantic representation stream.</Paragraph>
    <Paragraph position="8"> Our approach makes certain simplil'ying assumptions with respect to ternporal constraints. In multigesture utterances the primary flmction of temporal constraints is to force an order on the gestures. If you say move this here and make two .gestures, the first corresponds to thi s and the second to here. Our multimodal grammars encode order but do not impose explicit temporal constraints, ltowever, general temporal constraints between speech and the first gesture can be enforced belbrc the FSA is applied.</Paragraph>
    <Section position="1" start_page="370" end_page="371" type="sub_section">
      <SectionTitle>
3.1 Finite-state Meaning Representation
</SectionTitle>
      <Paragraph position="0"> A novel aspect of our approach is that in addition to capturing the structure of language with a finite state device, we also capture meaning. Tiffs is very important in nmltimodal language processing where the central goal is to capture how the multiple modes contribute to the combined interpretation. Ottr basic approach is to write symbols onto the third tape, which when concatenated together yield the semantic representation l'or the multimodal utterance. It suits out&amp;quot; purposes here to use a simple logical representation with predicates pred(....) and lists la, b,...l. Many other kinds of semantic representation could be generated. In the fl'agment in Figure 2, the word ema+-l contributes email(\[ to the semantics tape, and the list and predicate arc closed when the rule S --+ V NP e:z:\]) applies. The word person writes person( on the semantics tape.</Paragraph>
      <Paragraph position="1"> A signiiicant problem we face in adding meaning into the finite-state framework is how to reprcsent all of the different possible specific values that can be contributed by a gesture. For deictic references a unique identitier is needed for each object in the interface that the user can gesture on. For exalnple, il' the interface shows lists of people, there needs to be a unique ideutilier for each person. As part of the composition process this identifier needs</Paragraph>
      <Paragraph position="3"> to be copied from the gesture stream into the semantic representation. In the unification-based approach to multimodal integration, this is achieved by feature sharing (Johnston, 1998b). In the finite-state approach, we would need to incorporate all of the different possible IDs into the FSA. For a person with id objid345 you need an arc e:objid345:objid345 to transfer that piece of information fiom the gesture tape to the lneaning tape. All of the arcs for different IDs would have to be repeated everywhere in the network where this transfer of information is needed. Furthermore, these arcs would have to be updated as the underlying database was changed or updated. Matters are even worse for more complex pen-based data such as drawing lines and areas in an interactive map application (Cohen et al., 1998). In this case, the coordinate set from the gesture needs to be incorporated into the senmntic representation.</Paragraph>
      <Paragraph position="4"> It might not be practical to incorporate the vast nulnbet of different possible coordinate sequences into an FSA.</Paragraph>
      <Paragraph position="5"> Our solution to this problem is to store these specific values associated with incoming gestures in a finite set of buffers labeled el,e,),ea .... and in place of the specific content write in the nalne of the appropriate buffer on the gesture tape. Instead of having the specific values in the FSA, we have the transitions E:CI:C\], C:C2:C2, s:e3:e:3.., in each location where content needs to be transferred from the gesture tape to the meaning tape (See Figure 3). These are generated fi'om the ENTRY productions in the multilnodal CFG in Figure 2. The gesture interpretation module empties the buffers and starts back at el after each multimodal command, and so we am limited to a finite set of gesture events in a single utterance. Returning to the example email this person and that organization, assume the user gestures on entities objid367 and objid893. These will be stored in buffers el and e2. Figure 4 shows the speech and gesture streams and the resulting combined meaning.</Paragraph>
      <Paragraph position="6"> The elements on the meaning tape are concatenated and the buffer references are replaced to yield S: email this person and that organization G: Gp cl 'Go e2 M: email(\[ person(ct) , org(c2) \])  more recursive semantic phenomena such as possessives and other complex noun phrases are added to the grammar the resulting machines become larger. However, the computational consequences of this can be lessened by lazy ewfluation techniques (Mohri, 1997) and we believe that this finite-state approach to constructing semantic representations is viable for a broad range of sophisticated language interface tasks. We have implemented a sizeable multimodal CFG for VPQ (See Section 1): 417 rules and a lexicon of 2388 words.</Paragraph>
    </Section>
    <Section position="2" start_page="371" end_page="372" type="sub_section">
      <SectionTitle>
3.2 Multimodal Finite-state Transducers
</SectionTitle>
      <Paragraph position="0"> While a three-tape finite-state automaton is feasible in principle (Rosenberg, 1964), currently available tools for finite-state language processing (Mohri et al., 1998) only support finite-state transducers (FSTs) (two tapes). Furthermore, speech recognizers typically do not support tile use of a three-tape FSA as a language model. In order to implement our approach, we convert the three-tape FSA (Figme 3) into an FST, by decomposing the transition symbols into an input component (G x W) and output component M, thus resulting in a function, T:(G x W) --+ M. This corresponds to a transducer in which gesture symbols and words are on the :input tape and the meaning is on the output tape (Figure 6). The domain of this function T can be further curried to result in a transducer that maps 7~:G --&gt; W (Figure 7).</Paragraph>
      <Paragraph position="1"> This transducer captures the constraints that gesture places on the speech stream and we use it as a Janguage model for constraining the speech recognizer based on the recognized gesture string. In the fop lowing section, we explain how &amp;quot;F and 7% are used in conjunction with the speech recognition engine and gesture recognizer and interpreter to parse and inter- null pret nmltimodal input.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="372" end_page="372" type="metho">
    <SectionTitle>
4 Applying Multimodal Transducers
</SectionTitle>
    <Paragraph position="0"> There arc number of different ways in which multi-modal finite-state transducers can be integrated with speech and gesture recognition. The best approach to take depends on the properties of the lmrticular interface to be supported. The approach we outline here involves recognizing gesture ilrst then using the observed gestures to modify the language model for speech recognition. This is a good choice if there is limited ambiguity in gesture recognition, for exan@e, if lhe m~jority of gestures are unambiguous deictic pointing gestures.</Paragraph>
    <Paragraph position="1"> The first step is for the geslure recognition and interpretation module to process incoming pen gestures and construct a linite state machine GeslltVe corresponding to the range of gesture interpretations.</Paragraph>
    <Paragraph position="2"> Ill our example case (Figure 4) tile gesture input is unambiguous and the Gestttre linite state machine will be as in Figure 5. \]f the gestural input involves gesture recognition or is otherwise ambiguous it is represented as a lattice indicating all of the possible recognitions and interpretations o1' tile gesture stream. This allows speech to compensate for gesture errors and mutual compensation.</Paragraph>
    <Paragraph position="3">  This Ge,s'lure linite state machine is then composed with the transducer &amp;quot;R, which represents the relationship between speech and gesture (Figure 7).</Paragraph>
    <Paragraph position="4"> The result of this composition is a transducer Gesl-Lang (Figure 8). This transducer represents the relationship between this particular sl.ream of gestures and all of the possible word sequences tlmt could co-occur with those oes&amp;quot; , rares. In order to use this inlbnnation to guide the speech recognizer, we lhcn take a proiection on the output tape (speech) of Gesl-Lang to yield a finite-state machine which is used as a hmguage model for speech recognition (Figure 9). Using this model enables the gestural information to directly influence the speech recognizer's search. Speech recognition yields a lattice of possible word sequences. In our example case it yMds the wol~.t sequence email this person and that organization (Figure 10). We now need to reintegrale the geslure inl'ormation that wc removed in the prqjection step before recognition. This is achieved by composing Gest-Lang (Figure 8) with the result lattice from speech recognition (Figure 10), yielding transducer Gesl~ &amp;)eechFST (Figure 11). This transducer contains the information both from the speech stream and from the gesture stream. The next step is to generate the Colnbined meaning representation. To achieve this Gest&amp;)eechFST (G : W) is converted into an FSM GestSpeechFSM by combining output and input on one tape (G x W) (Figure 12).</Paragraph>
    <Paragraph position="5"> GestSk)eeckFSM is then composed with T (Figure 6), which relates speech and gesture to meaning, yielding file result transducer Result (Figure 13). The meaning is lead from the output tape yielding cm,dl(\[perso,,,(ca), m'O(e2)\]). We have implemented lifts approach and applied it in a multimodal interface to VPQ on a wireless PDA. In prelilninary speech recognition experiments, our approach yielded an average o1' 23% relative sentence-level error reduction on a corpus of 1000 utterances (Johnston and Bangalore, 2000).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML