File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1046_metho.xml

Size: 21,519 bytes

Last Modified: 2025-10-06 14:10:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1046">
  <Title>Edit Machines for Robust Multimodal Language Processing</Title>
  <Section position="4" start_page="361" end_page="362" type="metho">
    <SectionTitle>
3 Finite-state Multimodal Understanding
</SectionTitle>
    <Paragraph position="0"> Our approach to integrating and interpreting multimodal inputs (Johnston et al., 2002) is an extension of the finite-state approach previously proposed in (Bangalore and Johnston, 2000; Johnston and Bangalore, 2005). In this approach, a declarative multimodal grammar captures both the structure and the interpretation of multimodal and unimodal commands. The grammar consists of a set of context-free rules. The multimodal aspects of the grammar become apparent in the terminals, each of which is a triple W:G:M, consisting of speech (words, W), gesture (gesture symbols, G), and meaning (meaning symbols, M). The multimodal grammar encodes not just multimodal integration patterns but also the syntax of speech and gesture, and the assignment of meaning, here represented in XML. The symbol SEM is used to abstract over specific content such as the set of points delimiting an area or the identifiers of selected objects (Johnston et al., 2002). In Figure 3, we present a small simplified fragment from the MATCH application capable of handling information seeking requests such as phone for these three restaurants. The epsilon symbol (a2 ) indicates that a stream is empty in a given terminal.</Paragraph>
    <Paragraph position="1"> In the example above where the user says phone forthese tworestaurants whilecircling tworestaurants (Figure 2 [a]), assume the speech recognizer returns the lattice in Figure 4 (Speech). The gesture recognition component also returns a lattice (Figure 4, Gesture) indicating that the user's ink  is either a selection of two restaurants or a geographical area. In Figure 4 (Gesture) the specific content is indicated in parentheses after SEM. This content is removed before multimodal parsing and integration and replaced afterwards. For detailed explanation of our technique for abstracting over and then re-integrating specific gestural content and our approach to the representation of complex gestures see (Johnston et al., 2002). The multimodal grammar (Figure 3) expresses the relationship between what the user said, what they drew with the pen, and their combined meaning, in this case Figure 4 (Meaning). The meaning is generated by concatenating the meaning symbols and replacing SEM with the appropriate specific content: a12 cmda13a14a12 infoa13a14a12 typea13 phone a12 /typea13a15a12 obja13a16a12 resta13 [r12,r15] a12 /resta13 a12 /obja13a17a12 /infoa13a18a12 /cmda13 .</Paragraph>
    <Paragraph position="2"> For use in our system, the multimodal grammar is compiled into a cascade of finite-state transducers(Johnston andBangalore, 2000; Johnston etal., 2002; Johnston and Bangalore, 2005). As a result, processing of lattice inputs from speech and gesture processing is straightforward and efficient.</Paragraph>
    <Section position="1" start_page="362" end_page="362" type="sub_section">
      <SectionTitle>
3.1 Meaning Representation for Concept
Accuracy
</SectionTitle>
      <Paragraph position="0"> The hierarchically nested XML representation above is effective for processing by the backend application, but is not well suited for the automated determination of the performance of the language understanding mechanism. We adopt an approach, similar to (Ciaramella, 1993; Boros et al., 1996), in which the meaning representation, in our case XML, is transformed into a sorted flat list of attribute-value pairs indicating the core contentful concepts of each command. The example above yields:</Paragraph>
      <Paragraph position="2"> This allows us to calculate the performance of the understanding component using the same string matching metrics used for speech recognition accuracy. Concept Sentence Accuracy measures the number ofuser inputs for which the system got the meaning completely right (this is called Sentence Understanding in (Ciaramella, 1993)).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="362" end_page="363" type="metho">
    <SectionTitle>
4 Robust Understanding
</SectionTitle>
    <Paragraph position="0"> Robust understanding has been of great interest in the spoken language understanding literature.</Paragraph>
    <Paragraph position="1"> The issue of noisy output from the speech recognizer and disfluencies that are inherent in spoken input make it imperative for using mechanisms to provide robust understanding. As discussed in the introduction, there are two approaches to addressing robustness - partial parsing approach and classification approach. We have explored the classification-based approach to multimodal understanding in earlier work. We briefly present this approach and discuss its limitations for multimodal language processing.</Paragraph>
    <Section position="1" start_page="362" end_page="363" type="sub_section">
      <SectionTitle>
4.1 Classification-based Approach
</SectionTitle>
      <Paragraph position="0"> In previous work (Bangalore and Johnston, 2004), we viewed multimodal understanding as a sequence of classification problems in order to determine the predicate and arguments of an utterance. The meaning representation shown in (1) consists of an predicate (the command attribute) and a sequence of one or more argument attributes which are the parameters for the successful interpretation of the user's intent. For example, in (1), a19a21a20a23a22a54a24a32a26a55a28a44a30a32a31 is the predicate and a33a43a35a56a37a23a39a40a24a57a37a43a42a23a31a41a28a23a39a58a31a45a47a44a48a43a39a44a19a57a33a49a24a23a50a41a39a32a51a38a39a44a19a41a33a23a26a56a31a41a28 is the set of arguments to the predicate.</Paragraph>
      <Paragraph position="1"> We determine the predicate (a59a29a60 ) for a a61 tokenmultimodal utterance (a62a64a63 a65 )by maximizing the posterior probability as shown in Equation 2.</Paragraph>
      <Paragraph position="3"> Weview the problem of identifying and extracting arguments from a multimodal input as a problem of associating each token of the input with a specific tag that encodes the label of the argument and the span of the argument. These tags are drawn from a tagset which is constructed by  extending each argument label by three additional symbols a80a44a81a83a82a84a81a86a85 , following (Ramshaw and Marcus, 1995). These symbols correspond to cases when a token is inside (a80 ) an argument span, outside (a82 ) an argument span or at the boundary of two argument spans (a85 ) (See Table 1).</Paragraph>
      <Paragraph position="4">  Given this encoding, the problem of extracting the arguments is a search for the most likely sequence of tags (a90a91a60 ) given the input multimodal utterance a62 a63a65 as shown in Equation (3). We approximate the posterior probability a75 a69a44a76 a90a92a77a8a62 a63a65 a79 using independence assumptions as shown in Equation (4).</Paragraph>
      <Paragraph position="6"> Owing to the large set of features that are used for predicate identification and argument extraction, we estimate the probabilities using a classification model. In particular, we use the Adaboost classifier (Freund and Schapire, 1996) wherein a highly accurate classifier is build by combining many &amp;quot;weak&amp;quot; or &amp;quot;simple&amp;quot; base classifiers a106</Paragraph>
      <Paragraph position="8"> of which may only be moderately accurate. The selection of the weak classifiers proceeds iteratively picking the weak classifier that correctly classifies the examples that are misclassified by the previously selected weak classifiers. Each weak classifier is associated with a weight (a107</Paragraph>
      <Paragraph position="10"> that reflects its contribution towards minimizing the classification error. The posterior probability of a75 a69a27a76 a59a78a77 a73 a79 is computed as in Equation 5.</Paragraph>
    </Section>
    <Section position="2" start_page="363" end_page="363" type="sub_section">
      <SectionTitle>
4.2 Limitations of this approach
</SectionTitle>
      <Paragraph position="0"> Although, we have shown that the classification approach works for unimodal and simple multi-modal inputs, it is not clear how this approach can be extended to work on lattice inputs. Multimodal language processing requires the integration and joint interpretation of speech and gesture input. Multimodal integration requires alignment of the speech and gesture input. Given that the input modalities are both noisy and can receive multiple within-modality interpretations (e.g. a circle could be an &amp;quot;O&amp;quot; or an area gesture); it is necessary for the input to be represented as a multiplicity of hypotheses, which can be most compactly represented as a lattice. The multiplicity of hypotheses is also required for exploiting the mutual compensation between the two modalities as shown in (Oviatt, 1999; Bangalore and Johnston, 2000). Furthermore, in order to provide the dialog manager the best opportunity to recover the most appropriate meaning given the dialog context, we construct a lattice of semantic representations instead of providing only one semantic representation. null Inthemultimodal grammar-based approach, the alignment between speech and gesture along with their combined interpretation is utilized in deriving the multimodal finite-state transducers. These transducers are used to create a gesture-speech aligned lattice and a lattice of semantic interpretations. However, in the classification-based approach, it is not as yet clear how alignment between speech and gesture would be achieved especially when the inputs are lattice and how the aligned speech-gesture lattices can be processed to produce lattice of multimodal semantic representations. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="363" end_page="365" type="metho">
    <SectionTitle>
5 Hand-crafted Finite-State Edit
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="363" end_page="364" type="sub_section">
      <SectionTitle>
Machines
</SectionTitle>
      <Paragraph position="0"> A corpus trained SLM with smoothing is more effective at recognizing what the user says, but this will not help system performance if coupled directly to a grammar-based understanding system which can only assign meanings toin-grammar utterances. In order to overcome the possible mis-match between the user's input and the language encoded in the multimodal grammar (a122a124a123 ), we introduce a weighted finite-state edit transducer to the multimodal language processing cascade. This transducer coerces the set of strings (a125 ) encoded in the lattice resulting from ASR (a122a124a126 ) to closest strings in the grammar that can be assigned an interpretation. We are interested in the string with the least costly number of edits (a68a36a69a29a70a36a71a128a127 a0 ) that can be assigned an interpretation by the grammar1.</Paragraph>
      <Paragraph position="1"> This can be achieved by composition (a129 ) of transducers followed by a search for the least cost path through a weighted transducer as shown below.</Paragraph>
      <Paragraph position="3"> We first describe the edit machine introduced in (Bangalore and Johnston, 2004) (Basic Edit) then go on to describe a smaller edit machine with higher performance (4-edit) and an edit machine  which incorporates additional heuristics (Smart edit).</Paragraph>
    </Section>
    <Section position="2" start_page="364" end_page="364" type="sub_section">
      <SectionTitle>
5.1 Basic edit
</SectionTitle>
      <Paragraph position="0"> Our baseline, the edit machine described in (Bangalore and Johnston, 2004), is essentially a finite-state implementation of the algorithm to compute the Levenshtein distance. It allows for unlimited insertion, deletion, and substitution of any word foranother (Figure 5). Thecosts ofinsertion, deletion, and substitution are set as equal, except for members of classes such as price (cheap, expensive), cuisine (turkish) etc., which are assigned a higher cost for deletion and substitution.</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="3" start_page="364" end_page="364" type="sub_section">
      <SectionTitle>
5.2 4-edit
</SectionTitle>
      <Paragraph position="0"> Basic edit is effective in increasing the number of strings that are assigned an interpretation (Bangalore and Johnston, 2004) but is quite large (15mb, 1 state, 978120 arcs) and adds an unacceptable amount of latency (5s on average). In order to overcome this performance problem we experimented with revising the topology of the edit machine so that it allows only a limited number of edit operations (at most four) and removed the substitution arcs, since they give rise to a82 a76 a77a27a140 a77a105 a79 arcs. For the same grammar, the resulting edit machine is about 300K with 4 states and 16796 arcs and the average latency is (0.5s). The topology of the 4-edit machine is shown in Figure 6.</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="4" start_page="364" end_page="365" type="sub_section">
      <SectionTitle>
5.3 Smart edit
</SectionTitle>
      <Paragraph position="0"> Smart edit is a 4-edit machine which incorporates a number of additional heuristics and refinements to improve performance:  1. Deletion of SLM-only words: Arcs were added to the edit transducer to allow for free deletion of any words in the SLM training data which are not found in the grammar. For example, listings in thai restaurant listings in midtown a141 thai restaurant in midtown.</Paragraph>
      <Paragraph position="1"> 2. Deletion of doubled words: A common er- null ror observed in SLM output was doubling of monosyllabic words. For example: subway to the cloisters recognized as subway to to the cloisters. Arcs were added to the edit machine to allow for free deletion of any short word when preceded by the same word.</Paragraph>
      <Paragraph position="2"> 3. Extended variable weighting of words: Insertion and deletion costs were further subdivided from two to three classes: a low cost for 'dispensable' words, (e.g. please, would, looking, a, the), a high cost for special words (slot fillers, e.g. chinese, cheap, downtown), and a medium cost for all other words, (e.g.</Paragraph>
      <Paragraph position="3"> restaurant, find).</Paragraph>
      <Paragraph position="4"> 4. Auto completion of place names: It is unlikely that grammar authors will include all of the different ways to refer to named entities such as place names. For example, if the grammar includes metropolitan museum of art the user may just say metropolitan museum. These changes can involve significant numbers of edits. A capability was added to the edit machine to complete partial specifications of place names in a single edit. This involves a closed world assumption over the set of place names. For example, if the only metropolitan museum in the database is the metropolitan museum of art we assume that we can insert of art after metropolitan museum. The algorithm for construction of these auto-completion edits enumerates all possible substrings (both contiguous andnon-contiguous) forplace names.</Paragraph>
      <Paragraph position="5"> For each of these it checks to see if the sub-string is found in more than one semantically distinct member of the set. If not, an edit sequence is added to the edit machine which freely inserts the words needed to complete the placename. Figure 7 illustrates one of the edit transductions that is added for the place name metropolitan museum of art. The algorithm which generates the autocomplete edits also generates new strings to add to the place name class for the SLM (expanded class). In order to limit over-application of the completion mechanism substrings starting in prepositions (of art a141 metropolitan museum of art) orinvolving deletion ofparts ofabbreviations are not considered for edits (b c building a141 n b c building).</Paragraph>
      <Paragraph position="6">  The average latency of SmartEdit is 0.68s. Note that the application-specific structure and weighting of SmartEdit (3,4 above) can be derived automatically: 4. runs on the placename list for the new application and the classification in 3. is primarily determined by which words correspond to fields in the underlying application database.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="365" end_page="366" type="metho">
    <SectionTitle>
6 Learning Edit Patterns
</SectionTitle>
    <Paragraph position="0"> In the previous section, we described an edit approach where the weights of the edit operations have been set by exploiting the constraints from the underlying application. In this section, we discuss an approach that learns these weights from data.</Paragraph>
    <Section position="1" start_page="365" end_page="365" type="sub_section">
      <SectionTitle>
6.1 Noisy Channel Model for Error
Correction
</SectionTitle>
      <Paragraph position="0"> The edit machine serves the purpose of translating user'sinput toastring thatcanbeassigned ameaning representation by the grammar. One of the possible shortcomings of the approach described in the preceding section is that the weights for the edit operations are set heuristically and are crafted carefully for the particular application. This process can be tedious and application-specific. In order to provide a more general approach, we couch the problem of error correction in the noisy channel modeling framework. In this regard, we follow (Ringger and Allen, 1996; Ristad and Yianilos, 1998), however, we encode the error correction model as a weighted Finite State Transducer (FST) so we can directly edit ASR input lattices. Furthermore, unlike (Ringger and Allen, 1996), the language grammar from our application filters out edited strings that cannot be assigned an interpretation by the multimodal grammar. Also, while in (Ringger and Allen, 1996) the goal is to translate to the reference string and improve recognition accuracy, in our approach the goal is to translate in order to get the reference meaning and improve concept accuracy.</Paragraph>
      <Paragraph position="1"> We let a62a142a123 be the string that can be assigned a meaning representation by the grammar and a62a144a143 be the user's input utterance. If we consider a62a10a143 to be the noisy version of the a62a11a123 , we view the decoding task as a search for the string a62 a60  We then use a Markov approximation (trigram for our purposes) to compute the joint probability  by GIZA++ toolkit (Och and Ney, 2003) for this purpose. We convert the viterbi alignment into a bilanguage representation that pairs words of the string a62a142a143 with words of a62a142a123 . A few examples of bilanguage strings are shown in Figure 8. We compute the joint n-gram model using a language modeling toolkit (Goffin et al., 2005). Equation 8 thus allows us to edit a user's utterance to a string that can be interpreted by the grammar.</Paragraph>
      <Paragraph position="2"> show:show me:me the:a1 map:a1 of:a1 midtown:midtown</Paragraph>
    </Section>
    <Section position="2" start_page="365" end_page="365" type="sub_section">
      <SectionTitle>
6.2 Deriving Translation Corpus
</SectionTitle>
      <Paragraph position="0"> Since our multimodal grammar is implemented as a finite-state transducer it is fully reversible and can be used not just to provide a meaning for input strings but can also be run in reverse to determine possible input strings for a given meaning. Our multimodal corpus was annotated for meaning using the multimodal annotation tools described in (Ehlen et al., 2002). In order to train the translation model we build a corpus that pairs the reference speech string for each utterance in the trainingdatawithatarget string. Thetarget stringisderived in two steps. First, the multimodal grammar is run in reverse on the reference meaning yielding a lattice of possible input strings. Second, the closest string in the lattice to the reference speech string is selected as the target string.</Paragraph>
    </Section>
    <Section position="3" start_page="365" end_page="366" type="sub_section">
      <SectionTitle>
6.3 FST-based Decoder
</SectionTitle>
      <Paragraph position="0"> In order to facilitate editing of ASR lattices, we represent the edit model as a weighted finite-state transducer. We first represent the joint n-gram model as a finite-state acceptor (Allauzen et al., 2004). We then interpret the symbols on each arc of the acceptor as having two components a word from user's utterance (input) and a word from the edited string (output). This transformation makes a transducer out of an acceptor. In doing so, we can directly compose the editing model with ASR lattices to produce a weighted lattice of edited strings. We further constrain the set of  edited strings to those that are interpretable by the grammar. We achieve this by composing with the language finite-state acceptor derived from the multimodal grammar as shown in Equation 5. Figure 9 shows the input string and the resulting output after editing with the trained model.</Paragraph>
      <Paragraph position="1"> Input: I'm trying to find african restaurants that are located west of midtown Edited Output: find african around west midtown Input: I'd like directions subway directions from the metropolitan museum of art to the empire state building Edited Output: subway directions from the metropolitan museum of art to the empire state building</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML