File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1005_metho.xml
Size: 26,125 bytes
Last Modified: 2025-10-06 14:08:52
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1005"> <Title>Balancing Data-driven and Rule-based Approaches in the Context of a Multimodal Conversational System</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Bootstrapping Corpora for Language </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Models </SectionTitle> <Paragraph position="0"> The problem of speech recognition can be succinctly represented as a search for the most likely word sequence (a34 ) through the network created by the composition of a language of acoustic observations (a35 ), an acoustic model which is a transduction from acoustic observations to phone sequences (a36 ), a pronounciation model which is a transduction from phone sequences to word sequences (a37 ), and a language model acceptor (a38 ) (Pereira and Riley, 1997). The language model acceptor encodes the (weighted) word sequences permitted in an application.</Paragraph> <Paragraph position="2"> Typically, a38 is built using either a hand-crafted grammar or using a statistical language model derived from a corpus of sentences from the application domain. While a grammar could be written so as to be easily portable across applications, it suffers from being too prescriptive and has no metric for relative likelihood of users' utterances. In contrast, in the data-driven approach a weighted grammar is automatically induced from a corpus and the weights can be interpreted as a measure for relative likelihood of users' utterances. However, the reliance on a domain-specific corpus is one of the significant bottlenecks of data-driven approaches, since collecting a corpus specific to a domain is an expensive and time-consuming task.</Paragraph> <Paragraph position="3"> In this section, we investigate a range of techniques for producing a domain-specific corpus using resources such as a domain-specific grammar as well as an out-of-domain corpus. We refer to the corpus resulting from such techniques as a domain-specific derived corpus in contrast to a domain-specific collected corpus. The idea is that the derived domain-specific corpus would obviate the need for in-domain corpus collection. In particular, we are interested in techniques that would result in corpora such that the performance of language models trained on these corpora would rival the performance of models trained on corpora collected specifically for a specific domain. We investigate these techniques in the context of MATCH.</Paragraph> <Paragraph position="4"> We use the notation a59a61a60 for the corpus, a62a4a60 for the language model built using the corpus a59a63a60 , and a38a65a64a10a66 for the language model acceptor representation of the model a62 a60 , which can be used in Equation 2 above.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Language Model using in-domain corpus </SectionTitle> <Paragraph position="0"> In order to evaluate the MATCH system, we collected a corpus of multimodal utterances for the MATCH domain in a laboratory setting from a set of sixteen first time users (8 male, 8 female). We use this corpus to establish a point of reference to compare the models trained on derived corpora against models trained on an in-domain corpus. A total of 833 user interactions (218 multimodal / 491 speech-only / 124 pen-only) resulting from six sample task scenarios involving finding restaurants of various types and getting their names, phones, addresses, or reviews, and getting subway directions between locations were collected and annotated. The data collected was conversational speech where the users gestured and spoke freely. We built a class-based trigram language model (a62a49a67a22a68a70a69a4a71a73a72 ) using the 709 multimodal and speech-only utterances as the corpus (a59a61a67a22a68a70a69a4a71a73a72 ). The performance of this model serves as the point of reference to compare the performance of language models trained on derived corpora.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Grammar as Language Model </SectionTitle> <Paragraph position="0"> The multimodal CFG (a fragment is presented in Section 2) encodes the repertoire of language and gesture commands allowed by the system and their combined interpretations. The CFG can be approximated by an FSM with arcs labeled with language, gesture and meaning symbols, using well-known compilation techniques (Nederhof, 1997). The resulting FSM can be projected on the language component and can be used as the language model acceptor (a38a57a74a76a75a78a77a80a79 ) for speech recognition. Note that the resulting language model acceptor is unweighted if the grammar is unweighted and suffers from not being robust to language variations in user's input. However, due to the tight coupling of the grammar used for recognition and interpretion, every recognized string can be assigned an interpretation (though it may not necessarily be the intended interpretation).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Grammar-based N-gram Language Model </SectionTitle> <Paragraph position="0"> As mentioned earlier, a hand-crafted grammar typically suffers from the problem of being too restrictive and inadequate to cover the variations and extra-grammaticality of user's input. In contrast, an N-gram language model derives its robustness by permitting all strings over an alphabet, albeit with different likelihoods. In an attempt to provide robustness to the grammar-based model, we created a corpus (a59 a74a76a75a78a77a76a79 ) of a81 sentences by randomly sampling the set of paths of the grammar (a37 a50a83a82 a56 ) and built a class-based N-gram language model(a62 a74a76a75a78a77a80a79 ) using this corpus. Although this corpus might not represent the true distribution of sentences in the MATCH domain, we are able to derive some of the benefits of N-gram language modeling techniques. This technique is similar to Galescu et.al (1998).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Combining Grammar and Corpus </SectionTitle> <Paragraph position="0"> A straightforward extension of the idea of sampling the grammar in order to create a corpus is to select those sentences out of the grammar which make the resulting corpus &quot;similar&quot; to the corpus collected in the pilot studies. In order to create this corpus, we choose the a81 most likely sentences as determined by a language model (a62 a67a22a68a70a69a4a71a73a72 ) built using the collected corpus. A mixture model (a62a4a79a1a0a3a2 ) with mixture weight (a4 ) is built by interpolating the model trained on the corpus of extracted sentences (a62a6a5a8a7a3a9a11a10a13a12 ) and the model trained on the collected</Paragraph> <Paragraph position="2"> An alternative to using in-domain corpora for building language models is to &quot;migrate&quot; a corpus of a different domain to the MATCH domain. The process of migrating a corpus involves suitably generalizing the corpus to remove information specific only to the out-of-domain and instantiating the generalized corpus to the MATCH domain. Although there are a number of ways of generalizing the out-of-domain corpus, the generalization we have investigated involved identifying linguistic units, such as noun and verb chunks in the out-of-domain corpus and treating them as classes. These classes are then instantiated to the corresponding linguistic units from the MATCH domain. The identification of the linguistic units in the out-of-domain corpus is done automatically using a supertagger (Bangalore and Joshi, 1999). We use a corpus collected in the context of a software helpdesk application as an example out-of-domain corpus. In cases where the out-of-domain corpus is closely related to the domain at hand, a more semantically driven generalization might be more suitable.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.6 Adapting the SwitchBoard Language Model </SectionTitle> <Paragraph position="0"> We investigate the performance of a large vocabulary conversational speech recognition system when applied to a specific domain such as MATCH. We used the Switchboard corpus (a59a60a10 a46a30a61 a60 ) as an example of a large vocabulary conversational speech corpus. We built a tri-gram model (a62a6a10 a46a30a61 a60 ) using the 5.4 million word corpus and investigated the effect of adapting the Switchboard language model given a81 in-domain untranscribed speech utterances (a19 a35 a0 a53 ). The adaptation is done by first recognizing the in-domain speech utterances and then building a language model (a62a33a77a80a60a76a77a16a62a64a63 ) from the corpus of recognized text (a59 a77a80a60a76a77a16a62a64a63 ). This bootstrapping mechanism can be used to derive an domain-specific corpus and language model without any transcriptions. Similar techniques for unsupervised language model adaptation are presented in (Bacchiani and Roark, 2003; Souvignier and Kellner, 1998).</Paragraph> <Paragraph position="2"/> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.7 Adapting a wide-coverage grammar </SectionTitle> <Paragraph position="0"> There have been a number of computational implementations of wide-coverage, domain-independent, syntactic grammars for English in various formalisms (XTAG, 2001; Clark and Hockenmaier, 2002; Flickinger et al., 2000). Here, we describe a method that exploits one such grammar implementation in the Lexicalized Tree-Adjoining Grammar (LTAG) formalism, for deriving domain-specific corpora. An LTAG consists of a set of elementary trees (Supertags) (Bangalore and Joshi, 1999) each associated with a lexical item. The set of sentences generated by an LTAG can be obtained by combining supertags using substitution and adjunction operations. In related work (Rambow et al., 2002), it has been shown that for a restricted version of LTAG, the combinations of a set of supertags can be represented as an FSM. This FSM compactly encodes the set of sentences generated by an LTAG grammar.</Paragraph> <Paragraph position="1"> We derive a domain-specific corpus by constructing a lexicon consisting of pairings of words with their supertags that are relevant to that domain. We then compile the grammar to build an FSM of all sentences upto a given length. We sample this FSM and build a language model as discussed in Section 3.3. Given untranscribed utterances from a specific domain, we can also adapt the language model as discussed in Section 3.6.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Robust Multimodal Understanding </SectionTitle> <Paragraph position="0"> The grammar-based interpreter uses composition operation on FSTs to transduce multimodal strings (gesture,speech) to an interpretation. The set of speech strings that can be assigned an interpretation are exactly those that are represented in the grammar. It is to be expected that the accuracy of meaning representation will be reasonable, if the user's input matches one of the multimodal strings encoded in the grammar. But for those user inputs that are not encoded in the grammar, the system will not return a meaning representation. In order to improve the usability of the system, we expect it to produce a (partial) meaning representation, irrespective of the grammaticality of the user's input and the coverage limitations of the grammar. It is this aspect that we refer to as robustness in understanding. We present below two approaches to robust multimodal understanding that we have developed.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Pattern Matching Approach </SectionTitle> <Paragraph position="0"> In order to overcome the possible mismatch between the user's input and the language encoded in the multi-modal grammar (a62a21a74 ), we use an edit-distance based pattern matching algorithm to coerce the set of strings (a67 ) encoded in the lattice resulting from ASR (a62a6a68 ) to match one of the strings that can be assigned an interpretation.</Paragraph> <Paragraph position="1"> The edit operations (insertion, deletion, substitution) can either be word-based or phone-based and are associated with a cost. These costs can be tuned based on the word/phone confusions present in the domain. The edit operations are encoded as an transducer (a62 a12 a60a41a0a69a63 ) as shown in Figure 5 and can apply to both one-best and lattice output of the recognizer. We are interested in the string with the least number of edits (a39a19a40a10a41a17a43a71a70a73a72 ) that can be assigned an interpretation by the grammar. This can be achieved by composition (a51 ) of transducers followed by a search for the least cost path through a weighted transducer as shown below.</Paragraph> <Paragraph position="3"> stitution and identity arcs. a34 a0 and a34a6a5 could be words or phones. The costs on the arcs are set up such that scost</Paragraph> <Paragraph position="5"> This approach is akin to example-based techniques used in other areas of NLP such as machine translation.</Paragraph> <Paragraph position="6"> In our case, the set of examples (encoded by the grammar) is represented as a finite-state machine.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Classification-based Approach </SectionTitle> <Paragraph position="0"> A second approach is to view robust multimodal understanding as a sequence of classification problems in order to determine the predicate and arguments of an utterance. The meaning representation shown in (1) consists of an predicate (the command attribute) and a sequence of one or more argument attributes which are the parameters for the successful interpretation of the user's intent. For example, in (1), a0a80a1a4a3a6a5a21a7a42a9a8a11a13a12 is the predicate and a14a17a16a32a18a21a20 a5a25a18a19a24a21a12a28a9a21a20a65a12a28a27a21a29a17a20a8a0a23a14 a5a8a30a28a20a13a31a19a20a21a0 a14a33a7a25a12a28a9 is the set of arguments to the predicate.</Paragraph> <Paragraph position="1"> We determine the predicate (a7 a1 ) for a a8 token multi-modal utterance (a20a10a9a22 ) by maximizing the posterior probability as shown in Equation 7.</Paragraph> <Paragraph position="3"> We view the problem of identifying and extracting arguments from a multimodal input as a problem of associating each token of the input with a specific tag that encodes the label of the argument and the span of the argument. These tags are drawn from a tagset which is constructed by extending each argument label by three additional symbols a11</Paragraph> <Paragraph position="5"> cus, 1995). These symbols correspond to cases when a token is inside (a11 ) an argument span, outside (a35 ) an argument span or at the boundary of two argument spans (a12 ) (See Table 1).</Paragraph> <Paragraph position="6"> Given this encoding, the problem of extracting the arguments is a search for the most likely sequence of tags</Paragraph> <Paragraph position="8"> Owing to the large set of features that are used for predicate identification and argument extraction, we estimate the probabilities using a classification model. In particular, we use the Adaboost classifier (Freund and Schapire, 1996) wherein a highly accurate classifier is build by combining many &quot;weak&quot; or &quot;simple&quot; base classifiers a30a25a0 , each of which may only be moderately accurate. The selection of the weak classifiers proceeds iteratively picking the weak classifier that correctly classifies the examples that are misclassified by the previously selected weak classifiers. Each weak classifier is associated with a weight (a34a60a0 ) that reflects its contribution towards minimizing the classification error. The posterior probability of a42 a40 a50 a7 a5 a45 a56 is computed as in Equation 10.</Paragraph> <Paragraph position="9"> It should be noted that the data for training the classifiers can be collected from the domain or derived from an in-domain grammar using techniques similar to those presented in Section 3.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> We describe a set of experiments to evaluate the performance of the speech recognizer and the concept accuracy of speech only and speech and gesture exchanges in our MATCH multimodal system. We use word accuracy and string accuracy for evaluating ASR output. All results presented in this section are based on 10-fold cross-validation experiments run on the 709 spoken and multi-modal exchanges collected from the pilot study described in Section 3.1.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Language Model </SectionTitle> <Paragraph position="0"> Table 2 presents the performance results for ASR word and sentence accuracy using language models trained on collected in-domain corpus as well as on corpora derived using the different methods discussed in Section 3. For the class-based models mentioned in the table, we defined different classes based on areas of interest (eg. riverside park, turtle pond), points of interest (eg. Ellis Island, methods of bootstrapping domain-specific data.</Paragraph> <Paragraph position="1"> Indonesian), price categories (eg. moderately priced, expensive), and neighborhoods (eg. Upper East Side, Chinatown). null It is immediately apparent that the hand-crafted grammar as language model performs poorly and a language model trained on the collected domain-specific corpus performs significantly better than models trained on derived data. However, it is encouraging to note that a model trained on a derived corpus (obtained from combining migrated out-of-domain corpus and a corpus created by sampling in-domain grammar) is within 10% word accuracy as compared to the model trained on the collected corpus. There are several other noteworthy observations from these experiments.</Paragraph> <Paragraph position="2"> The performance of the language model trained on data sampled from the grammar is dramatically better as compared to the performance of the hand-crafted grammar. This technique provides a promising direction for authoring portable grammars that can be sampled subsequently to build robust language models when no in-domain corpora are available. Furthermore, combining grammar and in-domain data as described in Section 3.4, outperforms all other models significantly.</Paragraph> <Paragraph position="3"> For the experiment on migration of out-of-domain corpus, we used a corpus from a software helpdesk application. Table 2 shows that the migration of data using linguistic units as described in Section 3.5 significantly outperforms a model trained only on the out-of-domain corpus. Also, combining the grammar sampled corpus with the migrated corpus provides a further improvement.</Paragraph> <Paragraph position="4"> The performance of the SwitchBoard model on the MATCH domain is presented in Table 2. We built a tri-gram model using a 5.4 million word SwitchBoard corpus and investigated the effect of adapting the resulting language model on in-domain untranscribed speech utterances. The adaptation is done by first recognizing the training partition of the in-domain speech utterances and then building a language model from the recognized text.</Paragraph> <Paragraph position="5"> We observe that although the performance of the SwitchBoard language model on the MATCH domain is poorer than the performance of a model obtained by migrating data from a related domain, the performance can be significantly improved using the adaptation technique.</Paragraph> <Paragraph position="6"> The last row of Table 2 shows the results of using the MATCH specific lexicon to generate a corpus using a wide-coverage grammar, training a language model and adapting the resulting model using in-domain untranscribed speech utterances as was done for the SwitchBoard model. The class-based trigram model was built using 500,000 randomly sampled paths from the network constructed by the procedure described in Section 3.7.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Multimodal Understanding </SectionTitle> <Paragraph position="0"> In this section, we present results on multimodal understanding using the two techniques presented in Section 4.</Paragraph> <Paragraph position="1"> We use concept token accuracy and concept string accuracy as evaluation metrics for the entire meaning representation in these experiments. These metrics correspond to the word accuracy and string accuracy metrics used for ASR evaluation. In order to provide a finer-grained evaluation, we breakdown the concept accuracy in terms of the accuracy of identifying the predicates and arguments.</Paragraph> <Paragraph position="2"> Again, we use string accuracy metrics to evaluate predicate and argument accuracy. We use the output of the ASR with the language model trained on the collected data (word accuracy of 73.8%) as the input to the understanding component.</Paragraph> <Paragraph position="3"> The grammar-based multimodal understanding system composes the input multimodal string with the multi-modal grammar represented as an FST to produce an interpretation. Thus an interpretation can be assigned to only those multimodal strings that are encoded in the grammar. However, the result of ASR and gesture recognition may not be one of the strings encoded in the grammar, and such strings are not assigned an interpretation.</Paragraph> <Paragraph position="4"> This fact is reflected in the low concept string accuracy for the baseline as shown in Table 3.</Paragraph> <Paragraph position="5"> The pattern-matching based robust understanding approach mediates the mismatch between the strings that are output by ASR and the strings that can be assigned an interpretation. We experimented with word based pattern matching as well as phone based pattern matching on the one-best output of the recognizer. As shown in Table 3, the pattern-matching robust understanding approach improves the concept accuracy over the baseline significantly. Furthermore, the phone-based matching method has a similar performace to the word-based matching method.</Paragraph> <Paragraph position="6"> For the classification-based approach to robust understanding we used a total of 10 predicates such as help, assert, inforequest, and 20 argument types such as cuisine, price, location . We use unigrams, bigrams and trigrams appearing in the multimodal utterance as weak classifiers for the purpose of predicate classification. In order to predict the tag of a word for argument extraction, we use the left and right trigram context and the tags for the preceding two tokens as weak classifiers. The results are presented in Table 3.</Paragraph> <Paragraph position="7"> Both the approaches to robust understanding outperform the baseline model significantly. However, it is interesting to note that while the pattern-matching based approach has a better argument extraction accuracy, the classification based approach has a better predicate identification accuracy. Two possible reasons for this are: first, argument extraction requires more non-local information that is available in the pattern-matching based approach while the classification-based approach relies on local information and is more conducive for identifying the simple predicates in MATCH. Second, the pattern-matching approach uses the entire grammar as a model for matching while the classification approach is trained on the training data which is significantly smaller when compared to the number of examples encoded in the grammar.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> Although we are not aware of any attempts to address the issue of robust understanding in the context of multi-modal systems, this issue has been of great interest in the context of speech-only conversational systems (Dowding et al., 1993; Seneff, 1992; Allen et al., 2000; Lavie, 1996). The output of the recognizer in these systems usually is parsed using a handcrafted grammar that assigns a meaning representation suited for the downstream dialog component. The coverage problems of the grammar and parsing of extra-grammatical utterances is typically addressed by retrieving fragments from the parse chart and incorporating operations that combine fragments to derive a meaning of the recognized utterance. We have presented an approach that achieves robust multimodal utterance understanding using the edit-distance automaton in a finite-state-based interpreter without the need for combining fragments from a parser.</Paragraph> <Paragraph position="1"> The issue of combining rule-based and data-driven approaches has received less attention, with the exception of a few (Wang et al., 2000; Rayner and Hockey, 2003; Wang and Acero, 2003). In a recent paper (Rayner and Hockey, 2003), the authors address this issue by employing a decision-list-based speech understanding system as a means of progressing from rule-based models to data-driven models when data becomes available. The decision-list-based understanding system also provides a method for robust understanding. In contrast, the approach presented in this paper can be used on lattices of speech and gestures to produce a lattice of meaning representations. null</Paragraph> </Section> class="xml-element"></Paper>