File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1036_intro.xml
Size: 6,436 bytes
Last Modified: 2025-10-06 14:03:23
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1036"> <Title>Backoff Model Training using Partially Observed Data: Application to Dialog Act Tagging</Title> <Section position="2" start_page="0" end_page="280" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Discourse patterns in natural conversations and meetings are well known to provide interesting and useful information about human conversational behavior. They thus attract research from many different and bene cial perspectives. Dialog acts (DAs) (Searle, 1969), which re ect the functions that utterances serve in a discourse, are one type of such patterns. Detecting and understanding dialog act patterns can provide bene t to systems such as automatic speech recognition (ASR) (Stolcke et al., 1998), machine dialog translation (Lee et al., 1998), and general natural language processing (NLP) (Jurafsky et al., 1997b; He and Young, 2003). DA pattern recognition is an instance of tagging. Many different techniques have been quite successful in this endeavor, including hidden Markov models (Jurafsky et al., 1997a; Stolcke et al., 1998), semantic classi cation trees and polygrams (Mast et al., 1996), maximum entropy models (Ang et al., 2005), and other language models (Reithinger et al., 1996; Reithinger and Klesen, 1997). Like other tagging tasks, DA recognition can also be achieved using conditional random elds (Lafferty et al., 2001; Sutton et al., 2004) and general discriminative modeling on structured outputs (Bartlett et al., 2004). In many sequential data analysis tasks (speech, language, or DNA sequence analysis), standard dynamic Bayesian networks (DBNs) (Murphy, 2002) have shown great exibility and are widely used. In (Ji and Bilmes, 2005), for example, an analysis of DA tagging using DBNs is performed, where the models avoid label bias by structural changes and avoid data sparseness by using a generalized back-off procedures (Bilmes and Kirchhoff, 2003).</Paragraph> <Paragraph position="1"> Most DA classi cation procedures assume that within a sentence of a particular xed DA type, there is a xed word distribution over the entire sentence. Similar to (Ma et al., 2000) (and see citations therein), we have found, however, that intra- null sentence discourse patterns are inherently dynamic.</Paragraph> <Paragraph position="2"> Moreover, the patterns are speci c to each type of DA, meaning a sentence will go through a DA-speci c sequence of sub-DA phases or states. A generative description of this phenomena is that a DA is rst chosen, and then words are generated according to both the DA and to the relative position of the word in that sentence. For example, a statement (one type of DA) can consist of a sub-ject (noun phrase), verb phrase, and object (noun phrase). This particular sequence might be different for a different DA (e.g., a back-channel ). Our belief is that explicitly modeling these internal states can help a DA-classi cation system in conversational meetings or dialogs.</Paragraph> <Paragraph position="3"> In this work, we describe an approach that is motivated by several aspects of the typical DA-classi cation procedure. First, it is rare to have sub-DAs labeled in training data, and indeed this is true of the corpus (Shriberg et al., 2004) that we use.</Paragraph> <Paragraph position="4"> Therefore, some form of unsupervised clustering or pre-shallow-parsing of sub-DAs must be performed.</Paragraph> <Paragraph position="5"> In such a model, these sub-DAs are essentially unknown hidden variables that ideally could be trained with an expectation-maximization (EM) procedure.</Paragraph> <Paragraph position="6"> Second, when training models of language, it is necessary to employ some form of smoothing methodology since otherwise data-sparseness would render standard maximum-likelihood trained models useless. Third, discrete conditional probability distributions formed using backoff models that have been smoothed (particularly using modi ed Kneser-Ney (Chen and Goodman, 1998)) have been extremely successful in many language modeling tasks. Training backoff models, however, requires that all data is observed so that data counts can be formed. Indeed, our DA-speci c word models (implemented via backoff) will also need to condition on the current sub-DA, which at training time is unknown.</Paragraph> <Paragraph position="7"> We therefore have developed a procedure that allows us to train generalized backoff models (Bilmes and Kirchhoff, 2003), even when some or all of the variables involved in the model are hidden. We thus call our models hidden backoff models (HBMs). Our method is indeed a form of embedded EM training (Morgan and Bourlard, 1990), and more generally is a speci c form of EM (Neal and Hinton, 1998).</Paragraph> <Paragraph position="8"> Our approach is similar to (Ma et al., 2000), except our underlying language models are backoff-based and thus retain the bene ts of advanced smoothing methods, and we utilize both a normal and a backoff EM step as will be seen. We moreover wrap up the above ideas in the framework of dynamic Bayesian networks, which are used to represent and train all of our models.</Paragraph> <Paragraph position="9"> We evaluate our methods on the ICSI meeting recorder dialog act (MRDA) (Shriberg et al., 2004) corpus, and nd that our novel hidden backoff model can signi cantly improve dialog tagging accuracy.</Paragraph> <Paragraph position="10"> With a different number of hidden states for each DA, a relative reduction in tagging error rate as much as 6.1% can be achieved. Our best HBM result shows an accuracy that improves on the best known (to our knowledge) result on this corpora which is one that uses acoustic prosody as a feature. We have moreover developed our own prosody model and while we have not been able to usefully employ both prosody and the HBM technique together, our HBM is competitive in this case as well. Furthermore, our results show the effectiveness of our embedded EM procedure, as we demonstrate that it increases training log likelihoods, while simultaneously reducing error rate.</Paragraph> <Paragraph position="11"> Section 2 brie y summarizes our baseline DBN-based models for DA tagging tasks. In Section 3, we introduce our HBMs. Section 4 contains experimental evaluations on the MRDA corpus and nally Section 5 concludes.</Paragraph> </Section> class="xml-element"></Paper>