File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1036_metho.xml
Size: 7,213 bytes
Last Modified: 2025-10-06 14:10:12
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1036"> <Title>Backoff Model Training using Partially Observed Data: Application to Dialog Act Tagging</Title> <Section position="3" start_page="280" end_page="281" type="metho"> <SectionTitle> 2 DBN-based Models for Tagging </SectionTitle> <Paragraph position="0"> Dynamic Bayesian networks (DBNs) (Murphy, 2002) are widely used in sequential data analysis such as automatic speech recognition (ASR) and DNA sequencing analysis (Durbin et al., 1999). A hidden Markov model (HMM) for DA tagging as in (Stolcke et al., 1998) is one such instance.</Paragraph> <Paragraph position="1"> Figure 1 shows a generative DBN model that will be taken as our baseline. This DBN shows a prologue (the rst time slice of the model), an epilogue (the last slice), and a chunk that is repeated suf ciently to t the entire data stream. In this case, the data stream consists of the words of a meeting conversation, where individuals within the meeting (hopefully) take turns speaking. In our model, the entire meeting conversation, and all turns of all speakers, are strung together into a single stream rather than treating each turn in the meeting individually. This approach has the bene t that we are able to integrate a temporal DA-to-DA model (such as a DA bigram).</Paragraph> <Paragraph position="2"> In all our models, to simplify we assume that the sentence change information is known (as is common with this corpus (Shriberg et al., 2004)). We next describe Figure 1 in detail. Normally, the sentence change variable is not set, so that we are within a sentence (or a particular DA). When a sentence change does not occur, the DA stays the same from slice to slice. During this time, we use a DA-speci c language model (implemented via a backoff strategy) to score the words within the current DA.</Paragraph> <Paragraph position="3"> When a sentence change event does occur, a new DA is predicted based on the DA from the previous sentence (using a DA bigram). At the beginning of a sentence, rather than conditioning on the last word of the previous sentence, we condition on the special start of sentence <s> token, as shown in the gure by having a special parent that is used only when sentence change is true. Lastly, at the very beginning of a meeting, a special start of DA token is used.</Paragraph> <Paragraph position="4"> The joint probability under this baseline model is written as follows:</Paragraph> <Paragraph position="6"> where W = fwk,ig is the word sequence, D = fdkg is the DA sequence, dk is the DA of the k-th sentence, and wk,i is the i-th word of the k-th sentence in the meeting.</Paragraph> <Paragraph position="7"> Because all variables are observed when training our baseline, we use the SRILM toolkit (Stolcke, 2002), modi ed Kneser-Ney smoothing (Chen and Goodman, 1998), and factored extensions (Bilmes and Kirchhoff, 2003). In evaluations, the Viterbi algorithm (Viterbi, 1967) can be used to nd the best DA sequence path from the words of the meeting according to the joint distribution in Equation (1).</Paragraph> </Section> <Section position="4" start_page="281" end_page="282" type="metho"> <SectionTitle> 3 Hidden Backoff Models </SectionTitle> <Paragraph position="0"> When analyzing discourse patterns, it can be seen that sentences with different DAs usually have different internal structures. Accordingly, in this work we do not assume sentences for each dialog act have the same hidden state patterns. For instance (and as mentioned above), a statement can consist of a noun followed by a verb phase.</Paragraph> <Paragraph position="1"> A problem, however, is that sub-DAs are not annotated in our training corpus. While clustering and annotation of these phrases is already a widely developed research topic (Pieraccini and Levin, 1991; Lee et al., 1997; Gildea and Jurafsky, 2002), in our approach we use an EM algorithm to learn these hidden sub-DAs in a data-driven fashion. Pictorially, we add a layer of hidden states to our baseline DBN as illustrated in Figure 2.</Paragraph> <Paragraph position="2"> Under this model, the joint probability is:</Paragraph> <Paragraph position="4"> where S = fsk,ig is the hidden state sequence, sk,i is the hidden state at the i-th position of the k-th sentence, and other variables are the same as before.</Paragraph> <Paragraph position="5"> Similar to our baseline model, the DA bigram P(dkjdk[?]1) can be modeled using a backoff bigram. Moreover, if the states fsk,ig are known during training, the word prediction probability P(wk,ijwk,i[?]1, sk,i, dk) can also use backoff and be trained accordingly. The hidden state sequence is unknown, however, and thus cannot be used to produce a standard backoff model. What we desire is an ability to utilize a backoff model (to mitigate data sparseness effects) while simultaneously retaining the state as a hidden (rather than an observed) variable, and also have a procedure that trains the entire model to improve overall model likelihood.</Paragraph> <Paragraph position="6"> Expectation-maximization (EM) algorithms are well-known to be able to train models with hidden states. Furthermore, standard advanced smoothing methods such as modi ed Kneser-Ney smoothing (Chen and Goodman, 1998) utilize integer counts (rather than fractional ones), and they moreover need meta counts (or counts of counts). Therefore, in order to train this model, we propose an embedded training algorithm that cycles between a standard EM training procedure (to train the hidden state distribution), and a stage where the most likely hidden states (and their counts and meta counts) are used externally to train a backoff model. This procedure can be described in detail as follows: In the algorithm, the input contains words and a DA for each sentence in the meeting. The output is the corresponding conditional probability table (CPT) for hidden state transitions, and a back-off model for word prediction. Because we train the backoff model when some of the variables are hidden, we call the result a hidden backoff model. While we have seen embedded Viterbi training used in the past for simultaneously training heterogeneous models (e.g., Markov chains and Neural Networks (Morgan and Bourlard, 1990)), this is the rst instance of training backoff-models that involve hidden variables that we are aware of.</Paragraph> <Paragraph position="7"> While embedded Viterbi estimation is not guaranteed to have the same convergence (or xed-point under convergence) as normal EM (Lember and Koloydenko, 2004), we nd empirically this to be the case (see examples below). Moreover, our algorithm can easily be modi ed so that instead of taking a Viterbi alignment in step 5, we instead use a set of random samples generated under the current model. In this case, it can be shown using a law-oflarge numbers argument that having suf cient samples guarantees the algorithm will converge (we will investigate this modi cation in future work).</Paragraph> <Paragraph position="8"> Of course, when decoding with such a model, a conventional Viterbi algorithm can still be used to calculate the best DA sequence.</Paragraph> </Section> class="xml-element"></Paper>