File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1644_metho.xml

Size: 26,236 bytes

Last Modified: 2025-10-06 14:10:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1644">
  <Title>Style &amp; Topic Language Model Adaptation Using HMM-LDA</Title>
  <Section position="5" start_page="374" end_page="375" type="metho">
    <SectionTitle>
3 HMM-LDA
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="374" end_page="374" type="sub_section">
      <SectionTitle>
3.1 Latent Dirichlet Allocation
Discrete Principal Component Analysis describes
</SectionTitle>
      <Paragraph position="0"> a family of models that decompose a set of feature vectors into its principal components (Buntine and Jakulin, 2005). Describing feature vectors via their components reduces the number of parameters required to model the data, hence improving the quality of the estimated parameters when given limited training data. LSA, PLSA, and LDA are all examples from this family.</Paragraph>
      <Paragraph position="1"> Given a predefined number of desired components, LSA models feature vectors by finding a set of orthonormal components that maximize the variance using singular value decomposition (Deerwester et al., 1990). Unfortunately, the component vectors may contain non-interpretable negative values when working with word occurrence counts as feature vectors. PLSA eliminates this problem by using non-negative matrix factorization to model each document as a weighted combination of a set of non-negative feature vectors (Hofmann, 1999). However, because the number of parameters grows linearly with the number of documents, the model is prone to overfitting. Furthermore, because each training document has its own set of topic weight parameters, PLSA does not provide a generative framework for describing the probability of an unseen document (Blei et al., 2003).</Paragraph>
      <Paragraph position="2"> To address the shortcomings of PLSA, Blei et al. (2003) introduced the LDA model, which further imposes a Dirichlet distribution on the topic mixture weights corresponding to the documents in the corpus. With the number of model parameters dependent only on the number of topic mixtures and vocabulary size, LDA is less prone to overfitting and is capable of estimating the probability of unobserved test documents.</Paragraph>
      <Paragraph position="3"> Empirically, LDA has been shown to outperform PLSA in corpus perplexity, collaborative filtering, and text classification experiments (Blei et al., 2003). Various extensions to the basic LDA model have since been proposed. The Author Topic model adds an additional dependency on the author(s) to the topic mixture weights of each document (Rosen-Zvi et al., 2005). The Hierarchical Dirichlet Process is a nonparametric model that generalizes distribution parameter modeling to multiple levels. Without having to estimate the number of mixture components, this model has been shown to match the best result from LDA on a document modeling task (Teh et al., 2004).</Paragraph>
    </Section>
    <Section position="2" start_page="374" end_page="375" type="sub_section">
      <SectionTitle>
3.2 Hidden Markov Model with LDA
</SectionTitle>
      <Paragraph position="0"> HMM-LDA model proposed by Griffiths et al.</Paragraph>
      <Paragraph position="1"> (2004) combines the HMM and LDA models to separate syntactic words with local dependencies from topic-dependent content words without requiring any labeled data. Similar to HMM-based part-of-speech taggers, HMM-LDA maps each word in the document to a hidden syntactic state.</Paragraph>
      <Paragraph position="2"> Each state generates words according to a uni-gram distribution except the special topic state, where words are modeled by document-specific mixtures of topic distributions, as in LDA.</Paragraph>
      <Paragraph position="3"> Figure 1 describes this generative process in more detail.</Paragraph>
      <Paragraph position="4">  model representation of HMM-LDA. The number of states and topics are pre-specified. The topic mixture for each document is modeled with a Dirichlet distribution. Each word wi in the n-word document is generated from its hidden state si or hidden topic zi if si is the special topic state. Unlike vocabulary selection techniques that separate domain-independent words from topic-specific keywords using word collocation statistics, HMM-LDA classifies each word instance according to its context. Thus, an instance of the word &amp;quot;return&amp;quot; may be assigned to a syntactic state in &amp;quot;to return a&amp;quot;, but classified as a topic keyword in &amp;quot;expected return for&amp;quot;. By labeling each word in the training set with its syntactic state and mixture topic, HMM-LDA not only separates stylistic words from content words in a context-dependent manner, but also decomposes the corpus into a set of topic word distributions. This form of soft, context-dependent classifica-For each document d in the corpus:  1. Draw topic weights dth from )(Dirichlet a 2. For each word wi in document d: a. Draw topic zi from )l(Multinomia dth b. Draw state si from )(ultinomialM 1[?]ispi c. Draw word wi from:</Paragraph>
      <Paragraph position="6"> tion has many potential uses for language modeling, topic segmentation, and indexing.</Paragraph>
    </Section>
    <Section position="3" start_page="375" end_page="375" type="sub_section">
      <SectionTitle>
3.3 Training
</SectionTitle>
      <Paragraph position="0"> To train an HMM-LDA model, we employ the MATLAB Topic Modeling Toolbox 1.3 (Griffiths and Steyvers, 2004; Griffiths et al., 2004). This particular implementation performs Gibbs sampling, a form of Markov chain Monte Carlo (MCMC), to estimate the optimal model parameters fitted to the training data. Specifically, the algorithm creates a Markov chain whose stationary distribution matches the expected distribution of the state and topic labels for each word in the training corpus. Starting from random labels, Gibbs sampling sequentially samples the label for each hidden variable conditioned on the current value of all other variables. After a sufficient number of iterations, the Markov chain converges to the stationary distribution. We can easily compute the posterior word distribution for each state and topic from a single sample by averaging over the label counts and prior parameters. With a sufficiently large training set, we will have enough words assigned to each state and topic to yield a reasonable approximation to the underlying distribution.</Paragraph>
      <Paragraph position="1"> In the following sections, we examine the application of models derived from the HMM-LDA labels to the task of spoken lecture transcription and explore techniques on adaptive topic modeling to construct a better lecture language model.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="375" end_page="377" type="metho">
    <SectionTitle>
4 HMM-LDA Analysis
</SectionTitle>
    <Paragraph position="0"> Our language modeling experiments have been conducted on high-fidelity transcripts of approximately 168 hours of lectures from three undergraduate subjects in math, physics, and computer science (CS), as well as 79 seminars covering a wide range of topics (Glass et al., 2004).</Paragraph>
    <Paragraph position="1"> For evaluation, we withheld the set of 20 CS lectures and used the first 10 lectures as a development set and the last 10 lectures for the test set. The remainder of these data was used for training and will be referred to as the Lectures dataset.</Paragraph>
    <Paragraph position="2"> To supplement the out-of-domain lecture transcripts with topic-specific textual resources, we added the CS course textbook (Textbook) as additional training data for learning the target topics. To create topic-cohesive documents, the textbook is divided at every section heading to form 271 documents. Next, the text is heuristically segmented at sentence-like boundaries and normalized into the words corresponding to the spoken form of the text. Table 1 summarizes the data used in this evaluation.</Paragraph>
    <Paragraph position="3">  In the following analysis, we ran the Gibbs sampler against the Lectures dataset for a total of 2800 iterations, computing a model every 10 iterations, and took the model with the lowest perplexity as the final model. We built the model with 20 states and 100 topics based on preliminary experiments. We also trained an HMM-LDA model on the Textbook dataset using the same model parameters. We ran the sampler for a total of 2000 iterations, computing the perplexity every 100 iterations. Again, we selected the lowest perplexity model as the final model.</Paragraph>
    <Section position="1" start_page="375" end_page="376" type="sub_section">
      <SectionTitle>
4.1 Semantic Topics
</SectionTitle>
      <Paragraph position="0"> HMM-LDA extracts words whose distributions vary across documents and clusters them into a set of components. In Figure 2, we list the top 10 words from a random selection of 10 topics computed from the Lectures dataset. As shown, the words assigned to the LDA topic state are representative of content words and are grouped into broad semantic topics. For example, topic 4, 8, and 9 correspond to machine learning, linear algebra, and magnetism, respectively.</Paragraph>
      <Paragraph position="1"> Since the Lectures dataset consists of speech transcripts with disfluencies, it is interesting to  observe that &amp;quot;&lt;laugh&gt;&amp;quot; is the top word in a topic corresponding to childhood memories.</Paragraph>
      <Paragraph position="2"> Cursory examination of the data suggests that the speakers talking about children tend to laugh more during the lecture. Although it may not be desirable to capture speaker idiosyncrasies in the topic mixtures, HMM-LDA has clearly demonstrated its ability to capture distinctive semantic topics in a corpus. By leveraging all documents in the corpus, the model yields smoother topic word distributions that are less vulnerable to overfitting.</Paragraph>
      <Paragraph position="3"> Since HMM-LDA labels the state and topic of each word in the training corpus, we can also visualize the results by color-coding the words by their topic assignments. Figure 3 shows a color-coded excerpt from a topically coherent paragraph in the Textbook dataset. Notice how most of the content words (uppercase) are assigned to the same topic/color. Furthermore, of the 7 instances of the words &amp;quot;and&amp;quot; and &amp;quot;or&amp;quot; (underlined), 6 are correctly classified as syntactic or topic words, demonstrating the context-dependent labeling capabilities of the HMM-LDA model. Moreover, from these labels, we can identify multi-word topic key phrases (e.g.</Paragraph>
      <Paragraph position="4"> output signals, input signal, &amp;quot;and&amp;quot; gate) in addition to standalone keywords, an observation we will leverage later on with n-gram topic models.</Paragraph>
      <Paragraph position="5">  dataset showing the context-dependent topic labels. Syntactic words appear black in lowercase. Topic words are shown in uppercase with their respective topic colors. All instances of the words &amp;quot;and&amp;quot; and &amp;quot;or&amp;quot; are underlined.</Paragraph>
    </Section>
    <Section position="2" start_page="376" end_page="376" type="sub_section">
      <SectionTitle>
4.2 Syntactic States
</SectionTitle>
      <Paragraph position="0"> Since the syntactic states are shared across all documents, we expect words associated with the syntactic states when applying HMM-LDA to the Lectures dataset to reflect the lecture style vocabulary. null In Figure 4, we list the top 10 words from each of the 19 syntactic states (state 20 is the topic state). Note that each state plays a clear syntactic role. For example, state 2 contains prepositions while state 7 contains verbs. Since the model is trained on transcriptions of spontaneous speech, hesitation disfluencies (&lt;uh&gt;, &lt;um&gt;, &lt;partial&gt;) are all grouped in state 3 along with other words (so, if, okay) that frequently indicate hesitation. While many of these hesitation words are conjunctions, the words in state 6 show that most conjunctions are actually assigned to a different state representing different syntactic behavior from hesitations. As demonstrated with spontaneous speech, HMM-LDA yields syntactic states that have a good correspondence to part-of-speech labels, without requiring any labeled training data.</Paragraph>
    </Section>
    <Section position="3" start_page="376" end_page="377" type="sub_section">
      <SectionTitle>
4.3 Discussions
</SectionTitle>
      <Paragraph position="0"> Although MCMC techniques converge to the global stationary distribution, we cannot guarantee convergence from observation of the perplexity alone. Unlike EM algorithms, random sampling may actually temporarily decrease the model likelihood. Thus, in the above analysis, the number of iterations was chosen to be at least double the point at which the perplexity first appeared to converge.</Paragraph>
      <Paragraph position="1"> In addition to the number of iterations, the choice of the number of states and topics, as well as the values of the hyper-parameters on the Dirichlet prior, also impact the quality and effectiveness of the resulting model. Ideally, we run the algorithm with different combinations of the parameter values and perform model selection to choose the model with the best complexitypenalized likelihood. However, given finite computing resources, this approach is often im-We draw an INVERTER SYMBOLICALLY as in Figure 3.24. An AND GATE, also shown in Figure 3.24, is a PRIMITIVE  FUNCTION box with two INPUTS and ONE OUTPUT. It drives its OUTPUT SIGNAL to a value that is the LOGICAL AND of the INPUTS. That is, if both of its INPUT SIGNALS BECOME 1. Then ONE and GATE DELAY time later the AND GATE will force its OUTPUT SIGNAL TO be 1; otherwise the OUTPUT will be 0. An OR GATE is a SIMILAR two INPUT PRIMITIVE FUNCTION box that drives its OUTPUT SIGNAL to a value that is the LOGICAL OR of the INPUTS. That is, the  practical. As an alternative for future work, we would like to perform Gibbs sampling on the hyper-parameters (Griffiths et al., 2004) and apply the Dirichlet process to estimate the number of states and topics (Teh et al., 2004).</Paragraph>
      <Paragraph position="2"> Despite the suboptimal choice of parameters and potential lack of convergence, the labels derived from HMM-LDA are still effective for language modeling applications, as described next.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="377" end_page="379" type="metho">
    <SectionTitle>
5 Language Modeling Experiments
</SectionTitle>
    <Paragraph position="0"> To evaluate the effectiveness of models derived from the separation of syntax from content, we performed experiments that compare the perplexities and WERs of various model combinations. For a baseline, we used an adapted model (L+T) that linearly interpolates trigram models trained on the Lectures (L) and Textbook (T) datasets. In all models, all interpolation weights and additional parameters are tuned on a development set consisting of the first half of the CS lectures and tested on the second half. Unless otherwise noted, modified Kneser-Ney discounting (Chen and Goodman, 1998) is applied with the respective training set vocabulary using the SRILM Toolkit (Stolcke, 2002).</Paragraph>
    <Paragraph position="1"> To compute the word error rates associated with a specific language model, we used a speaker-independent speech recognizer (Glass, 2003). The lectures were pre-segmented into utterances by forced alignment of the reference transcription.</Paragraph>
    <Section position="1" start_page="377" end_page="377" type="sub_section">
      <SectionTitle>
5.1 Lecture Style
</SectionTitle>
      <Paragraph position="0"> In general, an n-gram model trained on a limited set of topic-specific documents tends to overemphasize words from the observed topics instead of evenly distributing weights over all potential topics. Specifically, given the list of words following an n-gram context, we would like to deemphasize the observed occurrences of topic words and ideally redistribute these counts to all potential topic words. As an approximation, we can build such a topic-deemphasized style tri-gram model (S) by using counts of only n-gram sequences that do not end on a topic word, smoothed over the Lectures vocabulary. Figure 5 shows the n-grams corresponding to an utterance used to build the style trigram model. Note that the counts of topic to style word transitions are not altered as these probabilities are mostly independent of the observed topic distribution.</Paragraph>
      <Paragraph position="1"> By interpolating the style model (S) from above with the smoothed trigram model based on the Lectures dataset (L), the combined model (L+S) achieves a 3.6% perplexity reduction and  1.0% WER reduction over (L), as shown in Table 2. Without introducing topic-specific training  data, we can already improve the generic lecture LM performance using the HMM-LDA labels.</Paragraph>
      <Paragraph position="2"> &lt;s&gt; for the SPATIAL MEMORY &lt;/s&gt; unigrams: for, the, spatial, memory, &lt;/s&gt; bigrams: &lt;s&gt; for, for the, the spatial, spatial memory, memory &lt;/s&gt; trigrams: &lt;s&gt; &lt;s&gt; for, &lt;s&gt; for the, for the spatial, the spatial memory, spatial memory &lt;/s&gt; Figure 5: Style model n-grams. Topic words in the utterance are in uppercase.</Paragraph>
    </Section>
    <Section position="2" start_page="377" end_page="378" type="sub_section">
      <SectionTitle>
5.2 Topic Domain
</SectionTitle>
      <Paragraph position="0"> Unlike Lectures, the Textbook dataset contains content words relevant to the target lectures, but in a mismatched style. Commonly, the Textbook trigram model is interpolated with the generic model to improve the probability estimates of the transitions involving topic words. The interpolation weight is chosen to best fit the probabilities of these n-gram sequences while minimizing the mismatch in style. However, with only one parameter, all n-gram contexts must share the same mixture weight. Because transitions from contexts containing topic words are rarely observed in the off-topic Lectures, the Textbook model (T) should ideally have higher weight in these contexts than contexts that are more equally observed in both datasets.</Paragraph>
      <Paragraph position="1"> One heuristic approach for adjusting the weight in these contexts is to build a topic-domain trigram model (D) from the Textbook n-gram counts with Witten-Bell smoothing (Chen and Goodman, 1998) where we emphasize the sequences containing a topic word in the context by doubling their counts. In effect, this reduces the smoothing on words following topic contexts with respect to lower-order models without significantly affecting the transitions from non-topic words. Figure 6 shows the adjusted counts for an utterance used to build the domain trigram model.</Paragraph>
      <Paragraph position="2"> &lt;s&gt; HUFFMAN CODE can be represented as a BINARY TREE ... unigrams: huffman, code, can, be, represented, as, binary, tree, ... bigrams: &lt;s&gt; huffman, huffman code (2x), code can (2x), can be, be represented, represented as, a binary, binary tree (2x), ...</Paragraph>
      <Paragraph position="3"> trigrams: &lt;s&gt; &lt;s&gt; hufmann, &lt;s&gt; hufmann code (2x), hufmann code can (2x), code can be (2x), can be represented, be represented as, represented as a, as a binary, a binary tree (2x), ... Figure 6: Domain model n-grams. Topic words in the utterance are in uppercase.</Paragraph>
      <Paragraph position="4">  Empirically, interpolating the lectures, textbook, and style models with the domain model (L+T+S+D) further decreases the perplexity by 1.4% and WER by 0.3% over (L+T+S), validating our intuition. Overall, the addition of the style and domain models reduces perplexity and WER by a noticeable 7.1% and 2.1%, respectively, as shown in Table 2.</Paragraph>
      <Paragraph position="5">  formance of various model combinations. Relative reduction is shown in parentheses.</Paragraph>
    </Section>
    <Section position="3" start_page="378" end_page="378" type="sub_section">
      <SectionTitle>
5.3 Textbook Topics
</SectionTitle>
      <Paragraph position="0"> In addition to identifying content words, HMM-LDA also assigns words to a topic based on their distribution across documents. Thus, we can apply HMM-LDA with 100 topics to the Textbook dataset to identify representative words and their associated contexts for each topic. From these labels, we can build unsmoothed trigram language models (Topic100) for each topic from the counts of observed n-gram sequences that end in a word assigned to the respective topic.</Paragraph>
      <Paragraph position="1"> Figure 7 shows a sample of the word n-grams identified via this approach for a few topics.</Paragraph>
      <Paragraph position="2"> Note that some of the n-grams are key phrases for the topic while others contain a mixture of syntactic and topic words. Unlike bag-of-words models that only identify the unigram distribution for each topic, the use of context-dependent labels enables the construction of n-gram topic models that not only characterize the frequencies of topic words, but also describe the transition contexts leading up to these words.</Paragraph>
    </Section>
    <Section position="4" start_page="378" end_page="378" type="sub_section">
      <SectionTitle>
5.4 Topic Mixtures
</SectionTitle>
      <Paragraph position="0"> Since each target lecture generally only covers a subset of the available topics, it will be ideal to identify the specific topics corresponding to a target lecture and assign those topic models more weight in a linearly interpolated mixture model.</Paragraph>
      <Paragraph position="1"> As an ideal case, we performed a cheating experiment to measure the best performance of a statically interpolated topic mixture model (L+T+S+D+Topic100) where we tuned the mixture weights of all mixture components, including the lectures, textbook, style, domain, and the 100 individual topic trigram models on individual target lectures.</Paragraph>
      <Paragraph position="2"> Table 2 shows that by weighting the component models appropriately, we can reduce the perplexity and WER by an additional 7.9% and 0.7%, respectively, over the (L+T+S+D) model even with simple linear interpolation for model combination.</Paragraph>
      <Paragraph position="3"> To gain further insight into the topic mixture model, we examine the breakdown of the normalized topic weights for a specific lecture. As shown in Figure 8, of the 100 topic models, 15 of them account for over 90% of the total weight.</Paragraph>
      <Paragraph position="4"> Thus, lectures tend to show a significant topic skew which topic adaptation approaches can model effectively.</Paragraph>
    </Section>
    <Section position="5" start_page="378" end_page="379" type="sub_section">
      <SectionTitle>
5.5 Topic Adaptation
</SectionTitle>
      <Paragraph position="0"> Unfortunately, since different lectures cover different topics, we generally cannot tune the topic mixture weights ahead of time. One approach, without any a priori knowledge of the target lecture, is to adaptively estimate the optimal mixture weights as we process the lecture (Gildea and Hofmann, 1999). However, since the topic distribution shifts over a long lecture, modeling a lecture as an interpolation of components with fixed weights may not be the most optimal. Instead, we employ an exponential decay strategy where we update the current mixture distribution by linearly interpolating it with the posterior topic distribution given the current word. Specifically, applying Bayes' rule, the probability of topic t generating the current word w is given by:</Paragraph>
      <Paragraph position="2"> To achieve the exponential decay, we update the topic distribution after each word according to</Paragraph>
      <Paragraph position="4"> adaptation rate.</Paragraph>
      <Paragraph position="5"> We evaluated this approach of dynamic mixture weight adaptation on the (L+T+S+D+Topic 100) model, with the same set of components as the cheating experiment with static weights. As shown in Table 2, the dynamic model actually outperforms the static model by more than 1% in perplexity, by better modeling the dynamic topic substructure within the lecture.</Paragraph>
      <Paragraph position="6"> To run the recognizer with a dynamic LM, we rescored the top 100 hypotheses generated with the (L+T+S+D) model using the dynamic LM.</Paragraph>
      <Paragraph position="7"> The WER obtained through such n-best rescoring yielded noticeable improvements over the (L+T+S+D) model without a priori knowledge of the topic distribution, but did not beat the optimal static model on the test set.</Paragraph>
      <Paragraph position="8"> To further gain an intuition for mixture weight adaptation, we plotted the normalized adapted weights of the topic models across the first lecture of the test set in Figure 9. Note that the topic mixture varies greatly across the lecture. In this particular lecture, the lecturer starts out with a review of the previous lecture. Subsequently, he shows an example of computation using accumulators. Finally, he focuses the lecture on stream as a data structure, with an intervening example that finds pairs of i and j that sum up to a prime. By comparing the topic labels in Figure 9 with the top words from the corresponding topics in Figure 10, we observe that the topic weights obtained via dynamic adaptation match the subject matter of the lecture fairly closely. Finally, to assess the effect that word error rate has on adaptation performance, we applied the adaptation algorithm to the corresponding transcript from the automatic speech recognizer (ASR). Traditional cache language models tend to be vulnerable to recognition errors since incorrect words in the history negatively bias the prediction of the current word. However, by adapting at a topic level, which reduces the number of dynamic parameters, the dynamic topic model is less sensitive to recognition errors. As seen in Figure 9, even with a word error rate around 40%, the normalized topic mixture weights from the ASR transcript still show a strong resemblance to the original weights from the manual  topics appearing in Figure 9.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML