File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1014_metho.xml
Size: 9,298 bytes
Last Modified: 2025-10-06 14:13:49
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1014"> <Title>Language Modeling with Sentence-Level Mixtures</Title> <Section position="3" start_page="82" end_page="83" type="metho"> <SectionTitle> 2. MIXTURE LANGUAGE MODEL 2.1. General Framework </SectionTitle> <Paragraph position="0"> The sentence-level mixture language model was originally motivated by an observation that news stories (and certainly other domains as well) often reflect the characteristics of different topics or sub-domains, such as sports, finance, national news and local news. The likelihood of different words or n-grams could be very different in different sub-domains, and it is unlikely that one would switch sub-domains midsentence. A model with sentence-level mixtures of topic-dependent component models would address this problem, but the model would be more general ff it also allowed for n-gram level mixtures within the components (e.g. for robust estimation). Thus, we propose a model using mixtures at two levels: the sentence and the n-gram level. Using trigram components, this model is described by</Paragraph> <Paragraph position="2"> where k is an index to the particular topic described by the component language model Pk('\]'), PI(-\]') is a topic-independent model that is interpolated with the topic-dependent model for purposes of robust estimation or dynamic language model adaptation, and At and 0k are the sentence-level and n-gram level mixture weights, respectively. (Note that the component-dependent term Pk (toi Item_ 1, w~-2) could itself be a mixtme.) Two important aspects of the model are the definition of &quot;'topics&quot; and robust parameter estimation. The m component distributions of the language model correspond to different &quot;topics&quot;, where topic can mean any broad class of sentences, such as subject area (as in the examples given above) or verb tense. Topics can be specified by hand, according to text labels ff they are available, or by heuristic rules associated with known characteristics of a task domain. Topics can also be determined automatically, which is the approach taken here, using any of a variety of clustering methods to initialiTc the component models. Robust parameter estimation is another important issue in mixture language modeling, because the process of partitioning the d,tA into topic-dependent subsets reduces the amount of training available to estimate each component language model. These two issues, automatic clnstering for topic initialization and robust parameter estimation, are described further in the next two subsections.</Paragraph> <Section position="1" start_page="82" end_page="83" type="sub_section"> <SectionTitle> 2.2. Clustering Algorithm </SectionTitle> <Paragraph position="0"> Since the standard WSJ language model training d~t~ does not have topic labels associated with the text, it was necessary to use automatic clustering to identify natural groupings of the d~t~ into &quot;topics&quot;. Because of its conceptual simplicity, agglomerative clustering is used to partition the training dAt. intO the desired number of clusters. The clustering is at the paragraph level, relying on the assumption that an entire paragraph comes from a single topic. Each paragraph begins as a singleton cluster. Paragraph pairs are then progressively grouped into clusters by computing the similarity between clusters and grouping the two most similar clusters.</Paragraph> <Paragraph position="1"> The basic clustering algorithm is as follows: 1. Let the desired number of clusters be C* and the initial number of clusters C be the number of singleton da!~ samples, or paragraphs.</Paragraph> <Paragraph position="2"> 2. Find the best matched clusters, say Ai and Aj, to minimize the similarity criterion S~.</Paragraph> <Paragraph position="3"> 3. Merge Ai and Aj and decrement C.</Paragraph> <Paragraph position="4"> 4. If current number of clusters C = C*, then stop; otherwise go to Step 2.</Paragraph> <Paragraph position="5"> At the end of this stage, we have the desired number of partitions of the training datz: To save computation, we run agglomerative clustering first on subsets of the dnt~; and then continue by agglomerating resulting clusters into a final set of m clusters.</Paragraph> <Paragraph position="6"> A variety of similarity measures can be envisioned. We use a normalized measure of the number of content words in common between the two clusters. (Paragraphs comprise both function words (e.g. is, that, bu0 and content words (e.g stocks, theater, trading), but the function words do not contribute towards the identification of a paragraph as belonging to a particular topic so they are ignored in the similarity criterion.) Letting Ai be the set of unique content words in cluster i, lAd the number of elements in Ai, and Ni the number of paragraplas in cluster i, then the specific measure of similarity of two clusters i and j is</Paragraph> <Paragraph position="8"> is a normalization factor used to avoid the tendency for small clusters to group with one large cluster rather than other small clusters.</Paragraph> <Paragraph position="9"> At this point, we have only experimented with a small number of clusters, so it is difficult to see coherent topics in them. However, it appears that the current models are putting news related to foreign affairs (politics, as well as travel) into one cluster and news relating to finance (stocks, prices, loans) in another.</Paragraph> </Section> <Section position="2" start_page="83" end_page="83" type="sub_section"> <SectionTitle> 2.3. Parameter Estimation </SectionTitle> <Paragraph position="0"> Each component model is a conventional n-gram model. Initial n-gram estimates for the component models are based on the partitions of the training data, obtained by using the above clustering algorithm. The initial component models are estimated separately for each cluster, where the Witten-Bell back-off \[11\] is used to compute the probabilities of n-grams not observed in training, based on distributing a certain amount of the total probability mass among unseen ngrams. This method was chosen based on the results of \[12\] and our own comparative experiments with different back-off methods for WSJ n-gram language models. The parameters of the component models can be re-estimated using the Expectation-Maximization (EM) algorithm \[13\]. However, since the EM algorithm was computationally intensive, an iterative re-labeling re-estimation technique was used. At each iteration, the training data is re-partitioned, by re-labeling each utterance according to which component model maximizes the likelihood of that utterance. Then, the component n-gram statistics are re-computed using the new subsets of the training data, again using the Witten-Bell back-off technique.</Paragraph> <Paragraph position="1"> The iterations continue until a steady state size for the clusters is reached.</Paragraph> <Paragraph position="2"> Since the component models are built on partitioned training data, there is a danger of them being undertrained. There are two main mechanisms we have explored for robust parameter estimation, in addition to using standard back-off techniques.</Paragraph> <Paragraph position="3"> One approach is to include a general model PG trained on all the data as one of the mixture components. This approach has the advantage that the general model will be more appropriate for recognizing sentences that do not fall clearly into any of the topic-dependent components, but the possible disadvantage that the component models may be underutilized because they are relatively undertrained. An alternative is to interpolate the general model with each component model at the n-gram level, but this may force the component models to be too general in order to allow for unforeseen topics. Given these trade-offs, we chose to implement a compromise between the two approaches, i.e. to include a general model as one of the components, as well as some component level smoothing via interpolation with a general model. Specifically, the model is given by</Paragraph> <Paragraph position="5"> where Pa, is a general model (which may or may not be the same as Pa), {Ak} provide weights for the different topics, and {Ok } serve to smooth the component models.</Paragraph> <Paragraph position="6"> Both sets of mixture weights are estimated on a separate data set, using a maximum likelihood criterion and initializing with uniform weights. To simplify the initial implementation, we did not estimate the two sets of weights { Ak } and {0k } jointly. Rather, we first labeled the sentences in the mixture weight estimation data set according to their most likely component models, and then separately estimated the weight 0k tO maximize the likelihood of the data assigned to its cluster. For a single set of data, the mixture weight estimation algorithm involves iteratively updating</Paragraph> <Paragraph position="8"> where n~ is the number of words in sentence i and N is the total number of sentences in cluster k. After the component models have been estimated, the sentence-level mixture weights { Ak } are estimated using an analogous algorithm.</Paragraph> </Section> </Section> class="xml-element"></Paper>