File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1022_metho.xml

Size: 19,087 bytes

Last Modified: 2025-10-06 14:15:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1022">
  <Title>Dynamic Nonlocal Language Modeling via Hierarchical Topic-Based Adaptation</Title>
  <Section position="4" start_page="167" end_page="167" type="metho">
    <SectionTitle>
2 The Data
</SectionTitle>
    <Paragraph position="0"> The data used in this research is the Broadcast News (BN94) corpus, consisting of radio and TV news transcripts form the year 1994. From the total of 30226 documents, 20226 were used for training and the other 10000 were used as test and held-out data.</Paragraph>
    <Paragraph position="1"> The vocabulary size is approximately 120k words.</Paragraph>
  </Section>
  <Section position="5" start_page="167" end_page="169" type="metho">
    <SectionTitle>
3 Optimizing Document Clustering
</SectionTitle>
    <Paragraph position="0"> for Language Modeling For the purpose of language modeling, the topic labels assigned to a document or segment of a document can be obtained either manually (by topictagging the documents) or automatically, by using an unsupervised algorithm to group similar documents in topic-like clusters. We have utilized the latter approach, for its generality and extensibility, and because there is no reason to believe that the manually assigned topics are optimal for language modeling.</Paragraph>
    <Section position="1" start_page="167" end_page="168" type="sub_section">
      <SectionTitle>
3.1 Tree Generation
</SectionTitle>
      <Paragraph position="0"> In this study, we have investigated a range of hierarchical clustering techniques, examining extensions of hierarchical agglomerative clustering, k-means clustering and top-down EM-based clustering. The latter underperformed on evaluations in Florian (1998) and is not reported here.</Paragraph>
      <Paragraph position="1"> A generic hierarchical agglomerative clustering algorithm proceeds as follows: initially each document has its own cluster. Repeatedly, the two closest clusters are merged and replaced by their union, until there is only one top-level cluster. Pairwise document similarity may be based on a range of functions, but to facilitate comparative analysis we have utilized standard cosine similarity (d(D1,D2) = &lt;D1,D2~ ) and IR-style term vectors (see Salton IIDx Ih liD2 Ih and McGill (1983)).</Paragraph>
      <Paragraph position="2"> This procedure outputs a tree in which documents on similar topics (indicated by similar term content) tend to be clustered together. The difference between average-linkage and maximum-linkage algorithms manifests in the way the similarity between clusters is computed (see Duda and Hart (1973)). A problem that appears when using hierarchical clustering is that small centroids tend to cluster with bigger centroids instead of other small centroids, often resulting in highly skewed trees such as shown in Figure 2, a=0. To overcome the problem, we devised two alternative approaches for computing the intercluster similarity: * Our first solution minimizes the attraction of large clusters by introducing a normalizing factor a to the inter-cluster distance function:</Paragraph>
      <Paragraph position="4"> a=O a = 0.3 a = 0.5 Figure 2: As a increases, the trees become more balanced, at the expense of forced clustering e=0 e = 0.15 e = 0.3 e = 0.7</Paragraph>
    </Section>
    <Section position="2" start_page="168" end_page="169" type="sub_section">
      <SectionTitle>
3.2 Optimizing the Hierarchical Structure
</SectionTitle>
      <Paragraph position="0"> To be able to compute accurate language models, one has to have sufficient data for the relative frequency estimates to be reliable. Usually, even with enough data, a smoothing scheme is employed to insure that P (wdw~ -1) &gt; 0 for any given word sequence w~.</Paragraph>
      <Paragraph position="1"> The trees obtained from the previous step have documents in the leaves, therefore not enough word mass for proper probability estimation. But, on the path from a leaf to the root, the internal nodes grow in mass, ending with the root where the counts from the entire corpus are stored. Since our intention is to use the full tree structure to interpolate between the in-node language models, we proceeded to identify a subset of internal nodes of the tree, which contain sufficient data for language model estimation. The criteria of choosing the nodes for collapsing involves a goodness function, such that the cut I is a solution to a constrained optimization problem, given the constraint that the resulting tree has exactly k leaves. Let this evaluation function be g(n), where n is a node of the tree, and suppose that we want to minimize it. Let g(n, k) be the minimum cost of creating k leaves in the subtree of root n. When the evaluation function g (n) satisfies the locality condition that it depends solely on the values g (nj,.), (where (n#)j_ 1 kare the children of node n), g (root) can be coml)uted efficiently using dynamic programming 2 : where N (Ck) is the number of vectors (documents) in cluster Ck and c (Ci) is the centroid of the i th cluster. Increasing a improves tree balance as shown in Figure 2, but as a becomes large the forced balancing degrades cluster quality. null A second approach we explored is to perform basic smoothing of term vector weights, replacing all O's with a small value e. By decreasing initial vector orthogonality, this approach facilitates attraction to small centroids, and leads to more balanced clusters as shown in Figure 3.</Paragraph>
      <Paragraph position="2"> Instead of stopping the process when the desired * number of clusters is obtained, we generate the full tree for two reasons: (1) the full hierarchical structure is exploited in our language models and (2) once the tree structure is generated, the objective function we used to partition the tree differs from that used when building the tree. Since the clustering procedure turns out to be rather expensive for large datasets (both in terms of time and memory), only 10000 documents were used for generating the initial hierarchical structure.</Paragraph>
      <Paragraph position="3"> degSection 3.2 describes the choice of optimum a.</Paragraph>
      <Paragraph position="4">  gCn, 1) = g(n) g(n, k) = min h (g (nl, jl),..* , g (n/c, jk))(3) jl,,jk &gt; 1  Let us assume for a moment that we are interested in computing a unigram topic-mixture language model. If the topic-conditional distributions have high entropy (e.g. the histogram of P(wltopic ) is fairly uniform), topic-sensitive language model interpolation will not yield any improvement, no matter how well the topic detection procedure works. Therefore, we are interested in clustering documents in such a way that the topic-conditional distribution P(wltopic) is maximally skewed. With this in mind, we selected the evaluation function to be the conditional entropy of a set of words (possibly the whole vocabulary) given the particular classification. The conditional entropy of some set of words )~V given a</Paragraph>
      <Paragraph position="6"> where c (w, Ci) is the TF-IDF factor of word w in class Ci and T is the size of the corpus. Let us observe that the conditional entropy does satisfy the locality condition mentioned earlier.</Paragraph>
      <Paragraph position="7"> Given this objective function, we identified the optimal tree cut using the dynamic-programming technique described above. We also optimized different parameters (such as a and choice of linkage method).</Paragraph>
      <Paragraph position="8"> Figure 4 illustrates that for a range of cluster sizes, maximal linkage clustering with a=0.15-0.3 yields optimal performance given the objective function in equation (2).</Paragraph>
      <Paragraph position="9"> The effect of varying a is also shown graphically in Figure 5. Successful tree construction for language modeling purposes will minimize the conditional entropy of P (~VIC). This is most clearly illustrated for the word politics, where the tree generated with a = 0.3 maximally focuses documents on this topic into a single cluster. The other words shown also exhibit this desirable highly skewed distribution of P (}4;IC) in the cluster tree generated when a = 0.3.</Paragraph>
      <Paragraph position="10"> Another investigated approach was k-means clustering (see Duda and Hart (1973)) as a robust and proven alternative to hierarchical clustering. Its application, with both our automatically derived clusters and Mangn's manually derived clusters (Mangn (1997)) used as initial partitions, actually yielded a small increase in conditional entropy and was not pursued further.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="169" end_page="172" type="metho">
    <SectionTitle>
4 Language Model Construction and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="169" end_page="169" type="sub_section">
      <SectionTitle>
Evaluation
</SectionTitle>
      <Paragraph position="0"> Estimating the language model probabilities is a two-phase process. First, the topic-sensitive lani--1 gnage model probabilities P (wilt, wi_,~+~ ) are computed during the training phase. Then, at run-time, or in the testing phase, topic is dynamically identified by computing the probabilities P (tlw~ -1) as in section 4.2 and the final language model probabilities are computed using Equation (1). The tree used in the following experiments was generated using average-linkage agglomerative clustering, using parameters that optimize the objective function in Section 3.</Paragraph>
    </Section>
    <Section position="2" start_page="169" end_page="171" type="sub_section">
      <SectionTitle>
4.1 Language Model Construction
</SectionTitle>
      <Paragraph position="0"> The topic-specific language model probabilities are computed in a four phase process: 1. Each document is assigned to one leaf in the tree, based on the similarity to the leaves' centroids (using the cosine similarity). The document counts are added to the selected leaf's count.</Paragraph>
      <Paragraph position="1"> 2. The leaf counts are propagated up the tree such that, in the end, the counts of every internal node are equal to the sum of its children's counts. At this stage, each node of the tree has an attached language model - the relative frequencies. null 3. In the root of the tree, a discounted Good-Turing language model is computed (see Katz (1987), Chen and Goodman (1998)).</Paragraph>
      <Paragraph position="2"> 4. m-gram smooth language models are computed for each node n different than the root by three-way interpolating between the m-gram language model in the parent parent(n), the (m - 1)-gram smooth language model in node n and the m-gram relativeffrequency estimate in node n:</Paragraph>
      <Paragraph position="4"> for each node n in the tree. Based on how ~k (w~,-1) depend on the particular node n and the word history w~ -1, various models can be obtained. We investigated two approaches: a bigram model in which the ,k's are fixed over the tree, and a more general trigram model in</Paragraph>
      <Paragraph position="6"> Not all words are topic sensitive. Mangu (1997) observed that closed-class function words (FW), such as the, of, and with, have minimal probability variation across different topic parameterizations, while most open-class content words (CW) exhibit substantial topic variation. This leads us to divide the possible word pairs in two classes (topic-sensitive and not) and compute the A's in Equation (5) in such a way that the probabilities in the former set are constant in all the models. To formalize this:</Paragraph>
      <Paragraph position="8"> the &amp;quot;unknown&amp;quot; space.</Paragraph>
      <Paragraph position="9"> The imposed restriction is, then: for every word</Paragraph>
      <Paragraph position="11"> The distribution of bigrams in the training data is as follows, with roughly 30% bigram probabilities allowed to vary in the topic-sensitive models: This approach raises one interesting issue: the language model in the root assigns some probability mass to the unseen events, equal to the singletons' mass (see Good (1953),Katz (1987)). In our case, based on the assumptions made in the Good-Turing formulation, we considered that the ratio of the probability mass that goes to the unseen events and the one that goes to seen, free events should be  Then the language model probabilities are computed as in Figure 5.</Paragraph>
      <Paragraph position="12">  In general, n gram language model probabilities can be computed as in formula (5), where (A~ (w&amp;quot;'-~'J'l are adapted both for the partic- ~. 1 I / k-~l...3 ular node n and history w~ -1. The proposed dependency on the history is realized through the history count c (w~'-1) and the relevance of the history w~ -1 to the topic in the nodes n and parent (n).</Paragraph>
      <Paragraph position="13"> The intuition is that if a history is as relevant in the current node as in the parent, then the estimates in the parent should be given more importance, since they are better estimated. On the other hand, if the history is much more relevant in the current node, then the estimates in the node should be trusted more. The mean adapted A for a given height h is the tree is shown in Figure 6. This is consistent with the observation that splits in the middle of the tree tend to be most informative, while those closer to the leaves suffer from data fragmentation, and hence give relatively more weight to their parent.</Paragraph>
      <Paragraph position="14"> As before, since not all the m-grams are expected to be topic-sensitive, we use a method to insure that those rn grams are kept 'Taxed&amp;quot; to minimize noise and modeling effort. In this case, though, 2 language models with different support are used: one  h, in the unigram case that supports the topic insensitive m-grams and that is computed only once (it's a normalization of the topic-insensitive part of the overall model), and one that supports the rest of the mass and which is computed by interpolation using formula (5). Finally, the final language model in each node is computed as a mixture of the two.</Paragraph>
    </Section>
    <Section position="3" start_page="171" end_page="172" type="sub_section">
      <SectionTitle>
4.2 Dynamic Topic Adaptation
</SectionTitle>
      <Paragraph position="0"> Consider the example of predicting the word following the Broadcast News fragment: &amp;quot;It is at least on the Serb side a real drawback to the ~-?--~'. Our topic detection model, as further detailed later in this section, assigns a topic distribution to this left context (including the full previous discourse), illustrated in the upper portion of Figure 7. The model identifies that this particular context has greatest affinity with the empirically generated topic clusters #41 and #42 (which appear to have one of their foci on international events).</Paragraph>
      <Paragraph position="1"> The lower portion of Figure 7 illustrates the topic-conditional bigram probabilities P(w\[the, topic) for two candidate hypotheses for w: peace (the actually observed word in this case) and piece (an incorrect competing hypothesis). In the former case, P(peace\[the, topic) is clearly highly elevated in the most probable topics for this context (#41,#42), and thus the application of our core model combination (Equation 1) yields a posterior joint product P (w, lw~ -1) = ~'~K= 1P ($lw~-l) * Pt (w, lw~_-~+l) that is 12-times more likely than the overall bigram probability, P(air\[the) = 0.001. In contrast, the obvious accustically motivated alternative piece, has greatest probability in a far different and much more diffuse distribution of topics, yielding a joint model probability for this particular context that is 40% lower than its baseline bigram probability. This context-sensitive adaptation illustrates the efficacy of dynamic topic adaptation in increasing the model probability of the truth.</Paragraph>
      <Paragraph position="2"> Clearly the process of computing the topic detector P (tlw~ -1) is crucial. We have investigated several mechanisms for estimating this probability, the most promising is a class of normalized transformations of traditional cosine similarity between the document history vector w~ -x and the topic centroids: null</Paragraph>
      <Paragraph position="4"> One obvious choice for the function f would be the identity. However, considering a linear contribution</Paragraph>
      <Paragraph position="6"> of similarities poses a problem: because topic detection is more accurate when the history is long, even unrelated topics will have a non-trivial contribution to the final probability 3, resulting in poorer estimates.</Paragraph>
      <Paragraph position="7"> One class of transformations we investigated, that directly address the previous problem, adjusts the similarities such that closer topics weigh more and more distant ones weigh less. Therefore, f is chosen such that I(=~} &lt; ~-~ for ~E1 &lt; X2 C/~ sC/.~)- ~ - (7) f(zl) &lt; for zz &lt; z2 X I ~ ag 2 that is, ~ should be a monotonically increasing function on the interval \[0, 1\], or, equivalently f (x) = x. g (x), g being an increasing function on \[0,1\]. Choices for g(x) include x, z~(~f &gt; 0), log (z), e z .</Paragraph>
      <Paragraph position="8"> Another way of solving this problem is through the scaling operator f' (xi) = ,~-mm~ By apply- max zi --min zi &amp;quot; ing this operator, minimum values (corresponding to low-relevancy topics) do not receive any mass at all, and the mass is divided between the more relevant topics. For example, a combination of scaling and</Paragraph>
      <Paragraph position="10"> A third class of transformations we investigated considers only the closest k topics in formula (6) and ignores the more distant topics.</Paragraph>
    </Section>
    <Section position="4" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
4.3 Language Model Evaluation
</SectionTitle>
      <Paragraph position="0"> Table 1 briefly summarizes a larger table of performance measured on the bigram implementation  of this adaptive topic-based LM. For the default parameters (indicated by *), a statistically significant overall perplexity decrease of 10.5% was observed relative to a standard bigram model measured on the same 1000 test documents. Systematically modifying these parameters, we note that performance is decreased by using shorter discourse contexts (as histories never cross discourse boundaries, 5000-word histories essentially correspond to the full prior discourse). Keeping other parameters constant, g(x) = x outperforms other candidate transformations g(x) = 1 and g(x) = e z. Absence of k-nn and use of scaling both yield minor performance improvements.</Paragraph>
      <Paragraph position="1"> It is important to note that for 66% of the vocabulary the topic-based LM is identical to the core bigram model. On the 34% of the data that falls in the model's target vocabulary, however, perplexity reduction is a much more substantial 33.5% improvement. The ability to isolate a well-defined target subtask and perform very well on it makes this work especially promising for use in model combination.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML