File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1024_metho.xml

Size: 9,272 bytes

Last Modified: 2025-10-06 14:07:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1024">
  <Title>Exploring Asymmetric Clustering for Statistical Language Modeling</Title>
  <Section position="5" start_page="1" end_page="3" type="metho">
    <SectionTitle>
3 Asymmetric Cluster Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Model
</SectionTitle>
      <Paragraph position="0"> The LM predicts the next word w i given its history h by estimating the conditional probability P(w i |h).</Paragraph>
      <Paragraph position="1"> Using the trigram approximation, we have</Paragraph>
      <Paragraph position="3"> ), assuming that the next word depends only on the two preceding words. In the ACM, we will use different clusters for words in different positions. For the predicted word,</Paragraph>
      <Paragraph position="5"> , we will denote the cluster of the word by PW</Paragraph>
      <Paragraph position="7"> and we will refer to this as the predictive cluster.</Paragraph>
      <Paragraph position="8"> .</Paragraph>
      <Paragraph position="9"> For the words w</Paragraph>
      <Paragraph position="11"> that we are conditioning on, we will denote their clusters by CW</Paragraph>
      <Paragraph position="13"> which we call conditional clusters. When we which to refer to a cluster of a word w in general we will use the notation W. The ACM estimates the probability of w i given the two preceeding words w</Paragraph>
      <Paragraph position="15"> as the product of the following two probabilities: (1) The probability of the predicted cluster PW</Paragraph>
      <Paragraph position="17"> given the preceding conditional clusters CW</Paragraph>
      <Paragraph position="19"> ), and (2) The probability of the word given its cluster PW</Paragraph>
      <Paragraph position="21"> and the preceding conditional clusters CW</Paragraph>
      <Paragraph position="23"> ). To deal with the data sparseness problem, we used a backoff scheme (Katz, 1987) for the parameter estimation of each sub-model. The backoff scheme recursively estimates the probability of an unseen n-gram by utilizing (n-1)-gram estimates.</Paragraph>
      <Paragraph position="24"> The basic idea underlying the ACM is the use of different clusters for predicted and conditional words respectively. Classical cluster models are symmetric in that the same clusters are employed for both predicted and conditional words. However, the symmetric cluster model is suboptimal in practice. For example, consider a pair of words like &amp;quot;a&amp;quot; and &amp;quot;an&amp;quot;. In general, &amp;quot;a&amp;quot; and &amp;quot;an&amp;quot; can follow the same words, and thus, as predicted words, belong in the same cluster. But, there are very few words that can follow both &amp;quot;a&amp;quot; and &amp;quot;an&amp;quot;. So as conditional words, they belong in different clusters.</Paragraph>
      <Paragraph position="25"> In generating clusters, two factors need to be considered: (1) clustering metrics, and (2) cluster numbers. In what follows, we will investigate the impact of each of the factors.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Asymmetric clustering
</SectionTitle>
      <Paragraph position="0"> The basic criterion for statistical clustering is to maximize the resulting probability (or minimize the resulting perplexity) of the training data. Many traditional clustering techniques [Brown et al., 1992] attempt to maximize the average mutual information of adjacent clusters  where the same clusters are used for both predicted and conditional words. We will call these clustering techniques symmetric clustering, and the resulting clusters both clusters. In constructing the ACM, we used asymmetric clustering, in which different clusters are used for predicted and conditional words. In particular, for clustering conditional words, we try to minimize the perplexity of training data for a bigram of the form P(w</Paragraph>
      <Paragraph position="2"> where N is the total number of words in the training data. We will call the resulting clusters conditional clusters denoted by CW. For clustering predicted words, we try to minimize the perplexity of training</Paragraph>
      <Paragraph position="4"> ). We will call the resulting clusters predicted clusters denoted by PW. We have</Paragraph>
      <Paragraph position="6"> is independent of the clustering used.</Paragraph>
      <Paragraph position="7"> Therefore, for the selection of the best clusters, it is sufficient to try to maximize</Paragraph>
      <Paragraph position="9"> This is very convenient since it is exactly the opposite of what was done for conditional clustering. It  Thanks to Lillian Lee for suggesting this justification of predictive clusters.</Paragraph>
      <Paragraph position="10"> means that we can use the same clustering tool for both, and simply switch the order used by the program used to get the raw counts for clustering. The clustering technique we used creates a binary branching tree with words at the leaves. The ACM in this study is a hard cluster model, meaning that each word belongs to only one cluster. So in the clustering tree, each word occurs in a single leaf. In the ACM, we actually use two different clustering trees. One is optimized for predicted words, and the other for conditional words.</Paragraph>
      <Paragraph position="11"> The basic approach to clustering we used is a top-down, splitting clustering algorithm. In each iteration, a cluster is split into two clusters in the way that the splitting achieves the maximal entropy decrease (estimated by Equations (3) or (4)). Finally, we can also perform iterations of swapping all words between all clusters until convergence i.e. no more entropy decrease can be found  . We find that our algorithm is much more efficient than agglomerative clustering algorithms - those which merge words bottom up.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.3 Parameter optimization
</SectionTitle>
      <Paragraph position="0"> Asymmetric clustering results in two binary clustering trees. By cutting the trees at a certain level, it is possible to achieve a wide variety of different numbers of clusters. For instance, if the tree is cut after the 8 th level, there will be roughly  =256 clusters. Since the tree is not balanced, the actual number of clusters may be somewhat smaller. We use W l to represent the cluster of a word w using a tree cut at level l. In particular, if we set l to the value &amp;quot;all&amp;quot;, it means that the tree is cut at infinite depth, i.e. each cluster contains a single word. The ACM model of Equation (1) can be rewritten as</Paragraph>
      <Paragraph position="2"> To optimally apply the ACM to realistic applications with memory constraints, we are always seeking the correct balance between model size and performance. We used Stolcke's pruning method to produce many ACMs with different model sizes. In our experiments, whenever we compare techniques, we do so by comparing the performance (perplexity and CER) of the LM techniques at the same model sizes. Stolcke's pruning is an entropy-based cutoff  Notice that for experiments reported in this paper, we used the basic top-down algorithm without swapping. Although the resulting clusters without swapping are not even locally optimal, our experiments show that the quality of clusters (in terms of the perplexity of the resulting ACM) is not inferior to that of clusters with swapping.</Paragraph>
      <Paragraph position="3"> method, which can be described as follows: all n-grams that change perplexity by less than a threshold are removed from the model. For pruning the ACM, we have two thresholds: one for the</Paragraph>
      <Paragraph position="5"> ) and one for the word sub-model P(w</Paragraph>
      <Paragraph position="7"> below.</Paragraph>
      <Paragraph position="8"> In this way, we have 5 different parameters that need to be simultaneously optimized: l, j, k, t</Paragraph>
      <Paragraph position="10"> are the pruning thresholds.</Paragraph>
      <Paragraph position="11"> A brute-force approach to optimizing such a large number of parameters is prohibitively expensive. Rather than trying a large number of combinations of all 5 parameters, we give an alternative technique that is significantly more efficient. Simple math shows that the perplexity of the overall model</Paragraph>
      <Paragraph position="13"> equal to the perplexity of the cluster sub-model</Paragraph>
      <Paragraph position="15"> ) times the perplexity of the</Paragraph>
      <Paragraph position="17"> ). The size of the overall model is clearly the sum of the sizes of the two sub-models. Thus, we try a large number of values of j, l, and a pruning threshold t</Paragraph>
      <Paragraph position="19"> ), computing sizes and perplexities of each, and a similarly large number of values of l, k, and a separate threshold t</Paragraph>
      <Paragraph position="21"> ). We can then look at all compatible pairs of these models (those with the same value of l) and quickly compute the perplexity and size of the overall models. This allows us to relatively quickly search through what would otherwise be an overwhelmingly large search space.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML