File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1023_metho.xml
Size: 13,748 bytes
Last Modified: 2025-10-06 14:07:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1023"> <Title>Improving Language Model Size Reduction using Better Pruning Criteria</Title> <Section position="5" start_page="11" end_page="11" type="metho"> <SectionTitle> 3 Pruning Criteria </SectionTitle> <Paragraph position="0"> In this section, we describe the three pruning criteria we evaluated. They are derived from LM evaluation measures including perplexity, rank, and entropy.</Paragraph> <Paragraph position="1"> The goal of the pruning criterion is to estimate the performance loss due to pruning each bigram individually. Therefore, we represent the pruning criterion as a loss function, denoted by LF below.</Paragraph> <Section position="1" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.1 Probability </SectionTitle> <Paragraph position="0"> The probability pruning criterion is derived from perplexity. The perplexity is defined as</Paragraph> <Paragraph position="2"> (2) where N is the size of the test data. The perplexity can be roughly interpreted as the expected branching factor of the test document when presented to the LM. It is expected that lower perplexities are correlated with lower error rates.</Paragraph> <Paragraph position="3"> The method of pruning bigram models using probability can be described as follows: all bigrams that change perplexity by less than a threshold are removed from the model. In this study, we assume that the change in model perplexity of the LM can be expressed in terms of a weighted difference of the log probability estimate before and after pruning a bigram. The loss function of probability LF</Paragraph> <Paragraph position="5"> where P(.|.) denotes the conditional probabilities assigned by the original model, P'(.|.) denotes the probabilities in the pruned model, and P(w</Paragraph> <Paragraph position="7"> ) is a smoothed probability estimate in the original model. We notice that LF probability of Equation (3) is very similar to that proposed by Seymore and Rosenfeld (1996), where the loss function is</Paragraph> <Paragraph position="9"> ) is the discounted frequency that</Paragraph> <Paragraph position="11"> ) in Equation (3).</Paragraph> <Paragraph position="12"> From Equations (2) and (3), we can see that lower</Paragraph> </Section> </Section> <Section position="6" start_page="11" end_page="111" type="metho"> <SectionTitle> LF </SectionTitle> <Paragraph position="0"> probability is strongly correlated with lower perplexity. However, we found that LF probability is suboptimal as a pruning criterion, evaluated on CER in our experiments. We assume that it is largely due to the deficiency of perplexity as a LM performance measure.</Paragraph> <Paragraph position="1"> Although perplexity is widely used due to its simplicity and efficiency, recent researches show that its correlation with error rate is not as strong as once thought. Clarkson and Robinson (2001) analyzed the reason behind it and concluded that the calculation of perplexity is based solely on the probabilities of words contained within the test text, so it disregards the probabilities of alternative words, which will be competing with the correct word (referred to as target word below) within the decoder (e.g. in a speech recognition system). Therefore, they used other measures such as rank and entropy for LM evaluation. These measures are based on the probability distribution over the whole vocabulary. That is, if the test text is w</Paragraph> <Paragraph position="3"> ), and the new measures will be based on the values of</Paragraph> <Paragraph position="5"> ) for all w in the vocabulary. Since these measures take into account the probability distribution over all competing words (including the target word) within the decoder, they are, hopefully, better correlated with error rate, and expected to evaluate LMs more precisely than perplexity.</Paragraph> <Section position="1" start_page="11" end_page="111" type="sub_section"> <SectionTitle> 3.2 Rank </SectionTitle> <Paragraph position="0"> The rank of the target word w is defined as the word's position in an ordered list of the bigram probabilities P(w|w ) where w[?]V, and V is the vocabulary. Thus the most likely word (within the decoder at a certain time point) has the rank of one, and the least likely has rank |V|, where |V |is the vocabulary size.</Paragraph> <Paragraph position="1"> We propose to use rank for pruning as follows: all bigrams that change rank by less than a threshold after pruning are removed from the model. The corresponding loss function LF rank is defined as</Paragraph> <Paragraph position="3"> where R(.|.) denotes the rank of the observed bigram</Paragraph> <Paragraph position="5"> where w[?]V, before pruning, R'(.|.) is the new rank of it after pruning, and the summation is over all word pairs (w</Paragraph> <Paragraph position="7"/> </Section> <Section position="2" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 3.3 Entropy </SectionTitle> <Paragraph position="0"> Given a bigram model, the entropy H of the probability distribution over the vocabulary V is generally given by</Paragraph> <Paragraph position="2"> We propose to use entropy for pruning as follows: all bigrams that change entropy by less than a threshold after pruning are removed from the model.</Paragraph> <Paragraph position="3"> The corresponding loss function LF</Paragraph> <Paragraph position="5"> size of the test data.</Paragraph> <Paragraph position="6"> The entropy-based pruning is conceptually similar to the pruning method proposed in (Stolcke, 1998). Stolcke used the Kullback-Leibler divergence between the pruned and un-pruned model probability distribution in a given context over the entire vocabulary. In particular, the increase in relative entropy from pruning a bigram is computed , where the summation is over all word pairs (w</Paragraph> <Paragraph position="8"/> </Section> </Section> <Section position="7" start_page="111" end_page="111" type="metho"> <SectionTitle> 4 Empirical Comparison </SectionTitle> <Paragraph position="0"> We evaluated the pruning criteria introduced in the previous section on a realistic application, Chinese text input. In this application, a string of Pinyin (phonetic alphabet) is converted into Chinese characters, which is the standard way of inputting text on Chinese computers. This is a similar problem to speech recognition except that it does not include acoustic ambiguity. We measure performance in terms of character error rate (CER), which is the number of characters wrongly converted from the Pinyin string divided by the number of characters in the correct transcript. The role of the language model is, for all possible word strings that match the typed Pinyin string, to select the word string with the highest language model probability.</Paragraph> <Paragraph position="1"> The training data we used is a balanced corpus of approximately 26 million characters from various domains of text such as newspapers, novels, manuals, etc. The test data consists of half a million characters that have been proofread and balanced among domain, style and time.</Paragraph> <Paragraph position="2"> The back-off bigram models we generated in this study are character-based models. That is, the training and test corpora are not word-segmented.</Paragraph> <Paragraph position="3"> As a result, the lexicon we used contains 7871 single Chinese characters only. While word-based n-gram models are widely applied, we used character-based models for two reasons. First, pilot experiments show that the results of word-based and character-based models are qualitatively very similar. More importantly, because we need to build a very large number of models in our experiments as shown below, character-based models are much more efficient, both for training and for decoding.</Paragraph> <Paragraph position="4"> We used the absolute discount smoothing method for model training.</Paragraph> <Paragraph position="5"> None of the pruning techniques we consider are loss-less. Therefore, whenever we compare pruning criteria, we do so by comparing the size reduction of the pruning criteria at the same CER.</Paragraph> <Paragraph position="6"> Figure 2 shows how the CER varies with the bigram numbers in the models. For comparison, we also include in Figure 2 the results using count cutoff pruning. We can see that CER decreases as we keep more and more bigrams in the model. A steeper curve indicates a better pruning criterion.</Paragraph> <Paragraph position="7"> The main result to notice here is that the rank-based pruning achieves consistently the best performance among all of them over a wide range of CER values, producing models that are at 55-85% of the size of the probability-based pruned models with the same CER. An example of the detailed comparison results is shown in Table 1, where the CER is 13.8% and the value of cutoff is 1. The last column of Table 1 shows the relative model sizes with respect to the probability-based pruned model with the CER 13.8%.</Paragraph> <Paragraph position="8"> Another interesting result is the good performance of count cutoff, which is almost overlapping with probability-based pruning at larger model sizes . The entropy-based pruning unfortunately, achieved the worst performance.</Paragraph> <Paragraph position="9"> The result is consistent with that reported in (Goodman and Gao, 2000), where an explanation was offered. We assume that the superior performance of rank-based pruning lies in the fact that rank (acting as a LM evaluation measure) has better correlation with CER. Clarkson and Robinson (2001) estimated the correlation between LM evaluation measures and word error rate in a speech recognition system. The related part of their results to our study are shown in Table 2, where r is the Pearson product-moment correlation coefficient, r s is the Spearman rank-order correlation coefficient, and T is the Kendall rank-order correlation coefficient.</Paragraph> <Paragraph position="11"> related to the pruning criterion of rank we used) has the best correlation with word error rate, followed by the perplexity (i.e. related to the pruning criterion of probability we used) and the mean entropy (i.e.</Paragraph> <Paragraph position="12"> related to the pruning criterion of entropy we used), which support our test results. We can conclude that the LM evaluation measures which are better correlated with error rate lead to better pruning criteria.</Paragraph> </Section> <Section position="8" start_page="111" end_page="111" type="metho"> <SectionTitle> 5 Combining Two Criteria </SectionTitle> <Paragraph position="0"> We now investigate methods of combining pruning criteria described above. We begin by examining the overlap of the bigrams pruned by two different criteria to investigate which might usefully be combined. Then the thresholding pruning algorithm described in Figure 1 is modified so as to make use of two pruning criteria simultaneously. The problem here is how to find the optimal settings of the pruning threshold pair (each for one pruning criterion) for different model sizes. We show how an optimal function which defines the optimal settings of the threshold pairs is efficiently established using our techniques.</Paragraph> <Section position="1" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 5.1 Overlap </SectionTitle> <Paragraph position="0"> From the abovementioned three pruning criteria, we investigated the overlap of the bigrams pruned by a pair of criteria. There are three criteria pairs. The overlap results are shown in Figure 3.</Paragraph> <Paragraph position="1"> We can see that the percentage of the number of bigrams pruned by both criteria seems to increase as the model size decreases, but all criterion-pairs have overlaps much lower than 100%. In particular, we find that the average overlap between probability and entropy is approximately 71%, which is the biggest among the three pairs. The pruning method based on the criteria of rank and entropy has the smallest average overlap of 63.6%. The results suggest that we might be able to obtain improvements by combining these two criteria for bigram pruning since the information provided by these criteria is, in some sense, complementary.</Paragraph> </Section> <Section position="2" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 5.2 Pruning by two criteria </SectionTitle> <Paragraph position="0"> In order to prune a bigram model based on two criteria simultaneously, we modified the thresholding pruning algorithm described in Figure due to pruning each bigram individually using the two pruning criteria Now, the remaining problem is how to find the optimal settings of the pruning threshold pair for different model sizes. This seems to be a very tedious task since for each model size, a large number of settings (th ) have to be tried for finding the optimal ones. Therefore, we convert the problem to the following one: How to find an optimal function th build a large number of models pruned using the algorithm described in Figure 4. For each model size, we find an optimal setting of the threshold ) which results in a pruned model with the lowest CER. Finally, all these optimal threshold settings serve as the sample data, from which the optimal function can be learned. We found that in pilot experiments, a relatively small set of sample settings is enough to generate the function which is close enough to the optimal one. This allows us to relatively quickly search through what would otherwise be an overwhelmingly large search space.</Paragraph> </Section> </Section> class="xml-element"></Paper>