File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1023_intro.xml
Size: 3,412 bytes
Last Modified: 2025-10-06 14:01:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1023"> <Title>Improving Language Model Size Reduction using Better Pruning Criteria</Title> <Section position="4" start_page="1" end_page="11" type="intro"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> N-gram models predict the next word given the previous n-1 words by estimating the conditional ). In practice, n is usually set to 2 (bigram), or 3 (trigram). For simplicity, we restrict our discussion to bigrams P(w</Paragraph> <Paragraph position="2"> ), but our approaches can be extended to any n-gram. The bigram probabilities are estimated from the training data by maximum likelihood estimation (MLE). However, the intrinsic problem of MLE is Computational Linguistics (ACL), Philadelphia, July 2002, pp. 176-182. Proceedings of the 40th Annual Meeting of the Association for that of data sparseness: MLE leads to zero-value probabilities for unseen bigrams. To deal with this problem, Katz (1987) proposed a backoff scheme.</Paragraph> <Paragraph position="3"> He estimates the probability of an unseen bigram by utilizing unigram estimates as follows</Paragraph> <Paragraph position="5"> in the training data, P d represents the Good-Turing discounted estimate for seen word pairs, and a(w</Paragraph> <Paragraph position="7"> is a normalization factor.</Paragraph> <Paragraph position="8"> Due to the memory limitation in realistic applications, only a finite set of word pairs have conditional probability P(w</Paragraph> <Paragraph position="10"> represented in the model. The remaining word pairs are assigned a probability by backoff (i.e. unigram estimates). The goal of bigram pruning is to remove uncommon explicit bigram estimates P(w</Paragraph> <Paragraph position="12"> the model to reduce the number of parameters while minimizing the performance loss.</Paragraph> <Paragraph position="13"> The research on backoff n-gram model pruning can be formulated as the definition of the pruning criterion, which is used to estimate the performance loss of the pruned model. Given the pruning criterion, a simple thresholding algorithm for pruning bigram models can be described as follows: 1. Select a threshold th.</Paragraph> <Paragraph position="14"> 2. Compute the performance loss due to pruning each bigram individually using the pruning criterion.</Paragraph> <Paragraph position="15"> 3. Remove all bigrams with performance loss less than th.</Paragraph> <Paragraph position="16"> 4. Re-compute backoff weights.</Paragraph> <Paragraph position="17"> The algorithm in Figure 1 together with several pruning criteria has been studied previously (Seymore and Rosenfeld, 1996; Stolcke, 1998; Gao and Lee, 2000; etc). A comparative study of these techniques is presented in (Goodman and Gao, 2000).</Paragraph> <Paragraph position="18"> In this paper, three pruning criteria will be studied: probability, rank, and entropy. Probability serves as the baseline pruning criterion. It is derived from perplexity which has been widely used as a LM evaluation measure. Rank and entropy have been previously used as a metric for LM evaluation in (Clarkson and Robinson, 2001). In the current paper, these two measures will be studied for the purpose of backoff n-gram model pruning. In the next section, we will describe how pruning criteria are developed using these two measures.</Paragraph> </Section> class="xml-element"></Paper>