File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1023_intro.xml

Size: 3,412 bytes

Last Modified: 2025-10-06 14:01:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1023">
  <Title>Improving Language Model Size Reduction using Better Pruning Criteria</Title>
  <Section position="4" start_page="1" end_page="11" type="intro">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> N-gram models predict the next word given the previous n-1 words by estimating the conditional  ). In practice, n is usually set to 2 (bigram), or 3 (trigram). For simplicity, we restrict our discussion to bigrams P(w</Paragraph>
    <Paragraph position="2"> ), but our approaches can be extended to any n-gram. The bigram probabilities are estimated from the training data by maximum likelihood estimation (MLE). However, the intrinsic problem of MLE is Computational Linguistics (ACL), Philadelphia, July 2002, pp. 176-182. Proceedings of the 40th Annual Meeting of the Association for that of data sparseness: MLE leads to zero-value probabilities for unseen bigrams. To deal with this problem, Katz (1987) proposed a backoff scheme.</Paragraph>
    <Paragraph position="3"> He estimates the probability of an unseen bigram by utilizing unigram estimates as follows</Paragraph>
    <Paragraph position="5"> in the training data, P d represents the Good-Turing discounted estimate for seen word pairs, and a(w</Paragraph>
    <Paragraph position="7"> is a normalization factor.</Paragraph>
    <Paragraph position="8"> Due to the memory limitation in realistic applications, only a finite set of word pairs have conditional probability P(w</Paragraph>
    <Paragraph position="10"> represented in the model. The remaining word pairs are assigned a probability by backoff (i.e. unigram estimates). The goal of bigram pruning is to remove uncommon explicit bigram estimates P(w</Paragraph>
    <Paragraph position="12"> the model to reduce the number of parameters while minimizing the performance loss.</Paragraph>
    <Paragraph position="13"> The research on backoff n-gram model pruning can be formulated as the definition of the pruning criterion, which is used to estimate the performance loss of the pruned model. Given the pruning criterion, a simple thresholding algorithm for pruning bigram models can be described as follows:  1. Select a threshold th.</Paragraph>
    <Paragraph position="14"> 2. Compute the performance loss due to pruning each bigram individually using the pruning criterion.</Paragraph>
    <Paragraph position="15"> 3. Remove all bigrams with performance loss less than th.</Paragraph>
    <Paragraph position="16"> 4. Re-compute backoff weights.</Paragraph>
    <Paragraph position="17">  The algorithm in Figure 1 together with several pruning criteria has been studied previously (Seymore and Rosenfeld, 1996; Stolcke, 1998; Gao and Lee, 2000; etc). A comparative study of these techniques is presented in (Goodman and Gao, 2000).</Paragraph>
    <Paragraph position="18"> In this paper, three pruning criteria will be studied: probability, rank, and entropy. Probability serves as the baseline pruning criterion. It is derived from perplexity which has been widely used as a LM evaluation measure. Rank and entropy have been previously used as a metric for LM evaluation in (Clarkson and Robinson, 2001). In the current paper, these two measures will be studied for the purpose of backoff n-gram model pruning. In the next section, we will describe how pruning criteria are developed using these two measures.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML