File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1041_metho.xml

Size: 11,742 bytes

Last Modified: 2025-10-06 14:14:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1041">
  <Title>An Empirical Study of Smoothing Techniques for Language Modeling</Title>
  <Section position="4" start_page="310" end_page="311" type="metho">
    <SectionTitle>
2 Previous Work
</SectionTitle>
    <Paragraph position="0"> The simplest type of smoothing used in practice is additive smoothing (Lidstone, 1920; Johnson, 1932; aeffreys, 1948), where we take</Paragraph>
    <Paragraph position="2"> and where Lidstone and Jeffreys advocate /i = 1.</Paragraph>
    <Paragraph position="3"> Gale and Church (1990; 1994) have argued that this method generally performs poorly.</Paragraph>
    <Paragraph position="4"> The Good-Turing estimate (Good, 1953) is central to many smoothing techniques. It is not used directly for n-gram smoothing because, like additive smoothing, it does not perform the interpolation of lower- and higher-order models essential for good performance. Good-Turing states that an n-gram that occurs r times should be treated as if it had occurred r* times, where</Paragraph>
    <Paragraph position="6"> and where n~ is the number of n-grams that. occur exactly r times in the training data.</Paragraph>
    <Paragraph position="7"> Katz smoothing (1987) extends the intuitions of Good-Turing by adding the interpolation of higher-order models with lower-order models. It is perhaps the most widely used smoothing technique in speech recognition.</Paragraph>
    <Paragraph position="8"> Church and Gale (1991) describe a smoothing method that combines the Good-Turing estimate with bucketing, the technique of partitioning a set, of n-grams into disjoint groups, where each group is characterized independently through a set of parameters. Like Katz, models are defined recursively in terms of lower-order models. Each n-gram is assigned to one of several buckets based on its frequency predicted from lower-order models. Each bucket is treated as a separate distribution and Good-Turing estimation is performed within each, giving corrected counts that are normalized to yield probabilities.</Paragraph>
    <Paragraph position="10"> single bucket The other smoothing technique besides Katz smoothing widely used in speech recognition is due to Jelinek and Mercer (1980). They present a class of smoothing models that involve linear interpolation, e.g., Brown et al. (1992) take</Paragraph>
    <Paragraph position="12"> That is, the maximum likelihood estimate is interpolated with the smoothed lower-order distribution, which is defined analogously. Training a distinct I ~-1 for each wi_,~+li-1 is not generally felicitous; Wi--n-{-1 Bahl, Jelinek, and Mercer (1983) suggest partitioni-1 ing the 1~,~-~ into buckets according to c(wi_~+l), i-- n-l-1 where all )~w~-~ in the same bucket are constrained i-- n-l-1 to have the same value.</Paragraph>
    <Paragraph position="13"> To yield meaningful results, the data used to estimate the A~!-, need to be disjoint from the data ~-- n&amp;quot;l-1 used to calculate PML .2 In held-out interpolation, one reserves a section of the training data for this purpose. Alternatively, aelinek and Mercer describe a technique called deleted interpolation where different parts of the training data rotate in training either PML or the A,o!-' ; the results are then averaged. z-- n-\[-I Several smoothing techniques are motivated within a Bayesian framework, including work by Nadas (1984) and MacKay and Peto (1995).</Paragraph>
  </Section>
  <Section position="5" start_page="311" end_page="312" type="metho">
    <SectionTitle>
3 Novel Smoothing Techniques
</SectionTitle>
    <Paragraph position="0"> Of the great many novel methods that we have tried, two techniques have performed especially well.</Paragraph>
    <Section position="1" start_page="311" end_page="311" type="sub_section">
      <SectionTitle>
3.1 Method average-count
</SectionTitle>
      <Paragraph position="0"> This scheme is an instance of Jelinek-Mercer smoothing. Referring to equation (3), recall that Bahl et al. suggest bucketing the A~!-I according i--1 to c(Wi_n+l). We have found that partitioning the ~!-~ according to the average number of counts *--~+1 per non-zero element ~(~--~&amp;quot;+1) yields better Iwi:~(~:_.+~)&gt;01 results.</Paragraph>
      <Paragraph position="1"> Intuitively, the less sparse the data for estimating i-1 PML(WilWi_n+l), the larger A~,~-~ should be. *-- ~-t-1 While larger i-1 c(wi_n+l) generally correspond to less sparse distributions, this quantity ignores the allocation of counts between words. For example, we would consider a distribution with ten counts distributed evenly among ten words to be much more sparse than a distribution with ten counts all on a single word. The average number of counts per word seems to more directly express the concept of sparseness, null In Figure 1, we graph the value of ~ assigned to each bucket under the original and new bucketing schemes on identical data. Notice that the new bucketing scheme results in a much tighter plot, indicating that it is better at grouping together distributions with similar behavior.</Paragraph>
    </Section>
    <Section position="2" start_page="311" end_page="312" type="sub_section">
      <SectionTitle>
3.2 Method one-count
</SectionTitle>
      <Paragraph position="0"> This technique combines two intuitions. First, MacKay and Peto (1995) argue that a reasonable form for a smoothed distribution is</Paragraph>
      <Paragraph position="2"> The parameter a can be thought of as the number of counts being added to the given distribution,  where the new counts are distributed as in the lower-order distribution. Secondly, the Good-Turing estimate can be interpreted as stating that the number of these extra counts should be proportional to the number of words with exactly one count in the given distribution. We have found that taking</Paragraph>
      <Paragraph position="4"> works well, where i-i i is the number of words with one count, and where/3</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="312" end_page="313" type="metho">
    <SectionTitle>
4 Experimental Methodology
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="312" end_page="312" type="sub_section">
      <SectionTitle>
4.1 Data
</SectionTitle>
      <Paragraph position="0"> We used the Penn treebauk and TIPSTER corpora distributed by the Linguistic Data Consortium. From the treebank, we extracted text from the tagged Brown corpus, yielding about one million words. From TIPSTER, we used the Associated Press (AP), Wall Street Journal (WSJ), and San Jose Mercury News (SJM) data, yielding 123, 84, and 43 million words respectively. We created two distinct vocabularies, one for the Brown corpus and one for the TIPSTER data. The former vocabulary contains all 53,850 words occurring in Brown; the latter vocabulary consists of the 65,173 words occurring at least 70 times in TIPSTER.</Paragraph>
      <Paragraph position="1"> For each experiment, we selected three segments of held-out data along with the segment of training data. One held-out segment was used as the test data for performance evaluation, and the other two were used as development test data for optimizing the parameters of each smoothing method.</Paragraph>
      <Paragraph position="2"> Each piece of held-out data was chosen to be roughly 50,000 words. This decision does not reflect practice very well, as when the training data size is less than 50,000 words it is not realistic to have so much development test data available. However, we made this decision to prevent us having to optimize the training versus held-out data tradeoff for each data size. In addition, the development test data is used to optimize typically very few parameters, so in practice small held-out sets are generally adequate, and perhaps can be avoided altogether with techniques such as deleted estimation.</Paragraph>
    </Section>
    <Section position="2" start_page="312" end_page="313" type="sub_section">
      <SectionTitle>
4.2 Smoothing Implementations
</SectionTitle>
      <Paragraph position="0"> In this section, we discuss the details of our implementations of various smoothing techniques. Due to space limitations, these descriptions are not comprehensive; a more complete discussion is presented in Chen (1996). The titles of the following sections include the mnemonic we use to refer to the implementations in later sections. Unless otherwise specified, for those smoothing models defined recursively in terms of lower-order models, we end the recursion by taking the n = 0 distribution to be the uniform distribution Punif(wi) = l/IV\[. For each method, we highlight the parameters (e.g., Am and 5 below) that can be tuned to optimize performance. Parameter values are determined through training on held-out data.</Paragraph>
      <Paragraph position="1">  For our baseline smoothing method, we use an instance of Jelinek-Mercer smoothing where we constrain all A,~!-I to be equal to a single value A,~ for ,- n-hi</Paragraph>
      <Paragraph position="3"> plus-delta) We consider two versions of additive smoothing. Referring to equation (2), we fix 5 = 1 in plus-one smoothing. In plus-delta, we consider any 6.  While the original paper (Katz, 1987) uses a single parameter k, we instead use a different k for each n &gt; 1, k,~. We smooth the unigram distribution using additive smoothing with parameter 5.</Paragraph>
      <Paragraph position="4">  (church-gale) To smooth the counts n~ needed for the Good-Turing estimate, we use the technique described by Gale and Sampson (1995). We smooth the unigram distribution using Good-tiering without any bucketing. null Instead of the bucketing scheme described in the original paper, we use a scheme analogous to the one described by Bahl, Jelinek, and Mercer (1983). We make the assumption that whether a bucket is large enough for accurate Good-Turing estimation depends on how many n-grams with non-zero counts occur in it. Thus, instead of partitioning the space of P(wi-JP(wi) values in some uniform way as was done by Church and Gale, we partition the space so that at least Cmi n non-zero n-grams fall in each bucket.</Paragraph>
      <Paragraph position="5"> Finally, the original paper describes only bigram smoothing in detail; extending this method to tri-gram smoothing is ambiguous. In particular, it is unclear whether to bucket trigrams according to i-1 i--1 P(wi_JP(w d or P(wi_JP(wilwi-1). We chose the former; while the latter may yield better performance, our belief is that it is much more difficult to implement and that it requires a great deal more computation.</Paragraph>
      <Paragraph position="6">  (interp-held-out and interp-del-int) We implemented two versions of Jelinek-Mercer smoothing differing only in what data is used to  train the A's. We bucket the A ~-1 according to Wi--n-bl i-1 C(Wi_~+I) as suggested by Bahl et al. Similar to our Church-Gale implementation, we choose buckets to ensure that at least Cmi n words in the data used to train the A's fall in each bucket.</Paragraph>
      <Paragraph position="7"> In interp-held-out, the A's are trained using held-out interpolation on one of the development test sets. In interp-del-int, the A's are trained using the relaxed deleted interpolation technique described by Jelinek and Mercer, where one word is deleted at a time. In interp-del-int, we bucket an n-gram according to its count before deletion, as this turned out to significantly improve performance.  (new-avg-count and new-one-count) The implementation new-avg-count, corresponding to smoothing method average-count, is identical to interp-held-out except that we use the novel bucketing scheme described in section 3.1. In the implementation new-one-count, we have different parameters j3~ and 7~ in equation (4) for each n.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML