File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/p96-1041_intro.xml

Size: 3,745 bytes

Last Modified: 2025-10-06 14:06:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1041">
  <Title>An Empirical Study of Smoothing Techniques for Language Modeling</Title>
  <Section position="3" start_page="0" end_page="310" type="intro">
    <SectionTitle>
1 IT
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> where Pm(ti) denotes the language model produced with method m and where the test data T is composed of sentences (tl,...,tzr) and contains a total of NT words. The entropy is inversely related to the average probability a model assigns to sentences in the test data, and it is generally assumed that lower entropy correlates with better performance in applications.</Paragraph>
    <Section position="1" start_page="310" end_page="310" type="sub_section">
      <SectionTitle>
1.1 Smoothing n-gram Models
</SectionTitle>
      <Paragraph position="0"> In n-gram language modeling, the probability of a string P(s) is expressed as the product of the probabilities of the words that compose the string, with each word probability conditional on the identity of the last n - 1 words, i.e., ifs = wl-..wt we have</Paragraph>
      <Paragraph position="2"> where w i j denotes the words wi * *. wj. Typically, n is taken to be two or three, corresponding to a bigram or trigram model, respectively. 1 Consider the case n = 2. To estimate the probabilities P(wilwi-,) in equation (1), one can acquire a large corpus of text, which we refer to as training data, and take</Paragraph>
      <Paragraph position="4"> where c(c 0 denotes the number of times the string c~ occurs in the text and Ns denotes the total number of words. This is called the maximum likelihood (ML) estimate for P(wilwi_l).</Paragraph>
      <Paragraph position="5"> While intuitive, the maximum likelihood estimate is a poor one when the amount of training data is small compared to the size of the model being built, as is generally the case in language modeling. For example, consider the situation where a pair of words, or bigram, say burnish the, doesn't occur in the training data. Then, we have PML(the Iburnish) = O, which is clearly inaccurate as this probability should be larger than zero. A zero bigram probability can lead to errors in speech recognition, as it disallows the bigram regardless of how informative the acoustic signal is. The term smoothing describes techniques for adjusting the maximum likelihood estimate to hopefully produce more accurate probabilities. null As an example, one simple smoothing technique is to pretend each bigram occurs once more than it actually did (Lidstone, 1920; Johnson, 1932; Jeffreys, 1948), yielding</Paragraph>
      <Paragraph position="7"> where V is the vocabulary, the set of all words being considered. This has the desirable quality of 1To make the term P(wdw\[Z~,,+~) meaningful for i &lt; n, one can pad the beginning of the string with a distinguished token. In this work, we assume there are n - 1 such distinguished tokens preceding each sentence.</Paragraph>
      <Paragraph position="8"> preventing zero bigram probabilities. However, this scheme has the flaw of assigning the same probability to say, burnish the and burnish thou (assuming neither occurred in the training data), even though intuitively the former seems more likely because the word the is much more common than thou.</Paragraph>
      <Paragraph position="9"> To address this, another smoothing technique is to interpolate the bigram model with a unigram model PML(Wi) = c(wi)/Ns, a model that reflects how often each word occurs in the training data. For example, we can take Pinto p( i J i-1) = APM (w pW _l) + (1 getting the behavior that bigrams involving common words are assigned higher probabilities (Jelinek and Mercer, 1980).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML