File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0617_intro.xml

Size: 7,197 bytes

Last Modified: 2025-10-06 14:07:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0617">
  <Title>I POS Tags and Decision Trees for Language Modeling</Title>
  <Section position="3" start_page="0" end_page="130" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> For recognizing spontaneous speech, the acoustic signal is to weak to narrow down the number of word candidates. Hence, recognizers employ a language model to take into account the likelihood of word seqiaences. To do this, the recognition problem is Cast as finding the most likely word sequence l?g given the acoustic signal A (Jelinek, 1985).</Paragraph>
    <Paragraph position="2"> The last line involves two probabilities that need to be estimated--the first due to the acoustic model Pr(AIW ) and the second due to the language model Pr(W). The language model probability can be expressed as follows, where we rewrite the sequence W explicitly as the se-</Paragraph>
    <Paragraph position="4"> To estimate the probability distribution Pr(WilWl, i-a), a training corpus is used to determine the relative frequencies. Due to sparseness of data, one must define equivalence classes amongst the contexts W~,i-1, which can be done by limiting the context to an n-gram language model (Jelinek, 1985). One can also mix in smaller size language models when there is not enough data to support the larger context by using either interpolated estimation (Jelinek and Mercer, 1980) or a backoff approach (Katz, 1987). A way of measuring the effectiveness of the estimated probability distribution is to measure the perplexity that it assigns to a test corpus (Bahl et al., 1977). Perplexity is an estimate of how well the language model is able to predict the next word of a test corpus in terms of the number of alternatives that need to be considered at each point. The perplexity of a test set Wi,N is calculated as 2 H, where H is the entropy, defined as follows.</Paragraph>
    <Paragraph position="6"/>
    <Section position="1" start_page="0" end_page="129" type="sub_section">
      <SectionTitle>
1.1 Class-based Language Models
</SectionTitle>
      <Paragraph position="0"> The choice of equivalence classes for a language model need not be the previous words.</Paragraph>
      <Paragraph position="1"> Words can be grouped into classes, and these classes can be used as the basis of the equivalence classes of the context rather than the word  identities (Jelinek, 1985). Below we give the equation usually used for a class-based trigram model, where the function 9 maps each word to its unambiguous class.</Paragraph>
      <Paragraph position="2"> Pr(Wilg(Wd ) Pr(g(Wdlg(W~-~ )g(W~-2) ) Using classes has the potential of reducing the problem of sparseness of data by allowing generalizations over similar words, as well as reducing the size of the language model.</Paragraph>
      <Paragraph position="3"> To determine the word classes, one can use the algorithm of Brown et al. (1992), which finds the classes that give high mutual information between the classes of adjacent words. In other words, for each bigram wi-lwi in a training corpus, choose the classes such that the classes for adjacent words 9(wi-1) and 9(wi) lose as little information about each other as possible. Brown et al. give a greedy algorithm for finding the classes. They start with each word in a separate class and iteratively combine classes that lead to the smallest decrease in mutual information between adjacent words.</Paragraph>
      <Paragraph position="4"> Kneser and Ney (1993) found that a class-based language model results in a perplexity improvement for the LOB corpus from 541 for a word-based bigram model to 478 for a class-based bi-gram model. Interpolating the word-based and class-based models resulted in an improvement to 439.</Paragraph>
    </Section>
    <Section position="2" start_page="129" end_page="129" type="sub_section">
      <SectionTitle>
1.2 Previous POS-Based Models
</SectionTitle>
      <Paragraph position="0"> One can also use POS tags, which capture the syntactic role of each word, as the basis of the equivalence classes (Jelinek, 1985). Consider the utterances &amp;quot;load the oranges&amp;quot; and &amp;quot;the load of bananas&amp;quot;. The word &amp;quot;load&amp;quot; is being used as an untensed verb in the first example, and as a noun in the second; and &amp;quot;oranges&amp;quot; and &amp;quot;bananas&amp;quot; are both being used as plural nouns.</Paragraph>
      <Paragraph position="1"> The POS tag of a word is influenced by, and influences the neighboring words and their POS tags. To use POS tags in language modeling, the typical approach is to sum over all of the POS possibilities. Below, we give the derivation based on using trigrams.</Paragraph>
      <Paragraph position="3"> Note that line 4 involves some simplifying assumptions; namely, that Pr(WilW~i-lP~i) can be approximated by Pr(WiIP~) and that Pr(PilWti-lP~i-1 ) can be approximated by Pr(P/IPti_i). These assumptions simplify the task of estimating the probability distributions. Relative frequency can be used directly for estimating the word probabilities, and trigram backoff and linear interpolation can be used for estimating the POS probabilities.</Paragraph>
      <Paragraph position="4"> The above approach for incorporating POS information into a language model has not been of much success in improving speech recognition performance. Srinivas (1996) reported a 24.5% increase in perplexity over a word-based model on the Wall Street Journal; Niesler and Woodland (1996) reported an 11.3% increase (but a 22-fold decrease in the number of parameters of such a model) for the LOB corpus; and Kneser and Ney (1993) report a 3% increase on the LOB corpus. The POS tags remove too much of the lexical information that is necessary for predicting the next word. Only by interpolating it with a word-based model is animprovement seen (Jelinek, 1985).</Paragraph>
    </Section>
    <Section position="3" start_page="129" end_page="130" type="sub_section">
      <SectionTitle>
1.3 Our Approach
</SectionTitle>
      <Paragraph position="0"> In past work (Heeman and Allen, 1997; Heeman, 1998), we introduced an alternative formulation for using POS tags in a language model. Here, POS tags are elevated from intermediate objects to be part of the output of the speech recognizer. Furthermore, we do not use the simplifying assumptions of the previous approach. Rather, we use a clustering algorithm to find words and POS tags that behave similarly. The output of the clustering algorithm is used by a decision tree algorithm to build a  set of equivalenc e classes of the contexts from which the word and POS probabilities are estimated. null In this paper, we show that the perplexity reduction that we previous reported using our POS-based model on the Trains corpus does translate into a word error rate reduction. The Trains corpus is very smal |with only 58,000 words of data. Hence, we also report on perplexity results using much larger amounts of training data, as afforded by using the Wall Street Journal corpus. We discuss how we take advantage of the POS tags to both improve and expedite the clustering and decision tree algorithms. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML