File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1121_intro.xml
Size: 4,271 bytes
Last Modified: 2025-10-06 14:06:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1121"> <Title>POS Tagging versus Classes in Language Modeling</Title> <Section position="2" start_page="0" end_page="179" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> For recognizing spontaneous speech, the acoustic signal is to weak to narrow down the number of word candidates. Hence, speech recognizers employ a language model that prunes out acoustic alternatives by taking into account the previous words that were recognized. In doing this, the speech recognition problem is viewed as finding the most likely word sequence 12C/ given the acoustic signal (Jelinek, 1985).</Paragraph> <Paragraph position="2"> We can rewrite the above using Bayes' rule.</Paragraph> <Paragraph position="4"> Since Pr(A) is independent of the choice of W, we simplify the above as follows.</Paragraph> <Paragraph position="6"> The first term, Pr(AIW), is the acoustic model and the second term, Pr(W), is the lanffuage model, which assigns a probability to the sequence of words W. We can rewrite W explicitly as a sequence of words W1W2W3... WN, where N is the number of words in the sequence. For expository ease, we use the notation Wid to refer to the sequence of words Wi to Wj. We now use the definition of conditional probabilities to rewrite Pr(Wi,N) as follows.</Paragraph> <Paragraph position="8"> To estimate the probability distribution, a training corpus is typically used from which the probabilities can be estimated using relative frequencies.</Paragraph> <Paragraph position="9"> Due to sparseness of data, one must define equivalence classes amongst the contexts WLi-1, which can be done by limiting the context to an n-gram language model (Jelinek, 1985). One can also mix in smaller size language models when there is not enough data to support the larger context by using either interpolated estimation (Jelinek and Mercer, 1980) or a backoff approach (Katz, 1987). A way of measuring the effectiveness of the estimated probability distribution is to measure the perplexity that it assigns to a test corpus (Bahl et al., 1977). Perplexity is an estimate of how well the language model is able to predict the next word of a test corpus in terms of the number of alternatives that need to be considered at each point. The perplexity of a test set wl,N is calculated as 2 t't, where H is the entropy, which is defined as follows.</Paragraph> <Paragraph position="11"/> <Section position="1" start_page="0" end_page="179" type="sub_section"> <SectionTitle> 1.1 Class-based Language Models </SectionTitle> <Paragraph position="0"> The choice of equivalence classes for a language model need not be the previous words. Words can be grouped into classes, and these classes can be used as the basis of the equivalence classes of the context rather than the word identities (Jelinek, 1985). Below we give the equation usually used for a class-based trigram model, where the function g maps each word to its unambiguous class.</Paragraph> <Paragraph position="1"> Pr(W/IWu.a ) ~ Pr(W~lg(W~) ) Pr(g(W~)lg(W~-l)g(W,-z)) Using classes has the potential of reducing the problem of sparseness .of data by allowing generaliza- null tions over similar words, as well as reducing the size of the language model.</Paragraph> <Paragraph position="2"> To determine the word classes, one can use the algorithm of Brown et al. (1992), which finds the classes that give high mutual information between the classes of adjacent words. In other words, for each bigram ~i3i_1~.13i in a training corpus, choose the classes such that the classes for adjacent words g(wi-1) and g(wi) lose as little information about each other as possible. Brown et at give a greedy algorithm for finding the classes. They start with each word in a separate class and iteratively combine classes that lead to the smallest decrease in mutual information between adjacent words. Kneser and Ney (1993) found that a class-based language model results in a perplexity improvement for the LOB corpus from 541 for a word-based bigrarn model to 478 for a class-based bigram model. Interpolating the word-based and class-based models resulted in an improvement to 439.</Paragraph> </Section> </Section> class="xml-element"></Paper>