File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1021_metho.xml
Size: 17,094 bytes
Last Modified: 2025-10-06 14:13:06
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1021"> <Title>IMPROVEMENTS IN STOCHASTIC LANGUAGE MODELING</Title> <Section position="4" start_page="0" end_page="107" type="metho"> <SectionTitle> 2. CORRECTING OVERESTIMATION IN THE BACKOFF MODEL 2.1. The Problem </SectionTitle> <Paragraph position="0"> The backoff n-gram language model\[l\] estimates the probn--1 ability of w,, given the immediate past history w~ = (wl .... w~-0. It is defined recursively as:</Paragraph> <Paragraph position="2"> where d, the discount ratio, is a function of C(w~), and the a's are the backoff weights, calculated to satisfy the sum-to-1 probability constraints.</Paragraph> <Paragraph position="3"> The backoff language model is a compact yet powerful way of modeling the dependence of the current word on its immediate history. An important factor in the backoff model is its behavior on the backed-off cases, namely when a given n-gram w~ is found not to have occurred in the training data. In these cases, the model assumes that the probability is proportional to the estimate provided by the n-1-gram, Pn-l(Wn \[W~- 1).</Paragraph> <Paragraph position="4"> This last assumption is reasonable most of the time, since no other sources of information are available. But for frequent n-l-grams, there may exist sufficient statistical evidence to suggest that the backed-off probabilities should in fact be much lower. This phenomenon occurs at any value of n, but is easiest to demonstrate for the simple case of n = 2, i.e. a bigram. Consider the following fictitious but typical example:</Paragraph> <Paragraph position="6"> N is the total number of words in the training set, and C(wz, w i) is the number of (wi, wj) bigrams occurring in that set. The backoff model computes:</Paragraph> <Paragraph position="8"> Thus, according to this model, P(&quot;AT&quot;I&quot;ON&quot;) >> P(&quot;CALL&quot;\[&quot;ON&quot;). But this is clearly incorrect. In the case of &quot;CAIJ?', the expected number of (&quot;ON&quot;,&quot;CALL&quot;) bigrams, assuming independence between &quot;ON&quot; and &quot;CALL&quot;, is 1, so an actual count of 0 does not give much information, and may be ignored. However, in the case of &quot;AT&quot;, the expected chance count of (&quot;ON&quot;,&quot;AT&quot;) is 100, so an actual count of 0 means that the real probability of P(&quot;AT&quot;I&quot;ON&quot;) is in fact much lower than chance. The backoff model does not capture this information, and thus grossly overestimates P(&quot;AT&quot;I &quot;ON&quot;).</Paragraph> <Paragraph position="9"> This deficiency of the backoff model has been pointed out before\[2, p.457\], but, to the best of our knowledge, has never been corrected. We suspect the reasons are twofold. First, it only occurs during backed-off cases. For a well trained bigram or trigram, this happens in only a small fraction of the time. Second, overestimation degrades perplexity only mildly and indirectly, by affecting a slight underestimation of all the other probabilities.</Paragraph> <Paragraph position="10"> We therefore did not expect this phenomenon to have a strong impact on perplexity. Nevertheless, we wanted to correct the problem and to measure its effect.</Paragraph> <Section position="1" start_page="107" end_page="107" type="sub_section"> <SectionTitle> 2.2. The Solution: Confidence Interval Capping </SectionTitle> <Paragraph position="0"> Let C(~1) = 0. Given a global confidence level Q, to be determined empirically, we calculate a confidence interval in which the true value of P(w~lw~ -1) should lie, using the constraint:</Paragraph> <Paragraph position="2"> The confidence interval is therefore \[0 ... (1 - Q1/C(~-')) \].</Paragraph> <Paragraph position="3"> We then provide another parameter, P (0 < P < 1), and establish a ceiling, or a cap, at a point P within the confidence interval: CAPe,e(C(w~- I)) = P. (1 - Q1/C(~ -~)) (3) We now require that the estimated P(wnlw~ -1) satisfy:</Paragraph> <Paragraph position="5"> The backoff case of the standard model is therefore modified to:</Paragraph> <Paragraph position="7"> min \[ o~(w~-l). P,~_l(w,,Iw~-l), CAPQ,p(C(w~-X)) I5) This capping off of the estimates requires renormalization.</Paragraph> <Paragraph position="8"> But renormalization would increase the a's, which would in turn cause some backed-off probabilities to exceed the cap.</Paragraph> <Paragraph position="9"> An iterative reestimation of the cz's is therefore required. The process was found to converge rapidly in all cases.</Paragraph> <Paragraph position="10"> Note that, although some computation is required to determine the new weights, once the model has been computed, it is no more complicated neither significantly more time consuming than the original one.</Paragraph> </Section> </Section> <Section position="5" start_page="107" end_page="107" type="metho"> <SectionTitle> 2.3. Results </SectionTitle> <Paragraph position="0"> The bigrarn perplexity reduction for various tasks is shown in table 1. BC-48K is the brown corpus with the unabridged test set backoff rate PP reduction vocabulary of 48,455 words. BC-5K is the same corpus, restricted to the most frequent 5,000 words. ATIS is the class-based bigram developed at CMU for the ATIS task. WSJ is the official CSR 5c.vp task.</Paragraph> <Paragraph position="1"> Although the reduction is modest, as expected, it should be remembered that it is achieved with hardly any increase in the complexity of the model. As can be predicted from the statistical analysis, when the vocabulary is larger, the backoff rate is greater, and the improvement in perplexity can be expected to be greater too.</Paragraph> </Section> <Section position="6" start_page="107" end_page="108" type="metho"> <SectionTitle> 3. TRIGGER-BASED ADAPTATION </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="107" end_page="108" type="sub_section"> <SectionTitle> 3.1. Motivation and Analysis </SectionTitle> <Paragraph position="0"> Several adaptive language models have been proposed recently \[3, 4, 5, 6\], which use caching of the partially dictated document, and interpolate a dynamic component based on the cache with the static component. These models have been successful in reducing the perplexity of the text considerably, and \[5\] also reports a positive effect on the word recognition rate.</Paragraph> <Paragraph position="1"> All of these models make direct use of the words in the history of the document. They take advantage of the fact that ygords, and combinations of words, once occurred in a given e document, have a higher likelihood of occurring in it again. But there is another source of information in the history that has not yet been tapped: within-document correlation between words or word sequences. Consider the sentence: &quot;The district attorney's office launched a comprehensive investigation into loans made by several well connected banks.&quot; Based on this sentence alone, a cache-based model will not be able to anticipate any of the constituent words.</Paragraph> <Paragraph position="2"> But a human reader might use &quot;DISTRICT ATTORNEY&quot; and/or &quot;LAUNCHED&quot; to anticipate &quot;INVESTIGATION&quot;, and &quot;LOANS&quot; to anticipate &quot;BANKS&quot;.</Paragraph> <Paragraph position="3"> In what follows, we describe a model that attempts to capture this type of information in a systematic way, using correlation between word sequences derived from a large corpus of text. In this model, if a word sequence A is positively and significantly correlated with another word sequence B, then (A ---~ B) is considered a &quot;trigger pair&quot;, with A being the trigger and B the triggered sequence. When A occurs in the document, it triggers B, causing its probability estimate to be increased.</Paragraph> <Paragraph position="4"> In order for such a model to be effective, the following issues have to be addressed: 1. How to filter all possible trigger pairs. Even if we restrict our attention to pairs where A and B are both single words, the number of such pairs is too large. Let V be the size of the vocabulary. Note that, unlike in a bigram model, where the number of different consecutive word pairs is much less than V 2, the number of word pairs where both words occurred in the same document is a significant fraction of V 2.</Paragraph> <Paragraph position="5"> 2. How to combine evidence from multiple triggers. This is a special case of the general problem of combining evidence from several sources. We discuss several heuristics, and a plan for a more disciplined approach.</Paragraph> <Paragraph position="6"> 3. How to combine the triggering model with the static model.</Paragraph> <Paragraph position="7"> We will discuss all 3 problems and our proposed solutions to them. This is ongoing research, and not all of our ideas have been tested yet. A solution to (1) will be discussed in some detail. When combined with simple minded solutions to (2) and (3), it resulted in a perplexity reduction of between 10% and 32%, depending on the test set. We are currently working on implementing and testing some of the other solutions.</Paragraph> <Paragraph position="8"> 3.2. Filtering the Trigger-Pairs Let &quot;history&quot; denote the part of the text already seen by the system. Let A, B be any two word sequences. Then the events Note that, although mutual information is symmetric with regard to its arguments, it is generally not true that I(A : Bo) = l(g :At).</Paragraph> <Paragraph position="9"> Should mutual information be our figure of merit in selecting the most promising trigger pairs? I(A : Bo) measures the average number of bits we can save by considering A in predictingBo. But this savings will materialize onlyifBo is true, namely if we indeed encounter the word sequence B next in the document. Our best estimate of this, at the time filtering is carried out, is P(Bo IA). We therefore define the expected utility of the trigger pair (A ~ B):</Paragraph> <Paragraph position="11"> and suggest it as a criterion for selecting trigger pairs.</Paragraph> </Section> <Section position="2" start_page="108" end_page="108" type="sub_section"> <SectionTitle> 3.3. Multiply-Triggered Sequences </SectionTitle> <Paragraph position="0"> The problem of combining evidence from multiple sources is a general, largely unsolved problem in modeling. The ideal solution is to model explicitly each combination of values of the predictor variables, but this leads to an exponential growth in the number of parameters, which renders the model untrainable. At the other extreme, we can assume linearity and simply sum the contribution from the different sources.</Paragraph> <Paragraph position="1"> This may be a reasonable approximation in some models, but it is clearly inadequate in our case: &quot;LOAN&quot; is not 3 times more likely after 3 occurrences of &quot;BANK&quot; than it is after only 1 occurrence.</Paragraph> <Paragraph position="2"> Multiple triggers have several important functions: Increase the reliability of the prediction in the face of unreliable history. Since we usually rely on the speech recognizer to provide us with the history, each word has a nonnegligible chance of being erroneous.</Paragraph> <Paragraph position="3"> Disambiguate multiple-sense words.</Paragraph> <Paragraph position="4"> Compare:</Paragraph> <Paragraph position="6"> Intersect several broad semantic domains, and assign a higher weight to the intersected region.</Paragraph> <Paragraph position="8"> We plan to model multiply triggered sequences in a way that will capture at least some of the above phenomena. This requires statistical analysis of the interaction among the triggers, especially as it relates to the triggered sequence. We have just begun this analysis. One possibility, suggested by Kai-Fu Lee, is to consider the mutual information between the triggers. Triggers with high mutual information provide little additional evidence, and thus should not be added up.</Paragraph> <Paragraph position="9"> For the system reported below, we considered several simple heuristics: averaging the effect of the different triggers, using the most informative trigger only, and a quickly saturating sum. In the limited context of our current model we found no significant difference between the three.</Paragraph> <Paragraph position="10"> 3.4. Integration with the Static Model A straightforward way to integrate the trigger model with a static model is to interpolate them linearly, using independent data to determine the weights. A somewhat fancier variant could use weights that depend on the length of the history. We expect the weight of the adaptive component to increase as the history grows. Using linear interpolation, the trigger model can be viewed as an adaptive unigram. This is the solution we used in the system reported below.</Paragraph> <Paragraph position="11"> However, linear interpolation is not without its faults. Existing static models, such as N-grams, are excellent at using short-range information. For our adaptive component to be useful, it should complement the prediction power of the static component. But linear interpolation means that the adaptive component is blind to short-term constraints, yet the latter strongly affect the behavior of the static model. For example, in processing the sentence &quot;The district attorney's office launched an investigation into loans made by several well connected banks.&quot; &quot;DISTRICT-ATtORNEY&quot; may trigger &quot;INVESTIGA-TION&quot;, causing its unigram probability to be raised to its level in documents containing the words &quot;DISTRICT-ATrORNEY&quot;. But when &quot;INVESTIGATION&quot; actually occurs, it is preceded by &quot;LAUNCHED AN&quot;, which causes a trigram model to predict it with an even higher probability, rendering the adaptive contribution useless.</Paragraph> <Paragraph position="12"> Thus a better method of combining the two components is to consider the information already provided by the static model.</Paragraph> <Paragraph position="13"> This can be done in two different ways: * By using a POS-based trigger model, in the spirit of \[4\].</Paragraph> <Paragraph position="14"> * By dynamically considering the probabilities produced by the static component, and modifying only those for which the adaptive component provided useful information. We are now experimenting with this method. Since it requires dynamic renormalization, it is only suitable for recognizers which compute the entire array of probabilities for every word.</Paragraph> </Section> <Section position="3" start_page="108" end_page="108" type="sub_section"> <SectionTitle> 3.5. The Experiment </SectionTitle> <Paragraph position="0"> We used most of the WSJ LM training corpus, 42M words in all, to train a conventional backoff trigram model\[l\] for the DARPA 20,000 closed-vocabulary task. We used the same data to derive the triggering list, as described below.</Paragraph> <Paragraph position="1"> The conditional probability provided by the trigger pair (A B) was estimated as:</Paragraph> <Paragraph position="3"> For the unconditional probability P(Bo) we used the static unigram probability of B. We have since switched to using the average probability with which occurrences of B in the training data are predicted by the trigram model, but the results reported here do not reflect this change.</Paragraph> <Paragraph position="4"> We first created an index of all but the 100 most frequent words, keeping for each word a detailed description of its occurrences. We included paragraph, sentence, and word location information, to allow consideration of different distance measures and different context levels. Excluding the top 100 words reduced the storage requirements by more than 50%. We assumed that frequently used words provide little contextual information. Using the index, we systematically searched for ordered word pairs whose expected utility, as given by Eq. 7, exceeded a given threshold. Of the 400 million possible pairs, we selected some 620,000.</Paragraph> <Paragraph position="5"> For combining multiple triggering of the same word, we used MAX or AVERAGE or SUM saturating at 2*MAX, as described in section 3.3. We found no significant difference between these methods.</Paragraph> <Paragraph position="6"> We combined the trigger model with the static trigram using linear interpolation. The automatically derived weights varied from task to task, but were usually in the range of 0.02 to 0.06 for the trigger component. We also tried to use weights that depend on the length of the history, but were surprised to find no improvement.</Paragraph> <Paragraph position="7"> ii0</Paragraph> </Section> </Section> class="xml-element"></Paper>