File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2114_metho.xml
Size: 14,916 bytes
Last Modified: 2025-10-06 14:07:15
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2114"> <Title>A Dynamic Language Model Based on Individual Word Domains</Title> <Section position="2" start_page="789" end_page="790" type="metho"> <SectionTitle> 1 Dynamic Language Model based on Word Language Models </SectionTitle> <Paragraph position="0"> Our dynamic language model builds a language model for each individual word. In order to do this we need to select which words are to be classified as significant and furthermore create a language model for them. We excluded all the stop words ('is', 'to', 'all', 'some') due to their high frequency within tile text and their limited contribution to the thematic of the text. A list of stop words was obtained by merging together the lists used by various www search engines, for example Altavista.</Paragraph> <Paragraph position="1"> Secondly we need to create a dictionary that contains the frequency of each word ill the corpus. This is needed because we want to exclude those non-stop words which appear too often in the training corpus, for example words like 'dollars', 'point', etc. A hash file is constructed to store large amounts of information so that it can be retrieved quickly.</Paragraph> <Paragraph position="2"> The next step is to create tile global language model by obtaining the text phrases and their probabilities. Frequencies of words and phrases are derived fiom a large text corpus and the conditional probability of a word given a sequence of preceding words is estimated. These conditional probabilities are combined to produce an overall language model probability for any given word sequence. The probability of a sequence of words is:</Paragraph> <Paragraph position="4"> where w\[' ={wl,w2,w3,...,w,, } is a sentence or sequence of words. The individual conditional probabilities are approximated by the maximum likelihoods:</Paragraph> <Paragraph position="6"> where freq(X)is the frequency of the phrase X in the text.</Paragraph> <Paragraph position="7"> In equation (2), there are often unknown sequences of words i.e. phrases which are not in the dictionary. The maximum likelihood probability is then zero. In order to improve this prediction of all unseen event, and hence the language model, a number of techniques have been explored, for example, the Good-Turing estimate (Good I. J., 1953), the backing-off method (Katz S. M., 1987), deleted interpolation (Jelinek F. and Mercer R. L., 1984) or the weighted average n-gram model (O'Boyle P., Owens M. and Smith F. J., 1994). We use the weighted average n-gram technique (WA), which combines n-grain ~ phrase distributions of several orders using a series of weighting fnnctions. The WA n-gram model has been shown to exhibit similar predictive powers to other n-gram techniques whilst enjoying several benefits.</Paragraph> <Paragraph position="8"> Firstly an algorithm for a WA model is relatively straightforward to implement ill computer software, secondly it is a variable n-gram model with the length depending on the context and finally it facilitates easy model extension 2. The weighted average probability of a woM given the preceding words is</Paragraph> <Paragraph position="10"> where the weighted functions are:</Paragraph> <Paragraph position="12"> N is tile number of tokens ill tile corpus and freq(wm+l_i...w,,~) is the frequency of tile senteuce Win+l_ i &quot;'&quot; W m in the text.</Paragraph> <Paragraph position="13"> The maximum likelihood probability of a word is: J A n-gram model contains the conditional probability of a word dependant on the previous i1 words. (Jclinek F., Mercer R.L. and Bahl L. R., 1983) 2 Tile &quot;ease of extension&quot; applies to the fact that additional training data can be incorporated into an existing WA model without the need to re-estimate smoothing parameters.</Paragraph> <Paragraph position="15"> J?eq(w) is the frequency of the word w in the text. This language model (defined by equation (3) and (5)) is what we term a standard n-gram language model or global language model.</Paragraph> <Paragraph position="16"> Finally the last step is the creation of a hmguage model for each significant word, which is formed in the same manner as the global language model. The word language-training corpus to be used is tlle amalgamation of the text fiagments taken from the global training corpus in which the significant word appears. A number of choices can be made as to how the wordtraining corpus for each significant word can be selected. We initially construct what we termed the &quot;paragraph context model&quot;, entailing that the global training corpus is scanned for a particular word and each time the word is found the paragraph containing that word is extracted. The paragraphs of text extracted for a particular word are joined together to form an individual wordtraining corpus, from which an individual word language model is built. Alternative methods include storing only the sentences where the word appears or extracting a piece of the text Mwords before and M + words after the search word.</Paragraph> <Paragraph position="17"> Additionally some restrictions on the number of words were imposed. This was done due to the high frequency of certain words. Such words were omitted since the additional information that they provide is minimal (conversely language models for &quot;rare&quot; words are desirable as they provide significant additional iuformation to that contained within the global language model). Once individual language models have been formed for each significaut word (trained using the standard n-grain approach as used for the global lnodel), the.m remains the problem of how the individual word language models will be combined together with the global language model.</Paragraph> </Section> <Section position="3" start_page="790" end_page="791" type="metho"> <SectionTitle> 2 Combining the Models </SectionTitle> <Paragraph position="0"> We need to combine the probabilities obtained from each word language model and fiom the global language model, in order to obtain a conditional probability for a word given a sequence of words. The first model to be tested is an arithmetic combination of the global hmguage model and the word language models.</Paragraph> <Paragraph position="1"> All the word hmguage models and the global language model are weighted equally. We believe that words, which appear far away in the previous word history, do not have as nmch importance as the ones closest to the word.</Paragraph> <Paragraph position="2"> Therefore we need to lnake a restriction in the number of language models. First, the conditional probabilities obtained from the word hmguage models and the global language model can be combined in a linear interpolated model as follows:</Paragraph> <Paragraph position="4"> and l'(wl,,i')is the conditional probability in the word language model for the significant word w i, 2 iare the correspondent weights and m is the maxinmm number of word models that we are including.</Paragraph> <Paragraph position="5"> If the same weight is given to all the word language models but not to the global language model and if a restriction on tim lmmber of word language models to be included is enforced, the weighted model is defined as: and ~ is a parameter which is chosen to optimise the model.</Paragraph> <Paragraph position="6"> Furthermore, a method was used based on all exponential decay of the word model probabilities with distance. This stands to reason, as a word appearing several words previously will generally be less relevant than more recent words. Given a sequence of words, for example, &quot;We had happy times in America...&quot; We ltad Happy Times In America 5 4 3 2 1 where 5, 4, 3, 2, 1 represent the distance of the word from the word America, Happy and Times are significant words for which we have an individual word language models. The exponential decay model for the word w, where in this case w represents the significant word America, is as follows:</Paragraph> <Paragraph position="8"> where Patot,,t(wl w(') is the conditional probability of the word w following a phrase wl &quot;&quot; w,, in the global language model. Pmppy(Wl w~') is the conditional probability of the word w following a phrase w 1...%word language model for the significant word Happy.</Paragraph> <Paragraph position="9"> The same defnition applies for the word model Times. d is the exponential decay distance with d=5, 10, 15,etc. The decaying factor exp(-I/d) introduces a cut off: if l>d ~ exp(-l/d)=O where l is the word modelto word distance d is the decay distance Presently the combination methods outlined above have been experimentally explored. However, they offer a reasonably simplistic means of combining the individual and global language models. More sophisticated models are likely to offer improved performance gains.</Paragraph> </Section> <Section position="4" start_page="791" end_page="792" type="metho"> <SectionTitle> 3 The Probabilistic-Union Model </SectionTitle> <Paragraph position="0"> The next method is the Probabilistic-Union model. This model is based on the logical concept of a disjunction of conjunction which is implemented as a sum of products. The union model has been previously applied in problems of noisy speech recognition, (Ming J. et al., 1999). Noisy conditions during speech recognition can have a serious effect on the likelihood of some features which are normally combined using the geometric mean. This noise has a zeroing effect upon the overall likelihood produced for that particular speech frame. The use of the probabilistic-union reduces the overall effect that each feature has in the colnbination, therefore loosening any zeroing effect.</Paragraph> <Paragraph position="1"> For the word language model, some of the conditional probabilities are zero or very small due to the small size of some of the word model corpora. For these word models, many of the words in the global training corpus are not in the word-model training-corpus dictionary. And so, the conditional probability will be in many cases zero or near zero reducing the overall probability. As in noisy speech @cognition we wish to reduce the effect of this zeroing in the combined model. The probabilistic-union model is one of the possible solutions for the zeroing problem when combining language models.</Paragraph> <Paragraph position="2"> The union model is best illustrated with an example when the number of word models to be included is m=4 and if they are assumed to be independent probabilities.</Paragraph> <Paragraph position="4"> where P'~,io,,(w) =P~U,,io,, (wlw;') is the nnion model of order k. P/ = P,.(w\[ w(') is the conditional probability for the significant word w i and ~ is a normalizing constant. The symbol '(r)' is a probabilistic sum, i.e. its equivalent for 1 and 2 is:</Paragraph> <Paragraph position="6"> Tile combination of the global language model with the probabilistic-union model is defined as follows:</Paragraph> <Paragraph position="8"/> <Section position="1" start_page="792" end_page="792" type="sub_section"> <SectionTitle> Results </SectionTitle> <Paragraph position="0"> To evaluate the behaviour of one language model with respect to others we use perplexity. It measures the average branching factor (per word) of the sequence at every new word, with respect to some source model. The lower the branching l'actor, the lower the model errors rate. Therefore, the lower the branching (l~erplexity) the better the model. Let w i be a word in the language model and w\[&quot; = {wl, w 2, w3,..-, w,,, } a sentence or sequence of words. The perplexity of this sequence of words is:</Paragraph> <Paragraph position="2"> The Wall Street Journal (version WSJ03) contains about 38 million words, and a dictionary of approxilnately 65,000 words. We select one quarter of the articles in the global training corpus as our training corpus (since the global training corpus is large and the normalisation process takes time). To test the new language model we use a subset of the test file given by WSJ0, selected at random. The training corpus that we are using contains 172,796 paragraphs, 376,589 sentences, 9526,187 tokens. The test file contains 150 paragraphs, 486 sentences, 8824 tokens and 1908 words types. Although the size of this test file is small, limher experilnents with bigger training corpora and test files are planned.</Paragraph> <Paragraph position="3"> Although in our first experiments we use 5--grams in the calculation of the word models, the size of the n-gram has been reduced to 3grains because the process of norlnalisation is slow in these experiments.</Paragraph> <Paragraph position="4"> The model based on a simple weighted colnbination offers ilnproved results, up to 10% when o~=0.6 in Eq. (8) and a combination of a maxilnuln of 10 word lnodels. Better results were found when the word models were weighted depending on their distance from the current word, that is, for the exponential decay model in Eq. (9) where d=7 and the number of word models is selected by the exponential cut off (Table 1 ). For this model ilnprovelnents of over 17% have been found.</Paragraph> <Paragraph position="5"> exponetial decay models with respect to the Global Language Model over the basic 3-gram model.</Paragraph> <Paragraph position="6"> For tile probabilistic-union model, we have as many nlodels as numbers of word language nlodels. For example, if we wish to include m=4 word language Jnodels, tile four union models are those with orders I to 4 (equation (13) to (15)). The results for the probabilistic union model when the number of words models is m=5 and m=6 are shown in the tables below.</Paragraph> <Paragraph position="7"> Probabilistie-Union Model with respect to the Global Language Model over the basic 3-gram model.</Paragraph> <Paragraph position="8"> The best result obtained so far, is an improvement of 20% when a maximum ot' 6 word models and the order is 5, i.e. sums of the products of pairs (Table 2).The value of alpha is 0.6.</Paragraph> <Paragraph position="9"> Conclusion In this paper we have introduced the concept of individual word language models to improve language model performance.</Paragraph> <Paragraph position="10"> Individual word language models permit an accurate capture of the domains in which significant words occur and hence improve the language model performance. We also describe a new method of combining models called the probabilistic union model, which has yet to be fully explored but the first results show good performance. Even though the results are preliminary, they indicate that individual word models combined with the union model offer a promising means of reducing the perplexity.</Paragraph> </Section> </Section> class="xml-element"></Paper>