File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1710_metho.xml
Size: 10,131 bytes
Last Modified: 2025-10-06 14:08:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1710"> <Title>Modeling of Long Distance Context Dependency in Chinese</Title> <Section position="3" start_page="0" end_page="21" type="metho"> <SectionTitle> 2 Ngram Modeling </SectionTitle> <Paragraph position="0"> Let , where 's are the words that make up the hypothesis, the probability of the word string P can be computed by using the chain rule:</Paragraph> <Paragraph position="2"> Obviously, an ngram model assumes that the probability of the next word w is independent of word string w in the history. The difference between bigram, trigram and other ngram models is the value of N. The parameters of an ngram are thus the probabilities:</Paragraph> <Paragraph position="4"/> <Paragraph position="6"> By taking log function to both sides of equation (2.1), we have the log probability of the word string log : )(SP Given</Paragraph> <Paragraph position="8"> = , an ngram model estimates the log probability of the word string by re-writing equation (2.2): )(SP</Paragraph> <Paragraph position="10"> So, the classical task of statistical language modeling becomes how to effectively and efficiently predict the next word, given the previous words, that is to say, to estimate expressions of the form . For convenience, P is often written as , where , is called history.</Paragraph> <Paragraph position="12"> Where is the string length, w is the -th word in string .</Paragraph> <Paragraph position="14"/> <Paragraph position="16"> Ngram modeling has been widely used in estimating . Within an ngram model, the probability of a word occurring next is estimated based on the previous words. That is to say,</Paragraph> <Paragraph position="18"/> <Paragraph position="20"/> <Paragraph position="22"/> <Paragraph position="24"> the mutual information between the word string pair and ,(</Paragraph> <Paragraph position="26"> mutual information between the word string pair . is the distance of two word strings in the word string pair and is equal to 1 when the two word strings are adjacent.</Paragraph> <Paragraph position="27"> For a word string pair ( over a distance where and ), BA d A B are word strings, mutual information reflects the degree of preference relationship between the two strings over a distance . Several properties of mutual information are apparent: dBAMIdBAMI [?] * If and A B are independent over a distance , then . d 0),,( =dBAMI ),,( dBAMI reflects the change of the information content when the word strings A and B are correlated. That is to say, the higher the value of , the stronger affinity and ),d,( BAMI A B have. Therefore, we can use mutual information to measure the preference relationship degree between a word string pair.</Paragraph> <Paragraph position="28"> From the view of mutual information, an ngram model assumes the mutual information independency between ( . Using an alternative view of equivalence, an ngram model is one that partitions the data into equivalence classes based on the last n-1 words in the history.</Paragraph> <Paragraph position="30"> As trigram model is most widely used in current research, we will mainly consider the trigram-based model. By re-writing equation (2.6), the trigram model estimates the log probability of the string as: )(SP</Paragraph> <Paragraph position="32"/> <Paragraph position="34"> where , and i . That is to say, the mutual information of the next word with the history is assumed equal to the summation of that of the next word with the first word in the history and that of the next word with the rest word string in the history.</Paragraph> <Paragraph position="36"/> <Paragraph position="38"> Obviously, the first item in equation (3.7) contributes to the log probability of ngram within an N-word window while the second item is the summation of mutual information which contributes to the long distance context dependency of the next word w with the individual previous word over the long distance outside the N-word window.</Paragraph> <Paragraph position="40"> In equation (3.8), the first three items are the values computed by the trigram model as shown in equation (2.9) and the forth item contributes to summation of the mutual information of the next word with the words over the long distance outside the N-word window. That is, the new model as shown in equation (3.8) consists of two components: an ngram model and an MI model.</Paragraph> <Paragraph position="41"> Therefore, we call equation (3.8) as an MI-Ngram model and equation (3.8) can be re-written as: By using equation (3.7), equation (2.2) can be re-written as: As a special case N=3, the MI-Trigram model estimate the log probability of the string as follows: MI-Ngram modeling incorporates the long distance context dependency by computing mutual information of the long distance dependent word pairs. Since the number of possible long distance dependent word pairs may be very huge, it is impossible for MI-Ngram modeling to incorporate all of them. Therefore, for MI-Ngram modeling to be practically useful, how to select a reasonable number of word pairs becomes very important. Here two approaches are used (Zhou et al 1998 and 1999). One is to restrict the window size of possible word pairs by computing and comparing the perplexities (Shannon C.E. 1951) of various long distance bigram models for different distances. It is found that the bigram perplexities for different distances outside the 10-word window become stable. Therefore, we only consider MI-Ngram modeling with a window size of 10 words. Another is to adapt average mutual information to select a reasonable number of long distance dependent word pairs. Given distance d and two words A and B, its average mutual information is computed as: Compared with mutual information, average mutual information takes joint probabilities into consideration. In this way, average mutual information prefers frequently occurred word pairs. In our paper, different numbers of long distance dependent word pairs will be considered in MI-Ngram modeling within a window size of 10 words to evaluate the effect of different MI model size.</Paragraph> </Section> <Section position="4" start_page="21" end_page="21" type="metho"> <SectionTitle> 4 Experimentation </SectionTitle> <Paragraph position="0"> As trigram modeling is most widely used in current research, only MI-Trigram modeling is studied here. Furthermore, in order to demonstrate the effect of different numbers of word pairs in MI-Trigram modeling, various MI-Trigram models with different numbers of word pairs and the same window size of 10 words are trained on the XINHUA news corpus of 29 million words while the lexicon contains about 56,000 words. Finally, various MI-Trigram models are tested on the same task of Chinese word segmentation using the Chinese tag bank PFR1.0 Perplexity is a measure of the average number of possible choices there are for a random variable. The perplexity PP of a random variable X with entropy is defined as: )(XH</Paragraph> <Paragraph position="2"> Entropy is a measure of uncertainty about a random variable. If a random variable X occurs with a probability distribution P x(), then the entropy H of that event is defined as: use the relation when computing entropy. Table 1 shows the perplexities of various MI-Trigram models and their performances on Chinese word segmentation. Here, the precision (P) measures the number of correct words in the answer file over the total number of words in the answer file and the recall (R) measures the number of correct words in the answer file over the total number of The units of entropy are bits of information. This is because the entropy of a random variable corresponds to the average number of bits per event needed to encode a typical sequence of event samples from that random variable' s distribution.</Paragraph> <Paragraph position="3"> PFR1.0 is developed by Institute of Computational Linguistics at Beijing Univ. Here, only the word segmentation annotation is used.</Paragraph> <Paragraph position="4"> words in the key file. F-measure is the weighted harmonic mean of precision and recall: Table 1 shows that * The perplexity and the F-measure rise quickly as the number of word pairs in MI-Trigram modeling increases from 0 to 1,600,000 and then rise slowly. Therefore, the best 1,600,000 word pairs should at least be included.</Paragraph> <Paragraph position="5"> * Inclusion of the best 1,600,000 word pairs decreases the perplexity of MI-Trigram modeling by about 20 percent compared with the pure trigram model.</Paragraph> <Paragraph position="6"> * The performance of Chinese word segmentation using the MI-Trigram model with 1,600,000 word pairs is 0.8 percent higher than using the pure trigram model (MI-Trigram with 0 word pairs). That is to say, about 35 percent of errors can be corrected by incorporating only 1,600,000 word pairs to the MI-Trigram model compared with the pure trigram model.</Paragraph> <Paragraph position="7"> * For Chinese word segmentation task, recalls are about 0.7 percent higher than precisions. The main reason may be the existence of unknown words. In our experimentation, unknown words are segmented into individual Chinese characters. This makes the number of segmented words in the answer file higher than that in the key file.</Paragraph> <Paragraph position="8"> It is clear that MI-Ngram modeling has much better performance than ngram modeling. One advantage of MI-Ngram modeling is that its number of parameters is just a little more than that of ngram modeling. Another advantage of MI-Ngram modeling is that the number of the word pairs can be reasonable in size without losing too much of its modeling power. Compared to ngram modeling, MI-Ngram modeling also captures the long-distance context dependency of word pairs using the concept of mutual information.</Paragraph> <Paragraph position="9"> of 10 words on Chinese word segmentation Number of word pairs Perplexity Precision Recall F-measure</Paragraph> </Section> class="xml-element"></Paper>