File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2239_metho.xml
Size: 12,413 bytes
Last Modified: 2025-10-06 14:15:09
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2239"> <Title>Word Association and MI-Trigger-based Language Modeling</Title> <Section position="2" start_page="1465" end_page="1465" type="metho"> <SectionTitle> 1 Concept of Trigger Pair </SectionTitle> <Paragraph position="0"> Based on the above description, we decide to use the trigger pair\[Rosenfeld94\] as the basic concept for extracting the word association information of an associated word pair. If a word A is highly associated with another word B, then (A --~ B) is considered a &quot;trigger pair&quot;, with A being the trigger and B the triggered word. When A occurs in the document, it triggers B, causing its probability estimate to change. A and B can be also extended to word sequences. For simplicity.</Paragraph> <Paragraph position="1"> here we will concentrate on the trigger relationships between single words although the ideas can be extended to longer word sequences.</Paragraph> <Paragraph position="2"> How to build a trigger-based language model? There remain two problems to be solved: 1) how to select a trigger pair? 2) how to measure a trigger pair'? We will discuss them separately in the next two sections.</Paragraph> </Section> <Section position="3" start_page="1465" end_page="1479" type="metho"> <SectionTitle> 2 Selecting Trigger Pair </SectionTitle> <Paragraph position="0"> Even if we can restrict our attention to the trigger pair (A, B) where A and B are both single words, the number of such pairs is too large. Therefore, selecting a reasonable number of the most powerful trigger pairs is important to a trigger-based language model.</Paragraph> <Section position="1" start_page="1465" end_page="1465" type="sub_section"> <SectionTitle> 2.1 Window Size </SectionTitle> <Paragraph position="0"> The most obvious way to control the number of the trigger pairs is to restrict the window size, which is the maximum distance between the trigger pair. In order to decide on a reasonable window size, we must know how much the distance between the two words in the trigger pair affects the word probabilities.</Paragraph> <Paragraph position="1"> Therefore, we construct the long-distance Word Bigram(WB) models for distanced = 1,2, .... 100. The distance-100 is used as a control, since we expect no significant information after that distance. We compute the conditional perplexity\[Shannon5 l\] for each long-distance WB model.</Paragraph> <Paragraph position="2"> Conditional perplexity is a measure of the average number of possible choices there are tbr a conditional distribution. The conditional perplexity of a conditional distribution with conditional entropy H(Y\]X) is defined to be 2 H(rtx) . Conditional Entropy is the entropy of a conditional distribution. Given two random variables )(and Y, a conditional probability mass function Prrx(YlX), and a marginal probability mass function Pr (Y), the conditional entropy of Y given X, H(Y\]X) is defined as: H(YIX)=-~-,~.Px.r(x,y)Iog: Prlx(ylx) (1) x.~Xy,EY For a large enough corpus, the conditional perplexity is usually an indication of the amount of information conveyed by the model: the lower the conditional perplexity, the more imbrmation it conveys and thus a better model. This is because the model captures as much as it can of that information, and whatever uncertainty remains shows up in the conditional perplexity. Here, the training corpus is the XinHua corpus, which has about 57M(million) characters or 29M words.</Paragraph> <Paragraph position="3"> From Table 1 we find that the conditional perplexity is lowest for d = 1, and it increases significantly as we move through d = 2, 3, 4, 5 and 6. For d = 7, 8, 9, 10, 11, the conditional perplexity increases slightly. We conclude that significant information exists only in the last 6 words of the history. However, in this paper we restrict maximum window size to 10.</Paragraph> </Section> <Section position="2" start_page="1465" end_page="1479" type="sub_section"> <SectionTitle> Distance Perplexity </SectionTitle> <Paragraph position="0"> distance WB models for different distances</Paragraph> </Section> <Section position="3" start_page="1479" end_page="1479" type="sub_section"> <SectionTitle> 2.2 Selecting Trigger Pair </SectionTitle> <Paragraph position="0"> Given a window, we define two events:</Paragraph> <Paragraph position="2"> Considering a particular trigger (A ~ B), we are interested in the correlation between the two events A o and B.</Paragraph> <Paragraph position="3"> A simple way to assess the significance of the correlation between the two events A o and B in the trigger(A ~ B) is to measure their cross product ratio(CPR). One often used measure is the logarithmic measure of that quality, which has units of bits and is defined as:</Paragraph> <Paragraph position="5"> where P(X o, Y) is the probability of a word pair (X,,, Y) occurring in the window.</Paragraph> <Paragraph position="6"> Although the cross product ratio measure is simple, it is not enough in determining the utility of a proposed trigger pair. Consider a highly correlated pair consisting of two rare words (}~}~ -+ \[~ ~), and compare it to a less wcll correlated, but more common pair (\[~--~+-). An occurrence of the word &quot;~}~&quot;(tail of tree) provides more information about the word &quot;\[~ ~o~ ~,,,,. re ,~,~. tpu white) than an occurrence of the word &quot;\[~ ~'(doctor) about the word &quot;~+-&quot;(nurse). Nevertheless, since the word &quot;\[~&quot; is likely to be much more common in the test data, its average utility may be much higher. If we can afford to incorporate only one of the two pairs into our trigger-based model, the trigger pair(\[~ ---> ~+-) may bc preferable.</Paragraph> <Paragraph position="7"> Therefore, an alternative measure of the expected benefit provided by A o in predicting B is the average mutual information(AMI) between the two:</Paragraph> <Paragraph position="9"> Obviously, Equation 3 takes the joint probability into consideration. We use this equation to select the trigger pairs. In related works, \[Rosenfeld94\] used this equation and \[Church+90\] used a variant of the first term to automatically identify the associated word pairs.</Paragraph> </Section> </Section> <Section position="4" start_page="1479" end_page="1479" type="metho"> <SectionTitle> 3 Measuring Trigger Pair </SectionTitle> <Paragraph position="0"> Considering a trigger pair (A, --~ B) selected by average mutual information AMI ( A o ; B) as shown in Equation 3, mutual information MI(Ao;B) reflects the degree of preference relationship between the two words in the trigger pair, which can be computed as tbllows:</Paragraph> <Paragraph position="2"> where P(X) is the probability of the word X occurred in the corpus and P(A,B) is the probability of the word pair(A,B) occurred in the window.</Paragraph> <Paragraph position="3"> Several properties of mutual information are apparent: * MI(Ao;B ) is deferent from MI(Bo;A), i.e. mutual information is ordering dependent. * If A, and B are independent, then MI(A; B) = O.</Paragraph> <Paragraph position="4"> In the above equations, the mutual information MI(A o;B) reflects the change of the information content when the two words A o and B are correlated. This is to say, the higher the value of MI(Ao;B), the stronger affinity the words A o and B have. Therefore, we use mutual information to measure the preference relationship degree of a trigger pair.</Paragraph> </Section> <Section position="5" start_page="1479" end_page="1479" type="metho"> <SectionTitle> 5 MI-Trigger-based Modeling </SectionTitle> <Paragraph position="0"> As discussed above, we can restrict the number of the trigger pairs using a reasonable window size, select the trigger pairs using average mutual information and then measure the trigger pairs using mutual information. In this section, we will describe in greater detail about how to build a trigger-based model. As the triggers are mainly determined by mutual information, we call them MI-Triggers. To build a concrete MI-Trigger model, two factors have to be considered.</Paragraph> <Paragraph position="1"> Obviously one is the window size. As we have restricted the maximum window size to 10, we will experiment on 10 different window sizes(ws = 1,2,...,10).</Paragraph> <Paragraph position="2"> Another one is whether to measure an MI-Trigger in a distance-independent(DI) or distancedependent(DD) way. While a DI MI-Trigger model is simple, a DD MI-Trigger model has the potential of modeling the word association better and is expected to have better performance because many of the trigger pairs are distancedependent. We have studied this issue using the XinHua corpus of 29M words by creating an index file that contains. For every word, a record of all of its occurrences with distance-dependent co-occurrence statistics. Some examples are shown in Table 2, which shows that &quot;jl~_/~_&quot;(&quot;the more/the more&quot;) has the highest correlation when the distance is 2, that &quot;~<{l~I/~_l~l.&quot;(&quot;not only/but also&quot;) has the highest correlation when the distances are 3, 4 and 5, and that &quot;1~'deg-~ / ~+- &quot;(&quot;doctor/nurse&quot;) has the highest correlation when the distances are 1 and 2. After manually browsing hundreds of the trigger pairs, we draw following conclusions: * Different trigger pairs display different behaviors.</Paragraph> <Paragraph position="3"> . Behaviors of trigger pairs are distance-dependent and should be measured in a distance-dependent way.</Paragraph> <Paragraph position="4"> * Most of the potential of triggers is concentrated on high-frequency words.</Paragraph> <Paragraph position="5"> (1~,&quot;-I:---~) is indeed more useful than pairs as a function of distance To compare the effects of the above two factors, 20 MI-trigger models(in which DI and DD MI-Trigger models with a window size of 1 are same) are built. Each model differs in different window sizes, and whether the evaluation is done in the DI or DD way.</Paragraph> <Paragraph position="6"> Moreover, for ease of comparison, each MI-Trigger model includes the same number of the best trigger pairs. In our experiments, only the best 1M trigger pairs are included. Experiments to determine the effects of different numbers of the trigger pairs in a trigger-based model will be conducted in Section 5.</Paragraph> <Paragraph position="7"> For simplicity, we represent a trigger pair as XX-ws-MI-Trigger, and call a trigger-based model as the XX-ws-MI-Trigger model, while XX represents DI or DD and ws represents the window size. For example, the DD-6-MI-Trigger model represents a distance-dependent MI-Trigger-based model with a window size of 6. All the models are built on the XinHua corpus of 29M words. Let's take the DD-6-MI-Trigger model as a example. We filter about 28 x 28 x 6M(with six different distances and with about 28000 Chinese words in the lexicon) possible DD word pairs. As a first step, only word pairs that co-occur at least 3 times are kept. This results in 5.7M word pairs. Then selected by average mutual information, the best IM word pairs are kept as trigger pairs. Finally, the best 1M MI-Trigger pairs are measured by mutual information. In this way, we build a DD-6-MI-Trigger model which includes the best 1M trigger pairs.</Paragraph> <Paragraph position="8"> Since the MI-Trigger-based models measure the trigger pairs using mutual information which only reflects the change of information content when the two words in the trigger pair are correlated, a word unigram model is combined with them. Given S=w~w2...w n, we can estimate the logarithmic probability log P(S).</Paragraph> <Paragraph position="9"> For a DI- ws MI-Trigger-based model,</Paragraph> <Paragraph position="11"> where ws is the windows size and i- j + 1 is the distance between the words w. and w i . The first item in each of Equation 5 and 6 is the logarithmic probability of S using a word unigram model and the second one is the value contributed to the MI-Trigger pairs in the MI-Trigger model.</Paragraph> <Paragraph position="12"> In order to measure the efficiency of the MI-Trigger-based models, the conditional perplexities of the 20 different models (each has</Paragraph> </Section> class="xml-element"></Paper>