File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1147_metho.xml

Size: 13,140 bytes

Last Modified: 2025-10-06 14:08:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1147">
  <Title>Fast Computation of Lexical Affinity Models</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Models for Word Co-occurrence
</SectionTitle>
    <Paragraph position="0"> There are two types of models for the co-occurrence of word pairs: functional models and distance models. Distance models use only positional information to measure co-occurrence frequency (Beeferman et al., 1997; Yuret, 1998; Rosenfeld, 1996).</Paragraph>
    <Paragraph position="1"> A special case of the distance model is the n-gram model, where the only distance allowed between pairs of words in the model is one. Any pair of word represents a parameter in distance models. Therefore, these models have to deal with combinatorial explosion problems, especially when longer sequences are considered. Functional models use the underlying syntactic function of words to measure co-occurrence frequency (Weeds and Weir, 2003; Niesler and Woodland, 1997; Grefenstette, 1993).</Paragraph>
    <Paragraph position="2"> The need for parsing affects the scalability of these models.</Paragraph>
    <Paragraph position="3"> Note that both distance and functional models rely only on pairs of terms comprised of a single word. Consider the pair of terms &amp;quot;NEW YORK&amp;quot; and &amp;quot;TERRORISM&amp;quot;, or any pair where one of the two items is itself a collocation. To best of our knowledge, no model tries to estimate composite terms of form a14a15a5a17a16a19a18a20a7a10a9a11a13a12 or a14a21a5a17a16a22a18a20a7a10a9a11a23a18a25a24a26a12 where a16 ,a7 ,a11 ,a24 are words in the vocabulary, without regard to the distribution function of a14 .</Paragraph>
    <Paragraph position="4"> In this work, we use models based on distance information. The first is an independence model that is used as baseline to determine the strength of the affinity between a pair of terms. The second is intended to fit the empirical term distribution, reflecting the actual affinity between the terms.</Paragraph>
    <Paragraph position="5"> Notation. Let a27 be a random variable with range comprising of all the words in the vocabulary. Also, let us assume that a27 has multinomial probability distribution function a14a29a28 . For any pair of terms a7 and a24 , let a30a32a31a34a33a35 be a random variable with the distance distribution for the co-occurrence of terms a7 and a24 . Let the probability distribution function of the random variable a30a15a31a34a33a35 be a14a6a36a37a5a8a7a38a18a25a24a26a12 and the corresponding cumulative be a39a40a36a37a5a8a7a38a18a25a24a26a12 .</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Independence Model
</SectionTitle>
      <Paragraph position="0"> Let a7 and a24 be two terms, with occurrence probabilities a14a41a28a26a5a8a7a42a12 and a14a41a28a43a5a17a24a44a12 . The chances, under independence, of the pair a7 and a24 co-occurring within a specific distance a45 ,a14a46a36a47a5a8a7a48a18a25a24a49a9a45a23a12 is given by a geometric distribution with parameter a4 , a30a51a50a52a27a47a53a13a54a55a5a8a45a43a56a17a4a49a12 . This is straightforward since if a7 and a24 are independent then a14a6a28a43a5a8a7a57a9a24a44a12a59a58a60a14a61a28a26a5a8a7a42a12 and similarly a14a6a28a62a5a17a24a49a9a7a42a12a59a58 a14a41a28a43a5a17a24a26a12 . If we fix a position for a a7 , then if independent, the next a24 will occur with probability a14a41a28a43a5a17a24a26a12a59a63a64a5a66a65a68a67a69a14a61a28a43a5a17a24a44a12a70a12a34a71a25a72a74a73 at distance a45 of a7 . The expected distance is the mean of the geometric distribution with parametera4 .</Paragraph>
      <Paragraph position="1"> The estimation of a4 is obtained using the Maximum Likelihood Estimator for the geometric distribution. Leta75  be the number of co-occurrences with distance a45 , and a76 be the sample size:</Paragraph>
      <Paragraph position="3"> We make the assumption that multiple occurrences of a7 do not increase the chances of seeing a24 and vice-versa. This assumption implies a different estimation procedure, since we explicitly discard what Befeerman et al. and Niesler call self-triggers (Beeferman et al., 1997; Niesler and Woodland, 1997). We consider only those pairs in which the terms are adjacent, with no intervening occurrences of a7 or a24 , although other terms may appear between them Figure 1 shows that the geometric distribution fits well the observed distance of independent words DEMOCRACY and WATERMELON. When a dependency exists, the geometric model does not fit the data well, as can be seen in Figure 2. Since the geometric and exponential distributions represent related idea in discrete/continuous spaces it is expected that both have similar results, especially whena4a77a84a85a65 .</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Affinity Model
</SectionTitle>
      <Paragraph position="0"> The model of affinity follows a exponential-like distribution, as in the independence model. Other researchers also used exponential models for affin- null ity (Beeferman et al., 1997; Niesler and Woodland, 1997). We use the gamma distribution, the generalized version of the exponential distribution to fit the observed data. Pairs of terms have a skewed distribution, especially when they have affinity for one another, and the gamma distribution is a good choice to model this phenomenon.</Paragraph>
      <Paragraph position="2"> (2) where a98 a5a8a91a29a12 is the complete gamma function. The exponential distribution is a special case with a91a99a58 a65 . Given a set of co-occurrence pairs, estimates for a91 anda93 can be calculated using the Maximum Likelihood Estimators given by:  to the word pair FRUITS and WATERMELON (a91a113a58 a114a44a115a117a116a10a116a10a118a10a118a23a119a62a120 ).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Computing the Empirical Distribution
</SectionTitle>
    <Paragraph position="0"> The independence and affinity models depend on a good approximation to a78 . We try to reduce the bias of the estimator by using a large corpus. Therefore, we want to scan the whole corpus efficiently in order to make this framework usable.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Corpus
</SectionTitle>
      <Paragraph position="0"> The corpus used in our experiments comprises a terabyte of Web data crawled from the general web in 2001 (Clarke et al., 2002; Terra and Clarke, 2003). The crawl was conducted using a breadth-first search from a initial seed set of URLs representing the home page of 2392 universities and other educational organizations. Pages with duplicate content were eliminated. Overall, the collection contains 53 billion words and 77 million documents. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Computing Affinity
</SectionTitle>
      <Paragraph position="0"> Given two terms, a7 and a24 , we wish to determine the affinity between them by efficiently examining all the locations in a large corpus where they cooccur. We treat the corpus as a sequence of terms</Paragraph>
      <Paragraph position="2"> a18a70a122a70a124 where a125 is the size of the corpus. This sequence is generated by concatenating together all the documents in the collection. Document boundaries are then ignored.</Paragraph>
      <Paragraph position="3"> While we are primarily interested in within-document term affinity, ignoring the boundaries simplifies both the algorithm and the model. Document information need not be maintained and manipulated by the algorithm, and document length normalization need not be considered. The order of the documents within the sequence is not of major importance. If the order is random, then our independence assumption holds when a document boundary is crossed and only the within-document affinity can be measured. If the order is determined by other factors, for example if Web pages from a single site are grouped together in the sequence, then affinity can be measured across these groups of pages.</Paragraph>
      <Paragraph position="4"> We are specifically interested in identifying all the locations where a7 and a24 co-occur. Consider a particular occurrence of a7 at position a126 in the sequence (a122a70a127a128a58a129a7 ). Assume that the next occurrence of a7 in the sequence is a122a70a130 and that the next occurrence of a24 is a122a66a131 (ignoring for now the exceptional case where a122a70a127 is close to the end of the sequence and is not followed by another a7 and a24 ). If a132a134a133a136a135 , then no a7 or a24 occurs between a122a25a127 and a122a131 , and the interval can be counted for this pair. Otherwise, if a132a138a137a139a135 let a122a34a140 be the last occurrence of a7 before a122a70a131 . No a7 or a24 occurs between a122a140 and a122a131 , and once again the interval containing the terms can be considered.</Paragraph>
      <Paragraph position="5"> Our algorithm efficiently computes all locations in a large term sequence where a7 and a24 co-occur with no intervening occurrences of either a7 or a24 .</Paragraph>
      <Paragraph position="6"> Two versions of the algorithm are given, an asymmetric version that treats terms in a specific order, and a symmetric version that allows either term to appear before the other.</Paragraph>
      <Paragraph position="7"> The algorithm depends on two access functions  Informally, the access function a141 a5a143a122a172a18a20a126a64a12 returns the position of the first occurrence of the term a122 located at or after position a126 in the term sequence. If there is no occurrence of a122 at or after position a126 , then a141 a5a143a122a95a18a20a126a64a12 returns a125a177a165a178a65 . Similarly, the access function a142a70a5a143a122a95a18a20a126a64a12 returns the position of the last occurrence of the terma122 located at or before position a126 in the term sequence. If there is no occurrence of a122 at or before position a126 , then a142a70a5a143a122a95a18a20a126a64a12 returns a114 . These access functions may be efficiently implemented using variants of the standard inverted list data structure. A very simple approach, suitable for a small corpus, stores all index information in memory. For a terma122 , a binary search over a sorted list of the positions where a122 occurs computes the result of a call to a141 a5a143a122a172a18a20a126a64a12 or a142a70a5a143a122a172a18a20a126a64a12 in a179a15a5a17a104a107a106a10a108a109a75a10a180a181a12a151a156a113a179a15a5a17a104a107a106a10a108a59a125a101a12 time. Our own implementation uses a two-level index, split between memory and disk, and implements different strategies depending on the relative frequency of a term in the corpus, minimizing disk traffic and skipping portions of the index where no co-occurrence will be found. A cache and other data structures maintain information from call to call.</Paragraph>
      <Paragraph position="8"> The asymmetric version of the algorithm is given below. Each iteration of the while loop makes three calls to access functions to generate a co-occurrence pair a5a173 a18a70a135a44a12 , representing the interval in the corpus from a122a34a140 to a122a182a131 where a7 and a24 are the start and end of the interval. The first call (a132a184a183 a141 a5a8a7a38a18a20a126a64a12 ) finds the first occurrence of a7 after a126 , and the second (a135a162a183 a141 a5a17a24a64a18a70a132a177a165a185a65a48a12 ) finds the first occurrence of a24 after that, skipping any occurrences of a24 between a126 and a132 . The third call (a173 a183a186a142a70a5a8a7a38a18a70a135a15a67a99a65a48a12 ) essentially indexes &amp;quot;backwards&amp;quot; in the corpus to locate last occurrence of a7 before a135 , skipping occurrences of a7 between a132 and a173 . Since each iteration generates a co-occurrence pair, the time complexity of the algorithm depends on a187 , the number of such pairs, rather than than number of times a7 and a24 appear individually in the corpus. Including the time required by calls to access functions, the algorithm generates all co-occurrence pairs in a179a15a5a8a187a52a104a111a106a10a108a144a125a110a12 time.</Paragraph>
      <Paragraph position="10"> The symmetric version of the algorithm is given next. It generates all locations in the term sequence where a7 and a24 co-occur with no intervening occurrences of either a7 or a24 , regardless of order. Its operation is similar to that of the asymmetric version.</Paragraph>
      <Paragraph position="12"> To demonstrate the performance of the algorithm, we apply it to the 99 word pairs described in Section 4.2 on the corpus described in Section 3.1, distributed over a 17-node cluster-of-workstations.</Paragraph>
      <Paragraph position="13"> The terms in the corpus were indexed without stemming. Table 1 presents the time required to scan all co-occurrences of given pairs of terms. We report the time for all hosts to return their results.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML