File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1115_metho.xml
Size: 14,088 bytes
Last Modified: 2025-10-06 14:08:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1115"> <Title>Selecting the Most Highly Correlated Pairs within a Large Vocabulary</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Problem Definition </SectionTitle> <Paragraph position="0"> When the corpus of a set of sets of labels is provided, and the function a31a33a32</Paragraph> <Paragraph position="2"> a14 of a pair of labels to the number in the following form is also provided, we will obtain a34 : the set of pairs of a given size that satisfies the following condition.</Paragraph> <Paragraph position="4"> The following are examples of a22 a1a45a44 a26a47a46 a26a7a48 a26 a21 a14 .</Paragraph> <Paragraph position="6"> Implementation of a program that requires a0 a5 a34a23a35 memory space and a0 a5 a34 a35 a5 a3 computation time is easy. A program of this type could be used to calculate a21a23a22a12a24 , a21a23a22a25a28 , a21a23a22a25a30 , and a21a23a22a25a32 for all pairs of a19 and a20 , and could then provide the most highly correlated pairs. However, compuation with this method is not feasible when a34 is large.</Paragraph> <Paragraph position="7"> For example, in order to calculate the most highly correlated words within a newspaper over several years of publication, a34 becomes roughly a37a39a38a2a1 , and a3 becomes a37a39a38a4a3 . The amount of computation time is then increased to a37a39a38 a38 a3 .</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Approach </SectionTitle> <Paragraph position="0"> In terms of the actual data, the number of correlated pairs is usually much smaller than the number of uncorrelated pairs. Moreover, most of the uncorrelated pairs usually satisfy the condition: a21a23a22a33a24 a1 a19a27a26a16a20 a14 a10 a38 , and are not of interest. This method takes this fact into account. Moreover, it also uses the relationship between a15 a3 a26 a21a29a22a25a24a21a17 and a15 a21a23a22 a26 a21a23a22a12a28 a26 a21a23a22a25a30 a26 a21a23a22a25a32 a17 to make the computation feasible.</Paragraph> <Paragraph position="2"> 2. By definition, the sum of a21a29a22a31a24 , a21a23a22a25a28 , a21a23a22a25a30 , and a21a29a22a25a32 always represents the total number of documents.</Paragraph> <Paragraph position="4"> 3. Similarly, the sum of a21a23a22a31a24 a1 a19a27a26a16a20 a14 and</Paragraph> <Paragraph position="6"> 5. These four equations make it possible to express a21a23a22 , a21a23a22a12a28 , a21a23a22a25a30 and a21a23a22a12a32 by a21a23a22 a24 and a3 .</Paragraph> <Paragraph position="7"> These formulas indicate that the number of required two-dimensional tables is not four, but just one. In other words, if we create a table of a21a29a22 a24 a1 a19a27a26a16a20 a14 and one variable for a3 , we can obtain a21a29a22 a1 a19 a14 ,</Paragraph> <Paragraph position="9"/> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 The memory requirement for a10a13a12 a24 </SectionTitle> <Paragraph position="0"> Let a18 be the maximum number of different words/labels in one document. The following prop-</Paragraph> <Paragraph position="2"> The left side of the formula equals the total number of all pairs of words/labels.</Paragraph> <Paragraph position="3"> This cannot exceed a18 a35 a5 a3 .</Paragraph> <Paragraph position="4"> This relationship indicates that if the table is stored using tuples of a1 a19a27a26a16a20 a26</Paragraph> <Paragraph position="6"> a38 , the required memory space is a0a2a1a4a3a17a14 .</Paragraph> <Paragraph position="7"> Tuples where a21a23a22a31a24 a1 a19a27a26a16a20 a14 a10 a38 are not necessary because we know that a21a23a22a31a24 a1 a19 a26a16a20 a14 a10 a38 when the tuple for a1 a19 a26a16a20 a26 a21a29a22a25a24 a1 a19a27a26a16a20 a14a16a14 does not exist in memory. This estimation is pessimistic. The actual size of the tuples will be smaller than a18 a35 a5 a3 , since not all documents will have a18 different words/labels.</Paragraph> <Paragraph position="8"> 7 Obtaining a10a13a12 a24 , and a6 The algorithm to obtain a21a23a22a33a24 a1 a19a27a26a16a20 a14 , and a3 is straightforward. First, the corpus must be trasformed into a set of sets of words/labels. Since this is a set form, there are no duplications of the words/labels of one document. In the following program, the hashtable returns 0 for a non-existent item.</Paragraph> <Paragraph position="9"> (01) Let DFA be empty hashtable.</Paragraph> <Paragraph position="10"> (02) Let DF be empty hashtable (03) Let N be 0 (04) For each document, assign it to D (05) |N = N + 1 (06) |For each word in D (07) |assign the word to X (08) ||For each word in D (09) ||assign the word to Y (10) DFA(X, Y)=DFA(X,Y)+1 (11) ||end of loop (12) |end of loop (13) end of loop The computation time for this program is less than</Paragraph> <Paragraph position="12"> a3 , the computation time is a0a2a1a4a3a17a14 . Again, a18 a35 a5 a3 is a pessimistic estimation, since not all documents will have a18 different words/labels.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 8 Selecting Pairs </SectionTitle> <Paragraph position="0"> Even though a21a23a22 a24 , a21a23a22a12a28 , a21a23a22 a30 , and a21a29a22 a30 can be obtained in constant time after a0a2a1a4a3a17a14 preprocessing, there are a34 a35 values to consider to obtain the best N correlated pairs. Fortunately, many of the functions that are usable as indicators of correlation and, at least, all five functions, return a lower value than the known</Paragraph> <Paragraph position="2"> The cosine measure, the dice coefficient, and pair-wise mutual information have property 1 and prop-erty 2 as defined below. This implies that the value for a1 a19a27a26a16a20 a14 where a21a23a22a12a24 a1 a19 a26a16a20 a14 a10 a38 is actually the minimum value of all a1 a19a27a26a16a20 a14 . Therefore, the first part of the total ordered sequence of a1 a19a27a26a16a20 a14 is the sorted list of a1 a19a27a26a16a20 a14 where a21a23a22 a1 a19 a26a16a20 a14a2a0 a38 . The rest is an arbitary order of pairs where a21a23a22 a1 a19a27a26a16a20 a14a19a10 a38 .</Paragraph> <Paragraph position="3"> Property 1: the value is not negative.</Paragraph> <Paragraph position="4"> Property 2: when a21a23a22a31a24 a1 a19a27a26a16a20 a14a19a10 a38 , the value is a38 . The phi coefficient and the complementary similarity measure have the following properties 1, 2 and 3. Therefore, the first part of the total ordered sequence where the value is positive, is equal to first part of the sorted list where a21a23a22 a1 a19a27a26a16a20 a14a3a0 a38 and the value is positive. Moreover, this list contains all pairs that have a positive correlation. This list is long enough for the actual application.</Paragraph> <Paragraph position="5"> same time, the estimated value is positive.</Paragraph> <Paragraph position="6"> It should be recalled that the number of pairs where a21a29a22 a24a33a1 a19 a26a16a20 a14a4a0 a38 is less than a18 a35 a5a2a3 . The sorted list is obtained in a0a2a1 a18 a35 a5 a3 a5 a7 a9a12a11 a1 a18 a35 a5 a3a15a14a16a14 computation time, where a18 is the maximum number of different words/labels in one document. Since a18 is constant, it becomes a0a8a1a4a3 a5 a7a10a9a12a11 a1a4a3a17a14a16a14 , even if the size ofvocabulary is very large.</Paragraph> <Paragraph position="7"> It is true that for the given some fixed vocabulary of size a34 , a18 a35 a5 a3 might be larger than a34 a35 as we increase the size of corpus. Fortunately, the actual memory consumption of this procedure also have the upper bound of a0a8a1 a34 a35 a14 , and we will not loose any memory space. When a34 is not fixed and a34 may become very large compare to a3 as is the case for proper nouns, a18 a35 a5 a3 is smaller than a34 a35 .</Paragraph> <Paragraph position="8"> relationship to the size of input data.</Paragraph> <Paragraph position="9"> corpus. When we analyzed labels of names of places in a newspaper over the course of one year, this corpus consisted of about 60,000 documents. The place names totalled 1902 after morphological analysis.</Paragraph> <Paragraph position="10"> The maximum number of names in one document was 142, and the average in one document was 4.02.</Paragraph> <Paragraph position="11"> In this case, the method described here, was much more efficient than the baseline system.</Paragraph> <Paragraph position="12"> Table 1 shows the actual execution time of the program in the appendix, changing the length of the corpus. This program computes similarity values for all pairs of words where a21a23a22a31a24 a0 a38 . It indicates that the execution time is linear.</Paragraph> <Paragraph position="13"> Our observation shows that even if the corpus were extended year by year, a18 which is the maximum number of different words in one document is stable, even though the total number of words would increase with the ongoing addition of proper nouns and new concepts.</Paragraph> <Paragraph position="14"> 10 For a large corpus Although the program in the appendix cannot be applied to a corpus larger than memory size, we can obtain a table of a21a29a22a12a24 using sequential access to file. The program in the appendix stores every pair in memory. The space requirement of a18 a35 a5 a3 may seem too great to hold in memory. However, sequential file can be used to obtain the a21a23a22 a24 table, as follows. Although the computation time for a21a23a22 a24 is a0a2a1a4a3 a5 a7a10a9a12a11 a1a4a3a17a14a16a14 rather than a0a2a1a4a3a17a14 , the total computation time remains the same because computation of a0a8a1a4a3 a5 a7a10a9a12a11 a1a4a3a17a14a16a14 is required to select pairs in both cases.</Paragraph> <Paragraph position="15"> Consider the following data. Each line corresponds to one document.</Paragraph> <Paragraph position="16"> When the pairs of words in each document are recorded, the following file is obtained. Note that since a21a23a22 a24a33a1 a19a27a26a16a20 a14a5a10 a21a23a22 a24 a1 a20 a26a16a19 a14 , it is not necessary to record pairs where a19 a0 a20 . This reduces the memory Using the merge sort algorithm which can sort a large file using sequential access only, the file can be sorted in a0a8a1a4a3 a5 a7a10a9a12a11a13a1a4a3a15a14a16a14 computation time. After sorting in alphabetical order, same pairs come together. Then, the pairs can be counted with sequential access, thereby providing the a21a29a22a33a24 table. An example of this table fllows: It should be noted that the a21a23a22 table can be obtained easily by extracting lines in which letter of the first column and that of the second column are the same, since a21a29a22 a1 a19 a14a19a10 a21a23a22a12a24 a1 a19a27a26a16a19 a14 . The a21a29a22 table can usually be stored in memory since it is a one dimensional array.</Paragraph> <Paragraph position="17"> After storing a21a29a22 in memory, similarity can be computed line by line. The following example uses the phi coefficient. The first column is the coefficient, followed by a21a23a22a12a24 , a21a29a22a25a28 , a21a23a22a25a30 , a21a23a22a25a32 , a19 and a20 . Since the phi coefficient is reflective, the a1 a19 a26a16a20 a14 value where a19 a0 a20 is not required. When the function is not symmetric, The ordered list can be obtained by sorting this table with the first column. This example shows that pairs where a21a23a22a12a24 a1 a19a27a26a16a20 a14 a10 a38 , such as a1a1a0 a26a3a2 a14 or a1a1a0 a26a5a4 a14 , do not add any overhead to either memory or computation time.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 11 Comparison with Apriori </SectionTitle> <Paragraph position="0"> There is a well known algorithm for forming a list of related items termed Apriori(Arrawal and Srikant, 1995). Apriori lists all relationship using confidence, where a21a23a22a12a24 a1 a19a27a26a16a20 a14 is larger than a specified value. Using Apriori, the a21a23a22a33a24 threshold can be specified in order to reduce computation, whereas with the proposed method, there is no way to adjust this threshold. This implies that Apriori may be faster than our algorithm in terms of confidence. However, since Apriori uses the property of confidence to reduce computation, it cannot be used for other functions, unlike the proposed method which can employ many standard functions, at least the five measures used here including confidence.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 12 Correlation of All Substrings </SectionTitle> <Paragraph position="0"> When computing correlations of all substrings in a corpus, a34 can be as large as a3 a5 a1a4a3 a55 a37 a14a5a6 a51 . Since the memory space requirement and computation time does not depend on a34 , this method can be used to generate a list of the most hightly correlated sub-strings of any length. In fact, in some cases, a18 may be too large to compute.</Paragraph> <Paragraph position="1"> The Yamamoto-Church method(Yamamoto and Church, 2001) allows for the creation of a a21a29a22 a1 a19 a14 table using a0a8a1a4a3a15a14 memory space and a0a2a1a4a3 a5 a7 a9a12a11 a1a4a3a15a14a16a14 computation time, where a19 represents all substrings in a given corpus. Yamamoto's method shows that although there may be a3 a5 a1a4a3 a55 a37 a14a5a6 patterns (or sets of substrings which have same occurence pattern) at most. The computational cost is greatly reduced if we deal with each pattern instead of each substring. Although the order of computional complexity does not depend on a34 , a18 differs whether the pattern is used or not. We have also developed a system using the pattern which actually reduces the cost of computation. Although the number of a18 is still problematic even using the Yamamoto-Church method, and although the computation cost is much larger than using words, the program runs much faster than the simple method.</Paragraph> </Section> </Section> class="xml-element"></Paper>