File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1127_metho.xml
Size: 12,671 bytes
Last Modified: 2025-10-06 14:10:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1127"> <Title>Novel Association Measures Using Web Search with Double Checking</Title> <Section position="5" start_page="1009" end_page="1009" type="metho"> <SectionTitle> CODC (Co-Occurrence Double-Check), which </SectionTitle> <Paragraph position="0"> measures the association in an interval [0,1]. In the extreme cases, when f(Y@X)=0 or f(X@Y)=0, CODC(X,Y)=0; and when f(Y@X)=f(X) and f(X@Y)=f(Y), CODC(X,Y)=1. In the first case, X and Y are of no association. In the second case, X and Y are of the strongest association.</Paragraph> </Section> <Section position="6" start_page="1009" end_page="1011" type="metho"> <SectionTitle> 3 Association of Common Words </SectionTitle> <Paragraph position="0"> We employ Rubenstein-Goodenough's (1965) benchmark data set to compare the performance of various association measures. The data set consists of 65 word pairs. The similarities between words, called Rubenstein and Goodenough rating (RG rating), were rated on a scale of 0.0 to 4.0 for &quot;semantically unrelated&quot; to &quot;highly synonymous&quot; by 51 human subjects. The Pearson product-moment correlation coefficient, r xy, between the RG ratings X and the association scores Y computed by a model shown as follows measures the performance of the model.</Paragraph> <Paragraph position="2"> are sample standard deviations of</Paragraph> <Paragraph position="4"> and n is total samples.</Paragraph> <Paragraph position="5"> Most approaches (Resink, 1995; Lin, 1998; Li et al., 2003) used 28 word pairs only. Resnik (1995) obtained information content from WordNet and achieved correlation coefficient 0.745. Lin (1998) proposed an information-theoretic similarity measure and achieved a correlation coefficient of 0.8224. Li et al. (2003) combined semantic density, path length and depth effect from WordNet and achieved the correlation coefficient 0.8914.</Paragraph> <Paragraph position="6"> In our experiments on the benchmark data set, we used information from the Web rather than WordNet. Table 1 summarizes the correlation coefficients between the RG rating and the association scores computed by our WSDC model. We consider the number of snippets from 100 to 900. The results show that CODC > VariantCosine > VariantJaccard > VariantOverlap > VariantDice. CODC measure achieves the best performance 0.8492 when a =0.15 and total snippets to be analyzed are 600. Matsuo et al. (2004) used Jaccard coefficient to calculate similarity between personal names using the Web. The co-efficient is defined as follows.</Paragraph> <Paragraph position="8"> Where f(X[?]Y) is the number of pages including X's and Y's homepages when query &quot;X and Y&quot; is submitted to a search engine; f(X[?]Y) is the number of pages including X's or Y's homepages when query &quot;X or Y&quot; is submitted to a search engine. We revised this formula as follows and evaluated it with Rubenstein-Goodenough's (X[?]Y) is the number of snippets in which X and Y co-occur in the top N snippets of query &quot;X and Y&quot;; f s (X[?]Y) is the number of snippets containing X or Y in the top N snippets of query &quot;X or Y&quot;. We test the formula on the same benchmark. The last row of Table 1 shows that Jaccard Coeff</Paragraph> <Paragraph position="10"> is worse than other models when the number of snippets is larger than 100. Table 2 lists the results of previous researches (Resink, 1995; Lin, 1998; Li et al., 2003) and our WSDC models using VariantCosine and CODC measures. The 28 word pairs used in the experiments are shown. CODC measure can compete with Li et al. (2003). The word pair &quot;carjourney&quot; whose similarity value is 0 in the papers (Resink, 1995; Lin, 1998; Li et al., 2003) is captured by our model. In contrast, our model cannot deal with the two word pairs &quot;craneimplement&quot; and &quot;bird-crane&quot;.</Paragraph> </Section> <Section position="7" start_page="1011" end_page="1013" type="metho"> <SectionTitle> 4 Association of Named Entities </SectionTitle> <Paragraph position="0"> Although the correlation coefficient of WSDC model built on the web is a little worse than that of the model built on WordNet, the Web provides live vocabulary, in particular, named entities. We will demonstrate how to extend our WSDC method to mine the association of personal names. That will be difficult to resolve with previous approaches. We design two experiments - say, link detection test and named entity clustering, to evaluate the association of named entities.</Paragraph> <Paragraph position="1"> Given a named-entity set L, we define a link detection test to check if any two named entities</Paragraph> <Paragraph position="3"> (i[?]j) in L have a relationship R using the following three strategies.</Paragraph> <Paragraph position="5"/> <Paragraph position="7"> then the link detection test says &quot;yes&quot;, i.e.,</Paragraph> <Paragraph position="9"> have direct association. Otherwise, the test says &quot;no&quot;. Figure 1(a) shows the direct association.</Paragraph> <Paragraph position="11"> be a transpose matrix of M. The matrix</Paragraph> <Paragraph position="13"> is an association matrix. Here</Paragraph> <Paragraph position="15"> should associate with at least l common named entities directly. The strategy of association matrix specifies: if a ij [?]l , then the link detection test says &quot;yes&quot;, otherwise it says &quot;no&quot;. In the example shown in Figure 1(b), NE</Paragraph> <Paragraph position="17"> denotes how many such an NE k there are.</Paragraph> <Paragraph position="18"> In the example of Figure 1(c), two named entities indirectly associate NE</Paragraph> <Paragraph position="20"> the same time. We can define NE</Paragraph> <Paragraph position="22"> have an indirect association if s ij is larger than a threshold d . In other words, if s ij >d , then the link detection test says &quot;yes&quot;, otherwise it says &quot;no&quot;. To evaluate the performance of the above three strategies, we prepare a test set extracted from domz web site (http://dmoz.org), the most comprehensive human-edited directory of the Web. The test data consists of three communities: actor, tennis player, and golfer, shown in Table 3. Total 220 named entities are considered. The golden standard of link detection test is: we compose 24,090 (=220x219/2) named entity pairs, and assign &quot;yes&quot; to those pairs belonging to the same community.</Paragraph> <Paragraph position="23"> Named Entities When collecting the related values for computing the double check frequencies for any named entity pair (NE</Paragraph> <Paragraph position="25"> ), we consider naming styles of persons. For example, &quot;Alba, Jessica&quot; have four possible writing: &quot;Alba, Jessica&quot;, &quot;Jessica Alba&quot;, &quot;J. Alba&quot; and &quot;Alba, J.&quot; We will get top N snippets for each naming style, and filter out duplicate snippets as well as snippets of ULRs including dmoz.org and google.com. Table 4 lists the experimental results of link detection on the test set. The precisions of two baselines are: guessing all &quot;yes&quot; (46.45%) and guessing all &quot;no&quot; (53.55%). All the three strategies are better than the two baselines and the performance becomes better when the numbers of snippets increase. The strategy of direct association shows that using double checks to measure the association of named entities also gets good effects as the association of common words. For the strategy of association matrix, the best performance 90.14% occurs in the case of 900 snippets and l=6. When larger number of snippets is used, a larger threshold is necessary to achieve a better performance. Figure 2(a) illustrates the relationship between precision and threshold (l). The performance decreases when l>6. The performance of the strategy of scalar association matrix is better than that of the strategy of association matrix in some l and d . Figure 2(b) shows the relationship between precision and threshold d for some number of snippets and l .</Paragraph> <Paragraph position="26"> In link detection test, we only consider the binary operation of double checks, i.e., f(NE</Paragraph> <Paragraph position="28"> Next we employ the five formulas proposed in Section 2 to cluster named entities. The same data set as link detection test is adopted. An agglomerative average-link clustering algorithm is used to partition the given 220 named entities based on Formulas (1)-(5). Four-fold cross-validation is employed and B-CUBED metric (Bagga and Baldwin, 1998) is adopted to evaluate the clustering results. Table 5 summarizes the experimental results. CODC (Formula 5), which behaves the best in computing association of common words, still achieves the better performance on different numbers of snippets in named entity clustering. The F-scores of the other formulas are larger than 95% when more snippets are considered to compute the double check frequencies.</Paragraph> </Section> <Section position="8" start_page="1013" end_page="1014" type="metho"> <SectionTitle> 5 Disambiguation Using Association of </SectionTitle> <Paragraph position="0"> Named Entities This section demonstrates how to employ association mined from the Web to resolve the ambiguities of named entities. Assume there are n , to be disambiguated. A named entity NE j has m accompanying names, called cue names later, CN</Paragraph> <Paragraph position="2"> . We have two alternatives to use the cue names. One is using them directly, i.e., be an initial seed. Figure 3 sketches the concept of community expansion.</Paragraph> <Paragraph position="3"> (1) Collection: We submit a seed to Google, and select the top N returned snippets. Then, we use suffix trees to extract possible patterns (Lin and Chen, 2006).</Paragraph> <Paragraph position="4"> (2) Validation: We calculate CODC score of each extracted pattern (denoted B</Paragraph> <Paragraph position="6"> strong enough, i.e., larger than a threshold th , we employ B i as a new seed and repeat steps (1) and (2). This procedure stops either expected number of nodes is collected or maximum number of layers is reached. (3) Union: The community initiated by the</Paragraph> <Section position="1" start_page="1014" end_page="1014" type="sub_section"> <SectionTitle> (&quot;Chien-Ming Wang&quot;) </SectionTitle> <Paragraph position="0"> In a cascaded personal name disambiguation system (Wei, 2006), association of named entities is used with other cues such as titles, common terms, and so on. Assume k clusters, c is left undecided.</Paragraph> <Paragraph position="1"> To evaluate the personal name disambiguation, we prepare three corpora for an ambiguous name &quot;Wang Jian Min &quot; (Chien-Ming Wang) from United Daily News Knowledge Base (UDN), Google Taiwan (TW), and Google China (CN). Table 6 summarizes the statistics of the test data sets. In UDN news data set, 37 different persons are mentioned. Of these, 13 different persons occur more than once. The most famous person is a pitcher of New York Yankees, which occupies 94.29% of 2,205 documents. In TW and CN web data sets, there are 24 and 107 different persons. The majority in TW data set is still the New York Yankees's &quot;Chien-Ming Wang&quot;. He appears in 331 web pages, and occupies 88.03%.</Paragraph> <Paragraph position="2"> Comparatively, the majority in CN data set is a research fellow of Chinese Academy of Social Sciences, and he only occupies 18.29% of 421 web pages. Total 36 different &quot;Chien-Ming Wang&quot;s occur more than once. Thus, CN is an unbiased corpus.</Paragraph> </Section> </Section> <Section position="9" start_page="1014" end_page="1015" type="metho"> <SectionTitle> UDN TW CN </SectionTitle> <Paragraph position="0"> Table 7 shows the performance of a personal name disambiguation system without (M1)/with (M2) community expansion. In the news data set (i.e., UDN), M1 is a little better than M2. Compared to M1, M2 decreases 0.98% of F-score. In contrast, in the two web data sets (i.e., TW and CN), M2 is much better than M1. M2 has 9.65% and 14.22% increases compared to M1. It shows that mining association of named entities from the Web is very useful to disambiguate ambiguous names. The application also confirms the effectiveness of the proposed association measures indirectly.</Paragraph> </Section> class="xml-element"></Paper>