File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2004_metho.xml

Size: 5,528 bytes

Last Modified: 2025-10-06 14:10:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2004">
  <Title>Measuring Semantic Relatedness Using People and WordNet</Title>
  <Section position="4" start_page="13" end_page="13" type="metho">
    <SectionTitle>
3 Relatedness Scores
</SectionTitle>
    <Paragraph position="0"> Our idea is to induce scores for pairs of anchored items with their anchors (henceforth, AApairs) using the cumulative annotations by 20 people.3 Thus, an AApair written by all 20 people scores 20, and that written by just one person scores 1. The scores would correspond to the perceived relatedness of the pair of concepts in the given text.</Paragraph>
    <Paragraph position="1"> In Beigman Klebanov and Shamir's (2006) core classification data, no distinctions are retained between pairs marked by 19 or 13 people. Now we are interested in the relative relatedness, so it is important to handle cases where the BS data might under-rate a pair. One such case are multi-word items; we remove AApairs with suspect multi-word elements.4 Further, we retain only pairs that belong to open-class parts of speech (henceforth, POS), as functional categories contribute little to the lexical texture (Halliday and Hasan, 1976). The Size column of table 1 shows the number of AApairs for each BS text, after the aforementioned exclusions.</Paragraph>
    <Paragraph position="2"> The induced scores correspond to cumulative judgements of a group of people. How well do they represent the people's ideas? One way to measure group homogeneity is leave-one-out estimation, as done by Resnik (1995) for MC data, attaining the high average correlation of r = 0.88. In the current case, however, every specific person made a binary decision, whereas a group is represented by scores 1 to 20; such difference in granularity is problematic for correlation or rank order analysis.</Paragraph>
    <Paragraph position="3"> Another way to measure group homogeneity is to split it into subgroups and compare scores emerging from the different subgroups. We know from Beigman Klebanov and Shamir's (2006) analysis that it is not the case that the 20-subject group clusters into subgroups that systematically produced different patterns of answers. This leads us to expect relative lack of sensitivity to the exact splits into subgroups. null To validate this reasoning, we performed 100 random choices of two 9-subject4 groups, calculated the scores induced by the two groups, and computed  Pearson correlation between the two lists. Thus, for every BS text, we have a distribution of 100 coefficients, which is approximately normal. Estimations of u and s of these distributions are u = .69 [?].82 (av. 0.75), s = .02[?].03 for the different BS texts.</Paragraph>
    <Paragraph position="4"> To summarize: although the homogeneity is lower than for MC data, we observe good average inter-group correlations with little deviation across the 100 splits. We now turn to discussion of a relatedness measure, which we will evaluate using the data.</Paragraph>
  </Section>
  <Section position="5" start_page="13" end_page="14" type="metho">
    <SectionTitle>
4 Gic: WordNet-based Measure
</SectionTitle>
    <Paragraph position="0"> Measures using WordNet taxonomy are state-of-the-art in capturing semantic similarity, attaining r=.85 -.89 correlations with the MC dataset (Jiang and Conrath, 1997; Budanitsky and Hirst, 2006).</Paragraph>
    <Paragraph position="1"> However, they fall short of measuring relatedness, as, operating within a single-POS taxonomy, they cannot meaningfully compare kill to death. This is a major limitation with respect to BS data, where only about 40% of pairs are nominal, and less than 10% are verbal. We develop a WordNet-based measure that would allow cross-POS comparisons, using glosses in addition to the taxonomy.</Paragraph>
    <Paragraph position="2"> One family of WordNet measures are methods based on estimation of information content (henceforth, IC) of concepts, as proposed in (Resnik, 1995). Resnik's key idea in corpus-based information content induction using a taxonomy is to count every appearance of a concept as mentions of all its hypernyms as well. This way, artifact#n#1, although rarely mentioned explicitly, receives high frequency and low IC value. We will count a concept's mention towards all its hypernyms AND all words5 that appear in its own and its hypernyms' glosses. Analogously to artifact, we expect properties mentioned in glosses of more general concepts to be less informative, as those pertain to more things (ex., visible, a property of anything that is-a physical object).</Paragraph>
    <Paragraph position="3"> The details of the algorithm for information content induction from taxonomy and gloss information (ICGT) are given in appendix A.</Paragraph>
    <Paragraph position="4"> To estimate the semantic affinity between two senses A and B, we average the ICGT values of the  with human ratings. r &gt; 0.16 is significant at p &lt; .05; r &gt; .23 is significant at p &lt; .01. Average correlation (AvBS) is r=.28 (Gic), r=.17 (BP).</Paragraph>
    <Paragraph position="5"> If A[?] (the word of which A is a sense) appears in the expanded gloss of B, we take the maximum between the ICGT(A[?]) and the value returned by the 3-smoothed calculation. To compare two words, we take the maximum value returned by pairwise comparisons of their WordNet senses.7 The performance of this measure is shown under Gic in table 1. Gic manages robust but weak correlations, never reaching the r = .40 threshold.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML