File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0604_evalu.xml

Size: 12,423 bytes

Last Modified: 2025-10-06 13:59:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0604">
  <Title>New Experiments in Distributional Representations of Synonymy</Title>
  <Section position="6" start_page="28" end_page="30" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> We experimented with various distance measures and context policies using the full North American News corpus. We count approximately one billion words in this corpus, which is roughly four times the size of the largest corpus considered by Ehlert.</Paragraph>
    <Paragraph position="1"> Except where noted, the numbers reported here are the result of taking the full WBST, a total of 23,570 test questions. Given this number of questions, scores where most of the results fall are accurate to within plus or minus 0.6% at the 95% confidence level.</Paragraph>
    <Section position="1" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
4.1 Performance Bounds
</SectionTitle>
      <Paragraph position="0"> In order to provide a point of comparison, the paper's authors each answered the same random sample of 100 questions from each part of speech. Average performance over this sample was 88.4%. The one non-native speaker scored 80.3%. As will be seen, this is better than the best automated result.</Paragraph>
      <Paragraph position="1"> The expected score, in the absence of any semantic information, is 25%. However, as noted, target and answer words are more polysemous than decoy words on average, and this can be exploited to establish a higher baseline. Since the frequency of a word is correlated with its polysemy, a strategy which always selects the most frequent word among the response words yields 39.2%, 34.5%, 29.1%, and 38.0% on nouns, verbs, adjectives, and adverbs, respectively, for an average score of 35.2%.</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="29" type="sub_section">
      <SectionTitle>
4.2 An Initial Comparison
</SectionTitle>
      <Paragraph position="0"> Table 2 displays a basic comparison of the distance measures and context definitions enumerated so far.</Paragraph>
      <Paragraph position="1"> For each distance measure (Manhattan, Euclidean, Cosine, Hellinger, and Ehlert), results are shown for window sizes 1 to 4 (columns). Results are further sub-divided according to whether strict direction and distance are false (None), only strict direction is true (Dir), or both strict direction and strict distance are true (Dir+Dist). In bold is the best score, along with any scores indistinguishable from it at the 95% confidence level.</Paragraph>
      <Paragraph position="2"> Notable in Table 2 are the somewhat depressed scores, compared with those reported for the TOEFL. Ehlert reports a best score on the TOEFL of 82%, whereas the best we are able to achieve on  ison of distance measures and context definitions. the WBST is 67.6%. Although there are differences in some of the experimental details (Ehlert employs a triangular window weighting and experiments with stemming), these probably do not account for the discrepancy. Rather, this appears to be a harder test than the TOEFL--despite the fact that all words involved are seen with high frequency.</Paragraph>
      <Paragraph position="3"> It is hard to escape the conclusion that, in pursuit of high scores, choice of distance measure is more critical than the specific definition of context. All scores returned by the Ehlert metric are significantly higher than any returned by other distance measures.</Paragraph>
      <Paragraph position="4"> Among the Ehlert scores, there is surprising lack of sensitivity to context policy, given a window of size 2 or larger.</Paragraph>
      <Paragraph position="5"> Although the Hellinger distance yields scores only in the middle of the pack, it might be that other divergences from the a48 -divergence family, such as the KL-divergence, would yield better scores. We experimented with various settings of a48 in Equation 1. In all cases, we observed bell-shaped curves with peaks approximately at a48a80a64 a50 a1 a2 and locally worst performance with values at or near 0 or 1. This held true when we used maximum likelihood estimates, or under a simple smoothing regime in which  all cells of the co-occurrence matrix were initialized with various fixed values. It is possible that numerical issues are nevertheless partly responsible for the poor showing of the KL-divergence. However, given the symmetry of the synonymy relation, it would be surprising if some value of a48 far from 0.5 was ultimately shown to be best.</Paragraph>
    </Section>
    <Section position="3" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
4.3 The Importance of Weighting
</SectionTitle>
      <Paragraph position="0"> The Ehlert measure and the cosine are closely related--both involve an inner product between vectors--yet they return very different scores in Table 2. There are two differences between these methods, normalization and vector element weighting.</Paragraph>
      <Paragraph position="1"> We presume that normalization does not account for the large score difference, and attribute the discrepancy, and the general strength of the Ehlert measure, to importance weighting.</Paragraph>
      <Paragraph position="2"> In information retrieval, it is common to take the cosine between vectors where vector elements are not raw frequency counts, but counts weighted using some version of the &amp;quot;inverse document frequency&amp;quot; (IDF). We ran the cosine experiment again, this time weighting the count of context a0 by a1a3a2a5a4 a1</Paragraph>
      <Paragraph position="4"> is the number of rows in the count matrix a0 and a7 a8 is the number of rows containing a non-zero count for context a0 . The results confirmed our expectation. The performance of &amp;quot;CosineIDF&amp;quot; for a window size of 3 with strict direction was 64.0%, which is better than Hellinger but worse than the Ehlert measure. This was the best result returned</Paragraph>
    </Section>
    <Section position="4" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
for &amp;quot;CosineIDF.&amp;quot;
4.4 Optimizing Distance Measures
</SectionTitle>
      <Paragraph position="0"> Both the Hellinger distance and the Ehlert measure are members of the family of measures defined by Equation 4. Although there are theoretical reasons to prefer each to neighboring members of the same family (see the discussion following Equation 1), we undertook to validate this preference empirically.</Paragraph>
      <Paragraph position="1"> We conducted parameter sweeps of a9 , a48 , and a10 , first exploring members of the family a48 a64a11a10 , of which both Hellinger and Ehlert are members. Specifically, we explored the space between a48 a64a12a10 a64 a50 a1 a2 and a48 a64a13a10a9a64 a53 , first in increments of 0.1, then in increments of 0.01 around the approximate maximum, in all cases varying a9 widely.</Paragraph>
      <Paragraph position="2"> This experiment clearly favored a region midway  and the &amp;quot;optimal&amp;quot; point in the space of measures defined by Equation 4 (a48 a64a14a10 a64 a50 a1a16a15 a2 , a9 a64 a53 a1 a53 ), by part of speech. Context policy is window size 3 with strict direction.</Paragraph>
      <Paragraph position="3"> between the Hellinger and Ehlert measures. We identified a48 a64a17a10a80a64 a50 a1a16a15 a2 , with a9 a64 a53 a1 a53 as the approximate midpoint of this optimal region. We next varied a48 and a10 independently around this point. This resulted in no improvement to the score, confirming our expectation that some point along a48 a64a18a10 would be best. For the sake of brevity, we will refer to this  a11 ) as the &amp;quot;Optimal&amp;quot; measure.</Paragraph>
      <Paragraph position="4"> As Table 3 indicates, this measure is significantly better than the Ehlert measure, or any other measure investigated here.</Paragraph>
      <Paragraph position="5"> This clear separation between Ehlert and Optimal does not hold for the original TOEFL. Using the same context policy, we applied these measures to 298 of the 300 questions used by Ehlert (all questions except those involving multi-word terms, which our framework does not currently support).</Paragraph>
      <Paragraph position="6"> Optimal returns 84.2%, while Ehlert's measure returns 83.6%, which is slightly better than the 82% reported by Ehlert. The two results are not distinguishable with any statistical significance.</Paragraph>
      <Paragraph position="7"> Interesting in Table 3 is the range of scores seen across parts of speech. The variation is even wider under other measures, the usual ordering among parts of speech being (from highest to lowest) adverb, adjective, noun, verb. In Section 4.6, we attempt to shed some light on both this ordering and the close outcome we observe on the TOEFL.</Paragraph>
    </Section>
    <Section position="5" start_page="29" end_page="30" type="sub_section">
      <SectionTitle>
4.5 Optimizing Context Policy
</SectionTitle>
      <Paragraph position="0"> It is certain that not every contextual token seen within the co-occurrence window is equally important to the detection of synonymy, and probable that some such tokens are useless or even detrimental.</Paragraph>
      <Paragraph position="1"> On the one hand, the many low-frequency events in the tails of the context distributions consume a lot of space, perhaps without contributing much infor- null mation. On the other, very-high-frequency terms are typically closed-class and stop words, possibly too common to be useful in making semantic distinctions. We investigated excluding words at both ends of the frequency spectrum.</Paragraph>
      <Paragraph position="2"> We experimented with two kinds of exclusion policies: one excluding the a0 most frequent terms, for a0 ranging between 10 and 200; and one excluding terms occurring fewer than a0 times, for a0 ranging from 3 up to 100. Both Ehlert and Optimal were largely invariant across all settings; no statistically significant improvements or degradations were observed. Optimal returned scores ranging from 72.0%, when contexts with marginal frequency fewer than 100 were ignored, up to 72.6%, when the 200 most frequent terms were excluded.</Paragraph>
      <Paragraph position="3"> Note there is a large qualitative difference between the two exclusion procedures. Whereas we exclude only at most 200 words in the high-frequency experiment, the number of terms excluded in the low-frequency experiment ranges from 939,496 (less than minimum frequency 3) to 1,534,427 (minimum frequency 100), out of a vocabulary containing about 1.6 million terms. Thus, it is possible to reduce the expense of corpus analysis substantially without sacrificing semantic fidelity.</Paragraph>
    </Section>
    <Section position="6" start_page="30" end_page="30" type="sub_section">
      <SectionTitle>
4.6 Polysemy
</SectionTitle>
      <Paragraph position="0"> We hypothesized that the variation in scores across part of speech has to do with the average number of senses seen in a test set. Common verbs, for example, tend to be much more polysemous (and syntactically ambiguous) than common adverbs. WordNet allows us to test this hypothesis.</Paragraph>
      <Paragraph position="1"> We define the polysemy level of a question as the sum of the number of senses in WordNet of its target and answer words. Polysemy levels in our question set range from 2 up to 116. Calculating the average polysemy level for questions in the various parts of speech--5.1, 6.7, 7.5, and 10.4, for adverbs, adjectives, nouns, and verbs, respectively--provides support for our hypothesis, inasmuch as this ordering aligns with test scores. By contrast, the average polysemy level in the TOEFL, which spans all four parts of speech, is 4.6.</Paragraph>
      <Paragraph position="2"> Plotting performance against polysemy level helps explain why Ehlert and Optimal return roughly equivalent performance on the original TOEFL. Fig- null ure 1 plots the Ehlert and Optimal measures as a function of the polysemy level of the questions. To produce this plot, we grouped questions according to polysemy level, creating many smaller tests, and scored each measure on each test separately.</Paragraph>
      <Paragraph position="3"> At low polysemy levels, the Ehlert and Optimal measures perform equally well. The advantage of Optimal over Ehlert appears to lie specifically in its relative strength in handling polysemous terms.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML