File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/p99-1005_evalu.xml

Size: 9,284 bytes

Last Modified: 2025-10-06 14:00:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1005">
  <Title>Distributional Similarity Models: Clustering vs. Nearest Neighbors</Title>
  <Section position="5" start_page="35" end_page="39" type="evalu">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
3.1 Methodology
</SectionTitle>
      <Paragraph position="0"> We compared the two similarity-based estimation techniques at the following decision task, which evaluates their ability to choose the more likely of two unseen cooccurrences.</Paragraph>
      <Paragraph position="1"> Test instances consist of noun-verb-verb triples (n, vl, v2), where both (n, Vl) and (n, v2) are unseen cooccurrences, but (n, vl) is more likely (how this is determined is discussed below). For each test instance, the language model probabilities 151 dej 15(vlln) and i52 dej 15(v2\]n) are computed; the result of the test is either correct (151 &gt; 152), incorrect (/51 &lt; ~52,) or a tie (151 = 152). Overall performance is measured by the error rate on the entire test set, defined as 1 ~(# of incorrect choices + (# of ties)/2), where T is the number of test triples, not counting multiplicities.</Paragraph>
      <Paragraph position="2"> Our global experimental design was to run ten-fold cross-validation experiments comparing distributional clustering, nearest-neighbors averaging, and Katz's backoff (the baseline) on the decision task just outlined. All results we report below are averages over the ten train-test splits.</Paragraph>
      <Paragraph position="3"> For each split, test triples were created from the held-out test set. Each model used the training set to calculate all basic quantities (e.g., p(vln ) for each verb and noun), but not to train k.</Paragraph>
      <Paragraph position="4"> Then, the performance of each similarity-based model was evaluated on the test triples for a sequence of settings for k.</Paragraph>
      <Paragraph position="5"> We expected that clustering performance with respect to the baseline would initially improve and then decline. That is, we conjectured that the model would overgeneralize at small k but overfit the training data at large k. In contrast, for nearest-neighbors averaging, we hypothesized monotonically decreasing performance curves: using only the very most similar words would yield high performance, whereas including more distant, uninformative words would result in lower accuracy. From previous experience, we believed that both methods would do well with respect to backoff.</Paragraph>
    </Section>
    <Section position="2" start_page="35" end_page="37" type="sub_section">
      <SectionTitle>
3.2 Data
</SectionTitle>
      <Paragraph position="0"> In order to implement the experimental methodology just described, we employed the follow data preparation method: i. Gather verb-object pairs using the CASS partial parser (Abney, 1996) Partition set of pairs into ten folds .</Paragraph>
      <Paragraph position="1">  3. For each test fold, (a) discard seen pairs and duplicates (b) discard pairs with unseen nouns or unseen verbs (e) for each remaining (n, vl), create (n, vl, v2) such that (n, v~) is less likely  Step 3b is necessary because neither the similarity-based methods nor backoff handle novel unigrams gracefully.</Paragraph>
      <Paragraph position="2"> We instantiated this schema in three ways: AP89 We retrieved 1,577,582 verb-object pairs from 1989 Associated Press (AP) newswire, discarding singletons (pairs occurring only once) as is commonly done in language modeling. We split this set by type 3, which does not realistically model how new data occurs in real life, but does conveniently guarantee that the entire test set is unseen. In step  n is indeed more likely to cooccur with Vl, even though (n, v2) is plausible (since it did in fact occur).</Paragraph>
      <Paragraph position="3"> 3When a corpus is split by type, all instances of a given type must end up in the same partition. If the split is by token, then instances of the same type may end up in different partitions. For example, for corpus '% b a c', &amp;quot;a b&amp;quot; +&amp;quot;a c&amp;quot; is a valid split by token, but not by type.</Paragraph>
      <Paragraph position="4">  AP90unseen 1,483,728 pairs were extracted from 1990 AP newswire and split by token. Although splitting by token is undoubtedly a better way to generate train-test splits than splitting by type, it had the unfortunate side effect of diminishing the average percentage of unseen cooccurrences in the test sets to 14%. While this is still a substantial fraction of the data (demonstrating the seriousness of the sparse data problem), it caused difficulties in creating test triples: after applying filtering step 3b, there were relatively few candidate nouns and verbs satisfying the fairly stringent condition 3c. Therefore, singletons were retained in the AP90 data. Step 3c was carried out as for AP89.</Paragraph>
      <Paragraph position="5"> AP90fake The procedure for creating the AP90unseen data resulted in much smaller test sets than in the AP89 case (see Table I). To generate larger test sets, we used the same folds as in AP90unseen, but implemented step 3c differently. Instead of selecting v2 from cooccurrences (n, v2) in the held-out set, test triples were constructed using v2 that never cooccurred with n in either the training or the test data. That is, each test triple represented a choice between a plausible cooccurrence (n, Vl) and an implausible (&amp;quot;fake&amp;quot;) cooccurrence (n, v2). To ensure a large differential between the two alternatives, we further restricted (n, Vl) to occur at least twice (in the test fold). We also chose v2 from the set of 50 most frequent verbs, resulting in much higher error rates for backoff.</Paragraph>
    </Section>
    <Section position="3" start_page="37" end_page="39" type="sub_section">
      <SectionTitle>
3.3 Results
</SectionTitle>
      <Paragraph position="0"> We now present evaluation results ordered by relative difficulty of the decision task.</Paragraph>
      <Paragraph position="1"> Figure 2 shows the performance of distributional clustering and nearest-neighbors averaging on the AP90fake data (in all plots, error bars represent one standard deviation). Recall that the task here was to distinguish between plausible and implausible cooccurrences, making it  a somewhat easier problem than that posed in the AP89 and AP90unseen experiments. Both similarity-based methods improved on the base-line error (which, by construction of the test triples, was guaranteed to be high) by as much as 40%. Also, the curves have the shapes predicted in section 3.1.</Paragraph>
      <Paragraph position="2"> all clu'sters nearest cluster  to backoff on AP90fake test sets.</Paragraph>
      <Paragraph position="3"> We next examine our AP89 experiment results, shown in Figure 3. The similarity-based methods clearly outperform backoff, with the best error reductions occurring at small k for both types of models. Nearest-neighbors averaging appears to have the advantage over distributional clustering, and the nearest cluster method yields lower error rates than the averaged cluster method (the differences are statistically significant according to the paired t-test). We might hypothesize that nearest-neighbors averaging is better in situations of extreme sparsity of data. However, these results must be taken with some caution given their unrealistic type-based train-test split.</Paragraph>
      <Paragraph position="4"> A striking feature of Figure 3 is that all the curves have the same shape, which is not at all what we predicted in section 3.1. The reason  that the very most similar words are apparently not as informative as slightly more distant words is due to recall errors. Observe that if (n, vl) and (n, v2) are unseen in the training data, and if word n' has very small Jensen-Shannon divergence to n, then chances are that n ~ also does not occur with either Vl or v2, resulting in an estimate of zero probability for both test cooccurrences. Figure 4 proves that this is the case: if zero-ties are ignored, then the error rate curve for nearest-neighbors averaging has the expected shape. Of course, clustering is not prone to this problem because it automatically smoothes its probability estimates. average error over APe9, normal vs. precision results nearest neighbors nearest neighbors. Ignodng recall errors  showing the effect of ignoring recall mistakes. Finally, Figure 5 presents the results of  our AP90unseen experiments. Again, the use of similarity information provides better-thanbaseline performance, but, due to the relative difficulty of the decision task in these experiments (indicated by the higher baseline error rate with respect to AP89), the maximum average improvements are in the 6-8% range.</Paragraph>
      <Paragraph position="5"> The error rate reductions posted by weighted-average clustering, nearest-centroid clustering, and nearest-neighbors averaging are all well within the standard deviations of each other.</Paragraph>
      <Paragraph position="6">  to backoff on AP90unseen test sets. As in the AP89 case, the nonmonotonicity of the nearest-neighbors averaging curve is due to recall errors.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML