File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2605_evalu.xml

Size: 9,159 bytes

Last Modified: 2025-10-06 13:59:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2605">
  <Title>Using Selectional Profile Distance to Detect Verb Alternations</Title>
  <Section position="6" start_page="5" end_page="5" type="evalu">
    <SectionTitle>
5 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"> We evaluate the SPD method on selectional profiles created using the method of Clark and Weir (2002), with comparison to the other distance measures as explained above. In the calculation of SPD, we compare the two node distance measures, a118a1a0a3a2 (Wu and Palmer, 1994) and a118 a104a93a105a107a106a50a104 , and the two ways of propagating selectional profiles, without entropy (a52a48a61 ) and with entropy (a52a72a69 ), as de- null (development set and threshold), along with the measure(s) that produce that result. SPD refers to SPD without entropy, using either a110a1a0a3a2 or a110a40a120a28a119a5a4a20a120 . &amp;quot;all&amp;quot;, &amp;quot;high&amp;quot;, and &amp;quot;low&amp;quot; refer to the different frequency bands.</Paragraph>
    <Paragraph position="1"> scribed in Section 3. These settings are mentioned when relevant to distinguishing the results.</Paragraph>
    <Section position="1" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
5.1 Development Results
</SectionTitle>
      <Paragraph position="0"> On the two development sets, SPD generally performs better than the other measures. In particular, our measure achieves a best accuracy of 69% (random baseline of 50%, broad class fillers, all verbs). The best performance is compiled in Table 1. Observe that in each condition, SPD (without entropy, using either a110 a0a3a2 or a110 a120a28a119a5a4a20a120 ) is always the best (or tied for best) at classifying all verbs, and at classifying at least one other frequency band. No other measure performs consistently as well as SPD. Indeed, on closer examination, in the cases where SPD is not the best, it has the second best performance. Interestingly, we also discover that cosine works well in the low frequency band.</Paragraph>
      <Paragraph position="1"> There is only a small difference in the SPD performance between the two development sets. Recall that broad class fillers contain non-causatives from a wider variety of classes than restricted class fillers, which we thought would make the classification task harder, because of more variation in the data. However, not only is the broad class performance not lower, there are some cases in which it surpasses the restricted class performance. At least for these verbs, amount of variation in the classes has little impact.</Paragraph>
      <Paragraph position="2"> SPD with entropy does not perform best on development verbs. However, in comparison to the vector distance measures (which yield below chance accuracies in most cases), SPD with entropy does achieve reasonable accuracies. It is always above chance, and sometimes second best.</Paragraph>
      <Paragraph position="3"> Generally, across both development sets, using a me- null testing, along with the measure(s) that produced the result, using a median threshold. &amp;quot;all&amp;quot;, &amp;quot;high&amp;quot;, and &amp;quot;low&amp;quot; refer to the different frequency bands.</Paragraph>
      <Paragraph position="4"> dian threshold works somewhat better than an average threshold. To focus our testing phase, we use only the median threshold.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
5.2 Test Results
</SectionTitle>
      <Paragraph position="0"> Table 2 shows both the best and second best results in the testing phase. Here, similarly to the development results, SPD is the best (or tied for best) at classifying all verbs, and verbs in the low frequency band. In cases where it is not the best, it is the second best.</Paragraph>
      <Paragraph position="1"> Contrary to the development results, SPD measures with entropy, SPDa104 a101 , fare somewhat better than those without entropy, SPDa104a10a9 . To examine the difference in performance, we do a pairwise comparison of the actual verb classification. In the &amp;quot;all&amp;quot; frequency case, SPD with entropy has 7 false positives,8 and SPD without entropy has 8 false positives, 5 of which are misclassified by both.</Paragraph>
      <Paragraph position="2"> Furthermore, with the exception of one verb, the remaining false positives are quite near the threshold. The trends in the low frequency band are quite similar--there is considerable overlap between SPDa104a8a9 and SPDa104 a101 false positives. Given the similarity of the classifications, we conclude that the propagation methods (with or without entropy) would likely be comparable on larger sets of verbs.</Paragraph>
      <Paragraph position="3"> Recall that we also experiment with two different node distance measures (a118 a0a3a2 and a118a66a104 a105a107a106a32a104 ). Interestingly, the performance between the two is remarkably similar. In fact, the actual classifications themselves are very similar. Note that Wu and Palmer (1994) designed their measure such that shallow nodes are less similar than nodes that are deeper in the WordNet hierarchy. This property is certainly lacking in the edge distance measure. Here we can only speculate that perhaps our selectional profiles are relatively similar in terms of depth, so that taking relative depth into account in the distance measure has little impact.</Paragraph>
      <Paragraph position="4"> For comparison, we replicate McCarthy's method,9 8Hence, 14 are misclassified, since we use median, which splits the verbs exactly in half into the two classes.</Paragraph>
      <Paragraph position="5"> 9We replicate McCarthy's method using tree cuts produced which only achieves above chance performance in a few cases: on the development verbs with restricted fillers (56%, low frequency verbs, average threshold), and on the development verbs with broad class fillers (58%, all verbs, average threshold; and 62%, low frequency verbs, median threshold). This result is very different from her reported results. One major difference between our experimental set-up and hers is the selection of verbs. We do not hand-select our causative verbs to ensure they undergo the causative alternation. We speculate that there is more noise in our data than in McCarthy's and our method is less sensitive to that.</Paragraph>
      <Paragraph position="6"> One puzzle in the pattern of results is the cosine performance--cosine has the best or second best accuracy across all bands in the test data, while it is best mostly in the low band in development. We are a bit surprised that cosine works well at all. In the future, we intend to examine the conditions where cosine is a sufficient discriminator.</Paragraph>
    </Section>
    <Section position="3" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
5.3 Frequency Bands
</SectionTitle>
      <Paragraph position="0"> Somewhat surprisingly, we often get better performance with both the low and high frequency bands individually than we do with all verbs together. By inspection, we observe that low frequency verbs tend to have smaller distances between two slots and high frequency verbs tend to have larger distances. As a result, the threshold for all verbs is in between the thresholds for each of the frequency bands. When classifying both types of verbs, the frequency effect may result in more false positives for low frequency verbs, and more false negatives for high frequency verbs.</Paragraph>
      <Paragraph position="1"> We examine the combined performance of the individual frequency bands, in comparison to the performance on all verbs. Here, we define &amp;quot;combined performance&amp;quot; as the average of the accuracies from each frequency band.</Paragraph>
      <Paragraph position="2"> (The averages are weighted averages if each band contains a different number of verbs.) We find that SPDa104 a101 attains an averaged accuracy of 70%, an improvement of 5% over the best accuracy classifying all verbs together.</Paragraph>
      <Paragraph position="3"> Separating the frequency bands is an effective way to remove the frequency effect.10 Stemming from this analysis, a possible refinement to separating the frequency bands is to use a different classifier in each frequency band, then combine their performance. We observe that combining the best accuracies gives us an accuracy of 75% (best low band accuracy of 80% and best high band accuracy of 70%), outperformby Li and Abe's technique, which are propagated to their lowest common subsumers and their distance measured by skew divergence. null 10Another method is to use some type of &amp;quot;expected distance&amp;quot; as a normalizing factor (Paola Merlo, p.c.). However, it is yet unclear how we would calculate this number.</Paragraph>
      <Paragraph position="4"> ing the &amp;quot;all verbs&amp;quot; best accuracy by 10%. Although in our current results there is no one classifier that is clearly the best overall for a particular frequency band, we plan to examine further the relationship between verb frequency and various distance measures.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML