File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2411_evalu.xml
Size: 8,763 bytes
Last Modified: 2025-10-06 13:59:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2411"> <Title>Calculating Semantic Distance between Word Sense Probability Distributions</Title> <Section position="6" start_page="3" end_page="3" type="evalu"> <SectionTitle> 5 Experimental Evaluation </SectionTitle> <Paragraph position="0"> We evaluate the SPD method on sense profiles created using the method of Clark and Weir (2002), with comparison to the other distance measures (skew and cos) as explained above. In the calculation of SPD, we compare the two node distance measures, a119a7a6a9a8 (Wu and Palmer, 1994) and a119 a104a93a105a107a106a50a104 , and the two ways of propagating sense profiles, without entropy (a52a48a61 ) and with entropy (a52a72a69 ), as described in Section 3. These settings are mentioned when relevant to distinguishing the results. Recall that in all experiments the random baseline is 50%.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.1 Development Results </SectionTitle> <Paragraph position="0"> On both the original pilot verbs (36 verbs) and the extended development set (60 verbs), SPD performs better than or as well as the other measures. The top performance in each experimental condition is compiled in Table 1. On the pilot verbs, our measure achieves a best accuracy of 75%. On the development verbs, SPD (without entropy, using either a108 a0a3a2 or a108a37a121a23a120a5a4a46a121 ) is also the best or tied for best at classifying all verbs, and verbs in each frequency band. No other measure performs consistently as well as SPD.</Paragraph> <Paragraph position="1"> We find that SPD with entropy does not work as well as SPD without entropy. However, it is often second best (with the exception for high frequency verbs).</Paragraph> <Paragraph position="2"> There is some difference in the SPD performance on all verbs between the pilot and development sets. Recall that the pilot set contains 36 verbs from an earlier experiment (which are not evenly distributed among the frequency bands); and the development set contains 60 verbs (the pilot set plus additional verbs, with each band containing with the measure(s) that produced the result, using a median threshold. SPD refers to SPD without entropy, using the indicated node distance measure. &quot;all&quot;, &quot;high&quot;, &quot;med&quot;, and &quot;low&quot; refer to the different frequency bands. &quot;avg(h,m,l)&quot; refers to the average accuracy of the three frequency bands.</Paragraph> <Paragraph position="3"> 20 verbs). We compare the two verb sets and discover that, in the pilot set, high and medium frequency verbs outnumber the low frequency verbs. To better compare with the pilot verb results, we run the experiment on only the high and medium frequency development verbs. See the &quot;high-med&quot; column under &quot;Development Verbs&quot; in with accuracy of 75%, equalling the best performance for the pilot set.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.2 Test Results </SectionTitle> <Paragraph position="0"> Table 2 shows the best results in the testing phase. Again, SPD has the most consistent performance. Here, similarly to the development results, SPD is the best (or tied for best) at classifying the verbs in the individual frequency bands. However, in classifying all verbs together, it is not the best; it is the second best at 63%.</Paragraph> <Paragraph position="1"> As in the development results, SPD measures without entropy (a52a45a61 ) fair better than those with entropy (a52a66a69 ). However, unlike the development results, a52a66a69 does not do well at all. To examine a52a66a69 's poor performance, we do a pairwise comparison of the actual classification of the two SPD methods. In all cases, many causative verbs that are classified correctly (i.e., small profile distance) by a52a48a61 are no longer correct using a52a66a69 (i.e., they are now classified as large profile distance). By propagating entropy instead of probability mass, the distance between profiles is incorrectly amplified for causative verbs. Since this phenomenon is not observed in the development set, it is unclear under what circumstances the distance is amplified by entropy propagation. We conclude that simple propagation of profile mass is the more consistent method of the two for this application.</Paragraph> <Paragraph position="2"> Recall that we also experiment with two different node distance measures (a119a7a6a9a8 and a119 a104a93a105a107a106a50a104 ). Though not identical, the performance between the two is remarkably similar. In fact, the actual classifications themselves are very similar. Note that Wu and Palmer (1994) designed their measure such that shallow nodes (i.e., less specific senses) are less similar than nodes that are deeper in the WordNet hierarchy, a property that is lacking in the edge distance measure. We hypothesized that our sense profiles are similar in terms of depth, so that taking relative depth into account in the distance measure has little impact. Comparing the sense profiles of groups of verbs reveals that, with one exception (non-causative development verbs), the difference in depth is not statistically significant (paired t-test). In the case that is statistically significant, the average difference in depth is less than two levels.</Paragraph> <Paragraph position="3"> For comparison, we replicate McCarthy's method on our test verbs, using tree cuts produced by Li and Abe's technique, which are propagated to their lowest common subsumers and their distance measured by skew divergence. Recall that we do not hand-select our causative verbs to ensure they undergo the causative alternation, and therefore there is more noise in our data than in Mc-Carthy's. In the presence of more noise, her method performs quite well in many cases; it is best or tied for best on the development verbs, medium frequency (70%) and on the test verbs, all verbs (67%), high frequency (80%), and medium frequency (80%). However, it does not do well on low frequency verbs at all (below chance at 40%).</Paragraph> <Paragraph position="4"> We suspect the problem is twofold, arising from the dependence of her method on tree cut models (Li and Abe, 1998). The first problem is that one needs to generalize the tree cut profiles to their common subsumers to use skew divergence. As a result, as we mentioned earlier, semantic specificity of the profiles is lost. The second problem is, as Wagner (2000) points out, less data tends to yield tree cuts that are more general (further up in the hierarchy). Therefore, low frequency verbs have more general profiles, and the distance between profiles is less informative. We conclude that McCarthy's method is less appropriate for low frequency data than ours.</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.3 Frequency Bands </SectionTitle> <Paragraph position="0"> Somewhat surprisingly, we often get better performance with the frequency bands individually than we do with all verbs together. By inspection, we observe that low frequency verbs tend to have smaller distances between two slots and high frequency verbs tend to have larger distances. As a result, the threshold for all verbs is in between the thresholds for each of these frequency bands.</Paragraph> <Paragraph position="1"> When classifying all verbs, the frequency effect may result in more false positives for low frequency verbs, and more false negatives for high frequency verbs.</Paragraph> <Paragraph position="2"> We examine the combined performance of the individual frequency bands, in comparison to the performance on all verbs. Here, we define &quot;combined performance&quot; as the average of the accuracies from each frequency band.</Paragraph> <Paragraph position="3"> We find that SPD without entropy attains an averaged accuracy of 70%, an improvement of 3% over the best accuracy classifying all verbs together. Separating the frequency bands is an effective way to remove the frequency effect.6 Stemming from this analysis, a possible refinement to separating the frequency bands is to use a different classifier in each frequency band, then combine their performance. However, we observe that the best SPD performer in one frequency band tends to be the best performer in other bands (development: SPD without entropy, a119 a104a93a105a107a106a50a104 ; test: SPD without entropy, a119a7a6a9a8 ). There does not seem to be a relationship between verb frequency and various distance measures.</Paragraph> </Section> </Section> class="xml-element"></Paper>