File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1605_evalu.xml
Size: 9,903 bytes
Last Modified: 2025-10-06 13:59:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1605"> <Title>Distributional Measures of Concept-Distance: A Task-oriented Evaluation</Title> <Section position="6" start_page="38" end_page="40" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> To evaluate the distributional concept-distance measures, we used them in the tasks of ranking word pairs in order of their semantic distance and of correcting real-word spelling errors, and compared our results to those that we obtained on the same tasks with distributional word-distance measures and those that Budanitsky and Hirst (2006) obtained with WordNet-based semantic measures.</Paragraph> <Paragraph position="1"> The distributional concept-distance measures used a bootstrapped WCCM created from the BNC and the Macquarie Thesaurus. The word-distance measures used a word-word co-occurrence matrix created from the BNC alone. The BNC was not lemmatized, part of speech tagged, or chunked.</Paragraph> <Paragraph position="2"> The vocabulary was restricted to the words present in the thesaurus (about 98,000 word types) both to provide a level evaluation platform and to keep the matrix to a manageable size. Co-occurrence counts less than 5 were reset to 0, and words that co-occurred with more than 2000 other words were stoplisted (543 in all). We used ASD with human ranking. Best results for each measure-type are shown in boldface.</Paragraph> <Paragraph position="3"> word-word distance matrices. Applications that require distance values will enjoy a run-time benefit if the distances are precomputed. While it is easy to completely populate the concept-concept co-occurrence matrix, completely populating the word-word distance matrix is a non-trivial task because of memory and time constraints.</Paragraph> <Section position="1" start_page="38" end_page="39" type="sub_section"> <SectionTitle> 5.1 Ranking word pairs </SectionTitle> <Paragraph position="0"> A direct approach to evaluating linguistic distance measures is to determine how close they are to human judgment and intuition. Given a set of word-pairs, humans can rank them in order of their distance--placing near-synonyms on one end of the ranking and unrelated pairs on the other. Rubenstein and Goodenough (1965) provide a &quot;gold-standard&quot; list of 65 human-ranked word-pairs (based on the responses of 51 subjects). One automatic word-distance estimator, then, is deemed to be more accurate than another if its ranking of word-pairs correlates more closely with this human ranking. Measures of concept-distance can perform this task by determining word-distance for each word-pair by finding the concept-distance between all pairs of senses of the two words, and choosing the distance of the closest sense pair. This is based on the assumption that when humans are asked to judge the semantic distance between a pair of words, they implicitly consider its closest senses. For example, most people will agree that bank and interest are semantically related, even though both have multiple senses-most of which are unrelated. Alternatively, the method could take the average of the distance of all pairs of senses.</Paragraph> <Paragraph position="1"> As we wanted to perform experiments with both concept-concept and word-word distance matrices, we populated them as and when new distance values were calculated. they give very large distance values when sensepairs are unrelated--values that dominate the averages, overwhelming the others, and making the results meaningless.) These correlations are, however, notably lower than those obtained by the best WordNet-based measures (not shown in the table), which fall in the range .78 to .84 (Budanitsky and Hirst, 2006).</Paragraph> </Section> <Section position="2" start_page="39" end_page="40" type="sub_section"> <SectionTitle> 5.2 Real-word spelling error correction </SectionTitle> <Paragraph position="0"> The set of Rubenstein and Goodenough word pairs is much too small to safely assume that measures that work well on them do so for the entire English vocabulary. Consequently, semantic measures have traditionally been evaluated through applications that use them, such as the work by Hirst and Budanitsky (2005) on correcting real-word spelling errors (or malapropisms). If a word in a text is not &quot;semantically close&quot; to any other word in its context, then it is considered a suspect. If the suspect has a spelling-variant that is &quot;semantically close&quot; to a word in its context, then the suspect is declared a probable real-word spelling error and an &quot;alarm&quot; is raised; the related spelling-variant is considered its correction. Hirst and Budanitsky tested the method on 500 articles from the 1987-89 Wall Street Journal corpus for their experiments, replacing every 200th word by a spelling-variant. We adopt this method and this test data, but whereas Hirst and Budanitsky used WordNet-based semantic measures, we use distributional measures Distrib word and Distrib concept .</Paragraph> <Paragraph position="1"> In order to determine whether two words are &quot;semantically close&quot; or not as per any measure of distance, a threshold must be set. If the distance between two words is less than the threshold, then they will be considered semantically close. Hirst and Budanitsky (2005) pointed out that there is a notably wide band between 1.83 and 2.36 (on a scale of 0-4), such that all Rubenstein and Goodenough word pairs were assigned values either higher than 2.36 or lower than 1.83 by human subjects. They argue that somewhere within this band is a suitable threshold between semantically close and semantically distant, and therefore set thresholds for the WordNet-based measures such that there was maximum overlap in what the measures and human judgments considered semantically close and distant. Following this idea, we use an automatic method to determine thresholds for the various Distrib word and Distrib concept measures. Given a list of Rubenstein and Goodenough word pairs ordered according to a distance measure, we repeatedly consider the mean of all consecutive distance values as candidate thresholds. Then we determine the number of word-pairs correctly classified as semantically close or semantically distant for each candidate threshold, considering which side of the band they lie as per human judgments. The candidate threshold with highest accuracy is chosen as the threshold.</Paragraph> <Paragraph position="2"> We follow Hirst and St-Onge (1998) in the metrics that we use to evaluate real-word spelling correction; they are listed in Table 3. Suspect ratio and alarm ratio evaluate the processes of identifying suspects and raising alarms, respectively. Detection ratio is the product of the two, and measures overall performance in detecting the errors. Correction ratio indicates overall correction performance, and is the &quot;bottom-line&quot; statistic that we focus on. Values greater than 1 for each of these ratios indicate results better than random guessing. The ability of the system to determine the intended word, given that it has correctly detected an error, is indicated by the correction accuracy (0 to 1). Notice that the correction ratio is the product of the detection ratio and correction accuracy. The over-all (single-point) precision P (no. of true-alarms / no. of alarms), recall R (no. of true-alarms / no. of malapropisms), and F-score (</Paragraph> <Paragraph position="4"> ) of detection are also computed. The product of detection F-score and correction accuracy, which we will call correction performance, can also be used as a bottom-line performance metric.</Paragraph> <Paragraph position="5"> Table 4 details the performance of Distrib word and Distrib concept measures. For comparison, results obtained by Hirst and Budanitsky (2005) with the use of WNet concept measures are also shown. Observe that the correction ratio results for the Distrib word measures are poor compared to Distrib concept measures; the concept-distance measures are clearly superior, in particular ASD cp and Cos cp . Moreover, if we consider correction ratio to be the bottom-line statistic, then the Distrib concept measures outperform all WNet concept measures except the Jiang-Conrath measure. If we consider correction performance to be the bottom-line statistic, then again we see that the distributional concept-distance measures outperform the word-distance measures, except in the case of Lin pmi , which gives slightly poorer results with conceptdistance. Also, in contrast to correction ratio values, using the Leacock-Chodorow measure results in relatively higher correction performance values than the best Distrib concept measures. While it is clear that the Leacock-Chodorow measure is relatively less accurate in choosing the right spelling-variant for an alarm (correction accuracy), detection ratio and detection F-score present contrary pictures of relative performance in detection. As correction ratio is determined by the product of a number of ratios, each evaluating the various stages of malapropism correction (identifying suspects, raising alarms, and applying the correction), we believe it is a better indicator of overall performance than correction performance, which is a not-so-elegant product of an F-score and accuracy. However, no matter which of the two is chosen as the bottom-line performance statistic, the results show that the newly proposed distributional concept-distance measures are clearly superior to word-distance measures. Further, of all the WordNet-based measures, only that proposed by Jiang and Conrath outperforms the best distributional concept-distance measures consistently with respect to both bottom-line statistics.</Paragraph> </Section> </Section> class="xml-element"></Paper>