File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1018_evalu.xml

Size: 15,615 bytes

Last Modified: 2025-10-06 13:59:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1018">
  <Title>Word Sense Induction: Triplet-Based Clustering and Automatic Evaluation</Title>
  <Section position="5" start_page="139" end_page="142" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Sch&amp;quot;utze (1992) introduced a pseudoword-based evaluation method for WSD algorithms. The idea is to take two arbitrarily chosen words like banana and door and replace all occurrences of either word by the new pseudoword bananadoor.</Paragraph>
    <Paragraph position="1"> Then WSD is applied to each sentence and the amount of correctly disambiguated sentences is measured. A disambiguation in this case is correct, if the sentence like I ate the banana is assigned to sense #1 (banana) instead of #2 (door).</Paragraph>
    <Paragraph position="2"> In other words all sentences where one of the two words occurs are viewed as one set and the WSD algorithm is then supposed to sort them correctly apart. This, in fact, is very similar to the WSI task, which is supposed to sort the set of words apart that co-occur with the target word and refer to its different meanings. Thus, again it is possible to take two words, view their co-occurrences as one set and let the WSI algorithm sort them apart.</Paragraph>
    <Paragraph position="3"> For example, the word banana might have co-occurrences such as apple, fruit, coconut, ... and the word door co-occurrences such as open, front, locked, .... The WSI algorithm would therefore have to disambiguate the pseudoword bananadoor with the co-occurrences apple, open, fruit, front, locked, ....</Paragraph>
    <Paragraph position="4"> In short, the method merges the co-occurrences of two words into one set of words. Then, the WSI algorithm is applied to that set of co-occurrences and the evaluation measures the result by comparingittotheoriginalco-occurrencesets. Inorderto find out whether a given sense has been correctly identified by the WSI algorithm, its retrieval precision (rP) - the similarity of the found sense with the original sense using the overlap measure - can be computed. In the present evaluations, the threshold of 0.6 was chosen, which means that at least 60% of words of the found sense must overlapwiththeoriginalsenseinordertobecountedas null a correctly found sense. The average numbers of similarity are much higher, ranging between 85% and 95%.</Paragraph>
    <Paragraph position="5"> It is further informativeto measure retrieval recall (rR) - the amount of words that have been correctly retrieved into the correct sense. If, e.g., two words are merged into a pseudoword and the meaning of each of these two words is represented by 200 co-occurring words, then it could happen that one of the senses has been correctly found by the WSI algorithm containing 110 words with an overlap similarity of 0.91. That means that only 100 words representing the original sense were retrieved, resulting in a 50% retrieval recall. This retrieval recall also has an upper bound for two reasons. The average overlap ratio of the co-occurrences of the word pairs used for the evaluation was 3.6%. Another factor lowering the upper bound by an unknown amount is the fact that some of the words are ambiguous. If the algorithm correctly finds different senses of one of the two original words, then only one of the found senses will be chosen to represent the original 'meaning' of the original word. All words assigned to the other sense are lost to the other sense.</Paragraph>
    <Paragraph position="6"> Using terms from information retrieval makes sense because this task can be reformulated as follows: Given a set of 400 words and one out of several word senses, try to retrieve all words belonging to that sense (retrieval recall) without retrieving any wrong ones (retrieval precision). A sense is then defined as correctly found by the WSI algorithm, if its retrieval precision is above 60% and retrieval recall above 25%. The latter number implies that at least 50 words have to be retrieved correctly since the initial co-occurrence sets contained200words. Thisalsoassumesthat50words would be sufficient to characterize a sense if the WSI algorithm is not only used to evaluate itself.</Paragraph>
    <Paragraph position="7"> The reason to set the minimum retrieval precision to any value above 50% is to avoid a too strong  baseline, see below.</Paragraph>
    <Paragraph position="8"> Using these prerequisites it is possible to define precision and recall (based on retrieval precision and retrieval recall) which will be used to measure the quality of the WSI algorithm.</Paragraph>
    <Paragraph position="9"> Precision (P) is defined as the number of times the original co-occurrence sets are properly restored divided by the number of different sets found. Precision has therefore an unknown upper bound below 100%, because any two words chosen could be ambiguous themselves. Thus, if the algorithm finds three meanings of the pseudoword that might be because one of the two words was ambiguous and had two meanings, and hence precision will only be 66%, although the algorithm operated flawlessly.</Paragraph>
    <Paragraph position="10"> Recall (R) is defined as the number of senses found divided by the number of words merged to create the pseudoword. For example, recall is60% if five words are used to create the pseudoword, but only three senses were found correctly (according to retrieval precision and retrieval recall). There is at least one possible baseline for the four introduced measures. One is an algorithm that does nothing, resulting in a single set of 400 co-occurrences of the pseudo-word. This set has a retrieval Precision rP of 50% compared to either of the two original 'senses' because for any of the two senses only half of the 'retrieved' words match. This is below the allowed 60% and thus does not count as a correctly found sense. This means that also retrieval Recall rR, Recall R are both 0% and Precision P in such a case (nothing correctly retrieved, but also nothing wrong retrieved) is defined to be 100%.</Paragraph>
    <Paragraph position="11"> As mentioned in the previous sections, there are several parameters that have a strong impact on the quality of a WSI algorithm. One interesting question is, whether the quality of disambiguation depends on the type of ambiguity: Would the WSI based on sentence co-occurrences (and hence on the bag-of-words model) produce better results for two syntactically different senses or for two senses differing by topic (as predicted by Sch&amp;quot;utze (1992)). This can be simulated by choosing two words of different word classes to create the pseudoword, such as the (dominantly) noun committee and the (dominantly) verb accept.</Paragraph>
    <Paragraph position="12"> Another interesting question concerns the influence of frequency of either the word itself or the sense to be found. The latter, for example, can be simulated by choosing one high-frequent word and one low-frequent word, thus representing a well-represented vs. a poorly represented sense.</Paragraph>
    <Paragraph position="13"> The aim of the evaluation is to test the described parameters and produce an overall average of precisionandrecallandatthesametimemakeitcom- null pletely reproducable by third parties. Therefore therawBNCwithoutbaseformreduction(because lemmatization introduces additional ambiguity) or POS-tags was used and nine groups each containing five words were picked semi-randomly (avoiding extremely ambiguous words, with respect to WordNet, if possible):  crispy, unrepresented, homoclinic, bitchy These nine groups were used to design fours tests, each focussing on a different variable. The high frequent nouns are around 9000 occurrences, medium frequent around 300 and low frequent around 50.</Paragraph>
    <Section position="1" start_page="140" end_page="142" type="sub_section">
      <SectionTitle>
4.1 Influence of word class and frequency
</SectionTitle>
      <Paragraph position="0"> In the first run of all four tests, sentence co-occurrences were used as features. In the first test, all words of equal word class were viewed as one set of 15 words. This results in parenleftbig152 parenrightbig = 105 possibilities to combine two of these words into  a pseudoword and test the results of the WSI algorithm. The purpose of this test is to examine whether there is a tendency for senses of certain word classes to be easier induced. As can be seen from Table 1, sense induction of verbs using sentence co-occurrences performs worse compared to nouns. This could be explained by the fact that verbs are less semantically specific and need more  syntacticcuesorgeneralizations-bothhardlycovered by the underlying bag-of-words model - in order to be disambiguated properly. At the same time, nouns and adjectives are much better distinguishable by topical key words. These results seem to be in unison with the prediction made by Sch&amp;quot;utze (1992).</Paragraph>
      <Paragraph position="1">  put word in Test 1. Showing precision P and recall R, as well as average retrieval precision rP and recall rR.</Paragraph>
      <Paragraph position="2"> In the second test, all three types of possible combinations of the word classes are tested, i.e.</Paragraph>
      <Paragraph position="3"> pseudowords consisting of a noun and a verb, a nouns and an adjective and a verb with an adjective. For each combination there are 15*15 = 225 possibilities of combining a word from one word class with a word from another word class. The purpose of this test was to demonstrate possible differences between WSI of different word class combinations. Thiscorrespondstocaseswhenone word form can be both a nound and a verb, e.g. a walk and to walk or a noun and an adjective, for example a nice color and color TV. However, the results in Table 2 show no clear tendencies other than perhaps that WSI of adjectival senses from verb senses seems to be slightly more difficult.</Paragraph>
      <Paragraph position="4">  senses to be found in Test 2.</Paragraph>
      <Paragraph position="5"> The third test was designed to show the influence of frequency of the input word. All words of equal frequency are taken as one group with parenleftbig152 parenrightbig = 105 possible combinations. The results in Table 3 show a clear tendency for higherfrequent word combinations to achieve a better quality of WSI over lower frequency words. The steep performance drop in recall becomes immediately clear when looking at the retrieval recall of the found senses. This is not surprising, since with the low frequency words, each occuring only about50times in the BNC, the algorithm runs into the data sparseness problem that has already been pointed out as problematic for WSI (Ferret, 2004).  in Test 3.</Paragraph>
      <Paragraph position="6"> The fourth test finally shows which influence the overrepresentation of one sense over another has on WSI. For this purpose, three possible combinations of frequency classes, high-frequent with middle, high with low and middle with low-frequent words were created with 15 * 15 = 225 possible word pairs. Table 4 demonstrates a steep drop in recall whenever a low-frequent word is part of the pseudoword. This reflects the fact that it is more difficult for the algorithm to find the sense that was represented by the less frequent word. The unusually high precision value for the high/low combination can be explained by the fact that in this case mostly only one sense was found (the one of the frequent word). Therefore recall is close to 50% whereas precision is closer to 100%.</Paragraph>
      <Paragraph position="7">  senses based on frequency of the two constituents of the pseudoword in Test 4.</Paragraph>
      <Paragraph position="8"> Finally it is possible to provide the averages for the entire test runs comprising 1980 tests. The macro averages over all tests are P = 85.42%, R = 72.90%, rP = 86.83% and rR = 62.30%, the micro averages are almost the same. Using the same thresholds but only pairs instead of triplets  results in P = 91.00%, R = 60.40%, rP = 83.94% and rR = 62.58%. Or in other words, more often only one sense is retrieved and the F-measures of F = 78.66% for triplets compared to F = 72.61% for pairs confirm an improvement by 6% by using triplets.</Paragraph>
    </Section>
    <Section position="2" start_page="142" end_page="142" type="sub_section">
      <SectionTitle>
4.2 Window size
</SectionTitle>
      <Paragraph position="0"> The second run of all four tests using direct neighbors as features failed due to the data sparseness problem. There were 17.5 million word pairs co-occurring significantly within sentences in the BNC according to the log-likelihood measure used. Even there, words with low frequency showed a strong performance loss as compared to the high-frequent words. Compared to that there were only 2.3 million word pairs co-occurring directly next to each other. The overall results of the second run with macro averages P = 56.01%, R = 40.64%, rP = 54.28% and rR = 26.79% will not be reiterated here in detail because they are highly inconclusive due to the data sparseness.</Paragraph>
      <Paragraph position="1"> The inconclusiveness derives from the fact that contrary to the results of the first run, the results here vary strongly for various parameter settings and cannot be considered as stable.</Paragraph>
      <Paragraph position="2"> Although these results are insufficient to show theinfluenceofcontextrepresentationsonthetype of induced senses as they were supposed to, they allow several other insights. Firstly, corpus size doesobviouslymatterforWSIasmoredatawould probably have alleviated the sparseness problem.</Paragraph>
      <Paragraph position="3"> Secondly, while perhaps one context representation might be theoretically superior to another (such as neighbor co-occurrences vs. sentence co-occurrences), the effect various representations have on the data richness were by far stronger in the presented tests.</Paragraph>
    </Section>
    <Section position="3" start_page="142" end_page="142" type="sub_section">
      <SectionTitle>
4.3 Examples
</SectionTitle>
      <Paragraph position="0"> In the light of rather abstract, pseudoword-based evaluations some real examples sometimes help to reduce the abstractness of the presented results.</Paragraph>
      <Paragraph position="1"> Threewords, sheet, line and space werechosenarbitrarily and some words representing the induced senses are listed below.</Paragraph>
      <Paragraph position="2">  * sheet - beneath, blank, blanket, blotting, bottom, canvas, cardboard - accounts, amount, amounts, asset, assets, attributable, balance * line - angle, argument, assembly, axis, bottom, boundary, cell, circle, column - lines, link, locomotive, locomotives, loop, metres, mouth, north, parallel * space - astronaut, launch, launched, manned,  mission, orbit, rocket, satellite - air, allocated, atmosphere, blank, breathing, buildings, ceiling, confined These examples show that the found differentiations between senses of words indeed are intuitive. They also show that the found senses are only the most distinguishable ones and many futher senses are missing even though they do appear in the BNC, some of them even frequently. It seems that for finer grained distinctions the bag-of-words model is not appropriate, although it might prove to be sufficient for other applications such as Information Retrieval. Varying contextual representationsmightprovetobecomplementarytotheap- null proach presented here and enable the detection of syntactic differences or collocational usages of a word.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML