File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0213_metho.xml

Size: 20,204 bytes

Last Modified: 2025-10-06 14:14:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0213">
  <Title>A Perspective on Word Sense Disambiguation Methods and Their Evaluation</Title>
  <Section position="4" start_page="80" end_page="81" type="metho">
    <SectionTitle>
3 Proposals
</SectionTitle>
    <Paragraph position="0"> Proposal 1. A better evaluation criterion. At present, the standard for evaluation of word sense disambiguation algorithms is the &amp;quot;exact match&amp;quot; criterion, or simple accuracy: % correct = 100 x # exactly matched sense tags # assigned sense tags Despite its appealing simplicity, this criterion suffers some obvious drawbacks. For example, consider the context: ... bought an interest in Lydak Corp .... (1) and assume the existence of 4 hypothetical systems that assign the probability distribution in Table 2 to the 4 major senses of interest.</Paragraph>
    <Paragraph position="1"> Sense  (1) monetary (e.g. on a loan) (2) stake or share ,tffi correct (3) benefit/advantage/sake (4) intellectual curiosity  hypothetical systems to the example context (1) above.</Paragraph>
    <Paragraph position="2"> Each of the systems assigns the incorrect classification (sense 1) given the correct sense 2 Ca stake or share). However System 1 has been able to nearly rule out senses 3 and 4 and assigns reasonably high probability to the correct sense, but is given the same penalty as other systems that either have ruled out the correct sense (systems 2 and 4) or effectively claim ignorance (system 3).</Paragraph>
    <Paragraph position="3"> If we intend to use the output of the sense tagger as input to another probabilistic system, such as a speech recognizer, topic classifier or IR system, it is important that the sense tagger yield probabilities with its classifications that are as accurate and robust as possible. If the tagger is confident in its answer, it should assign high probability to its chosen classification. If it is less confident, but has effectively ruled out several options, the assigned probability distribution should reflect this too. A solution to this problem comes from the speech community, where cross-entropy (or its related measures perplexity and Kullback-Leibler distance) are used to evaluate how well a model assigns probabilities to its predictions. The easily computable formula for cross entropy is N -- N i~-~1 log2 Pr~4 (ca, \[wi, cdegntext') where N is the number of test instances and Pr~t is the probability assigned by the algorithm A to the correct sense, c.sl of polysemous word wi in contexti. Crucially, given the hypothetical case above, the sense disambiguation algorithm in System 1 would get much of the credit for assigning high probability, even if not the highest probability, to the correct sense. Just as crucially, an algorithm would be penalized heavily for assigning very low probability to the correct sense, I as illustrated below:  In aggregate, optimal performance is achieved under this measure by systems that assign as accurate a probability estimate as possible to their classifications, neither too conservative (System 3) nor too overconfident (Systems 2 and 4).</Paragraph>
    <Paragraph position="4"> This evaluation measure does not necessarily obviate the exact match criterion, and the two could be used in conjunction with each other since they make use of the same test data. However, a measure based on cross-entropy or perplexity would provide a fairer test, especially for the common case where several fine-grained senses may be correct and it is nearly impossible to select exactly the sense chosen by the human annotator.</Paragraph>
    <Paragraph position="5"> Finally, not all classification algorithms return probability values. For these systems, and for those that yield poorly estimated values, a variant of the cross entropy measure without the log term (~ ~'\]~I Pr-a(csdwi, contexti)) can be used to measure improvement in restricting and/or roughly ordering the possible classification set without excessive penalties for poor or absent probability estimates. In the latter case, when the assigned tag is given probability 1 and all other senses probability 0, this measure is equivalent to simple % correct. Proposal 2. Make evaluation sensitive to senmntic/conmaunlcative distance between subsenses.</Paragraph>
    <Paragraph position="6"> Current WSD evaluation metrics also fail to take into account semantic/communicative distance between senses when assigning penalties for incorrect labels. This is most evident when word senses are nested or arranged hierarchically, as shown in the example sense inventory for bank in Table 3.</Paragraph>
    <Paragraph position="7">  An erroneous classification between close siblings in the sense hierarchy should be given relatively little penalty, while misclassifications across homographs should receive a much greater penalty. The penalty matrix distance (subsensel, subsense2) could capture simple hierarchical distance (e.g. (Resnik, 1995; Richardson et al., 1994)), derived from a single semantic hierarchy such as WordNet, or be based on a weighted average of simple hierarchical distances from multiple sources such as sense/subsense hierarchies in several dictionaries. A very simple example of such a distance matrix for the bank sense hierarchy  Penalties could also be based on general pairwise f~nctional communicative distance: errors between subtle sense differences would receive little penalty while gross errors likely to result in misunderstanding would receive a large penalty. Such communicative distance matrices could be derived from several sources. They could be based on psycholinguistic data, such as experimentally derived estimates of similarity or confusability (Miller and Charles, 1991; Resnik, 1995). They could be based on a given task, e.g. in speech synthesis only those sense distinction errors corresponding to pronunciation distinctions (e.g. bass-/bms/vs, bass-/beIs/) would be penalized. For the machine-translation application, only those sense differences lexicalized differently in the target language would he penalized, with the penalty proportional to communicative distance. 2 In gen2Such distance could be based on the weighted % of all languages that lexicalize the two subsenses differently. eral such a distance matrix could support arbitrary communicative cost/penalty functions, dynamically changible according to task.</Paragraph>
    <Paragraph position="8"> There are several ways in which such a (hierarchical) distance penalty weighting could be utilized along with the cross-entropy measure. The simplest is to minimize the mean distance/cost between the assigned sense (as~) and correct sense (csl) over all N examples as an independent figure of merit:</Paragraph>
  </Section>
  <Section position="5" start_page="81" end_page="84" type="metho">
    <SectionTitle>
1 N
</SectionTitle>
    <Paragraph position="0"> distance(csi, asi) However, one could also use a metric such as the following that measures efficacy of probability assignment in a manner that penalizes probabilities assigned to incorrect senses weighted by the communicative distance/cost between that incorrect sense and the correct one:</Paragraph>
    <Paragraph position="2"> where for any test example i, we consider all Si senses (sj) of word wi, weighting the probability mass assigned by the classifier ,4 to incorrect senses (PrA(sjlwi,context~)) by the communicative distance or cost of that misclassification. 3 Note that in the special case of sense tagging without probability estimates (all are either 0 or 1), this formula is equivalent to the previous one (simple mean distance or cost mlnlmlzation).</Paragraph>
    <Paragraph position="3"> Proposal 3. A framework for common evaluation and test set generation. Supervised and unsupervised sense disambiguation methods have different needs regarding system development and evaluation. Although unsupervised methods may be evaluated (with some limitations) by a sequentially tagged corpus such as the WordNet semantic concordance (with a large number of polysemous words represented but with few examples of each), supervised methods require much larger data sets focused on a subset of polysemous words to provide adequately large training and testing material. It is hoped that US and international sources will see fit to fund such a data annotation effort. To facilitate discussion of this issue, the following is a proposed framework for providing this data, satisfying the needs of both supervised and unsupervised tagging research.</Paragraph>
    <Paragraph position="4"> 3Although this function enumerates over all 8i senses of wi, because distance(cs~,cs~) -- 0 this function only penalizes probability mass assigned to incorrect senses for the given example.</Paragraph>
    <Paragraph position="6"> 1. Select/Collect a very large (e.g., N = 1 billion words), diverse unannotated corpus.</Paragraph>
    <Paragraph position="7"> 2. Select a sense inventory (e.g. WordNet, LDOCE) with respect to which algorithms will be evaluated (see Proposal 4).</Paragraph>
    <Paragraph position="8"> 3. Pick a subset of R &lt; N (e.g., 100M) words of unannotated text, and release it to the community as a training set.</Paragraph>
    <Paragraph position="9"> 4. Pick a smaller subset of S &lt; R &lt; N (e.g., 10M) words of text as the source of the test set. Generate the test set as follows: (a) Select a set of M (e.g., 100) ambiguous words that will be used as the basis for the evaluation, mithout telling the research community what those words will be.</Paragraph>
    <Paragraph position="10"> (b) For each of the M words, annotate all available instances of that word in the test cor null pus. Make sure each annotator tags all instances of a single word, e.g. using a concordance tool, as opposed to going through the corpus sequentially.</Paragraph>
    <Paragraph position="11">  (c) For each of the M words, compute evaluation statistics using individual annotators against other annotators.</Paragraph>
    <Paragraph position="12"> (d) For each of the M words, go through the cases where annotators disagreed and make a consensus choice, by vote if necessary.</Paragraph>
    <Paragraph position="13"> 5. Instruct participants in the evaluation to '~reeze&amp;quot; their code; that is, from this point on no changes may be made.</Paragraph>
    <Paragraph position="14"> 6. Have each participating algorithm do WSD on the full S-word test corpus.</Paragraph>
    <Paragraph position="15"> 7. Evaluate the performance of each algorithm  considering only instances of the M words annotated as the basis for the evaluation. Compare exact match, cross-entropy, and inter-judge reliability measures (e.g. Cohen's ~) using annotator-vs-annotator results as an upper bound.</Paragraph>
    <Paragraph position="16"> 8. Release this year's S-word test corpus as a development corpus for those algorithms that require supervised training, so they can participate from now on, being evaluated in the future via cross-validation.</Paragraph>
    <Paragraph position="17"> 9. Go back to Step 3 for next year's evaluation. There are a number of advantages to this paradigm, in comparison with simply trying to annotate large corpora with word sense information. First, it combines an emphasis on broad coverage with the advantages of evaluating on a limited set of words, as is done traditionally in the WSD literature. Step 4a can involve any form of criteria (frequency, level of ambiguity, part of speech, etc.) to narrow down to set of candidate words, and then employ random selection among those candidates. At the same time, it avoids a common criticism of studies based on evaluating using small sets of words, namely that there is not enough attention being paid to scalability. In this evaluation paradigm, algorithms must be able to sense tag all words in the corpus meeting specified criteria, because there is no way to know in advance which words will be used to compute the figure(s) of merit.</Paragraph>
    <Paragraph position="18"> Second, the process avoids some of the problems that arise in using exhaustively annotated corpora for evaluation. By focusing on a relatively small set of polysemous words, much larger data sets for each can be produced. This focus will also allow more attention to be paid to selecting and vetting comprehensive and robust sense inventories, including detailed specifications and definitions for each. Furthermore, by having annotators focus on one word at at time using concordance software, the initial level of consistency is likely to be far higher than that obtained by a process in which one jumps from word to word to word by going sequentially through a text, repeatedly refamiliarizing oneself with different sense inventories at each word. Finally, by computing inter-annotator statistics blindly and then allowing annotators to confer on disagreements, a cleaner test set can be obtained without sacrificing trustworthy upper bounds on performance.</Paragraph>
    <Paragraph position="19"> Third, the experience of the Penn Treebank and other annotation efforts has demonstrated that it is difflcult to select and freeze a comprehensive tag set for the entire vocabulary in advance. Studying and writing detailed sense tagging guidelines for each word is comparable to the effort required to create a new dictionary. By focusing on only 100 or so polysemous words per evaluation, the annotating organization can afford to do a multi-pass study of and detailed tagging guidelines for the sense inventory present in the data for each target word. This would be prohibitively expensive to do for the full English vocabulary. Also, by utilizing different sets of words in each evaluation, such factors as the level of detail and the sources of the sense inventories may change without worrying about maintaining consistency with previous data.</Paragraph>
    <Paragraph position="20"> Fourth, both unsupervised and supervised WSD algorithms are better accommodated in terms of the amount of data available. Unsupervised algorithms  can be given very large quantities of training data: since they require no annotation the value of R can be quite large. And although supervised algorithms are typically plagued by sparse data, this approach will yield much larger training and testing sets per word, facilitating the exploration and development of data intensive supervised algorithms.</Paragraph>
    <Paragraph position="21"> Proposal 4. A multUlngual sense inventory for evaluation. One of the most fraught issues in applied lexical semantics is how to define word senses. Although we certainly do not propose a definitive answer to that question, we suggest here a general purpose criterion that can be applied to existing sources of word senses in a way that, we suggest, makes sense both for target applications and for evaluation, and is compatible with the major sources of available training and test data.</Paragraph>
    <Paragraph position="22"> The essence of the proposal is to restrict a word sense inventory to those distinctions that are typically lezicalized cross-linguistically. This cuts a middle ground between restricting oneself to homographs within a single language, which tends toward a very coarse-grained distinction, and an attempt to express all the fine-grained distinctions made in a language, as found in monolingual dictionaries. In practice the idea would be to define a set of target languages (and associated bilingual dictionaries), and then to require that any sense distinction be reAliT~d lexically in a minimum subset of those languages. This would eliminate many distinctions that are arguably better treated as regular polysemy. For example, table can be used to refer to both a physical  object and a group of people: (1) a. The waiter put the food on the table.</Paragraph>
    <Paragraph position="23">  b. Then he told another table their food was almost ready.</Paragraph>
    <Paragraph position="24"> c. He finally brought appetizers to the table an hour later.</Paragraph>
    <Paragraph position="25"> In German the two meanings can actually be lexicalized differently (Tisch vs. Tischrunde). However, as such sense distinctions are typically conflated into a single word in most languages, and because even German can use Tisch in both cases, one could plausibly argue for a common sense inventory for evaluation that conflates these meanings.</Paragraph>
    <Paragraph position="26"> A useful reference source for both training and evaluation would be a table linking sense numbers in established lexical resources (such as WordNet or LDOCE) with these crosslinguistic translation distinctions. An example of such a map is given in  extracted semi-automatically from bilingual dictionaries or from the EuroWordNet effort (Bloksma et al., 1996) which provides both semantic hierarchies and interlingnal node linkages, currently for the languages Spanish, Italian, Dutch and English. We note that the table follows many lexical resources, such as the original WordNet, in being organized at the top level according to parts of speech. This seems to us a sensible approach to take for sense inventories, especially in light of Wilks and Stevenson's (1996) observation that part-of-speech tagging accomplishes much of the work of semantic disambignation, at least at the level of homographs.</Paragraph>
    <Paragraph position="27"> Although cross-linguistic divergence is a significant problem, and 1-1 translation maps do not exist for all sense-language pairs, this table suggests how multiple parallel bilingual corpora for different langnage pairs can be used to yield sets of training data covering different subsets of the English sense inventory, that in aggregate may yield tagged data for all given sense distinctions when any one language alone may not be adequate.</Paragraph>
    <Paragraph position="28"> For example, a German-English parallel corpus could yield tagged data for Senses 1 and 2 for interest, and the presence of certain Spanish words (provecho, beneficio) aligned with interest in a Spanish-English corpus will tag some instances of Sense 5, with a Japanese-English aligned corpus potentially providing data for the remaining sense distinctions. In some cases it will not be possible to find any language (with adequate on-line parallel corpora) that lexicalize some subtle English sense distinctions differently, but this may be evidence that the distinction is regular or subtle enough to be excluded or handled by other means.</Paragraph>
    <Paragraph position="29"> Note that Table 5 is not intended for direct use in machine translation. Also note that when two word senses are in a cell they axe not necessarily synonyms. In some cases they realize differences in meaning or contextual usage that are salient to the target language. However, at the level of sense distinction given in the table, they correspond to the same word senses in English and the presence of either in an aligned bilingual corpus will indicate the same English word sense.</Paragraph>
    <Paragraph position="30"> Monolingual sense tagging of another language such as Spanish would yield a similar map, such as distinguishing the senses of the Spanish word dedo, which can mean 'finger' or 'toe'. Either English or German could be used to distinguish these senses, but not Italian or French, which share the same sense ambiguity.</Paragraph>
    <Paragraph position="31"> It would also be helpful for Table 5 to include alignments between multiple monolingual sense representations, such as COBUILD sense numbers, LDOCE tags or WordNet synsets, to support the sharing and leveraging of results between multiple systems. This brings to the fore an existing problem, of course: different sense inventories lead to different algorithmic biases. For example, WordNet as a sense inventory would tend to bias an evaluation in favor of algorithms that take advantage of taxonomic structure; LDOCE might bias in favor of algorithms that can take advantage of topical/subject codes, and so forth. Unfortunately we have no solution to propose for the problem of which representation (if any) should be the ultimate standard, and leave it as a point for discussion.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML