File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/j01-3001_metho.xml

Size: 39,605 bytes

Last Modified: 2025-10-06 14:07:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-3001">
  <Title>The Interaction of Knowledge Sources in Word Sense Disambiguation</Title>
  <Section position="5" start_page="328" end_page="329" type="metho">
    <SectionTitle>
4 An example of this situation is shown in the bottom row of Table 2.
</SectionTitle>
    <Paragraph position="0"> Computational Linguistics Volume 27, Number 3 Table 2 Examples of the four word types introduced in Section 3.2. The leftmost column indicates the full set of homographs for the example words, with upper case indicating the correct homograph. The remaining columns show (respectively) the part-of-speech assigned by the tagger, the resulting set of senses after filtering, and the type of the word. All PoS After Word type Homographs Tag tagging N, v, v n N Full disambiguation n, adj, V v V Full disambiguation n, V, v v V, v Partial disambiguation n, N, v n n, N Partial disambiguation N, n n N, n No disambiguation v, V v v, V No disambiguation N, v, v v v v PoS error N, v, v adj N, v, v PoS error Table 3 Error analysis for the experiment on WSD by part of speech alone.</Paragraph>
  </Section>
  <Section position="6" start_page="329" end_page="338" type="metho">
    <SectionTitle>
4. A Sense Tagger which Combines Knowledge Sources
</SectionTitle>
    <Paragraph position="0"> We adopt a framework in which different knowledge sources are applied as separate modules. One type of module, a filter, can be used to remove senses from consideration when a knowledge source identifies them as unlikely in context. Another type can be used when a knowledge source provides evidence for a sense but cannot identify it confidently; we call these partial taggers (in the spirit of McCarthy's notion of &amp;quot;partial information&amp;quot; \[McCarthy and Hayes, 1969\]). The choice of whether to apply a knowledge source as either a filter or a partial tagger depends on whether it is likely to rule out correct senses. If a knowledge source is unlikely to reject the correct sense, then it can be safely implemented as a filter; otherwise implementation as a partial tagger would be more appropriate. In addition, it is necessary to represent the context of ambiguous words so that this information can be used in the disambiguation process.</Paragraph>
    <Paragraph position="1"> In the system described here these modules are referred to as feature extractors.</Paragraph>
    <Paragraph position="2"> Our sense tagger is implemented within this modular architecture, one where each module is a filter, partial tagger, or feature extractor. The architecture of the system is represented in Figure 2. This system currently incorporates a single filter (part-of-speech filter), three partial taggers (simulated annealing, subject codes, selectional restrictions) and a single feature extractor (collocation extractor). null  Computational Linguistics Volume 27, Number 3</Paragraph>
    <Section position="1" start_page="331" end_page="331" type="sub_section">
      <SectionTitle>
4.1 Preprocessing
</SectionTitle>
      <Paragraph position="0"> Before the filters or partial taggers are applied, the text is tokenized, lemmatized, split into sentences, and part-of-speech tagged, again using Brill's tagger. A named entity identifier is then run over the text to mark and categorize proper names, which will provide information for the selectional restrictions partial tagger (see Section 4.4).</Paragraph>
      <Paragraph position="1"> These preprocessing stages are carried out by modules from Sheffield University's Information Extraction system, LaSIE, and are described in more detail by Gaizauskas et al. (1996).</Paragraph>
      <Paragraph position="2"> Our system disambiguates only the content words in the text, and the part-of-speech tags are used to decide which are content words. There is no attempt to disambiguate any of the words identified as part of a named entity. These are excluded because they have already been analyzed semantically by means of the classification added by the named entity identifier (see Section 4.4). Another reason for not attempting WSD on named entities is that when words are used as names they are not being used in any of the senses listed in a dictionary. For example, Rose and May are names but there are no senses in LDOCE for this usage. It may be possible to create a dummy entry in the set of LDOCE senses indicating that the word is being used as a name, but then the sense tagger would simply repeat work carried out by the named entity identifier.</Paragraph>
    </Section>
    <Section position="2" start_page="331" end_page="331" type="sub_section">
      <SectionTitle>
4.2 Part-of-Speech filtering
</SectionTitle>
      <Paragraph position="0"> We take the part-of-speech tags assigned by the Brill tagger and use a manually created mapping to translate these to the corresponding LDOCE grammatical category (see Section 3.2). Any senses which do not correspond to the category returned are removed from consideration. In practice, the filtering is carried out at the same time as the lexical lookup phase and the senses whose grammatical categories do not correspond to the tag assigned are never attached to the ambiguous word. There is also an option of turning off filtering so that all senses are attached regardless of the part-of-speech tag.</Paragraph>
      <Paragraph position="1"> If none of the dictionary senses for a given word agree with the part-of-speech tag then all are kept.</Paragraph>
      <Paragraph position="2"> It could be reasonably argued that removing senses is a dangerous strategy since, if the part-of-speech tagger made an error, the correct sense could be removed from consideration. However, the experiments described in Section 3.2 indicate that part-of-speech information is unlikely to reject the correct sense and can be safely implemented as a filter.</Paragraph>
    </Section>
    <Section position="3" start_page="331" end_page="332" type="sub_section">
      <SectionTitle>
4.3 Optimizing Dictionary Definition Overlap
</SectionTitle>
      <Paragraph position="0"> Lesk (1986) proposed that WSD could be carried out using an overlap count of content words in dictionary definitions as a measure of semantic closeness. This method would tag all content words in a sentence with their senses from a dictionary that contains textual definitions. However, it was found that the computations which would be necessary to test every combination of senses, even for a sentence of modest length, was prohibitive.</Paragraph>
      <Paragraph position="1"> The approach was made practical by Cowie, Guthrie, and Guthrie (1992) (see also (Wilks, Slator, and Guthrie 1996)). Rather than computing the overlap for all possible combinations of senses, an approximate solution is identified by the simulated annealing optimization algorithm (Metropolis et al. 1953). Although this algorithm is not guaranteed to find the global solution to an optimization problem, it has been shown to find solutions that are not significantly different from the optimal one (Press et al. 1988). Cowie et al. used LDOCE for their implementation and found it correctly disambiguated 47% of words to the sense level and 72% to the homograph level  Stevenson and Wilks Interaction of Knowledge Sources in WSD</Paragraph>
      <Paragraph position="3"> S,E, 1,2,5 L,E, 6,7 G, 7 PV'~A, O/,V H~OO, (solid) (liquid) (gas) (plant) (ani~{al) (~umaXn~ \] N B,R D~K M,K F,R (movable (nonmovable (animal (ammal (human (human solid) solid) male) female) male) female) Figure 3 Bruce and Guthrie's hierarchy of LDOCE semantic codes.</Paragraph>
      <Paragraph position="4"> when compared with manually assigned senses. The optimization must be carried out relative to a function that evaluates the suitability of a particular choice of senses. In the Cowie et al. implementation this was done using a simple count of the number of words (tokens) in common between all the definitions for a given choice of senses. However, this method prefers longer definitions, since they have more words that can contribute to the overlap, and short definitions or definitions by synonym are correspondingly penalized. We addressed this problem by computing the overlap in a different way: instead of each word contributing one, we normalized its contribution by the number of words in the definition it came from. In their implementation Cowie et al. also added pragmatic codes to the overlap computation; however, we prefer to keep different knowledge sources separate and use this information in another partial tagger (see Section 4.5). The Cowie et al. implementation returned one sense for each ambiguous word in the sentence without any indication of the system's confidence in its choice, but we adapted the system to return a set of suggested senses for each ambiguous word in the sentence.</Paragraph>
    </Section>
    <Section position="4" start_page="332" end_page="335" type="sub_section">
      <SectionTitle>
4.4 Selectional Preferences
</SectionTitle>
      <Paragraph position="0"> Our next partial tagger returns the set of senses for each word that is licensed by selectional preferences (in the sense of Wilks 1975). LDOCE senses are marked with selectional restrictions expressed by 36 semantic codes not ordered in a hierarchy.</Paragraph>
      <Paragraph position="1"> However, the codes are clearly not of equal levels of generality; for example, the code H is used to represent all humans, while M represents human males. Thus for a restriction with type H, we would want to allow words with the more specific semantic class M to meet it. This can be computed if the semantic categories are organized into a hierarchy.</Paragraph>
      <Paragraph position="2"> Then all categories subsumed by another category will be regarded as satisfying the restriction. Bruce and Guthrie (1992) manually identified relations between the LDOCE semantic classes, grouping the codes into small sets with roughly the same meaning and attached descriptions; for example M, K are grouped as a pair described as &amp;quot;human male&amp;quot;. The hierarchy produced is shown in Figure 3.</Paragraph>
      <Paragraph position="3">  The named entities identified as part of the preprocessing phase (Section 4.1) are used by this module, which requires first a mapping between the name types and LDOCE semantic codes, shown in Table 4.</Paragraph>
      <Paragraph position="4"> Any use of preferences for sense selection requires prior identification of the site in the sentence where such a relationship holds. Although prior identification was not done by syntactic methods in Wilks (1975), it is often easiest to think of the relationships as specified in grammatical terms, e.g., as subject-verb, verb-object, adjective-noun etc. We perform this step by means of a shallow syntactic analyzer (Stevenson 1998) which finds the following grammatical relations: the subject, direct and indirect object of each verb (if any), and the noun modified by an adjective. Stevenson (1998) describes an evaluation of this system in which the relations identified were compared with those derived from Penn TreeBank parses (Marcus, Santorini, and Marcinkiewicz 1993). It was found that the parser achieved 51% precision and 69% recall.</Paragraph>
      <Paragraph position="5"> The preference resolution algorithm begins by examining a verb and the nouns it dominates. Each sense of the verb applies a preference to those nouns such that some of their senses may be disallowed. Some verb senses will disallow all senses for a particular noun it dominates and these senses of the verb are immediately rejected.</Paragraph>
      <Paragraph position="6"> This process leaves us with a set of verb senses that do not conflict with the nouns that verb governs, and a set of noun senses licensed by at least one of those verb senses. For each noun, we then check whether it is modified by an adjective. If it is, we reject any senses of the adjectives which do not agree with any of the remaining noun senses. This approach is rather conservative in that it does not reject a sense unless it is impossible for it to fit into the preference pattern of the sentence.</Paragraph>
      <Paragraph position="7"> In order to explain this process more fully we provide a walk-through explanation of the procedure applied to a toy example shown in Table 5. It is assumed that the named-entity identifier has correctly identified John as a person and that the shallow parser has found the correct syntactic relations. In order to make this example as straightforward as possible, we consider only the case in which the ambiguous words have few senses. The disambiguation process operates by considering the relations between the words in known grammatical relations, and before it begins we have essentially a set of possible senses for each word related via their syntax. This situation is represented by the topmost tree in Figure 4.</Paragraph>
      <Paragraph position="8"> Disambiguation is carried out by considering each verb sense in turn, beginning with run(l). As run is being used transitively, it places two restrictions on the sentence: first, the subject must satisfy the restriction human and the object abstract. In this</Paragraph>
      <Paragraph position="10"> proper name to control an organisation run IBM to move quickly by foot run a marathon undulating terrain hilly road  Restriction resolution in toy example. example, John has been identified as a named entity and marked as human, so the subject restriction is not broken. Note that, if the restriction were broken, then the verb sense run(l) would be marked as incorrect by this partial tagger and no further attempt would be made to resolve its restrictions. As this was not the case, we consider the direct-object slot, which places the restriction abstract on the noun which fills it. course(2) fulfils this criterion, course is modified by hilly which expects a noun of type noumovable solid. However, course(2) is marked abstract, which does not comply with this restriction. Therefore, assuming that run is being used in its second sense leads to a situation in which there is no set of senses which comply with all the restrictions placed on them; therefore run(l) is not the correct sense of run and the partial tagger marks this sense as wrong. This situation is represented by the tree at the bottom left of Figure 4. The sense course(2) is not rejected at this point since it may be found to be acceptable in the configuration of senses of another sense of run. The algorithm now assumes that run(2) is the correct sense. This implies that course(I) is the correct sense as it complies with the inanimate restriction that that verb sense places on the direct object. As well as complying with the restriction imposed by run(2), course(I) also complies with the one imposed by hilly(i), since nonmovable solid is subsumed by inanimate. Therefore, assuming that the senses run(2) and  Computational Linguistics Volume 27, Number 3 course(I) are being used does not lead to any restrictions being broken and the algorithm marks these as correct.</Paragraph>
      <Paragraph position="11"> Before leaving this example it is worth discussing a few additional points. The sense course(2) is marked as incorrect because there is no sense of run with which an interpretation of the sentence can be constructed using course(2). If there were further senses of run in our example, and course(2) was found to be suitable for those extra senses, then the algorithm would mark the second sense of course as correct. There is, however, no condition under which run(l) could be considered as correct through the consideration of further verb senses. Also, although John and hilly are not ambiguous in this example, they still participate in the disambiguation process. In fact they are vital to its success, as the correct senses could not have been identified without considering the restrictions placed by the adjective hilly.</Paragraph>
      <Paragraph position="12"> This partial tagger returns, for all ambiguous noun, verb, and adjective occurrences in the text, the set of senses which satisfy the preferences imposed on those words. Adverbs do not have any selectional preferences in LDOCE and so are ignored by this partial tagger.</Paragraph>
    </Section>
    <Section position="5" start_page="335" end_page="336" type="sub_section">
      <SectionTitle>
4.5 Subject Codes
</SectionTitle>
      <Paragraph position="0"> Our final partial tagger is a re-implementation of the algorithm developed by Yarowsky (1992). This algorithm is dependent upon a categorization of words in the lexicon into subject areas--Yarowsky used the Roget large categories. In LDOCE, primary pragmatic codes indicate the general topic of a text in which a sense is likely to be used. For example, LN means &amp;quot;Linguistics and Grammar&amp;quot; and this code is assigned to some senses of words such as &amp;quot;ellipsis&amp;quot;, &amp;quot;ablative&amp;quot;, &amp;quot;bilingual&amp;quot; and &amp;quot;intransitive&amp;quot;. Roget is a thesaurus, so each entry in the lexicon belongs to one of the large categories; but over half (56%) of the senses in LDOCE are not assigned a primary code. We therefore created a dummy category, denoted by --, used to indicate a sense which is not associated with any specific subject area and this category is assigned to all senses without a primary pragmatic code. These differences between the structures of LDOCE and Roget meant that we had to adapt the original algorithm reported in Yarowsky (1992).</Paragraph>
      <Paragraph position="1"> In Yarowsky's implementation, the correct subject category is estimated by applying (6), which maximizes the sum of a Bayesian term (the fraction on the right) over  an extremely skewed distribution of codes across senses and Yarowsky's assumption that subject codes occur with equal probability is unlikely to be useful in this application. We gained a rough estimate of the probability of each subject category by determining the proportion of senses in LDOCE to which it was assigned and applying the maximum likelihood estimate. It was found that results improved when the  Stevenson and Wilks Interaction of Knowledge Sources in WSD rough estimate of the likelihood of pragmatic codes was used. This procedure generates estimates based on counts of types and it is possible that this estimate could be improved by counting tokens, although the problem of polysemy in the training data would have to be overcome in some way.</Paragraph>
      <Paragraph position="2"> The algorithm relies upon the calculation of probabilities gained from corpus statistics: Yarowsky used the Grolier's Encyclopaedia, which comprised a 10 million word corpus. Our implementation used nearly 14 million words from the non-dialogue portion of the British National Corpus (Burnard 1995). Yarowsky used smoothing procedures to compensate for data sparseness in the training corpus (detailed in Gale, Church, and Yarowsky \[1992b\]), which we did not implement. Instead, we attempted to avoid this problem by considering only words which appeared at least 10 times in the training contexts of a particular word. A context model is created for each pragmatic code by examining 50 words on either side of any word in the corpus containing a sense marked with that code. Disambiguation is carried out by examining the same 100 word context window for an ambiguous word and comparing it against the models for each of its possible categories. Further details may be found in Yarowsky (1992).</Paragraph>
      <Paragraph position="3"> Yarowsky reports 92% correct disambiguation over 12 test words, with an average of three possible Roget large categories. However, LDOCE has a higher level of average ambiguity and does not contain as complete a thesaural hierarchy as Roget, so we would not expect such good results when the algorithm is adapted to LDOCE. Consequently, we implemented the approach as a partial tagger. The algorithm identifies the most likely pragmatic code and returns the set of senses which are marked with that code. In LDOCE, several senses of a word may be marked with the same pragmatic code, so this partial tagger may return more than one sense for an ambiguous word.</Paragraph>
    </Section>
    <Section position="6" start_page="336" end_page="336" type="sub_section">
      <SectionTitle>
4.6 Collocation Extractor
</SectionTitle>
      <Paragraph position="0"> The final disambiguation module is the only feature-extractor in our system and is based on collocations. A set of 10 collocates are extracted for each ambiguous word in the text: first word to the left, first word to the right, second word to the left, second word to the right, first noun to the left, first noun to the right, first verb to the left, first verb to the right, first adjective to the left, and first adjective to the right. Some of these types of collocation were also used by Brown et al. (1991) and Yarowsky (1993) (see Section 2.3). All collocates are searched for within the sentence which contains the ambiguous word. If some particular collocation does not exist for an ambiguous word, for example if it is the first or last word in a sentence, then a null value (NoColl) is stored instead. Rather than storing the surface form of the cooccurrence, morphological roots are stored instead, as this allows for a smaller set of collocations, helping to cope with data sparseness. The surface form of the ambiguous word is also extracted from the text and stored. The extracted collocations and surface form combine to represent the context of each ambiguous word.</Paragraph>
    </Section>
    <Section position="7" start_page="336" end_page="338" type="sub_section">
      <SectionTitle>
4.7 Combining Disambiguation Modules
</SectionTitle>
      <Paragraph position="0"> The results from the disambiguation modules (filter, partial taggers, and feature extractor) are then presented to a machine learning algorithm to combine their results.</Paragraph>
      <Paragraph position="1"> The algorithm we chose was the TIMBL memory-based learning algorithm (Daelemans et al. 1999). Memory-based learning is another name for exemplar-based learning, as employed by Ng and Lee (Section 2.3). The TiMBL algorithm has already been used for various NLP tasks including part-of-speech tagging and PP-attachment (Daelemans et al. 1996; Zavrel, Daelemans, and Veenstra 1997).</Paragraph>
      <Paragraph position="2">  Computational Linguistics Volume 27, Number 3 Like PEBLS, which formed the core of Ng and Lee's LEXAS system, TiMBL classifies new examples by comparing them against previously seen cases. The class of the most similar example is assigned. At the heart of this approach is the distance metric A(X, Y) which computes the similarity between instances X and Y. This measure is calculated using the weighted overlap metric shown in (8), which calculates the total distance by computing the sum of the distance between each position in the feature vector.</Paragraph>
      <Paragraph position="4"> if xi # yi From (9) we can see that TiMBL treats numeric and symbolic features differently. For numeric features, the unweighted distance is computed as the difference between the values for that feature in each instance, divided by the maximum possible distance computed over all pairs of instances in the database. 5 For symbolic features, the unweighted distance is 0 if they are identical, and 1 otherwise. For both numeric and symbolic features, this distance is multiplied by the weight for the particular feature, based on the Gain Ratio measure introduced by Quinlan (1993). This is a measure of the difference in uncertainty between the situations with and without knowledge of the value of that feature, as in (10).</Paragraph>
      <Paragraph position="5"> H(C) - ~-,v Pr(v) x H(CIv) (10) wi = H(v) Where C is the set of classifications, v ranges over all values of the feature i and H(C) is the entropy of the class labels. Probabilities are estimated from frequency of occurrence in the training data. The numerator of this formula determines the knowledge about the distribution of classes that is added by knowing the value of feature i. However, this measure can overestimate the value of features with large numbers of possible values. To compensate, it is divided by H(v), the entropy of the feature values.</Paragraph>
      <Paragraph position="6"> Word senses are presented to TiMBL in a feature-vector representation, with each sense which was not removed by the part of speech filter being represented by a separate vector. The vectors are formed from the following pieces of information in order: headword, homograph number, sense number, rank of sense (the order of the sense in the lexicon), part of speech from lexicon, output from the three partial taggers (simulated annealing, subject codes, and selectional restrictions), surface form of headword from the text, the ten collocates, and an indicator of whether the sense is appropriate or not in the context (correct or incorrect).</Paragraph>
      <Paragraph position="7"> Figure 5 shows the feature vectors generated for the word influence in the context shown. The final value in the feature vector shows whether the sense is correct or not in the particular context. We can see that, in this case, there is one correct sense, influence_l_la, the definition of which is &amp;quot;power to gain an effect on the mind of  influence 1 la 1 n influences 1 12.03 y NoColl manner NoColl eliminate NoColl in NoColl political NoColl eliminate correct influence 1 lb 2 n influences 0 12.03 y NoColl manner NoColl eliminate NoColl in NoColl political NoColl eliminate incorrect influence 1 2 3 n influences 0 12.03 y NoColl manner NoColl eliminate NoColl in NoColl political NoColl eliminate incorrect influence 1 3 4 n influences 0 12.03 y NoColl manner NoColl eliminate NoColl in NoColl political NoColl eliminate incorrect influence 1 4 5 n influences 0 12.03 n NoColl manner NoColl eliminate NoCofl in NoColl political NoColl eliminate incorrect influence 1 5 6 n influences 0 12.03 n NoColl manner NoColl eliminate NoColl in NoCon political NoColl eliminate incorrect influence 1 6 7 n influences 0 12.03 n NoColl manner NoColl eliminate NoColl in NoColl political NoColl eliminate incorrect Figure 5 Example feature-vector representation.</Paragraph>
      <Paragraph position="8"> or get results from, without asking or doing anything&amp;quot;. Features 10-19 are produced by the collocation extractor, and these are identical since each vector is taken from the same content. Features 7-9 show the results of the partial taggers. The first is the output from simulated annealing, the second the subject code, and the third the selectional restrictions. All noun senses of influence share the same pragmatic code (--), and consequently this partial tagger returns the same score for each sense.</Paragraph>
      <Paragraph position="9"> A final point worth noting is that in LDOCE, influence has a verb sense which the part-of-speech filter removed from consideration, and consequently this sense is not included in the feature-vector representation.</Paragraph>
      <Paragraph position="10"> The TiMBL algorithm is trained on tokens presented in this format. When disambiguating unannotated text, the algorithm is applied to data presented in the same format without the classification. The unclassified vectors are then compared with all the training examples, and it is assigned the class of the closest one.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="338" end_page="341" type="metho">
    <SectionTitle>
5. Evaluation Strategy
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="338" end_page="340" type="sub_section">
      <SectionTitle>
5.1 Evaluation Corpus
</SectionTitle>
      <Paragraph position="0"> The evaluation of WSD algorithms has recently become a much-studied area. Gale, Church, and Yarowsky (1992a), Resnik and Yarowsky (1997), and Melamed and Resnik (2000) each presented arguments for adopting various evaluation strategies, with Resnik and Yarowsky's proposal directly influencing the set-up of SENSEVAL (Kilgarriff 1998). At the heart of their proposals is the ability of human subjects to mark up text with the phenomenon in question (WSD in this case) and evaluate the results of computation. This linguistic phenomenon has proved to be far more elusive and complex than many others. We have discussed this at length elsewhere (Wilks 1997) and will assume here that humans can mark up text for senses to a sufficient degree.</Paragraph>
      <Paragraph position="1"> Kilgarriff (1993) questioned the possibility of creating sense-tagged texts, claiming the task to be impossible. However, it should be borne in mind that no alternative has yet been widely accepted and that Kilgarriff himself used the markup-and-test model for SENSEVAL. In the following discussion we compare the evaluation methodology adopted here with those proposed by others.</Paragraph>
      <Paragraph position="2">  Computational Linguistics Volume 27, Number 3 The standard evaluation procedure for WSD is to compare the output of the system against gold standard texts, but these are very labor-intensive to obtain; lexical semantic markup is generally considered to be a more difficult and time-consuming task than part-of-speech markup (Fellbaum et al. 1998). Rather than expend a vast amount of effort on manual tagging we decided to combine two existing resources: SEMCOR (Landes, Leacock, and Tengi 1998), and SENSUS (Knight and Luk 1994).</Paragraph>
      <Paragraph position="3"> SEMCOR is a 200,000 word corpus with the content words manually tagged as part of the WordNet project. The semantic tagging was carried out by trained lexicographers under disciplined conditions that attempted to keep tagging inconsistencies to a minimum. SENSUS is a large-scale ontology designed for machine-translation and was itself produced by merging the ontological hierarchies of WordNet, LDOCE (as derived by Bruce and Guthrie, see Section 4.4), and the Penman Upper Model (Bateman et al., 1990) from ISI. To facilitate the merging of these three resources to produce SENSUS, Knight and Luk were required to derive a mapping between the senses in the two lexical resources. We used this mapping to translate the WordNet-tagged content words in SEMCOR to LDOCE tags.</Paragraph>
      <Paragraph position="4"> The mapping of senses is not one-to-one, and some WordNet synsets are mapped onto two or three LDOCE senses when WordNet does not distinguish between them.</Paragraph>
      <Paragraph position="5"> The mapping also contained significant gaps, chiefly words and senses not in the translation scheme. SEMCOR contains 91,808 words tagged with WordNet synsets, 6,071 of which are proper names, which we ignored, leaving 85,737 words which could potentially be translated. The translation contains only 36,869 words tagged with LDOCE senses; however, this is a reasonable size for an evaluation corpus for the task, and it is several orders of magnitude larger than those used by other researchers working in large vocabulary WSD, for example Cowie, Guthrie, and Guthrie (1992), Harley and Glennon (1997), and Mahesh et al. (1997). This corpus was also constructed without the excessive cost of additional hand-tagging and does not introduce any of the inconsistencies that can occur with a poorly controlled tagging strategy.</Paragraph>
      <Paragraph position="6"> Resnik and Yarowsky (1997) proposed to evaluate large vocabulary WSD systems by choosing a set of test words and providing annotated test and training examples for just these words, allowing supervised and unsupervised algorithms to be tested on the same vocabulary. This model was implemented in SENSEVAL (Kilgarriff 1998).</Paragraph>
      <Paragraph position="7"> However, for the evaluation of the system presented here, there would have been no benefit from using this strategy since it still involves the manual tagging of large amounts of data and this effort could be used to create a gold standard corpus in which all content words are disambiguated. It is possible that some computational techniques may evaluate well over a small vocabulary but may not work for a large set of words, and the evaluation strategy proposed by Resnik and Yarowsky will not discriminate between these cases.</Paragraph>
      <Paragraph position="8"> In our evaluation corpus, the most frequent ambiguous type is have, which appears 604 times. A large number of words (2407) occur only once, and nearly 95% have 25 occurrences or less. Table 6 shows the distribution of ambiguous types by number of corpus tokens. It is worth noting that, as would be expected, the observed distribution is highly Zipfian (Zipf 1935).</Paragraph>
      <Paragraph position="9"> Differences in evaluation corpora makes comparison difficult. However, some idea of the difficulty of WSD can be gained by calculating properties of the evaluation corpus. Gale, Church, and Yarowsky (1992a) suggest that the lowest level of performance which can be reasonably expected from a WSD system is that achieved by assigning the most likely sense in all cases. Since the first sense in LDOCE is usually the most frequent, we calculate this baseline figure using a heuristic which assumes the first sense is always correct. This is the same baseline heuristic we used for the experiments  reported in Section 3, although those were for the homograph level. We applied the naive heuristic of always choosing the first sense in our corpus and found that 30.9% of senses were correctly disambiguated.</Paragraph>
      <Paragraph position="10"> Another measure that gives insight into an evaluation corpus is to count the average polysemy, i.e., the number of possible senses we can expect for each ambiguous word in the corpus. The average polysemy is calculated by counting the sum of possible senses for each ambiguous token and dividing by the number of tokens. This is represented by (11), where w ranges over all ambiguous tokens in the corpus, S(w) is the number of possible senses for word w, and N is the number of ambiguous tokens.</Paragraph>
      <Paragraph position="11"> The average polysemy for our evaluation corpus is 14.62.</Paragraph>
      <Paragraph position="12"> Average polysemy = ~w in text S( w) (11) N Our annotated corpus has the unusual property that more than one sense may be marked as correct for a particular token. This is an unavoidable side-effect of a mapping between lexicon senses which is not one-to-one. However, it does not imply that WSD is easier in this corpus than one in which only a single sense is marked for each token, as can be shown from an imaginary example. The worst case for a WSD algorithm is when each of the possible semantic tags for a given word occurs with equal frequency in a corpus, and so the prior probabilities exhibit a uniform, uninformative distribution. Then a corpus with an average polysemy of 5, and 2 senses marked correct on each ambiguous token, will have a baseline not less than 40%.</Paragraph>
      <Paragraph position="13"> However, one with an average polysemy of 2, and only a single sense on each, will have a baseline of at least 50%. Test corpora in which each ambiguous token has exactly two senses were used by Brown et al. (1991), Yarowsky (1995) and others.</Paragraph>
      <Paragraph position="14"> Our system was tested using a technique known as 10-fold cross validation. This process is carried out by splitting the available data into ten roughly equal subsets. One of the subsets is chosen as the test data and the TiMBL algorithm is trained on the remainder. This is repeated ten times, so that each subset is used as test data exactly once, and results are averaged across all of the test runs. This technique provides two advantages: first, the best use can be made of the available data, and secondly, the computed results are more statistically reliable than those obtained by simply setting aside a single portion of the data for testing.</Paragraph>
    </Section>
    <Section position="2" start_page="340" end_page="341" type="sub_section">
      <SectionTitle>
5.2 Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> The choice of scoring metric is an important one in the evaluation of WSD algorithms.</Paragraph>
      <Paragraph position="1"> The most commonly used metric is the ratio of words for which the system has assigned the correct sense compared to those which it attempted to disambiguate. Resnik and Yarowsky (1997) dubbed this the exact match metric, which is usually expressed  Computational Linguistics Volume 27, Number 3 as a percentage calculated according to the formula in (12).</Paragraph>
      <Paragraph position="2"> Exact match = Number of correctly assigned senses x 100% (12) Number of senses assigned Resnik and Yarowsky criticize this metric because it assumes a WSD system commits to a particular sense. They propose an alternative metric based on cross-entropy that compares the probabilities for each sense as assigned by a WSD system against those in the gold standard text. The formula in (13) shows the method for computing this metric, where the WSD system has processed N words and Pr(csi) is the probability assigned to the correct sense of word i.</Paragraph>
      <Paragraph position="4"> This evaluation metric may be useful for disambiguation systems that assign probabilities to each sense, such as those developed by Resnik and Yarowsky, since it provides more information than the exact match metric. However, for systems which simply choose a single sense and do not measure confidence, it provides far less information.</Paragraph>
      <Paragraph position="5"> When a WSD assigns only one sense to a word and that sense is incorrect, that word is scored as ~. Consequently, the formula in (13) returns c~ if there is at least one word in the test set for which the tagger assigns a zero probability to the correct sense. For WSD systems which assign exactly one sense to each word, this metric returns 0 if all words are tagged correctly, and cx~ otherwise. This metric is potentially very useful for the evaluation of WSD systems that return non-zero probabilities for each possible sense; however, it is not useful for the metric presented in this paper and others that are not based on probabilistic models.</Paragraph>
      <Paragraph position="6"> Melamed and Resnik (2000) propose a metric for scoring WSD output when there may be more than one correct sense in the gold standard text, as with the evaluation corpus we use. They mention that when a WSD system returns more than one sense it is difficult to tell if they are intended to be disjunctive or conjunctive. The score for a token is computed by dividing the number of correct senses identified by the algorithm by the total it returns, making the metric equivalent to precision in information retrieval (van Rijsbergen 1979). 6 For systems which return exactly one sense for each word, this equates to scoring a token as 1 if the sense returned is correct, and 0 otherwise. For the evaluation of the system presented here, the metric proposed by Melamed and Resnik is then equivalent to the exact match metric.</Paragraph>
      <Paragraph position="7"> The exact match metric has the advantage of being widely used in the WSD literature. In our experiments the exact match figure is computed at the LDOCE sense level, where the number of tokens correctly disambiguated to the sense level is divided by the number ambiguous at that level. At the homograph level, the number correctly disambiguated to the homograph is divided by the number which are polyhomographic. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML