File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1302_metho.xml
Size: 20,578 bytes
Last Modified: 2025-10-06 14:08:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1302"> <Title>Unsupervised Monolingual and Bilingual Word-Sense Disambiguation of Medical Documents using UMLS</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Bilingual Disambiguation </SectionTitle> <Paragraph position="0"> The mapping between word-forms and senses differs across languages, and for this reason the importance of word-sense disambiguation has long been recognised for machine translation. By the same token, pairs of translated documents naturally contain information for disambiguation. For example, if in a particular context the English word drugs is translated into French as drogues rather than medicaments, then the English word drug is being used to mean narcotics rather than medicines. This observation has been used for some years on varying scales. Brown et al. (1991) pioneered the use of statistical WSD for translation, building a translation model from one million sentences in English and French. Using this model to help with translation decisions (such as whether prendre should be translated as take or make), the number of acceptable translations produced by their system increased by 8%. Gale et al. (1992) use parallel translations to obtain training and testing data for word-sense disambiguation. Ide (1999)investigatesthe information made available by a translation of George Orwell's Nineteen Eighty-four into six languages, using this to analyse the related senses of nine ambiguous English words into hierarchical clusters.</Paragraph> <Paragraph position="1"> These applications have all been case studies of a handful of particularly interesting words. The large scale of the semantic annotation carried out by the MUCHMORE project has made it possible to extend the bilingual disambiguation technique to entire dictionaries and corpora.</Paragraph> <Paragraph position="2"> To disambiguate an instance of an ambiguous term, we consulted the translation of the abstract in which it appeared. We regarded the translated abstract as disambiguating the ambiguous term if it met the following two criteria: * Only one of the CUI's was assigned to any term in the translated abstract.</Paragraph> <Paragraph position="3"> * At least one of the terms to which this CUI was assigned in the translated abstract was un-ambiguous (i.e. was not also assigned another CUI).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Results for Bilingual Disambiguation </SectionTitle> <Paragraph position="0"> We attempted both to disambiguate terms in the German abstracts using the corresponding English abstracts, and to disambiguate terms in the English abstracts using the corresponding German ones. In this collection of documents, we were able to disambiguate 1802 occurrences of 63 English terms and 1500 occurrences of 43 German terms. Comparing this with the evaluation corpora gave the results in As can be seen, the recall and coverage of this method is not especially good but the precision (at least for English) is very high. The German results contain roughly the same proportion of correct decisions as the English, but many more incorrect ones as well.</Paragraph> <Paragraph position="1"> Our disambiguation results break down into three cases: 1. Terms ambiguous in one language that translate as multiple unambiguous terms in the other language; one of the meanings is medical and the other is not.</Paragraph> <Paragraph position="2"> 2. Terms ambiguous in one language that translate as multiple unambiguous terms in the other language; both of the terms are medical.</Paragraph> <Paragraph position="3"> 3. Terms that are ambiguous between two mean- null ings that are difficult to distinguish.</Paragraph> <Paragraph position="4"> One striking aspect of the results was that relatively few terms were disambiguated to different senses in different occurrences. This phenomenon was particularly extreme in disambiguating the German terms; of the 43 German terms disambiguated, 42 were assigned the same sense every time we were able to disambiguate them. Only one term, Metastase, was assigned difference senses; 88 times it was assigned CUI C0027627 (&quot;The spread of cancer from one part of the body to another ...&quot;, associated with the English term Metastasis and 6 times it was assigned CUI C0036525 &quot;Used with neoplasms to indicate the secondary location to which the neoplastic process has metastasized&quot;, corresponding to the English terms metastastic and secondary). Metastase therefore falls into category 2 from above, although the distinction between the two meanings is relatively subtle.</Paragraph> <Paragraph position="5"> The first and third categories above account for the vast majority of cases, in which only one meaning is ever selected. It is easy to see why this would according to the evaluation corpora, Recall is the proportion of instances in the evaluation corpora for which a correct decision was made, and Coverage is the proportion of instances in the evaluation corpora for which any decision was made. It follows that Recall = PrecisionxCoverage.</Paragraph> <Paragraph position="6"> happen in the first category, and it is what we want to happen. For instance, the German term Krebse can refer either to crabs (Crustaceans) or to cancerous growths; it is not surprising that only the latter meaning turns up in the corpus under consideration and that we can determine this from the unambiguous English translation cancers.</Paragraph> <Paragraph position="7"> In English somewhat more terms were disambiguated multiple ways: eight terms were assigned two different senses across their occurrences. All three types of ambiguity were apparent. For instance, the second type (medical/medical ambiguity) appeared for the term Aging, which can refer either to aging people (Alte Menschen) or to the process of aging itself (Altern); both meanings appeared in our corpus.</Paragraph> <Paragraph position="8"> In general, the bilingual method correctly find the meanings of approximately one fifth of the ambiguous terms, and makes only a few mistakes for English but many more for German.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Collocational disambiguation </SectionTitle> <Paragraph position="0"> By a 'collocation' we mean a fixed expression formed by a group of words occuring together, such as blood vessel or New York. (For the purposes of this paper we only consider contiguous multiword expressions which are listed in UMLS.) There is a strong and well-known tendency for words to express only one sense in a given collocation. This property of words was first described and quantified by Yarowsky (1993), and has become known generally as the 'One Sense Per Collocation' property.</Paragraph> <Paragraph position="1"> Yarowsky (1995) used the one sense per collocation property as an essential ingredient for an unsupervised Word-SenseDisambiguationalgorithm. For example, the collocations plant life and manufacturing plant are used as 'seed-examples' for the living thing and building senses of plant, and these examples can then be used as high-precision training data to perform more general high-recall disambiguation.</Paragraph> <Paragraph position="2"> While Yarowsky's algorithm is unsupervised (the algorithm does not need a large collection of annotated training examples), it still needs direct human intervention to recognise which ambiguous terms are amenable to this technique, and to choose appropriate 'seed-collocations' for each sense. Thus the algorithm still requires expert human judgments, which leads to a bottleneck when trying to scale such methods to provide Word-Sense Disambiguation for a whole document collection.</Paragraph> <Paragraph position="3"> A possible method for widening this bottleneck is to use existing lexical resources to provide seed collocations. The texts ofdictionary definitions have been used as a traditional source of information for disambiguation (Lesk, 1986). The richly detailed structure of UMLS provides a special opportunity to combine both of these approaches, because many multiword expressions and collocations are included in UMLS as separate concepts.</Paragraph> <Paragraph position="4"> For example, the term pressure has the following three senses in UMLS, each of which is assigned to a different semantic type (TUI): Sense of pressure Semantic Type arterial pressure, lung pressure, intraocular pressure This leads to the hypothesis that the term pressure, when used in any of these collocations, is used with the meaning corresponding to the same semantic type. This allows deductions of the following form: Collocation bar pressure, mean pressure Semantic type Quantitative Concept Sense of pressure C0033095, physical pressure Since nearly all English and German multiword technical medical terms are head-final, it follows that the a multiword term is usually of the same semantic type as its head, the final word. (So for example, lung cancer is a kind of cancer, not a kind of lung.) ForEnglish, UMLS 2001containsover800,000multiword expressionsthe lastword in which is also a term in UMLS. Over 350,000 of these expressions have a last word which on its own, with no other context, would be regarded as ambiguous (has more that one CUI in UMLS). Over 50,000 of these multiword expressions are unambiguous, with a unique semantic type which is shared by only one of the meanings of the potentially ambiguous final word. The ambiguity of the final word in such multiword expressions is thus resolved, providing over 50,000 'seed collocations' for use in semantically annotating documents with disambiguated word senses.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Results for collocational disambiguation </SectionTitle> <Paragraph position="0"> Unfortunately, results for collocational disambiguation (Table 3) were disappointing compared with the promising number of seed collocations we expected to find. Precision was high, but comparatively few of the collocations suggested by UMLS were found in the Springer corpus.</Paragraph> <Paragraph position="1"> In retrospect, this maynotbe surprisinggiven that many of the 'collocations' in UMLS are rather collections of words such as C0374270 intracoronary percutaneous placements singlestenttranscathetervessel which would almost never occur in natural text. Thus very few of the potential collocations we extracted from UMLS actually occurredin the Springer corpus. This scarcity was especially pronounced for German, because so many terms which are several words in English are compounded into a single word in German. For example, the term C0035330 retinal vessel does occur in the (English) Springer corpus and contains the ambiguous word vessel, whose ambiguity is successfully resolved using the collocational method. However, in German this concept is represented by the single word</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> C0035330 Retinagefaesse </SectionTitle> <Paragraph position="0"> and so this ambiguity never arises in the first place.</Paragraph> <Paragraph position="1"> It should still be remarked that the few decisions that were made by the collocational method were very accurate, demonstrating that we can get some high precision results using this method. It is possible that recall could be improved by relaxing the conditions which a multiword expression in UMLS must satisfy to be used as a seed-collocation.</Paragraph> <Paragraph position="2"> 5 Disambiguation using related UMLS terms found in the same context While the collocational method turned out to give disappointing recall, it showed that accurate information could be extracted directly from the existing UMLS and used for disambiguation, without extra human intervention or supervision. What we needed was advice on how to get more of this high-quality information out of UMLS, which we still believed to be a very rich source of information which we were not yet exploiting fully. Fortunately, no less than 3 additional sources of information for disambiguation using related terms from UMLS were suggested by a medical expert.5 The suggestion was that we should consider terms that were linked by conceptual relations (as given by the MRREL and MRCXT files in the UMLS source) and which were noted as coindexing concepts in the same MEDLINE abstract (as given by the MRCOC file in the UMLS source). For eachseparatesense ofan ambiguousword, this would give a set of related concepts, and if examples of any of these related concepts were found in the corpus near to one of the ambiguous words, it might indicate that the correct sense of the ambiguous word was the one related to this particular concept.</Paragraph> <Paragraph position="3"> This method is effectively one of the many variants of Lesk's (1986) original dictionary-based method for disambiguation, where the words appearing in the definitions of different senses of ambiguous words are used to indicate that those senses are being used if they are observed near the ambiguous word. However, we gain over purely dictionary-based methods because the words that occur in dictionary definitions rarely correspond well with those that occur in text. The information we collected from UMLS did not suffer from this drawback: the pairs of coindexing concepts from MRCOC were derived precisely from human judgements that these two concepts both occured in the same text in MEDLINE.</Paragraph> <Paragraph position="4"> The disambiguation method proceeds as follows.</Paragraph> <Paragraph position="5"> For each ambiguous word w, we find its possible senses {sj(w)}. For each sense sj, find all CUI's in MRREL, MRCXT or MRCOC files that are related to this sense, and call this set {crel(sj)}. Then for each occurrence of the ambiguous word w in the corpus we examine the local context to see if a term t occurs whose sense6 (CUI) is one of the concepts in {crel(sj)}, and if so take this as positive evidence that the sense sj is the appropriate one for this context, by increasing the score of sj by 1. In this way, each sense sj in context gets assigned a score which measures the number of terms in this context which are related to this sense. Finally, choose the sense 5Personal communication from Stuart Nelson (instrumental in the design of UMLS), at the MUCHMORE workshop in Croatia, September 2002.</Paragraph> <Paragraph position="6"> 6This fails to take into account that the term t might itself be ambiguous -- it is possible that results could be improved still further by allowing for mutual disambiguation of more than one term at once.</Paragraph> <Paragraph position="7"> with the highest score.</Paragraph> <Paragraph position="8"> One open question for this algorithm is what region of text to use as a context-window. We experimented with using sentences, documents and whole subdomains, where a 'subdomain' was considered to be all of the abstracts appearing in one of the journals in the Springer corpus, such as Arthroskopie or Der Chirurg. Thus our results (for each language) vary according to which knowledge sources were used (Conceptually Related Terms from MRREL and MRCXT or coindexing terms from MR-COC, or a combination), and according to whether the context-window for recording cooccurence was a sentence, a document or a subdomain.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5.1 Results for disambiguation based on </SectionTitle> <Paragraph position="0"> related UMLS concepts The results obtained using this method (Tables 5.1 and 5.1) were excellent, preserving (and in some cases improving) the high precision of the bilingual and collocational methods while greatly extending coverage and recall. The results obtained by using the coindexing terms for disambiguation were particularly impressive, which coincides with a long-held view in the field that terms which are topically related to a target word can be much richer clues for disambiguation that terms which are (say) hierarchically related. We are very fortunate to have such a wealth of information about the cooccurence of pairs of concepts through UMLS, which appears to have provided the benefits of cooccurence data from a manually annotated training sample without having to perform the costly manual annotation.</Paragraph> <Paragraph position="1"> In particular, for English (Table 5.1), results were actually better using only coindexing terms rather than combining this information with hierarchically related terms: both precision and recall are best when using only the MRCOC knowledge source. As we had expected, recall and coverage increased but precision decreased slightly when using larger contexts. null The German results (Table 5.1) were slightly different, and even more successful, with nearly 60% of the evaluation corpus being correctly disambiguated, nearly80%ofthe decisionsbeingcorrect. Here, there was some small gain when combining the knowledge sources, though the results using only coindexing terms were almost as good. For the German experiments, using larger contexts resulted in greater recall and greater precision. This was unexpected -- one hypothesisis thatthe sparsercoverageofthe German UMLS contributed to less predictable results on the sentence level.</Paragraph> <Paragraph position="2"> These results are comparable with some of the better SENSEVAL results (Kilgarriff and Rosenzweig, 2000) which used fully supervised methods, though the comparison may not be accurate because we are choosing between fewer senses than on avarage in SENSEVAL, and because of the doubts over our interannotator agreement.</Paragraph> <Paragraph position="3"> Comparing these results with the number of words disambiguated in the whole corpus (Table 1), it is apparent that the average coverage of this method is actually higher for the whole corpus (over 80%) than for the words in the evaluation corpus. It is possible that this reflects the fact the the evaluation corpus was specifically chosen to include words with 'interesting' ambiguities, which might include wordswhich are more difficult than average to disambiguate. It is possible that over the whole corpus, the method actually works even better than on just the evaluation corpus.</Paragraph> <Paragraph position="4"> This technique is quite groundbreaking, because it shows that a lexical resource derived almost entirely from English data (MEDLINE indexing terms) could successfully be used for automatic disambiguation in a German corpus. (The alignment of documents and their translations was not even considered for these experiments so the results do not depend at all on our having access to a parallel corpus.) This is because the UMLS relations are defined between concepts rather than between words. Thus if we know that there is a relationship between two concepts, we can use that relationship for disambiguation, even if the originalevidence for this relationship was derived from information in a different language from the language of the document we are seeking to disambiguate. We are assigning the correct senses based not upon how terms are related in language, but how medical concepts are related to one another.</Paragraph> <Paragraph position="5"> It follows that this technique for disambiguation should be applicable to any language which UMLS covers, and applicable at very little cost. This proposal should stimulate further research, and not too far behind, successful practical implementation.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Summary and Conclusion </SectionTitle> <Paragraph position="0"> We have described three implementations of unsupervised word-sense disambiguation techniques for medical documents. The bilingual method relies on the availability of a translated parallel corpus: the collocational and relational methods rely solely on the structure of UMLS, and could therefore be applied to new collections of medical documents without requiring any new resources. The method of disambiguation using relations between terms given by UMLS was by far the most successful method, achieving 74% precision, 66% coverage for English</Paragraph> </Section> class="xml-element"></Paper>