File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1302_intro.xml

Size: 12,858 bytes

Last Modified: 2025-10-06 14:01:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1302">
  <Title>Unsupervised Monolingual and Bilingual Word-Sense Disambiguation of Medical Documents using UMLS</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper reports on experiments in monolingual and multilingual word sense disambiguation (WSD) in the medical domain using the Unified Medical Language System (UMLS). The work described was carried out as part of the MUCHMORE project 1 for multilingual organisation and retrieval of medical information, for which WSD is particularly important.</Paragraph>
    <Paragraph position="1"> The importance of WSD to multilingual applications stems from the simple fact that meaningsrepresented by a single word in one language may be represented by multiple words in other languages. The English word drug when referring to medically therapeutic drugs would be translated as medikamente,  while it would be rendered as drogen when referring to a recreationally taken narcotic substance of the kind that many governments prohibit by law.</Paragraph>
    <Paragraph position="2"> The ability to disambiguate is therefore essential to the task of machine translation -- when translating from English to Spanish or from English to German we would need to make the distinctions mentioned above and other similar ones. Even short of the task of full translation, WSD is crucial to applications such as cross-lingual information retrieval (CLIR), since search terms entered in the language used for querying must be appropriately rendered in the language used for retrieval. WSD has become a well-established subfield of natural language processing with its own evaluation standards and SENSEVAL competitions (Kilgarriffand Rosenzweig, 2000).</Paragraph>
    <Paragraph position="3"> Methods for WSD can effectively be divided into those that require manually annotated training data (supervised methods) and those that do not (unsupervised methods) (Ide and V'eronis, 1998). In general, supervised methods are less scalable than unsupervised methods because they rely on training data which may be costly and unrealistic to produce, and even then might be available for only a few ambiguous terms. The goal of our work on disambiguation in the MUCHMORE project is to enable the correct semantic annotation of entire document collections with all terms which are potentially relevant for organisation, retrieval and summarisation of information. Therefore a decision was taken early on in the project that we should focus on unsupervised methods, which have the potential to be scaled up enough to meet our needs.</Paragraph>
    <Paragraph position="4"> This paper is arranged as follows. In Section 2 we describe the lexical resource (UMLS) and the corpora we used for our experiments. We then describe and evaluate three different methods for disambiguation. The bilingual method (Section 3) takes advantage of our having a translated corpus, because knowing the translation of an ambiguous word can be enough to determine its sense. The collocational method (Section 4) uses the occurence of a term in a recognisedfixed expressionto determine its meaning.</Paragraph>
    <Paragraph position="5"> UMLS relation based methods (Section 5) use relations between terms in UMLS to determine which sense is being used in a particular instance. Other techniques used in the MUCHMORE project include domain-specific sense selection (Buitelaar and Sacaleanu, 2001), used to select senses appropriate to the medical domain from a general lexical resource, and instance-based learning, a machine-learning technique that has been adapted for word-sense disambiguation (Widdows et al., 2003).</Paragraph>
    <Paragraph position="6"> 2 Language resources used in these experiments</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Lexical Resource -- UMLS
</SectionTitle>
      <Paragraph position="0"> The Unified Medical Language System (UMLS) is a resource that contains linguistic, terminological and semantic information in the medical domain.2 It is organised in three parts: Specialist Lexicon, MetaThesaurus and Semantic Network. The MetaThesaurus contains concepts from more than 60 standardised medical thesauri, of which for our purposes we only use the concepts from MeSH (the Medical Subject Headings thesaurus). This decision is based on the fact that MeSH is also available in German. The semantic information that we use in annotation is the so-called Concept Unique Identifier (CUI), a code that represents a concept in the UMLS MetaThesaurus. We consider the possible 'senses' of a term to be the set of CUI's which list this term as a possible realisation. For example, UMLS contains the term trauma as a possible realisation of the following two concepts: C0043251 Injuries and Wounds: Wounds and Injuries: trauma: traumatic disorders: Traumatic injury: C0021501 Physical Trauma: Trauma (Physical): trauma: Each of these CUI's is a possible sense of the term trauma. The term trauma is therefore noted as ambiguous, since it can be used to express more than one UMLS concept. The purpose of disambiguation is to find out which of these possible senses is actually being used in each particular context where there term trauma is used.</Paragraph>
      <Paragraph position="1">  CUI's in UMLS are also interlinked to each other by a number of relations. These include: * 'Broader term' which is similar to the hypernymy relation in WordNet (Fellbaum, 1998). In general, x is a 'broader term' for y if every y is also a (kind of) x.</Paragraph>
      <Paragraph position="2"> * More generally, 'related terms' are listed, where possible relationships include 'is like', 'is clinically associated with'.</Paragraph>
      <Paragraph position="3"> * Cooccurring concepts, which are pairs of concepts which are linked in some information source. In particular, two concepts are regarded as cooccurring if they have both been used to manually index the same document in MED-LINE. We will refer to such pairs of concepts as coindexing concepts.</Paragraph>
      <Paragraph position="4"> * Collocations and multiword expressions. For example, the term liver transplant is included separately in UMLS, as well as both the terms liver and transplant. This information can sometimes be used for disambiguation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Springer Corpus of Medical
Abstracts
</SectionTitle>
      <Paragraph position="0"> The experiments and implementations of WSD described in this paper were all carried out on a parallel corpus of English-German medical scientific abstracts obtained from the Springer Link web site.3 The corpus consists approximately of 1 million tokens for each language. Abstracts are from 41 medical journals, eachofwhich constitutes a relativelyhomogeneous medical sub-domain (e.g. Neurology, Radiology, etc.). The corpus was automatically marked up with morphosyntactic and semantic information, as described by VSpela Vintar et al. (2002). In brief, whenever a token is encountered in the corpus that is listed as a term in UMLS, the document is annotated with the CUI under which that term is listed. Ambiguity is introduced by this markup process because the lexical resources often list a particular term as a possible realisation of more than one concept or CUI, as with the trauma example above, in which case the document is annotated with all of these possible CUI's.</Paragraph>
      <Paragraph position="1"> The number of tokens of UMLS terms included by this annotation process is given in Table 1. The table shows how many tokens were found by the annotation process, listed according to how many possible senses each of these tokens was assigned in UMLS (so that the number of ambiguous tokens is the number  2, 3 and 4 possible senses in the Springer corpus of tokens with more than one possible sense). The greater number of concepts found in the English corpus reflects the fact that UMLS has greater coverage for English than for German, and secondly that there are many small terms in English which are expressed by single words which would be expressed by larger compound terms in German (for example knee + joint = kniegelenk). Table 1 also shows how many tokens of UMLS concepts were in the annotated corpus after we applied the disambiguation process described in Section 5, which proved to be our most successful method. As can be seen, our disambiguation methods resolved some 83% of the ambiguities in the English corpus and 87% of the ambiguities in the German corpus (we refer to this proportion as the 'Coverage' of the method). However, this only measures the number of disambiguation decisions that were made: in order to determine how many of these decisions were correct, evaluation corpora were needed.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Evaluation Corpora
</SectionTitle>
      <Paragraph position="0"> An important aspect of word sense disambiguation is the evaluation of different methods and parameters.</Paragraph>
      <Paragraph position="1"> Unfortunately, there is a lack of test sets for evaluation, specifically for languages other than English and even more so for specific domains like medicine.</Paragraph>
      <Paragraph position="2"> Given that our work focuses on German as well as English text in the medical domain, we had to develop our own evaluation corpora in order to test our disambiguation methods.</Paragraph>
      <Paragraph position="3"> Because in the MUCHMORE project we developed an extensive format for linguistic and semantic annotation (VSpela Vintar et al., 2002) that includes annotation with UMLS concepts, we could automatically generate lists of all ambiguous UMLS types (English and German) along with their token frequencies in the corpus. Using these lists we selected a setof70frequenttypesfor English(token frequencies at least 28, 41 types having token frequencies over 100). For German, we only selected 24 ambiguous types (token frequencies at least 11, 7 types having token frequencies over 100) because there are fewer ambiguous terms in the German annotation (see Table 1). We automatically selected instances to be annotated using a random selection of occurrences if the token frequency was higher than 100, and using all occurrences if the token frequency was lower than 100. The level of ambiguity for these UMLS terms is mostly limited to only 2 senses; only 7 English terms have 3 senses.</Paragraph>
      <Paragraph position="4"> Correct senses of the English tokens in context were chosen by three medical experts, two native speakers of German and one of English. The German evaluation corpus was annotated by the two German speakers. Interannotator agreement for individual terms ranged from very low to very high, with an average of 65% for German and 51% for English (where all three annotators agreed). The reasons for this low score are still under investigation. In some cases, the UMLS definitions were insufficient to give a clear distinction between concepts, especially when the concepts came from different original thesauri. This allowed the decision of whether a particular definition gave a meaningful 'sense' to be more or less subjective. Approximately half of the disagreements between annotators occured with terms where interannotator agreement was less than 10%, which is evidence that a significant amount of the disagreementbetween annotatorswas on the type level rather than the token level. In other cases, it is possible that there was insufficient contextual information provided for annotators to agree. If one of the annotatorswasunable to chooseany ofthe senses and declared an instance to be 'unspecified', this also counted against interannotator agreement. Whatever is responsible, our interannotator agreement fell far short of the 88%-100% achieved in SENSEVAL (Kilgarriff and Rosenzweig, 2000, SS7), and until this problem is solved or better datasets are found, this poor agreement casts doubt on the generality of the results obtained in this paper.</Paragraph>
      <Paragraph position="5"> A 'gold standard' was produced for the German UMLS evaluation corpus and used to evaluate the disambiguation of German UMLS concepts. The English experiments were evaluated on those tokens for which the annotators agreed. More details and discussion of the annotation process is available in the project report (Widdows et al., 2003).</Paragraph>
      <Paragraph position="6"> In the rest of this paper we describe the techniques that used these resources to build systems for word sense disambiguation, and evaluate their level of success. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML