File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3110_intro.xml
Size: 9,092 bytes
Last Modified: 2025-10-06 14:02:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3110"> <Title>A Large Scale Terminology Resource for Biomedical Text Processing</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> It has been widely recognized that the biomedical literature is now so large, and growing so quickly, that it is becoming increasingly difficult for researchers to access the published results that are relevant to their research. Consequently, any technology that can facilitate this access should help to increase research productivity. This has led to an increased interest in the application of natural language processing techniques for the automatic capture of biomedical content from journal abstracts, complete papers, and other textual documents (Gaizauskas et al., 2003; Hahn et al., 2002; Pustejovsky et al., 2002; Rindflesch et al., 2000).</Paragraph> <Paragraph position="1"> An essential processing step in these applications is the identification and semantic classification of technical terms in text, since these terms often point to entities about which information should be extracted. Proper semantic classification of terms also helps in resolving anaphora and extracting relations whose arguments are restricted semantically.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Challenge </SectionTitle> <Paragraph position="0"> Any technical domain generates very large numbers of terms - single or multiword expressions that have some specialised use or meaning in that domain. For example, the UMLS Metathesaurus (Humphreys et al., 1998), which provides a semantic classification of terms from a wide range of vocabularies in the clinical and biomedical domain, currently contains well over 2 million distinct English terms.</Paragraph> <Paragraph position="1"> For a variety of reasons, recognizing these terms in text is not a trivial task. First of all, terms are often long multi-token sequences, e.g. 3-methyladenine-DNA glycosylase I. Moreover, since terms are referred to repeatedly in discourses there is a benefit in their being short and unambiguous, so they are frequently abbreviated and acronymized, e.g. CvL for chromobacterium viscosum lipase. However, abbreviations may not always occur together with their full forms in a text, the method of abbreviation is not predictable in all cases, and many three letter abbreviations are highly overloaded.</Paragraph> <Paragraph position="2"> Terms are also subject to a high degree of orthographic variation as a result of the representation of non-Latin characters, e.g. a-helix vs. alpha-helix, capitalization, e.g. DNA vs. dna, hyphenation, e.g. anti-histamine vs. antihistamine, and British and American spelling variants, e.g. tumour vs. tumor. Furthermore, biomedical science is a dynamic field: new terms are constantly being introduced while old ones fall into disuse. Finally, certain classes of biomedical terms exhibit metonomy, e.g. when a protein is referred to by the gene that expresses it.</Paragraph> <Paragraph position="3"> To begin to address these issues in term recognition, we are building a large-scale resource for storing and recognizing technical terminology, called Termino. This resource must store complex, heterogeneous information about large numbers of terms. At the same time term recognition must be performed in realistic times. Termino attempts to reconcile this tension by maintaining a Association for Computational Linguistics.</Paragraph> <Paragraph position="4"> Linking Biological Literature, Ontologies and Databases, pp. 53-60. HLT-NAACL 2004 Workshop: Biolink 2004, flexible, extensible relational database for storing terminological information and compiling finite state machines from this database to do term look-up.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 Context </SectionTitle> <Paragraph position="0"> Termino is being developed in the context of two ongoing projects: CLEF, for Clinical E-Science Framework (Rector et al., 2003) and myGrid (Goble et al., 2003). Both these projects involve an Information Extraction component. Information Extraction is the activity of identifying pre-defined classes of entities and relationships in natural language texts and storing this information in a structured format enabling rapid and effective access to the information, e.g. Gaizauskas and Wilks (1998), Grishman (1997).</Paragraph> <Paragraph position="1"> The goal of the CLEF project is to extract information from patient records regarding the treatment of cancer.</Paragraph> <Paragraph position="2"> The treatment of cancer patients may extend over several years and the resulting clinical record may include many documents, such as clinic letters, case notes, lab reports, discharge summaries, etc. These documents are generally full of medical terms naming entities such as body parts, drugs, problems (i.e. symptoms and diseases), investigations and interventions. Some of these terms are particular to the hospital from which the document originates. We aim to identify these classes of entities, as well as relationships between such entities, e.g. that an investigation has indicated a particular problem, which, in turn, has been treated with a particular intervention. The information extracted from the patient records is potentially of value for immediate patient care, but can also be used to support longitudinal and epidemiological medical studies, and to assist policy makers and health care managers in regard to planning and clinical governance.</Paragraph> <Paragraph position="3"> The myGrid project aims to present research biologists with a unified workbench through which component bioinformatic services can be accessed using a workflow model. These services may be remotely located from the user and will be exploited via grid or web-service channels. A text extraction service will form one of these services and will facilitate access to information in the scientific literature. This text service comprises an off-line and an on-line component. The off-line component involves pre-processing a large biological sciences corpus, in this case the contents of Medline, in order to identify various biological entities such as genes, enzymes, and proteins, and relationships between them such as structural and locative relations. These entities and relationships are referred to in Medline abstracts by a very large number of technical terms and expressions, which contributes to the complexity of processing these texts. The on-line component supports access to the extracted information, as well as to the raw texts, via a SOAP interface to an SQL database.</Paragraph> <Paragraph position="4"> Despite the different objectives for text extraction within the CLEF and myGrid projects, many of the technical challenges they face are the same, such as the need for extensive capabilities to recognize and classify biomedical entities as described using complex technical terminology in text. As a consequence we are constructing a general framework for the extraction of information from biomedical text: AMBIT, a system for acquiring medical and biological information from text. An overview of the AMBIT logical architecture is shown in figure 1.</Paragraph> <Paragraph position="5"> The AMBIT system contains several engines, of which Termino is one. The Information Extraction Engine pulls selected information out of natural language text and pushes this information into a set of pre-defined templates. These are structured objects which consists of one or more slots for holding the extracted entities and relations. The Query Engine allows users to access information through traditional free text search and search based on the structured information produced by the Information Extraction Engine, so that queries may refer to specific entities and classes of entities, and specific kinds of relations that are recognised to hold between them. The Text Indexing Engine is used to index text and extracted, structured information for the purposes of information retrieval. The AMBIT system contains two further components: an interface layer, which provides a web or grid channel to allow user and program access to the system; and a database which holds free text and structured information that can be searched through the Query Engine.</Paragraph> <Paragraph position="6"> Termino interacts with the Query Engine and the Text Indexing Engine to provide terminological support for query formulation and text indexation. It also provides knowledge for the Information Extraction Engine to use in identifying and classifying biomedical entities in text.</Paragraph> <Paragraph position="7"> The Terminology Engine can furthermore be called by users and remote programs to access information from the various lexical resources that are integrated in the terminological database.</Paragraph> </Section> </Section> class="xml-element"></Paper>