File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2702_intro.xml
Size: 3,422 bytes
Last Modified: 2025-10-06 14:04:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2702"> <Title>Annotation and Disambiguation of Semantic Types in Biomedical Text: a Cascaded Approach to Named Entity Recognition</Title> <Section position="2" start_page="0" end_page="11" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Publishers of biomedical journals have widely adopted XML as the underlying format from which other formats, such as PDF and HTML, are generated. For example, documents in XML format are available from the National Library of Medicine1 (Medline abstracts and Pubmed2 Central documents), and from BioMed Central3 (full text journal articles). Other publishers are heading into the same direction. Such documents contain logical markup to organize meta-inform- null ation such as title, author(s), sections, headings, citations, references, etc. Inside the text of a document, XML is used for physical markup, e.g. text in italic or boldface, subscript and superscript insertions, etc. Manually generated semantic markup is available only on the document level (e.g. MeSH terms).</Paragraph> <Paragraph position="1"> One of the most distinguished feature of scientific biomedical literature is that it contains a large amount of terms and entities, the majority of which are explained in public electronic databases. Terms (such as names of genes, proteins, gene products, organisms, drugs, chemical compounds, etc.) are a key factor for accessing and integrating the information stored in literature (Krauthammer and Nenadic, 2004). Identification and markup of names and terms in text serves several purposes: (1) The users profit from highlighted semantic types, e.g. protein/gene, drug, species, and from links to the defining database for immediate access and exploration.</Paragraph> <Paragraph position="2"> (2) Identified terms facilitate and improve statistical and NLP based text analysis (Hirschman et al., 2005; Kirsch et al., 2005).</Paragraph> <Paragraph position="3"> In this paper we describe a cascaded approach to named-entity recognition (NER) and markup in biomedicine that is embedded into EBIMed4, an on-line service to access the literature (Rebholz-Schuhmann et al., forthcoming). EBIMed facilitates both purposes mentioned above. It keeps the annotations provided by publishers and inserts XML annotations while processing the text. Named entities from different resources are identified in the text. The individual modules provide annotation of protein names with unique identifiers, disambiguation of protein names that are ambiguous acronyms, annotation of drugs, Gene Ontology5 terms and species. The identification of protein named entities can be further used in an alternative pipeline to identify events such as protein-protein interactions and associations between terms and mutations (Blaschke et al., 1999; Rzhetsky et al., 2004; Rebholz-Schuhmann et al., 2004; Nenadic and Ananiadou, 2006).</Paragraph> <Paragraph position="4"> The rest of the paper is organised as follows.</Paragraph> <Paragraph position="5"> In Section 2 we briefly discuss problems with biomedical NER. The cascaded approach and an online text mining system are described in sections 3 and 4 respectively. We discuss the lessons learnt from the on-line application and remainig open problems in Section 5, while conclusions are presented in Section 6.</Paragraph> </Section> class="xml-element"></Paper>