File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/w06-2702_abstr.xml

Size: 1,360 bytes

Last Modified: 2025-10-06 13:45:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2702">
  <Title>Annotation and Disambiguation of Semantic Types in Biomedical Text: a Cascaded Approach to Named Entity Recognition</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Publishers of biomedical journals increasingly use XML as the underlying document format. We present a modular text-processing pipeline that inserts XML markup into such documents in every processing step, leading to multi-dimensional markup. The markup introduced is used to identify and disambiguate named entities of several semantic types (protein/gene, Gene Ontology terms, drugs and species) and to communicate data from one module to the next. Each module independently adds, changes or removes markup, which allows for modularization and a flexible setup of the processing pipeline. We also describe how the cascaded approach is embedded in a large-scale XML-based application (EBIMed) used for on-line access to biomedical literature. We discuss the lessons learnt so far, as well as the open problems that need to be resolved. In particular, we argue that the pragmatic and tailored solutions allow for reduction in the need for overlapping annotations -- although not completely without cost.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML