File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-3009_intro.xml

Size: 3,503 bytes

Last Modified: 2025-10-06 14:03:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-3009">
  <Title>A Hybrid Approach to Biomedical Named Entity Recognition and Semantic Role Labeling</Title>
  <Section position="2" start_page="0" end_page="243" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The volume of biomedical literature available on the Web has experienced unprecedented growth in recent years, and demand for efficient methods to process this material has increased accordingly.</Paragraph>
    <Paragraph position="1"> Lately, there has been a surge of interest in mining biomedical literature. To this end, more and more information extraction (IE) systems using natural language processing (NLP) technologies have been developed for use in the biomedical field. Key biomedical IE tasks include named entity (NE) recognition (NER), such as the recognition of protein and gene names; and relation extraction, such as the extraction of protein-protein and gene-gene interactions.</Paragraph>
    <Paragraph position="2"> NER identifies named entities from natural language texts and classifies them into specific classes according to a defined ontology or classification.</Paragraph>
    <Paragraph position="3"> In general, biomedical NEs do not follow any nomenclature and may comprise long compound words and short abbreviations. Some NEs contain various symbols and other spelling variations. On average, an NE has five synonyms (Tsai et al., 2006a), and it may belong to multiple categories intrinsically. Since biomedical language and vo- null cabulary are highly complex and evolving rapidly, Bio-NER is a very challenging problem, which raises a number of difficulties.</Paragraph>
    <Paragraph position="4"> The other main focus of Bio-IE is relation extraction. Most systems only extract the relation targets (e.g., proteins, genes) and the verbs representing those relations, overlooking the many adverbial and prepositional phrases and words that describe location, manner, timing, condition, and extent. However, the information in such phrases may be important for precise definition and clarification of complex biological relations.</Paragraph>
    <Paragraph position="5"> This problem can be tackled by using semantic role labeling (SRL) because it not only recognizes main roles, such as agents and objects, but also extracts adjunct roles such as location, manner, timing, condition, and extent. (Morarescu et al., 2005) has demonstrated that full-parsing and SRL can improve the performance of relation extraction, resulting in an F-score increase of 15% (from 67% to 82%). This significant result leads us to surmise that SRL may also have potential for relation extraction in the biomedical domain. Unfortunately, no SRL system for the biomedical domain exists.</Paragraph>
    <Paragraph position="6"> In this paper, we tackle the problems of both biomedical SRL and NER. Our contributions are (1) employing web lexicons and template-based post-processing to boost the performance of Bio-NER; (2) constructing a proposition bank on top of the popular biomedical GENIA treebank following the PropBank annotation scheme and developing a Biomedical SRL system. We adapt an SRL system trained the World Street Journal (WSJ) corpus to the biomedical domain. On adjunct arguments, especially those relevant to the biomedical domain, the performance is unsatisfactory. We, therefore, develop automatically generated templates for identifying these arguments.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML