File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1306_intro.xml

Size: 3,821 bytes

Last Modified: 2025-10-06 14:02:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1306">
  <Title>Boosting Precision and Recall of Dictionary-Based Protein Name Recognition</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The rapid increase of machine readable biomedical texts (e.g. MEDLINE) makes automatic information extraction from those texts much more attractive.</Paragraph>
    <Paragraph position="1"> Especially extracting information of protein-protein interactions from MEDLINE abstracts is regarded as one of the most important tasks today (Marcotte et al., 2001; Thomas et al., 2000; Ono et al., 2001).</Paragraph>
    <Paragraph position="2"> To extract information of proteins, one has to first recognize protein names in a text. This kind of problem has been studied in the field of natural language processing as named entity recognition tasks. Ohta et al. (2002) provided the GENIA corpus, an annotated corpus of MEDLINE abstracts, which can be used as a gold-standard for evaluating and training named entity recognition algorithms. There are some research efforts using machine learning techniques to recognize biological entities in texts (Takeuchi and Collier, 2002; Kim and Tsujii, 2002; Kazama et al., 2002).</Paragraph>
    <Paragraph position="3"> One drawback of these machine learning based approaches is that they do not provide identification information of recognized terms. For the purpose of information extraction of protein-protein interaction, the ID information of recognized proteins, such as GenBank 1 ID or SwissProt 2 ID, is indispensable to integrate the extracted information with the data in other information sources.</Paragraph>
    <Paragraph position="4"> Dictionary-based approaches, on the other hand, intrinsically provide ID information because they recognize a term by searching the most similar (or identical) one in the dictionary to the target term. This advantage currently makes dictionary-based approaches particularly useful as the first step for practical information extraction from biomedical documents (Ono et al., 2001).</Paragraph>
    <Paragraph position="5"> However, dictionary-based approaches have two serious problems. One is a large number of false positives mainly caused by short names, which significantly degrade overall precision. Although this problem can be avoided by excluding short names from the dictionary, such a solution makes it impossible to recognize short protein names. We tackle  this problem by using a machine learning technique.</Paragraph>
    <Paragraph position="6"> Each recognized candidate is checked if it is really protein name or not by a classifier trained on an annotated corpus.</Paragraph>
    <Paragraph position="7"> The other problem of dictionary based approaches is spelling variation. For example, the protein name &amp;quot;NF-Kappa B&amp;quot; has many spelling variants such as &amp;quot;NF Kappa B,&amp;quot; &amp;quot;NF kappa B,&amp;quot; &amp;quot;NF kappaB,&amp;quot; and &amp;quot;NFkappaB.&amp;quot; Exact matching techniques, however, regard these terms as completely different terms.</Paragraph>
    <Paragraph position="8"> We alleviate this problem by using an approximate string matching method in which surface-level similarities between terms are considered.</Paragraph>
    <Paragraph position="9"> This paper is organized as follows. Section 2 describes the overview of our method. Section 3 presents the approximate string searching algorithm for candidate recognition. Section 3 describes how to filter out false recognitions by a machine learning method. Section 5 presents the experimental results using the GENIA corpus. Some related work is described in Section 6. Finally, Section 7 offers some concluding remarks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML