File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1309_intro.xml
Size: 2,495 bytes
Last Modified: 2025-10-06 14:02:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1309"> <Title>Protein Name Tagging for Biomedical Annotation in Text</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper describes a protein name tagging method which is a fundamental precursor to information extraction of protein-protein interactions (PPIs) from MEDLINE abstracts. Previous work in bio-entity (including protein) recognition can be categorized into three approaches: (a) exact and approximate string matching (Hanisch et al., 2003), (b) hand-crafted rule-based approaches (Fukuda et al., 1998) (Olsson et al., 2002), and (c) machine learning (Collier et al., 2000), (Kazama et al., 2002).</Paragraph> <Paragraph position="1"> Previous approaches in (b) and (c) ignore the fact that bio-entities have boundary ambiguities.</Paragraph> <Paragraph position="2"> Unlike general English, a space character is not a sufficient token delimiter. Moreover, name descriptions in biomedical resources are mostly compounds. A conventional English preprocessing undergoes a pipeline of simple tokenization and part-of-speech tagging. The tokenization is based on a graphic word1 for the subsequent part-of-speech tagging to work. The conventional paradigm does not properly handle peculiarities of biomedical English.</Paragraph> <Paragraph position="3"> To remedy the problem, we propose morphological analysis which achieves sophisticated tokenization and adapts biomedical resources effectively.</Paragraph> <Paragraph position="4"> Our method identifies protein names by chunking based on morphemes, the smallest units determined by morphological analysis. We do not use graphic words as a unit of chunking to avoid the under-segmentation problem. Suppose that a protein name appears as a substring of a graphic word.</Paragraph> <Paragraph position="5"> Chunking based on graphic words fails, because graphic words are too coarsely segmented. Instead, chunking based on morpheme overcomes the problem, and the exact boundaries of protein names are better recognized.</Paragraph> <Paragraph position="6"> Below, we describe our method of protein name tagging, including preprocessing, feature extraction (Section 2), and experimental results (Section 3).</Paragraph> <Paragraph position="7"> We mention related work in bio-entity recognition (Section 4) and give concluding remarks (Section 5).</Paragraph> </Section> class="xml-element"></Paper>