File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3305_intro.xml

Size: 3,302 bytes

Last Modified: 2025-10-06 14:04:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3305">
  <Title>A Priority Model for Named Entities</Title>
  <Section position="3" start_page="0" end_page="33" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> A successful gene and protein NER system must address the complexity and ambiguity inherent in this domain. Hand-crafted rules alone are unable to capture these phenomena in large biomedical text collections. Most biomedical NER systems use some form of language modeling, consisting of an observed sequence of words and a hidden sequence of tags. The goal is to find the tag sequence with maximal probability given the observed word sequence. McDonald and Pereira (2005) use conditional random fields (CRF) to identify the beginning, inside and outside of gene and protein names. GuoDong et al. (2005) use an ensemble of one support vector machine and two Hidden Markov Models (HMMs). Kinoshita et al.</Paragraph>
    <Paragraph position="1"> (2005) use a second-order Markov model. Dingare et al. (2005) use a maximum entropy Markov model (MEMM) with large feature sets.</Paragraph>
    <Paragraph position="2">  NER is a difficult task because it requires both the identification of the boundaries of an entity in text, and the classification of that entity. In this paper, we focus on the classification step. Spasic et al. (2005) use the MaSTerClass case-based reasoning system for biomedical term classification. MaSTerClass uses term contexts from an annotated corpus of 2072 MEDLINE abstracts related to nuclear receptors as a basis for classifying new terms. Its set of classes is a subset of the UMLS Semantic Network (McCray, 1989), that does not include genes and proteins. Liu et al. (2002) classified terms that represent multiple UMLS concepts by examining the conceptual relatives of the concepts. Hatzivassiloglou et al. (2001) classified terms known to belong to the classes Protein, Gene and/or RNA using unsupervised learning, achieving accuracy rates up to 85%. The AZuRE system (Podowski et al., 2004) uses a separate modified Naive Bayes model for each of 20K genes. A term is disambiguated based on its contextual similarity to each model. Nenadic et al. (2003) recognized the importance of terminological knowledge for biomedical text mining. They used the C/NCmethods, calculating both the intrinsic characteristics of terms (such as their frequency of occurrence as substrings of other terms), and the context of terms as linear combinations. These biomedical classification systems all rely on the context surrounding named entities. While we recognize the importance of context, we believe one must strive for the appropriate blend of information coming from the context and information that is inherent in the name itself. This explains our focus on names without context in this work.</Paragraph>
    <Paragraph position="3"> We believe one can improve gene and protein entity classification by using more training data and/or using a more appropriate model for names.</Paragraph>
    <Paragraph position="4"> Current sources of training data are deficient in important biomedical terminologies like cell line names. To address this deficiency, we constructed the SemCat database, based on a subset of the</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML