File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-3009_metho.xml

Size: 9,118 bytes

Last Modified: 2025-10-06 14:10:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-3009">
  <Title>A Hybrid Approach to Biomedical Named Entity Recognition and Semantic Role Labeling</Title>
  <Section position="3" start_page="243" end_page="244" type="metho">
    <SectionTitle>
2 Biomedical Named Entity Recognition
</SectionTitle>
    <Paragraph position="0"> Our Bio-NER system uses the CRF model (Lafferty et al., 2001), which has proven its effectiveness in several sequence tagging tasks.</Paragraph>
    <Section position="1" start_page="243" end_page="243" type="sub_section">
      <SectionTitle>
2.1 Features and Post-Processing
Orthographical Features
</SectionTitle>
      <Paragraph position="0"> In our experience, ALLCAPS, CAPSMIX, and INITCAP are more useful than others. The details are listed in (Tsai et al., 2006a).</Paragraph>
      <Paragraph position="1"> Context Features Words preceding or following the target word may be useful for determining its category. In our experience, a suitable window size is five.</Paragraph>
      <Paragraph position="2"> Part-of-speech Features Part-of-speech information is quite useful for identifying NEs. Verbs and prepositions usually indicate an NE's boundaries, whereas nouns not found in the dictionary are usually good candidates for named entities. Our experience indicates that five is also a suitable window size. The MBT POS tagger is used to provide POS information. We trained it on GENIA 3.02p and achieved 97.85% accuracy.</Paragraph>
    </Section>
    <Section position="2" start_page="243" end_page="244" type="sub_section">
      <SectionTitle>
Word Shape Features
</SectionTitle>
      <Paragraph position="0"> As NEs in the same category may look similar (e.g., IL-2 and IL-4), we have to find a simple way to normalize all similar words. According to our method, capitalized characters are all replaced by 'A', digits are all replaced by '0', non-English characters are replaced by '_' (underscore), and non-capitalized characters are replaced by 'a'. To further normalize these words, we reduce consecutive strings of identical characters to one character. Affix Features Some affixes can provide good clues for classifying named entities (e.g., &amp;quot;ase&amp;quot;). In our experience, an acceptable affix length is 3-5 characters.</Paragraph>
      <Paragraph position="1"> Lexicon Features Depending on the quality of a given dictionary, our system uses one of two different lexicon features to estimate the possibility of a token in a biomedical named entity. The first feature determines whether a token is part of a multi-word NE in the dictionary, while the second feature calculates the minimum distance between the given token and a dictionary.</Paragraph>
      <Paragraph position="2"> In our experience, the first feature is effective for a dictionary containing high-quality items, for example, human-curated protein dictionaries. The second feature is effective for a dictionary that has a large number of items that are not very accurate, for example, web or database lexicons. Details can be found in (Tsai et al., 2006a).</Paragraph>
      <Paragraph position="3"> Post-Processing We count the number of occurrences of a word x appearing in the rightmost position of all NEs in each category. Let the maximum occurrence be n,  and the corresponding category be c. The total number of occurrences of x in the rightmost position of an NE is T; c/T is the consistency rate of x. According to our analysis of the training set of the JNLPBA 2004 data, 75% of words have a consistency rate of over 95%. We record this 75% of words and their associated categories in a table.</Paragraph>
      <Paragraph position="4"> After testing, we crosscheck all the rightmost words of NEs found by our system against this table. If they match, we overwrite the NE categories with those from the table.</Paragraph>
    </Section>
    <Section position="3" start_page="244" end_page="244" type="sub_section">
      <SectionTitle>
2.2 Experiments and Summary
</SectionTitle>
      <Paragraph position="0"> We perform 10-fold cross validation on the GENIA V3.02 corpus (Kim et al., 2003) to compare our CRF-based system with other biomedical NER systems. The experimental results are reported in Table 1. Our system outperforms other systems in protein names by an F-score of at least 2.6%. For DNA names, our performance is very close to that of the best system.</Paragraph>
      <Paragraph position="1">  recognition on the GENIA V3.02 corpus We have made every effort to implement a variety of linguistic features in our system's CRF framework. Thanks to these features and the nature of CRF, our system outperforms state-of-the-art machine-learning-based systems, especially in the recognition of protein names.</Paragraph>
      <Paragraph position="2"> Our system still has difficulty recognizing long, complicated NEs and coordinated NEs and distinguishing between overlapping NE classes, e.g., cell-line and cell-type. This is because biomedical texts have complicated sentence structures and involve more expert knowledge than texts from the general newswire domain. Since pure machine learning approaches cannot model long contextual phenomena well due to context window size limitations and data sparseness, we believe that template-based methods, which exploit long templates containing different levels of linguistic information, may be of help. Certain errors, such as incorrect boundary identification, are more tolerable if the main purpose is to discover relations between NEs (Tsai et al., 2006c). We shall exploit more linguistic features, such as composite features and external features, in the future. However, machine leaning approaches suffer from a serious problem of annotation inconsistency, which confuses machine learning models and makes evaluation difficult. In order to reduce human annotation effort and alleviate the scarcity of available annotated corpora, we shall learn from web corpora to develop machine learning techniques in different biomedical domains.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="244" end_page="245" type="metho">
    <SectionTitle>
3 Biomedical Semantic Role Labeling
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the main steps in building a biomedical SRL system: (1) create semantic roles for each biomedical verb; (2) construct a biomedical corpus, annotated with verbs and their corresponding semantic roles; (3) build an automatic semantic interpretation model, using the annotated text as a training corpus for machine learning. However, on adjunct arguments, especially on those highly relevant to the biomedical domain, such as AM-LOC (location), the performance is not satisfactory. We therefore develop a template generation method to create templates that are used as features for identifying these argument types.</Paragraph>
    <Section position="1" start_page="244" end_page="244" type="sub_section">
      <SectionTitle>
3.1 Biomedical Proposition Bank -- BioProp
</SectionTitle>
      <Paragraph position="0"> Our biomedical proposition bank, BioProp, is based on the GENIA Treebank (Yuka et al., 2005), which is a 491-abstract corpus annotated with syntactic structures. The semantic annotation in BioProp is added to the proper constituents in a syntactic tree.</Paragraph>
      <Paragraph position="1"> Basically, we adopt the definitions in PropBank (Palmer et al., 2005). For the verbs not in Prop-Bank, such as &amp;quot;phosphorylate&amp;quot;, we define their framesets. Since the annotation is time-consuming, we adopt a semi-automatic approach. We adapt an SRL system trained on PropBank (Wall Street Journal corpus) to the biomedical domain. We first use this SRL system to automatically annotate our corpus, and then human annotators to double check the system's results. Therefore, human effort is greatly reduced.</Paragraph>
    </Section>
    <Section position="2" start_page="244" end_page="245" type="sub_section">
      <SectionTitle>
3.2 Biomedical SRL System -- SEROW
</SectionTitle>
      <Paragraph position="0"> Following (Punyakanok et al., 2004), we formulate SRL as a constituent-by-constituent (C-by-C) tagging problem. We use BioProp to train our biomedical SRL system, SEROW (Tsai et al., 2006b), which uses a maximum entropy (ME) machine-learning model. We use the basic features described in (Xue &amp; Palmer, 2004). In addition, we automatically generate templates which can be used to improve classification of biomedical argument types. The details of SEROW system are described in (Tsai et al., 2005) and (Tsai et al., 2006b).</Paragraph>
    </Section>
    <Section position="3" start_page="245" end_page="245" type="sub_section">
      <SectionTitle>
3.3 Experiment and Summary
</SectionTitle>
      <Paragraph position="0"> Our experimental results show that a newswire English SRL system that achieves an F-score of 86.29% can maintain an F-score of 64.64% when ported to the biomedical domain. By using SE-ROW, we can increase that F-score by 22.9%.</Paragraph>
      <Paragraph position="1"> Adding automatically generated template features further increases overall F-score by 0.47% and adjunct (AM) F-score by 1.57%, respectively.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML