File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1304_intro.xml

Size: 12,123 bytes

Last Modified: 2025-10-06 14:01:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1304">
  <Title>Enhancing Performance of Protein Name Recognizers Using Collocation</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Named entities are basic constituents in a document. Recognizing named entities is a fundamental step for document understanding. In a famous message understanding competition MUC (Darpa, 1998), named entities extraction, including organizations, people, and locations, along with date/time expressions and monetary and percentage expressions, is one of the evaluation tasks. Several approaches have been proposed to capture these types of terms. For example, corpus-based methods are employed to extract Chinese personal names, and rule-based methods are used to extract Chinese date/time expressions and monetary and percentage expressions (Chen and Lee, 1996; Chen, et al., 1998). Corpus-based approach is adopted because a large personal name database is available for training. In contrast, rules which have good coverage exist for date/time expressions, so the rule-based approach is adopted.</Paragraph>
    <Paragraph position="1"> In the past, named entities extraction mainly focuses on general domains. Recently, large amount of scientific documents has been published, in particular for biomedical domains. Several attempts have been made to mine knowledge from biomedical documents (Hirschman, et al., 2002).</Paragraph>
    <Paragraph position="2"> One of their goals is to construct a knowledge base automatically and to find new information embedded in documents (Craven and Kumlien, 1999). Similar information extraction works have been explored on this domain. Named entities like protein names, gene names, drug names, disease names, and so on, were recognized (Collier, et al., 2000; Fukuda, et al., 1998; Olsson, et al., 2002; Rindflesch, et al., 2000). Besides, the relationships among these entities, e.g., protein-protein, protein-gene, drug-gene, drug-disease, etc., were extracted (Blaschke, et al., 1999; Frideman, et al., 2001; Hou and Chen, 2002; Marcotte, et al., 2001; Ng and Wong, 1999; Park, et al., 2001; Rindflesch, et al., 2000; Thomas, et al., 2000; Wong, 2001).</Paragraph>
    <Paragraph position="3"> Collocation denotes two or more words having strong relationships (Manning and Schutze, 1999).</Paragraph>
    <Paragraph position="4"> The related technologies have been applied to terminological extraction, natural language generation, parsing, and so on. This paper deals with a special collocation in biological domain say, protein collocation. We will find out those keywords that co-occur with protein names by using statistical methods. Such terms, which are called collocates of proteins hereafter, will be considered as restrictions in protein name extraction. To improve the precision rate at the low expense of recall rate is the main theme of this approach.</Paragraph>
    <Paragraph position="5"> The rest of the paper is organized as follows.</Paragraph>
    <Paragraph position="6"> The protein name recognizers used in this study are introduced in Section 2. The collocation method we adopted is shown in Section 3. The filtering and integration strategies are explained in Sections 4 and 5, respectively. Finally, Section 6 concludes the remarks and lists some future works.  The detection of protein names presents a challenging task because of their variant structural characteristics, their resemblance to regular noun phrases and their similarity with other kinds of biological substances. Previous approaches on biological named entities extraction can be classified into two types - say, rule-based (Fukuda, et al., 1998; Humphreys, et al., 2000; Olsson, et al., 2002) and corpus-based (Collier, et al., 2000).</Paragraph>
    <Paragraph position="7"> KeX developed by Fukuda, et al. (1998) and Yapex developed by Olsson, et al. (2002) were based on handcrafted rules for extracting protein names. Collier, et al. (2000) trained a Hidden Markov Model with a small corpus of 100 MEDLINE abstracts to extract names of gene and gene products.</Paragraph>
    <Paragraph position="8"> Different taggers have their specific features.</Paragraph>
    <Paragraph position="9"> KeX was evaluated by using 30 abstracts on SH3 domain and 50 abstracts on signal transduction, and achieved 94.70% precision and 98.84% recall.</Paragraph>
    <Paragraph position="10"> Yapex was applied to a test corpus of 101 abstracts. Of these, 48 documents were queried from protein binding and interaction, and 53 documents were randomly chosen from GENIA corpus. The performance of tagging protein names is 67.8% precision and 66.4% recall. While the same test corpus was applied to KeX, it got 40.4% precision and 41.1% recall. It reveals that each tagger has its own characteristics. Changing the domain may result in the variant performance.</Paragraph>
    <Paragraph position="11"> Consequently, how to select the correct molecular entities proposed from the existing taggers is an interesting issue.</Paragraph>
    <Paragraph position="12"> Statistical Methods for Collocation The overall flow of our method is shown in Figure 1. To extract protein collocates, we need a corpus in which protein names have been tagged. Thus, we prepare a tagged biological corpus by looking up the protein lexicon in the first step. Then, common stop words are removed and the stemming procedure is applied to gather and group more informative words. Next, the collocation values of proteins and their surrounding words are calculated. Finally, we use these values to tell which neighbouring words are the desired collocates. The major modules are specified in detail in the following subsections.</Paragraph>
    <Paragraph position="14"> 1. Remove stopwords 2. Stem</Paragraph>
    <Paragraph position="16"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Step 1: Tagging the Corpus
</SectionTitle>
      <Paragraph position="0"> On the one hand, to calculate the collocation values of words with proteins from a corpus, it is necessary to recognize protein names at first. On the other hand, the goal of this paper deals with performance issue of protein name tagging.</Paragraph>
      <Paragraph position="1"> Hence, preparing a protein name tagged corpus and developing a high performance protein name tagger seem to be a chicken-egg problem.</Paragraph>
      <Paragraph position="2"> Because the corpus developed in the first step is used to extract the contextual information of proteins, a completely tagged corpus is not necessary at the first step. Dictionary-based approach for name tagging, i.e., full pattern matching between the dictionary entries and the words in the corpus, is simple. The major argument is its coverage. Those protein names which are not listed in the dictionary, but appear in the corpus will not be recognized. Thus this approach only produces a partial-tagged corpus, but it is enough to acquire contextual information for latter use.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Step 2: Preprocessing
3.2.1 Step 2.1: Exclusion of Stopwords
</SectionTitle>
      <Paragraph position="0"> Stopwords are common English words (such as preposition &amp;quot;in&amp;quot; and article &amp;quot;the&amp;quot;) that frequently appear in the text but are not helpful in discriminating special classes. Because they are distributed largely in the corpus, they should be filtered out. The stopword list in this study was collected with reference to the stoplists of Fox (1992), but the words also appearing in the protein lexicon are removed. For example, &amp;quot;of&amp;quot; is a constituent of the protein name &amp;quot;capsid of the lumazine&amp;quot;, so that &amp;quot;of&amp;quot; is excluded from the stoplist. Finally, 387 stopwords were used.</Paragraph>
      <Paragraph position="1">  Stemming is a procedure of transforming an inflected form to its root form. For example, &amp;quot;inhibited&amp;quot; and &amp;quot;inhibition&amp;quot; will be mapped into the root form &amp;quot;inhibit&amp;quot; after stemming.</Paragraph>
      <Paragraph position="2"> Stemming can group the same word semantics and reflect more information around the proteins.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Step 3: Computing Collocation Statistics
</SectionTitle>
      <Paragraph position="0"> The collocates of proteins are those terms that often co-occur with protein names in the corpus.</Paragraph>
      <Paragraph position="1"> In this step, we calculate three collocation statistics to find the significant terms around proteins.</Paragraph>
      <Paragraph position="2"> Frequency The collocates are selected by frequency. In order to gather more flexible relationships, here we define a collocation window that has five words on each side of protein names. And then collocation bigrams at a distance are captured. In general, more occurrences in the collocation windows are preferred, but the standard criteria for frequencies are not acknowledged. Hence, other collocation models are also considered.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Mean and Variance
</SectionTitle>
      <Paragraph position="0"> The mean value of collocations can indicate how far collocates are typically located from protein names. Furthermore, variance shows the deviation from the mean. The standard deviation of value zero indicates that the collocates and the protein names always occur at exactly the same distance equal to the mean value. If the standard deviation is low, two words usually occur at about the same distance, i.e., near the mean value. If the standard deviation is high, then the collocates and the protein names occur at random distance.</Paragraph>
      <Paragraph position="1"> t-test Model When the values of mean and variance have been computed, it is necessary to know if two words do not co-occur by chance. Moreover, we also have to know if the standard deviation is low enough.</Paragraph>
      <Paragraph position="2"> In other words, we have to set a threshold in the above approach. To get the statistical confidence that two words have a collocation relationship, t-test hypothesis testing is adopted.</Paragraph>
      <Paragraph position="3"> The t-value for each word i is formulated as follows:</Paragraph>
      <Paragraph position="5"> p is the probability of protein.</Paragraph>
      <Paragraph position="6"> When a (confidence level) is equal to 0.005, the value of t is 2.576. In the t-test model, if the t-value is larger than 2.576, the word is regarded as a good collocate of protein with 99.5% confidence.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Step 4: Extraction of Collocates
</SectionTitle>
      <Paragraph position="0"> We applied the above procedure to a corpus downloaded from the PASTA website in Sheffield University with 1,514 MEDLINE abstracts [http://www.dcs.shef.ac.uk/nlp/pasta]. Of the 4,782 different stemmed words appearing in the collocation windows, there are 541 collocations generated in Step 3. The collocates are not tagged with parts of speech, so that the output may contain nouns, prepositions, numbers, verbs, etc.</Paragraph>
      <Paragraph position="1"> The collocates extracted in a corpus cannot only serve as conditions of protein names, but also facilitate the relationship discovery between proteins. From the past papers on the extraction of the biological information, such as Blaschke, et al. (1999), Ng, et al. (1999), and Ono, et al. (2001) etc., verbs are the major targets. This is because many of the subjects and the objects related to these verbs are names of genes or proteins. To assure that the collocates selected in Step 3 are verbs, we assign parts of speech to these words.</Paragraph>
      <Paragraph position="2"> Appendix A lists the collocates and their variations.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML