File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1209_metho.xml

Size: 7,733 bytes

Last Modified: 2025-10-06 14:09:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1209">
  <Title>Support Vector Machine Approach to Extracting Gene References into Function from Biological Documents</Title>
  <Section position="3" start_page="0" end_page="54" type="metho">
    <SectionTitle>
2 Architecture Overview
</SectionTitle>
    <Paragraph position="0"> A complete annotation system may be done at two stages, including (1) extraction of molecular function for a gene from a publication and (2) alignment of this function with a GO term. Figure 1 shows an example. The left part is an MEDLINE abstract with the function description highlighted.</Paragraph>
    <Paragraph position="1"> The middle part is the corresponding GeneRIF.</Paragraph>
    <Paragraph position="2"> The matching words are in bold, and the similar words are underlined. The right part is the GO annotation. This figure shows a possible solution of maintaining the knowledge bases and ontology using natural language processing technology. We addressed automation of the first stage in this paper. The overall architecture is shown in Figure 2.</Paragraph>
    <Paragraph position="3"> First, we constructed a training corpus in such a way that GeneRIFs were collected from LocusLink and the corresponding abstracts were retrieved from  MEDLINE. &amp;quot;GRIF words&amp;quot; and their weights were derived from the training corpus. Then Support Vector Machines were trained using the derived corpus. Given a new abstract, a sentence is selected from the abstract to be the candidate GeneRIF.</Paragraph>
  </Section>
  <Section position="4" start_page="54" end_page="54" type="metho">
    <SectionTitle>
3 Methods
</SectionTitle>
    <Paragraph position="0"> We adopted several weighting schemes to locate the GeneRIF sentence in an abstract in the official runs (Hou et al., 2003). Inspired by the work by Jelier et al. (2003), we incorporated their definition of classes into our weighting schemes, converting this task into a classification problem using SVMs as the classifier. We ran SVMs on both sets of features proposed by Hou et al. (2003) and Jelier et al. (2003), respectively. Finally, all the features were combined and some feature selection methods were applied to train the classifier.</Paragraph>
    <Section position="1" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
3.1 Training and test material preparation
</SectionTitle>
      <Paragraph position="0"> Since GeneRIFs are often cited verbatim from abstracts, we decided to reproduce the GeneRIF by selecting one sentence in the abstract. Therefore, for each abstract in our training corpus, the sentence most similar to the GeneRIF was labelled as the GeneRIF sentence using Classic Dice coefficient as similarity measure. Totally, 259,244 abstracts were used, excluding the abstracts for testing. The test data for evaluation are the 139 abstracts used in TREC 2003 Genomics track.</Paragraph>
    </Section>
    <Section position="2" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
3.2 GRIF words extraction and weighting
</SectionTitle>
      <Paragraph position="0"> scheme We called the matched words between GeneRIF and the selected sentence as GRIF words in this paper. GRIF words represent the favorite vocabulary that human experts use to describe gene functions. After stop word removal and stemming operation, 10,506 GRIF words were extracted. In our previous work (Hou et al., 2003) , we first generated the weight for each GRIF word. Given an abstract, the score of each sentence is the sum of weights of all the GRIF words in this sentence.</Paragraph>
      <Paragraph position="1"> Finally, the sentence with the highest score is selected as the candidate GeneRIF. This method is denoted as OUR weighting scheme, and several heuristic weighting schemes were investigated.</Paragraph>
      <Paragraph position="2"> Here, we only present the weighting scheme used in SVMs classification. The weighting scheme is as follows. For GRIF word i, the number of occurrence Gin in all the GeneRIF sentences and the number of occurrence Ain in all the abstracts were computed and AiGi nn / was assigned to GRIF word i as its weight.</Paragraph>
    </Section>
    <Section position="3" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
3.3 Classification
3.3.1 Class definition and feature extraction
</SectionTitle>
      <Paragraph position="0"> The distribution of GeneRIF sentences showed that the position of a sentence in an abstract is an important clue to where the answer sentence is.</Paragraph>
      <Paragraph position="1"> Jelier et al. (2003) considered only the title, the first three and the last five sentences, achieving the best performance in TREC official runs. Their Naive Bayes model is as follows. An abstract a is assigned a class vj by calculating vNB:  The Bcl10 gene was recently isolated from the breakpoint region of t(1;14)(p22;q32) in mucosa-associated lymphoid tissue (MALT) lymphomas.</Paragraph>
      <Paragraph position="2"> Somatic mutations of Bcl10 were found in not only t(1;14)-bearing MALT lymphomas, but also a wide range of other tumors. ...... Our results strongly suggest that somatic mutations of Bcl10 are extremely rare in malignant cartilaginous tumors and do not commonly contribute to their molecular pathogenesis.</Paragraph>
      <Paragraph position="4"> where vj is one of the nine positions aforementioned, S is the set of 9 sentence positions, Wa,i is the set of all word positions in sentence i in abstract a, wk,i is the occurrence of the normalized word at position k in sentence i and V is the set of 9 classes.</Paragraph>
      <Paragraph position="5"> We, therefore, represented each abstract by a feature vector composed of the scores of 9 sentences. Furthermore, with a list of our 10,506 GRIF words at hand, we also computed the occurrences of these words in each sentence, given an abstract. Each abstract is then represented by the number of occurrences of these words in the 9 sentences respectively, i.e., the feature vector is 94,554 in length. Classification based on this type of features is denoted the sentence-wise bag of words model in the rest of this paper. Combining these two models, we got totally 94,563 features.</Paragraph>
      <Paragraph position="6"> Since we are extracting sentences discussing gene functions, it's reasonable to expect gene or protein names in the GeneRIF sentence. Therefore, we employed Yapex (Olsson et al., 2002) and GAPSCORE (Chang et al., 2004) protein/gene name detectors to count the number of protein/gene names in each of the 9 sentences, resulting in 94,581 features.</Paragraph>
      <Paragraph position="7">  The whole process related to SVM was done via LIBSVM - A Library for Support Vector Machines (Hsu et al., 2003). Radial basis kernel was adopted based on our previous experience. However, further verification showed that the combined model with either linear or polynomial kernel only slightly surpassed the baseline, attaining 50.67% for CD. In order to get the best-performing classifier, we tuned two parameters, C and gamma. They are the penalty coefficient in optimization and a parameter for the radial basis kernel, respectively. Four-fold cross validation accuracy was used to select the best parameter pair.</Paragraph>
      <Paragraph position="8"> 3.3.3 Picking up the answe r sentence Test instances were first fed to the classifier to get the predicted positions of GeneRIF sentences. In case that the predicted position doesn't have a sentence, which would happen when the abstract doesn't have enough sentences, the sentence with the highest score is picked for the weighting scheme and the combined model, otherwise the title is picked for the sentence-wise bag of words model.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML