File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3303_metho.xml
Size: 12,275 bytes
Last Modified: 2025-10-06 14:11:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3303"> <Title>Using the Gene Ontology for Subcellular Localization Prediction</Title> <Section position="4" start_page="18" end_page="21" type="metho"> <SectionTitle> 3 Methods </SectionTitle> <Paragraph position="0"> The workflow used to perform our experiments is outlined in Figure 1.</Paragraph> <Section position="1" start_page="18" end_page="19" type="sub_section"> <SectionTitle> 3.1 The Data Set </SectionTitle> <Paragraph position="0"> The first step in evaluating the usefulness of GO as a knowledge source is to create a data set. This pro- null in this paper. Abstracts are gathered for proteins with known localization (process a). Treatments are applied to abstracts to create three Data Sets (process b).</Paragraph> <Paragraph position="1"> cess begins with a set of proteins with known subcellular localization annotations (Figure 1). For this we use Proteome Analyst's (PA) data sets (Lu et al, 2004; Szafron et al, 2004). The PA group used these data sets to create very accurate subcellular classifiers based on the keyword fields of Swiss-Prot entries for homologous proteins. Here we use PA's current data set of proteins collected from Swiss-Prot (version 48.3) and impose one further criterion: the subcellular localization annotation may not be longer than four words. This constraint is introduced to avoid including proteins where the localization category was incorrectly extracted from a long sentence describing several aspects of localization. For example, consider the subcellular annotation &quot;attached to the plasma membrane by a lipid anchor&quot;, which could mean the protein's functional components are either cytoplasmic or extracellular (depending on which side of the plasma membrane the protein is anchored). PA's simple parsing scheme could mistake this description as meaning that the protein performs its function in the plasma membrane. Our length constraint reduces the chances of including mislabeled training instances in our data.</Paragraph> <Paragraph position="2"> than the sum of the rows because proteins may belong to more than one localization class.</Paragraph> <Paragraph position="3"> PA has data sets for five organisms (animal, plant, fungi, gram negative bacteria and gram positive bacteria). The animal data set was chosen for our study because it is PA's largest and medical research has the most to gain from increased annotations for animal proteins. PA's data sets have binary labeling, and each class has its own training file. For example, in the nuclear data set a nuclear protein appears with the label &quot;+1&quot;, and non-nuclear proteins appear with the label &quot;[?]1&quot;. Our training data includes 317 proteins that localize to more than one location, so they will appear with a positive label in more than one data set. For example, a protein that is both cytoplasmic and peroxisomal will appear with the label &quot;+1&quot; in both the peroxisomal and cytoplasmic sets, and with the label &quot;[?]1&quot; in all other sets. Our data set has 7652 proteins across 9 classes (Table 1). To take advantage of the information in the abstracts of proteins with multiple localizations, we use a one-against-all classification model, rather than a &quot;single most confident class&quot; approach.</Paragraph> </Section> <Section position="2" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 3.2 Retrieve Abstracts </SectionTitle> <Paragraph position="0"> Now that a set of proteins with known localizations has been created, we gather each protein's abstracts and abstract titles (Figure 1, process a).</Paragraph> <Paragraph position="1"> We do not include full text because it can be difficult to obtain automatically and because using full text does not improve F-measure (Sinclair and Webber, 2004). Abstracts for each protein are retrieved using the PubMed IDs recorded in the Swiss-Prot database. PubMed (http://www.pubmed.</Paragraph> <Paragraph position="2"> gov) is a database of life science articles. It should be noted that more than one protein in Swiss-Prot may point to the same abstract in PubMed. Because the performance of our classifiers is estimated using cross-validation (discussed in Section 3.4) it is important that the same abstract does not appear in both testing and training sets during any stage of cross-validation. To address this problem, all abstracts that appear more than once in the complete set of abstracts are removed. The distribution of the remaining abstracts among the 9 subcellular localization classes is shown in Table 1. For simplicity, the fact that an abstract may actually be discussing more than one protein is ignored. However, because we remove duplicate abstracts, many abstracts discussing more than one protein are eliminated.</Paragraph> <Paragraph position="3"> In Table 1 there are more abstracts than proteins because each protein may have more than one associated abstract. Classes with less than 100 abstracts were deemed to have too little information for training. This constraint eliminated plasma membrane and golgi classes, although they remained as negative data for the other 7 training sets.</Paragraph> <Paragraph position="4"> It is likely that not every abstract associated with a protein will discuss subcellular localization. However, because the Swiss-Prot entries for proteins in our data set have subcellular annotations, some research must have been performed to ascertain localization. Thus it should be reported in at least one abstract. If the topics of the other abstracts are truly unrelated to localization than their distribution of words may be the same for all localization classes.</Paragraph> <Paragraph position="5"> However, even if an abstract does not discuss localization directly, it may discuss some other property that is correlated with localization (e.g. function).</Paragraph> <Paragraph position="6"> In this case, terms that differentiate between localization classes will be found by the classifier.</Paragraph> </Section> <Section position="3" start_page="19" end_page="21" type="sub_section"> <SectionTitle> 3.3 Processing Abstracts </SectionTitle> <Paragraph position="0"> Three different data sets are made by processing our retrieved abstracts (Figure 1, process b). An ex- null ods of abstract processing. Data Set 1 is our baseline, Data Set 2 incorporates synonym resolution and Data Set 3 incorporates synonym resolution and term generalization. Word counts are shown here for simplicity, though our experiments use TFIDF.</Paragraph> <Paragraph position="1"> ample illustrating our three processing techniques is shown in Figure 2.</Paragraph> <Paragraph position="2"> In Data Set 1, abstracts are tokenized and each word is stemmed using Porter's stemming algorithm (Porter, 1980). The words are then transformed into a vector of <word,TFIDF> pairs.</Paragraph> <Paragraph position="4"> where f(wi) is the number of times word wi appears in documents associated with a protein, n is the total number of training documents and D(wi) is the number of documents in the whole training set that contain the word wi. TFIDF was first proposed by Salton and Buckley (1998) and has been used extensively in various forms for text categorization (Joachims, 1998; Stapley et al, 2002). The words from all abstracts for a single protein are amalgamated into one &quot;bag of words&quot; that becomes the training instance which represents the protein.</Paragraph> <Paragraph position="5"> The GO hierarchy can act as a thesaurus for words with synonyms. For example the GO encodes the fact that &quot;metabolic process&quot; is a synonym for &quot;metabolism&quot;(see Figure 3). Data Set 2 uses GO's &quot;exact synonym&quot; field for synonym resolution and adds extra features to the vector of words from Data Set 1. We search a stemmed version of the abstracts hierarchy. GO nodes are shown as ovals, synonyms appear as grey rectangles.</Paragraph> <Paragraph position="6"> for matches to stemmed GO node names or synonyms. If a match is found, the GO node name (deemed the canonical representative for its set of synonyms) is associated with the abstract. In Figure 2 the phrase &quot;regulation of osmotic pressure&quot; appears in the text. A lookup in the GO synonym dictionary will indicate that this is an exact synonym of the GO node &quot;osmoregulation&quot;. Therefore we associated the term &quot;osmoregulation&quot; with the training instance. This approach combines the weight of several synonyms into one representative, allowing the SVM to more accurately model the author's intent, and identifies multi-word phrases that are otherwise lost during tokenization. Table 2 shows the increase in average number of features per training instance as a result of our synonym resolution technique.</Paragraph> <Paragraph position="7"> In order to express the relationships between terms, the GO hierarchy is organized in a directed acyclic graph (DAG). For example, &quot;thermoregulation&quot; is a type of &quot;homeostasis&quot;, which is a &quot;physiological process&quot;. This &quot;is a&quot; relationship is expressed as a series of parent-child relationships (see Figure 3). In Data Set 3 we use the GO for synonym resolution (as in Data Set 2) and we also use its hierarchical structure to generalize specific terms into broader concepts. For Data Set 3, if a GO node name (or synonym) is found in an abstract, all names of ancestors to the match in the text are included in the instance for 7 subcellular localization categories in animals. Data Set 1 is the baseline, Data Set 2 incorporates synonym resolution and Data Set 3 uses synonym resolution and term generalization.</Paragraph> <Paragraph position="8"> training instance along with word vectors from Data Set 2 (see Figure 2). These additional node names are prepended with the string &quot;GO &quot; which allows the SVM to differentiate between the case where a GO node name appears exactly in text and the case where a GO node name's child appeared in the text and the ancestor was added by generalization. Term generalization increases the average number of features per training instance (Table 2).</Paragraph> <Paragraph position="9"> Term generalization gives the SVM algorithm the opportunity to learn correlations that exist between general terms and subcellular localization even if the general term never appears in an abstract and we encounter only its more specific children. Without term generalization the SVM has no concept of the relationship between child and parent terms, nor between sibling terms. For some localization categories more general terms may be the most informative and in other cases specific terms may be best. Because our technique adds features to training instances and never removes any, the SVM can assign lower weights to the generalized terms in cases where the localization category demands it.</Paragraph> </Section> <Section position="4" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 3.4 Evaluation </SectionTitle> <Paragraph position="0"> Each of our classifiers was evaluated using 10 fold cross-validation. In 10 fold cross-validation each Data Set is split into 10 stratified partitions. For the first &quot;fold&quot;, a classifier is trained on 9 of the 10 partitions and the tenth partition is used to test the classifier. This is repeated for nine more folds, holding out a different tenth each time. The results of all 10 folds are combined and composite precision, recall and F-measures are computed. Cross-validation accurately estimates prediction statistics of a classifier, since each instance is used as a test case at some point during validation.</Paragraph> <Paragraph position="1"> The SVM implementation libSVM (Chang and Lin, 2001) was used to conduct our experiments. A linear kernel and default parameters were used in all cases; no parameter searching was done. Precision, recall and F-measure were calculated for each experiment. null</Paragraph> </Section> </Section> class="xml-element"></Paper>