File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3303_evalu.xml

Size: 4,373 bytes

Last Modified: 2025-10-06 13:59:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3303">
  <Title>Using the Gene Ontology for Subcellular Localization Prediction</Title>
  <Section position="5" start_page="21" end_page="22" type="evalu">
    <SectionTitle>
4 Results and Discussion
</SectionTitle>
    <Paragraph position="0"> combined synonym resolution/term generalization respectively. Paired t-tests (p=0.05) were done between the baseline, synonym resolution and term generalization Data Sets, where each sample is one fold of cross-validation. Those classifiers with significantly better performance over the baseline appear in bold in Table 3. For example, the lysosome classifiers trained on Data Set 2 and 3 are both significantly better than the baseline, and results for Data Set 3 are significantly better than results for Data Set 2, signified with an asterisk. In the case of the nucleus classifier no abstract processing technique was significantly better, so no column appears in bold.</Paragraph>
    <Paragraph position="1"> In six of the seven classes, classifiers trained on Data Set 2 are significantly better than the baseline, and in no case are they worse. In Data Set 3, five of the seven classifiers are significantly better than the baseline, and in no case are they worse. For the lysosome and peroxisome classes our combined synonym resolution/term generalization technique produced results that are significantly better than synonym resolution alone. The average results of Data Set 2 are significantly better than Data Set 1 and the average results of Data Set 3 are significantly better than Data Set 2 and Data Set 1. On average, synonym resolution and term generalization combined give an improvement of 3%, and synonym  cantly improved over the baseline (p=0.05) appear in bold, and those with an asterisk (*) are significantly better than both other data sets. Change in F-measure compared to baseline is shown for Data Sets 2 and 3. Standard deviation is shown in parentheses.</Paragraph>
    <Paragraph position="2"> resolution alone yields a 1.7% improvement. Because term generalization and synonym resolution never produce classifiers that are worse than synonym resolution alone, and in some cases the result is 7.8% better than the baseline, Data Set 3 can be confidently used for text categorization of all seven animal subcellular localization classes.</Paragraph>
    <Paragraph position="3"> Our baseline SVM classifier performs quite well compared to the baselines reported in related work. At worst, our baseline classifier has F-measure 0.740. The text only classifier reported by H&amp;quot;oglund et al. has F-measure in the range [0.449,0.851] (H&amp;quot;oglund et al, 2006) and the text only classifiers presented by Stapley et al. begin with a baseline classifier with F-measure in the range [0.31,0.80] (Stapley et al, 2002). Although their approaches gave a greater increase in performance their low baselines left more room for improvement.</Paragraph>
    <Paragraph position="4"> Though we use different data sets than H&amp;quot;oglund et al. (2006), we compare our results to theirs on a class by class basis. For those 7 localization classes for which we both make predictions, the F-measure of our classifiers trained on Data Set 3 exceed the F-measures of the H&amp;quot;oglund et al. text only classifiers in all cases, and our Data Set 3 classifier beats the F-measure of the MutliLocText classifier for 5 classes (see supplementary material http://www.cs.</Paragraph>
    <Paragraph position="5"> ualberta.ca/~alona/bioNLP). In addition, our technique does not preclude using techniques presented by H&amp;quot;oglund et al. and Stapley et al., and it may be that using a combination of our approach and techniques involving protein sequence information may result in an even stronger subcellular localization predictor.</Paragraph>
    <Paragraph position="6"> We do not assert that using abstract text alone is the best way to predict subcellular localization, only that if text is used, one must extract as much from it as possible. We are currently working on incorporating the classifications given by our text classifiers into Proteome Analyst's subcellular classifier to improve upon its already strong predictors (Lu et al, 2004), as they do not currently use any information present in the abstracts of homologous proteins.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML