File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3303_intro.xml

Size: 3,632 bytes

Last Modified: 2025-10-06 14:04:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3303">
  <Title>Using the Gene Ontology for Subcellular Localization Prediction</Title>
  <Section position="2" start_page="0" end_page="17" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Can computers extract the semantic content of academic journal abstracts? This paper explores the use of natural language techniques for processing biological abstracts to answer this question in a specific domain. Our prototype method predicts the subcellular localization of proteins (the part of the biological cell where a protein performs its function) by performing text classification on related journal abstracts. null In the last two decades, there has been explosive growth in molecular biology research. Molecular biologists organize their findings into a common set of databases. One such database is Swiss-Prot, in which each entry corresponds to a protein. As of version 49.1 (February 21, 2006) Swiss-Prot contains more than 200,000 proteins, 190,000 of which link to biological journal abstracts. Unfortunately, a much smaller percentage of protein entries are annotated with other types of information. For example, only about half the entries have subcellular localization annotations. This disparity is partially due to the fact that humans annotate these databases manually and cannot keep up with the influx of data. If a computer could be trained to produce annotations by processing journal abstracts, proteins in the Swiss-Prot database could be curated semi-automatically.</Paragraph>
    <Paragraph position="1"> Document classification is the process of categorizing a set of text documents into one or more of a predefined set of classes. The classification of biological abstracts is an interesting specialization of general document classification, in that scientific language is often not understandable by, nor written for, the lay-person. It is full of specialized terms, acronyms and it often displays high levels of synonymy. For example, the &amp;quot;PAM complex&amp;quot;, which exists in the mitochondrion of the biological cell is also referred to with the phrases &amp;quot;presequence translocase-associated import motor&amp;quot; and  &amp;quot;mitochondrial import motor&amp;quot;. This also illustrates the fact that biological terms often span word boundaries and so their collective meaning is lost when text is whitespace tokenized.</Paragraph>
    <Paragraph position="2"> To overcome the challenges of scientific language, our technique employs the Gene Ontology (GO) (Ashburner et al, 2000) as a source of expert knowledge. The GO is a controlled vocabulary of biological terms developed and maintained by biologists. In this paper we use the knowledge represented by the GO to complement the information present in journal abstracts. Specifically we show that: * the GO can be used as a thesaurus * the hierarchical structure of the GO can be used to generalize specific terms into broad concepts * simple techniques using the GO significantly improve text classification Although biological abstracts are challenging documents to classify, solving this problem will yield important benefits. With sufficiently accurate text classifiers, the abstracts of Swiss-Prot entries could be used to automatically annotate corresponding proteins, meaning biologists could more efficiently identify proteins of interest. Less time spent sifting through unannotated proteins translates into more time spent on new science, performing important experiments and uncovering fresh knowledge.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML