File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3304_intro.xml

Size: 8,237 bytes

Last Modified: 2025-10-06 14:04:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3304">
  <Title>Integrating Ontological Knowledge and Textual Evidence in Estimating Gene and Gene Product Similarity</Title>
  <Section position="3" start_page="25" end_page="27" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> GO-based similarity methods that focus on measuring intra-ontological relations have adopted the information theoretic treatment of semantic similarity developed in Natural Language Processing[?]see Budanitsky (1999) for an extensive survey. An example of such a treatment is given by Resnik (1995), who defines semantic similarity between two concept nodes c1 c2 in a graph as the information content of the least common superordinate (lcs) of c1 and c2, as shown in (1). The information content of a concept node c, IC(c), is computed as -log p(c) where p(c) indicates the probability of encountering instances of c in a specific corpus.</Paragraph>
    <Paragraph position="2"> Jiang and Conrath (1997) provide a refinement of Resnik's measure by factoring in the distance from each concept to the least common superordinate, as shown in (2).</Paragraph>
    <Paragraph position="4"> Lin (1998) provides a slight variant of Jiang's and Conrath's measure, as indicated in (3).</Paragraph>
    <Paragraph position="6"/>
    <Paragraph position="8"> The information theoretic approach is very well suited to assess GO code similarity since each gene subontology is formalized as a directed acyclic graph. In addition, the GO database  includes numerous curated GO annotations which can be used to calculate the information content of each GO code with high reliability. Evaluations of this methodology have yielded promising results. For example, Lord et al. (2002, 2003) demonstrate that there is strong correlation between GO-based similarity judgments for human proteins and similarity judgments obtained through BLAST searches for the same proteins. Azuaje et al. (2005) show that there is a strong connection between the degree of GO-based similarity and the expression correlation of gene products.</Paragraph>
    <Paragraph position="9"> As Bodenreider et al. (2005) remark, the main problem with the information theoretic approach to GO code similarity is that it does not take into account associative relations across the gene ontologies. For example, the two GO codes 0050909 (sensory perception of taste) and 0008527 (taste receptor activity) belong to different gene ontologies (BP and MF), but they are undeniably very closely related. The information theoretic approach would simply miss associations of this kind as it is not designed to capture inter-ontological relations. Bodenreider et al. (2005) propose to recover associative relations across the gene ontologies using a variety of statistical techniques which estimate the similarity of two GO codes inter-ontologically in terms of the distribution of the gene product annotations associated with the two GO codes in the GO database. One such technique is an adaptation of the vector space model frequently used in Information Retrieval (Salton et al. 1975), where For ease of exposition, we have converted Jiang's and Conrath's semantic distance measure to semantic similarity by taking its inverse, following Pedersen et al. (2005).  http://www.godatabase.org/dev/database.</Paragraph>
    <Paragraph position="10">  each GO code is represented as a vector of gene-based features weighted according to their distribution in the GO annotation database, and the similarity between two GO codes is computed as the cosine of the vectors for the two codes.</Paragraph>
    <Paragraph position="11"> The ability to measure associative relations across the gene ontologies can significantly augment the functionality of the information theoretic approach so as to provide a more comprehensive assessment of gene and gene product similarity.</Paragraph>
    <Paragraph position="12"> However, in spite of their complementarities, the two GO code similarity measures are not easily integrated. This is because the two measures are obtained through different methods, express distinct senses of similarity (i.e. intra- and interontological) and are thus incomparable.</Paragraph>
    <Paragraph position="13"> Posse et al. (2006) develop a GO-based similarity algorithm-XOA, short for Cross-Ontological Analytics-capable of combining intra- and inter-ontological relations by &amp;quot;translating&amp;quot; each associative relation across the gene ontologies into a hierarchical relation within a single ontology. More precisely, let c1 denote a GO code in the gene ontology O1 and c2 a GO code in the gene ontology O2. The XOA similarity between c1 and c2 is defined as shown in (4), where  * cos(ci,cj) denotes the cosine associative measure proposed by Bodenreider et al. (2005) * sim(ci,cj) denotes any of the three intra-ontological semantic similarities described above, see (1)-(3)</Paragraph>
    <Paragraph position="15"> {f(ci)} denotes the maximum of the function f() over all GO codes ci in the gene ontology Oj.</Paragraph>
    <Paragraph position="16"> The major innovation of the XOA approach is to allow the comparison of two nodes c1, c2 across distinct ontologies O1, O2 by mapping c1 into its closest node c4 in O2 and c2 into its closest node c3 in O1. The inter-ontological semantic similarity between c1 and c2 can be then estimated from the intra-ontological semantic similarities between c1- null If c1 and c2 are in the same ontology, i.e. O1=O2, then xoa(c1,c2) is still computed as in (4). In most cases, the maximum in (4) would be obtained with c3 = c2 and c4 = c1 so that XOA(c1,c2) would simply be computed as sim(c1,c2). However, there are situations where there exists a GO code c3  (c4) in the same ontology which * is highly associated with c1 (c2), * is semantically close to c2 (c1), and * leads to a value for sim(c1,c3) x cos(c2,c3) ((sim(c2,c4) x cos(c1,c4)) that is higher than sim(c1,c2).</Paragraph>
    <Paragraph position="17"> c3 and c2-c4, using multiplication with the associative relations between c2-c3 and c1-c4 as a score enrichment device.</Paragraph>
    <Paragraph position="19"> Posse et al. (2006) show that the XOA similarity measure provides substantial advantages. For example, a comparative evaluation of protein similarity, following the benchmark study of Lord et al.</Paragraph>
    <Paragraph position="20"> (2002, 2003), reveals that XOA provides the basis for a better correlation with protein sequence similarities as measured by BLAST bit score than any intra-ontological semantic similarity measure. The XOA similarity between genes/gene products derives from the XOA similarity between GO codes.</Paragraph>
    <Paragraph position="21"> Let GP1 and GP2 be two genes/gene products. Let c11,c12,..., c1n denote the set of GO codes associated with GP1 and c21, c22,...., c2m the set of GO codes associated with GP2. The XOA similarity between GP1 and GP2 is defined as in (5), where  i=1,...,n and j=1,...,m.</Paragraph>
    <Paragraph position="22"> (5) XOA(GP1,GP2) = max {XOA(c1i, c2j)}  The results of the study by Posse et al. (2006) are shown in Table 1. Note that the correlation between protein similarities based on intra-ontological similarity measures and BLAST bit scores in Table 1 is given for each choice of gene ontology (MF, BP, CC). This is because intra-ontological similarity methods only take into account GO codes that are in the same ontology and can therefore only assess protein similarity from a single ontology viewpoint. By contrast, the XOA-based protein similarity measure makes use of GO codes that can belong to any of the three gene ontologies and needs not be broken down by single ontologies, although the contribution of each gene ontology or even single GO codes can still be fleshed out, if so desired.</Paragraph>
    <Paragraph position="23"> Is it possible to improve on these XOA results by factoring in textual evidence? We will address this question in the remaining part of the paper.</Paragraph>
    <Paragraph position="24">  cients between BLAST bit score and semantic similarities, calculated using a set of 255,502 protein pairs-adapted from Posse et al. (2006).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML