File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3112_metho.xml
Size: 8,833 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3112"> <Title>Mork</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Using SemGen to Compare OMIM and MEDLINE </SectionTitle> <Paragraph position="0"> SemGen errors notwithstanding, we are investigating possibilities for exploiting automatically extracted gene interaction predications. We discuss an application which compares MEDLINE text to OMIM documents, for specified diseases. LocusLink preferred gene symbols and GO terms are an integral part of this processing. We feel it is instructive to investigate the consequences of this comparison, anticipating results that are effective enough for practical application.</Paragraph> <Paragraph position="1"> We selected five diseases with a genetic component (Alzheimer's disease, Crohn's disease, lung cancer, prostate cancer, and sickle cell anemia), and retrieved the corresponding OMIM report for each disease, automatically discarding sections such as references, headings, and edit history. We also queried PubMed for each disease and retrieved all MEDLINE citations that were more recent than the corresponding OMIM report. Both OMIM and MEDLINE files were then submitted to SemGen.</Paragraph> <Paragraph position="2"> For each disease, the MEDLINE file was larger than the corresponding OMIM file, and the categorizer eliminated some parts of each file as not being in the molecular biology domain. Table 1 shows the number of sentences in the original input files and the number processed after the categorizer eliminated sentences not in the molecular biology domain.</Paragraph> <Paragraph position="3"> A paragraph in the OMIM file for Alzheimer's disease beginning with the sentence Alzheimer disease is by far the most common cause of dementia, for example, was eliminated, while a MEDLINE citation with the title Semantic decision making in early probable AD: A PET activation study was removed.</Paragraph> <Paragraph position="4"> An overview of predication types retrieved by SemGen is given in Table 2 for the files on Alzheimer's disease. Of the gene-disease predications, the majority had predicate ASSOCIATED_WITH (15 from OMIM and We developed a program that compares semantic predications found in MEDLINE abstracts to those found in an OMIM report associated with a particular disease and classifies the comparison between two predications as either an exact match, partial match, or no match. The category of a comparison is determined by examining the argument and predicate fields of the predications. If all three fields match, the comparison is an exact match; if any two fields match it is a partial match. All other cases are considered as no match.</Paragraph> <Paragraph position="5"> Although fewer than half of the predications extracted by SemGen are likely to be correct, we provide some examples from the files on Alzheimer's disease.</Paragraph> <Paragraph position="6"> (The system retains the document ID's, which are suppressed here for clarity.) Examples of partial matches between gene-disease predications extracted from In (18) are listed some of the gene-gene interaction predications found in MEDLINE but not in OMIM.</Paragraph> <Paragraph position="8"/> </Section> <Section position="5" start_page="2" end_page="7" type="metho"> <SectionTitle> 6 Using the GO Terms </SectionTitle> <Paragraph position="0"> As noted above, for each gene argument in the predications identified by SemGen, we retrieved from LocusLink the GO terms associated with that gene. We have begun to investigate ways in which these terms might be used to compare genes by looking at the gene-gene interaction predications extracted from MEDLINE that did not occur in OMIM.</Paragraph> <Paragraph position="1"> To support this work, we developed a program that sorts gene-gene interaction predications by the GO terms of their arguments. For each gene function, the predications in which both arguments share the same function are listed first. These are followed by the predications in which only the first argument has that gene function, and then the predications in which only the second argument has the relevant gene function. A typical output file of this process is shown in (19): The three branches of the Gene Ontology provide a uniform system for relating genes by function. The terms in the molecular function and biological process branches are perhaps most useful for this purpose; however, we have begun by considering all three branches (including the cellular component branch). The most effective method of exploiting GO annotations remains a matter of research.</Paragraph> <Paragraph position="2"> It is important to recognize that GO mapping is not precise; different annotators may make different GO assignments for the same gene. Nevertheless, GO annotations provide considerable potential for relating the molecular functions and biological processes of genes.</Paragraph> <Paragraph position="3"> We consider one of the predications extracted from the MEDLINE file for prostate cancer that did not occur in OMIM: 19) EGR1|INTERACT_WITH|AR Both genes EGR1 and AR in LocusLink elicit the same human gene set (367 Hs AR; 1026 Hs CDKN1A; 1958 Hs EGR1; 3949 Hs LDLR; 4664 Hs NAB1; 4665 Hs NAB2; 5734 Hs PTGER4; 114034 Hs TOE1). This suggests a high degree of sequence homology and functional similarity. In addition, LocusLink provides the following GO terms for the two genes: 20) EGR1: early growth response 1; LocusID: 1958 Gene Ontology: transcription factor activity; regulation of transcription, DNA-dependent; nucleus 21) AR: androgen receptor (dihydrotestosterone re- null ceptor; testicular feminization; spinal and bulb ar muscular atrophy; Kennedy disease) ; LocusID: 367 Gene Ontology: androgen receptor activity; steroid binding; receptor activity; transcription factor activity; transport; sex differentiation; regulation of transcription, DNAdependent; signal transduction; cell-cell signaling; nucleus (The GO provides additional, hierarchical information for terms, which we have not yet exploited.) Thirty percent of the predications examined had some degree of overlap in their GO terms. For example, the terms for EGR1 (transcription factor activity; regulation of transcription, DNA- dependent; and nucleus) are identical to three of the GO terms for the AR gene. This concordance may not be typical of the majority of paired genes in our sample. However, in the case of genes that do not exhibit such complete overlap, concordance might be obtained at higher nodes in the classification scheme.</Paragraph> <Paragraph position="4"> An alternate approach for assessing distance between GO annotations has been suggested by Lord et al. (2003a, 2003b). They propose a &quot;semantic similarity measure&quot; using ontologies to explore the relationships between genes that may have associated interaction or function. The authors consider the information content of each GO term, defined as the number of times each term, or any child term, occurs.</Paragraph> <Paragraph position="5"> The fact that any one gene has a number of GO annotations indicates that a particular gene may perform more than one function or its function may be classified under a number of molecular activities. Some of these activities may be part of, i.e. extending to a variable degree down, the same GO structure. For example, for gene AR, &quot;receptor activity&quot; (GO 4872) partially overlaps with &quot;androgen receptor activity&quot; (GO 4882), as does &quot;steroid binding&quot; (GO 5496) with &quot;transcription factor activity&quot; (GO 3700), and &quot;signal transduction (GO 7165) and &quot;cell-cell signaling (GO 7267). This indicates that in assessing similarity one needs to examine the ontology structure and not rely solely on the GO terms.</Paragraph> <Paragraph position="6"> While we have no experimental evidence, we would like to speculate about the functional or biological significance indicated by similarity in GO annotation. There are three orthogonal aspects to GO: molecular function, biological process, and cellular component. If two genes map more closely in one of the taxonomies, then their function is necessarily more closely related. The majority of GO terms are in the molecular function taxonomy. It is conceivable that genes that map more closely could be involved in the same cascade or participate in the same genetic regulatory network. There is increasing interest in genetic networks (e.g.</Paragraph> <Paragraph position="8"> and combining the ability to search and extract information from the literature with GO mapping could prove effective in elucidating the functional interactions of genes.</Paragraph> </Section> class="xml-element"></Paper>