File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3112_intro.xml
Size: 9,902 bytes
Last Modified: 2025-10-06 14:02:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3112"> <Title>Mork</Title> <Section position="3" start_page="2" end_page="2" type="intro"> <SectionTitle> 3 SemGen </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Molecular Biology Resources </SectionTitle> <Paragraph position="0"> To support and supplement the information extracted by SemGen from biomedical text, we draw on two resources, LocusLink and the Gene Ontology. LocusLink (Wheeler et al. 2004) provides a single query interface to curated genomic sequences and genetic loci. It presents information on official nomenclature, aliases, sequence accessions, phenotypes, OMIM numbers, homology, map locations, and related Web sites, among others. Of particular interest is the Reference Sequence (RefSeq) collection, which provides a comprehensive, curated, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products for major research organisms. Currently, SemGen uses LocusLink to obtain normalized gene names and Gene Ontology annotations.</Paragraph> <Paragraph position="1"> The Gene Ontology (GO) (The Gene Ontology Consortium 2000, 2001, 2004) aims to provide a dynamic controlled vocabulary that can be applied to all organisms, even while knowledge of gene and protein function is incomplete or unfolding. The GO consists of three separate ontologies: molecular function, biological process, and cellular component. These three branches are used to characterize gene function and products and provide a comprehensive structure that permits the annotation of molecular attributes of genes in various organisms. We use GO annotations to examine whether there are identifiable patterns, or concordance, in the function of gene pairs identified by SemGen.</Paragraph> <Paragraph position="2"> SemGen identifies gene interaction predications based on semantic interpretation adapted from SemRep (Srinivasan and Rindflesch 2002; Rindflesch and Fiszman 2003), a general natural language processing system being developed for the biomedical domain. After the application of a statistically-based labeled categorizer (Humphrey 1999) that limits input text to the molecular biology domain, SemGen processing proceeds in three major phases: categorial analysis, identification of concepts, and identification of relations.</Paragraph> <Paragraph position="3"> The initial phase relies on a parser that draws on the SPECIALIST Lexicon (McCray et al. 1994) and the Xerox Part-of-Speech Tagger (Cutting et al. 1992) to produce an underspecified categorial analysis.</Paragraph> <Paragraph position="4"> In the phase for identifying concepts, disorders as well as genes and proteins are isolated by mapping simple noun phrases from the previous phase to concepts in the Unified Medical Language System (Humphreys et al. 1998), using MetaMap (Aronson 2001). ABGene, a program that identifies genes and proteins using several statistical and empirical methods (Tanabe and Wilbur 2002) is also consulted during this phase. In addition, a small list of signal words (such as gene, codon, and exon) helps identify genetic phenomena. For example, the genetic phenomena in (4) are identified from the sentence in (3). Concepts isolated in this phase serve as potential arguments in the next phase.</Paragraph> <Paragraph position="5"> 3) WIF1 was down-regulated in 64% of primary prostate cancers, while SFRP4 was up-regulated in 81% of the patients.</Paragraph> <Paragraph position="7"> During the final phase, in which relations are identified, the predicates of semantic propositions are based on indicator rules. These stipulate verbs, nominalizations, and prepositions that &quot;indicate&quot; semantic predicates. During this phase, argument identification is constrained by an underspecified dependency grammar, which also attempts to accommodate coordinated arguments as well as predicates.</Paragraph> <Paragraph position="8"> SemGen originally had twenty rules indicating one of three etiology relations between genetic phenomena and diseases, namely CAUSE, PREDISPOSE, and ASSOCIATED_WITH. In this project, we extended SemGen to cover gene-gene interaction relations: INHIBIT, STIMULATE, AND INTERACT_WITH. About 20 indicator rules were taken from MedMiner (Tanabe et al. 1999).</Paragraph> <Paragraph position="9"> We supplemented this list by taking advantage of the verbs identified in syntactic predications by GeneScene (Leroy et al. 2003). SemGen has 46 gene-gene interaction indicator rules (mostly verbs), including 16 for INHIBIT (such as block, deplete, down-regulate); 12 for INTERACT_WITH (bind, implicate, influence, mediate); and 18 for STIMULATE (amplify, activate, induce, upregulate). null An overview of the SemGen system is given in Figure 1, and an example is provided below. SemGen processing on input text (5) produces the underspecified syntactic structure (represented schematically) in (6).</Paragraph> <Paragraph position="10"> (7) illustrates genetic phenomena identified, and (8) shows the final semantic interpretation.</Paragraph> <Paragraph position="11"> 5) We show here that EGR1 binds to the AR in prostate carcinoma cells, and an EGR1-AR complex can be detected by chromatin immunoprecipitation at the enhancer of an endogenous AR target gene.</Paragraph> <Paragraph position="12"> 6) [We] [show] [here] [that] [EGR1] [binds] [to the AR] [in prostate carcinoma cells,] [and] [an EGR1-AR complex] [can] [be] [detected] [by chromatin immunoprecipitation] [at the enhancer] [of an endogenous AR target gene]</Paragraph> <Paragraph position="14"> During processing, SemGen normalizes gene symbols using the preferred symbol from LocusLink. The final interpretation with LocusLink gene symbol is shown in (9).</Paragraph> <Paragraph position="15"> 9) EGR1|INTERACT_WITH|AR As we retrieve the LocusLink symbol for a gene, we also get the GO terms associated with that gene. We are interested in extending the application of our textual analysis and knowledge extraction methodology and relating it to other biomedical and genomic resources. Gene Ontology is one such important resource, and below we discuss the possibility that GO might shed additional light on the biological relationship between genes that are paired functionally based on textual analysis. The GO terms for the genes in (9) are given in (10) and (11).</Paragraph> <Paragraph position="16"> 10) EGR1|[transcription factor activity; regulation of transcription, DNA-dependent; nucleus] 11) AR|[androgen receptor activity; steroid binding; receptor activity; transcription factor activity; transport; sex differentiation; regulation of transcription, DNA-dependent; signal transduction; cell-cell signaling; nucleus] 4 SemGen Evaluation and Error Analysis Before suggesting an application using SemGen output, we discuss the results of error analysis performed on 344 sentences from MEDLINE citations related to six genetic diseases: Alzheimer's disease, Crohn's disease, lung cancer, ovarian cancer, prostate cancer and sickle cell anemia. Out of 442 predications identified by Sem-Gen, 181 were correct, for 41% precision. This is not yet accurate enough to support a production system; however, the majority of the errors are focused in two syntactic areas, and we believe that with further development it is possible to provide output effective for supporting practical applications.</Paragraph> <Paragraph position="17"> The majority of the errors fall into one of two major syntactic classes, relativization and coordination. A further source of error is the fact that we have not yet addressed interaction relations that involve a process in addition to a gene.</Paragraph> <Paragraph position="18"> Reduced relative clauses, such as mediated by Tip60 in (12), are a rich source of argument identification errors. null 12) LRPICD dramatically inhibits APP-derived intracellular domain/Fe65 transactivation mediated by Tip60.</Paragraph> <Paragraph position="19"> SemGen wrongly interpreted this sentence as asserting that LRPICD inhibits Tip60. The rules of the under-specified dependency grammar that identify arguments essentially look to the left and right of a verb for a noun phrase that has been marked as referring to a genetic phenomenon. Arguments are not allowed to be used in more than one predication (unless licensed by coordination or as the head of a relative clause).</Paragraph> <Paragraph position="20"> A number of phenomena conspire in (12) to wrongly allow TIP60 to be analyzed as the object of inhibits. The actual object, transactivation, was not recognized because we have not yet addressed processes as arguments of gene interaction predications. Further, the predication on transactivation, with argument TIP60, was not interpreted, and hence TIP60 was available (incorrectly) for the object of inhibits. If we had recognized the relative clause in (12), TIP60 would not have been reused as an argument of inhibits, since only heads of relative clauses can be reused.</Paragraph> <Paragraph position="21"> The underspecified analysis on which SemGen is based is not always effective in identifying verb phrase coordination, as in (13), leading to the incorrect interpretation that WIF1 interacts with SFRP4.</Paragraph> <Paragraph position="22"> 13) WIF1 was down-regulated in 64% of primary prostate cancers, while SFRP4 was up-regulated in 81% of the patients.</Paragraph> <Paragraph position="23"> A further source of error in this sentence is that down-regulated was analyzed by the tagger as a past tense rather than past participle, thus causing the argument identification phase to look for an object to the right of this verb form. A further issue here is that we have not yet addressed truncated passives.</Paragraph> </Section> </Section> class="xml-element"></Paper>