File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3102_intro.xml
Size: 10,691 bytes
Last Modified: 2025-10-06 14:02:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3102"> <Title>Gene/protein/family name recognition in biomedical literature</Title> <Section position="2" start_page="2" end_page="2" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> With the increasing number of biomedical papers, and their electronic publication in NCBI-PUBMED, there is a growing focus on information retrieval from texts. In particular, the recent development of procedures for large-scale experiments, such as yeast-two hybrid screening, mass spectrometry, and DNA/protein microarrays, has brought about many changes in the knowledge required by biologists and chemists. Because they produce large amounts of data on genes at one time, biologists require extensive knowledge of numerous genes to analyze the data obtained and these are beyond the capability of manual acquisition from the vast biomedical literature. Since, in many cases, the main objective of text processing is extraction of proteinprotein/gene interaction or gene function, the first problem to solve is gene/protein/compound name recognition. To date, various methods of protein/gene name taggers have been proposed, mainly relating to Homo sapiens. These methods can be roughly divided into rule-based approaches (Fukuda et al. 1998), statistical approaches, including machine learning (Collier et al.</Paragraph> <Paragraph position="1"> 2000, Nobata et al. 1999), dictionary/knowledge-based approaches Humphreys et al. 2000, Jenssen et al. 2001, Koike et al. 2003), or a combination of these approaches (Tanabe and Wilbur, 2002). Since merely recognizing gene/protein names is insufficient to keep the extracted information in gene order, dictionary-based name recognition appears useful for assigning the locus of the extracted gene/protein name. Naming conventions are quite different for different organisms. Therefore, an appropriate approach is required for each organism.</Paragraph> <Paragraph position="2"> Association for Computational Linguistics.</Paragraph> <Paragraph position="3"> Linking Biological Literature, Ontologies and Databases, pp. 9-16. HLT-NAACL 2004 Workshop: Biolink 2004, There are three main problems in dictionary-based searching: (1) the existence of multi-sense words; (2) variations in gene names; and (3) the existence of ambiguous names. The first problem is mainly seen in symbol (abbreviated) types. For example, HAC1 is a synonym for both &quot;tripartite motif-containing 3&quot; and &quot;hyperpolarization activated cyclic nucleotide-gated potassium channel 2&quot; in H. sapiens. Further, some gene names, especially in Drosophila melanogaster, have the same spelling with verb(lack, ...), adjective(white, yellow...), common nouns (spot, twin, ...), and prepositions (of, ...). The second problem is trivial variations in gene names (orthographical, morphological, syntactic, lexicosemantic, insertion/deletion, permutation, or pragmatic). For example, &quot;mitogen-activated protein kinase 1&quot; and &quot;protein kinase mitogen-activated, 1&quot;, &quot;NIK serine/threonine protein kinase&quot;, and &quot;NIK protein kinase&quot; indicate the same gene. The third problem is caused by ambiguous expression of the gene name in the text. The problems of multi-sense words and the ambiguity are well summarized by Tuason et al. (2004) In many cases, the family name is used instead of the gene name. A unique gene locus may not have been specified, especially for genes with multiple paralogs, or to avoid repeating the same expression, the family name may frequently be used. For example, in 1996, the &quot;14-3-3&quot; family name was counted 107 times in abstracts using mesh terms for human, while &quot;14-3-3 alpha, beta, delta, gamma&quot; gene name expressions did not appear at all. Thus, a family name dictionary is also required along with a gene name dictionary to specify the gene locus or loci. In this study, the above-mentioned problems were, as far as possible, solved simply using heuristics. null 2 Construction of the gene name dictionary null The gene name dictionary, GENA, was constructed using the major databases, GenAtlas pombe, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus, Rattus norvegicus, and Homo sapiens, respectively. A merge of each database entry was done using the 'official symbol' or ORF name and link data provided by each entry and the protein-sequence data entry. The priority of the database was given in advance. For example, in H.</Paragraph> <Paragraph position="4"> sapiens, HUGO, Locuslink, GDB, and GenAtlas were registered in this order, using the merged entry for the same 'official symbol'. LocusLink's 'preferred symbol', which is not yet administered by HUGO, was also used.</Paragraph> <Paragraph position="5"> Merging the entries in SWISS-PROT, TrEMBL, and these registered data was done using the link data for 'Genew' provided by SWISS-PROT and TrEMBL. The rest of the entries were merged using the protein-IDs for LocusLink, SWISS-PROT, and TrEMBL. For example, LocusLink provides unique representative mRNA and protein sequences, and related sequences belonging to the same gene. If the protein-sequence entry for SWISS-PROT and TrEMBL matched with any of these sequence entries for LocusLink, the entries were merged.</Paragraph> <Paragraph position="6"> Linking these registered data with the PIR entries was also done using protein-ID entries. In principle, for all organisms, protein sequences without 'official or preferred symbols' were not registered. The entries consisted of 'official symbols' and 'official full names', which were provided by representative institutions, such as HUGO, for each organism, and 'synonyms' and 'gene products'. S. cerevisiae and C. elegans do not have 'official full names'. The distinction between these elements of each 'name' simply depends on the 'item headings' for each database. Although gene names and their product names are registered separately for one locus, and whether the entry's product is protein or RNA is also registered in GENA, we do not distinguish between them here. Hereafter, we do not distinguish 'gene product' from the gene name 'synonym'. Unfortunately, databases contain numerous mistakes or inappropriate gene/protein names. The reliability of each synonym was judged according to the database source.</Paragraph> <Paragraph position="7"> To meet our information extraction purposes, only gene names over a certain reliability can be used. Meaningless names (ex. hypothetical protein), higher concept names (ex. membrane protein) and apparently wrong names (ex. OK ) were removed from the data semi-automatically using word-net vocabularies and term frequencies of all abstracts of one year. In an evaluation of this study, synonym names entered only in TrEMBL or PIR, except for names manually checked in our laboratory, were removed due to their low reliability.</Paragraph> <Paragraph position="8"> In addition to these data, we added synonym names using the following methods. (1) Abbreviations of synonyms were added using an abbreviation extraction algorithm (Schwartz and Hears, 2003). (2) Plausible gene names were extracted from the subject and object noun of some verbs, which restricted such subjects and objects as 'phosphorylate' and 'methylate' (both subjects must be protein/gene/family names). These are by-products of protein-interaction extraction in our project. The corresponding 'official symbol' was searched using a partial match of registered names, and finally was checked manually.</Paragraph> <Paragraph position="9"> Compound names were gathered from the index of the biochemical dictionary, KEGG (http://www.genome.ad.jp/kegg/kegg2.html), mesh terms, and UMLS (http://www.nlm.nih.gov/research/umls/) and were registered in GENA. Some high-concept terms were removed manually. Compound name searches were not evaluated in this study. Currently (January, 2004), it contains about 920,000 registered gene/protein names and 210,000 compound names.</Paragraph> <Paragraph position="10"> GENA was managed using Postgres, which provides command line searching and Web searching (http://www.gena.ontology.ims.u-tokyo.ac.jp). Searches can be done considering the word order replacement of long gene names using indexing all words consisting names.</Paragraph> <Paragraph position="11"> 3 Construction of family name dictionary The construction of the family name dictionary was done using SWISS-PROT family names, PIR family names, INTERPRO family names (http://www.ebi.ac.uk/interpro/), gene/protein names in GENA, and clustering sequence similarities. These have hierarchical named entities. For example, &quot;MAPK1&quot; is a member of the &quot;MAPK family&quot; and the &quot;MAPK family&quot; is a member of the family of the &quot;Ser/Thr protein kinase family&quot;; in turn, this family is a member of &quot;protein kinase&quot;, and &quot;protein kinase&quot; is a type of &quot;kinase&quot;. Although &quot;family&quot; is usually used to indicate &quot;similar sequence groups that probably have the same origin&quot;, sometimes it is also used to mean &quot;sequence groups that have almost the same function&quot;. In this paper, we use &quot;family&quot; as &quot;ambiguous gene/protein names that indicate similar sequences or biological functions&quot;. Plausible family names based on gene names are the common parts of multiple gene names, such as &quot;MAPK&quot; of &quot;MAPK[number]&quot;, &quot;14-3-3&quot; of &quot;14-3-3 [Greek alphabet[alpha-delta/alphabet[a-d]]&quot;, &quot;protein kinase&quot; of &quot;Tyr protein kinase&quot; and &quot;Ser/Thr protein kinase&quot;, and &quot;kinase&quot; of &quot;Inositol kinase&quot; and &quot;protein kinase&quot;. The backbone of the family hierarchy was constructed based on the INTERPRO family hierarchy. As far as possible, the remaining hierarchy was manually constructed considering sequence similarities, using Markov clustering (Enright et al. 2002) based on all-versus-all blast. The hierarchy has a directed acyclic graph structure. The family names are across organims and the family name dictionary is common to each organism. The family database is available from http://marine.ims.utokyo.ac.jp:8080/Dict/family. Currently (January, 2004), it contains about 16,000 entries and 70,000 registered names.</Paragraph> </Section> class="xml-element"></Paper>