File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-3102_concl.xml
Size: 2,841 bytes
Last Modified: 2025-10-06 13:54:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3102"> <Title>Gene/protein/family name recognition in biomedical literature</Title> <Section position="7" start_page="2" end_page="2" type="concl"> <SectionTitle> 8 Conclusions: </SectionTitle> <Paragraph position="0"> We constructed gene name and family name dictionaries to link each gene name to a gene locus and to relate ambiguous names to gene families. Our preliminary investigations showed that more than one-third to one-half of gene/protein names in abstracts are written using ambiguous names such as family/super-family level names. This indicates that dictionary-based gene/protein/family name recognition requires not only a gene name dictionary but also a hierarchical family name dictionary. Using the gene name dictionary GENA and the family name dictionary we constructed and our searching method, 95, 91, and 89% of protein/gene/family names in abstracts on S. cerevisiae, D. melanogaster, and H. sapiens were detected with a precision of 96, 92, and 94%, respectively. The simple heuristics we developed seem to be useful for matching gene/family names in texts with dictionary entry names, although additional trivial changes are required to address ambiguity of gene names. These methods are also useful for extracting data on protein interaction and protein function. However, the gene/protein/family name recognition subject is deep. For example, &quot;NFkappaB&quot; represents &quot;NFKB1&quot; and &quot;RELA&quot; complex in many contexts and sometimes represents &quot;NFKB1&quot;. Unfortunately, these complicated recognitions were not resolved. null Although different organisms have different naming conventions, the nomenclature for mammals is similar to that for H. sapiens, and most bacteria and archaea gene/protein/family names are similar to the nomenclature for S. cerevisiae. Problems in gene name recognition for most organisms will be able to be addressed using our method. Dictionary-based name recognition cannot search new gene name/synonym names. However, the whole human/drosophila/yeast genomes have already been sequenced and the appearance of new synonym names can be expected to decrease or be inferable from the referenced known name. In addition, with the introduction of the family name dictionary, parts of new genes can be retrieved using the higher concept name (family name), even if the new gene name itself is not registered in GENA. Accordingly, the dictionary-based name recognition will be expected to be sufficient for the information extraction in these organisms. null Protein-interaction and protein-function information extracted using these procedures for gene/protein/family name recognition are available from http://prime.ontlogy.ims.u-toky.ac.jp.</Paragraph> </Section> class="xml-element"></Paper>