File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3102_evalu.xml
Size: 2,505 bytes
Last Modified: 2025-10-06 13:59:23
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3102"> <Title>Gene/protein/family name recognition in biomedical literature</Title> <Section position="6" start_page="2" end_page="2" type="evalu"> <SectionTitle> 7 Related Work </SectionTitle> <Paragraph position="0"> Various protein/gene recognition methods have been reported and some successes were gained as briefly reviewed in introduction and well reviewed in the references (Hirshman et al., 2002). However, most of them did not specify the gene locus. Further, they were developed mainly for H. sapiens. Since the naming convention is different in organisms, their recognition performance in other organisms is unknown.</Paragraph> <Paragraph position="1"> Hirshman et al. (2002) have reported the dictionary-based name recognition. This report discussed the difficulty of the gene name recognition of D. melanogaster and showed the increase of the precision by removing the gene names that have meanings as normal English words. Tuason et al. (2004) have investigated that the ambiguity within each organism and among organisms (mouse, worm, fly, and yeast) and with general English words. Tsuruoka and Tsujii (2003) also reported the dictionary-based named recognition and our method is similar to them. They resolved the trivial gene variation problems using dynamic programming and tries, while in our method, by normalizing dictionary names and devising the trie structure, the trivial variations were addressed without dynamic programming and the required CPU time is expected to be largely reduced without decreasing precision and recall. The protein name recognition standard is a little different from them and the direct comparison of precision and recall with their results seem meaningless. In their methods, they focus on protein names (without gene names) and seem not to distinguish whether the protein name candidate represents the protein itself or not in the context. (ex.</Paragraph> <Paragraph position="2"> &quot;IL-1 receptor antagonist&quot; and &quot;IL-1 receptor expression&quot;: only the latter description means the IL-1 receptor itself.) Further, in our method, addressing the ambiguity of gene names (common gene names among multiple gene names) is tried. Since long protein names are usually written with abbreviated names, the name variations caused of permutation and insertion/deletion of long name words are picked up in the ambiguity resolution process.</Paragraph> </Section> class="xml-element"></Paper>