File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3102_metho.xml

Size: 19,198 bytes

Last Modified: 2025-10-06 14:09:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3102">
  <Title>Gene/protein/family name recognition in biomedical literature</Title>
  <Section position="3" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Gene/protein/family name searches us-
</SectionTitle>
    <Paragraph position="0"> ing a devised trie A gene/protein/family name search of texts was carried out using a devised trie for faster gene name searching. The trie was provided for each organism separately. The core terms implemented for the trie were generated based on GENA. Here, the following main heuristics were used.</Paragraph>
    <Paragraph position="1">  (1) Special characters are replaced by a space. (2) In principle, both numerical and Roman numerals are prepared.</Paragraph>
    <Paragraph position="2"> (3) The space before a numerical number is removed. However, if the previous character before the space is a number, the space is not removed (e.g., 14-3-3 is &amp;quot;14 3 3&amp;quot;).</Paragraph>
    <Paragraph position="3"> (4) With space and without space terms are used for 'Greek alphabet and alphabet a/A, b/B, c/C, ...'. For example, &amp;quot;14 3 3 alpha, 14 3 3alpha, 14 3 3 a, 14 3 3a&amp;quot;. (5) Common words at the end of gene names, such  as &amp;quot;protein&amp;quot;, &amp;quot;gene&amp;quot;, &amp;quot;sub-family&amp;quot;, &amp;quot;family&amp;quot;, and &amp;quot;group&amp;quot;, are removed. However, if the meaning of names is changed with/without these words, they are left. For example, &amp;quot;T-cell surface protein&amp;quot; indicates &amp;quot;protein on the T-cell surface&amp;quot;, while &amp;quot;T-cell surface&amp;quot; usually indicates &amp;quot;the surface of the T-cell&amp;quot;, and removing &amp;quot;protein&amp;quot; from &amp;quot;memory-related protein&amp;quot; causes faulty recognition of &amp;quot;memory-related function&amp;quot; as 'memory related /gene-name' 'function'. When &amp;quot;protein&amp;quot;, &amp;quot;gene&amp;quot;, &amp;quot;sub-family&amp;quot;, &amp;quot;group&amp;quot;, and &amp;quot;family&amp;quot; appear within gene names, gene words with and without these words are generated.</Paragraph>
    <Paragraph position="4"> (6) For symbol-type names (less than seven characters), the initial of the organism is added to the spelt-out type. For example, in MAPK1 for H. sapiens, hMAPK1 and h MAPK1 are used. For S. cerevisiae, the protein name is generated by adding &amp;quot;p&amp;quot; at the end of the name. For example, the protein of STE7 is STE7p. For mutations of D. melanogaster, + added names are used. For example, lt+ for lt.</Paragraph>
    <Paragraph position="5"> (7) All names are converted into small characters and plurals are also generated. Some names are &amp;quot;case sensitive&amp;quot; and some require &amp;quot;all capital letters&amp;quot;. In principle, when the name is the common spelling of a &amp;quot;common noun, adverb, or adjective&amp;quot;, &amp;quot;all capital letter names&amp;quot; are adopted in H. sapiens, M. musculus, and R. norvegicus (using &amp;quot;word net vocabularies&amp;quot; with less than five characters. Word length is limited to remove words that happen to have the same spelling but without removing biological names registered in the word net).</Paragraph>
    <Paragraph position="6"> &amp;quot;All capital letters names&amp;quot; were recognized in the trie. Case-sensitive words such as cAMP and CAMP were selected experientially and checked after the trie search. Since many of Drosophila melanogaster genes have the same spelling with verb, adjective, common nouns, and preposition. These gene names are replaced by &amp;quot;gene name + specified names&amp;quot; using word-net vocabularies to decrease false positive. For example, the gene name &amp;quot;yellow&amp;quot; is replaced by &amp;quot;yellow locus&amp;quot;, &amp;quot;yellow gene&amp;quot;, &amp;quot;yellow protein&amp;quot;, &amp;quot;yellow allele&amp;quot;... etc. The trie search starts from the next characters after a &amp;quot;space&amp;quot;, &amp;quot;-&amp;quot;, &amp;quot;/&amp;quot;, or &amp;quot;period&amp;quot; or the head of sentence. When multiple gene names are hit in duplicate, the longest name ID is outputted. When specific terms, such as &amp;quot;antagonist&amp;quot;, &amp;quot;receptor&amp;quot;, &amp;quot;cell&amp;quot;, and &amp;quot;inhibitor&amp;quot;, ....are next to the gene name, the hit gene name ID is not outputted, since these indicate different gene/protein names or are not gene/protein names. Also, when terms such as &amp;quot;promoter&amp;quot; and &amp;quot;mutant&amp;quot; are located next to the gene name, they do not show the gene/protein/family themselves. However, for our purposes of extracting the genetic interaction, they are treated the same as gene/protein/family names. Specific terms such as &amp;quot;number&amp;quot; are located before the gene name and the hit gene name ID is not outputted since they are multi-sense words and, in most cases, are not gene/protein names. Parentheses are also specially treated, so &amp;quot;mitogen activated protein kinase (MAPK) 1&amp;quot; --&gt; is recognized as &amp;quot;mitogen activated protein kinase 1 (MAPK1)&amp;quot;. The continuous gene description such as &amp;quot;GATA-4/5/6&amp;quot; is also specially treated as shown in Figure 1. If the gene names are synonyms of multi-genes, the multiple gene IDs are outputted in this stage.</Paragraph>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
5 Resolving multi sense words
</SectionTitle>
    <Paragraph position="0"> To resolve the problem of multi-sense words, we used information from the whole text. When the hit name is shorter than a certain gene name length (seven characters for H. sapiens; the length is different for each organism), there is a possibility that the hit name is an abbreviation of another word (not only gene names, but also an experimental method or name of an apparatus).</Paragraph>
    <Paragraph position="1"> To avoid false-positive words as far as possible, we used the following heuristics in M. musculus, R.</Paragraph>
    <Paragraph position="2"> norvegicus, and H. sapiens.</Paragraph>
    <Paragraph position="3"> 1) If the corresponding full name, or a name longer than six characters, is written in the same abstract, the hit gene ID is used.</Paragraph>
    <Paragraph position="4"> When the full name and abbreviation pairs are written in the abstract as &amp;quot;plausible full name (the hit name)&amp;quot; or &amp;quot;plausible full name [the hit name]&amp;quot;, the following procedures are carried out.</Paragraph>
    <Paragraph position="5">  2) If the full/long name is a complete match for the synonyms or full name of the corresponding ID, the hit gene ID is used.</Paragraph>
    <Paragraph position="6"> 3) If the full/long name is not a complete match for these corresponding IDs using the abbreviation extraction algorithm (Schwartz and Hearst, 2003), but its spelling consists of words used in any name of the corresponding ID, the hit ID is adopted. If not, the hit ID is discarded (i.e., the full/long name considering the replacement of the word order).</Paragraph>
    <Paragraph position="7"> 4) If information on full names or long names is not  found in the abstract, a key-word search of all the abstracts is carried out. If at least one key word is detected, the ID is used.</Paragraph>
    <Paragraph position="8"> The summary of these steps were shown in Figure 2.</Paragraph>
    <Paragraph position="9"> (The numbers in Fig.2 correspond to the above head numbers.) However, treatment (2) is not sufficient in some cases because some abbreviations are written only once for one family kind. For example, in PUBMED-ID 8248212, ...&amp;quot;the recently described TAP (transporter associated with antigen processing) genes have been mapped approximately midway between DP and DQ. ...</Paragraph>
    <Paragraph position="10"> In addition to the alleles of TAP1 that have been described, others were identified during this study.&amp;quot; &amp;quot;TAP1&amp;quot; is the synonym for &amp;quot;transporter 1, ATP-binding cassette, sub-family B (MDR/TAP)&amp;quot;, and &amp;quot;transient receptor potential cation channel, subfamily C, member 4 associated protein.&amp;quot; In most cases, the full name is written only once for the same family. In this case, the former (&amp;quot;transporter 1, ATP-binding cassette, sub-family B (MDR/TAP)&amp;quot;) is correct. Accordingly, the full name and abbreviation pair &amp;quot;TAP&amp;quot; without the number is also checked. Since all vocabularies (&amp;quot;transporter&amp;quot;, &amp;quot;associated&amp;quot;, &amp;quot;antigen&amp;quot;, &amp;quot;processing&amp;quot;) are components of synonyms of TAP1, the TAP1 is recognized by  prepositions such as &amp;quot;of&amp;quot; and &amp;quot;with&amp;quot;, and frequently used words such as &amp;quot;sub-family&amp;quot; and &amp;quot;family&amp;quot;, are skipped in this process. Further regarding the lexico-semantic pattern, as far as possible, adjectives and nouns are provided for each vocabulary using word-net vocabularies and UMLS.</Paragraph>
    <Paragraph position="11">  With this treatment, only when pairs of full names, or close to the full name, and abbreviations appear, the distinctions between some synonyms are completed. In some cases, the name belongs to the same family. For example, LRE2 is a synonym for &amp;quot;LINE retrotransposable element 2&amp;quot; and &amp;quot;LINE retrotransposable element 3&amp;quot;. In this case, the distinction between them is very fine and seems unimportant. In some abstracts, full names are not written in the text. To resolve this issue, we used key words for each gene, which were selected from all words/terms (continuous words) composing synonym names and their family names as shown in the procedures in (4). When at least one keyword is detected, the ID is accepted. The key words appear less than 50 times (only for words extracted from gene names, in the case of words from family name, this limitation is not used) in genes and appear less than a certain frequency in all abstracts and are not common to different genes that have synonyms with the same spelling. Even if a key word search is performed, except for famous names such as p53 and p38, the locus identification for &amp;quot;# kDa&amp;quot;, meaning a &amp;quot;#p&amp;quot; expression such as p60 and p61, is quite difficult. In relation to famous name-Ids, such as cAMP(cyclic AMP), CD2(cluster designation 2), the IDs are used to recover a false negative even if the full/longer name is not written in the abstracts and the keywords are not detected.</Paragraph>
    <Paragraph position="12"> The automatic keyword selection using conventional methods such as tf-idf (Salton and Yang, 1973) and SMART (Singhal et al. 1996) may be applicable. However, the number of abstracts per gene is too small in many cases and the effective keywords selection could not be achieved. Therefore, this approach was not applied, in this study.</Paragraph>
    <Paragraph position="13"> For S. cerevisiae, C. elegans, and D. melanogaster, in most cases, the full names of symbols are not written.</Paragraph>
    <Paragraph position="14"> Only when the symbol name has a symbol (abbreviation)-full name pairs, and the full name is not the corresponding gene name or contains a word that is not a component of the synonyms, the hit-ID is discarded.</Paragraph>
    <Paragraph position="15"> Although, as far as possible, we removed what we assumed were wrong or inappropriate gene names, some names either do not seem to be synonyms or are rarely used ones. These can cause errors. For example, LPS is a synonym for &amp;quot;interferon regulatory factor 6&amp;quot; (for example, LocusLink, GenAtlas) and &amp;quot;lipopolysaccharide&amp;quot; in H. sapiens. However, our investigations indicate that LPS is not used to indicate &amp;quot;interferon regulatory factor 6&amp;quot; in abstracts.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
6 Experiment and Results
</SectionTitle>
    <Paragraph position="0"> To validate the recall and precision of our method for gene/protein/family name recognition, we made manually pre-tagged 100 abstracts (1996 year) on each of the following organisms: S. cerevisiae, D.</Paragraph>
    <Paragraph position="1"> melanogaster, and H. sapiens with mesh terms &amp;quot;saccharomyces cerevisiae&amp;quot;, &amp;quot;drosophila melanogaster&amp;quot;, and &amp;quot;human&amp;quot;, respectively. Table 1 shows the results. In this evaluation, whether each gene/family ID was correctly assigned in the abstract or not was investigated.</Paragraph>
    <Paragraph position="2"> (each ID was counted only once per abstract.) When the precision and recall of all gene/family name descriptions' recognition were calculated (each ID can be counted more than once per abstract), they did not change largely and were within 2-5% error spans of  In judging family name recognition, slightly soft criteria were used. If a complete matching entry was not registered in the family name dictionary, a higher concept ID was assigned. For example, &amp;quot;lactate dehydrogenase&amp;quot; was not registered in the family name dictionary, so this name was assigned the ID &amp;quot;dehydrogenase&amp;quot;. Even if the other organisms are written in the same abstracts, their gene names are not extracted in principle. However, human, rat, and mouse are not distinguished in this validation. The family names in other organisms are also extracted in this evaluation.</Paragraph>
    <Paragraph position="3"> As shown in Table 2, in all organisms, more than one-third of the gene names were written as family names. This indicates the necessity for hierarchical gene names, as in the family dictionary, although conventional methods scarcely mentioned. The recall and precision of these organisms as shown in Table 1 are relatively high roughly compared to previous reports.</Paragraph>
    <Paragraph position="4"> (precision:72-93%, recall:76-94%: The summary is reviewed by Hirschman 2002). The details of errors were as followings. Only 4 and 1 names, which were registered in GENA and family name dictionary, were recognized as gene/family names at once, but they were erroneously discarded by the procedures used to confirm ambiguous names, in H. sapiens. Many of them are caused by the key-word search fails. Especially, in family names, the key-words seem to be insufficient.</Paragraph>
    <Paragraph position="5"> Probably, these will be addressed in some extent by use of the key words of the higher/lower concept IDs. In some cases, the full-name and abbreviation match failed.</Paragraph>
    <Paragraph position="6"> For example, in &amp;quot;urokinase-type plasminogen activator receptor (uPAR, CD87)&amp;quot;, the full-name and abbreviation match failed due to the existence of &amp;quot;two names&amp;quot; in the parenthesis. These errors will be recovered by the keyword search. However, in the present program, recovering step is not used. The recall of family names in H. sapiens is slightly low because of varieties of families as shown in Table 1. 6, 4 names were false positive gene/protein names in S. cerevisiae and H. sapiens, respectively. 7, 5 names were false positive family names in S. cerevisiae and H. sapien, respectively. Most of them were short names and were not removed due to their in-appropriate keywords. Some of them are caused by inappropriate GENA entries.</Paragraph>
    <Paragraph position="7"> In relation to D. melanogaster, 10 gene/protein names that were registered in GENA were not recognized as gene/family names. Many of them were general nouns/adjective and were not used as the &amp;quot;gene name + specified words&amp;quot; phrase in the abstracts. Rest of them were gene/protein names removed in trie implementation steps due to their confusing spellings such as &amp;quot;104&amp;quot;. Also mutant gene name recognition was quite difficult in this method, since the superscript for the mutation was converted in the normal characters in NCBIabstracts and newly developed mutant was expressed by changing the superscript. 4 family names were recognized once and erroneously discarded in the keyword search steps. 31 gene/protein names and 12 family names were false positive. Most of them in gene/protein names were misleading names such as 19A. These misleading names were removed or replaced by the &amp;quot;gene name + specified words&amp;quot; phrase as far as possible with some heuristics and term frequencies in abstracts. However, some remained. Some false positive were wrongly extracted other organisms' gene names.</Paragraph>
    <Paragraph position="8"> In the strict criteria of family name recognition, 10, 18, 10 names were recognized as higher concepts in H.</Paragraph>
    <Paragraph position="9"> sapiens, D. melanogaster, and S.cerevisiae, respectively.</Paragraph>
    <Paragraph position="10"> The registration of detailed entries for the family name dictionary is required.</Paragraph>
    <Paragraph position="11"> The heuristics of the name detection seem to be sufficient so that no name detections failed due to trivial name variations in H. sapiens and S. cerevisiae, and only one name in D. melanogaster except mutant variation failed. There is some room to be improved in ambiguity resolution steps using sophisticated keyword searching.</Paragraph>
    <Paragraph position="12"> In our laboratory, protein interaction information and protein function were automatically extracted and stored in PRIME (http://prime.ontology.ims.u-tokyo.ac.jp) and in the protein kinase database (http://kinasedb.ontology.ims.u-tokyo.ac.jp, Koike et al., 2003). With this procedure, some false positives were not extracted since the phrase patterns did not match the extracted protein interaction and protein function. That is, some wrongly recognized names were removed as a result of considering the local context. In this stage, the wrongly recognized false positive names was 0, 4, and 3 for S. cerevisiae, D. melanogaster, and H. sapiens, respectively. Using the family name dictionary greatly increased the recognition of ambiguous names. However, a new difficulty was found in extracting information. Many family names are common to functional nouns. Therefore, even if a phrase pattern is used, the wrong interaction may be extracted. For example, from PUBMED_11279098: &amp;quot;We also identified key residue pairs in the hydrophobic core of the Cet1 protomer that support the active site tunnel and stabilize the triphosphatase in vivo.&amp;quot; It is difficult to automatically judge from this sentence whether &amp;quot;triphosphatase&amp;quot; means the Cet1 function or another protein family name. All the interaction information in this abstract indicates that &amp;quot;triphosphatase&amp;quot; is the activity of Cet1. Our program wrongly extracted &amp;quot;Cet1/gene-name&amp;quot; stabilize &amp;quot;triphosphatase/family-name&amp;quot;. Additional heuristics are required to remove these wrongly extracted data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML