File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3306_metho.xml
Size: 21,766 bytes
Last Modified: 2025-10-06 14:10:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3306"> <Title>Human Gene Name Normalization using Text Matching with Automatically Extracted Synonym Dictionaries</Title> <Section position="3" start_page="41" end_page="42" type="metho"> <SectionTitle> 2 Automatically Extracted Synonym Dictionaries </SectionTitle> <Paragraph position="0"> Even when restricted to human genes, biomedical researchers mention genes in a highly variable manner, with a minimum of adherence to the gene naming standard provided by the Human Gene Nomenclature Committee (HGNC). In addition, frequent variations in spelling and punctuation generate additional non-standard forms. Extracting gene synonyms automatically from online databases has several benefits (Cohen, 2005). First, online databases contain highly accurate annotations from expert curators, and thus serve as excellent information sources. Second, refreshing of specialized lexicons from online sources provides a means to obtain new information automatically and with no human intervention. We thus sought a way to rapidly collect as many human gene identifiers as possible.</Paragraph> <Paragraph position="1"> All the statistics used in this section are from on-line database holdings last extracted on February 20, 2006.</Paragraph> <Section position="1" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 2.1 Building the Initial Dictionaries </SectionTitle> <Paragraph position="0"> Nineteen online websites and databases were initially surveyed to identify a set of resources that collectively contain a large proportion of all known human gene identifiers. After examination of the 19 resources with a limited but representative set of gene names, we determined that only four databases together contained all identifiers (excluding resourcespecific identifiers used for internal tracking purposes) used by the 19 resources. We then built an automated retrieval agent to extract gene synonyms from these four online databases: The HGNC Genew database, Entrez Gene, Swiss-Prot, and Stanford SOURCE. The results were collected into a single dictionary. Each entry in the dictionary consists of a gene identifier and a corresponding official HGNC symbol. For data from HGNC, withdrawn entries were excluded. Retrieving gene synonyms from SOURCE required a list of gene identifiers to query SOURCE, which was compiled by the retrieval agent from the other sources (i.e., HGNC, Entrez Gene and Swiss-Prot). In total, there were 333,297 entries in the combined dictionary.</Paragraph> </Section> <Section position="2" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 2.2 Rule-Based Filter for Purification </SectionTitle> <Paragraph position="0"> Examination of the initial dictionary showed that some entries did not fit our definition of a gene identifier, usually because they were peripheral (e.g., a GenBank sequence identifier) or were describing a gene class (e.g., an Enzyme Commission identifier or a term such as &quot;tyrosine kinase&quot;). A rule-based filter was imposed to prune these uninformative synonyms. The rules include removing identifiers under these conditions: 1. Follows the form of a GenBank or EC accession ID (e.g., 1-2 letters followed by 5-6 digits). null 2. Contains at most 2 characters and 1 letter but not an official HGNC symbol (e.g., P1).</Paragraph> <Paragraph position="1"> 3. Matches a description in the OMIM morbid list1 (e.g., Tangier disease).</Paragraph> <Paragraph position="2"> 4. Is a gene EC number.2 5. Ends with ', family ?', where ? is a capital letter or a digit.</Paragraph> <Paragraph position="3"> 6. Follows the form of a DNA clone (e.g., 1-4 digits followed by a single letter, followed by 1-2 digits).</Paragraph> <Paragraph position="4"> 7. Starts with 'similar to' (e.g., similar to zinc fin null ger protein 533).</Paragraph> <Paragraph position="5"> Our filter pruned 9,384 entries (2.82%).</Paragraph> </Section> <Section position="3" start_page="41" end_page="42" type="sub_section"> <SectionTitle> 2.3 Internal Update Across the Dictionaries </SectionTitle> <Paragraph position="0"> We used HGNC-designated human gene symbols as the unique identifiers. However, we found that certain gene symbols listed as &quot;official&quot; in the non-HGNC sources were not always current, and that other assigned symbols were not officially designated as such by HGNC. To remedy these issues, we treated HGNC as the most reliable source and Entrez Gene as the next most reliable, and then updated our dictionary as follows: In the initial dictionary, some synonyms are associated with symbols that were later withdrawn by HGNC. Our retrieval agent extracted a list of 5,048 withdrawn symbols from HGNC, and then replaced any outdated symbols in the dictionary with the official ones. Sixty withdrawn symbols were found to be ambiguous, but we found none of them appearing as symbols in our dictionary.</Paragraph> <Paragraph position="1"> If a symbol used by Swiss-Prot or SOURCE was not found as a symbol in HGNC or Entrez Gene, but was a non-ambiguous synonym in HGNC or Entrez Gene, then we replaced it by the corresponding symbol of the non-ambiguous synonym.</Paragraph> <Paragraph position="2"> Among the 323,913 remaining entries, 801 entries (0.25%) had symbols updated. After removing duplicate entries (42.19%), 187,267 distinct symbolsynonym pairs representing 33,463 unique genes were present. All tasks addressed in this section were performed automatically by the retrieval agent.</Paragraph> </Section> </Section> <Section position="4" start_page="42" end_page="43" type="metho"> <SectionTitle> 3 Exact String Matching </SectionTitle> <Paragraph position="0"> We initially invoked several string transformations for gene normalization, including: 1. Normalization of case.</Paragraph> <Paragraph position="1"> 2. Replacement of hyphens with spaces. 3. Removal of punctuation.</Paragraph> <Paragraph position="2"> 4. Removal of parenthesized materials. 5. Removal of stop words3.</Paragraph> <Paragraph position="3"> 6. Stemming, where the Porter stemmer was employed (Porter, 1980).</Paragraph> <Paragraph position="4"> 7. Removal of all spaces.</Paragraph> <Paragraph position="5"> The first four transformations are derived from (Cohen et al., 2002). Not all the rules we experimented with demonstrated good results for human gene name normalization. For example, we found that stemming is inappropriate for this task. To amend potential boundary errors of tagged mentions, or to match the variants of the synonyms, four 3ftp://ftp.cs.cornell.edu/pub/smart/English.stop mention reductions (Cohen et al., 2002) were also applied to the mentions or synonyms: 1. Removal of the first character.</Paragraph> <Paragraph position="6"> 2. Removal of the first word.</Paragraph> <Paragraph position="7"> 3. Removal of the last character.</Paragraph> <Paragraph position="8"> 4. Removal of the last word.</Paragraph> <Paragraph position="9"> To provide utility, a system was built to allow for transformations and reductions to be invoked flexibly, including chaining of rules in various sequences, grouping of rules for simultaneous invocation, and application of transformations to either or both the candidate mention input and the dictionary. For example, the mention &quot;alpha2Cadrenergic receptor&quot; in PMID 8967963 matches synonym &quot;Alpha-2C adrenergic receptor&quot; of gene ADRA2C after normalizing case, replacing hyphens by spaces, and removing spaces. Each rule can be built into an invoked sequence deemed by evaluation to be optimal for a given application domain. A normalization step is defined here as the process of finding string matches after a sequence of chained transformations, with optional reductions of the mentions or synonyms. We call a normalization step safe if it generally makes only minor changes to mentions. On the contrary, a normalization step is called aggressive if it often makes substantial changes. However, a normalization step safe for long mentions may not be safe for short ones. Hence, our system was designed to allow a user to set optional parameters factoring the minimal mention length and/or the minimal normalized mention length required to invoke a match.</Paragraph> <Paragraph position="10"> A normalization system consists of multiple normalization steps in sequence. Transformations are applied sequentially and a match searched for; if no match is identified for a particular step, the algorithm proceeds to the next transformation. The normalization steps and the optional conditions are well-encoded in our program, which allows for a flexible system specified by the sequences of the step codes. Our general principle is to design a normalization system that invokes safe normalization steps first, and then gradually moves to more aggressive ones. As the process lengthens, the precision decreases while the recall increases. The balance between precision and recall desired for a particular application can be defined by the user.</Paragraph> <Paragraph position="11"> Specifically, given string s, we use T (s) to denote the transformed string. All the 7 transformation rules listed at the beginning of this subsection are idempotent, since T (T (s)) = T (s). Two transformations, denoted by T1 and T2, are called commutative, if T1(T2(s)) = T2(T1(s)). The first four transformations listed form a set of commutative rules. Knowledge of these properties helps design a normalization system.</Paragraph> <Paragraph position="12"> Recall that NER systems, such as those required for BioCreAtIvE task 1B, consist of two stages. For our applications of interest, the normalization input is generated by a gene tagger (McDonald and Pereira, 2005), followed by the normalization system described here as the second stage. In the second stage, more synonyms do not necessarily imply better performance, because less frequently used or less informative synonyms may result in ambiguous matches, where a match is called ambiguous if it associates a mention with multiple gene identifiers. For example, from the Swiss-Prot dictionary we know the gene mention 'MDR1' in PMID 8880878 is a synonym uniquely representing the ABCB1 gene. However, if we include synonyms from HGNC, it results in an ambiguous match because the TBC1D9 gene also uses the synonym 'MDR1'.</Paragraph> <Paragraph position="13"> We investigated the rules separately, designed the initial normalization procedure, and tuned our system at the end. To evaluate the efficacy of our compiled dictionary and its sources, we determined the accuracy of our system with all transformations and reductions invoked sequentially, and without any efforts to optimize the sequence (see section 6 for evaluation details). The goal in this experiment was to evaluate the effectiveness of each vocabulary source alone and in combination. Our experimental results at the mention level are summarized in Table 1. The best two-staged system achieved a precision of 0.725 and recall of 0.704 with an F-measure of 0.714, by using only HGNC and Swiss-Prot entries. null As errors can be derived from the tagger or the normalization alone or in combination, we also as- null sessed the performance of our normalization program alone by directly normalizing the mentions in the gold standard file used for evaluation (i.e., assuming the tagger is perfect). Our normalization system achieved 0.824 F-measure (0.958 precision and 0.723 recall) in this evaluation.</Paragraph> </Section> <Section position="5" start_page="43" end_page="44" type="metho"> <SectionTitle> 4 Approximate String Matching </SectionTitle> <Paragraph position="0"> Approximate string matching techniques have been well-developed for entity identification. Given two strings, a distance metric generates a score that reflects their similarity. Various string distance metrics have been developed based upon edit-distance, string tokenization, or a hybrid of the two approaches (Cohen et al., 2003). Given a gene mention, we consider the synonym(s) with the highest score to be a match if the score is higher than a defined threshold. Our program also allows optional string transformations and provides a user-defined parameter for determining the minimal mention length for approximate string matching. The decision on the method chosen may be affected by several factors, such as the application domain, features of the strings representing the entity class, and the particular data sets used. For gene NER, various scoring methods have been favored (Crim et al., 2005; Cohen et al., 2003; Wellner et al., 2005).</Paragraph> <Paragraph position="1"> Approximate string matching is usually considered more aggressive than exact string matching with transformations; hence, we applied it as the last step of our normalization sequence. To assess the usefulness of approximate string matching, we began with our best dictionary subset in Subsection 3 (i.e., using HGNC and SwissProt), and applied approximate string matching as an additional normalization step.</Paragraph> <Section position="1" start_page="44" end_page="44" type="sub_section"> <SectionTitle> Matching for Gene Normalization. </SectionTitle> <Paragraph position="0"> We selected six existing distance metrics that appeared to be useful for human gene normalization: Jaro, JaroWinkler, SmithWaterman, TFIDF, UnsmoothedJS, and Jaccard. Our experiment showed that TFIDF, UnsmoothedJS and Jaccard outperformed the others for human gene normalization in our system, as shown in Figure 1. By incorporating approximate string matching using either of these metrics into our system, overall performance was slightly improved to 0.718 F-measure (0.724 precision and 0.713 recall) when employing a high threshold (0.95). However, in most scenarios, approximate matching did not considerably improve recall and had a non-trivial detrimental effect upon precision.</Paragraph> </Section> </Section> <Section position="6" start_page="44" end_page="46" type="metho"> <SectionTitle> 5 Ambiguation Analysis </SectionTitle> <Paragraph position="0"> Gene identifier ambiguity is inherent in synonym dictionaries as well as being generated during normalization steps that transform mention strings.</Paragraph> <Section position="1" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 5.1 Ambiguity in Synonym Dictionaries </SectionTitle> <Paragraph position="0"> If multiple gene identifiers share the same synonym, it results in ambiguity. Table 2 shows the level of ambiguity between and among the four sources of gene identifiers used by our dictionary. The rate of ambiguity ranges from 0.89% to 2.83%, which is a rate comparable with that of mouse (1.5%) and Drosophila (3.6%) identifiers (Hirschman et al., 2005).</Paragraph> <Paragraph position="1"> the human gene dictionary.</Paragraph> <Paragraph position="2"> Figure 2 is a log-log plot showing the distribution of ambiguous synonyms, where the degree is the number of gene identifiers that a synonym is associated with. Comparing Figure 2 with (Hirschman et al., 2005, Figure 3), we noted that on average, human gene synonyms are less ambiguous than those of the three model organisms.</Paragraph> <Paragraph position="3"> Another type of ambiguity is caused by gene symbols or synonyms being common English words or other biological terms. Our dictionary contains 11 gene symbols identical to common stop words4: T, AS, DO, ET, IF, RD, TH, ASK, ITS, SHE and WAS.</Paragraph> </Section> <Section position="2" start_page="44" end_page="46" type="sub_section"> <SectionTitle> 5.2 Ambiguous Matches in Gene Normalization </SectionTitle> <Paragraph position="0"> We call a match ambiguous if it associates a mention with multiple gene identifiers. Although the normalization procedure may create ambiguity, if a mention matches multiple synonyms, it may not be strictly ambiguous. For example, the gene mention &quot;M creatine kinase&quot; in PMID 1690725 matches the synonyms &quot;Creatine kinase M-type&quot; and &quot;Creatine kinase, M chain&quot; in our dictionary using the TFIDF scoring method (with score 0.866). In this case, both synonyms are associated with the CKM gene, so the match is not ambiguous. However, even if a mention matches only one synonym, it can be ambiguous, because the synonym is possibly ambiguous.</Paragraph> <Paragraph position="1"> Figure 3 shows the result of an experiment conducted upon 200,000 MEDLINE abstracts, where the degree of ambiguity is the number of gene identifiers that a mention is associated with. The maximum, average, and standard deviation of the ambiguity degrees are 20, 1.129 and 0.550, respectively. The overall ambiguity rate of all matched mentions was 8.16%, and the rate of ambiguity is less than 10% at each step. Successful disambiguation can increase the true positive match rate and therefore improve performance but is beyond the scope of the current investigation.</Paragraph> <Paragraph position="2"> Finally, we were interested in determining the effectiveness of an optimized system based upon the gene normalization system described above, and also coupled with a state-of-the-art gene tagger. To determine the optimal results of such a system, we created a corpus of 100 MEDLINE abstracts that together contained 1,094 gene mentions for 170 unique genes (also used in the evaluations above).</Paragraph> <Paragraph position="3"> These documents were a subset of those used to train the tagger, and thus measure optimal, rather than typical MEDLINE, performance (data for a generalized evaluation is forthcoming). This corpus was manually annotated to identify human genes, according to a precise definition of gene mentions that an NER gene system would be reasonably expected to tag and normalize correctly. Briefly, the definition included only human genes, excluded multi-protein complexes and antibodies, excluded chained mentions of genes (e.g., &quot;HDAC1- and -2 genes&quot;), and excluded gene classes that were not normalizable to a specific symbol (e.g., tyrosine kinase). Documents were dual-pass annotated in full and then adjudicated by a 3rd expert. Adjudication revealed a very high level of agreement between annotators.</Paragraph> <Paragraph position="4"> To optimize the rule set for human gene normalization, we evaluated up to 200 cases randomly chosen from all MEDLINE files for each rule, where invocation of that specific rule alone resulted in a match. Most of the transformations worked perfectly or very well. Stemming and removal of the first or last word or character each demonstrated poor performance, as genes and gene classes were often incorrectly converted to other gene instances (e.g., &quot;CAP&quot; and &quot;CAPS&quot; are distinct genes). Re- null moval of stop words generated a high rate of false positives. Rules were ranked according to their precision when invoked separately. A high-performing sequence was &quot;0 01 02 03 06 016 026 036&quot;, with 0 referring to case-insensitivity, 1 being replacement of hyphens with spaces, 2 being removal of punctuation, 3 being removal of parenthesized materials, and 6 being removal of spaces; grouped digits indicate simultaneous invocation of each specified rule in the group. Table 3 indicates the cumulative accuracy achieved at each step5. A formalized determination of an optimal sequence is in progress. Approximate matching did not considerably improve recall and had a non-trivial detrimental effect upon precision.</Paragraph> <Paragraph position="5"> tions. First, we used the actual textual mentions of each gene from the gold standard files as input into our optimized normalization sequence, in order to determine the accuracy of the normalization process alone. We also used a previously developed CRF gene tagger (McDonald and Pereira, 2005) to tag the gold standard files, and then used the tagger's output as input for our normalization sequence. This second evaluation determined the accuracy of a combined NER system for human gene identification.</Paragraph> <Paragraph position="6"> Depending upon the application, evaluation can be determined more significant at either at the mention level (redundantly), where each individual mention is evaluated independently for accuracy, or as in gold standard file and therefore the scores were unchanged. These rule sets may improve performance in other cases. the case of BioCreAtIvE task 1B, at the document level (non-redundantly), where all mentions within a document are considered to be equivalent. For pure information extraction tasks, mention level accuracy is a relevant performance indicator. However, for applications such as information extraction-based information retrieval (e.g., the identification of documents mentioning a specific gene), document-level accuracy is a relevant gauge of system performance.</Paragraph> <Paragraph position="7"> For normalization alone, at the mention level our optimized normalization system achieved 0.882 precision, 0.704 recall, and 0.783 F-measure. At the document level, the normalization results were 1.000 precision, 0.994 recall, and 0.997 F-measure.</Paragraph> <Paragraph position="8"> For the combined NER system, the performance was 0.718 precision, 0.626 recall, and 0.669 F-measure at the mention level. At the document level, the NER system results were 0.957 precision, 0.857 recall, and 0.901 F-measure. The lower accuracy of the combined system was due to the fact that both the tagger and the normalizer introduce error rates that are multiplicative in combination.</Paragraph> </Section> </Section> class="xml-element"></Paper>