File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3328_metho.xml
Size: 24,302 bytes
Last Modified: 2025-10-06 14:10:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3328"> <Title>Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain</Title> <Section position="3" start_page="0" end_page="139" type="metho"> <SectionTitle> 2 Building a test set </SectionTitle> <Paragraph position="0"> In this section we present a new test set created to evaluate named entity recognition for Drosophila fly genes. To our knowledge, there is only one other test set built for this purpose, presented in Morgan et al. (2004), which was annotated by two annotators. The inter-annotator agreement achieved was 87% F-score between the two annotators, which according to the authors reflects the difficulty of the task. Vlachos et al (2006) evaluated their system on both versions of this test set and obtained significantly different results. The disagreements between the two versions were attributed to difficulties in applyingtheguidelinesusedfortheannotation. Therefore, they produced a version of this dataset resolving the differences between these two versions using revised guidelines, partially based on those developed for ACE (2004). In this work, we applied these guidelines to construct a new test set, which resulted in their refinement and clarification.</Paragraph> <Paragraph position="1"> The basic idea is that gene names (<gn>) are annotated in any position they are encountered in the text, including cases where they are not referring to the actual gene but they are used to refer to a different entity. Names of gene families, reporter genes and genes not belonging to Drosophila are tagged as gene names too: In addition, following the ACE guidelines, for each gene name we annotate the shortest surrounding noun phrase. These noun phrases are classified further into gene mentions (<gm>) and other mentions (<om>), depending on whether the mentions refer to an actual gene or not respectively. Most of the times, this distinction can be performed by looking at the head noun of the noun phrase: * <gm>the <gn>faf</gn> gene</gm> * <om>the <gn>Reaper</gn> protein</om> However, in many cases the noun phrase itself is not sufficient to classify the mention, especially whenthementionconsistsofjustthegenename, because it is quite common in the biomedical literature to use a gene name to refer to a protein or to other gene products. In order to classify such cases, the annotators need to take into account the context in which the mention appears. In the following examples, the word of the context that enables us to make Morgan et al. new dataset the distinction between gene mentions (<gm>) and other mentions is underlined:</Paragraph> <Paragraph position="3"> It is worth noticing as well that sometimes more than one gene name may appear within the same noun phrase. As the examples that follow demonstrate, this enables us to annotate consistently cases ofcoordination, whichisanothersourceofdisagreement (Dingare et al., 2004): * <gm><gn>male-specific lethal-1</gn>, <gn>-2</gn> and <gn>-3</gn> genes</gm> The test set produced consists of the abstracts from 82 articles curated by FlyBase1. We used the tokenizer of RASP2 (Briscoe and Carroll, 2002) to process the text, resulting in 15703 tokens. The size and the characteristics of the dataset is comparable withthatofMorganetal(2004)asitcanbeobserved from the statistics of Table 1, except for the number of non-unique gene-names. Apart from the different guidelines, another difference is that we used the original text of the abstracts, without any post-processing apart from the tokenization. The dataset from Morgan et al. (2004) had been stripped from all punctuation characters, e.g. periods and commas.</Paragraph> <Paragraph position="4"> Keepingthetextintactrendersthisnewdatasetmore realistic and most importantly it allows the use of tools that rely on this information, such as syntactic parsers.</Paragraph> <Paragraph position="5"> The annotation of gene names was performed by a computational linguist and a FlyBase curator.</Paragraph> <Paragraph position="6"> We estimated the inter-annotator agreement in two ways. First, we calculated the F-score achieved between them, which was 91%. Secondly, we used the Kappa coefficient (Carletta, 1996), which has become the standard evaluation metric and the score obtained was 0.905. This high agreement score can be attributed to the clarification of what gene name should capture through the introduction of gene mention and other mention. It must be mentioned that in the experiments that follow in the rest of the paper, only the gene names were used to evaluate the performance of bootstrapping. The identification and the classification of mentions is the sub-ject of ongoing research.</Paragraph> <Paragraph position="7"> The annotation of mentions presented greater difficulty, because computational linguists do not have sufficient knowledge of biology in order to use the context of the mentions whilst biologists are not trained to identify noun phrases in text. In this effort, the boundaries of the mentions where defined by the computational linguist and the classification was performed by the curator. A more detailed description of the guidelines, as well as the corpus itself in IOB format are available for download3.</Paragraph> </Section> <Section position="4" start_page="139" end_page="140" type="metho"> <SectionTitle> 3 Bootstrapping NER </SectionTitle> <Paragraph position="0"> For the bootstrapping experiments presented in this paper we employed the system developed by Vlachos et al. (2006), which was an improvement of the system of Morgan et al. (2004). In brief, the abstracts of all the articles curated by FlyBase were retrieved and tokenized by RASP (Briscoe and Carroll, 2002). For each article, the gene names and their synonyms that were recorded by the curators were annotated automatically on its abstract using longest-extent pattern matching. The pattern matching is flexible in order to accommodate capitalization and punctuation variations. This process resulted in a large but noisy training set, consisting of 2,923,199 tokens and containing 117,279 gene names, 16,944 of which are unique. The abstracts used in the test set presented in the previous section were excluded. We used them though to evaluate the performance of the training data generation process This material was used to train the HMM-based NER module of the open-source toolkit LingPipe4.</Paragraph> <Paragraph position="1"> The performance achieved on the corpus presented in the previous section appears in Table 2 in the row &quot;std&quot;. Following the improvements suggested by Vlachos et al. (2006), we also re-annotated as gene-names the tokens that were annotated as such by the data generation process more than 80% of the time (row &quot;std-enhanced&quot;), which slightly increased the performance.</Paragraph> <Paragraph position="2"> In order to assess the usefulness of this bootstrapping method, we evaluated the performance of the HMM-based tagger if we trained it on manually annotated data. For this purpose we used the annotated data from BioCreative-2004 (Blaschke et al., 2004) task 1A. In that task, the participants were requested to identify which terms in a biomedical research article are gene and/or protein names, which is roughly the same task as the one we are dealing with in this paper. Therefore we would expect that, even though the material used for the annotation is not drawn from the exact domain of our test data (FlyBase curated abstracts), it would still be useful to train a system to identify gene names.</Paragraph> <Paragraph position="3"> The results in Table 2 show that this is not the case.</Paragraph> <Paragraph position="4"> Apart from the domain shift, the deterioration of the performance could also be attributed to the different guidelines used. However, given that the tasks are roughly the same, it is a very important result that manually annotated training material leads to so poor performance, compared to the performance achieved using automatically created training data.</Paragraph> <Paragraph position="5"> This evidence suggests that manually created resources, which are expensive, might not be useful even in slightly different tasks than those they were initially designed for. Moreover, it suggests that the use of semi-supervised or unsupervised methods for creating training material are alternatives worthexploring. null</Paragraph> </Section> <Section position="5" start_page="140" end_page="140" type="metho"> <SectionTitle> 4 Evaluating NER </SectionTitle> <Paragraph position="0"> The standard evaluation metric used for NER is the F-score (Van Rijsbergen, 1979), which is the harmonic average of Recall and Precision. It is very successful and popular, because it penalizes systems that underperform in any of these two aspects. Also, it takes into consideration the existence multi-token entities by rewarding systems able to identify the entity boundaries correctly and penalizing them for partial matches. In this section we suggest an extension to this evaluation, which we believe is meaningful and informative for trainable NER systems.</Paragraph> <Paragraph position="1"> Two are the main expectations from trainable systems. The first one is that they will be able to identify entities that they have encountered during their training. This is not as easy as it might seem, because in many domains token(s) representing entity names of a certain type can appear as common words or representing an entity name of a different type. Using examples from the biomedical domain, &quot;to&quot;canbeagenenamebutitisalsousedasapreposition. Also gene names are commonly used as protein names, rendering the task of distinguishing between the two types non-trivial, even if examples of those names exist in the training data. The second expectation is that trainable systems should be able to learn from the training data patterns that will allow it to generalize to unseen named entities. Important role in this aspect of the performance play the features that are dependent on the context and on observations on the tokens. The ability to generalize to unseen named entities is very significant because it is unlikely that training material can cover all possible names and moreover, in most domains, new names appear regularly.</Paragraph> <Paragraph position="2"> A common way to assess these two aspects is to measure the performance on seen and unseen data separately. It is straightforward to apply this in tasks with token-based evaluation, such as part-of-speech tagging (Curran and Clark, 2003). However, in the case of NER, this is not entirely appropriate due to the existence of multi-token entities. For example, consider the case of the gene-name &quot;head inhibition defective&quot;, which consists of three common words that are very likely to occur independently of each other in a training set. If this gene name appears in the test set but not in the training set, with a token-based evaluation its identification (or not) would count towards the performance on seen tokens if the tokens appeared independently. Moreover, a system would be rewarded or penalized for each of the tokens. One approach to circumvent these problems and evaluate the performance of a system on unseen named entities, is to replace all the named entities of the test set with strings that do not appear in the training data, as in Morgan et al. (2004). There are two problems with this evaluation. Firstly, it alters the morphology of the unseen named entities, which is usually a source of good features to recognize them. Secondly, it affects the contexts in which the unseen named entities occur, which don't have to be the same as that of seen named entities.</Paragraph> <Paragraph position="3"> In order to overcome these problems, we used the following method. We partitioned the correct answers and the recall errors according to whether the named entity at question have been encountered in the training data as a named entity at least once. The precision errors are partitioned in seen and unseen depending on whether the string that was incorrectly annotated as a named entity by the system has been encountered in the training data as a named entity at least once. Following the standard F-score definition, partially recognized named entities count as both precision and recall errors.</Paragraph> <Paragraph position="4"> In examples from the biomedical domain, if &quot;to&quot; has been encountered at least once as a gene name in the data but an occurrence of in the test dataset is erroneously tagged as a gene name, this will count as a precision error on seen named entities. Similarly, if &quot;to&quot; has never been encountered in the training data as a gene name but an occurrence of it in the test dataset is erroneously tagged as a common word, this will count as a recall error on unseen named entities. In a multi-token example, if &quot;head inhibition defective&quot; is a gene name in the test dataset and it has been seen as such in the training data but the NER system tagged (erroneously) &quot;head inhibition&quot; as a gene name (which is not the training data), then this would result in a recall error on seen named entities and a precision error on unseen named entities.</Paragraph> </Section> <Section position="6" start_page="140" end_page="143" type="metho"> <SectionTitle> 5 Improving performance </SectionTitle> <Paragraph position="0"> Using this extended evaluation we re-evaluated the named entity recognition system of Vlachos et al. (2006) and Table 3 presents the results. The big gap in the performance on seen and unseen named entities can be attributed to the highly lexicalized nature of the algorithm used. Tokens that have not beenseeninthetrainingdataarepassedontoamodule that classifies them according to their morphology, which given the variety of gene names and their overlap with common words is unlikely to be sufficient. Also, the limited window used by the tagger (previous label and two previous tokens) does not allow the capture of long-range contexts that could improve the recognition of unseen gene names.</Paragraph> <Paragraph position="1"> We believe that this evaluation allows fair comparison between the data generation process that creating the training data and the HMM-based tagger. This comparison should take into account the performance of the latter only on seen named entities, since the former is applied only on those abstracts for which lists of the genes mentioned have been compiled manually by the curators. The result of this comparison is in favor of the HMM, which achieves 94.5% F-score compared to 82.1% of the data generation process, mainly due to the improved recall (95.9% versus 73.5%). This is a very encouraging result for bootstrapping techniques using noisy training material, because it demonstrates thatthetrainedclassifiercandealefficientlywiththe noise inserted.</Paragraph> <Paragraph position="2"> From the analysis performed in this section, it becomes obvious that the system is rather weak in identifyingunseengenenames. Thelattercontribute 31% of all the gene names in our test dataset, with respect to the training data produced automatically to train the HMM. Each of the following subsections describes different ideas employed to improve the performance of our system. As our baseline, we kept the version that uses the training data produced by re-annotating as gene names tokens that appear as part of gene names more than 80% of times. This version has resulted in the best performance obtained so far.</Paragraph> <Section position="1" start_page="141" end_page="142" type="sub_section"> <SectionTitle> 5.1 Substitution </SectionTitle> <Paragraph position="0"> A first approach to improve the overall performance is to increase the coverage of gene names in the training data. We noticed that the training set produced by the process described earlier contains 16944 unique gene names, while the dictionary of allgenenamesfromFlyBasecontains97227entries.</Paragraph> <Paragraph position="1"> This observation suggests that the dictionary is not fully exploited. This is expected, since the dictionary entries are obtained from the full papers while the training data generation process is applied only to their abstracts which are unlikely to contain all of them.</Paragraph> <Paragraph position="2"> In order to include all the dictionary entries in the training material, we substituted in the training dataset produced earlier each of the existing gene names with entries from the dictionary. The process was repeated until each of the dictionary entries was included once in the training data. The assumption that we take advantage of is that gene names should appear in similar lexical contexts, even if the resulting text is nonsensical from a biomedical perspective. For example, in a sentence containing the phrase &quot;the sws mutant&quot;, the immediate lexical context could justify the presence of any gene name in the place &quot;sws&quot;, even though the whole sentence would become untruthful and even incomprehensible. Although through this process we are bound to repeat errors of the training data, we expect the gains from the increased coverage to alleviate their effect. The resulting corpus consisted of 4,062,439 tokens containing each of the 97227 gene names of the dictionary once. Training the HMM-based tagger with this data yielded 78.3% F-score (Table 4, row &quot;sub&quot;). 438 out of the 629 genes of the test set were seen in the training data.</Paragraph> <Paragraph position="3"> The drop in precision exemplifies the importance of using naturally occurring training material. Also, 59 gene names that were annotated in the training data due to the flexible pattern matching are not in- null cluded anymore since they are not in the dictionary, which explains the drop in recall. Given these observations, we trained HMM-based tagger on both versions of the training data, which consisted of 5,527,024 tokens, 218,711 gene names, 106,235 of which are unique. The resulting classifier had seen in its training data 79% of the gene names in the test set (497 out of 629) and it achieved 82.8% F-score (row &quot;bsl+sub&quot; in Table 4). It is worth pointing out that this improvement is not due to ameliorating the performance on unseen named entities but due to including more of them in the training data, therefore taking advantage of the high performance on seen named entities (93.7%). Direct comparisons between these three versions of the system on seen and unseen gene names are not meaningful because the separation in seen and seen gene names changes with the the genes covered in the training set and therefore we would be evaluating on different data.</Paragraph> </Section> <Section position="2" start_page="142" end_page="142" type="sub_section"> <SectionTitle> 5.2 Excluding sentences not containing entities Fromtheevaluationofthedictionarybasedtaggerin </SectionTitle> <Paragraph position="0"> Section 3 we confirmed our initial expectation that it achieves high precision and relatively low recall.</Paragraph> <Paragraph position="1"> Therefore, we anticipate most mistakes in the training data to be unrecognized gene names (false negatives). In an attempt to reduce them, we removed from the training data sentences that did not contain any annotated gene names. This process resulted in keeping 63,872 from the original 111,810 sentences. Apparently, such processing would remove many correctly identified common words (true negatives), but given that the latter are more frequent in our data we expect it not to have significant impact.</Paragraph> <Paragraph position="2"> The results appear in Table 5.</Paragraph> <Paragraph position="3"> In this experiment, we can compare the performances on unseen data because the gene names that were included in the training data did not change.</Paragraph> <Paragraph position="4"> As we expected, the F-score on unseen gene names rose substantially, mainly due to the improvement in recall (from 32.3% to 46.2%). The overall F-score deteriorated, which is due to the drop in precision.</Paragraph> <Paragraph position="5"> An error analysis showed that most of the precision errors introduced were on tokens that can be part of gene names as well as common words, which suggests that removing from the training data sentences without annotated entities, deprives the classifier from contexts that would help the resolution of such cases. Still though, such an approach could be of interest in cases where we expect a significant amount of novel gene names.</Paragraph> </Section> <Section position="3" start_page="142" end_page="143" type="sub_section"> <SectionTitle> 5.3 Filtering contexts </SectionTitle> <Paragraph position="0"> The results of the previous two subsections suggested that improvements can be achieved through substitution and exclusion of sentences without entities, attempting to include more gene names in the training data and exclude false negatives from them.</Paragraph> <Paragraph position="1"> However, the benefits from them were hampered because of the crude way these methods were applied, resulting in repetition of mistakes as well as exclusion of true negatives. Therefore, we tried to filter the contexts used for substitution and the sentences that were excluded using the confidence of the HMM based tagger.</Paragraph> <Paragraph position="2"> In order to accomplish this, we used the &quot;std-enhanced&quot; version of the HMM based tagger to reannotate the training data that had been generated automatically. From this process, we obtained a second version of the training data which we expected to be different from the original one by the data generation process, since the HMM based tagger should behave differently. Indeed, the agreement between the training data and its re-annotation by the HMM based tagger was 96% F-score. We estimated the entropy of the tagger for each token and for each sentence we calculated the average entropy over all its tokens. We expected that sentences less likely to contain errors would be sentences on which the two versions of the training data would agree and in addition the HMM based tagger would annotate with low entropy, an intuition similar to that of co-training (Blum and Mitchell, 1998). Following this, we removed from the dataset the sentences on which the HMM-based tagger disagree with the annotation of the data generation process, or it agreed with but the average entropy of their tokens was above a certain threshold. By setting this threshold at 0.01, we kept 72,534 from the original 111,810 sentences, which contained 61798 gene names, 11,574 of which are unique. Using this dataset as training data we achieved 80.4% F-score (row &quot;filter&quot; in Table 6). Even though this score is lower than our baseline (81.5% F-score), this filtered dataset should be more appropriate to apply substitution because it would contain fewer errors.</Paragraph> <Paragraph position="3"> Indeed, applying substitution to this dataset resulted in better results, compared to applying it to the original data. The performance of the HMM-based tagger trained on it was 80.6% F-score (row &quot;filter-sub&quot; in Table 6) compared to 78.3% (row &quot;sub&quot; in Table 4). Since both training datasets contain the same gene names (the ones contained in the FlyBase dictionary), we can also compare the performance on unseen data, which improved from 46.7% to 48.6%. This improvement can be attributed to the exclusion of some false negatives from the training data, which improved the recall on unseen data from 42.9% to 47.1%. Finally, we combined the dataset produced with filtering and substitution with the original dataset. Training the HMM-based tagger on this dataset resulted in 83% F-score, which is the best performance we obtained.</Paragraph> </Section> </Section> class="xml-element"></Paper>