File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2209_metho.xml

Size: 23,570 bytes

Last Modified: 2025-10-06 14:10:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2209">
  <Title>Active Annotation</Title>
  <Section position="4" start_page="64" end_page="65" type="metho">
    <SectionTitle>
3 Effect of errors
</SectionTitle>
    <Paragraph position="0"> Noise in the training data is acommon issue in training machine learning for NLP tasks. It can have significant effect on the performance, as it was pointed out by Dingare et al. (2004), where the performance of the same system on the same task (named entity recognition in the biomedical domain) was lower when using noisier material. The effect of noise in the data used to train machine learning algorithms forNLPtaskshasbeen explored byOsborne (2002), using the task of shallow parsing as the case study and a variety of learners. The impact of different  types of noise was explored and learner specific extensions were proposed in order to deal with it.</Paragraph>
    <Paragraph position="1"> In ourexperiments weexplored the effect of noise intraining theselected namedentity recognition system, keeping in mind that we are going to use an unsupervised method to create the training material.</Paragraph>
    <Paragraph position="2"> Thekindofnoiseweexpect ismislabelled instances.</Paragraph>
    <Paragraph position="3"> In order to simulate the behavior of a hypothetical unsupervised method, wecorrupted the training data artificially using the following models: * LowRecall: Change tokens labelled as entities to non-entities. It must be noted that in this model, due to the presence of multi-token entities precision is reduced too, albeit less than recall.</Paragraph>
    <Paragraph position="4"> * LowRecall WholeEntities: Change the labeling of whole entities to non-entities. In this model, precision is kept intact.</Paragraph>
    <Paragraph position="5">  * LowPrecision: Change tokens labelled as non-entities to entities.</Paragraph>
    <Paragraph position="6"> * Random: Entities and non-entities are changed  randomly. It can be viewed alternatively as a random tagger which labels the data with some accuracy.</Paragraph>
    <Paragraph position="7"> The level of noise inserted is adjusted by specifying the probability with which a candidate label is changed. In all the experiments in this paper, for a particular model and level of noise, the corruption of the dataset was repeated five times in order to produce more reliable results. In practice, the behavior of an unsupervised method is likely to be a mixture of the above models. However, given that the method would tag the data with a certain performance, we attempted through our experiments to identify which of these (extreme) behaviors would be less harmful. In Figure 1, we present graphs showing the effect of noise inserted with the above models. The experimental procedure was to add noise to the training data according to a model, evaluate the performance of the hypothetical tagger that produced it, train Lingpipe on thenoisy training data and evaluate the performance of the latter on the test data. The process was repeated for various levels of noise. In the top graph, the F-score achieved by Lingpipe (F-ling) is plotted against the F-score of  the hypothetical tagger (F-tag), while in the bottom graph the F-score achieved by Lingpipe (F-ling) is plotted against the number of erroneous classifications made by the hypothetical tagger.</Paragraph>
    <Paragraph position="8">  the top graph and (b) the number of errors made by the hypothetical tagger in the bottom graph.</Paragraph>
    <Paragraph position="9"> A first observation is that limited noise does not affect the performance significantly, a phenomenon that can be attributed to the capacity of the machine learning method to deal with noise. From the point of view of correcting mistakes in the training data this suggests that not all mistakes need to be corrected. Another observation is that while the performance for all the models follow similar curves when plotted against the F-score of the hypothetical tagger, the same does not hold when plotted against the number of errors. While this can be attributed to the unbalanced nature of the task (very few entity tokens compared to non-entities), it also suggests that the raw number of errors in the training data is not a good indicator for the performance obtained by training on it. However, it represents the effort required to obtain the maximum performance from the data by correcting it.</Paragraph>
  </Section>
  <Section position="5" start_page="65" end_page="66" type="metho">
    <SectionTitle>
4 Active Annotation
</SectionTitle>
    <Paragraph position="0"> In this section we present a detailed description of the active annotation framework. Initially, we have a pool of unlabeled data, D, whose instances are annotated using anunsupervised methodu, whichdoes not need training data. As expected, a significant amount of errors is inserted during this process. A list L is created containing the tokens that have not been checked by a human annotator. Then, a supervised learner s is used to train a model M on this noisy training material. A query module q, which uses the model created by s decides which instances of D will be selected to be checked for errors by a human annotator. The selected instances are removed from L so that q does not select them again in future. The learner s is then trained on this partially corrected training data and the sequence is repeated from the point of applying the querying module q. The algorithm written in pseudocode appears in Figure 2.</Paragraph>
    <Paragraph position="1"> Data D, unsupervised tagger u, supervised learner s, query module q.</Paragraph>
    <Paragraph position="2"> Initialization: Apply u to D.</Paragraph>
    <Paragraph position="3"> Create list of instances L.</Paragraph>
    <Paragraph position="4"> Loop: Using s train a model M on D.</Paragraph>
    <Paragraph position="5"> Using q and M select a batch of instances B to be checked.</Paragraph>
    <Paragraph position="6"> Correct the instances of B in D.</Paragraph>
    <Paragraph position="7"> Remove the instances of B from L.</Paragraph>
    <Paragraph position="8">  Comparing it with active learning, the similarities are apparent. Both frameworks have a loop in which a query module q, using a model produced by the learner, selects instances to be presented to a human annotator. The efficiency of active annotation can be measured in two ways, both of them used in evaluating active learning. The first is to measure the reduction in the checked instances needed in order to achieve a certain level of performance. The second is the increase in performance for a fixed number of checked instances. Following the active learning  paradigm, a baseline for active annotation is random selection of instances to be checked.</Paragraph>
    <Paragraph position="9"> There are though some notable differences. During initialization, an unsupervised method u is required to provide an initial tagging on the data D. This is an important restriction which is imposed by the lack of any annotated data. Even under this restriction, there are some options available, especially for tasks which have compiled resources. One option is to use an unsupervised learning algorithm, such the one presented by Collins &amp; Singer (1999), where a seed set of rules is used to bootstrap a rule-based named entity recognizer. A different approach could be the use of a dictionary-based tagger, as in Morgan et al. (2003). It must be noted that the unsupervised method used to provide the initial tagging does not need to generalize to any data (a common problem for such methods), it only needs to perform well on the data used during active annotation. Generalization on unseen data is an attribute we hope that the supervised learning method s will have after training on the annotated material created with active annotation.</Paragraph>
    <Paragraph position="10"> The query module q is also different from the corresponding module in active learning. Instead of selecting unlabeled informative instances to be annotated and added to the training data, its purpose is to identify likely errors in the imperfectly labelled training data, so that they are checked and corrected by the human annotator.</Paragraph>
    <Paragraph position="11"> In order to perform error-detection, we chose to adapt the approach of Nakagawa and Matsumoto (2002) which resembles uncertainty based sampling for active learning. According to their paradigm, likely errors in the training data are instances that are &amp;quot;hard&amp;quot; for the classifier and inconsistent with the rest of the data. In our case, we used the uncertainty of the classifier as the measure of the &amp;quot;hardness&amp;quot; of an instance. As an indication of inconsistency, we used the disagreement of the label assigned bytheclassifierwiththecurrent label ofthe instance. Intuitively, if the classifier disagrees with the label of an instance used in its training, it indicates that there have been other similar instances in the training data that were labelled differently. Returning to the description of active annotation, the query module q ranks the instances inLfirst by their inconsistency and then by decreasing uncertainty of the classifier. As aresult, instances that are inconsistent with the rest of the data and hard for the classifier are selected first, then those that are inconsistent but easy for the classifier, then the consistent ones but hard for the classifier and finally the consistent and easy ones.</Paragraph>
    <Paragraph position="12"> While this method of detecting errors resembles uncertainty sampling, there are other approaches that could have been used instead and they can be very different. Sj&amp;quot;obergh and Knutsson (2005) inserted artificial errors and trained a classifier to recognize them. Dickinson and Meuers (2003) proposed methods based on n-grams occurring with different labellings in the corpus. Therefore, while it is reasonable to expect some correlation between the selections of active annotation and active learning (hard instances are likely to be erroneously annotated by the unsupervised tagger), the task of selecting hard instances is quite different from detecting errors. The use of the disagreement between taggers for selecting candidates for manual correction is reminiscent of corrected co-training (Pierce and Cardie, 2001). However, the main difference is corrected co-training results in a manually annotated corpus, while active annotation allows automatically annotated instances to be kept.</Paragraph>
  </Section>
  <Section position="6" start_page="66" end_page="67" type="metho">
    <SectionTitle>
5 HMM uncertainty estimation
</SectionTitle>
    <Paragraph position="0"> In order to perform error detection according to the previous section we need to obtain uncertainty estimations over each token from the named entity recognition module of Lingpipe. For each token t and possible label l, Lingpipe estimates the following Hidden Markov Model from the training data:</Paragraph>
    <Paragraph position="2"> Whenannotating acertain textpassage, thetokens are fixed and the joint probability of Equation 1 is computed for each possible combination of labels.</Paragraph>
    <Paragraph position="3"> From Bayes' rule, we obtain:</Paragraph>
    <Paragraph position="5"> For fixed token sequence t[n],t[n [?] 1],t[n [?] 2] and previous label (l[n [?] 1]) the second term of the  left part of Equation 2 is a fixed value. Therefore, under these conditions, we can write:</Paragraph>
    <Paragraph position="7"> From Equation 3 we obtain an approximation for the conditional distribution of the current label (l[n]) conditioned on the previous label (l[n [?] 1]) for a fixed sequence of tokens. It must be stressed that the later restriction is very important. The resulting distribution from Equation 3 cannot be compared across different token sequences. However, for the purpose of computing the uncertainty over a fixed token sequence it is a reasonable approximation. One way to estimate the uncertainty of the classifier is to calculate the conditional entropy of this distribution. The conditional entropy for a distribution P(X|Y ) can be computed as:</Paragraph>
    <Paragraph position="9"> In our case, X is l[n] and Y is l[n [?] 1]. Function 4 can be interpreted as the weighted sum of the entropies of P(l[n]|l[n [?] 1]) for each value of l[n [?] 1], in our case the weighted sum of entropies of the distribution of the current label for each possible previous label. The probabilities for each tag (needed for P(l[n [?] 1])) are not calculated directly from the model. P(l[n]) corresponds to P(l[n]|t[n],t[n [?] 1],t[n [?] 2]), but since we are considering a fixed token sequence, we approximate its distribution using the conditional probability P(l[n]|t[n],l[n [?] 1],t[n [?] 1],t[n [?] 2]), by marginalizing over l[n[?] 1].</Paragraph>
    <Paragraph position="10"> Again, itmustbenoted thattheabovecalculations are to be used in estimating uncertainty over a single word. Oneproperty oftheconditional entropy isthat it estimates the uncertainty of the predictions for the current label given knowledge of the previous tag, which is important in our application because we need the uncertainty over each label independently from the rest of the sequence. This is confirmed by the theory, from which we know that for a conditional distribution of X given Y the following equation holds, H[X|Y ] = H[X,Y ] [?] H[Y ], where H denotes the entropy.</Paragraph>
    <Paragraph position="11"> A different way of obtaining uncertainty estimations from HMMs in the framework of active learning has been presented in (Scheffer et al., 2001). There, the uncertainty is estimated by the margin between the two most likely predictions that would result in a different current label, explicitly:</Paragraph>
    <Paragraph position="13"> Intuitively, the margin M is the difference between the two highest scored predictions that disagree. The lower the margin, the higher the uncertainty oftheHMMonthe token atquestion. Adrawback of this method is that it doesn't take into account the distribution of the previous label. It is possible that the two highest scored predictions are obtained for two different previous labels. It may also be the case that ahighly scored label can be obtained given a very improbable previous label. Finally, an alternative that we did not explore in this work is the Field Confidence Estimation (Culotta and Mc-Callum, 2004), which allows the estimation of confidence over sequences of tokens, instead of singleton tokens only. However, in this work confidence estimation over singleton tokens is sufficient.</Paragraph>
  </Section>
  <Section position="7" start_page="67" end_page="68" type="metho">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In this section we present results from applying active annotation to biomedical named entity recognition. Using the noise models described in Section 3, we corrupted the training data and then using Lingpipe as the supervised learner we applied the algorithm of Figure 2. The batch of tokens selected to be checked in each round was 2000 tokens. As a base-line for comparison we used random selection of tokens to be checked. The results for various noise models and levels are presented in the graphs of Figure 3. In each of these graphs, the performance of Lingpipe trained on the partially corrected material (F-ling) is plotted against the number of checked instances, under the label &amp;quot;entropy&amp;quot;.</Paragraph>
    <Paragraph position="1"> In all the experiments, active annotation significantly outperformed random selection, with the exception of 50% Random, where the high level of noise (the F-score of the hypothetical tagger that provided the initial data was 0.1) affected the initial  against the number of checked instances for various models and levels of noise.</Paragraph>
    <Paragraph position="2"> judgements of the query module on which instances should be checked. After having checked some portion of the dataset though, active annotation started outperforming random selection. In the graphs for the 10% Random, 20% LowRecall WholeEntities and 50% LowRecall noise models, under the label &amp;quot;margin&amp;quot;, appear the performance curves obtained using the uncertainty estimation of Scheffer et al. (2001). Even though active annotation using this method performs better than random selection, active annotation using conditional entropy performs significantly better. These results provide evidence ofthe theoretical advantages of conditional entropy described earlier. We also ran experiments using pure uncertainty based sampling (i.e. without checking the consistency of the labels) on selecting instances to be checked. The performance curves appear under the label &amp;quot;uncertainty&amp;quot; for the 20% LowRecall, 50% Random and 20% LowPrecision noise models. The uncertainty was estimated using the method described in Section 5. As expected, uncertainty based sampling performed reasonably well, betterthan random selection but worse than using labelling consistency, except for the initial stage of 20% LowPrecision.</Paragraph>
  </Section>
  <Section position="8" start_page="68" end_page="69" type="metho">
    <SectionTitle>
7 Active Annotation versus Active
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="68" end_page="69" type="sub_section">
      <SectionTitle>
Learning
</SectionTitle>
      <Paragraph position="0"> Inordertocompare active annotation toactive learning, we run active learning experiments using the same dataset and software. The paradigm employed was uncertainty based sampling, using the uncertainty estimation presented in Sec. 5. HMMsrequire annotated sequences of tokens, therefore annotating whole sentences seemed as the natural choice, as in (Becker et al., 2005). While token-level selections could be used in combination with EM, (as in (Scheffer et al., 2001)), constructing a corpus of individual tokens would result in a corpus that would be very difficult to be reused, since it would be partially labelled. We employed the two standard options of selecting sentences, selecting the sentences with the highest average uncertainty over the tokens or selecting the sentence containing the most uncertain token. As cost metric we used the number of tokens, which allows more straightforward comparison with active annotation.</Paragraph>
      <Paragraph position="1"> In Figure 4 (left graph), each active learning experiment isstarted byselecting arandom sentence as seed data, repeating the seed selection 5 times. The random selection is repeated 5 times for each seed selection. As in (Becker et al., 2005), selecting the sentences with the highest average uncertainty (ave) performs better than selecting those with the most uncertain token (max).</Paragraph>
      <Paragraph position="2"> In the right graph, we compare the best active learning method with active annotation. Apparently, the performance of active annotation is highly dependent on the performance of the unsupervised tagger used to provide us with the initial annotation of the data. In the graph, we include curves for two of the noise models reported in the previous section, LowRecall20% and LowRecall50% which correspond to tagging performance of 0.66 / 0.69 / 0.67 and 0.33 / 0.43 / 0.37 respectively, in terms of Recall / Precision / F. We consider such tagging performances feasible with a dictionary-based tagger, since Morgan et al. (2003) report performance of  0.88 / 0/78 / 83 with such a method.</Paragraph>
      <Paragraph position="3"> These results demonstrate that active annotation, given a reasonable starting point, can achieve reductions in the annotation cost comparable to those of active learning. Furthermore, active annotation produces an actual corpus, albeit noisy. Active learning, as pointed out by Baldridge &amp; Osborne (2004), while it reduces the amount of training material needed, it selects data that might not be useful to train a different learner. In the active annotation framework, it is likely to preserve correct instances that might not be useful to the machine learning method used to create it, but maybe beneficial to a different method. Furthermore, producing an actual corpus can be very important when adding new features to the model. In the case of biomedical NER, one could consider adding document-level features, such as whether a token has been seen as part of a gene name earlier in the document. With the corpus constructed using active learning this is not feasible, since it is unlikely that all the sentences of a document are selected for annotation. Also, if one intended to use the same corpus for a different task, such as anaphora resolution, again the imperfectly annotated corpus constructed using active annotation can be used more efficiently than the partially annotated one produced by active learning.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="69" end_page="70" type="metho">
    <SectionTitle>
8 Selecting errors
</SectionTitle>
    <Paragraph position="0"> In order to investigate further the behavior of active annotation, we evaluated the performance of the trained supervised method against the number of errors corrected by the human annotator. The aim of this experiment was to verify whether the improvement in performance compared to random selection is due to selecting &amp;quot;informative&amp;quot; errors to correct, or due tothe efficiency ofthe errordetection technique.</Paragraph>
    <Paragraph position="1">  is plotted against the number of corrected errors.</Paragraph>
    <Paragraph position="2"> Right: Errors corrected plotted against the number of checked tokens.</Paragraph>
    <Paragraph position="3"> In Figure 5, we present such graphs for the 10% Random noise model. Similar results were obtained with different noise models. As can be observed on the left graph, the errors corrected initially during random selection are farmoreinformative compared to those corrected at the early stages of active annotation (labelled &amp;quot;entropy&amp;quot;). The explanation for this is that using the error detection method described in Section 4, the errors that are detected are those on which the supervised method s disagrees with the training material, which implies that even if such an instance is indeed an error then it didn't affect s.</Paragraph>
    <Paragraph position="4"> Therefore, correcting such errors will not improve the performance significantly. Informative errors are those that s has learnt to reproduce with high certainty. However, such errors are hard to detect because similar attributes are exhibited usually by correctly labelled instances. This can be verified by the curves labelled &amp;quot;reverse&amp;quot; in the graphs of Figure 5, in which the ranking of the instances to be selected was reversed, so that instances where the supervised method agrees confidently with the training material  are selected first. The fact that errors with high uncertainty arelessinformative thanthosewithlowuncertainty suggests that active annotation, while being related to active learning, it is sufficiently different. Theright graph suggests that the error-detection performance during active annotation is much better than that of random selection. Therefore, the performance of active annotation could be improved by preserving the high error-detection performance and selecting more informative errors.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML