File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3202_metho.xml

Size: 19,954 bytes

Last Modified: 2025-10-06 14:09:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3202">
  <Title>Active Learning and the Total Cost of Annotation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Measuring annotation cost
</SectionTitle>
    <Paragraph position="0"> To aid identification of the best parse out of all those licensed by the ERG, the Redwoods annotation environment provides local discriminants which the annotator can mark as true or false properties for the analysis of a sentence in order to disambiguate large portions of the parse forest. As such, the annotator does not need to inspect all parses and so parses are narrowed down quickly (usually exponentially so) even for sentences with a large number of parses.</Paragraph>
    <Paragraph position="1"> More interestingly, it means that the labeling burden is relative to the number of possible parses (rather than the number of constituents in a parse).</Paragraph>
    <Paragraph position="2"> Data about how many discriminants were needed to annotate each sentence is recorded in Redwoods.</Paragraph>
    <Paragraph position="3"> Typically, more ambiguous sentences require more discriminant values to be set, reflecting the extra effort put into identifying the best parse. We showed in Osborne and Baldridge (2004) that discriminant cost does provide a more accurate approximation of annotation cost than assigning a fixed unit cost for each sentence. We thus use discriminants as the basis of calculating annotation cost to evaluate the effectiveness of different experiment AL conditions.</Paragraph>
    <Paragraph position="4"> Specifically, we set the cost of annotating a given sentence as the number of discriminants whose value were set by the human annotator plus one to indicate a final 'eyeball' step where the annotator selects the best parse of the few remaining ones.1 The discriminant cost of the examples we use averages 3.34 and ranges from 1 to 14.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Active learning
</SectionTitle>
    <Paragraph position="0"> Suppose we have a set of examples and labels Dn = {&lt;x1,y1&gt; ,&lt;x2,y2&gt; ,...} which is to be extended with a new labeled example {&lt;xi,yi&gt; }. The information gain for some model is maximized after selecting, labeling, and adding a new example xi to Dn such that the noise level of xi is low and both the bias and variance of some model using Dn [?] {&lt;xi,yi&gt; } are minimized (Cohn et al., 1995).</Paragraph>
    <Paragraph position="1"> In practice, selecting data points for labeling such that a model's variance and/or bias is maximally minimized is computationally intractable, so approximations are typically used instead. One such approximation is uncertainty sampling. Uncertainty sampling (also called tree entropy by Hwa (2000)), measures the uncertainty of a model over the set of parses of a given sentence, based on the conditional 1This eyeball step is not always taken, but Redwoods does not contain information about when this occurred, so we apply the cost for the step uniformly for all examples.</Paragraph>
    <Paragraph position="2"> distribution it assigns to them. Following Hwa, we use the following measure to quantify uncertainty:</Paragraph>
    <Paragraph position="4"> t denotes the set of analyses produced by the ERG for the sentence and Mk is some model. Higher values of fus(s,t,Mk) indicate examples on which the learner is most uncertain . Calculating fus is trivial with the conditional log-linear and perceptrons models described in section 2.2.</Paragraph>
    <Paragraph position="5"> Uncertainty sampling as defined above is a single-model approach. It can be improved by simply replacing the probability of a single log-linear (or perceptron) model with a product probability:</Paragraph>
    <Paragraph position="7"> M is the set of models M1 ...Mn. As we mentioned earlier, AL for parse selection is potentially problematic as sentences vary both in length and the number of parses they have. Nonetheless, the above measures do not use any extra normalization as we have found no major differences after experimenting with a variety of normalization strategies.</Paragraph>
    <Paragraph position="8"> We use random sampling as a baseline and uncertainty sampling for AL. Osborne and Baldridge (2004) show that uncertainty sampling produces good results compared with other AL methods.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experimental framework
</SectionTitle>
    <Paragraph position="0"> For all experiments, we used a 20-fold cross-validation strategy by randomly selecting 10% (roughly 500 sentences) for the test set and selecting samples from the remaining 90% (roughly 4500 sentences) as training material. Each run of AL begins with a single randomly chosen annotated seed sentence. At each round, new examples are selected for annotation from a randomly chosen, fixed sized 500 sentence subset according to random selection or uncertainty sampling until models reach certain desired accuracies. We select 20 examples for annotation at each round, and exclude all examples that have more than 500 parses.2 2Other parameter settings (such as how many examples to label at each stage) did not produce substantially different results to those reported here.</Paragraph>
    <Paragraph position="1"> AL results are usually presented in terms of the amount of labeling necessary to achieve given performance levels. We say that one method is better than another method if, for a given performance level, less annotation is required. The performance metric used here is parse selection accuracy as described in section 2.3.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Reusing training material
</SectionTitle>
    <Paragraph position="0"> AL can be considered as selecting some labeled training set which is 'tuned' to the needs of a particular model. Typically, we might wish to reuse labeled training material, so a natural question to ask is how general are training sets created using AL. So, if we later improved upon our feature set, or else improved upon our learner, would the previously created training set still be useful? If AL selects highly idiosyncratic datasets then we would not be able to reuse our datasets and thus it might, for example, actually be better to label datasets using random sampling. This is a realistic situation since models typically change and evolve over time -- it would be very problematic if the training set itself inherently limits the benefit of later attempts to improve the model.</Paragraph>
    <Paragraph position="1"> We use two baselines to evaluate how well a model is able to reuse data selected for labeling by another model: (1) Selecting the data randomly.</Paragraph>
    <Paragraph position="2"> This provides the essential baseline; if AL in reuse situations is going to be useful, it ought to outperform this model-free approach. (2) Reuse by the AL model itself. This is the standard AL scenario; against this, we can determine if reused data can be as good as when a model selects data for itself.</Paragraph>
    <Paragraph position="3"> We evaluate a variety of reuse scenarios. We refer to the model used with AL as the selector and the model that is reusing that labeled data as the reuser. Models can differ in the machine learning algorithm and/or the feature set they use. To measure relatedness, we use Spearman's rank correlation on the rankings that two models assign to the parses of a sentence. The overall relatedness of two models is calculated as the average rank correlation on all examples tested in a 10-fold parse selection experiment using all available training material.</Paragraph>
    <Paragraph position="4"> Figure 1 shows complete learning curves for LL-CONFIG when it reuses material selected by itself, LL-CONGLOM, P-MRS, and random sampling. The graph shows that self-reuse is the most effective of all strategies - this is the idealized situation commonly assumed in active learning studies. However, the graph reveals that random sampling is actually more effective than selection both by LL-CONGLOM until nearly 70% accuracy is reached and by P-MRS until about 73%. Finally, we see that the material selected by LL-CONGLOM is always more effective for LL-CONFIG than that selected by P-MRS. The reason for this can be explained by the relatedness of each of these selector models to LL-CONFIG: LL- null reusing material by different selectors.</Paragraph>
    <Paragraph position="5"> Table 2 fleshes out the relationship between relatedness and reusability more fully. It shows the annotation cost incurred by various reusers to reach 65%, 70%, and 73% accuracy when material is selected by various models. The list is ordered from top to bottom according to the rank correlation of the two models. The first three lines provide the baselines of when LL-PROD, LL-CONGLOM, and LL-CONFIG select material for themselves. The last three show the amount of material needed by these models when random sampling is used. The rest gives the results for when the selector differs from the reuser.</Paragraph>
    <Paragraph position="6"> For each performance level, the percent increase in annotation cost over self-reuse is given. For example, a cost of 2300 discriminants is required for LL-PROD to reach the 73% performance level when it reuses material selected by LL-CONGLOM; this is a 10% increase over the 2100 discriminants needed when LL-PROD selects for itself. Similarly, the 5500 discriminants needed by LL-CONGLOM to reach 73% when reusing material selected by LL-CONFIG is a 31% increase over the 4200 discriminants LL-CONGLOM needs with its own selection.</Paragraph>
    <Paragraph position="7"> As can be seen from Table 2, reuse always leads to an increase in cost over self-reuse to reach a given level of performance. How much that increase will be is in general inversely related to the rank correlation of the two models. Furthermore, considering each reusing model individually, this relationship is almost entirely inversely related at all performance levels, with the exception of P-CONGLOM and LL-MRS selecting for LL-CONFIG at the 73% level.</Paragraph>
    <Paragraph position="8"> The reason for some models being more related to others is generally easy to see. For example, LL-CONFIG and LL-CONGLOM are highly related to LL-PROD, of which they are both components. In both of these cases, using AL for use by LL-PROD beats random sampling by a large amount.</Paragraph>
    <Paragraph position="9"> That LL-MRS is more related to LL-CONGLOM than to LL-CONFIG is explained by the fact the mrs feature set is actually a subset of the conglom set.</Paragraph>
    <Paragraph position="10"> The former contains 15% of the latter's features.</Paragraph>
    <Paragraph position="11"> Accordingly, material selected by LL-MRS is also generally more reusable by LL-CONGLOM than to LL-CONFIG. This is encouraging since the case of LL-CONGLOM reusing material selected by LL-MRS represents the common situation in which an initial model - that was used to develop the corpus - is continually improved upon.</Paragraph>
    <Paragraph position="12"> A particularly striking aspect revealed by Figure 1 and Table 2 is that random sampling is overwhelmingly a better strategy when there is still little labeled material. AL tends to select examples which are more ambiguous and hence have a higher discriminant cost. So, while these examples may be highly informative for the selector model, they are not cheap -- and are far less effective when reused by another model.</Paragraph>
    <Paragraph position="13"> Considering unit cost (i.e., each sentence costs the same) instead of discriminant cost (which assigns a variable cost per sentence), AL is generally more effective than random sampling for reuse throughout all accuracy levels - but not always. For example, even using unit cost, random sampling is better than selection by LL-MRS or P-MRS for reuse by  and the percent increase (Incr) in cost over use of material selected by the reuser. LL-CONFIG until 67% accuracy. Thus, LL-MRS and P-MRS are so divergent from LL-CONFIG that their selections are truly sub-optimal for LL-CONFIG, particularly in the initial stages.</Paragraph>
    <Paragraph position="14"> Together, these results shows that AL cannot be used blindly and always be expected to reduce the total cost of annotation. The data is tuned to the models used during AL and how useful that data will be for other models depends on the degree of relatedness of the models under consideration.</Paragraph>
    <Paragraph position="15"> Given that AL may or may not provide cost reductions, we consider the effect that semi-automating annotation has on reducing the total cost of annotation when used with and without AL.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Semi-automated labeling
</SectionTitle>
    <Paragraph position="0"> Corpus building, with or without AL, is generally viewed as selecting examples and then from scratch labeling such examples. This can be inefficient, especially when dealing with labels that have complex internal structures, as a model may be able to ruleout some of the labeling possibilities.</Paragraph>
    <Paragraph position="1"> For our domain, we exploit the fact that we may already have partial information about an example's label by presenting only the top n-best parses to the annotator, who then navigates to the best parse within that set using those discriminants relevant to that set of parses. Rather than using a value for n that is fixed or proportional to the ambiguity of the sentence, we simply select all parses for which the model assigns a probability higher than chance. This has the advantage of reducing the number of parses presented to the annotator as the model uses more training material and reduces its uncertainty.</Paragraph>
    <Paragraph position="2"> When the true best parse is within the top n presented to the annotator, the cost we record is the number of discriminants needed to identify it from that subset, plus one - the same calculation as when all parses are presented, with the advantage that fewer discriminants and parses need to be inspected.</Paragraph>
    <Paragraph position="3"> When the best parse is not present in the n-best subset, there is a question as to how to record the annotation cost. The discriminant decisions made in reducing the subset are still valid and useful in identifying the best parse from the entire set, but we must incur some penalty for the fact that the annotator must confirm that this is the case. To determine the cost for such situations, we add one to the usual full cost of annotating the sentence. This encodes what we feel is a reasonable reflection of the penalty since decisions taken in the n-best phase are still valid in the context of all parses.3  mance levels when using n-best automation (NB).</Paragraph>
    <Paragraph position="4"> Table 3 shows the effects of using semi-automated labeling with LL-PROD. As can be seen, random selection costs reduce dramatically with n-best automation (compare rows 1 and 3). It is also an early winner over basic uncertainty sampling (row 2), though the latter eventually reaches the higher accuracies more quickly. Nonetheless, the mixture of AL and semi-automation provides the biggest over-all gains: to reach 73% accuracy, n-best uncertainty sampling (row 4) reduces the cost by 17% over n-best random sampling (row 3) and by 15% over basic uncertainty sampling (row 2). Similar patterns hold for n-best automation with LL-CONFIG.</Paragraph>
    <Paragraph position="5"> Figure 2 provides an overall view on the accumulative effects of ensembling, n-best automation, and uncertainty sampling in the ideal situation of reuse by the AL model itself. Ensemble models and n-best automation show that massive improvements can be made without AL. Nonetheless, we see the largest reductions by using AL, n-best automation, and ensemble models together: LL-PROD using uncertainty sampling and n-best automation (row 4 of Table 3) reaches 73% accuracy with a cost of 1760 compared to 8560 needed by LL-CONFIG using random sampling without automation. This is our best annotation saving: a cost reduction of 80%.</Paragraph>
    <Paragraph position="6"> 8 Closing the reuse gap The previous section's semi-automated labeling experiments did not involve reuse. If models are expected to evolve, could n-best automation fill in the cost gap created by reuse? To test this, we considered reusing examples with our best model (LL3When we do not allow ourselves to benefit from such labeling decisions, our annotation savings naturally decrease, but not below when we do not use n-best labeling.</Paragraph>
    <Paragraph position="7">  provements to the annotation scenario starting from random sampling with LL-CONFIG: ensembling, n-best automation, and uncertainty sampling.</Paragraph>
    <Paragraph position="8"> PROD), as selected by different models using both AL and n-best automation as a combined strategy.</Paragraph>
    <Paragraph position="9"> For LL-CONFIG and LL-CONGLOM as selectors, the gap is entirely closed: costs for reuse were virtually equal to when LL-PROD selects examples for itself without n-best (Table 3, row 2).</Paragraph>
    <Paragraph position="10"> The gap also closes when n-best automation and AL are used with the weaker LL-MRS model. Performance (Table 4, row 1) still falls far short of LL-PROD selecting for itself without n-best (Table 3, row 2). However, the gap closes even more when n-best automation and random sampling are used with  mance levels in reuse situations where n-best automation (NB) was used with LL-MRS with uncertainty sampling (US) or random sampling (RAND).</Paragraph>
    <Paragraph position="11"> Interestingly, when using a weak selector (LL-MRS), n-best automation combined with random sampling was more effective than when combined with uncertainty sampling. The reason for this is clear. Since AL typically selects more ambiguous examples, a weak model has more difficulty getting the best parse within the n-best when AL is used.</Paragraph>
    <Paragraph position="12"> Thus, the gains from the more informative examples selected by AL are surpassed by the gains that come with the easier labeling with random sampling.</Paragraph>
    <Paragraph position="13"> For most situations, n-best automation is beneficial: the gap introduced by reuse can be reduced. n-best automation never results in an increase in cost. This is still true even if we do not allow ourselves to reuse those discriminants which were used to select the best parse from the n-best subset and the best parse was not actually present in that subset.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
9 Related work
</SectionTitle>
    <Paragraph position="0"> There is a large body of AL work in the machine learning literature, but less so within natural language processing (NLP). Most work in NLP has primarily focused upon uncertainty sampling (Hwa, 2000; Tang et al., 2002). Hwa (2001) considered reuse of examples selected for one parser by another with uncertainty sampling. This performed better than sequential sampling but was only half as effective as self-selection. Here, we have considered reuse with respect to many models and their co-relatedness. Also, we compare reuse performance against against random sampling, which we showed previously to be a much stronger baseline than sequential sampling for the Redwoods corpus (Osborne and Baldridge, 2004). Hwa et al. (2003) showed that for parsers, AL outperforms the closely related co-training, and that some of the labeling could be automated. However, their approach requires strict independence assumptions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML