File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1012_metho.xml
Size: 25,086 bytes
Last Modified: 2025-10-06 14:08:52
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1012"> <Title>Ensemble-based Active Learning for Parse Selection</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Parse selection </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The Redwoods treebank </SectionTitle> <Paragraph position="0"> Many broad coverage grammars providing detailed syntactic and semantic analyses of sentences exist for a variety of computational grammar frameworks, but their purely symbolic nature means that when ordering licensed analyses, parse selection models are necessary. To overcome this limitation for the HPSG English Resource Grammar (ERG, Flickinger (2000)), the Redwoods tree-bank has been created to provide annotated training material (Oepen et al., 2002).</Paragraph> <Paragraph position="1"> For each utterance in Redwoods, analyses licensed by the ERG are enumerated and the correct one, if present, is indicated. Each analysis is represented as a tree that records the grammar rules which were used to derive it.</Paragraph> <Paragraph position="2"> For example, Figure 1a shows the preferred derivation tree, out of three analyses, for what can I do for you?.</Paragraph> <Paragraph position="3"> Using these trees and the ERG, several different views of analyses can be recovered: phrase structures, semantic interpretations, and elementary dependency graphs. The phrase structures contain detailed HPSG non-terminals but are otherwise of the variety familiar from context-free grammar, as can be seen in Figure 1b.</Paragraph> <Paragraph position="4"> Unlike most treebanks, Redwoods also provides semantic information for utterances. The semantic interpretations are expressed using Minimal Recursion Semantics (MRS) (Copestake et al., 2001), which provides the means to represent interpretations with a flat, underspecified semantics using terms of the predicate calculus and generalized quantifiers. An example MRS structure is given in Figure 2.</Paragraph> <Paragraph position="5"> An elementary dependency graph is a simplified abstraction on a full MRS structure which uses no under-specification and retains only the major semantic predicates and their relations to one another.</Paragraph> <Paragraph position="6"> In this paper, we report results using the third growth of Redwoods, which contains 5302 sentences for which there are at least two parses and for which a unique preferred parse is identified. These sentences have 9.3 words and 58.0 parses on average. Due to the small size of Redwoods and the underlying complexity of the system, exploring the effect of AL techniques for this domain is of practical, as well as theoretical, interest.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Modeling parse selection </SectionTitle> <Paragraph position="0"> As is now standard for feature-based grammars, we use log-linear models for parse selection (Johnson et al., 1999). Log-linear models are popular for their ability to incorporate a wide variety of features without making assumptions about their independence.1 For log-linear models, the conditional probability of an analysisa0 given a sentence with a set of analysesa1a3a2</Paragraph> <Paragraph position="2"> wherea38a33a12a0a40a22 returns the number of times featurea46 occurs in analysisa0 ,a41 a33 is a weight,a43a47a12a15a45a22 is a normalization factor for the sentence, and a18 a20 is a model. The parse with the highest probability is taken as the preferred parse for the model. We use the limited memory variable metric algorithm (Malouf, 2002) to determine the weights. Note that because the ERG usually only produces relatively few parses for in-coverage sentences, we can simply enumerate all parses and rank them.</Paragraph> <Paragraph position="3"> The previous parse selection model (equation 1) uses a single model. It is possible to improve performance using an ensemble parse selection model. We create our ensemble model (called a product model) using the product-ofexperts formulation (Hinton, 1999): Note that each individual modela18 a53 is a well-defined distribution and is usually taken from a fixed set of models. a43a44a12a15a55a22 is a normalization factor to ensure the product distribution sums to one over the set of possible parses.</Paragraph> <Paragraph position="4"> A product model effectively averages the contributions made by each of the individual models. Our product model, although simple, is sufficient to show enhanced performance when using multiple models. Of course, other ensemble techniques could be used instead.</Paragraph> <Paragraph position="5"> event index is a25a22a45 . These are followed by a list of elementary predications, each of which is preceded by a label that allows it to be related to other predications. The final list is a set of constraints on how labels may be equated.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Three feature sets </SectionTitle> <Paragraph position="0"> Utilizing the various structures made available by Redwoods - derivation trees, phrase structures, MRS structures, and elementary dependency graphs - we create three distinct feature sets - configurational, ngram, and conglomerate. These three feature sets are used to train log-linear models. They incorporate different aspects of the parse selection task and so have different properties.</Paragraph> <Paragraph position="1"> This is crucial for creating diverse models for use in product ensembles as well as for the ensemble-based AL algorithms discussed in 4.</Paragraph> <Paragraph position="2"> The configurational feature set is based on the derivation tree features described by Toutanova etal. (2003) and takes into account parent, grandparent, and sibling relationships among the nodes of the trees (such as that given in Figure 1(a)). The ngram set, described by Baldridge and Osborne (2003), also uses derivation trees; however, it uses a linearized representation of trees to create ngrams over the tree nodes. This feature creation strategy encodes many (but not all) of the relationships in the configurational set, and also captures some additional long-distance relationships.</Paragraph> <Paragraph position="3"> The conglomerate feature set uses a mixture of features gleaned from phrase structures, MRS structures, and elementary dependency graphs. Each of these representations contains less information than that provided by derivation trees, but together they provide a different and comprehensive view on the ERG semantic analyses.</Paragraph> <Paragraph position="4"> The features contributed by phrase structures are simply ngrams of the kind described above for derivation trees.</Paragraph> <Paragraph position="5"> The features drawn from the MRS structures and elementary dependency graphs capture various dominance and co-occurrence relationships between nodes in the structures, as well as some global characteristics such as how many predications and nodes they contain.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Parse selection performance </SectionTitle> <Paragraph position="0"> Parse selection accuracy is measured using exact match, so a model is awarded a point if it picks some parse for a sentence and that parse is the correct analysis indicated in Redwoods. To deal with ties, the accuracy is given as a46a22a47a49a48 when a model ranks a48 parses highest and the best parse is one of them.</Paragraph> <Paragraph position="1"> Using the configurational, ngram, and conglomerate feature sets described in section 2.3, we create three log-linear models, which we will refer to as LL-CONFIG, LL-NGRAM, and LL-CONGLOM, respectively. We also create an ensemble model (called LL-PROD) with them using equation 2. The results for a chance baseline (selecting a parse at random), each of the three base models, and LL-PROD are given in Table 1. These are 10-fold cross-validation results, using all the training data when estimating models and the test split when evaluating them.</Paragraph> <Paragraph position="2"> Though their overall accuracy is similar, the single models only agree about 80% of the time and performance varies by 3-4% between them on different folds of the cross-validation. Such variation is crucial for use in ensembles, and indeed, LL-PROD reduces the error rate of the best single model by a46a1a0 a5a2a6a5 .2 Redwoods is different from other treebanks in that the treebank itself changes as the ERG is improved. LL-PROD's accuracy of 77.78% is the highest reported performance on version 3 of Redwoods. Results have also been presented for versions 1 (Baldridge and Osborne, 2003) and 1.5 (Oepen et al., 2002; Toutanova et al., 2003), both of which have considerably less ambiguity than version 3. Accordingly, LL-PROD's accuracy increases to 84.23% when tested on version 1.5, which has 3834 ambiguous sentences with an average length of 7.98 and average ambiguity of 11.05.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Measuring annotation cost </SectionTitle> <Paragraph position="0"> When evaluating AL methods, we compare methods based on two metrics: the absolute number of sentences they select (unit cost) and the summed number of decisions needed to select an individual preferred parse from a set of possible parses (discriminant cost). Unit cost is commonly used in AL research (Tang et al., 2002), but discriminant cost is more fine-grained.3 Discriminant cost works as follows. Annotation for Redwoods does not consist of actually drawing parse trees, and instead involves picking the correct parse out those produced by the ERG. To facilitate this task, Redwoods presents local discriminants which disambiguate large portions of the parse forest. This means that the annotator does not need to inspect all parses when specifying the intended analysis and so possible parses are narrowed down quickly even for sentences with a large number of parses. More interestingly, it means that the labeling burden is relative to the number of possible parses rather than the number of constituents in a parse. The discriminant cost of the examples we use averages 3.34 per sentence and ranges from 1 to 14.</Paragraph> <Paragraph position="1"> We measure the discriminant cost of annotating a sen- null tree as another annotation cost. Our approach measures the cost of a more efficient labelling strategy than Hwa's (tree drawing). Redwoods plus one to reflect the final decision of selecting the preferred parse from the reduced parse forest. Although we have not measured the cognitive burden on humans, we strongly believe that simply selecting the best parse is far more efficient than drawing the best parse for some sentence (as exemplified by Hwa (2000)). However, an interesting tension here is that we are committed to the ERG producing the intended parse within the set of analyses. When drawing a parse tree, by definition, the best parse is created. This may not be always true when using a manually written grammar such as the ERG.</Paragraph> <Paragraph position="2"> for some model is maximized after selecting, labeling, and adding a new examplea26 to a3 a49 such that the noise level ofa26 is low and both the bias and variance of some model using a3a13a49a13a12 a4a11a4a26</Paragraph> <Paragraph position="4"> 1995). If examples are selected for labeling using a strategy of minimizing either variance or bias, then typically, the error rate of a model decreases much faster than if examples are simply selected randomly for labeling.</Paragraph> <Paragraph position="5"> In reality, selecting data points for labeling such that a model's variance and/or bias is maximally minimized is computationally intractable, so approximations are typically used instead. Ensemble methods can improve the performance of our active learners. An ensemble active learner uses more than one component model. For example, query-by-committee is an ensemble AL method, as is our generalization of uncertainty sampling.</Paragraph> <Paragraph position="6"> In this section, we describe the AL methods that we tested on Redwoods, which include both single-model and ensemble-based AL techniques. Our single-method approaches are not meant to be exhaustive. In principle, there is no reason why we could not have also tried (within a kernel-based environment) selecting examples by their distance to a separating hyperplane (Tong and Koller, 2000) or else using the computationally demanding approach of Roy and McCallum (2001).</Paragraph> <Paragraph position="7"> AL for parse selection is potentially problematic as sentences vary both in length and the number of parses they have. After experimenting with, and without, a variety of normalization strategies, we found that generally, there were no major differences overall. All of our methods therefore do not have any extra normalization.</Paragraph> <Paragraph position="8"> In all our methods, a1 denotes the set of analyses produced by the ERG for the sentence and a18 a20 is some model. a14 is the set of modelsa18 a36 a5a8a5a8a5a19a18 a49 .</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Uncertainty sampling </SectionTitle> <Paragraph position="0"> Uncertainty sampling (also called tree entropy by Hwa (2000)), measures the uncertainty of a model over the set of parses of a given sentence, based on the conditional distribution it assigns to them. Following Hwa, we use the following measure to quantify uncertainty:</Paragraph> <Paragraph position="2"> Higher values of a38 a0a1a2 a12a15a28a16a40a1 a16a48a18 a20a22 indicate examples on which the learner is most uncertain and thus presumably are more informative. Calculating a38 a0a1a2 is trivial with the conditional log-linear models described in section 2.2.</Paragraph> <Paragraph position="3"> Uncertainty sampling as defined above is a single-model approach. It can be generalized to an ensemble by simply replacing the probability of a single log-linear model with a product probability:</Paragraph> <Paragraph position="5"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Fixed Query-by-Committee </SectionTitle> <Paragraph position="0"> Another AL method is inspired by the query-by-committee (QBC) algorithm (Freund et al., 1997; Argamon-Engelson and Dagan, 1999). According to QBC, one should select data points when a group of models cannot agree as to the predicted labeling.</Paragraph> <Paragraph position="1"> Using a fixed committee consisting of a17 distinct models, the examples we select for annotation are those for which the models most disagree on the preferred parse. One way of measuring this is with vote entropy</Paragraph> <Paragraph position="3"> where a27 a12a0a16a48a15a45a22 is the number of committee members that preferred parsea0 . QBC is inherently an ensemble-based method. We use a fixed set of models in our committee and refer to the resulting sample selection method as fixed QBC. Clearly there are many other possibilities for creating our ensemble, such as sampling from the set of all possible models.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Lowest best probability selection </SectionTitle> <Paragraph position="0"> Uncertainty sampling considers the overall shape of a distribution to determine how confident a model is for a given example. A radically simpler way of determining the potential informativity of an example is simply to consider the absolute probability of the most highly 4We experimented with Kullback-Leibler divergence to the mean (Pereira et al., 1993; McCallum and Nigam, 1998), but it performed no better than the simpler vote entropy metric.</Paragraph> <Paragraph position="1"> ranked parse. The smaller this probability, the less confident the model is for that example and the more useful it will be to know its true label.</Paragraph> <Paragraph position="2"> We call this new method lowest best probability (LBP) selection, and calculate it as follows:</Paragraph> <Paragraph position="4"> LBP can be extended for use with an ensemble model in the same manner as uncertainty sampling (that is, replace the single model probability with a product).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="50" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> To test the effectiveness of the various AL strategies discussed in the previous section, we perform simulation studies of annotating version 3 of Redwoods.</Paragraph> <Paragraph position="1"> For all experiments, we used a tenfold cross-validation strategy by randomly selecting a46a1a0 a5 (roughly 500 sentences) from Redwoods for the test set and selecting samples from the remaining a32 a0 a5 of the corpus (roughly a33a35a34 a0 a0 sentences) as training material. Each run of AL begins with a single randomly chosen annotated seed sentence.</Paragraph> <Paragraph position="2"> At each round, new examples are selected for annotation from a randomly chosen, fixed sized a34 a0 a0 sentence subset according to the method until the annotated training material made available to the learners contains at least a36 a0 a0 a0 examples and a2 a0 a0 a0 discriminants.5 We select a36 a0 examples for manual annotation at each round, and exclude all examples that have more than 500 parses. Other parameter settings did not produce substantially different results to those reported here.</Paragraph> <Paragraph position="3"> AL results are usually presented in terms of the amount of labeling necessary to achieve given performance levels. We say that one method is better than another method if, for a given performance level, less annotation is required. The performance metric used here is parse selection accuracy as described in section 2.4.</Paragraph> <Section position="1" start_page="0" end_page="50" type="sub_section"> <SectionTitle> 5.1 Baseline results </SectionTitle> <Paragraph position="0"> Frequently, baseline results are those produced by random sampling for a single model. Figure 3a shows a set of baseline results: LL-CONFIG (the best single model) using random sampling and the stronger baseline result of LL-PROD, also using random sampling. Quite clearly, we see that LL-PROD (which uses all three feature sets) outperforms LL-CONFIG. Although not shown, LL-PROD also outperforms LL-NGRAM and LL-CONGLOM trained using random sampling. These results show that the common practice in AL of only reporting the convergence results of a single model, trained using random sampling, can be misleading: we can improve upon the performance of a single model without using AL by using an ensemble model. Our main baseline system is therefore LL-PROD, trained progressively with randomly sampled examples.</Paragraph> </Section> <Section position="2" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 5.2 Ensemble active learning results </SectionTitle> <Paragraph position="0"> Figure 3b compares uncertainty sampling using LL-CONFIG (the lower curve), random sampling using LL-PROD, and uncertainty sampling using LL-PROD.</Paragraph> <Paragraph position="1"> The first thing to note is that random sampling for the ensemble outperforms uncertainty sampling for the single model. This shows that single model AL results can themselves be beaten by a model that does not use AL.</Paragraph> <Paragraph position="2"> Nonetheless, the graph also shows that an ensemble parse selection model using an ensemble AL method outperforms an ensemble parse selection model not using AL. Table 2 shows the amount of labeling (as measured using our discriminant cost function) selected by some AL method necessary to achieve a given performance level.</Paragraph> <Paragraph position="3"> The top two methods are random baselines; the third method is uncertainty sampling using a single model, while the remaining three other methods are all ensemble active learners. There, and in the following text, labels of the form rand-config mean (in this case) using random sampling for LL-CONFIG; labels of the form rand-a0 mean (again in this case) random sampling for LL-PROD; the legend QBC means using query-by-committee, with all three base models, when selecting examples for LLPROD. null All three ensemble AL methods - product uncertainty sampling, QBC, and product LBP - provide large gains over random sampling (of all kinds). There is very little to distinguish the three methods, though product uncertainty sampling proves the strongest overall, providing a 53.6% reduction over rand-a0 to achieve 77% accuracy and a 73.5% reduction over rand-config to reach 75% accuracy. null To understand whether product uncertainty is indeed choosing more wisely, it is important to consider the performance of an ensemble parse selection model when examples are chosen by a single-model AL method. That is, using a single-model AL method, but labeling examples using an ensemble model. If the ensemble AL method using the ensemble parse selection model performs equally to a single-model AL method also using an ensemble parse selection model, then the ensemble parse selection model would be responsible for improved performance. This contrasts with our ensemble AL method instead selecting more informative examples. We find that, as expected, selecting examples using LL-CONFIG for LL-PROD is worse than LL-PROD selecting for itself.</Paragraph> </Section> <Section position="3" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 5.3 Simple selection metrics </SectionTitle> <Paragraph position="0"> Since sentences have variable length and ambiguity, there are four obvious selection metrics that make no use of AL methods: select sentences that are longer, shorter, more ambiguous or less ambiguous. We tested all four with LL-PROD and found none which improved on random sampling with the same model. For example, selecting the least ambiguous sentences performs the worst of all experiments we ran, with selection by shortest sentences close behind, respectively requiring 61.9% and 55.4% increases in discriminant cost over random sampling to reach 70% accuracy.</Paragraph> <Paragraph position="1"> Selecting the most ambiguous examples dramatically demonstrates the difference between unit cost and discriminant cost. While that selection method requires a 17.4% increase in discriminant cost to reach 70%, it provides a 27.9% reduction in unit cost. Figure 4 compares (a) unit cost with (b) discriminant cost for ambiguity selection versus random sampling (with LL-PROD).</Paragraph> <Paragraph position="2"> It is also important to consider sequential selection, a default strategy typically adopted by annotators. This was the worst of all AL methods, requiring an increase of 45.5% in discriminant cost over random sampling. This is most likely because the four sections of Redwoods come from two slightly different domains: appointment scheduling and travel planning dialogs. Because of this, sequential selection does not choose examples from the latter domain until all those from the former have been selected, and it thus lacks examples that are similar to those in the test set from the latter domain.</Paragraph> </Section> </Section> <Section position="6" start_page="50" end_page="50" type="metho"> <SectionTitle> 6 Related work </SectionTitle> <Paragraph position="0"> There is a large body of AL work in the machine learning literature, but less so within natural language processing. There is even less work on ensemble-based AL.</Paragraph> <Paragraph position="1"> Baram et al. (2003) consider selection of individual AL methods at run-time. However, their AL methods are only ever based on single model approaches.</Paragraph> <Paragraph position="2"> Turning to parsing, most work has utilized uncertainty sampling (Thompson et al., 1999; Hwa, 2000; Tang et al., 2002). In all cases, relatively simple parsers were bootstrapped, and also, comparison was with a single model, trained using random sampling. As we pointed out earlier, our product model, not using AL, can outperform single-model active learning.</Paragraph> <Paragraph position="3"> Baldridge and Osborne (2003) also applied AL to Redwoods. They only used two feature sets, did not consider product models, nor our simple LBP method. Additionally, they used the unit cost assumption.</Paragraph> <Paragraph position="4"> Hwa et al. (2003) showed that for parsers, AL outperforms the closely related co-training, and that some of the labeling could be automated. However, their approach requires strict independence assumptions.</Paragraph> </Section> class="xml-element"></Paper>