File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-1012_intro.xml
Size: 4,097 bytes
Last Modified: 2025-10-06 14:02:17
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1012"> <Title>Ensemble-based Active Learning for Parse Selection</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Active learning (AL) methods, such as uncertainty sampling (Cohn et al., 1995) or query by committee (Seung et al., 1992), can dramatically reduce the cost of creating an annotated dataset. In particular, they enable rapid creation of labeled datasets which can then be used for trainable speech and language technologies. Progress in AL will therefore translate into even greater savings in annotation costs and hence faster creation of speech and language systems.</Paragraph> <Paragraph position="1"> In this paper, we: a7 Present a novel way of improving uncertainty sampling by generalizing it from using a single model to using an ensemble model. This generalization easily outperforms single-model uncertainty sampling.</Paragraph> <Paragraph position="2"> a7 Introduce a new, extremely simple AL method (called lowest best probability selection) which is competitive with uncertainty sampling and can also be improved using ensemble techniques.</Paragraph> <Paragraph position="3"> a7 Show that an ensemble of models trained using randomly sampled examples can outperform a single model trained using (single model) AL methods.</Paragraph> <Paragraph position="4"> a7 Demonstrate further reductions in annotation cost when we train the ensemble parse selection model using examples selected by an ensemble-based active learner. This result shows that ensemble learning can improve both the underlying model and also the way we select examples for it.</Paragraph> <Paragraph position="5"> Our domain is parse selection for Head-Driven Phrase Structure Grammar (HPSG). Although annotated corpora exist for HPSG, such corpora do not exist in significant volumes and are limited to a few small domains (Oepen et al., 2002). Even if it were possible to bootstrap from the Penn Treebank, it is still unlikely that there would be sufficient quantities of high quality material necessary to improve parse selection for detailed linguistic formalisms such as HPSG. There is thus a pressing need to efficiently create significant volumes of annotated material.</Paragraph> <Paragraph position="6"> AL applied to parse selection is much more challenging than applying it to simpler tasks such as text classification or part-of-speech tagging. Our labels are complex objects rather than discrete values drawn from a small, fixed set. Furthermore, the fact that sentences are of variable length and have variable numbers of parses potentially adds to the complexity of the task.</Paragraph> <Paragraph position="7"> Our results specific to parse selection show that: a7 An ensemble of three parse selection models is able to achieve a 10.8% reduction in error rate over the best single model.</Paragraph> <Paragraph position="8"> a7 Annotation cost should not assume a unit expenditure per example. Using a more refined cost metric based upon efficiently selecting the correct parse from a set of possible parses, we are able to show that some AL methods are more effective than others, even though they perform similarly when making the unit cost per example assumption.</Paragraph> <Paragraph position="9"> a7 Ad-hoc selection methods based upon superficial characteristics of the data, such as sentence length or ambiguity rate, are typically worse than random sampling. This motivates using AL methods.</Paragraph> <Paragraph position="10"> a7 Labeling sentences in the order they appear in the corpus - as is typically done in annotation - performs much worse than using random selection.</Paragraph> <Paragraph position="11"> Throughout this paper, we shall treat the terms sentences and examples as interchangeable; we shall also consider parses and labels as equivalent. Also, we shall use the term method whenever we are talking about AL, and model whenever we are talking about parse selection.</Paragraph> </Section> class="xml-element"></Paper>