File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/j04-3001_abstr.xml
Size: 3,904 bytes
Last Modified: 2025-10-06 13:43:24
<?xml version="1.0" standalone="yes"?> <Paper uid="J04-3001"> <Title>c(c) 2004 Association for Computational Linguistics Sample Selection for Statistical Parsing</Title> <Section position="2" start_page="0" end_page="254" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Many learning tasks for natural language processing require supervised training; that is, the system successfully learns a concept only if it has been given annotated training data. For example, while it is difficult to induce a grammar with raw text alone, the task is tractable when the syntactic analysis for each sentence is provided as a part of the training data (Pereira and Schabes 1992). Current state-of-the-art statistical parsers (Collins 1999; Charniak 2000) are all trained on large annotated corpora such as the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993). However, supervised training data are difficult to obtain; existing corpora might not contain the relevant type of annotation, and the data might not be in the domain of interest. For example, one might need lexical-semantic analyses in addition to the syntactic analyses in the treebank, or one might be interested in processing languages, domains, or genres for which there are no annotated corpora. Because supervised training demands significant human involvement (e.g., annotating the syntactic structure of each sentence by hand), creating a new corpus is a labor-intensive and time-consuming endeavor. The goal of this work is to minimize a system's reliance on annotated training data.</Paragraph> <Paragraph position="1"> One promising approach to mitigating the annotation bottleneck problem is to use sample selection, a variant of active learning. Sample selection is an interactive learning method in which the machine takes the initiative in selecting unlabeled data for the human to annotate. Under this framework, the system has access to a large pool of unlabeled data, and it has to predict how much it can learn from each candidate in the pool if that candidate is labeled. More quantitatively, we associate each candidate in the pool with a training utility value (TUV). If the system can accurately identify the subset of examples with the highest TUV, it will have located the most beneficial [?] Computer Science Department, Pittsburgh, PA 15260. E-mail: hwa@cs.pitt.edu.</Paragraph> <Paragraph position="2"> Submission received: 14 October 2002; Revised submission received: 30 September 2003; Accepted for publication: 22 December 2003 Computational Linguistics Volume 30, Number 3 training examples, thus freeing the annotators from having to label less informative examples.</Paragraph> <Paragraph position="3"> In this article, we apply sample selection to two syntactic learning tasks: training a prepositional-phrase attachment (PP-attachment) model and training a statistical parsing model. We are interested in addressing two main questions. First, what are good predictors of a candidate's training utility? We propose several predictive criteria and define evaluation functions based on them to rank the candidates' utility. We have performed experiments comparing the effect of these evaluation functions on the size of the training corpus. We find that, with a judiciously chosen evaluation function, sample selection can significantly reduce the size of the training corpus. The second main question is: Are the predictors consistently effective for different types of learners? We compare the predictive criteria both across tasks (between PP-attachment and parsing) and within a single task (applying the criteria to two parsing models: an expectation-maximization-trained parser and a count-based parser). We find that the learner's uncertainty is a robust predictive criterion that can be easily applied to different learning models.</Paragraph> </Section> class="xml-element"></Paper>