File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-0206_concl.xml
Size: 2,044 bytes
Last Modified: 2025-10-06 13:55:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0206"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Data Selection in Semi-supervised Learning for Name Tagging</Title> <Section position="9" start_page="53" end_page="54" type="concl"> <SectionTitle> 7 Conclusions and Future Work </SectionTitle> <Paragraph position="0"> This paper demonstrates the effectiveness of two straightforward semi-supervised learning methods for improving a state-of-art name tagger, and investigates the importance of data selection for this application.</Paragraph> <Paragraph position="1"> Banko and Brill (2001) suggested that the development of very large training corpora may be central to progress in empirical natural language processing. When using large amounts of unlabeled data, as expected, we did get improvement by using unsupervised bootstrapping. However, exploiting a very large corpus did not by itself produce the greatest performance gain. Rather, we observed that good measures to select relevant unlabeled documents and useful labeled sentences are important.</Paragraph> <Paragraph position="2"> The work described here complements the active learning research described by (Scheffer et al., 2001). They presented an effective active learning approach that selects &quot;difficult&quot; (small margin) sentences to label by hand and then add to the training set. Our approach selects &quot;easy&quot; sentences - those with large margins - to add automatically to the training set. Combining these methods can magnify the gains possible with active learning.</Paragraph> <Paragraph position="3"> In the future we plan to try topic identification techniques to select relevant unlabeled documents, and use the downstream information extraction components such as coreference resolution and relation detection to measure the confidence of the tagging for sentences. We are also interested in applying clustering as a pre-processing step for bootstrapping.</Paragraph> </Section> class="xml-element"></Paper>