File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1034_intro.xml
Size: 3,118 bytes
Last Modified: 2025-10-06 14:03:19
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1034"> <Title>From detecting errors to automatically correcting them</Title> <Section position="3" start_page="0" end_page="265" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Annotated corpora serve as training material and as &quot;gold standard&quot; testing material for the development of tools in computational linguistics, and as a source of data for theoretical linguists searching for relevant language patterns. However, they contain annotation errors, and such errors provide unreliable training andevaluation data, ashasbeen previously shown (see ch. 1 of Dickinson (2005) and references therein). Improving the quality of linguistic annotation where possible is thus a key issue for the use of annotated corpora in computational and theoretical linguistics.</Paragraph> <Paragraph position="1"> Research has gone into automatically detecting annotation errors for part-of-speech annotation (van Halteren, 2000; KvVetVon and Oliva, 2002; Dickinson and Meurers, 2003), yet there has been virtually no work on automatically or semi-automatically correcting such annotation errors.1 1Oliva (2001) specifies hand-written rules to detect and Automatic correction can speed up corpus improvement efforts and provide new data for NLP technology training on the corpus. Additionally, an investigation into automatic correction forces us to re-evaluate the technology using the corpus, providing new insights into such technology.</Paragraph> <Paragraph position="2"> We propose in this paper to automatically correct part-of-speech (POS) annotation errors in corpora, by adapting existing technology for POSdisambiguation. We build the correction work on top of a POS error detection phase, described in section 2. In section 3 we discuss how to evaluate corpus correction work, given that we have no benchmark corpus to compare with. We turn to the actual work of correction in section 4, using two different POS taggers as automatic correctors and using the Wall Street Journal (WSJ) corpus as our data. After more thoroughly investigating how problematic tagging distinctions affect thePOSdisambiguation task, insection 5wemodify the tagging model in order to better account for these distinctions, and we show this to significantly reduce the error rate of a corpus.</Paragraph> <Paragraph position="3"> It might be objected that automatic correction of annotation errors will cause information to be lost or will make the corpus worse than it was, but the construction of a large corpus generally requires semi-automated methods of annotation, and automatic tools must be used sensibly at every stage in the corpus building process. Automated annotation methods are not perfect, but humans also add errors, from biases and inconsistent judgments. Thus, automatic corpus correction methods canbeused semi-automatically, just astheoriginal corpus creation methods were used.</Paragraph> <Paragraph position="4"> then correct errors, but there is no general correction scheme.</Paragraph> </Section> class="xml-element"></Paper>