File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1101_intro.xml

Size: 3,836 bytes

Last Modified: 2025-10-06 14:01:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1101">
  <Title>Detecting Errors in Corpora Using Support Vector Machines</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Corpora are widely used in natural language processing today. For example, many statistical part-of-speech (POS) taggers have been developed and they use corpora as the training data to obtain statistical information or rules (Brill, 1995; Ratnaparkhi, 1996). For natural language processing systems based on a corpus, the quantity and quality of the corpus affect their performance. In general, corpora are annotated by hand, and therefore are error-prone. These errors are problematic for corpus-based systems. The errors become false training examples and deteriorate the performance of the systems. Furthermore, incorrect instances may be used as testing examples and prevent the accurate measurement of performance. Many studies and improvements have been conducted for / Presently with Service Media Laboratory, Corporate ResearchandDevelopmentCenter, OkiElectricIndustry Co.,Ltd.</Paragraph>
    <Paragraph position="1"> POS tagging, and major methods of POS tagging achieve an accuracy of 96-97% on the Penn Treebank WSJ corpus, but obtaining higher accuracies is difficult (Ratnaparkhi, 1996). It is mentioned that the limitation is largely caused by inconsistencies in the corpus (Ratnaparkhi, 1996; Padr'o and M`arquez, 1998; van Halteren et al., 2001). Therefore, correcting the errors in a corpus and improving its quality is important.</Paragraph>
    <Paragraph position="2"> However, to find and correct errors in corpora by hand is costly, since the size of corpora is usually very large. Hence, automatic detection of errors in corpora is necessary.</Paragraph>
    <Paragraph position="3"> One of the approaches for corpus error detection is use of machine learning techniques (Abney et al., 1999; Matsumoto and Yamashita, 2000; Ma et al., 2001). These methods regard difficult elements for a learning model (boosting or neural networks) to learn as corpus errors.</Paragraph>
    <Paragraph position="4"> Abney et al. (1999) studied corpus error detection using boosting. Boosting assigns weights to training examples, and the weights are large for the examples that are difficult to classify. Mislabeled examples caused by annotators tend to be difficult examples to classify and these authors conducted error detection of POS tags and PP attachment information in a corpus by extracting examples with a large weight.</Paragraph>
    <Paragraph position="5"> Some probabilistic approaches for corpus error detection have also been studied (Eskin, 2000; Murata et al., 2000). Eskin (2000) conducted corpus error detection using anomaly detection. He supposed that all the elements in a corpus are generated by a mixture model consisting of two distributions, a majority distribution (typically a structured distribution) and an anomalous distribution (a uniform random distribution), and erroneous elements are generated by the anomalous distribution. For each element in a corpus, the likelihood of the mixed model is calculated in both cases when the element is generated from the majority distribution and from the anomalous one. The element is detected as an error if the likelihood in the latter case is large enough.</Paragraph>
    <Paragraph position="6"> In this paper, we focus on detection of errors in corpora annotated with POS tags, and propose a method for corpus error detection using support vector machines (SVMs). SVMs are one of machine learning models and applied to many natural language processing tasks with success recently. In the next section, we explain a method to use SVMs for corpus error detection. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML