File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-2020_intro.xml
Size: 4,354 bytes
Last Modified: 2025-10-06 14:00:41
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2020"> <Title>Detecting Errors within a Corpus using Anomaly Detection</Title> <Section position="2" start_page="0" end_page="148" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Manually marking corpora is a time consuming and expensive process. The process is subject to human error by the experts doing the marking.</Paragraph> <Paragraph position="1"> Unfortunately, many natural language processing methods are sensitive to these errors. In order to ensure accuracy in a corpus, typically several experts pass over the corpus to ensure consistency. For large corpora this can be a tremendous expense.</Paragraph> <Paragraph position="2"> In this paper, we propose a method for automatically detecting errors in a marked corpus using an anomaly detection technique. This technique detects anomalies or elements which do not fit in with the rest of the corpus. When applied to marked corpora, the anomalies tend to be errors in the markings of the corpus.</Paragraph> <Paragraph position="3"> To detect the anomalies, we first compute a probability distribution over the entire corpus.</Paragraph> <Paragraph position="4"> Then we apply a statistical test which identifies which elements are anomalies. In this case the anomalies are the elements with very low likelihood. These elements are marked as errors and are thrown out of the corpus. The model is recomputed on the remaining elements. At conclusion, we are left with two data sets: one the normal elements and the second the detected anomalous elements.</Paragraph> <Paragraph position="5"> We evaluate this method over the part of speech tagged portion of the Penn Treebank corpus (Marcus et al., 1993). In one experiment, our method detected 1000 anomalies within a data set of 1.25 million tagged elements. Human judges evaluated the results of the application of this method and verified that 69% of identified anomalies are in fact tagging errors. In another experiment, our method detected 4000 anomalies of which 44% are tagging errors.</Paragraph> <Section position="1" start_page="0" end_page="148" type="sub_section"> <SectionTitle> 1.1 Related Work </SectionTitle> <Paragraph position="0"> The tagged portion of the Penn Treebank has been extensively utilized for construction and evaluation of taggers. This includes transformation-based tagging (Brill, 1994; Brill and Wu, 1998). Weischedel et al. (1993) applied Markov Models to tagging. Abney et al. (1999) applied boosting to part of speech tagging. Adwait Ratnaparkhi (1996) estimates a probability distribution for tagging using a maximum entropy approach.</Paragraph> <Paragraph position="1"> Regarding error detection in corpora, Ratnaparkhi (1996) discusses inconsistencies in the Penn Treebank and relates them to inter-annotator differences in tagging style. Abney, Schapire and Singer (1999) discuss how to use boosting for cleaning data.</Paragraph> <Paragraph position="2"> Much related work to the anomaly detection problem stems from the field of statistics in the study of outliers. This work examines detecting and dealing with outliers in univariate data, multivariate data, and structured data where the probability distribution over the data is given a priori. Statistics gives a set of discordancy tests which can be applied to any given element in the dataset to determine whether it is an outlier. A survey of outliers in statistics is given in Barnett and Lewis (1994).</Paragraph> <Paragraph position="3"> Anomaly detection is extensively used within the field of computer security specifically in intrusion detection (Denning, 1987). Typically anomaly detection methods are applied to detect attacks by comparing the activity during an attack to the activity under normal use (Lane and Brodley, 1997; Warrender et al., 1999). The method used in this paper is based on a method for anomaly detection which detects anomalies in noisy data (Eskin, 2000).</Paragraph> <Paragraph position="4"> The sparse Markov transducer probability modeling method is an extension of adaptive mixtures of probabilistic transducers (Singer, 1997; Pereira and Singer, 1999). Naive Bayes learning, which is used to estimate probabilities in this paper, is described in (Mitchell, 1997).</Paragraph> </Section> </Section> class="xml-element"></Paper>