File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2020_metho.xml
Size: 14,971 bytes
Last Modified: 2025-10-06 14:07:03
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2020"> <Title>Detecting Errors within a Corpus using Anomaly Detection</Title> <Section position="3" start_page="148" end_page="149" type="metho"> <SectionTitle> 2 Anomaly Detection </SectionTitle> <Paragraph position="0"> More formally, anomaly detection is the process of determining when an element of data is an outlier. Given a set of training data without a probability distribution, we want to construct an automatic method for detecting anomalies.</Paragraph> <Paragraph position="1"> We are interested in detecting anomalies for two main reasons. One, we are interested in modeling the data and the anomalies can contaminate the model. And two, the anomalies themselves can be of interest as they may show rarely occurring events. For the purposes of this work, we axe most interested in identifying mistagged elements, i.e. the second case.</Paragraph> <Paragraph position="2"> In order to motivate a method for detecting anomalies, we must first make assumptions about how the anomalies occur in the data. We use a &quot;mixture model&quot; for explaining the presence of anomalies, one of several popular models in statistics for explaining outliers (Barnett and Lewis, 1994). In the mixture model, there are two probability distributions which generate the data. An element xi is either generated from the majority distribution or with (small) probability A from an alternate (anomalous) distribution. Our distribution for the data, D, is then: D -- (1 - A)M + AA (I) where M is the majority distribution, and A is the anomalous distribution. The mixture framework for explaining anomalies is independent of the properties of the distributions M and A. In other words, no assumptions about the nature of the probability distributions are necessary. The specific probability distributions, M and A, are chosen based on prior knowledge of the problem. Typically M is a structured distribution which is estimated over the data using a machine learning technique, while A is a uniform (random) distribution representing elements which do not fit into M.</Paragraph> <Paragraph position="3"> In the corpus error detection problem, we are assuming that for each tag in the corpus with probability (1 - A) the human annotator markes the corpus with the correct tag and with probability A the human annotator makes an error.</Paragraph> <Paragraph position="4"> In the case of an error, we assume that the tag is chosen at random.</Paragraph> <Section position="1" start_page="149" end_page="149" type="sub_section"> <SectionTitle> 2.1 Detection of Anomalies </SectionTitle> <Paragraph position="0"> Detecting anomalies, in this framework, is equivalent to determining which elements were generated by the distribution A and which elements were generated by distribution M. Elements generated by A are anomalies, while elements generated by M are not. In our case, we have probability distributions associated with the distributions M and A, PM and PA respectively. null The algorithm partitions the data into two sets, the normal elements M and the anomalies A. For each element, we make a determination of whether it is an anomaly and should be included in A or a majority element in which it should be included in M. We measure the likelihood of the distribution under both cases to make this determination.</Paragraph> <Paragraph position="1"> The likelihood, L, of distribution D with probability function P over elements Xl,...,XN is defined as follows:</Paragraph> <Paragraph position="3"> Since the product of small numbers is difficult to compute, we instead compute the log likelihood, LL. The log likelihood for our case is:</Paragraph> <Paragraph position="5"> In order to determine which elements are anomalies, we use a general principal for determining outliers in multivariate data (Barnett, 1979). We measure how likely each element xi is an outlier by comparing the difference between the log likelihood of the distribution if the element is removed from the majority distribution and included in the anomalous distribution. If this difference is sufficiently large, we declare the element an anomaly.</Paragraph> <Paragraph position="6"> Specifically what this difference should be depends on the probability distributions and prior knowledge of the problem such as the rate of the anomalies, A.</Paragraph> </Section> </Section> <Section position="4" start_page="149" end_page="150" type="metho"> <SectionTitle> 3 Methodology </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="149" end_page="149" type="sub_section"> <SectionTitle> 3.1 Corpus </SectionTitle> <Paragraph position="0"> The corpus we use is the Penn Treebank tagged corpus. The corpus contains approximately 1.25 million manually tagged words from Wall Street Journal articles. For each word, a record is generated containing the following elements: 1. The tag of the current word Ti.</Paragraph> <Paragraph position="1"> 2. The current word Wi.</Paragraph> <Paragraph position="2"> 3. The previous tag ~-I.</Paragraph> <Paragraph position="3"> 4. The next tag 7~+1.</Paragraph> <Paragraph position="4"> Over records containing these 4 elements, we compute our probability distributions.</Paragraph> </Section> <Section position="2" start_page="149" end_page="149" type="sub_section"> <SectionTitle> 3.2 Probability Modeling Methods </SectionTitle> <Paragraph position="0"> The anomaly detection framework is independent of specific probability distributions. Different probability distributions have different properties. Since the anomaly detection framework does not depend on a specific probability distribution, we can choose the probability distribution to best model the data based on our intuitions about the problem.</Paragraph> <Paragraph position="1"> To illustrate this, we perform two sets of experiments, each using a different probability distribution modeling method. The first set of experiments uses sparse Markov transducers as the probability modeling method, while the second uses a simple naive Bayes method.</Paragraph> </Section> <Section position="3" start_page="149" end_page="150" type="sub_section"> <SectionTitle> 3.3 Sparse Markov Transducers </SectionTitle> <Paragraph position="0"> Sparse Markov transducers compute probabilistic mappings over sparse data. A Markov transducer is defined to be a probability distribution conditional on a finite set of inputs. A Markov transducer of order L is the conditional probability distribution of the form:</Paragraph> <Paragraph position="2"> where Xk are random variables over the input alphabet Ei,~ and Yk is a random variable over the output alphabet Eout. This probability distribution stochastically defines a mapping of strings over the input alphabet into the output alphabet. The mapping is conditional on the L previous input symbols.</Paragraph> <Paragraph position="3"> In the case of sparse data, the probability distribution is conditioned on only some of the inputs. We use sparse Markov transducers to model these type of distributions. A sparse Markov transducer is a conditional probability of the form: (5) where C/ represents a wild card symbol and ti = t- ~=lnJ- (i- 1). The goal of the sparse Markov transducer estimation algorithm is to estimate a conditional probability of this form based upon a set of inputs and their corresponding outputs. However, the task is complicated due to the lack of knowledge a priori of which inputs the probability distribution is conditional on.</Paragraph> <Paragraph position="4"> Intuitively, a fixed order Markov Chain of order L is equivalent to a n-gram with n = L. In a variable order Markov Chain, the value of n changes depending on the context. For example, some elements in the data may use a bigram, while others may use a trigram. The sparse Markov transducer uses a weighted sum of n-grams for different values of n and these weights depend on the context. In addition the weighted sum is over not only n-grams, but also n-grams with wild cards such as a trigram where only the first and last element is conditioned on. In this case we are 'looking at the input sequence of the current word, Wt, the previous tag, Tt-1, and the next tag, Tt+l. The output is the set of all possible tags. The models that are in the weighted sum are the trigram, WtTt-lTt+l; the bigrams WtTt-1, WtTt+l and Tt-lTt+l; and the unigrams Wt, Tt-1 and Tt+l.</Paragraph> <Paragraph position="5"> The specific weights of each model depends on the context or the actual values of Wt, Tt-1, and Tt+l.</Paragraph> <Paragraph position="6"> Sparse Markov transducers depend on a set of prior probabilities that incorporate prior knowledge about the importance of various elements in the input sequence. These prior probabilities are set based on the problem. For this problem, we use the priors to encode the information that the current word, Wt, is very important in determining the part of speech.</Paragraph> <Paragraph position="7"> Each model in the weighted sum uses a pseudo-count predictor. This predictor computes the probability of an output (tag) by the number of times that a specific output was seen in a given context. In order to avoid probabilities of 0, we assume that we have seen each output at least once in every context. In fact, these predictors can be any probability distribution which can also depend on what works best for the task.</Paragraph> </Section> <Section position="4" start_page="150" end_page="150" type="sub_section"> <SectionTitle> 3.4 Naive Bayes </SectionTitle> <Paragraph position="0"> The probability distribution for the tags was also estimated using a straight forward naive Bayes approach.</Paragraph> <Paragraph position="1"> We are interested in the probability of a tag, given the current word, the previous tag, and the next tag, or the probability distribution P(TiIWi, Ti-t, ~+1) which using Bayes Rule is equivalent to:</Paragraph> <Paragraph position="3"> If we make the Naive Bayes independence assumption and we assume that the denominator is constant for all values this reduces to:</Paragraph> <Paragraph position="5"> where C is a normalization constant in order to have the probabilities sum to 1. Each of the values on the right side of the equation can easily be computed over the data estimating a probability distribution.</Paragraph> </Section> <Section position="5" start_page="150" end_page="150" type="sub_section"> <SectionTitle> 3.5 Computing Probability Distributions </SectionTitle> <Paragraph position="0"> Each probability distribution was trained over each record giving a model over the entire data.</Paragraph> <Paragraph position="1"> The probability model is then used to determine whether or not an element is an anomaly by applying the test in equation (3). Typically this can be done in an efficient manner because the approach does not require reestimating the model over the entire data set. If an element is designated as an anomaly, we remove it from the set of normal elements andefficiently reestimate the probability distribution to obtain more anomalous elements.</Paragraph> </Section> </Section> <Section position="5" start_page="150" end_page="151" type="metho"> <SectionTitle> 4 Results/Evaluation </SectionTitle> <Paragraph position="0"> The method was applied to the Penn Tree-bank corpus and a set of anomalies were generated. These anomalies were evaluated by human judges to determine if they are in fact tagging errors in the corpus. The human judges were natural language processing researchers (not the author) familiar with the Penn Tree-bank markings.</Paragraph> <Paragraph position="1"> In the experiments involving the sparse Markov transducers, after applying the method, 7055 anomalies were detected. In the experiments involving the naive Bayes learning method, 6213 anomalies were detected.</Paragraph> <Paragraph position="2"> Sample output from the system is shown in figure 1. The error is shown in the context marked with !!!. The likelihood of the tag is also given which is extremely low for the errors. The system also outputs a suggested tag and its likelihood which is the tag with the highest likelihood for that context. As we can see, these errors are clearly annotation errors.</Paragraph> <Paragraph position="3"> Since the anomalies detected from the two probability modeling methods differed only slightly, we performed human judge verification of the errors over only the results of the sparse Markov transducer experiments.</Paragraph> <Paragraph position="4"> The anomalies were ordered based on their likelihood. Using this ranking, the set of anomalies were broken up into sets of 1000 records. We examined the first 4000 elements by randomly selecting 100 elements out of each 1000.</Paragraph> <Paragraph position="5"> Human judges were presented with the system output for four sets of 100 anomalies. The judges were asked to choose among three options for each example: 1. Corpus Error- The tag in the corpus sentence is incorrect.</Paragraph> <Paragraph position="6"> 2. Unsure - The judge is unsure whether or not the corpus tag is correct.</Paragraph> <Paragraph position="7"> 3. System Error - The tag in the corpus sen null tence is correct and the system incorrectly marked it as an error.</Paragraph> <Paragraph position="8"> The &quot;unsure&quot; choice was allowed because of the inherent subtleties in differentiating between types of tags such as &quot;VB vs. VBP&quot; or &quot;VBD vs. VBN&quot;.</Paragraph> <Paragraph position="9"> Over the 400 examples evaluated, 158 were corpus errors, 202 were system errors and the judges were unsure in 40 of the cases. The corpus error rate was computed by throwing out the unsure cases and computing: Corpus error rate = (8)</Paragraph> <Section position="1" start_page="151" end_page="151" type="sub_section"> <SectionTitle> Corpus Errors System Errors + Corpus Errors </SectionTitle> <Paragraph position="0"> The total corpus error rate over the 400 manually checked examples was was 44%. As can be seen, many of the anomalies are in fact errors in the corpus.</Paragraph> <Paragraph position="1"> For each error, we asked the human judge to determine if the correct tag is the systems suggested tag. Out of the total 158 corpus errors, the systems correct tag would have corrected the error in 145 cases.</Paragraph> <Paragraph position="2"> Since the verified examples were random, we can assume that 91% of corpus errors would be automatically corrected if the system would replace the suspect tag with the suggested tag. Ignoring the &quot;unsure&quot; elements for the purposes of this analysis, if we attempted to automatically correct the first 1000 examples where the error rate was 69%, this method would have led to a reduction of the total number of errors in the corpus by 245.</Paragraph> </Section> </Section> class="xml-element"></Paper>