File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1101_metho.xml
Size: 13,489 bytes
Last Modified: 2025-10-06 14:07:52
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1101"> <Title>Detecting Errors in Corpora Using Support Vector Machines</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Corpus Error Detection Using Support Vector Machines </SectionTitle> <Paragraph position="0"> Training data for corpus error detection is usually not available, so we have to solve it as an unsupervised learning problem. We consider in the following way: in general, a corpus is built according to a set of guidelines, thus it should be consistent. If there is an exceptional element in the corpus that jeopardizes consistency, it is likely to be an error. Therefore, corpus error detection can be conducted by detecting exceptional elements that causes inconsistency.</Paragraph> <Paragraph position="1"> While this is a simple and straightforward approach and any machine learning method is applicable to this task, we will use SVMs as the learning algorithm in the settings described in Section 2.2. The advantage of using SVMs in this setting is the following: In our setting, each position in the annotated corpus receives a weight according to the SVM algorithm and these weights can be used as the confidence level of erroneous examples. By effectively usingthoseweightsthe inspection oftheerroneous parts can be undertaken in the order of the confidence level, so that an efficient browsing of corpus becomes possible. We believe this is a particular advantage of our method compared with the methods that use other machine learning methods.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Support Vector Machines </SectionTitle> <Paragraph position="0"> Support Vector Machines (SVMs) are a supervised machine learning algorithm for binary classification (Vapnik, 1998). Given l training examples of feature vector xi 2 RL with label yi 2f+1;!1g, SVMs map them into a high dimensional space by a nonlinear function Ph(x) and linearly separate them. The optimal hyperplane to separate them is found by solving the following quadratic programming problem:</Paragraph> <Paragraph position="2"> where the function K(xi;xj) is the inner product of the nonlinear function (K(xi;xj) = Ph(xi)C/Ph(xj)) called a kernel function, and the constant C controls the training errors and becomes the upper bound of fii. Given a test example x, its label y is decided by summing the inner products of the test example and the training examples weighted by fii:</Paragraph> <Paragraph position="4"> where b is a threshold value. Thus, SVMs assign a weight fii to each training example. The weights are large for examples that are hard for SVMs to classify, that is, exceptional examples in training data have a large weight. We conduct corpus error detection using the weights.</Paragraph> <Paragraph position="5"> To detect exceptional examples in a corpus annotated with POS tags, we first construct an SVM model for POS tagging using all the elements in a corpus as the training examples.</Paragraph> <Paragraph position="6"> Note that each example corresponds to a word in the corpus. Then SVMs assign weights to the examples, and large weights are assigned to difficult examples. Finally, we extract examples with a large weight greater than or equal to a threshold value fi. In the next subsection, we describe how to construct an SVM model for POS tagging.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Revision Learning for POS tagging </SectionTitle> <Paragraph position="0"> We use a revision learning method (Nakagawa et al., 2002) for POS tagging with SVMs1. This method creates training examples of SVMs with al., 2000) can be also used for POS tagging with SVMs, but it has large computational cost and cannot handle segmentation of words directly that is necessary for Japanese morphological analysis.</Paragraph> <Paragraph position="1"> binary labels for each POS tag class using a stochastic model (e.g. n-gram) as follows: each word in a corpus becomes a positive example of its POS tag class. We then build a simple stochastic POS tagger based on n-gram (POS bigram or trigram) model, and words in the corpus that the stochastic model failed to tag with a correct part-of-speech are collected as negative examples of the incorrect POS tag class. In such way, revision learning makes a model of SVMs to revise outputs of the stochastic model.</Paragraph> <Paragraph position="2"> For example, assume that for a sentence: &quot;11/CD million/CD yen/NNS are/VBP paid/VBN&quot;, a stochastic model tags incorrectly: &quot;11/CD million/CD yen/NN are/VBP paid/VBN&quot;. In this case, the following training examples are created for SVMs (each line corresponds to an example):</Paragraph> <Paragraph position="4"> Thus, the positive and negative examples are created for each class (POS tag), and a model of SVMs is trained for each class using the training examples.</Paragraph> <Paragraph position="5"> In English POS tagging, for each word w in the tagged corpus, we use the following features for SVMs: 1. the POS tags and the lexical forms of the two words preceding w; 2. the POS tags and the lexical forms of the two words succeeding w; 3. the lexical form of w and the prefixes and suffixes of up to four characters, the existence of numerals, capital letters and hyphens in w.</Paragraph> <Paragraph position="6"> Japanese morphological analysis can be conductedwithrevisionlearningalmostinthesame null way as English POS tagging, and we use the following features for a morpheme ,,: 1. the POS tags, the lexical forms and the inflection forms of the two morphemes preceding ,,; 2. the POS tags and the lexical forms of the two morphemes succeeding ,,; 3. the lexical form and the inflection form of ,,.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Extraction of Inconsistencies </SectionTitle> <Paragraph position="0"> So far, we discussed how to detect exceptional elements in a corpus. However, it is insufficient and inconvenient for corpus error detection, because an exceptional element is not always an error, that is, an exceptional element may be a correct or an incorrect exceptional element. Furthermore, it is often difficult to judge whether it is a true error or not when only the exceptional element is shown. To solve these problems, we extract not only an exceptional example but also another similar example that is inconsistent with the exceptional example. If the exceptional example is correct, the second example is likely to be an error, and vice versa.</Paragraph> <Paragraph position="1"> We assume that an inconsistency occurs when two examples have similar features but have opposite labels. The similarity between two examples xi and xj on SVMs is measured by the following distance:</Paragraph> <Paragraph position="3"> We can extract inconsistencies from a corpus as follows: given an example x which was detected as an exceptional example (following the proposal in the previous subsection), we extract an example z with the smallest values of the distance d(x;z) from the examples whose label is different from x. Intuitively, z is a closest opposite example to x in the SVMs' higher dimensional space and may be a cause for x to be attached a large weight.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> We perform experiments of corpus error detection using the Penn Treebank WSJ corpus (in English), the RWCP corpus (in Japanese) and the Kyoto University Corpus (in Japanese). In the following experiments, we use SVMs with second order polynomial kernel, and the upper bound value C is set to 1.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Correctly Detected Errors </SectionTitle> <Paragraph position="0"> pay about 11 million yen/NNS ( $ 77,000 budgeted about 11 million yen/NN ( $ 77,500 , president and chief/JJ executive officer of named president and chief/NN executive officer for its fiscal first quarter ended/VBN Sept. 30 its first quarter ended/VBD Sept. 30 was</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Incorrectly Detected Errors </SectionTitle> <Paragraph position="0"> EOS 3/LS . EOS Send your child to Nov. 1-Dec . EOS 3/CD . EOS</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Experiments on the Penn Treebank WSJ Corpus (English) </SectionTitle> <Paragraph position="0"> Experiments are performed on the Penn Tree-bank WSJ corpus, which consists of 53,113 sentences (1,284,792 tokens).</Paragraph> <Paragraph position="1"> We create models of SVMs for POS tagging using the corpus with revision learning. The distribution of the obtained weights fii are shown in Figure 1. The values of fii concentrate near the lower bound zero and the upper bound C. The examples with fii near the upper bound seem to be exceptional. Therefore, we regarded the examples with fii , 0:5 as exceptional examples (i.e. fi = 0:5). As a result, 1,740 elements were detected as errors. We implemented a browsing tool for corpus error detection with HTML (see Figure 2). A detected inconsistency pair is displayed in the lower part of the screen. We examined by hand whether the detected errors are true errors or not for the first 200 elements in the corpus from the detected 1,740 elements, and 199 were actual errors and 1 was not. The precision (the ratio of correctly detected errors for all of the detected errors) was 99.5%. Examples of correctly detected errors and incorrectly detected errors from the corpus are shown in Table 1. The underlined words were detected as errors. To judge whether they are true errors or not is easy by comparing the pair of examples that contradict each other.</Paragraph> <Paragraph position="2"> To examine the recall (the ratio of correctly detected errors for all of the existing actual errors in corpora), we conduct another experiments on an artificial data. We made the artificial data by randomly changing the POS tags of randomly selected ambiguous tokens in the WSJ corpus. The tags of 12,848 tokens (1% for the whole corpus) are changed, and the results are shown in Table 2 for various values of fi2.</Paragraph> <Paragraph position="3"> For the smaller threshold fi, the larger recall were obtained, but the value is not high.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> RWCP Corpus 3.2 Experiments on the RWCP Corpus (Japanese) </SectionTitle> <Paragraph position="0"> We use the RWCP corpus, which consists of 35,743 sentences (921,946 morphemes).</Paragraph> <Paragraph position="1"> The distribution of the weights fii are shown in Figure 3. The distribution of fii shows the same tendency as in the case of the WSJ corpus.</Paragraph> <Paragraph position="2"> We conducted corpus error detection for various values of fi, and examined by hand whether the detected errors are true errors or not. The results are shown in Table 3, where the correctly detected errors are distinguished into two types, one type is errors of word segmentation and the other is errors of POS tagging, since Japanese has two kinds of ambiguities, word segmentation and POS tagging. Precision of more than 80% are obtained, and the number of POS tag errors is larger than that of segmentation errors.</Paragraph> <Paragraph position="3"> Examples of correctly detected errors and incorrectly detected errors from the corpus are shown in Table 4. The underlined morphemes were detected as errors. In the examples of correctly detected errors, both segmentation errors (upper) and POS tag errors (lower) are detected. On the other hand, the examples of incorrectly detected errors show the limitations of our method. We use the two morphemes on either side of the current morpheme as features for SVMs. In the examples, the two morphemes on either side are the same and only the POS tag of the current morpheme is different, so that SVMs cannot distinguish them and regard them as errors (inconsistency).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Experiments on the Kyoto University Corpus (Japanese) </SectionTitle> <Paragraph position="0"> Experiments are performed on a portion of the Kyoto University corpus version 2.0, consisting of the articles of January 1, and from January 3 to January 9 (total of 9,204 sentences, 229,816 morphemes). We set the value of fi to 0:5.</Paragraph> <Paragraph position="1"> By repeating corpus error detection and correction of the detected errors by hand, new errors that are not detected previously may be detected. To examine this, we repeated corpus error detection and correction by hand. Table 5 shows the result. All the detected errors in all rounds were true errors, that is, the precision was 100%. Applying the corpus error detection repeatedly, the number of detected errors decrease rapidly, and no errors are detected in the fourth round. In short, even if we repeat corpus error detection with feedback, few new errors were detected in this experiment.</Paragraph> </Section> </Section> class="xml-element"></Paper>