File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2206_metho.xml

Size: 22,310 bytes

Last Modified: 2025-10-06 14:10:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2206">
  <Title>Spotting the 'Odd-one-out': Data-Driven Error Detection and Correction in Textual Databases</Title>
  <Section position="4" start_page="41" end_page="41" type="metho">
    <SectionTitle>
3 Data
</SectionTitle>
    <Paragraph position="0"> We tested our error correction methods on a database containing information about animal specimens collected by researchers at Naturalis, the Dutch Natural History Museum.5 The database contains 16,870 entries and 35 columns.</Paragraph>
    <Paragraph position="1"> Each entry provides information about one or several specimens, for example, who collected it, where and when it was found, its position in the zoological taxonomy, the publication which first described and classified the specimen, and so on.</Paragraph>
    <Paragraph position="2"> Some columns contain fairly free text (e.g., SPECIAL REMARKS), others contain textual content6 of a specific type and in a relatively fixed format, such as proper names (e.g., COLLECTOR or LO-CATION), bibliographical information (PUBLICA-TION), dates (e.g., COLLECTION DATE) or numbers (e.g., REGISTRATION NUMBER).</Paragraph>
    <Paragraph position="3"> Some database cells are left unfilled; just under 40% of all cells are filled (i.e., 229,430 cells). There is a relatively large variance in the number of different values in each column, ranging from three for CLASS (i.e., Reptilia, Amphibia, and a remark pointing to a taxonomic inconsistency in the entry) to over 2,000 for SPECIAL REMARKS, which is only filled for a minority of the entries.</Paragraph>
    <Paragraph position="4"> On the other hand there is also some repetition of cell contents, even for the free text columns, which often contain formulaic expressions. For example, the strings no further data available or (found) dead on road occur repeatedly in the special remarks field. A certain amount of repetition is characteristic for many textual databases, and we exploit this in our error correction methods.</Paragraph>
    <Paragraph position="5"> While most of the entries are in Dutch or English, thedatabasealsocontainstextstringsinseveral other languages, such as Portuguese or French (and Latin for the taxonomic names). In principle, there is no limit to which languages can occur in the database. For example, the PUBLICATION column often contains text strings (e.g., the title of the publication) in languages other than Dutch or English.</Paragraph>
  </Section>
  <Section position="5" start_page="41" end_page="43" type="metho">
    <SectionTitle>
4 Horizontal Error Correction
</SectionTitle>
    <Paragraph position="0"> The different fields in a database are often not statistically independent; i.e., for a given entry,  sense, i.e., comprising all character strings, including dates and numbers.</Paragraph>
    <Paragraph position="1"> the likelihood of a particular value in one field may be dependent on the values in (some of) the other fields. In our database, for example, there is an interdependency between the LOCATION and the COUNTRY columns: the probability that the COUNTRY column contains the value South Africa increases if the LOCATION column contains the string Tafel Mountain (and vice versa). Similar interdependencies hold between other columns, such as LOCATION and ALTITUDE, or COUNTRY and BIOTOPE, or between the columns encoding a specimen's position in the zoological taxonomy (e.g., SPECIES and FAMILY). Given enough data, many of these interdependencies can be determined automatically and exploited to identify field values that are likely to be erroneous.</Paragraph>
    <Paragraph position="2"> This idea bears some similarity to the approach by Marcus and Maletic (2000) who infer association rules for a data set and then look for outliers relative to these rules. However, we do not explicitly infer rules. Instead, we trained TiMBL (Daelemans et al., 2004), a memory-based learner, to predict the value of a field given the values of other fields for the entry. If the predicted value differs from the original value, it is signalled as a potential error to a human annotator.</Paragraph>
    <Paragraph position="3"> We applied the method to the taxonomic fields (CLASS, ORDER, FAMILY, GENUS, SPECIES and SUB-SPECIES), because it is possible, albeit somewhat time-consuming, for a non-expert to check the values of these fields against a published zoologicaltaxonomy. Wesplitthedatainto80%training set, 10% development set and 10% test set. As not all taxonomic fields are filled for all entries, the exact sizes for each data set differ, depending on which field is to be predicted (see Table 1).</Paragraph>
    <Paragraph position="4"> We used the development data to set TiMBL's parameters, such as the number of nearest neighbours to be taken into account or the similarity metric (van den Bosch, 2004). Ideally, one would want to choose the setting which optimised the error detection accuracy. However, this would require manual annotation of the errors in the development set. As this is fairly time consuming, we abstained from it. Instead we chose the parameter setting which maximised the value prediction accuracy for each taxonomic field, i.e. the setting for which the disagreement between the values predicted by TiMBL and the values in the database was smallest. The motivation for this was that a high prediction accuracy will minimise the num- null ber of potential errors that get flagged (i.e., disagreementsbetweenTiMBLandthedatabase)and null thus, hopefully, lead to a higher error detection precision, i.e., less work for the human annotator who has to check the potential errors.</Paragraph>
    <Paragraph position="5"> training devel. test  We also used the development data to perform some feature selection. We compared (i) using the values of all other fields (for a given entry) as features and (ii) only using the other taxonomic fields plus the author field, which encodes which taxonomist first described the species to which a given specimen belongs.7 The reduced feature set was found to lead to better or equal performance for all taxonomic fields and was thus used in the experiments reported below.</Paragraph>
    <Paragraph position="6"> For each taxonomic field, we then trained TiMBL on the training set and applied it to the test set, using the optimised parameter settings.</Paragraph>
    <Paragraph position="7"> Table 2 shows the value prediction accuracies for each taxonomic field and the accuracies achieved by two baseline classifiers: (i) randomly selecting a value from the values found in the training set (random) and (ii) always predicting the (training set) majority value (majority). The prediction accuracies are relatively high, even for the lowest fields in the taxonomy, SPECIES and SUBSPECIES,whichshouldbethemostdifficulttopre- null dict. Hence it is in principle possible to predict the value of a taxonomic field from the values of other fields in the database. To determine whether the taxonomic fields are exceptional in this respect, we also tested how well non-taxonomic fields can be predicted. We found that all fields can be predicted with a relatively high accuracy. The lowest accuracy (63%) is obtained for the BIOTOPE field. For most fields, accuracies of around 70% 7The author information provides useful cues for the prediction of taxonomic fields because taxonomists often specialise on a particular zoological group. For example, a taxonomist who specialises on Ranidae (frogs) is unlikely to have published a description of a species belonging to Serpentes (snakes).</Paragraph>
    <Paragraph position="8"> are achieved; this applies even to the &amp;quot;free text&amp;quot;  nomic field values (horizontal method) To determine whether this method is suitable for semi-automatic error correction, we looked at the cases in which the value predicted by TiMBL differed from the original value. There are three potential reasons for such a disagreement: (i) the value predicted by TiMBL is wrong, (ii) the value predicted by TiMBL is correct and the original value in the database is wrong, and (iii) both values are correct and the two terms are (zoological) synonyms. For the fields CLASS, ORDER, FAMILY and GENUS, we checked the values predicted by TiMBL against two published zoological taxonomies8 and counted how many times the predicted value was the correct value. We did not check the two lowest fields (SUB SPECIES and SPECIES), as the correct values for these fields can  onlybedeterminedreliablybylookingatthespecimens themselves, not by looking at the other taxonomic values for an entry. For the evaluation, we  focusedonerrorcorrectionratherthanerrordetection, hence cases where both the value predicted by TiMBL and the original value in the database were wrong, were counted as TiMBL errors.</Paragraph>
    <Paragraph position="9"> Table 3 shows the results (the absolute numbers of database errors, synonyms and TiMBL errors are shown in brackets). It can be seen that TiMBL detects several errors in the database and predicts the correct values for them. It also finds several synonyms. For GENUS, however, the vast majority of disagreements between TiMBL and the database is due to TiMBL errors. This can be explained by the fact that GENUS is relatively low in the taxonomy (directly above SPECIES). As the  for the value of a lower field, the lower a field is in the taxonomy the more difficult it is to predict its value accurately.</Paragraph>
    <Paragraph position="10"> So far we have only looked at the precision of our error detection method (i.e., what proportion of flagged errors are real errors). Error detection recall (i.e., the proportion of real errors that is flagged) is often difficult to determine precisely because this would involve manually checking the dataset (or a significant subset) for errors, which is typically quite time-consuming. However, if errors are identified and corrected semiautomatically, recall is more important than precision; a low precision means more work for the human expert who is checking the potential errors, a low recall, however, means that many errors are not detected at all, which may severely limit the usefulness of the system.</Paragraph>
    <Paragraph position="11"> To estimate the recall obtained by the horizontal error detection method, we introduced errors artificially and determined what percentage of these artificial errors was detected. For each taxonomic field, we changed the value of 10% of the entries, which were randomly selected. In these entries, the original values were replaced by one of the other attested values for this field. The new value was selected randomly and with uniform probability for all values. Of course, this method can only provide an estimate of the true recall, as it is possible that real errors are distributed differently, e.g., some values may be more easily confused by humans than others. Table 4 shows the results. The estimated recall is fairly high; in all cases above 90%. This suggests that a significant proportion of the errors is detected by our method.</Paragraph>
  </Section>
  <Section position="6" start_page="43" end_page="45" type="metho">
    <SectionTitle>
5 Vertical Error Correction
</SectionTitle>
    <Paragraph position="0"> While the horizontal method described in the previous section aimed at correcting values which are inconsistent with the remaining fields of a database entry, vertical error correction is aimed at a different type of error, namely, text strings which were entered in the wrong column of the  (horizontal method) database. For example, in our database, information about the biotope in which a specimen was found may have been entered in the SPECIAL REMARKS column rather than the BIOTOPE column. Errors of this type are quite frequent. They can be accidental, i.e., the person entering the information inadvertently chose the wrong column, but they can also be due to misinterpretation, e.g., the person entering the information may believe that it fits the SPECIAL REMARKS column better than the BIOTOPE column or they may not know that there is a BIOTOPE column. Some of these errors may also stem from changes in the database structure itself, e.g., maybe the BIOTOPE column was only added after the data was entered.9 Identifying this type of error can be recast as a text classification task: given the content of a cell, i.e., a string of text, the aim is to determine which column the string most likely belongs to. Text strings which are classified as belonging to a different column than they are currently in, represent a potential error. Recasting error detection as a text classification problem allows the use of supervised machine learning methods, as training data (i.e., text strings labelled with the column they belong to) can easily be obtained from the database. We tokenised the text strings in all database fields10 and labelled them with the column they 9Many databases, especially in the cultural heritage domain, are not designed and maintained by database experts. Over time, such database are likely to evolve and change structurally. In our specimens database, for example, several columns were only added at later stages.</Paragraph>
    <Paragraph position="1">  occur in. Each string was represented as a vector of 48 features, encoding the (i) string itself and some of its typographical properties (13 features), and (ii) its similarity with each of the 35 columns (in terms of weighted token overlap) (35 features).</Paragraph>
    <Paragraph position="2"> The typographical properties we encoded were: the number of tokens in the string and whether it contained an initial (i.e., an individual capitalised letter), a number, a unit of measurement (e.g., km), punctuation, an abbreviation, a word (as opposed to only numbers, punctuation etc.), a capitalised word, a non-capitalised word, a short word (&lt; 4 characters), a long word, or a complex word (e.g., containing a hyphen).</Paragraph>
    <Paragraph position="3"> The similarity between a string, consisting of a set T of tokens t1 ...tn, and a column colx was defined as:</Paragraph>
    <Paragraph position="5"> where tfidfticolx is the tfidf weight (term frequency - inverse document frequency, cf. (Sparck-Jones, 1972)) of token ti in column colx. This weight encodes how representative a token is of a column. The term frequency, tfti,colx, of a token ti in column colx is the number of occurrences of ti in colx divided by the number of occurrences of all tokens in colx. The term frequency is 0 if the token does not occur in the column. The inverse document frequency, idfti, of a token ti is the number of all columns in the database divided by the number of columns containing ti. Finally, the tfidf weight for a term ti in column colx is defined as: tfidfti,colx = tfti,colx log idfti A high tfidf weight for a given token in a given column means that the token frequently occurs in that column but rarely in other columns, thus the token is a good indicator for that column. Typically tfidf weights are only calculated for content words, however we calculated them for all tokens, partly because the use of stop word lists to filter out function words would have jeopardised the language independence of our method and partly because function words and even punctuation can be very useful for distinguishing different columns. For example, prepositions such as under often indicate BIOTOPE, as in under a stone.</Paragraph>
    <Paragraph position="6"> Sabine Buchholz. The inclusion of multi-lingual abbreviations in the rule set ensures that this tokeniser is robust enough to also cope with text strings in English and other Western European languages.</Paragraph>
    <Paragraph position="7"> To assign a text string to one of the 35 database columns, we trained TiMBL (Daelemans et al., 2004) on the feature vectors of all other database cells labelled with the column they belong to.11 Cases where the predicted column differed from the current column of the string were recorded as potential errors.</Paragraph>
    <Paragraph position="8"> We applied the classifier to all filled database cells. For each of the strings identified as potential errors, we checked manually (i) whether this was a real error (i.e., error detection) and (ii) whether the column predicted by the classifier was the correct one (i.e., error correction). While checking for this type of error is much faster than checking for errors in the taxonomic fields, it is sometimes difficult to tell whether a flagged error is a real error. In some cases it is not obvious which column a string belongs to, for example because two columns are very similar in content (such as LOCATION and FINDING PLACE), in other cases the content of a database field contains several pieces of information which would best be located in different columns. For instance, the string found with broken neck near Karlobag arguably couldbe split between the SPECIAL REMARKS and the LOCA-TION columns. We were conservative in the first case, i.e., we did not count an error as correctly identified if the string could belong to the original column, but we gave the algorithm credit for flagging potential errors where part of the string should be in a different column.</Paragraph>
    <Paragraph position="9">  Theresultsareshowninthesecondcolumn(unfiltered)inTable5. Theclassifierfound836potential errors, 148 of these were found to be real errors. For 100 of the correctly identified errors the predicted column was the correct column. Some of the corrected errors can be found in Table 6.</Paragraph>
    <Paragraph position="10"> Note that the system corrected errors in both English and Dutch text strings without requiring language identification or any language-specific resources (apart from tokenisation).</Paragraph>
    <Paragraph position="11"> We also calculated the precision of error detection (i.e., the number of real errors divided by the number of flagged errors) and the error correction accuracy (i.e., the number of correctly corrected errors divided by the number correctly identified errors). The error detection precision is relatively low (17.70%). In general a low precision means relatively more work for the human expert check11We used the default settings (IB1, Weighted Overlap Metric, Information Gain Ratio weighting) and k=3.</Paragraph>
    <Paragraph position="12">  string original column corrected column op boom ongeveer 2,5 m boven grond SPECIAL REMARKS BIOTOPE (on a tree about 2.5 m above ground)  flagged errors 836 262 real errors 148 67 correctly corrected 100 54 precision error detection 17.70 % 25.57% accuracy error correction 67.57% 80.60%</Paragraph>
  </Section>
  <Section position="7" start_page="45" end_page="46" type="metho">
    <SectionTitle>
Table5: Resultsautomaticerrordetectionandcor-
</SectionTitle>
    <Paragraph position="0"> rection for all database fields (vertical method) ing the flagged errors. However, note that the system considerably reduces the number of database fields that have to be checked (i.e., 836 out of 229,430 filled fields). We also found that, for this type of error, error checking can be done relatively quickly even by a non-expert; checking the 836 errors took less than 30 minutes. Furthermore, the correction accuracy is fairly high (67.57%), i.e., for most of the correctly identified errors the correctcolumnissuggested. Thismeansthatformost errors the user can simply choose the column suggested by the classifier.</Paragraph>
    <Paragraph position="1"> In an attempt to increase the detection precision we applied two filters and only flagged errors which passed these filters. First, we filtered out potential errors if the original and the predicted column were of a similar type (e.g., if both contained person names or dates) as we noticed that our method was very prone to misclassifications in these cases.12 For example, if the name M.S.</Paragraph>
    <Paragraph position="2"> Hoogmoed occurs several times in the COLLECTOR column and a few times in the DONATOR column, the latter cases are flagged by the system as potential errors. However, it is entirely normal for a person to occur in both the COLLECTOR and the DONATOR column. What is more, it is impossible 12Note, that this filter requires a (very limited) amount of background knowledge, i.e. knowledge about which columns are of a similar type.</Paragraph>
    <Paragraph position="3"> to determine on the basis of the text string M.S.</Paragraph>
    <Paragraph position="4"> Hoogmoed alone, whether the correct column for this string in a given entry is DONATOR or COLLECTOR or both.13 Secondly, we only flagged errorswherethepredictedcolumnwasemptyforthe null current database entry. If the predicted column is already occupied, the string is unlikely to belong to that column (unless the string in that column is also an error). The third column in Table 5 (filtered) shows the results. It can be seen that detection precision increases to 25.57% and correction precision to 80.60%, however the system also finds noticeably fewer errors (67 vs. 148).</Paragraph>
    <Paragraph position="5">  Estimating the error detection recall (i.e., the number of identified errors divided by the over-all number of errors in the database) would involve manually identifying all the errors in the database. This was not feasible for the database as a whole. Instead we manually checked three of the free text columns, namely, BIOTOPE, PUBLICATION and SPECIAL REMARKS, for errors and calculated the recall and precision for these. Table 7 shows the results. For BIOTOPE and PUBLICATION the recall is relatively high (94% and 100%, respectively), for SPECIAL REMARKS it is much lower (24%). The low recall for SPECIAL REMARKS is probably due to the fact that this col13Note,however,thatthehorizontalerrordetectionmethod null proposed in the previous section might detect an erroneous occurrence of this string (based on the values of other fields in the entry).</Paragraph>
    <Paragraph position="6">  umnisveryheterogeneous, thusitisfairlydifficult to find the true errors in it. While the precision is relatively low for all three columns, the number of flagged errors (ranging from 58 for PUBLICATION to 298 for SPECIAL REMARKS) is still small enough for manual checking.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML