File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-2206_concl.xml

Size: 3,322 bytes

Last Modified: 2025-10-06 13:55:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2206">
  <Title>Spotting the 'Odd-one-out': Data-Driven Error Detection and Correction in Textual Databases</Title>
  <Section position="8" start_page="46" end_page="46" type="concl">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> We have presented two methods for (semi-)automatic error detection and correction in textual databases. The two methods are aimed at different types of errors: horizontal error correction attempts to identify and correct inconsistent values within a database record; vertical error correction is aimed at values which were accidentally entered in the wrong column.</Paragraph>
    <Paragraph position="1"> Both methods are data-driven and require little or no background knowledge. The methods are also language-independent and can be applied to multi-lingual databases. While we utilise supervised machine learning, no manual annotation of training data is required, as the training set is obtained directly from the database.</Paragraph>
    <Paragraph position="2"> We tested the two methods on an animal specimens database and found that a significant proportion of errors could be detected: up to 97% for horizontal error detection and up to 100% for vertical error detection. While the error detection precision was fairly low for both methods (up to 55% for the horizontal method and up to 25.57% for the vertical method), the number of potential errors flagged was still sufficiently small to check manually. Furthermore, theautomaticallypredictedcorrection for an error was often the right one. Hence, it would be feasible to employ the two methods in  asemi-automaticerrorcorrectionset-upwherepotential errors together with a suggested correction are flagged and presented to a user.</Paragraph>
    <Paragraph position="3"> As the two error correction methods are to some extent complementary, it would be worthwhile to investigate whether they can be combined. Some errors flagged by the horizontal method will not be detected by the vertical method, for instance, values which are valid in a given column, but inconsistent with the values of other fields. On the other hand, values which were entered in the wrong column should, in theory, also be detected by the horizontal method. For example, if the correct FAMILY for Rana aurora is Ranidae, it should make no difference whether the (incorrect) value in the FAMILY field is Bufonidae, which is a valid value for FAMILY but the wrong family for Rana aurora, or Amphibia, which is not a valid value for FAMILY but the correct CLASS value for Rana aurora; in both cases the error should be detected. Hence, if both methods predict an error in a given field this should increase the likelihood that there is indeed an error. This could be exploited to obtain a higher precision. We plan to experiment with this idea in future research.</Paragraph>
    <Paragraph position="4"> Acknowledgments The research reported in this paper was funded by NWO (Netherlands Organisation for Scientific Research) and carried out at the Naturalis Research Labs in Leiden. We would like to thank Pim Arntzen and Erik van Nieukerken from Naturalis for guidance and helpful discussions. We are also grateful to two anonymous reviewers for useful comments.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML