File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2206_intro.xml

Size: 4,745 bytes

Last Modified: 2025-10-06 14:04:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2206">
  <Title>Spotting the 'Odd-one-out': Data-Driven Error Detection and Correction in Textual Databases</Title>
  <Section position="2" start_page="0" end_page="40" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Over the last decades, more and more information has become available in digital form; a major part of this information is textual. While some textual information is stored in raw or typeset form (i.e., as more or less flat text), a lot is semi-structured in databases. A popular example of a textual database is Amazon's book database,1 which contains fields for &amp;quot;author&amp;quot;, &amp;quot;title&amp;quot;, &amp;quot;publisher&amp;quot;, &amp;quot;summary&amp;quot; etc. Information about collections in the cultural heritage domain is also frequently stored in (semi-)textual databases. Examples of publicly accessible databases of this type are the University of St. Andrews's photographic  Such databases are an important resource for researchers in the field, especially if the contents can be systematically searched and queried. However, information retrieval from databases can be adversely affected by errors and inconsistencies in the data. For example, a zoologist interested in finding out about the different biotopes (i.e., habitats) in which a given species was found, might  queryazoologicalspecimensdatabaseforthecontent of the BIOTOPE column for all specimens of that species. Whenever information about the biotopewasenteredinthewrongcolumn, thatparticular record will not be retrieved by such a query. Similarly, if an entry erroneously lists the wrong species, it will also not be retrieved.</Paragraph>
    <Paragraph position="1"> Usually it is impossible to avoid errors completely, even in well maintained databases. Errors can arise for a variety of reasons, ranging from technical limitations (e.g., copy-and-paste errors) to different interpretations of what type of information should be entered into different database fields. The latter situation is especially prevalent if the database is maintained by several people. Manual identification and correction of errors is frequently infeasible due to the size of the database. A more realistic approach would be to use automatic means to identify potential errors; these could then be flagged and presented to a human expert, and subsequently corrected manually or semi-automatically. Error detection and correction can be performed as a pre-processing step for information extraction from databases, or it can be interleaved with it.</Paragraph>
    <Paragraph position="2"> In this paper, we explore whether it is possi- null ble to detect and correct potential errors in textual databases by applying data-driven clean-up methods which are able to work in the absence of background knowledge (e.g., knowledge about the domain or the structure of the database) and instead rely on the data itself to discover inconsistencies and errors. Ideally, error detection should also be language independent, i.e., require no or fewlanguagespecifictools, suchaspart-of-speech taggers or chunkers. Aiming for language independence is motivated by the observation that many databases, especially in the cultural heritage domain, are multi-lingual and contain strings of text in various languages. If textual data-cleaning methods are to be useful for such databases, they should ideally be able to process all text strings, not only those in the majority language.</Paragraph>
    <Paragraph position="3"> While there has been a significant amount of previous research on identifying and correcting errors in data sets, most methods are not particularly suitable for textual databases (see Section 2). We present two methods which are. Both methods are data-driven and knowledge-lean; errors are identified through comparisons with other database fields. We utilise supervised machine learning, but the training data is derived directly from the database, i.e., no manual annotation of data is necessary. In the first method, the database fields of individual entries are compared, and improbable combinations are flagged as potential errors. Because the focus is on individual entries, i.e., rows in the database, we call this horizontal error correction. The second method aims at a different type of error, namely values which were entered in the wrong column of the database. Potential errors of this type are determined by comparing the content of a database cell to (the cells of) all database columns and determining which column it fits best. Because the focus is on columns, we refer to this method as vertical error correction.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML