File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3325_intro.xml

Size: 4,208 bytes

Last Modified: 2025-10-06 14:04:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3325">
  <Title>The Difficulties of Taxonomic Name Extraction and a Solution</Title>
  <Section position="2" start_page="0" end_page="126" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Digitization of biosystematics publications currently is a major issue. They contain the names and descriptions of taxonomic genera and species.</Paragraph>
    <Paragraph position="1"> The names are important because they identify the various genera and species. They also position the species in the tree of life, which in turn is useful for a broad variety of biology tasks. Hence, recognition of taxonomic names is relevant. However, manual extraction of these names is time-consuming and expensive.</Paragraph>
    <Paragraph position="2"> The main problem for the automated recognition of these names is to distinguish them from the surrounding text, including other Named Entities (NE). Named Entity Recognition (NER) currently is a big research issue. However, conventional NER techniques are not readily applicable here for two reasons: First, the NE categories are rather high-level, e.g., names of organizations or persons (cf. common NER benchmarks such as (Carreras 2005)). Such a classification is too coarse for our context. The structure of taxonomic names varies widely and can be complex. Second, those recognizers require large bodies of training data. Since digitization of biosystematics documents has started only recently, such data is not yet available in biosystematics. On the other hand, it is important to demonstrate right away that text-learning technology is of help to biosystematics as well.</Paragraph>
    <Paragraph position="3"> This paper reports on our experiences with learning techniques for the automated extraction of taxonomic names from documents. The various techniques are obviously useful in this context: * Language recognition - taxonomic names are a combination of Latin or Latinized words, with surrounding text written in English, * structure recognition - taxonomic names follow a certain structure, * lexica support - certain words never are/may well be part of taxonomic names.</Paragraph>
    <Paragraph position="4"> On the other hand, an individual technique in isolation is not sufficient for taxonomic name extraction. Mikheev (1999) has shown that a combining approach, i.e., one that integrates the results of several different techniques, is superior to the individual techniques for common NER. Combining approaches are also promising for taxonomic name extraction. Having said this, the article will now proceed as follows: First, we have conducted a thorough inspection of taxonomic names. An important observation is that one cannot model taxonomic names both concisely and precisely using regular expressions.</Paragraph>
    <Paragraph position="5"> As is done in bootstrapping, we use two kinds of regular expressions: precision rules, whose instances are taxonomic names with very high probability, and recall rules, whose instances are a superset of all taxonomic names. We propose a meaningful definition of precision rules and recall rules for taxonomic names.</Paragraph>
    <Paragraph position="6">  Second, the essence of a combining approach is to arrange the individual specific approaches in the right order. We propose such a composition for taxonomic name extraction, and we say why it is superior to other compositions that may appear feasible as well at first sight.</Paragraph>
    <Paragraph position="7"> Finally, to quantify the impact of the various alternatives described so far, we report on experimental results. The evaluation is based on a corpus of biosystematics documents marked up by hand. The best solution achieves about 99.2% in precision and recall. It prompts the user for only 0.2% of the words.</Paragraph>
    <Paragraph position="8"> The remainder of the paper is as follows: Section 2 discusses related approaches. Section 3 introduces some preliminaries. Section 4 describes one specific combining approach in some detail.</Paragraph>
    <Paragraph position="9"> Section 5 features an evaluation. Section 6 concludes. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML