File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-3325_relat.xml
Size: 6,087 bytes
Last Modified: 2025-10-06 14:15:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3325"> <Title>The Difficulties of Taxonomic Name Extraction and a Solution</Title> <Section position="3" start_page="126" end_page="127" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> This section reviews solutions to problems related to the extraction of taxonomic names.</Paragraph> <Section position="1" start_page="126" end_page="126" type="sub_section"> <SectionTitle> 2.1 Named Entity Recognition </SectionTitle> <Paragraph position="0"> Taxonomic names are a special case of named entity. In the recent past, NER has received much attention, which yielded a variety of methods. The most common ones are list lookups, grammars, rules, and statistical methods like SVMs (Bikel 1997). All these techniques have been developed for tasks like the one presented by Carreras (2005). Thus, their focus is the recognition of somewhat common NE like locations and persons. Consequently, they are not feasible for the complex and variable structure of taxonomic names (see Section 3.3). Another problem of common NER techniques is that they usually require several hundred thousand words of pre-annotated training data.</Paragraph> </Section> <Section position="2" start_page="126" end_page="126" type="sub_section"> <SectionTitle> 2.2 List-based Techniques </SectionTitle> <Paragraph position="0"> List-based NER techniques (Palmer 1997) make use of lists to determine whether a word is a NE of the category sought. The sole use of a thesaurus as a positive list is not an option for taxonomic names. All existing thesauri are incomplete. Nevertheless, such a list allows recognizing known parts of taxonomic names.</Paragraph> <Paragraph position="1"> The inverse approach would be list-based exclusion, using a common English dictionary. Koning (2005) combines such an approach with structural rules. In isolation, however, it is not an option either. First, it would not exclude proper names reliably. Second, it excludes parts of taxonomic names that are also used in common English.</Paragraph> <Paragraph position="2"> However, exclusion of sure negatives, i.e., words that are never part of taxonomic names, simplifies the classification.</Paragraph> </Section> <Section position="3" start_page="126" end_page="126" type="sub_section"> <SectionTitle> 2.3 Rule Based Techniques </SectionTitle> <Paragraph position="0"> Rule based techniques do not require pre-annotated training data. They extract words or word sequences based on their structure. Yoshida (1999) applies regular expressions to extract the names of proteins. He makes use of the syntax of protein names like NG-monomethyl-L-arginine, which is very distinctive.</Paragraph> <Paragraph position="1"> There are also rules for the syntax of taxonomic names, but they are less restrictive. For instance, Prenolepis (Nylanderia) vividula Erin subsp. guatemalensis Forel var. itinerans Forel is a taxonomic name as well as Dolichoderus decollatus.</Paragraph> <Paragraph position="2"> Because of the wide range of optional parts, it is impossible to find a regular expression that matches all taxonomic names and at the same time provides satisfactory precision. Koning (2005) presents an approach based on regular expressions and static dictionaries. This technique performs satisfactorily compared to common NER approaches, but their conception of what is a positive is restricted. For instance, they leave aside taxonomic names that do not specify a genus. However, the idea of rule-based filters for the phrases of documents is helpful.</Paragraph> </Section> <Section position="4" start_page="126" end_page="127" type="sub_section"> <SectionTitle> 2.4 Bootstrapping </SectionTitle> <Paragraph position="0"> Instead of a large amount of labeled training data, Bootstrapping uses some labeled examples (&quot;seeds&quot;) and an even larger amount of unlabeled data for the training. Jones (1999) has shown that this approach performs equal to techniques requiring labeled training data. However, Bootstrapping is not readily applicable to our particular problem.</Paragraph> <Paragraph position="1"> Niu (2003) used an unlabeled corpus of 88.000.000 words for training a named entity recognizer. For our purpose, even unlabeled training data is not available in this order of magnitude, at least right now.</Paragraph> </Section> <Section position="5" start_page="127" end_page="127" type="sub_section"> <SectionTitle> 2.5 Active Learning </SectionTitle> <Paragraph position="0"> According to Day (1997), the original idea of Active Learning was to speed up the creation of large labeled training corpora from unlabeled documents. The system uses all of its knowledge during all phases of the learning. Thus, it labels most of the data items automatically and requires user interaction only in rare cases. In order to increase data quality, we include user-interaction in our taxonomic name extractor as well.</Paragraph> </Section> <Section position="6" start_page="127" end_page="127" type="sub_section"> <SectionTitle> 2.6 Gene and Protein Name Extraction </SectionTitle> <Paragraph position="0"> In the recent past, the major focus of biomedical NER has been the recognition of gene and protein names. Tanabe (2002) gives a good overview of various approaches to this task. Frequently used techniques are structural rules, dictionary lookups and Hidden Markov Models. Most of the approaches use the output of a part-of-speech tagger as additional evidence. Both gene and protein names differ from taxonomic names in that the nomenclature rules for them are by far stricter.</Paragraph> <Paragraph position="1"> For instance, they never include the names of the discoverer / author of a given part. In addition, there are parts which are easily distinguished from the surrounding text based on their structure, which is not true for taxonomic names. Consequently, the techniques for gene or protein name recognition are not feasible for the extraction of taxonomic names.</Paragraph> </Section> </Section> class="xml-element"></Paper>