File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1315_intro.xml
Size: 6,240 bytes
Last Modified: 2025-10-06 14:01:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1315"> <Title>An Investigation of Various Information Sources for Classifying Biological Names</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In this paper, we investigate the extent to which different sources of information contribute towards the task of classifying the type of biological entity a phrase might refer to. The classification task is an integral part of named entity extraction. For this reason, name classification has been studied in solving the named entity extraction task in the NLP and information extraction communities (see, for example, (Collins and Singer, 1999; Cucerzan and Yarowsky, 1999) and various approaches reported in the MUC conferences (MUC-6, 1995)). However, many of these approaches do not distinguish the detection of the names (i.e., identifying a sequence of characters and words in text as a name) from that of its classification as separate phases. Yet, we believe that we will gain from examining the two as separate tasks as the classification task, the focus of this work, is sufficiently distinct from the name identification task. More importantly, from the perspective of the current work, we hope to show that the sources of information that help in solving the two tasks are quite distinct.</Paragraph> <Paragraph position="1"> Similar to the approaches of name classification of (Collins and Singer, 1999; Cucerzan and Yarowsky, 1999), we investigate both name internal and external clues. However, we believe that the situation in the specialized domain of biomedicine is sufficiently distinct, that the clues for this domain need further investigation and that the classification task has not received the similar attention deserved.</Paragraph> <Paragraph position="2"> A large number of name extraction methods proposed in this specialized domain have focused on extracting protein names only (Fukuda et al., 1998; Franzen et al., 2002; Tanabe et al., 2002). Since only one class is recognized, the only task these methods directly address is that of identifying a string of characters and/or words that constitute a protein name.</Paragraph> <Paragraph position="3"> These methods do not, at least in an explicit manner, have to consider the classification task.</Paragraph> <Paragraph position="4"> There are some important reasons to consider the detection of names of other types of entities of biological relevance. Information extraction need not be limited to protein-protein interactions, and extracting names of other types of entities will be required for other types of interactions. Secondly, classification of names can help improve the precision of the methods. For example, KEX (Fukuda et al., 1998) is a protein name recognizer and hence labels each name it detects as a protein. However, names of different types of entities share similar surface characteristics (including use of digits, special characters, and capitalizations). Due to this reason, KEX and other protein name recognizers can pick names of entities other than proteins (and label them as proteins). (Narayanaswamy et al., 2003) reports that by recognizing that some of these names as not those of proteins allows their method to improve the precision of protein name detection. Thirdly detecting names of different classes will help in coreference resolution, the importance of which is well recognized in the IE domain. In such specialized domains, the sortal/class information will play an important role for this task. In fact, the coreference resolution method described in (Casta~no et al., 2002) seeks to use such information by using the UMLS system and by applying type coercion. Finally, many information extraction methods are based on identifying or inducing patterns by which information (of the kind being extracted) is expressed in natural language text. If we can tag the text with occurrences of various types of names (or phrases that refer to biological entities) then better generalizations of patterns can be induced.</Paragraph> <Paragraph position="5"> There are at least two efforts (Narayanaswamy et al., 2003; Kazama et al., 2002) that consider the recognition of names of different classes of biomedical relevance. Work reported in (Pustejovsky et al., 2002; Casta~no et al., 2002) also seeks to classify or find the sortal information of phrases that refer to biological entities. However, classification was not the primary focus of these papers and hence the details and accuracy of the classification methods are not described in much detail. Other related works include those of (Hatzivassiloglou et al., 2001; Liu et al., 2001) that use external or contextual clues to disambiguate ambiguous expressions. While these works maybe viewed as similar to word sense disambiguation (WSD), the one reported in (Hatzivassiloglou et al., 2001) in particular is close to classification as well. In this work, using context of individual occurrences, names are disambiguated between gene, protein and RNA senses.</Paragraph> <Paragraph position="6"> The Unified Medical Language System (UMLS) was developed at National Library of Medicine, a National Institutes of Health at Bethesda, USA.</Paragraph> <Paragraph position="7"> While our interest is in classification of phrases that refer to entities of biomedical significance, in this work we limit ourselves to name classification.</Paragraph> <Paragraph position="8"> In our investigations, we wish to use an annotated corpus for both inducing and evaluating features.</Paragraph> <Paragraph position="9"> We are unaware of any large corpus where phrases are annotated with their classes. However, large corpora for named entity extraction in this domain are being developed, and fortunately, corpora such as GENIA being developed at University of Tokyo are freely available. We make use of this corpus and hence investigate the classification of names only.</Paragraph> <Paragraph position="10"> However, we believe that the conclusions we draw in this regard will apply equally to classification of phrases other than names as well.</Paragraph> </Section> class="xml-element"></Paper>