File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0301_intro.xml
Size: 4,959 bytes
Last Modified: 2025-10-06 14:01:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0301"> <Title>Tuning Support Vector Machines for Biomedical Named Entity Recognition</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Application of natural language processing (NLP) is now a key research topic in bioinformatics. Since it is practically impossible for a researcher to grasp all of the huge amount of knowledge provided in the form of natural language, e.g., journal papers, there is a strong demand for biomedical information extraction (IE), which extracts knowledge automatically from biomedical papers using NLP techniques (Ohta et al., 1997; Proux et al., 2000; Yakushiji et al., 2001).</Paragraph> <Paragraph position="1"> The process called named entity recognition, which finds entities that fill the information slots, e.g., proteins, DNAs, RNAs, cells etc., in the biomedical context, is an important building block in such biomedical IE systems. Conceptually, named entity recognition consists of two tasks: identification, which finds the region of a named entity in a text, and classification, which determines the semantic class of that named entity. The following illustrates biomedical named entity recognition.</Paragraph> <Paragraph position="2"> &quot;Thus, CIITAPROTEIN not only activates the expression of class II genesDNA but recruits another B cell-specific coactivator to increase transcriptional activity of class II promotersDNA in B cellsCELLTYPE.&quot; Machine learning approach has been applied to biomedical named entity recognition (Nobata et al., 1999; Collier et al., 2000; Yamada et al., 2000; Shimpuku, 2002). However, no work has achieved su cient recognition accuracy. One reason is the lack of annotated corpora for training as is often the case of a new domain. Nobata et al. (1999) and Collier et al. (2000) trained their model with only 100 annotated paper abstracts from the MEDLINE database (National Library of Medicine, 1999), and Yamada et al. (2000) used only 77 annotated paper abstracts. In addition, it is di cult to compare the techniques used in each study because they used a closed and di erent corpus.</Paragraph> <Paragraph position="3"> To overcome such a situation, the GENIA corpus (Ohta et al., 2002) has been developed, and at this time it is the largest biomedical annotated corpus available to public, containing 670 annotated abstracts of the MEDLINE database.</Paragraph> <Paragraph position="4"> Another reason for low accuracies is that biomedical named entities are essentially hard to recognize using standard feature sets compared with the named entities in newswire articles (Nobata et al., 2000).</Paragraph> <Paragraph position="5"> Thus, we need to employ powerful machine learning techniques which can incorporate various and complex features in a consistent way.</Paragraph> <Paragraph position="6"> Support Vector Machines (SVMs) (Vapnik, 1995) and Maximum Entropy (ME) method (Berger et al., 1996) are powerful learning methods that satisfy such requirements, and are applied successfully to other NLP tasks (Kudo and Matsumoto, 2000; Nakagawa et al., 2001; Ratnaparkhi, 1996). In this paper, we apply Support Vector Machines to biomedical named entity recognition and train them with Association for Computational Linguistics.</Paragraph> <Paragraph position="7"> the Biomedical Domain, Philadelphia, July 2002, pp. 1-8. Proceedings of the Workshop on Natural Language Processing in the GENIA corpus. We formulate the named entity recognition as the classification of each word with context to one of the classes that represent region and named entity's semantic class. Although there is a previous work that applied SVMs to biomedical named entity task in this formulation (Yamada et al., 2000), their method to construct a classifier using SVMs, one-vs-rest, fails to train a classifier with entire GENIA corpus, since the cost of SVM training is super-linear to the size of training samples. Even with a more feasible method, pairwise (Kressel, 1998), which is employed in (Kudo and Matsumoto, 2000), we cannot train a classifier in a reasonable time, because we have a large number of samples that belong to the non-entity class in this formulation. To solve this problem, we propose to split the non-entity class to several sub-classes, using part-of-speech information. We show that this technique not only enables the training feasible but also improves the accuracy.</Paragraph> <Paragraph position="8"> In addition, we explore new features such as word cache and the states of an unsupervised HMM for named entity recognition using SVMs. In the experiments, we show the e ect of using these features and compare the overall performance of our SVM-based recognition system with a system using the Maximum Entropy method, which is an alternative to the SVM method.</Paragraph> </Section> class="xml-element"></Paper>