File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1305_intro.xml
Size: 3,852 bytes
Last Modified: 2025-10-06 14:02:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1305"> <Title>Two-Phase Biomedical NE Recognition based on SVMs</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Knowledge discovery in the rapidly growing area of biomedicine is very important. While most knowledge are provided in a vast amount of texts, it is impossible to grasp all of the huge amount of knowledge provided in the form of natural language. Recently, computational text analysis techniques based on NLP have received a spotlight in bioinformatics. Recognizing the named entities such as proteins, DNAs, RNAs, cells etc. has become one of the most fundamental tasks in the biomedical knowledge discovery. null Conceptually, named entity recognition consists of two tasks: identification, which finds the boundaries of a named entity in a text, and classification, which determines the semantic class of that named entity. Many machine learning approaches have been applied to biomedical named entity recognition(Nobata, 1999)(Hatzivalssiloglou, 2001)(Kazama, 2002). However, no work has achieved sufficient recognition accuracy. One reason is the lack of annotated corpora. This is somewhat appeased with announcement of the GENIA corpus v3.0(GENIA, 2003). Another reason is that it is difficult to recognize biomedical named entities by using general features compared with the named entities in newswire articles. In addition, since non-entity words are much more than entity words in biomedical documents, class distribution in the class representation combining a B/I/O tag with a semantic class C is so severely unbalanced that it costs too much time and huge resources, especially in SVMs training(Hsu, 2001).</Paragraph> <Paragraph position="1"> Therefore, Kazama and his colleagues tackled the problems by tuning SVMs(Kazama, 2002). They splitted the class with unbalanced class distribution into several subclasses to reduce the training cost.</Paragraph> <Paragraph position="2"> In order to solve the data sparseness problem, they explored various features such as word cache features and HMM state features. According to their report, the word cache and HMM state features made a positive effect on the performance improvement.</Paragraph> <Paragraph position="3"> But, not separating the identification task from the semantic classification, they tried to classify the named entities in the integrated process.</Paragraph> <Paragraph position="4"> By the way, the features for identifying the biomedical entity are different from those for semantically classifying the entity. For example, while orthographical characteristics and a part-of-speech tag sequence of an entity are strongly related to the identification, those are weakly related to the semantic classification. On the other hand, context words seem to provide useful clues to the semantic classification of a given entity. Therefore, we will separate the identification task from the semantic classification task. We try to select different features according to the task. This approach enables us to solve the unbalanced class distribution problem which often occurs in a single complicated approach. Besides, to improve the performance, we will post-process the results of SVM classifiers by utilizing the dictionary.</Paragraph> <Paragraph position="5"> That is, we adopt a simple dictionary lookup method to correct the errors by SVMs in the identification phase.</Paragraph> <Paragraph position="6"> Through some experiments, we will show how separating the entity recognition task into two sub-tasks contributes to improving the performance of biomedical named entity recognition. And we will show the effect the hybrid approach of the SVMs and the dictionary-lookup.</Paragraph> </Section> class="xml-element"></Paper>