File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1305_metho.xml
Size: 10,996 bytes
Last Modified: 2025-10-06 14:08:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1305"> <Title>Two-Phase Biomedical NE Recognition based on SVMs</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Definition of Named Entity </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Classification Problem </SectionTitle> <Paragraph position="0"> We divide the named entity recognition into two subtasks, the identification task which finds the regions of the named entities in a text and the semantic classification which determines the semantic classes of them. Figure 1 illustrates the proposed method, which is called two-phase named entity recognition</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Recognition </SectionTitle> <Paragraph position="0"> The identification task is formulated as classification of each word into one of two classes, T or O that represent region information. The region information is encoded by using simple T/O representation: T means that current word is a part of a named entity, and O means that the word is not in a named entity. With the representation, we need only one binary SVM classifier of two classes, T, O.</Paragraph> <Paragraph position="1"> The semantic classification task is to assign one of semantic classes to the identified entity. At the semantic classification phase, we need to classify only the identified entities into one of the N semantic classes because the entities were already identified. Non-entity words are ignored at this phase. The classes needed to be classified are just only the N semantic classes. Note that the number of total classes, N +1 is remarkably small compared with the number, 2N +1 required in the complicated recognition approaches in which a class is represented by combining a region information B/I/O with a semantic class C. It can considerably reduce workload in the named entity recognition.</Paragraph> <Paragraph position="2"> Especially when using SVMs, the number of classes is very critical to the training in the aspect of training time and required resources. Let L be the number of training samples and let N be the number of classes. Then one-vs-rest method takes N PS O(L) in the training step. The complicated approach with the B/I/O notation requires (2N +1)PSO(Lwords) (L is number of total words in a training corpus). In contrast, the proposed approach requires (N PS O(Lentities)) + O(Lwords).</Paragraph> <Paragraph position="3"> Here, O(Lwords) stands for the number of words in a training corpus and O(Lentities) for the number of entities. It is a considerable reduction in the training cost. Ultimately, it affects the performance of the entity recognizer.</Paragraph> <Paragraph position="4"> To achieve a high performance of the defined tasks, we use SVM(Joachims, 2002) as a machine learning approach which has showed the best performance in various NLP tasks. And we post-process the classification results of SVMs by utilizing a dictionary. Figure 2 outlines the proposed two-phase named entity recognition system. At each phase, each classifier with SVMs outputs the class of the best score. For classifying multi-classes based on a binary classifier SVM, we use the one-vs-rest classification method and the linear kernel in both tasks. Furthermore, for correcting the errors by SVMs, the entity-word dictionary constructed from a training corpus is utilized in the identification phase. The dictionary is searched to check whether the boundary words of an identified entity were excluded or not because the boundary words of an entity might be excluded during the entity identification. If a boundary word was excluded, then we concatenate the left or the right side word adjacent to the identified entity. This post-processing may enhance the capability of the entity identifier.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Biomedical Named Entity Identification </SectionTitle> <Paragraph position="0"> The named entity identification is defined as the classification of each word to one of the classes that represent the region information. The region information is encoded by using simple T/O representation: T means that the current word is a part of a named entity, and O means that the current word is not in a named entity.</Paragraph> <Paragraph position="1"> The above representation yields two classes of the task and we build just one binary SVM classifiers for them. By accepting the results of the SVM classifier, we determine the boundaries of an entity. To correct boundary errors, we post-process the identified entities with the entity-word dictionary.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Features for Entity Identification </SectionTitle> <Paragraph position="0"> An input x to a SVM classifier is a feature representation of a target word to be classified and its context.</Paragraph> <Paragraph position="1"> We use a bit-vector representation. The features of the designated word are composed of orthographical characteristic features, prefix, suffix, and lexical of the word.</Paragraph> <Paragraph position="2"> Table 1 shows all of the 24 orthographical features. Each feature may be a discriminative feature appeared in biomedical named entites such as protein, DNA and RNA etc. Actually, the name of protein, DNA or RNA is composed by combining alpha-numeric string with several characters such as Greek or special symbols and so on.</Paragraph> <Paragraph position="3"> In the definition, k is the relative word position from the target word. A negative value represents a preceeding word and a positive value represents a following word. Among them, the part-of-speech tag sequence of the word and the context words is a kind of a syntactic rule to compose an entity. And lexical information is a sort of filter to identify an entity which is as possible as semantically cohesive.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Post-Processing by Dictionary Look-Up </SectionTitle> <Paragraph position="0"> After classifying the given instances, we do post-processing of the identified entities. During the postprocessing, we scan the identified entities and examine the adjacent words to those. If the part-of-speech of an adjacent word belongs to one of the group, adjective, noun, or cardinal, then we look up the dictionary to check whether the word is in it or not. If it exists in the dictionary, we include the word into the entity region. The dictionary is constructed of words consisting of the named entities in a training corpora and stopwords are ignored.</Paragraph> <Paragraph position="1"> Figure 3 illustrates the post-processing algorithm.</Paragraph> <Paragraph position="2"> In Figure 3, the word cell adjacent to the left of the identified entity cycle-dependent transcription, has the part-of-speech NN and exists in the dictionary.</Paragraph> <Paragraph position="3"> The word factor adjacent to the right of the entity has the part-of-speech NN. It exists in the dictionary, too. Therefore, we include the words cell and factor into the entity region and change the position tags of the words in the entity.</Paragraph> <Paragraph position="4"> By taking the post-processing method, we can correct the errors by a SVM classifier. It also gives us a great effect of overcoming the low coverage problem of the small-sized entity dictionary.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Semantic Classification of Biomedical </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Named Entity </SectionTitle> <Paragraph position="0"> The objects of the semantic tagging are the entities identified in the identification phase. Each entity is assigned to a proper semantic class by voting the</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Features for Semantic Classification </SectionTitle> <Paragraph position="0"> For semantically tagging an entity, an input x to a SVM classifier is represented by a feature vector.</Paragraph> <Paragraph position="1"> The vector is composed of following features: left context is the ith word in the left context word list</Paragraph> <Paragraph position="3"> right context is the ith word in the right context word list</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 0 otherwise </SectionTitle> <Paragraph position="0"> Of the above features, fwi checks whether the entity contains one of functional words. The functional words are similar to the feature terms used by (Fukuda, 1998). For example, the functional words such as factor, receptor and protein are very helpful to classifying named entities into protein and the functional words such as gene, promoter and motif are very useful for classifying DNA.</Paragraph> <Paragraph position="1"> In case of the context features of a given entity, we divide them into two kinds of context features, inside context features and outside context features. As inside context features, we take at most three words from the backend of the entity 1. We make a list of the inside context words by collecting words in the pus.</Paragraph> <Paragraph position="2"> range of the inside context. If one of the three words is the ith word in the inside context word list, we set the inwi bit to 1. The outside context features are grouped in the left ones and the right ones. For the left and the right context features, we restrict them to noun or verb words in a sentence, whose position is not specified. This grouping make an effect of alleviating the data sparseness problem when using a word as a feature.</Paragraph> <Paragraph position="3"> For example, given a sentence with the entity, RNA polymerase II as follows: General transcription factor are required for accurate initiation of transcription by RNA polymerase II PROTEIN.</Paragraph> <Paragraph position="4"> The nouns transcription, factor, initiation and the verbs are, required are selected as left context features, and the words RNA, polymerase, II are selected as inside context features. The bit field corresponding to each of the selected word is set to 1. In this case, there is no right context features. And since the entity contains the functional word RNA, the bit field of RNA is set to 1.</Paragraph> <Paragraph position="5"> For classifying a given entity, we build SVM classifiers as many as the number of semantic classes. We take linear kernel and one-vs-rest classification method.</Paragraph> </Section> class="xml-element"></Paper>