File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0301_metho.xml
Size: 25,561 bytes
Last Modified: 2025-10-06 14:07:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0301"> <Title>Tuning Support Vector Machines for Biomedical Named Entity Recognition</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The GENIA Corpus </SectionTitle> <Paragraph position="0"> The GENIA corpus is an annotated corpus of paper abstracts taken from the MEDLINE database.</Paragraph> <Paragraph position="1"> Currently, 670 abstracts are annotated with named entity tags by biomedical experts and made available to public (Ver. 1.1).1 These 670 abstracts are a subset of more than 5,000 abstracts obtained by the query &quot;human AND blood cell AND transcription factor&quot; to the MEDLINE database. Table 1 shows basic statistics of the GENIA corpus. Since the GENIA corpus is intended to be extensive, there exist 24 distinct named entity classes in the corpus.2 Our task is to find a named entity region in a paper abstract and correctly select its class out of these 24 classes. This number of classes is relatively large compared with other corpora used in previous studies, and compared with the named entity task for newswire articles. This indicates that the task with the GENIA corpus is hard, apart from the di culty of the biomedical domain itself.</Paragraph> <Paragraph position="2"> tive/disjunctive named entity expressions such as &quot;human B- or T-cell lines&quot; (Kim et al., 2001). In this paper we ignore such expressions and consider that constituents in such expressions are annotated as a dummy class &quot;temp&quot;.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Named Entity Recognition as Classification </SectionTitle> <Paragraph position="0"> We formulate the named entity task as the classification of each word with context to one of the classes that represent region information and named entity's semantic class. Several representations to encode region information are proposed and examined (Ramshaw and Marcus, 1995; Uchimoto et al., 2000; Kudo and Matsumoto, 2001). In this paper, we employ the simplest BIO representation, which is also used in (Yamada et al., 2000). We modify this representation in Section 5.1 in order to accelerate the SVM training.</Paragraph> <Paragraph position="1"> In the BIO representation, the region information is represented as the class prefixes &quot;B-&quot; and &quot;I-&quot;, and a class &quot;O&quot;. B- means that the current word is at the beginning of a named entity, I- means that the current word is in a named entity (but not at the beginning), and O means the word is not in a named entity. For each named entity class C, class B-C and I-C are produced. Therefore, if we have N named entity classes, the BIO representation yields 2N + 1 classes, which will be the targets of a classifier. For instance, the following corresponds to the annotation &quot;Number of glucocorticoid receptorsPROTEIN in lymphocytesCELLTYPE and ...&quot;.</Paragraph> <Paragraph position="2"> Number of glucocorticoid receptors</Paragraph> <Paragraph position="4"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Support Vector Machines </SectionTitle> <Paragraph position="0"> Support Vector Machines (SVMs) (Cortes and Vapnik, 1995) are powerful methods for learning a classifier, which have been applied successfully to many NLP tasks such as base phrase chunking (Kudo and Matsumoto, 2000) and part-of-speech tagging (Nakagawa et al., 2001).</Paragraph> <Paragraph position="1"> The SVM constructs a binary classifier that outputs +1 or 1 given a sample vector x2Rn. The decision is based on the separating hyperplane as follows. null</Paragraph> <Paragraph position="3"> The class for an input x, c(x), is determined by seeing which side of the space separated by the hyperplane, w x + b = 0, the input lies on.</Paragraph> <Paragraph position="4"> Given a set of labeled training samples f(y1;x1); ;(yL;xL)g; xi 2 Rn; yi 2 f+1; 1g, the SVM training tries to find the optimal hyperplane, i.e., the hyperplane with the maximum margin. Margin is defined as the distance between the hyperplane and the training samples nearest to the hyperplane. Maximizing the margin insists that these nearest samples (support vectors) exist on both sides of the separating hyperplane and the hyperplane lies exactly at the midpoint of these support vectors. This margin maximization tightly relates to the fine generalization power of SVMs.</Paragraph> <Paragraph position="5"> Assuming thatjw xi+bj= 1 at the support vectors without loss of generality, the SVM training can be formulated as the following optimization problem.3 minimize 12jjwjj2 subject to yi(w xi + b) 1; i = 1; ;L: The solution of this problem is known to be written as follows, using only support vectors and weights for them.</Paragraph> <Paragraph position="7"> In the SVM learning, we can use a function k(xi;x j) called a kernel function instead of the inner product in the above equation. Introducing a kernel function means mapping an original input x using (x); s.t. (xi) (x j) = k(xi;x j) to another, usually a higher dimensional, feature space. We construct the optimal hyperplane in that space. By using kernel functions, we can construct a non-linear separating surface in the original feature space. Fortunately, such non-linear training does not increase the computational cost if the calculation of the kernel function is as cheap as the inner product. A polynomial function defined as (sxi x j + r)d is popular in applications of SVMs to NLPs (Kudo and Matsumoto, 2000; Yamada et al., 2000; Kudo and Matsumoto, 2001), because it has an intuitively sound interpretation that each dimension of the mapped space is a inseparable, we allow the constraints are broken with some penalty. In the experiments, we use so-called 1-norm soft margin formulation described as:</Paragraph> <Paragraph position="9"> (weighted) conjunction of d features in the original sample.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Multi-Class SVMs </SectionTitle> <Paragraph position="0"> As described above, the standard SVM learning constructs a binary classifier. To make a named entity recognition system based on the BIO representation, we require a multi-class classifier. Among several methods for constructing a multi-class SVM (Hsu and Lin, 2002), we use a pairwise method proposed by Kressel (1998) instead of the one-vs-rest method used in (Yamada et al., 2000), and extend the BIO representation to enable the training with the entire GENIA corpus. Here we describe the one-vs-rest method and the pairwise method to show the necessity of our extension.</Paragraph> <Paragraph position="1"> Both one-vs-rest and pairwise methods construct a multi-class classifier by combining many binary SVMs. In the following explanation, K denotes the number of the target classes.</Paragraph> <Paragraph position="2"> one-vs-rest Construct K binary SVMs, each of which determines whether the sample should be classified as class i or as the other classes.</Paragraph> <Paragraph position="3"> The output is the class with the maximum f (x) in Equation 1.</Paragraph> <Paragraph position="4"> pairwise Construct K(K 1)=2 binary SVMs, each of which determines whether the sample should be classified as class i or as class j. Each binary SVM has one vote, and the output is the class with the maximum votes.</Paragraph> <Paragraph position="5"> Because the SVM training is a quadratic optimization program, its cost is super-linear to the size of the training samples even with the tailored techniques such as SMO (Platt, 1998) and kernel evaluation caching (Joachims, 1998). Let L be the number of the training samples, then the one-vs-rest method takes time in K OS V M(L). The BIO formulation produces one training sample per word, and the training with the GENIA corpus involves over 100,000 training samples as can be seen from Table 1. Therefore, it is apparent that the one-vs-rest method is impractical with the GENIA corpus. On the other hand, if target classes are equally distributed, the pairwise method will take time in K(K 1)=2 OS V M(2L=K). This method is worthwhile because each training is much faster, though it requires the training of (K 1)=2 times more classifiers. It is also reported that the pairwise method achieves higher accuracy than other methods in some benchmarks (Kressel, 1998; Hsu and Lin, 2002).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Input Features </SectionTitle> <Paragraph position="0"> An input x to an SVM classifier is a feature representation of the word to be classified and its context.</Paragraph> <Paragraph position="1"> We use a bit-vector representation, each dimension of which indicates whether the input matches with a certain feature. The following illustrates the wellused features for the named entity recognition task.</Paragraph> <Paragraph position="2"> In the above definitions, k is a relative word position from the word to be classified. A negative value represents a preceding word's position, and a positive value represents a following word's position. Note that we assume that the classification proceeds left to right as can be seen in the definition of the preceding class feature. For the SVM classification, we does not use a dynamic argmax-type classification such as the Viterbi algorithm, since it is di cult to define a good comparable value for the confidence of a prediction such as probability. The consequences of this limitation will be discussed with the experimental results.</Paragraph> <Paragraph position="3"> Features usually form a group with some variables such as the position unspecified. In this paper, we instantiate all features, i.e., instantiate for all i, for a group and a position. Then, it is convenient to denote a set of features for a group g and a position k as gk (e.g., wk and posk). Using this notation, we write a feature set as fw 1;w0;pre 1;pre0;pc 1g.4 This feature description derives the following input compare our SVM-based method, defines the probability that the class is c given an input vector x as follows.</Paragraph> <Paragraph position="5"> where Z(x) is a normalization constant, and fi(c;x) is a feature function. A feature function is defined in the same way as the features in the SVM learning, except that it includes c in it like f (c;x) = (c is the jth class) ^ wi;k(x). If x contains previously assigned classes, then the most probable class sequence, ^cT1 = argmaxc1; ;cT QTt=1 P(ctjxt) is searched by using the Viterbi-type algorithm. We use the maximum entropy tagging method described in (Kazama et al., 2001) for the experiments, which is a variant of (Ratnaparkhi, 1996) modified to use HMM state features.</Paragraph> <Paragraph position="6"> 5 Tuning of SVMs for Biomedical NE Task</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Class Splitting Technique </SectionTitle> <Paragraph position="0"> In Section 3.3, we described that if target classes are equally distributed, the pairwise method will reduce the training cost. In our case, however, we have a very unbalanced class distribution with a large number of samples belonging to the class &quot;O&quot; (see Table 1). This leads to the same situation with the one-vs-rest method, i.e., if LO is the number of the samples belonging to the class &quot;O&quot;, then the most dominant part of the training takes time in K OS V M(LO).</Paragraph> <Paragraph position="1"> One solution to this unbalanced class distribution problem is to split the class &quot;O&quot; into several sub-classes e ectively. This will reduce the training cost for the same reason that the pairwise method works.</Paragraph> <Paragraph position="2"> In this paper, we propose to split the non-entity class according to part-of-speech (POS) information of the word. That is, given a part-of-speech tag set POS, we produce new jPOSj classes, &quot;Op&quot; p 2POS. Since we use a POS tagger that outputs 45 Penn Treebank's POS tags in this paper, we have new 45 sub-classes which correspond to non-entity regions such as &quot;O-NNS&quot; (plural nouns), &quot;O-JJ&quot; (adjectives), and &quot;O-DT&quot; (determiners). Splitting by POS information seems useful for improving the system accuracy as well, because in the named entity recognition we must discriminate between nouns in named entities and nouns in ordinal noun phrases. In the experiments, we show this class splitting technique not only enables the feasible training but also improves the accuracy.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Word Cache and HMM Features </SectionTitle> <Paragraph position="0"> In addition to the standard features, we explore word cache feature and HMM state feature, mainly to solve the data sparseness problem.</Paragraph> <Paragraph position="1"> Although the GENIA corpus is the largest annotated corpus for the biomedical domain, it is still small compared with other linguistic annotated corpora such as the Penn Treebank. Thus, the data sparseness problem is severe, and must be treated carefully. Usually, the data sparseness is prevented by using more general features that apply to a broader set of instances (e.g., disjunctions). While polynomial kernels in the SVM learning can e ectively generate feature conjunctions, kernel functions that can e ectively generate feature disjunctions are not known. Thus, we should explicitly add dimensions for such general features.</Paragraph> <Paragraph position="2"> The word cache feature is defined as the disjunction of several word features as: wckfk1; ;kng;i _k2kwk;i We intend that the word cache feature captures the similarities of the patterns with a common key word such as follows.</Paragraph> <Paragraph position="3"> (a) &quot;human W 2 W 1 W0&quot; and &quot;human W 1 W0&quot; (b) &quot;W0 gene&quot; and &quot;W0 W1 gene&quot; We use a left word cache defined as lwck;i wcf k; ;0g;i, and a right word cache defined as rwck;i wcf1; ;kg;i for patterns like (a) and (b) in the above example respectively.</Paragraph> <Paragraph position="4"> Kazama et al. (2001) proposed to use as features the Viterbi state sequence of a hidden Markov model (HMM) to prevent the data sparseness problem in the maximum entropy tagging model. An HMM is trained with a large number of unannotated texts by using an unsupervised learning method. Because the number of states of the HMM is usually made smaller than jVj, the Viterbi states give smoothed but maximally informative representations of word patterns tuned for the domain, from which the raw texts are taken.</Paragraph> <Paragraph position="5"> The HMM feature is defined in the same way as the word feature as follows.</Paragraph> <Paragraph position="7"> In the experiments, we train an HMM using raw MEDLINE abstracts in the GENIA corpus, and show that the HMM state feature can improve the accuracy.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Implementation Issues </SectionTitle> <Paragraph position="0"> Towards practical named entity recognition using SVMs, we have tackled the following implementation issues. It would be impossible to carry out the experiments in a reasonable time without such efforts. null Parallel Training: The training of pairwise SVMs has trivial parallelism, i.e., each SVM can be trained separately. Since computers with two or more CPUs are not expensive these days, parallelization is very practical solution to accelerate the training of pair-wise SVMs.</Paragraph> <Paragraph position="1"> Fast Winner Finding: Although the pairwise method reduces the cost of training, it greatly increases the number of classifications needed to determine the class of one sample. For example, for our experiments using the GENIA corpus, the BIO representation with class splitting yields more than 4,000 classification pairs. Fortunately, we can stop classifications when a class gets K 1 votes and this stopping greatly saves classification time (Kressel, 1998). Moreover, we can stop classifications when the current votes of a class is greater than the others' possible votes.</Paragraph> <Paragraph position="2"> Support Vector Caching: In the pairwise method, though we have a large number of classifiers, each classifier shares some support vectors with other classifiers. By storing the bodies of all support vectors together and letting each classifier have only the weights, we can greatly reduce the size of the classifier. The sharing of support vectors also can be exploited to accelerate the classification by caching the value of the kernel function between a support vector and a classifiee sample.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> To conduct experiments, we divided 670 abstracts of the GENIA corpus (Ver. 1.1) into the training part (590 abstracts; 4,487 sentences; 133,915 words) and the test part (80 abstracts; 622 sentences; 18,211 words).6 Texts are tokenized by using Penn Treebank's tokenizer. An HMM for the HMM state features was trained with raw abstracts of the GENIA corpus (39,116 sentences).7 The number of states is 160. The vocabulary for the word feature is constructed by taking the most frequent 10,000 words from the above raw abstracts, the prefix/su x/prefix list by taking the most frequent 10,000 prefixes/su xes/substrings.8 The performance is measured by precision, recall, and F-score, which are the standard measures for the the class splitting technique. The number of training samples includes SOS and EOS (special words for the start/end of a sentence).</Paragraph> <Paragraph position="1"> no splitting splitting training time acc. time acc.</Paragraph> <Paragraph position="2"> named entity recognition. Systems based on the BIO representation may produce an inconsistent class sequence such as &quot;O B-DNA I-RNA O&quot;. We interpret such outputs as follows: once a named entity starts with &quot;B-C&quot; then we interpret that the named entity with class &quot;C&quot; ends only when we see another &quot;B-&quot; or &quot;O-&quot; tag.</Paragraph> <Paragraph position="3"> We have implemented SMO algorithm (Platt, 1998) and techniques described in (Joachims, 1998) for soft margin SVMs in C++ programming language, and implemented support codes for pairwise classification and parallel training in Java programming language. To obtain POS information required for features and class splitting, we used an English POS tagger described in (Kazama et al., 2001).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Class Splitting Technique </SectionTitle> <Paragraph position="0"> First, we show the e ect of the class splitting described in Section 5.1. Varying the size of training data, we compared the change in the training time and the accuracy with and without the class splitting. We used a feature set fhw;pre;suf;sub;posi[ 2; ;2];pc[ 2; 1]g and the inner product kernel.9 The training time was measured on a machine with four 700MHz PentiumIIIs and 16GB RAM. Table 2 shows the results of the experiments. Figure 1 shows the results graphically. We can see that without splitting we soon suffer from super-linearity of the SVM training, while with splitting we can handle the training with over 100,000 samples in a reasonable time. It is very important that the splitting technique does not sacrifice the accuracy for speed, rather improves the accuracy.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Word Cache and HMM State Features </SectionTitle> <Paragraph position="0"> In this experiment, we see the e ect of the word cache feature and the HMM state feature described in Section 3.4. The e ect is assessed by the accuracy gain observed by adding each feature set to a base feature set and the accuracy degradation observed by subtracting it from a (com null plete) base set. The first column (A) in Table 3 shows an adding case where the base feature set is fw[ 2; ;2]g. The columns (B) and (C) show subtracting cases where the base feature set isfhw;pre;suf;sub;pos;hmmi[ k; ;k];lwck;rwck; pc[ 2; 1]g with k = 2 and k = 3 respectively. The kernel function is the inner product. We can see that word cache and HMM state features surely improve the recognition accuracy. In the table, we also included the accuracy change for other standard features. Preceeding classes and su xes are definitely helpful. On the other hand, the substring feature is not e ective in our setting. Although the e ects of part-of-speech tags and prefixes are not so definite, it can be said that they are practically e ective since they show positive e ects in the case of the maximum performance.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.3 Comparison with the ME Method </SectionTitle> <Paragraph position="0"> In this set of experiments, we compare our SVM-based system with a named entity recognition system based on the Maximum Entropy method. For the SVM system, we used the feature set fhw;pre;suf;pos;hmmi[ 3; ;3]; lwc3; rwc3; pc[ 2; 1]g, which is shown to be the best in the previous experiment. The compared system is a maximum entropy tagging model described in (Kazama et al., 2001). Though it supports several character type features such as number and hyphen and some conjunctive features such as word n-gram, we do not use these features to compare the performance under as close a condition as possible. The feature set used in the maximum entropy system is expressed asfhw;pre;suf;pos;hmmi[ 2; ;2]; pc[ 2; 1]g.10 Both systems use the BIO representation with splitting.</Paragraph> <Paragraph position="1"> Table 4 shows the accuracies of both systems. For the SVM system, we show the results with the inner product kernel and several polynomial kernels. The row &quot;All (id)&quot; shows the accuracy from the view10When the width becomes [ 3; ;3], the accuracy degrades (53.72 to 51.73 in F-score).</Paragraph> <Paragraph position="2"> point of the identification task, which only finds the named entity regions. The accuracies for several major entity classes are also shown. The SVM system with the 2-dimensional polynomial kernel achieves the highest accuracy. This comparison may be unfair since a polynomial kernel has the e ect of using conjunctive features, while the ME system does not use such conjunctive features. Nevertheless, the facts: we can introduce the polynomial kernel very easily; there are very few parameters to be tuned;11 we could achieve the higher accuracy; show an advantage of the SVM system.</Paragraph> <Paragraph position="3"> It will be interesting to discuss why the SVM systems with the inner product kernel (and the polynomial kernel with d = 1) are outperformed by the ME system. We here discuss two possible reasons. The first is that the SVM system does not use a dynamic decision such as the Viterbi algorithm, while the ME system uses it. To see this, we degrade the ME system so that it predicts the classes deterministically without using the Viterbi algorithm. We found that this system only marks 51.54 in F-score. Thus, it can be said that a dynamic decision is important for this named entity task. However, although a method to convert the outputs of a binary SVM to probabilistic values is proposed (Platt, 1999), the way to obtain meaningful probabilistic values needed in Viterbi-type algorithms from the outputs of a multi-class SVM is unknown. Solving this problem is certainly a part of the future work. The second possible reason is that the SVM system in this paper does not use any cut-o or feature truncation method to remove data noise, while the ME system uses a simple feature cut-o method.12 We observed that the ME system without the cut-o only marks 49.11 in 11C, s, r, and d 12Features that occur less than 10 times are removed.</Paragraph> <Paragraph position="4"> F-score. Thus, such a noise reduction method is also important. However, the cut-o method for the ME method cannot be applied without modification since, as described in Section 3.4, the definition of the features are di erent in the two approaches. It can be said the features in the ME method is &quot;finer&quot; than those in SVMs. In this sense, the ME method allows us more flexible feature selection. This is an advantage of the ME method.</Paragraph> <Paragraph position="5"> The accuracies achieved by both systems can be said high compared with those of the previous methods if we consider that we have 24 named entity classes. However, the accuracies are not su cient for a practical use. Though higher accuracy will be achieved with a larger annotated corpus, we should also explore more e ective features and find e ective feature combination methods to exploit such a large corpus maximally.</Paragraph> </Section> </Section> class="xml-element"></Paper>