File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1054_intro.xml
Size: 10,664 bytes
Last Modified: 2025-10-06 14:01:22
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1054"> <Title>Efficient Support Vector Classifiers for Named Entity Recognition</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Named Entity (NE) recognition is a task in which proper nouns and numerical information in a document are detected and classified into categories such as person, organization, and date. It is a key technology of Information Extraction and Open-Domain Question Answering (Voorhees and Harman, 2000).</Paragraph> <Paragraph position="1"> We are building a trainable Open-Domain Question Answering System called SAIQA-II. In this paper, we show that an NE recognizer based on Support Vector Machines (SVMs) gives better scores than conventional systems. SVMs have given high performance in various classification tasks (Joachims, 1998; Kudo and Matsumoto, 2001).</Paragraph> <Paragraph position="2"> However, it turned out that off-the-shelf SVM classifiers are too inefficient for NE recognition.</Paragraph> <Paragraph position="3"> The recognizer runs at a rate of only 85 bytes/sec on an Athlon 1.3 GHz Linux PC, while rule-based systems (e.g., Isozaki, (2001)) can process several kilobytes in a second. The major reason is the inefficiency of SVM classifiers. There are other reports on the slowness of SVM classifiers. Another SVM-based NE recognizer (Yamada and Matsumoto, 2001) is 0.8 sentences/sec on a Pentium III 933 MHz PC. An SVM-based part-of-speech (POS) tagger (Nakagawa et al., 2001) is 20 tokens/sec on an Alpha 21164A 500 MHz processor. It is difficult to use such slow systems in practical applications.</Paragraph> <Paragraph position="4"> In this paper, we present a method that makes the NE system substantially faster. This method can also be applied to other tasks in natural language processing such as chunking and POS tagging. Another problem with SVMs is its incomprehensibility. It is not clear which features are important or how they work. The above method is also useful for finding useless features. We also mention a method to reduce training time.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Support Vector Machines </SectionTitle> <Paragraph position="0"> Suppose we have a set of training data for a two-class problem: a2a4a3a6a5a8a7a10a9a11a5a13a12a14a7a8a15a8a15a8a15a16a7a17a2a4a3a19a18a20a7a10a9a21a18a22a12, wherea3a24a23a10a2a26a25 a27a29a28 a12 is a feature vector of the a30-th sample in the training data and a9a23 a25a32a31a34a33a36a35a37a7a8a38a20a35a34a39 is the label for the sample. The goal is to find a decision function that accurately predicts a9 for unseen a3 . A non-linear SVM classifier gives a decision function Here, a40 a2a4a3a41a12a60a42a61a33a36a35 meansa3 is a member of a certain class anda40 a2a4a3a41a12a36a42a32a38a20a35 meansa3 is not a member. a55a23s are called support vectors and are representatives of training examples. a62 is the number of support vectors. Therefore, computational complexity ofa43a63a2a4a3a41a12 is proportional toa62 . Support vectors and other constants are determined by solving a certain quadratic programming problem. a52a54a2a4a3a6a7a56a55a64a12 is a kernel that implicitly maps vectors into a higher dimensional space. Typical kernels use dot products: a52a54a2a4a3a6a7a56a55a64a12a65a42a67a66a24a2a4a3a69a68a17a55a64a12. A polynomial kernel of degree ous kernels, and the design of an appropriate kernel for a particular application is an important research issue.</Paragraph> <Paragraph position="1"> Figure 1 shows a linearly separable case. The decision hyperplane defined by a43a45a2a4a3a41a12a80a42a82a81 separates positive and negative examples by the largest margin. The solid line indicates the decision hyperplane and two parallel dotted lines indicate the margin between positive and negative examples. Since such a separating hyperplane may not exist, a positive parametera83 is introduced to allow misclassifications.</Paragraph> <Paragraph position="2"> See Vapnik (1995).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 SVM-based NE recognition </SectionTitle> <Paragraph position="0"> As far as we know, the first SVM-based NE system was proposed by Yamada et al. (2001) for Japanese.</Paragraph> <Paragraph position="1"> His system is an extension of Kudo's chunking system (Kudo and Matsumoto, 2001) that gave the best performance at CoNLL-2000 shared tasks. In their system, every word in a sentence is classified sequentially from the beginning or the end of a sentence. However, since Yamada has not compared it with other methods under the same conditions, it is not clear whether his NE system is better or not.</Paragraph> <Paragraph position="2"> Here, we show that our SVM-based NE system is more accurate than conventional systems. Our system uses the Viterbi search (Allen, 1995) instead of sequential determination.</Paragraph> <Paragraph position="3"> For training, we use 'CRL data', which was prepared for IREX (Information Retrieval and Extraction Exercise1, Sekine and Eriguchi (2000)). It has about 19,000 NEs in 1,174 articles. We also use additional data by Isozaki (2001). Both datasets are based on Mainichi Newspaper's 1994 and 1995 CD-ROMs. We use IREX's formal test data called GENERAL that has 1,510 named entities in 71 articles from Mainichi Newspaper of 1999. Systems are compared in terms of GENERAL's F-measure which is the harmonic mean of 'recall' and 'precision' and is defined as follows.</Paragraph> <Paragraph position="5"> where M is the number of NEs correctly extracted and classified by the system.</Paragraph> <Paragraph position="6"> We developed an SVM-based NE system by following our NE system based on maximum entropy (ME) modeling (Isozaki, 2001). We simply replaced the ME model with SVM classifiers. The above datasets are processed by a morphological analyzer ChaSen 2.2.12. It tokenizes a sentence into words and adds POS tags. ChaSen uses about 90 POS tags such as common-noun and location-name. Since most unknown words are proper nouns, ChaSen's parameters for unknown words are modified for better results. Then, a character type tag is added to each word. It uses 17 character types such as all-kanji and smallinteger. See Isozaki (2001) for details.</Paragraph> <Paragraph position="7"> Now, Japanese NE recognition is solved by the classification of words (Sekine et al., 1998; Borthwick, 1999; Uchimoto et al., 2000). For instance, the words in &quot;President George Herbert Bush said Clinton is . . . &quot; are classified as follows: &quot;Presi-</Paragraph> <Paragraph position="9"> = OTHER. In this way, the first word of a person's name is labeled as PERSON-BEGIN. The last word is labeled as PERSON-END. Other words in the name are PERSON-MIDDLE. If a person's name is expressed by a single word, it is labeled as PERSON-SINGLE. If a word does not belong to any named entities, it is labeled as OTHER. Since IREX defines eight NE classes, words are classified into 33 (a42a85a84a87a86a69a88a22a33a75a35 ) categories.</Paragraph> <Paragraph position="10"> Each sample is represented by 15 features because each word has three features (part-of-speech tag, character type, and the word itself), and two preceding words and two succeeding words are also used for context dependence. Although infrequent features are usually removed to prevent overfitting, we use all features because SVMs are robust. Each sample is represented by a long binary vector, i.e., a sequence of 0 (false) and 1 (true). For instance, &quot;Bush&quot; in the above example is represented by a</Paragraph> <Paragraph position="12"> Here, we have to consider the following problems. First, SVMs can solve only a two-class problem. Therefore, we have to reduce the above multi-class problem to a group of two-class problems.</Paragraph> <Paragraph position="13"> Second, we have to consider consistency among word classes in a sentence. For instance, a word classified as PERSON-BEGIN should be followed by PERSON-MIDDLE or PERSON-END. It implies that the system has to determine the best combinations of word classes from numerous possibilities. Here, we solve these problems by combining existing methods.</Paragraph> <Paragraph position="14"> There are a few approaches to extend SVMs to covera116 -class problems. Here, we employ the &quot;one class versus all others&quot; approach. That is, each classifier a40a37a117a2a4a3a41a12 is trained to distinguish members of a classa118 from non-members. In this method, two or more classifiers may give a33a36a35 to an unseen vector or no classifier may givea33a36a35 . One common way to avoid such situations is to comparea43a117a2a4a3a41a12 values and to choose the class indexa118 of the largesta43a117a2a4a3a41a12. The consistency problem is solved by the Viterbi search. Since SVMs do not output probabilities, we use the SVM+sigmoid method (Platt, 2000).</Paragraph> <Paragraph position="15"> That is, we use a sigmoid functiona119a120a2a4a71a63a12a74a42a121a35a17a122a11a2a73a35a123a33</Paragraph> <Paragraph position="17"> a2a4a3a41a12 to a probability-like value.</Paragraph> <Paragraph position="18"> The output of the Viterbi search is adjusted by a postprocessor for wrong word boundaries. The adjustment rules are also statistically determined (Isozaki, 2001).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.3 Comparison of NE recognizers </SectionTitle> <Paragraph position="0"> We use a fixed value a127a128a42 a35a8a81a57a81 . F-measures are not very sensitive toa127 unlessa127 is too small. When we used 1,038,986 training vectors, GENERAL's F-measure was 89.64% fora127a129a42a130a81a11a15a131a35 and 90.03% for</Paragraph> <Paragraph position="2"> because it gives the best results. Polynomial kernels of degree 1, 2, and 3 resulted in 83.03%, 88.31%,</Paragraph> <Paragraph position="4"> when we used only CRL data. 'ME' indicates our ME system and 'RG+DT' indicates a rule-based machine learning system (Isozaki, 2001). According to this graph, 'SVM' is better than the other systems. null However, SVM classifiers are too slow. Famous SVM-Light 3.50 (Joachims, 1999) took 1.2 days to classify 569,994 vectors derived from 2 MB documents. That is, it runs at only 19 bytes/sec. TinySVM's classifier seems best optimized among publicly available SVM toolkits, but it still works at only 92 bytes/sec.</Paragraph> </Section> </Section> class="xml-element"></Paper>