File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1054_metho.xml
Size: 13,020 bytes
Last Modified: 2025-10-06 14:07:52
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1054"> <Title>Efficient Support Vector Classifiers for Named Entity Recognition</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Efficient Classifiers </SectionTitle> <Paragraph position="0"> In this section, we investigate the cause of this inefficiency and propose a solution. All experiments are conducted for training data of 569,994 vectors.</Paragraph> <Paragraph position="1"> The total size of the original news articles was 2 MB and the number of NEs was 39,022. According to the definition ofa43a45a2a4a3a41a12, a classifier has to processa62 support vectors for eacha3 . Table 1 showsa62 s for different word classes. According to this table, classification of one word requiresa3 's dot products with 228,306 support vectors in 33 classifiers. Therefore, the classifiers are very slow. We have never seen such largea62 s in SVM literature on pattern recognition. The reason for the largea62 s is word features. In other domains such as character recognition, dimen- null increases monotonically with respect to the size of the training data. Since SVMs learn combinations of features,a62 tends to be very large. This tendency will hold for other tasks of natural language processing, too.</Paragraph> <Paragraph position="2"> Here, we focus on the quadratic kernel a66a24a2a4a71a24a12a60a42 a2a73a35a20a33a150a71a24a12a26a151 that yielded the best score in the above experiments. Suppose a3a128a42 a2a4a71a91a90a92a35a94a93a95a7a8a15a8a15a8a15a34a7a10a71a91a90a96a97a93a4a12 has only a152 (=15) non-zero elements. The dot product of a3 and a55a23 a42 a2a53a153a23a90a92a35a94a93a95a7a8a15a8a15a8a15a34a7a13a153a23a90a96a60a93a4a12 is given by for every non-zero element a71a161a90a157a156a11a93 and a165 a172a90a157a156a24a7a13a66a64a93 for every non-zero paira71a161a90a157a156a11a93a131a71a91a90a157a66a64a93. Accordingly, we only need to add a35a87a33a160a152a191a33a160a152a106a2a4a152a82a38a192a35a17a12a10a122a37a133 (=121) constants to get a43a45a2a4a3a41a12. Therefore, we can expect this method to be much faster than a na&quot;ive implementation that computes tens of thousands of dot products at run time. We call this method 'XQK' (eXpand the Quadratic Kernel).</Paragraph> <Paragraph position="3"> Table 1 compares TinySVM and XQK in terms of CPU time taken to apply 33 classifiers to process the training data. Classes are sorted by a62 . Small numbers in parentheses indicate the initialization time for reading support vectors a31a17a55a23a39 and allocating memory. XQK requires a longer initialization time in order to preparea165 a183</Paragraph> <Paragraph position="5"> TinySVM took 11,490.26 seconds (3.2 hours) in total for applying OTHER's classifier to all vectors in the training data. Its initialization phase took 2.13 seconds and all vectors in the training data were classified in 11,488.13 (a42a61a35a57a35a37a7a10a88a21a193a37a81a11a15a194a133a57a195a196a38a139a133a120a15a131a35a112a187 ) seconds. On the other hand, XQK took 225.28 seconds in total and its initialization phase took 174.17 seconds. Therefore, 569,994 vectors were classified in 51.11 seconds. The initialization time can be disregarded because we can reuse the above coefficents.</Paragraph> <Paragraph position="6"> Consequently, XQK is 224.8 (=11,488.13/51.11) times faster than TinySVM for OTHER. TinySVM took 6 hours to process all the word classes, whereas XQK took only 17 minutes. XQK is 102 times faster than SVM-Light 3.50 which took 1.2 days.</Paragraph> <Paragraph position="7"> ber of non-zero elements ina55a23. Therefore, removal of useless features would be beneficial. Conventional SVMs do not tell us how an individual feature works because weights are given not to features but toa52a54a2a4a3a6a7a56a55a23a12. However, the above weights (a165 a183 and a165a167a172 ) clarify how a feature or a feature pair works. We can use this fact for feature selection after the training.</Paragraph> <Paragraph position="8"> We simplify a40 a2a4a3a41a12 by removing all features not change the number of misclassifications for the training data is found by using the binary search for each word class. We call this method 'XQK-FS' (XQK with Feature Selection). This approximation slightly degraded GENERAL's F-measure from 88.31% to 88.03%.</Paragraph> <Paragraph position="9"> Table 2 shows the reduction of features that appear in support vectors. Classes are sorted by the numbers of original features. For instance, OTHER has 56,220 features in its support vectors. According to the binary search, its performance did not change even when the number of features was reduced to 21,852 ata202a20a42a75a81a11a15a204a81a114a133a57a195a114a205a37a195 . The total number of features was reduced by 75% and that of weights was reduced by 60%. The table also shows CPU time for classification by the selected features. XQK-FS is 28.5 (=21754.23/ 763.10) times faster than TinySVM. Although the reduction of features is significant, the reduction of CPU time is moderate, because most of the reduced features are infrequent ones. However, simple reduction of infrequent features without considering weights damages the system's performance. For instance, when we removed 5,066 features that appeared four times or less in the training data, the modified classifier for ORGANIZATION-END mis-classified 103 training examples, whereas the original classifier misclassified only 19 examples. On the other hand, XQK-FS removed 12,141 features without an increase in misclassifications for the training data.</Paragraph> <Paragraph position="10"> XQK can be easily extended to a more general quadratic kernel a66a24a2a4a71a63a12a163a42a164a2a177a118a108a166a207a33a129a118a5a71a24a12a26a151 and to non-binary sparse vectors. XQK-FS can be used to select useful features before training by other kernels. As mentioned above, we conducted an experiment for the cubic kernel (a70 a42a208a187 ) by using all features. When we trained the cubic kernel classifiers by using only features selected by XQK-FS, TinySVM's classification time was reduced by 40% becausea62 was reduced by 38%. GENERAL's F-measure was slightly improved from 87.04% to 87.10%. On the other hand, when we trained the cubic kernel classifiers by using only features that appeared three times or more (without considering weights), TinySVM's classification time was reduced by only 14% and the F-measure was slightly degraded to 86.85%. Therefore, we expect XQK-FS to be useful as a feature selection method for other kernels when such kernels give much better results than the quadratic kernel.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Reduction of training time </SectionTitle> <Paragraph position="0"> Since training of 33 classifiers also takes a long time, it is difficult to try various combinations of parameters and features. Here, we present a solution for this problem. In the training time, calculation of a66a24a2a4a3a173a209a158a68a114a3 a5a12a14a7a13a66a24a2a4a3a173a209a36a68a114a3 a151 a12a14a7a8a15a8a15a8a15a16a7a13a66a24a2a4a3a173a209a20a68a64a3 a18 a12 for various a3 a209 s is dominant. Conventional systems save time by caching the results. By analyzing TinySVM's classifier, we found that they can be calculated more efficiently.</Paragraph> <Paragraph position="1"> For sparse vectors, most SVM classifiers (e.g., SVM-Light) use a sparse dot product algorithm (Platt, 1999) that compares non-zero elements ofa3 and those ofa55a23 to geta66a24a2a4a3a69a68a17a55a23a12 ina43a45a2a4a3a41a12. However,a3 is common to all dot products ina66a24a2a4a3a210a68a26a55a5a12a14a7a8a15a8a15a8a15a17a7a13a66a24a2a4a3a20a68 a12. Therefore, we can implement a faster classifier that calculates them concurrently. TinySVM's classifier prepares a list fi2sia90a157a156a11a93 that contains alla55a23s whose a156 -th coordinates are not zero. In addition, counters fora3a211a68a37a55a5a7a8a15a8a15a8a15a112a7a10a3a212a68a37a55 a47 are prepared because dot products of binary vectors are integers. Then, for each non-zeroa71a91a90a157a156a11a93, the counters are incremented for alla55a23 a25 fi2sia90a157a156a162a93. By checking only members of fi2sia90a157a156a11a93 for non-zeroa71a91a90a157a156a11a93, the classifier is not bothered by fruitless cases: a71a161a90a157a156a11a93a213a42a149a81a11a7a13a153a8a23a56a90a157a156a162a93a215a214a42a89a81 or a71a91a90a157a156a162a93a87a214a42a216a81a11a7a13a153a13a217a21a90a157a156a162a93a218a42a121a81 . Therefore, TinySVM's classifier is faster than other classifiers. This method is applicable to any kernels based on dot products.</Paragraph> <Paragraph position="2"> For the training phase, we can build fi2sia183a90a157a156a162a93 that contains alla3a23s whosea156 -th coordinates are not zero. Then,a66a24a2a4a3a173a209a22a68a16a3 a5a12a14a7a8a15a8a15a8a15a17a7a13a66a24a2a4a3a173a209a218a68a17a3 a18 a12 can be efficiently calculated becausea3a173a209 is common. This improvement is effective especially when the cache is small and/or the training data is large. When we used a 200 MB cache, the improved system took only 13 hours for training by the CRL data, while TinySVM and SVM-Light took 30 hours and 46 hours respectively for the same cache size. Although we have examined other SVM toolkits, we could not find any system that uses this approach in the training phase.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The above methods can also be applied to other tasks in natural language processing such as chunking and POS tagging because the quadratic kernels give good results.</Paragraph> <Paragraph position="1"> Utsuro et al. (2001) report that a combination of two NE recognizers attained F = 84.07%, but wrong word boundary cases are excluded. Our system attained 85.04% and word boundaries are automatically adjusted. Yamada (Yamada et al., 2001) also reports that a70 a42a128a133 is best. Although his system attained F = 83.7% for 5-fold cross-validation of the CRL data (Yamada and Matsumoto, 2001), our system attained 86.8%. Since we followed Isozaki's implementation (Isozaki, 2001), our system is different from Yamada's system in the following points: 1) adjustment of word boundaries, 2) ChaSen's parameters for unknown words, 3) character types, 4) use of the Viterbi search.</Paragraph> <Paragraph position="2"> For efficient classification, Burges and Sch&quot;olkopf (1997) propose an approximation method that uses &quot;reduced set vectors&quot; instead of support vectors. Since the size of the reduced set vectors is smaller than a62 , classifiers become more efficient, but the computational cost to determine the vectors is very large. Osuna and Girosi (1999) propose two methods. The first method approximatesa43a45a2a4a3a41a12 by support vector regression, but this method is applicable only whena83 is large enough. The second method reformulates the training phase. Our approach is simpler than these methods. Downs et al. (Downs et al., 2001) try to reduce the number of support vectors by using linear dependence.</Paragraph> <Paragraph position="3"> We can also reduce the run-time complexity of a multi-class problem by cascading SVMs in the form of a binary tree (Schwenker, 2001) or a direct acyclic graph (Platt et al., 2000). Yamada and Matsumoto (2001) applied such a method to their NE system and reduced its CPU time by 39%. This approach can be combined with our SVM classifers.</Paragraph> <Paragraph position="4"> NE recognition can be regarded as a variable-length multi-class problem. For this kind of problem, probability-based kernels are studied for more theoretically well-founded methods (Jaakkola and Haussler, 1998; Tsuda et al., 2001; Shimodaira et al., 2001).</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Conclusions </SectionTitle> <Paragraph position="0"> Our SVM-based NE recognizer attained F = 90.03%. This is the best score, as far as we know. Since it was too slow, we made SVMs faster. The improved classifier is 21 times faster than TinySVM and 102 times faster than SVM-Light. The improved training program is 2.3 times faster than TinySVM and 3.5 times faster than SVM-Light.</Paragraph> <Paragraph position="1"> We also presented an SVM-based feature selection method that removed 75% of features. These methods can also be applied to other tasks such as chunking and POS tagging.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> Acknowledgment </SectionTitle> <Paragraph position="0"> We would like to thank Yutaka Sasaki for the training data. We thank members of Knowledge Processing Research Group for valuable comments and discussion. We also thank Shigeru Katagiri and</Paragraph> </Section> class="xml-element"></Paper>