File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0610_metho.xml
Size: 17,831 bytes
Last Modified: 2025-10-06 14:09:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0610"> <Title>Using Uneven Margins SVM and Perceptron for Information Extraction</Title> <Section position="4" start_page="72" end_page="73" type="metho"> <SectionTitle> 2 Uneven Margins SVM and Perceptron </SectionTitle> <Paragraph position="0"> Li and Shawe-Taylor (2003) introduced an uneven margins parameter into the SVM to deal with imbalanced classi cation problems. They showed that the SVM with uneven margins outperformed the standard SVM on document classi cation problem with imbalanced training data. Formally, given a training set Z = ((x1,y1),... ,(xm,ym)),where xi is the n-dimensional input vector and yi (= +1 or 1) its label, the SVM with uneven margins is obtained by solving the quadratic optimisation problem:</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> xi 0 for i = 1,...,m We can see that the uneven margins parameter t was added to the constraints of the optimisation problem. t is the ratio of negative margin to the positive margin of the classi er and is equal to 1 in the standard SVM. For an imbalanced dataset with a few positive examples and many negative ones, it would be bene cial to use larger margin for positive examples than for the negative ones. Li and Shawe-Taylor (2003) also showed that the solution of the above problem could be obtained by solving a related standard SVM problem by, for example, using a publicly available SVM package1.</Paragraph> <Paragraph position="5"> Perceptron is an on-line learning algorithm for linear classi cation. It checks the training examples one by one by predicting their labels. If the prediction is correct, the example is passed; otherwise, the example is used to correct the model. The algorithm stops when the model classi es all training examples correctly. The margin Perceptron not only classi es every training example correctly but also outputs for every training example a value (before thresholding) larger than a prede ned parameter (margin). The margin Perceptron has better generalisation capability than the standard Perceptron. Li et al. (2002) proposed the Perceptron algorithm with uneven margins (PAUM) by introducing two margin parameters t+ and t[?] into the updating rules for the positive and negative examples, respectively. Similar to the uneven margins parameter in SVM, two margin parameters allow the PAUM to handle imbalanced datasets better than both the standard Perceptron and the margin Perceptron. Additionally, it is known that the Perceptron learning will stop after limited loops only on a linearly separable training set. Hence, a regularisation parameter l is used in PAUM to guarantee that the algorithm would stop for any training dataset after some updates. PAUM is simple and fast and performed very well on document classi cation, in particularly on imbalanced</Paragraph> </Section> <Section position="5" start_page="73" end_page="77" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="73" end_page="74" type="sub_section"> <SectionTitle> 3.1 Classi er-Based Framework for IE </SectionTitle> <Paragraph position="0"> In the experiments we adopted a classi er-based framework for applying the SVM and PAUM algorithms to IE. The framework consists of three stages: pre-processing of the documents to obtain feature vectors, learning classi ers or applying classi ers to test documents, and nally post-processing the results to tag the documents.</Paragraph> <Paragraph position="1"> The aim of the preprocessing is to form input vectors from documents. Each document is rst processed using the open-source ANNIE system, which is part of GATE2 (Cunningham et al., 2002). This produces a number of linguistic (NLP) features, including token form, capitalisation information, token kind, lemma, part-of-speech (POS) tag, semantic classes from gazetteers, and named entity types according to ANNIE's rule-based recogniser.</Paragraph> <Paragraph position="2"> Based on the linguistic information, an input vector is constructed for each token, as we iterate through the tokens in each document (including word, number, punctuation and other symbols) to see if the current token belongs to an information entity or not. Since in IE the context of the token is usually as important as the token itself, the features in the input vector come not only from the current token, but also from preceding and following ones.</Paragraph> <Paragraph position="3"> As the input vector incorporates information from the context surrounding the current token, features from different tokens can be weighted differently, based on their position in the context. The weighting scheme we use is the reciprocal scheme, which weights the surrounding tokens reciprocally to the distance to the token in the centre of the context window. This re ects the intuition that the nearer a neighbouring token is, the more important it is for classifying the given token. Our experiments showed that such a weighting scheme obtained better results than the commonly used equal weighting of features (Li et al., 2005).</Paragraph> <Paragraph position="4"> The key part of the framework is to convert the recognition of information entities into binary classi cation tasks one to decide whether a token is the start of an entity and another one for the end token.</Paragraph> <Paragraph position="5"> After classi cation, the start and end tags of the entities are obtained and need to be combined into one entity tag. Therefore some post-processing is needed to guarantee tag consistency and to try to improve the results by exploring other information. The currently implemented procedure has three stages. First, in order to guarantee the consistency of the recognition results, the document is scanned from left to right to remove start tags without matching end tags and end tags without preceding start tags. The second stage lters out candidate entities from the output of the rst stage, based on their length. Namely, a candidate entity tag is removed if the entity's length (i.e., the number of tokens) is not equal to the length of any entity of the same type in the training set. The third stage puts together all possible tags for a sequence of tokens and chooses the best one according to the probability which was computed from the output of the classi ers (before thresholding) via a Sigmoid function.</Paragraph> </Section> <Section position="2" start_page="74" end_page="74" type="sub_section"> <SectionTitle> 3.2 The Experimental Datasets </SectionTitle> <Paragraph position="0"> The paper reports evaluation results on three corpora covering different IE tasks named entity recognition (CoNLL-2003) and template lling or scenario templates in different domains (Jobs and CFP). The CoNLL-20033 provides the most recent evaluation results of many learning algorithms on named entity recognition. The Jobs corpus4 has also been used recently by several learning systems. The CFP corpus was created as part of the recent Pascal Challenge for evaluation of machine learning methods for IE5.</Paragraph> <Paragraph position="1"> In detail, we used the English part of the CoNLL-2003 shared task dataset, which consists of 946 documents for training, 216 document for development (e.g., tuning the parameters in learning algorithm), and 231 documents for evaluation (i.e., testing), all of which are news articles taken from the Reuters English corpus (RCV1). The corpus contains four types of named entities person, location, organisation and miscellaneous names. In the other two corpora domain-speci c information was extracted into a number of slots. The Job corpus includes 300 computer related job advertisements and 17 slots encoding job details, such as title, salary, recruiter, computer language, application, and platform. The shop call for papers (CFP), of which 600 were annotated. The corpus includes 11 slots such as workshop and conference names and acronyms, workshop date, location and homepage.</Paragraph> </Section> <Section position="3" start_page="74" end_page="76" type="sub_section"> <SectionTitle> 3.3 Comparison to Other Systems </SectionTitle> <Paragraph position="0"> Named Entity Recognition The algorithms are evaluated on the CoNLL-2003 dataset. Since this set comes with development data for tuning the learning algorithm, different settings were tried in order to obtain the best performance on the development set.</Paragraph> <Paragraph position="1"> Different SVM kernel types, window sizes (namely the number of tokens in left or right side of the token at the centre of window), and the uneven margins parameter t were tested. We found that quadratic kernel, window size 4 and t = 0.5 produced best results on the development set. These settings were used in all experiments on the CoNLL-2003 dataset in this paper, unless otherwise stated. The parameter settings for PAUM described in Li et al. (2002), e.g.</Paragraph> <Paragraph position="3"> with PAUM, unless otherwise stated.</Paragraph> <Paragraph position="4"> Table 1 presents the results of our system using three learning algorithms, the uneven margins SVM, the standard SVM and the PAUM on the CONLL2003 test set, together with the results of three participating systems in the CoNLL-2003 shared task: the best system (Florian et al., 2003), the SVM-based system (May eld et al., 2003) and the Perceptron-based system (Carreras et al., 2003).</Paragraph> <Paragraph position="5"> Firstly, our uneven margins SVM system performed signi cantly better than the other SVM-based system. As the two systems are different from each other in not only the SVM models used but also other aspects such as the NLP features and the framework, in order to make a fair comparison between the uneven margins SVM and the standard SVM, we also present the results of the two learning algorithms implemented in our framework. We can see from Table 1 that, under the same experimental settings, the uneven margins SVM again performed better than the standard SVM.</Paragraph> <Paragraph position="6"> Secondly, our PAUM-based system performed slightly better than the system based on voted Perceptron, but there is no signi cant difference between them. Note that they adopted different mechanisms to deal with the imbalanced data in IE (refer to Section 1). The structure of PAUM system is simpler than that of the voted Perceptron system.</Paragraph> <Paragraph position="7"> Finally, the PAUM system performed worse than the SVM system. On the other hand, training time of PAUM is only 1% of that for the SVM and the PAUM implementation is much simpler than that of SVM. Therefore, when simplicity and speed are required, PAUM presents a good alternative.</Paragraph> <Paragraph position="8"> Template Filling On Jobs corpus our systems are compared to several state-of-the-art learning systems, which include the rule based systems Rapier (Califf, 1998), (LP)2 (Ciravegna, 2001) and BWI (Freitag and Kushmerick, 2000), the statistical system HMM (Freitag and Kushmerick, 2000), and the double classi cation system (Sitter and Daelemans, 2003). In order to make the comparison as informative as possible, the same settings are adopted in our experiments as those used by (LP)2, which previously reported the highest results on this dataset. In particular, the results are obtained by averaging the performance in ten runs, using a random half of the corpus for training and the rest for testing. Only basic NLP features are used: token form, capitalisation information, token types, and lemmas.</Paragraph> <Paragraph position="9"> Preliminary experiments established that the SVM with linear kernel obtained better results than SVM with quadratic kernel on the Jobs corpus (Li et al., 2005). Hence we used the SVM with linear kernel in the experiments on the Jobs data. Note that PAUM always uses linear kernel in our experiments.</Paragraph> <Paragraph position="10"> Table 2 presents the results of our systems as well as the other six systems which have been evaluated on the Jobs corpus. Note that the results for all the 17 slots are available for only three systems, Rapier, (LP)2 and double classi cation, while the results for some slots were available for the other three systems. We computed the macro-averaged F1 (the mean of the F1 of all slots) for our systems as well as for the three fully evaluated systems in order to make a comparison of the overall performance.</Paragraph> <Paragraph position="11"> Firstly, the overall performance of our two systems is signi cantly better than the other three fully evaluated systems. The PAUM system achieves the best performance on 5 out of the 17 slots. The SVM system performs best on the other 3 slots. Secondly, the double classi cation system had much worse overall performance than our systems and other two fully evaluated systems. HMM was evaluated only on two slots. It achieved best result on one slot but was much worse on the other slot than our two systems and some of the others. Finally, somewhat surprisingly, our PAUM system achieves better performance than the SVM system on this dataset. Moreover, the computation time of PAUM is about 1/3 of that of the SVM. Hence, the PAUM system performs quite satisfactory on the Jobs corpus.</Paragraph> <Paragraph position="12"> Our systems were also evaluated by participating in a Pascal challenge Evaluating Machine Learning for Information Extraction. The evaluation provided not only the CFP corpus but also the linguistic features for all tokens by pre-processing the documents. The main purpose of the challenge was to evaluate machine learning algorithms based on the same linguistic features. The only compulsory task is task1, which used 400 annotated documents for training and other 200 annotated documents for testing. See Ireson and Ciravegna (2005) for a short overview of the challenge. The learning methods explored by the participating systems included LP 2, HMM, CRF, SVM, and a variety of combinations mance as macro-averaged F1. Standard deviations for the MA F1 of our systems are presented in parenthesis. The highest score on each slot and overall performance appears in bold. of different learning algorithms. Firstly, the system of the challenge organisers, which is based on LP 2 obtained the best result for Task1, followed by one of our participating systems which combined the uneven margins SVM and PAUM (see Ireson and Ciravegna (2005)). Our SVM and PAUM systems on their own were respectively in the fourth and fth position among the 20 participating systems. Secondly, at least six other participating system were also based on SVM but used different IE framework and possibly different SVM models from our SVM system. Our SVM system achieved better results than all those SVM-based systems, showing that the SVM models and the IE framework of our system were quite suitable to IE task. Thirdly, our PAUM based system was not as good as our SVM system but was still better than the other SVM based systems. The computation time of the PAUM system was about 1/5 of that of our SVM system.</Paragraph> <Paragraph position="13"> Table 3 presents the per slot results and over-all performance of our SVM and PAUM systems as well as the system with the best overall result.</Paragraph> <Paragraph position="14"> Compared to the best system, our SVM system performed better on two slots and had similar results on many of other slots. The best system had extremely good results on the two slots, C-acronym and C-homepage. Actually, the F1 values of the best system on the two slots were more than double of those of every other participating system.</Paragraph> </Section> <Section position="4" start_page="76" end_page="77" type="sub_section"> <SectionTitle> 3.4 Effects of Uneven Margins Parameter </SectionTitle> <Paragraph position="0"> A number of experiments were conducted to investigate the in uence of the uneven margins parameter on the SVM and Perceptron's performances. Table 4 show the results with several different values of uneven margins parameter respectively for the SVM and the Perceptron on two datasets CoNLL-2003 and Jobs. The SVM with uneven margins (t < 1.0) had better results than the standard SVM (t = 1).</Paragraph> <Paragraph position="1"> We can also see that the results were similar for the t between 0.6 and 0.4, showing that the results are not particularly sensitive to the value of the uneven margins parameter. The uneven margins parameter has similar effect on Perceptron as on the SVM. Table 4 shows that the PAUM had better results than both the standard Perceptron and the margin Perceptron on CFP corpus: F-measures(%) on individual entity type and the overall gures, together with the system with the highest overall score. The highest score on each slot appears in bold.</Paragraph> <Paragraph position="2"> Our conjecture was that the uneven margins parameter was more helpful on small training sets, because the smaller a training set is, the more imbalanced it could be. Therefore we carried out experiments on a small numbers of training documents. Table 5 shows the results of the SVM and the uneven margins SVM on different numbers of training documents from CoNLL-2003 and Jobs datasets. The performance of both the standard SVM and the uneven margins SVM improves consistently as more training documents are used. Moreover, compared to the results one large training sets shown in Table 4, the uneven margins SVM obtains more improvements on small training sets than the standard SVM model. We can see that the smaller the training set is, the better the results of the uneven margins SVM are in comparison to the standard SVM.</Paragraph> </Section> </Section> class="xml-element"></Paper>