File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1027_metho.xml
Size: 7,504 bytes
Last Modified: 2025-10-06 14:08:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1027"> <Title>Virtual Examples for Text Classification with Support Vector Machines</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Support Vector Machines </SectionTitle> <Paragraph position="0"> In this section we give some theoretical definitions of SVMs. Assume that we are given the training data The decision function CV in SVM framework is defined as:</Paragraph> <Paragraph position="2"> are weights. Besides, the weights AB The solution gives an optimal hyperplane, which is a decision boundary between the two classes. Figure 1 illustrates an optimal hyperplane and its support vectors. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="1" type="metho"> <SectionTitle> 3 Virtual Examples and Virtual Support </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> Vectors </SectionTitle> <Paragraph position="0"> Virtual examples are generated from labeled examples. null Based on prior knowledge of a target task, the label of a generated example is set to the same value as that of the original example.</Paragraph> <Paragraph position="1"> For example, in hand-written digit recognition, virtual examples can be created on the assumption that the label of an example is unchanged even if the example is shifted by one pixel in the four principal directions (Sch&quot;olkopf et al., 1996; DeCoste and Sch&quot;olkopf, 2002).</Paragraph> <Paragraph position="2"> Virtual examples that are generated from support vectors are called virtual support vectors (Sch&quot;olkopf We discuss here only virtual examples which are generated from labeled examples. We do not consider examples, the labels of which are not known.</Paragraph> <Paragraph position="3"> et al., 1996). Reasonable virtual support vectors are expected to give a better optimal hyperplane. Assuming that virtual support vectors represent natural variations of examples of a target task, the decision boundary should be more accurate. Figure 2 illustrates the idea of virtual support vectors. Note that after virtual support vectors are given, the hyperplane is different from that in Figure 1.</Paragraph> </Section> </Section> <Section position="5" start_page="1" end_page="3" type="metho"> <SectionTitle> 4 Virtual Examples for Text Classification </SectionTitle> <Paragraph position="0"> We assume on text classification the following: Assumption 1 The category of a document is unchanged even if a small number of words are added or deleted.</Paragraph> <Paragraph position="1"> This assumption is reasonable. In typical cases of text classification most of the documents usually contain two or more keywords which may indicate the categories of the documents.</Paragraph> <Paragraph position="2"> Following Assumption 1, we propose two methods to create virtual examples for text classification. One method is to delete some portion of a document. The label of a virtual example is given from the original document. The other method is to add a small number of words to a document. The words to be added are taken from documents, the label of which is the same as that of the document. Although one can invent various methods to create virtual examples based on Assumption 1, we propose here very simple ones.</Paragraph> <Paragraph position="3"> Before describing our methods, we describe text representation which we used in this study. We tokenize a document to words, downcase them and then remove stopwords, where the stopword list of freeWAIS-sf is used. Stemming is not performed.</Paragraph> <Paragraph position="4"> We adopt binary feature vectors where word frequency is not used.</Paragraph> <Paragraph position="5"> Now we describe the two proposed methods: GenerateByDeletion and GenerateByAddition. Assume that we are given a feature vector (a document) D8 then remove the feature CU, where randB4B5 is a function which generates a random number from BC to BD, and D8 is a parameter to decide how many features are deleted.</Paragraph> <Paragraph position="6"> For example, suppose that we have a set of documents as in Table 1. Some possible virtual examples generated from Document 1 by GenerateByDeletion 1. Collect from a training set documents, the label of which is the same as that of DC.</Paragraph> <Paragraph position="7"> 2. Concatenate all the feature vectors (documents) to create an array CP of features. Each element of CP is a feature which represents a word.</Paragraph> <Paragraph position="8"> 3. Copy DC to DC 4. For each binary feature CU of DC BC , if randB4B5 AK D8 then select a feature randomly from CP and put</Paragraph> <Paragraph position="10"> For example, when we want to generate a virtual example from Document 2 in Table 1 by GenerateByAddition algorithm, first we create an array CP BP to evaluate the proposed methods. The dataset has several splits of a training set and a test set. We used here &quot;ModApte&quot; split, which is most widely used in the literature on text classification. This split has 9,603 training examples and 3,299 test examples. More than 100 categories are in the dataset. We use, however, only the most frequent 10 categories. Table 2 shows the 10 categories and the number of training and test examples in each of the categories.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.2 Performance Measures </SectionTitle> <Paragraph position="0"> We use F-measure (van Rijsbergen, 1979; Lewis and Gale, 1994) as a primal performance measure to evaluate the result. F-measure is defined as: where D4 is precision and D5 is recall and AC is a parameter which decides the relative weight of precision and recall. The D4 and the D5 are defined as:</Paragraph> </Section> </Section> <Section position="6" start_page="3" end_page="3" type="metho"> <SectionTitle> D4 BP </SectionTitle> <Paragraph position="0"> number of positive and correct outputs number of positive outputs</Paragraph> </Section> <Section position="7" start_page="3" end_page="3" type="metho"> <SectionTitle> D5 BP </SectionTitle> <Paragraph position="0"> number of positive and correct outputs number of positive examples In Equation 4, usually AC BP BD is used, which means it gives equal weight to precision and recall. When we evaluate the performance of a classifier to a multiple category dataset, there are two ways to compute F-measure: macro-averaging and micro-averaging (Yang, 1999). The former way is to first compute F-measure for each category and then average them, while the latter way is to first compute precision and recall for all the categories and use them to calculate the F-measure.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.3 SVM setting </SectionTitle> <Paragraph position="0"> Through our experiments we used our original SVM tools, the algorithm of which is based on SMO (Sequential Minimal Optimization) by Platt (1999). We used linear SVMs and set a misclassification cost BV to BCBMBCBDBIBHBGBD which is BDBPB4the average of DCA1DCB5 where DC is a feature vector in the 9,603 size training set.</Paragraph> <Paragraph position="1"> For simplicity, we fixed BV through all the experiments. We built a binary classifier for each of the 10 categories shown in Table 2.</Paragraph> </Section> </Section> class="xml-element"></Paper>