File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2604_metho.xml

Size: 15,275 bytes

Last Modified: 2025-10-06 14:10:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2604">
  <Title>Basque Country ccpzejaa@si.ehu.es I~naki Alegria UPV-EHU Basque Country acpalloi@si.ehu.es Olatz Arregi UPV-EHU Basque Country acparuro@si.ehu.es</Title>
  <Section position="4" start_page="26" end_page="28" type="metho">
    <SectionTitle>
3 Proposed Approach
</SectionTitle>
    <Paragraph position="0"> In this paper we propose a multiclassifier based document categorization system. Documents in the training and testing sets are represented in a reduced dimensional vector space. Different trainingdatabasesaregeneratedfromtheoriginaltrain- null 2Actually, this result is obtained for 118 categories which correspond to the 115 mentioned before and three more categories which have testing documents but no training document assigned.</Paragraph>
    <Paragraph position="1">  ModApte split.</Paragraph>
    <Paragraph position="2"> ing dataset in order to construct the multiclassifier. We use the k-NN classification algorithm, which according to each training database makes a prediction for testing documents. Finally, a Bayesian voting scheme is used in order to definitively assign category labels to testing documents.</Paragraph>
    <Paragraph position="3"> In the rest of this section we make a brief review of the SVD dimensionality reduction technique, the k-NN algorithm and the combination of classifiers used.</Paragraph>
    <Section position="1" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
3.1 The SVD Dimensionality Reduction
Technique
</SectionTitle>
      <Paragraph position="0"> TheclassicalVectorSpaceModel(VSM)hasbeen successfully employed to represent documents in text categorization tasks. The newer method of Latent Semantic Indexing (LSI) 3 (Deerwester et al., 1990) is a variant of the VSM in which documents are represented in a lower dimensional space created from the input training dataset. It is based on the assumption that there is some underlying latent semantic structure in the term-document matrix that is corrupted by the wide variety of words used in documents. This is referred to as the problem of polysemy and synonymy. The basicideaisthatiftwodocumentvectorsrepresent two very similar topics, many words will co-occur on them, and they will have very close semantic structures after dimension reduction.</Paragraph>
      <Paragraph position="1"> The SVD technique used by LSI consists in factoring term-document matrix M into the product of three matrices, M = USV T where S is a diagonal matrix of singular values in non-increasing order, andU andV are orthogonal matrices of singular vectors (term and document vectors, respectively). Matrix M can be approximated by a lower rank Mp which is calculated by using the p largest singular values of M. This operation is called dimensionality reduction, and the p-dimensional  space to which document vectors are projected is called the reduced space. Choosing the right dimension p is required for successful application of the LSI/SVD technique. However, since there is no theoretical optimum value for it, potentially expensive experimentation may be required to determine it (Berry and Browne, 1999).</Paragraph>
      <Paragraph position="2"> Fordocumentcategorizationpurposes(Dumais, 2004), the testing document q is also projected to the p-dimensional space, qp = qTUpS[?]1p , and the cosine is usually calculated to measure the semantic similarity between training and testing document vectors.</Paragraph>
      <Paragraph position="3"> In Figure 1 we can see an ilustration of the document vector projection. Documents in the training collection are represented by using the term-document matrix M, and each one of the documents is represented by a vector in the Rm vector space like in the traditional vector space model (VSM)scheme. Afterwards,thedimensionpisselected, and by applying SVD vectors are projected to the reduced space. Documents in the testing collection will also be projected to the same reduced space.</Paragraph>
      <Paragraph position="4">  reduced space by using SVD.</Paragraph>
    </Section>
    <Section position="2" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
3.2 The k nearest neighbor classification
</SectionTitle>
      <Paragraph position="0"> algorithm (k-NN) k-NN is a distance based classification approach.  Accordingtothisapproach,givenanarbitrarytesting document, the k-NN classifier ranks its nearest neighbors among the training documents, and uses the categories of the k top-ranking neighbors to predict the categories of the testing document (Dasarathy, 1991). In this paper, the training and testing documents are represented as reduced dimensional vectors in the lower dimensional space, and in order to find the nearest neighbors of a given document, we calculate the cosine similarity measure.</Paragraph>
      <Paragraph position="1"> In Figure 2 an ilustration of this phase can be seen, where some training documents and a testing document q are projected in the Rp reduced space. The nearest to the qp testing document are considered to be the vectors which have the smallest angle with qp. According to the category labels of the nearest documents, a category label prediction, c, will be made for testing document q.</Paragraph>
      <Paragraph position="2">  We have decided to use the k-NN classifier because it has been found that on the Reuters-21578 database it performs best among the conventional methods (Joachims, 1998; Yang, 1999) and because we have obtained good results in our previous work on text categorization for documents writteninBasque, ahighlyinflectedlanguage(Zelaia et al., 2005). Besides, the k-NN classification algorithm can be easily adapted to multilabel categorization problems such as Reuters.</Paragraph>
    </Section>
    <Section position="3" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
3.3 Combination of classifiers
</SectionTitle>
      <Paragraph position="0"> The combination of multiple classifiers has been intensively studied with the aim of improving the accuracy of individual components (Ho et al., 1994). Two widely used techniques to implement this approach are bagging (Breiman, 1996), that uses more than one model of the same paradigm; and boosting (Freund and Schapire, 1999), in which a different weight is given to different training examples looking for a better accuracy.</Paragraph>
      <Paragraph position="1"> In our experiment we have decided to construct a multiclassifier via bagging. In bagging, a set of training databasesTDi is generated by selectingn training examples drawn randomly with replacement from the original training database TD of n examples. When a set of n1 training examples,  n1 &lt; n, is chosen from the original training collection, the bagging is said to be applied by randomsubsampling. Thisistheapproachusedinour work. The n1 parameter has been selected via tuning. In Section 4.3 the selection will be explained in a more extended way.</Paragraph>
      <Paragraph position="2"> According to the random subsampling, given a testing document q, the classifier will make a label prediction ci based on each one of the training databases TDi. One way to combine the predictions is by Bayesian voting (Dietterich, 1998), where a confidence value cvicj is calculated for each training database TDi and category cj to be predicted. These confidence values have been calculated based on the original training collection. Confidence values are summed by category. The category cj that gets the highest value is finally proposed as a prediction for the testing document.</Paragraph>
      <Paragraph position="3"> In Figure 3 an ilustration of the whole experiment can be seen. First, vectors in the VSM are projected to the reduced space by using SVD. Next, random subsampling is applied to the training database TD to obtain different training databases TDi. Afterwards the k-NN classifier is applied for each TDi to make category label predictions. Finally, Bayesian voting is used to combine predictions, and cj, and in some cases ck as well, will be the final category label prediction of the categorization system for testing document q.</Paragraph>
      <Paragraph position="4"> In Section 4.3 the cases when a second category label prediction ck is given are explained.</Paragraph>
      <Paragraph position="6"/>
    </Section>
  </Section>
  <Section position="5" start_page="28" end_page="30" type="metho">
    <SectionTitle>
4 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> Theaimofthis sectionis todescribethedocument collection used in our experiment and to give an account of the preprocessing techniques and parameter settings we have applied.</Paragraph>
    <Paragraph position="1"> When machine learning and other approaches areappliedtotextcategorizationproblems, acommon technique has been to decompose the multiclass problem into multiple, independent binary classification problems. In this paper, we adopt a differentapproach. Wewillbeprimarilyinterested in a classifier which produces a ranking of possible labels for a given document, with the hope that the appropriate labels will appear at the top of the ranking.</Paragraph>
    <Section position="1" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
4.1 Document Collection
</SectionTitle>
      <Paragraph position="0"> As previously mentioned, the experiment reported in this paper has been carried out for the Reuters21578dataset4 compiledbyDavidLewisandoriginally collected by the Carnegie group from the Reuters newswire in 1987. We use one of the most widely used training/testing divisions, the &amp;quot;ModApte&amp;quot; split, in which 75 % of the documents (9,603documents)areselectedfortrainingandthe remaining 25 % (3299 documents) to test the accuracy of the classifier.</Paragraph>
      <Paragraph position="1"> Document distribution over categories in both thetrainingandthetestingsetsisveryunbalanced: the 10 most frequent categories, top-10, account 75% of the training documents; the rest is distributed among the other 108 categories.</Paragraph>
      <Paragraph position="2"> According to the number of labels assigned to each document, many of them (19% in training and 8.48% in testing) are not assigned to any category, and some of them are assigned to 12. We have decided to keep the unlabeled documents in both the training and testing collections, as it is suggested in (Lewis, 2004)5.</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="29" type="sub_section">
      <SectionTitle>
4.2 Preprocessing
</SectionTitle>
      <Paragraph position="0"> The original format of the text documents is in SGML. We perform some preprocessing to filter out the unused parts of a document. We preserved only the title and the body text, punctuation and numbers have been removed and all letters have been converted to lowercase. We have  lows: &amp;quot;If you are using a learning algorithm that requires each training document to have at least TOPICS category, you can screen out the training documents with no TOPICS categories. Please do NOT screen out any of the 3,299 documents - that will make your results incomparable with other studies.&amp;quot;  used the tools provided in the web6 in order to extract text and categories from each document. We have stemmed the training and testing documents by using the Porter stemmer (Porter, 1980)7. By usingit, caseandflectioninformationareremoved from words. Consequently, the same experiment has been carried out for the two forms of the document collection: word-forms and Porter stems.</Paragraph>
      <Paragraph position="1"> According to the dimension reduction, we have created the matrices for the two mentioned document collection forms. The sizes of the training matrices created are 15591 x 9603 for word-forms and 11114 x 9603 for Porter stems. Differentnumberofdimensionshavebeenexperimented null (p = 100,300,500,700).</Paragraph>
    </Section>
    <Section position="3" start_page="29" end_page="30" type="sub_section">
      <SectionTitle>
4.3 Parameter setting
</SectionTitle>
      <Paragraph position="0"> We have designed our experiment in order to optimize the microaveraged F1 score. Based on previous experiments (Zelaia et al., 2005), we have set parameter k for the k-NN algorithm to k = 3.</Paragraph>
      <Paragraph position="1"> This way, the k-NN classifier will give a category label prediction based on the categories of the 3 nearest ones.</Paragraph>
      <Paragraph position="2"> On the other hand, we also needed to decide the number of training databases TDi to create. It has to be taken into account that a high number of training databases implies an increasing computational cost for the final classification system. We decided to create 30 training databases. However, this is a parameter that has not been optimized.</Paragraph>
      <Paragraph position="3"> Therearetwootherparameterswhichhavebeen tuned: the size of each training database and the threshold for multilabeling. We now briefly give some cues about the tuning performed.</Paragraph>
      <Paragraph position="4"> 4.3.1 The size of the training databases As we have previously mentioned, documents have been randomly selected from the original training database in order to construct the 30 training databases TDi used in our classification system. There are n = 9,603 documents in the original Reuters training collection. We had to decide the number of documents to select in order to construct each TDi. The number of documents selected from each category preserves the proportion of documents in the original one. We have experimented to select different numbers n1 &lt; n</Paragraph>
      <Paragraph position="6"> where ti is the total number of training documents in category i. In Figure 4 it can be seen the variation of the n1 parameter depending on the value of parameter j. We have experimented different j values, and evaluated the results. Based on the results obtained we decided to select j = 60, which means that each one of the 30 training databases will have n1 = 298 documents. As we can see, the final classification system will be using trainingdatabaseswhicharequitesmallerthattheorig- null inal one. This gives a lower computational cost, and makes the classification system faster.</Paragraph>
      <Paragraph position="7">  The k-NN algorithm predicts a unique category label for each testing document, based on the ranked list of categories obtained for each training database TDi8. As previously mentioned, we use Bayesian voting to combine the predictions.</Paragraph>
      <Paragraph position="8"> TheReuters-21578isamultilabeldatabase, and therefore, we had to decide in which cases to assignasecondcategorylabeltoatestingdocument. null Given thatcj is the category with the highest value in Bayesian voting and ck the next one, the second ck category label will be assigned when the following relation is true: cvck &gt; cvcj x r, r = 0.1,0.2,...,0.9,1  InFigure5wecanseethemeannumberofcategories assigned to a document for different values 8It has to be noted that unlabeled documents have been preserved, and thus, our classification system treats unlabeled documents as documents of a new category  of r. Results obtained were evaluated and based on them we decided to select r = 0.4, which corresponds to a ratio of 1.05 categories.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML