File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2604_intro.xml

Size: 3,185 bytes

Last Modified: 2025-10-06 14:04:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2604">
  <Title>Basque Country ccpzejaa@si.ehu.es I~naki Alegria UPV-EHU Basque Country acpalloi@si.ehu.es Olatz Arregi UPV-EHU Basque Country acparuro@si.ehu.es</Title>
  <Section position="2" start_page="0" end_page="25" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Document Categorization, the assignment of natural language texts to one or more predefined categories based on their content, is an important component in many information organization and management tasks. Researchers have concentrated their efforts in finding the appropriate way to represent documents, index them and construct classifiers to assign the correct categories to each document. Both, document representation and classification method are crucial steps in the categorization process.</Paragraph>
    <Paragraph position="1"> In this paper we concentrate on both issues. On the one hand, we use Latent Semantic Indexing (LSI) (Deerwester et al., 1990), which is a variant of the vector space model (VSM) (Salton and McGill, 1983), in order to obtain the vector representation of documents. This technique compresses vectors representing documents into vectors of a lower-dimensional space. LSI, which is based on Singular Value Decomposition (SVD) of matrices, has showed to have the ability to extract the relations among words and documents by means of their context of use, and has been successfully applied to Information Retrieval tasks.</Paragraph>
    <Paragraph position="2"> On the other hand, we construct a multiclassifier (Ho et al., 1994) which uses different training databases. These databases are obtained from the original training set by random subsampling.</Paragraph>
    <Paragraph position="3"> We implement this approach by bagging, and use the k-NN classification algorithm to make the category predictions for testing documents. Finally, we combine all predictions made for a given document by Bayesian voting.</Paragraph>
    <Paragraph position="4"> The experiment we present has been evaluated for Reuters-21578 standard document collection.</Paragraph>
    <Paragraph position="5"> Reuters-21578 is a multilabel document collection, which means that categories are not mutually exclusive because the same document may be relevant to more than one category. Being aware of the results published in the most recent literature, and having obtained good results in our experiments, we consider the categorization method presented in this paper an interesting contribution for text categorization tasks.</Paragraph>
    <Paragraph position="6"> The remainder of this paper is organized as follows: Section 2, discusses related work on document categorization for Reuters-21578 collection. In Section 3, we present our approach to deal with the multilabel text categorization task. In Section 4 the experimental setup is introduced, and details about the Reuters database, the preprocessing applied and some parameter setting are provided. In Section 5, experimental results are presented and discussed. Finally, Section 6 contains some conclusions and comments on future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML