File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1155_intro.xml

Size: 3,241 bytes

Last Modified: 2025-10-06 14:01:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1155">
  <Title>Multi-Dimensional Text Classification</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In the past, most of previous works on text classification focus on classifying text documents into a set of flat categories. The task is to classify documents into a predefined set of categories (or classes) (Lewis and Ringuetee, 1994; Eui-Hong and Karypis, 2000) where there are no structural relationships among these categories. Many existing databases are organized in this type of flat structure, such as Reuters newswire, OHSUMED and TREC. To improve classification accuracy, a variety of learning techniques are developed, including regression models (Yang and Chute, 1992), nearest neighbour classification (Yang and Liu, 1999), Bayesian approaches (Lewis and Ringuetee, 1994; McCallum et al., 1998), decision trees (Lewis and Ringuetee 1994), neural networks (Wiener et al.,1995) and support vector machines (Dumais and Chen, 2000). However, it is very difficult to browse or search documents in flat categories when there are a large number of categories. As a more efficient method, one possible natural extension to flat categories is to arrange documents in topic hierarchy instead of a simple flat structure. When people organize extensive data sets into fine-grained classes, topic hierarchy is often employed to make the large collection of classes (categories) more manageable. This structure is known as category hierarchy. Many popular search engines and text databases apply this structure, such as Yahoo, Google Directory, Netscape search and MEDLINE. There are many recent works attempting to automate text classification based on this category hierarchy (McCallum et al., 1998; Chuang W. T. et al., 2000). However, with a large number of classes or a large hierarchy, the problem of sparse training data per class at the lower levels in the hierarchy raises and results in decreasing classification accuracy of lower classes. As another problem, the traditional category hierarchy may be too rigid for us to construct since there exist several possible category hierarchies for a data set.</Paragraph>
    <Paragraph position="1"> To cope with these problems, this paper proposes a new framework, called multi-dimensional framework, for text classification. The framework allows multiple pre-defined sets of categories (viewed as multiple dimensions) instead of a single set of categories like flat categories. While each set of classes with some training examples (documents) attached to each class, represents a criterion to classify a new text document based on such examples, multiple sets of classes enable several criteria. Documents are classified based on these multiple criteria (dimensions) and assigned a class per criterion (dimension). Two merits in the multi-dimensional approach are (1) the support of multiple viewpoints of classification, (2) a solution to data sparseness problem. The efficiency of multi-dimensional classification is investigated using three classifiers: k-NN, naive Bayes and centroid-based methods.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML