File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1155_metho.xml

Size: 13,273 bytes

Last Modified: 2025-10-06 14:07:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1155">
  <Title>Multi-Dimensional Text Classification</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Multi-Dimensional Category Model
</SectionTitle>
    <Paragraph position="0"> for Text Classification Category is a powerful tool to manage a large number of text documents. By grouping text documents into a set of categories, it is possible for us to efficiently keep or search for information we need. At this point, the structure of categories, called category model, becomes one of the most important factors that determine the efficiency of organizing text documents. In the past, two traditional category models, called flat and hierarchical category models, were applied in organizing text documents. However, these models have a number of disadvantages as follows. For the flat category model, when the number of categories becomes larger, it faces with difficulty of browsing or searching the categories. For the hierarchical category model, constructing a good hierarchy is a complicated task. In many cases, it is not intuitive to determine the upward/downward relations among categories. There are several possible hierarchies for the same document set. Since hierarchies in the hierarchical category model are static, browsing and searching documents along the hierarchy are always done in a fix order, from the root to a leaf node. Therefore, the searching flexibility is lost.</Paragraph>
    <Paragraph position="1"> As an alternative to flat and hierarchical category models, the multi-dimensional category is introduced. So far the concept of multi-dimensional data model has been very well known in the field of database technology. The model was shown to be powerful in modeling a data warehouse or OLAP to allow users to store, view and utilize relational data efficiently (Jiawei and Micheline, 2001). This section describes a way to apply multi-dimensional data model to text classification, so called multi-dimensional category. The proposed model is an extension of flat category model, where documents are not classified into a single set of categories, instead they are classified into multiple sets. Each set of categories can be viewed as a dimension in the sense that documents may be classified into different kinds of categories. For example in Figure 1, a set of news issues (documents) can be classified into three dimensions, say TOPIC, ZONE and TOPIC, each including {sports, economics, politics, social, entertainment, science and technology}, {domestic, intra-continental, intercontinental} and {good news, bad news, neutral news}, respectively. A news issue in a Thailand newspaper titled &amp;quot;Airplanes attacked World Trader Center&amp;quot; can be classified into &amp;quot;social news&amp;quot;, &amp;quot;inter-continental&amp;quot;, &amp;quot;bad news&amp;quot; in the first, second and third dimensions, respectively.</Paragraph>
    <Paragraph position="2">  Comparing with flat and/or hierarchical category models, the multi-dimensional model has the following merits. First, it is more natural than flat model in the sense that a document could be classified basing on not a single criterion (one dimension) but multiple criteria (multiple dimensions). Secondly, in contrast with hierarchical model, it is possible for us to browse or search documents flexibly without the order constraint defined in the structure. Lastly, the multi-dimensional category model can be basically transformed to and represented by flat category or hierarchical category models, even the converses are not always intuitive.</Paragraph>
    <Paragraph position="3"> In the previous example, the corresponding flat and hierarchical models for the multi-dimensional model in Figure 1 are illustrated Figure 2 and 3, respectively. The total number of derived flat categories equals to the product of the number of categories in each dimension, i.e., 54(=6x3x3). In the derived hierarchical model, the number of leaf categories is also equivalent to 54 but there exist 24 (=6+6x3) internal categories. Note that the figure shows only one possible hierarchy where the dimensions ordered by TOPIC, ZONE and MOOD. However, there are totally 6 (=3!) possible hierarchies for the model in Figure 1.</Paragraph>
    <Paragraph position="4"> From a viewpoint of category representation, the fact that the derived flat model enumerates all combinations among categories, makes the representation of a class be more precise than the class in multi-dimensional model. However, from the viewpoint of relationship constraints in these models, the derived flat category model ignores the relationship among categories while the derived hierarchical model explicitly declares such relationship in a rigid manner, and the multi-dimensional model is a compromise between these two previous models. These different aspects affect classification efficiency as shown in the next section.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Multi-Dimensional Classification
</SectionTitle>
    <Paragraph position="0"> Described in the previous section, a multi-dimensional category model can be transformed into flat and hierarchical category models. As a result, there are three different classification strategies: flat-based, hierarchical-based and multi-dimensional-based methods.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Flat-based classification
</SectionTitle>
      <Paragraph position="0"> The naive method to classify documents according to a multi-dimensional model is flat-based classification. After transforming a multi-dimensional category model to flat category model, traditional flat classification is applied directly to the derived flat categories. The granularity of the derived flat categories is finer than the original multi-dimensional categories since all combinations of classes in the dimensions are enumerated. This fact implies that a flat category represents the class more precisely than a multi-dimensional category and then one can expect high classification accuracy. However, on the other hand, the number of training data (documents) per class is reduced. As a consequence, flat classification may face with the sparseness problem of training data. This may cause a classifier harder to classify and then reduce classification accuracy. In the view of computational cost, a test document has to be compare to all enumerated classes, resulting in high computation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Hierarchical-based classification
</SectionTitle>
      <Paragraph position="0"> The second method is to transform a multi-dimensional category model to a hierarchical category model and then apply the standard hierarchical classification on the derived hierarchical model. There are several possible models generated from a multi-dimensional model due to the order of dimensions as described in section 2. The classification is held along the hierarchy from the root to a leaf. The decision of the class, which a document belongs  model in Figure 1 to, is made in step by step. The classifications of different levels occupy different granularities of training data. Nodes at the level closed to the root will have coarser granularity. This makes such nodes represent classes less imprecisely but there are more training data (documents) for these nodes. On the other hand, nodes near leaves will have finer granularity and then have more precise representation but have less training data. The classification accuracy varied with the order of dimensions in the hierarchy.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Multi-dimensional-based classification
</SectionTitle>
      <Paragraph position="0"> It is possible to directly classify a document using the multi-dimensional category model.</Paragraph>
      <Paragraph position="1"> The class of the document for each dimension is determined independently. We called this multi-dimensional-based classification. Compared with flat-based classification, the granularity of multi-dimensional classification is coarser. For each dimension, it classifies a document based on categories in that dimension instead of classifying it into the set of finer categories as done in flat classification. Although the multi-dimensional category is not precisely represent any finer categories, the number of training data (documents) per class is relatively high. As a consequence, multi-dimensional classification gains high accuracy for each dimension and results in high accuracy for the overall classification accuracy when there are a small number of training data. It also performs faster than flat-based classification since there are fewer classes needed to be compared.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Implementation
</SectionTitle>
    <Paragraph position="0"> To investigate efficiency of text classification on the multidimensional category model, three well-known classification algorithms called k-nearest neighbors (k-NN), naive Bayesian (NB) and centroid-based (CB) approaches are applied.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 k-NN Classifier
</SectionTitle>
      <Paragraph position="0"> As a similarity-based method, the k-nearest neighbor classifier (k-NN) is applied to our text classification. First, the classifier calculates k most similar documents (i.e., k nearest neighbors) of the test document being classified.</Paragraph>
      <Paragraph position="1"> The similarity of this document to a class is computed by summing up the similarities of documents among the k documents, whose classes are equivalent to such class. The test document is assigned the class that has the highest similarity to the document. Two parameters involved are the definition of similarity and the number k. While the standard similarity is defined as tfxidf, a variant</Paragraph>
      <Paragraph position="3"> )xidf that performed better in our preliminary experiments, is applied in this work. The parameter k is determined by experiments as shown in the next section.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Naive Bayes Classifier
</SectionTitle>
      <Paragraph position="0"> The standard naive Bayesian (NB) is applied as a statistical approach to our text classification in this work. For each document, the classifier first calculates the posterior probability P(c</Paragraph>
      <Paragraph position="2"> that the document belongs to different classes and assigns it to the class with the highest posterior probability. Basically, a document d can be represented by a bag of  vector of occurrence frequencies of words in the document). NB assumes that the effect of a word's occurrence on a given class is independent of other words' occurrence. With this assumption, a NB classifier finds the most</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Centroid-based Classifier
</SectionTitle>
      <Paragraph position="0"> Applied in our implementation is a variant of centroid-based classification (CB) with different weight methods from the standard weighting tfidf. A centroid-based classifier (CB) is a modified version of k-NN classifier. Instead of comparing the test document with all training documents, CB calculates a centroid (a vector) for all training documents in each class and compares the test document with these centroids to find the most probable (similar) class. A simple centroid-based classifier represents a document with a vector each dimension of which expresses a term in the document with a weight of tfxidf. The resultant vector is</Paragraph>
      <Paragraph position="2"> normalized with the document length to a unitlength vector. A different version of a centroid vector is so-called a prototype vector (Chuang, W. T. et al., 2000). Instead of normalizing each vector in the class before calculating a centroid, the prototype vector is calculated by normalizing the summation of all vectors of documents in the class. Both methods utilizing centroid-based and prototype vectors obtained high classification accuracy with small time complexity. In our implementation, we use a variant of the prototype vector that does not apply the standard tf-idf but use either of the following weighting formulas. These weighting formulas, we called CB1 and CB2, were empirically proved to work well in (Theeramunkong and Lertnattee, 2001).</Paragraph>
      <Paragraph position="3"> (2) icsd stands for inter-class standard deviation, tf rms is the root mean square of document term frequency in a class, and sd means standard deviation. After this weighting, a prototype vector is constructed for each class. Due to the length limitation of the paper, we ignore the detail of this formula but the full description can be found in (Theeramunkong and Lertnattee, 2001).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML