File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/c02-1155_evalu.xml

Size: 8,555 bytes

Last Modified: 2025-10-06 13:58:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1155">
  <Title>Multi-Dimensional Text Classification</Title>
  <Section position="6" start_page="0" end_page="12" type="evalu">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> Two data sets, WebKB and Drug information collection (DI) are used for evaluating our multi-dimensional model. These two data sets can be viewed as a two-dimensional category model as follows. Composed of 8,145 web pages, the WebKB data set is a collection of web pages of computer science departments in four universities with some additional pages from other universities. The original collection is divided into seven classes (1 st dimension): student, faculty, staff, course, project, department and others. Focusing on each class, five subclasses  nd dimension) are defined according to the university a web page belongs to: Cornell, Texas, Washington, Wisconsin and miscellaneous. In our experiment, we use the four most popular classes: student, faculty, course and project. This includes 4,199 web pages. Drug information, the second data set, is a collection of web documents that have been collected from www.rxlist.com. This collection is composed of 4,480 web pages providing information about widely used drugs in seven topics (1 st dimension): adverse drug reaction, clinical pharmacology, description, indications, overdose, patient information, and warning. There exists exactly one page for each drug in each class, i.e., the number of recorded drugs is 640 (=4480/7). Moreover We manually grouped the drugs according to major pharmacological actions, resulting in five classes (2 nd dimension): chemotherapy (Chem), neuro-muscular system (NMS), cardiovascular &amp; hematopoeitic (CVS), hormone (Horm) and respiratory system (Resp). The multi-dimensional classification is tested using four algorithms: k-NN, NB and two centroid-based classifiers (CB1 and CB2). In the k-NN, the parameter k is set to 20 for WebKB, and set to 35 for DI. For the centroid-based method, the applied weighting systems are those shown in Section 4.3. All experiments were performed with 10-fold cross validation. That is, 90% of documents are kept as a training set while the rest 10% are used for testing. The performance was measured by classification accuracy defined as the ratio between the number of documents assigned with correct classes and the total number of test documents. As a preprocess, some stop words (e.g., a, an, the) and all tags (e.g., &lt;B&gt;, &lt;/HTML&gt;) were omitted from documents to eliminate the affect of these common words and typographic words. In the rest, first the results on flat and hierarchical classification on the data sets are shown, followed by that of multi-dimensional classification. Finally overall discussion is given.</Paragraph>
    <Section position="1" start_page="0" end_page="12" type="sub_section">
      <SectionTitle>
5.1 Flat-based Classification
</SectionTitle>
      <Paragraph position="0"> In this experiment, test documents are classified into the most specified classes say D  . Therefore, the number of classes equals to the product of the number of classes in each dimension. That is 20 (=5x4) classes for WebKB and 35 (=7x5) classes for DI. A test document was assigned the class that gained the highest score from the classifier applied. Table 1 displays the classification accuracy of flat classification on WebKB and DI data sets. Here, two measures, two-dimension and single-dimension accuracy, are taken into account.  In the table, D  shows the two-dimension accuracy where the test document is completely assigned to the correct class. D  , the single-dimension accuracy, mean the accuracy of the first and second dimensions where the classes in D  dimensions are generated from the result class D  , respectively.</Paragraph>
      <Paragraph position="1"> The result shows that the centroid-based classifiers perform better than k-NN and NB. CB1 and CB2 works well on WebKB and DI, respectively. Even low two-dimension accuracy, high single-dimension accuracy is obtained.</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
5.2 Hierarchical-based Classification
</SectionTitle>
      <Paragraph position="0"> Since there are two dimensions in the data set, hierarchical-based classification can be held in two different ways according to the classifying order. In the first version, documents are classified based on the first dimension to determine the class to which those documents belong. They are further classified again according to the second dimension using the model of that class. The other version classifies documents based on the second dimension first and then the first dimension. The results are shown in Table 2. In the tables, D  mean the accuracy of the first dimension, the second dimension and the two-dimension accuracy, respectively. D  represents the accuracy of the second dimension that used the result from the first dimension during classifying the first dimension.</Paragraph>
      <Paragraph position="1"> From the results, we found that the centroid-based classifiers also perform better than k-NN and NB, and CB1 works well on WebKB while CB2 gains the highest accuracy on DI. In almost cases, the hierarchical-based classification performs better than the flat-based classification. Moreover, an interesting observation is that classifying on the worse dimension before the better one yields a better result.</Paragraph>
    </Section>
    <Section position="3" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
5.3 Multi-dimensional Classification
</SectionTitle>
      <Paragraph position="0"> In the last experiment, multi-dimensional classification is investigated. Documents are classified twice based on two dimensions independently. The results of the first and second dimensions are combined to be the suggested class for a test document. The classification accuracy of multi-dimensional classification is shown in Table 3.</Paragraph>
      <Paragraph position="1">  is the two-dimension accuracy of the class which is the combination of classes suggested in the first dimension and the second dimension. From the results, we found that CB1 performs well on WebKB but NB gains the highest accuracy on DI. The multi-dimensional classification outperforms flat classification in most cases but sometime the hierarchical-based classification performs well.</Paragraph>
    </Section>
    <Section position="4" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
5.4 Overall Evaluation and Discussion
</SectionTitle>
      <Paragraph position="0"> Two accuracy criteria are (1) all dimensions are correct or (2) some dimensions are correct. The classification accuracy based on the first criterion is shown in all previous tables as the two-dimension accuracy. As the second criterion, the classification accuracy can be evaluated when some dimensions are correct.</Paragraph>
      <Paragraph position="1"> The result is summarized in Table 4. The multi-dimensional classification outperforms other two methods for WebKB but the hierarchical-based classification sometimes works better for DI.</Paragraph>
      <Paragraph position="2">  dimensions are correct.</Paragraph>
      <Paragraph position="3"> From this result, some observations can be given as follows. There are two tradeoff factors that affect classification accuracy of multi-dimensional category model: training set size and the granularity of classes. The flat-based classification in the multi-dimensional model deals with the finest granularity of classes because all combinations of classes from predefined dimensions are combined to form a large set of classes. Although this precise representation of classes may increase the accuracy, the flat-based classification suffers with sparseness problem where the number of training data per class is reduced. The accuracy is low when the training set is small. The multi-dimensional-based classification copes with the coarsest granularity of the classes. Therefore the number of training document per class is larger than flat-based classification approach but the representation of classes is not exact. However, It works well when we have a relatively small training set. The hierarchical-based classification occupies a medium granularity of classes. However, the size of training set is smaller than multi-dimensional approach at the low level of the hierarchy. It works well when the training set is medium.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML