File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/c02-1045_evalu.xml

Size: 7,237 bytes

Last Modified: 2025-10-06 13:58:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1045">
  <Title>A Method of Cluster-Based Indexing of Textual Data</Title>
  <Section position="5" start_page="0" end_page="1" type="evalu">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
4.1 The Data Set
</SectionTitle>
      <Paragraph position="0"> In our experiments, we used NTCIR-J1  ,a Japanese text collection for retrieval tasks that is composed of abstracts of conference papers organized by Japanese academic societies. In preparing the data for the experiments, we first selected 52,867 papers from five different societies: 23,105 from the Society of Polymer Science, Japan (SPSJ), 20,482 from the Japan Society of Civil Engineers (JSCE), 4,832 from the Japan Society for Precision Engineering (JSPE), 2,434 from the Ecological Society of Japan (ESJ), and 2,014 from the Japanese Society for Artificial Intelligence (JSAI). The papers were then analyzed by the morphological analyzer ChaSen Ver.2.02 (Matsumoto et al., 1999) to extract nouns and compound nouns using the Part-Of-Speech tags. Next, the co-occurrence frequencies between documents and terms were collected. After preprocessing, the number of distinctive terms was 772,852 for the 52,867 documents.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.2 Clustering Results
</SectionTitle>
      <Paragraph position="0"> In our first experiments, we used a framework of unsupervised text categorization, where the quality of the generated clusters was evaluated</Paragraph>
      <Paragraph position="2"> by the goodness of the separation between different societies. To investigate the effect of the discounting parameter, it was given the values d =0.1,0.3,0.5,0.7,0.9, 0.95.</Paragraph>
      <Paragraph position="3"> Table 1 compares the total number of generated clusters (c), the average number of documents per cluster (s d ), and the average number of terms per cluster (s t ), for different values of d. We also examined the ratio of unique clusters that consist only of documents from a single society (r s ), and an inside-cluster ratio that is defined as the average relative weight of the dominant society for each cluster (r i ). Here, the weight of each society within a cluster was calculated as the sum of the significance weights of its component documents given by Eq. (10). The results shown in Table 1 indicate that reducing the value of d improves the quality of the generated clusters: with smaller d, the single society ratio and the inside-cluster ratio becomes higher, while the number of generated clusters becomes smaller.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.3 Categorization Results
</SectionTitle>
      <Paragraph position="0"> In our second experiment, we used a framework of supervised text categorization, where the generated clusters were used as indices for classifying documents between the existing societies, and the categorization performance was examined.</Paragraph>
      <Paragraph position="1"> For this purpose, the documents were first divided into a training set of 50,182 documents and a test set of 2,641 documents. Then, assuming that the originating societies of the training documents are known, the significance weights of the five societies were calculated for each cluster generated in the previous experiments.</Paragraph>
      <Paragraph position="2"> Next, the test documents were assigned to one of the five societies based on the membership of the multiple clusters to which they belong.</Paragraph>
      <Paragraph position="3"> For comparison, two supervised text categorization methods, naive Bayes and Support Vector Machine (SVM), were also applied to the same training and test sets.</Paragraph>
      <Paragraph position="4"> The results are shown in Table 2. In this case, the performance was better for larger d, indicating that the major factor determining the categorization performance was the number of clusters rather than their quality. For d =0.5 [?] 0.95, each tested document appeared in at least one of the generated clusters, and the performance was almost comparable to the performance of standard text categorization methods: slightly better than naive Bayes, but not so good as SVM. We also compared the performance for varied sizes of training sets and also using different combination of societies, but the tendency remained the same.</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.4 Further Analysis
Analysis of categorization errors
</SectionTitle>
      <Paragraph position="0"> Table 3 compares the patterns of misclassification, where the columns and rows represent the classified and the real categories, respectively. It can be seen that as far as minor categories such as ESJ and JSAI are concerned, the proposed micro-clustering method performed slightly better than SVM. The reason may be that the former method is based on locally conformed clusters and less affected by the skew of the distribution of category sizes.</Paragraph>
      <Paragraph position="1"> However, the details are left for further investigation. null In addition, by manually analyzing the individual misclassified documents, it can be confirmed that most of them dealt with interdomain topics. For example, nine out of the ten JSCE documents misclassified as ESJ were related to environmental issues; six out of the 14 JSPE documents misclassified as JSCE, as well as all seven JSPE documents misclassified as JSAI, were related to the application of artificial intelligence techniques. These were the major causes of the performance difference of the two methods. null  We also tested the categorization performance without local improvement where the top 50 terms at most survive unconditionally after forming the initial clusters. In this case, the clustering works similarly to the automatic relevance feedback in information retrieval. Using the same data set, the result was 2,564 correct judgments (F-value 0.971), which shows the effectiveness of local improvement in reducing noise in automatic relevance feedback.</Paragraph>
      <Paragraph position="2"> Effect of cluster duplication check: Because we do not apply any duplication check in our generation step, the same cluster may appear repeatedly in the resulting cluster set. We have also tested the other case where clusters with terms or document sets identical to existing better-performing clusters were eliminated. The obtained categorization performance was slightly worse than the one without elimination. For example, the best performance obtained for d =0.9 was 2,582 correct judgments (F-value 0.978) with 137,867 (30% reduced) clusters.</Paragraph>
      <Paragraph position="3"> The results indicate that the system does not necessarily require expensive redundancy checks for the generated clusters as a whole. Such consideration becomes necessary when the formulated clusters are presented to users, in which case, the duplication check can be applied only locally.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML