File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1069_intro.xml

Size: 4,937 bytes

Last Modified: 2025-10-06 14:03:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1069">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Comparison and Semi-Quantitative Analysis of Words and Character-Bigrams as Features in Chinese Text Categorization</Title>
  <Section position="4" start_page="545" end_page="546" type="intro">
    <SectionTitle>
2 Performance Comparison
</SectionTitle>
    <Paragraph position="0"> Three document collections in Chinese language are used in this study.</Paragraph>
    <Paragraph position="1"> The electronic version of Chinese Encyclopedia (&amp;quot;CE&amp;quot;): It has 55 subject categories and 71674 single-labeled documents (entries). It is randomly split by a proportion of 9:1 into a training set with 64533 documents and a test set with 7141 documents. Every document has the fulltext. This data collection does not have much of a sparseness problem.</Paragraph>
    <Paragraph position="2"> The training data from a national Chinese text categorization evaluation  (&amp;quot;CTC&amp;quot;): It has 36 subject categories and 3600 single-labeled  documents. It is randomly split by a proportion of 4:1 into a training set with 2800 documents and a test set with 720 documents. Documents in this data collection are from various sources including news websites, and some documents  The Annual Evaluation of Chinese Text Categorization 2004, by 863 National Natural Science Foundation.  In the original document collection, a document might have a secondary category label. In this study, only the primary category label is reserved.</Paragraph>
    <Paragraph position="3"> may be very short. This data collection has a moderate sparseness problem.</Paragraph>
    <Paragraph position="4"> A manually word-segmented corpus from the State Language Affairs Commission (&amp;quot;LC&amp;quot;): It has more than 100 categories and more than 20000 single-labeled documents  . In this study, we choose a subset of 12 categories with the most documents (totally 2022 documents). It is randomly split by a proportion of 2:1 into a training set and a test set. Every document has the full-text and has been entirely word-segmented null  by hand (which could be regarded as a golden standard of segmentation).</Paragraph>
    <Paragraph position="5"> All experiments in this study are carried out at various feature space dimensionalities to show the scalability. Classifiers used in this study are Rocchio and SVM. All experiments here are multi-class tasks and each document is assigned a single category label.</Paragraph>
    <Paragraph position="6"> The outline of this section is as follows: Sub-section 2.1 shows experiments based on the Rocchio classifier, feature selection schemes besides Chi and term weighting schemes besides tfidf to compare the automatic segmented word features with bigram features on CE and CTC, and both document collections lead to similar behaviors; Subsection 2.2 shows experiments on CE by a SVM classifier, in which, unlike with the Rocchio method, Chi feature selection scheme and tfidf term weighting scheme outperform other schemes; Subsection 2.3 shows experiments by a SVM classifier with Chi feature selection and tfidf term weighting on LC (manual word segmentation) to compare the best word features with bigram features.</Paragraph>
    <Section position="1" start_page="545" end_page="546" type="sub_section">
      <SectionTitle>
2.1 The Rocchio Method and Various Set-
tings
</SectionTitle>
      <Paragraph position="0"> The Rocchio method is rooted in the IR tradition, and is very different from machine learning ones (such as SVM) (Joachims, 1997; Sebastiani, 2002). Therefore, we choose it here as one of the representative classifiers to be examined. In the experiment, the control parameter of negative examples is set to 0, so this Rocchio based classifier is in fact a centroid-based classifier.</Paragraph>
      <Paragraph position="1"> Chi max is a state-of-the-art feature selection criterion for dimensionality reduction (Yang and Peterson, 1997; Rogati and Yang, 2002). Chimax* null CIG (Xue and Sun, 2003a) is reported to be better in Chinese text categorization by a cen- null Not completed.</Paragraph>
      <Paragraph position="2">  And POS (part-of-speech) tagged as well. But POS tags are not used in this study.</Paragraph>
      <Paragraph position="3">  troid based classifier, so we choose it as another representative feature selection criterion besides</Paragraph>
      <Paragraph position="5"> Likewise, as for term weighting schemes, in addition to tfidf, the state of the art (Baeza-Yates and Ribeiro-Neto, 1999), we also choose tfidf*CIG (Xue and Sun, 2003b).</Paragraph>
      <Paragraph position="6"> Two word segmentation schemes are used for the word-indexing of documents. One is the maximum match algorithm (&amp;quot;mmword&amp;quot; in the figures), which is a representative of simple and fast word segmentation algorithms. The other is</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML