File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/p06-1069_abstr.xml

Size: 1,088 bytes

Last Modified: 2025-10-06 13:45:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1069">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Comparison and Semi-Quantitative Analysis of Words and Character-Bigrams as Features in Chinese Text Categorization</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Words and character-bigrams are both used as features in Chinese text processing tasks, but no systematic comparison or analysis of their values as features for Chinese text categorization has been reported heretofore. We carry out here a full performance comparison between them by experiments on various document collections (including a manually word-segmented corpus as a golden standard), and a semi-quantitative analysis to elucidate the characteristics of their behavior; and try to provide some preliminary clue for feature term choice (in most cases, character-bigrams are better than words) and dimensionality setting in text categorization systems.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML