File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/p06-1069_concl.xml
Size: 2,403 bytes
Last Modified: 2025-10-06 13:55:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1069"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Comparison and Semi-Quantitative Analysis of Words and Character-Bigrams as Features in Chinese Text Categorization</Title> <Section position="8" start_page="550" end_page="551" type="concl"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> In this paper, we aimed to thoroughly compare the value of words and bigrams as feature terms in text categorization, and make the implicit mechanism explicit.</Paragraph> <Paragraph position="1"> Experimental comparison showed that the Chi feature selection scheme and the tfidf term weighting scheme are still the best choices for (Chinese) text categorization on a SVM classifier. In most cases, the bigram scheme outperforms the word scheme at high dimensionalities and usually reaches its top performance at a dimen- null sionality of around 70000. The word scheme often outperforms the bigram scheme at low dimensionalities and reaches its top performance at a dimensionality of less than 40000.</Paragraph> <Paragraph position="2"> Whether the best performance of the word scheme is higher than the best performance scheme depends considerably on the word segmentation precision and the number of categories. The word scheme performs better with a higher word segmentation precision and fewer (<10) categories.</Paragraph> <Paragraph position="3"> A word scheme costs more document indexing time than a bigram scheme does; however a bi-gram scheme costs more training time and classification time than a word scheme does at the same performance level due to its higher dimensionality. Considering that the document indexing is needed in both the training phase and the classification phase, a high precision word scheme is more time consuming as a whole than a bigram scheme.</Paragraph> <Paragraph position="4"> As a concluding suggestion: a word scheme is more fit for small-scale tasks (with no more than 10 categories and no strict classification speed requirements) and needs a high precision word segmentation system; a bigram scheme is more fit for large-scale tasks (with dozens of categories or even more) without too strict training speed requirements (because a high dimensionality and a large number of categories lead to a long training time).</Paragraph> </Section> class="xml-element"></Paper>