File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1069_metho.xml
Size: 19,509 bytes
Last Modified: 2025-10-06 14:10:17
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1069"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Comparison and Semi-Quantitative Analysis of Words and Character-Bigrams as Features in Chinese Text Categorization</Title> <Section position="5" start_page="546" end_page="548" type="metho"> <SectionTitle> ICTCLAS </SectionTitle> <Paragraph position="0"> (&quot;lqword&quot; in the figures). ICTCLAS is one of the best word segmentation systems (SIGHAN 2003) and reaches a segmentation precision of more than 97%, so we choose it as a representative of state-of-the-art schemes for automatic word-indexing of document).</Paragraph> <Paragraph position="1"> For evaluation of single-label classifications,</Paragraph> <Paragraph position="3"> -measure, precision, recall and accuracy (Baeza-Yates and Ribeiro-Neto, 1999; Sebastiani, 2002) have the same value by microaveraging , and are labeled with &quot;performance&quot; in the following figures.</Paragraph> <Paragraph position="4"> dimensionality curves of the chi-tfidf approach and the approach with CIG, by mmword, lqword and bigram document indexing, on the CE document collection. We can see that the original chi-tfidf approach is better at low dimensionalities (less than 10000 dimensions), while the CIG version is better at high dimensionalities and reaches a higher limit.</Paragraph> <Paragraph position="5"> Microaveraging is more prefered in most cases than macroaveraging (Sebastiani 2002).</Paragraph> <Paragraph position="6"> In all figures in this paper, curves might be truncated due to the large scale of dimensionality, especially the curves of the CTC document collection. The curves fluctuate more than the curves for the CE collection because of sparseness; The CE collection is more sensitive to the additions of terms that come with the increase of dimensionality. The CE curves in the following figures show similar fluctuations for the same reason.</Paragraph> <Paragraph position="7"> For a parallel comparison among mmword, lqword and bigram schemes, the curves in Figure 1 and Figure 2 are regrouped and shown in Figure 3 and Figure 4.</Paragraph> <Paragraph position="8"> bigram scheme. For these kinds of figures, at least one of the following is satisfied: (a) every curve has shown its zenith; (b) only one curve is not complete and has shown a higher zenith than other curves; (c) a margin line is shown to indicate the limit of the incomplete curve. We can see that the lqword scheme outperforms the mmword scheme at almost any dimensionality, which means the more precise the word segmentation the better the classification performance. At the same time, the bigram scheme outperforms both of the word schemes on a high dimensionality, wherea the word schemes might outperform the bigram scheme on a low dimensionality. null Till now, the experiments on CE and CTC show the same characteristics despite the performance fluctuation on CTC caused by sparseness. Hence in the next subsections CE is used instead of both of them because its curves are smoother.</Paragraph> <Section position="1" start_page="547" end_page="547" type="sub_section"> <SectionTitle> 2.2 SVM on Words and Bigrams </SectionTitle> <Paragraph position="0"> As stated in the previous subsection, the lqword scheme always outperforms the mmword scheme; we compare here only the lqword scheme with the bigram scheme.</Paragraph> <Paragraph position="1"> Support Vector Machine (SVM) is one of the best classifiers at present (Vapnik, 1995; Joachims, 1998), so we choose it as the main classifier in this study. The SVM implementation used here is LIBSVM (Chang, 2001); the type of SVM is set to &quot;C-SVC&quot; and the kernel type is set to linear, which means a one-with-one scheme is used in the multi-class classification.</Paragraph> <Paragraph position="2"> Because the CIG's effectiveness on a SVM classifier is not examined in Xue and Sun (2003a, 2003b)'s report, we make here the four combinations of schemes with and without CIG in feature selection and term weighting. The experiment results are shown in Figure 5. The collection Here we find that the chi-tfidf combination outperforms any approach with CIG, which is the opposite of the results with the Rocchio method. And the results with SVM are all better than the results with the Rocchio method. So we find that the feature selection scheme and the term weighting scheme are related to the classifier, which is worth noting. In other words, no feature selection scheme or term weighting scheme is absolutely the best for all classifiers. Therefore, a reasonable choice is to select the best performing combination of feature selection scheme, term weighting scheme and classifier, i.e. chi-tfidf and SVM. The curves for the lqword scheme and the bigram scheme are redrawn in Figure 6 to make them clearer.</Paragraph> <Paragraph position="3"> The curves shown in Figure 6 are similar to those in Figure 3. The differences are: (a) a larger dimensionality is needed for the bigram scheme to start outperforming the lqword scheme; (b) the two schemes have a smaller performance gap.</Paragraph> <Paragraph position="4"> The lqword scheme reaches its top performance at a dimensionality of around 40000, and the bigram scheme reaches its top performance at a dimensionality of around 60000 to 70000, after which both schemes' performances slowly decrease. The reason is that the low ranked terms in feature selection are in fact noise and do not help to classification, which is why the feature selection phase is necessary.</Paragraph> </Section> <Section position="2" start_page="547" end_page="548" type="sub_section"> <SectionTitle> 2.3 Comparing Manually Segmented Words and Bigrams </SectionTitle> <Paragraph position="0"> Up to now, bigram features seem to be better than word ones for fairly large dimensionalities.</Paragraph> <Paragraph position="1"> But it appears that word segmentation precision impacts classification performance. So we choose here a fully manually segmented document collection to detect the best performance a word scheme could reach and compare it with the bigram scheme.</Paragraph> <Paragraph position="2"> the LC document collection (the circles indicate the maximums and the dash-dot lines indicate the superior limit and the asymptotic interior limit of the bigram scheme). The word scheme reaches a top performance around the dimensionality of 20000, which is a little higher than the bigram scheme's zenith around 70000.</Paragraph> <Paragraph position="3"> Besides this experiment on 12 categories of the LC document collection, some experiments on fewer (2 to 6) categories of this subset were also done, and showed similar behaviors. The word scheme shows a better performance than the bigram scheme and needs a much lower dimensionality. The simpler the classification task is, the more distinct this behavior is.</Paragraph> </Section> </Section> <Section position="6" start_page="548" end_page="549" type="metho"> <SectionTitle> 3 Qualitative Analysis </SectionTitle> <Paragraph position="0"> To analyze the performance of words and bi-grams as feature terms in Chinese text categorization, we need to investigate two aspects as follows. null</Paragraph> <Section position="1" start_page="548" end_page="548" type="sub_section"> <SectionTitle> 3.1 An Individual Feature Perspective </SectionTitle> <Paragraph position="0"> The word is a natural semantic unit in Chinese language and expresses a complete meaning in text. The bigram is not a natural semantic unit and might not express a complete meaning in text, but there are also reasons for the bigram to be a good feature term.</Paragraph> <Paragraph position="1"> First, two-character words and three-character words account for most of all multi-character Chinese words (Liu and Liang, 1986). A two-character word can be substituted by the same bigram. At the granularity of most categorization tasks, a three-character words can often be substituted by one of its sub-bigrams (namely the &quot;intraword bigram&quot; in the next section) without a change of meaning. For instance, &quot;Biao Sai &quot; is a sub-bigram of the word &quot;Jin Biao Sai (tournament)&quot; and could represent it without ambiguity.</Paragraph> <Paragraph position="2"> Second, a bigram may overlap on two successive words (namely the &quot;interword bigram&quot; in the next section), and thus to some extent fills the role of a word-bigram. The word-bigram as a more definite (although more sparse) feature surely helps the classification. For instance, &quot;Qi Yu &quot; is a bigram overlapping on the two successive words &quot;Tian Qi (weather)&quot; and &quot;Yu Bao (forecast)&quot;, and could almost replace the word-bigram (also a phrase) &quot;Tian Qi Yu Bao (weather forecast)&quot;, which is more likely to be a representative feature of the category &quot;Qi Xiang Xue (meteorology)&quot; than either word.</Paragraph> <Paragraph position="3"> Third, due to the first issue, bigram features have some capability of identifying OOV (outof-vocabulary) words , and help improve the recall of classification.</Paragraph> <Paragraph position="4"> The above issues state the advantages of bi-grams compared with words. But in the first and second issue, the equivalence between bigram and word or word-bigram is not perfect. For instance, the word &quot;Wen Xue (literature)&quot; is a also sub-bigram of the word &quot;Tian Wen Xue (astronomy)&quot;, but their meanings are completely different. So the loss and distortion of semantic information is a disadvantage of bigram features over word features. null Furthermore, one-character words cover about 7% of words and more than 30% of word occurrences in the Chinese language; they are effevtive in the word scheme and are not involved in the above issues. Note that the impact of effective one-character words on the classification is not as large as their total frequency, because the high frequency ones are often too common to have a good classification power, for instance, the word &quot;De (of, 's)&quot;.</Paragraph> </Section> <Section position="2" start_page="548" end_page="549" type="sub_section"> <SectionTitle> 3.2 A Mass Feature Perspective </SectionTitle> <Paragraph position="0"> Features are not independently acting in text classification. They are assembled together to constitute a feature space. Except for a few models such as Latent Semantic Indexing (LSI) (Deerwester et al., 1990), most models assume the feature space to be orthogonal. This assumption might not affect the effectiveness of the models, but the semantic redundancy and complementation among the feature terms do impact on the classification efficiency at a given dimensionality. null According to the first issue addressed in the previous subsection, a bigram might cover for more than one word. For instance, the bigram &quot;Zhi Wu &quot; is a sub-bigram of the words &quot;Zhi Wu (fabric)&quot;,&quot;Mian Zhi Wu (cotton fabric)&quot;, &quot;Zhen Zhi Wu (knitted fabric)&quot;, and also a good substitute of The &quot;OOV words&quot; in this paper stand for the words that occur in the test documents but not in the training document. them. So, to a certain extent, word features are redundant with regard to the bigram features associated to them. Similarly, according to the second issue addressed, a bigram might cover for more than one word-bigram. For instance, the bigram &quot;Pian Xiao &quot; is a sub-bigram of the wordbigrams (phrases) &quot;Duan Pian Xiao Shuo (short story)&quot;, &quot;Zhong Pian Xiao Shuo (novelette)&quot;, &quot;Chang Pian Xiao Shuo (novel)&quot; and also a good substitute for them. So, as an addition to the second issue stated in the previous subsection, a bigram feature might even cover for more than one word-bigram.</Paragraph> <Paragraph position="1"> On the other hand, bigrams features are also redundant with regard to word features associated with them. For instance, the &quot;Jin Biao &quot; and &quot;Biao Sai &quot; are both sub-bigrams of the previously mentioned word &quot;Jin Biao Sai &quot;. In some cases, more than one sub-bigram can be a good representative of a word.</Paragraph> <Paragraph position="2"> We make a word list and a bigram list sorted by the feature selection criterion in a descending order. We now try to find how the relative redundancy degrees of the word list and the bigram list vary with the dimensionality. Following issues are elicited by an observation on the two lists (not shown here due to space limitations).</Paragraph> <Paragraph position="3"> The relative redundancy rate in the word list keeps even while the dimensionality varies to a certain extent, because words that share a common sub-bigram might not have similar statistics and thus be scattered in the word feature list.</Paragraph> <Paragraph position="4"> Note that these words are possibly ranked lower in the list than the sub-bigram because feature selection criteria (such as Chi) often prefer higher frequency terms to lower frequency ones, and every word containing the bigram certainly has a lower frequency than the bigram itself.</Paragraph> <Paragraph position="5"> The relative redundancy in the bigram list might be not as even as in the word list. Good (representative) sub-bigrams of a word are quite likely to be ranked close to the word itself. For instance, &quot;Zuo Qu &quot; and &quot;Qu Jia &quot; are sub-bigrams of the word &quot;Zuo Qu Jia (music composer)&quot;, both the bigrams and the word are on the top of the lists.</Paragraph> <Paragraph position="6"> Theretofore, the bigram list has a relatively large redundancy rate at low dimensionalities. The redundancy rate should decrease along with the increas of dimensionality for: (a) the relative redundancy in the word list counteracts the redundancy in the bigram list, because the words that contain a same bigram are gradually included as the dimensionality increases; (b) the proportion of interword bigrams increases in the bigram list and there is generally no redundancy between interword bigrams and intraword bigrams.</Paragraph> <Paragraph position="7"> Last, there are more bigram features than word features because bigrams can overlap each other in the text but words can not. Thus the bigrams as a whole should theoretically contain more information than the words as a whole.</Paragraph> <Paragraph position="8"> From the above analysis and observations, bi-gram features are expected to outperform word features at high dimensionalities. And word features are expected to outperform bigram features at low dimensionalities.</Paragraph> </Section> </Section> <Section position="7" start_page="549" end_page="550" type="metho"> <SectionTitle> 4 Semi-Quantitative Analysis </SectionTitle> <Paragraph position="0"> In this section, a preliminary statistical analysis is presented to corroborate the statements in the above qualitative analysis and expected to be identical with the experiment results shown in Section 1. All statistics in this section are based on the CE document collection and the lqword segmentation scheme (because the CE document collection is large enough to provide good statistical characteristics).</Paragraph> <Section position="1" start_page="549" end_page="550" type="sub_section"> <SectionTitle> 4.1 Intraword Bigrams and Interword Bi- </SectionTitle> <Paragraph position="0"> grams In the previous section, only the intraword bi-grams were discussed together with the words. But every bigram may have both intraword occurrences and interword occurrences. Therefore we need to distinguish these two kinds of bi-grams at a statistical level. For every bigram, the number of intraword occurrences and the number of interword occurrences are counted and we can as a metric to indicate its natual propensity to be a intraword bigram. The probability density of bigrams about on this metric is shown in Figure The figure shows a mixture of two Gaussian distributions, the left one for &quot;natural interword bigrams&quot; and the right one for &quot;natural intraword bigrams&quot;. We can moderately distinguish these two kinds of bigrams by a division at -1.4.</Paragraph> </Section> <Section position="2" start_page="550" end_page="550" type="sub_section"> <SectionTitle> 4.2 Overall Information Quantity of a Fea- </SectionTitle> <Paragraph position="0"> ture Space The performance limit of a classification is related to the quantity of information used. So a quantitative metric of the information a feature space can provide is need. Feature Quantity (Aizawa, 2000) is suitable for this purpose because it comes from information theory and is additive; tfidf was also reported as an appropriate metric of feature quantity (defined as &quot;probability [?] information&quot;). Because of the probability involved as a factor, the overall information provided by a feature space can be calculated on training data by summation.</Paragraph> <Paragraph position="1"> The redundancy and complementation mentioned in Subsection 3.2 must be taken into account in the calculation of overall information quantity. For bigrams, the redundancy with regard to words associated with them between two intraword bigrams is given by stand for the two bigrams and w stands for any word containing both of them. The overall information quantity is obtained by subtracting the redundancy between each pair of bigrams from the sum of all features' feature quantity (tfidf). Redundancy among more than two bigrams is ignored. For words, there is only complementation among words but not redundancy, the complementation with regard to bi-grams associated with them is given by { } if exists; if does not exists.</Paragraph> <Paragraph position="2"> in which b is an intraword bigram contained by w. The overall information is calculated by summing the complementations of all words.</Paragraph> </Section> <Section position="3" start_page="550" end_page="550" type="sub_section"> <SectionTitle> 4.3 Statistics and Discussion </SectionTitle> <Paragraph position="0"> Figure 9 shows the variation of these overall information metrics on the CE document collection.</Paragraph> <Paragraph position="1"> It corroborates the characteristics analyzed in Section 3 and corresponds with the performance curves in Section 2.</Paragraph> <Paragraph position="2"> Figure 10 shows the proportion of interword bigrams at different dimensionalities, which also corresponds with the analysis in Section 3.</Paragraph> <Paragraph position="3"> The curves do not cross at exactly the same dimensionality as in the figures in Section 1, because other complications impact on the classification performance: (a) OOV word identifying capability, as stated in Subsection 3.1; (b) word segmentation precision; (c) granularity of the categories (words have more definite semantic meaning than bigrams and lead to a better performance for small category granularities); (d) noise terms, introduced in the feature space during the increase of dimensionality. With these factors, the actual curves would not keep increasing as they do in Figure 9.</Paragraph> </Section> </Section> class="xml-element"></Paper>