Text Classi cation in Asian Languages without Word Segmentation
Fuchun Penga0a2a1 Xiangji Huanga0 Dale Schuurmansa0 Shaojun Wanga0a4a3
a0 School of Computer Science, University of Waterloo, Ontario, Canada
a1 Department of Computer Science, University of Massachusetts, Amherst, MA, USA
a3 Department of Statistics, University of Toronto, Ontario, Canada
a5 f3peng, jhuang, dale, sjwang
a6 @ai.uwaterloo.ca
Abstract
We present a simple approach for Asian
language text classi cation without word
segmentation, based on statistical a7 -gram
language modeling. In particular, we ex-
amine Chinese and Japanese text classi-
 cation. With character a7 -gram models,
our approach avoids word segmentation.
However, unlike traditional ad hoca7 -gram
models, the statistical language model-
ing based approach has strong informa-
tion theoretic basis and avoids explicit fea-
ture selection procedure which potentially
loses signi cantly amount of useful infor-
mation. We systematically study the key
factors in language modeling and their in-
 uence on classi cation. Experiments on
Chinese TREC and Japanese NTCIR topic
detection show that the simple approach
can achieve better performance compared
to traditional approaches while avoiding
word segmentation, which demonstrates
its superiority in Asian language text clas-
si cation.
1 Introduction
Text classi cation addresses the problem of assign-
ing a given passage of text (or a document) to one or
more prede ned classes. This is an important area
of information retrieval research that has been heav-
ily investigated, although most of the research activ-
ity has concentrated on English text (Dumais, 1998;
Yang, 1999). Text classi cation in Asian languages
such as Chinese and Japanese, however, is also an
important (and relatively more recent) area of re-
search that introduces a number of additional dif -
culties. One dif culty with Chinese and Japanese
text classi cation is that, unlike English, Chinese
and Japanese texts do not have explicit whitespace
between words. This means that some form of
word segmentation is normally required before fur-
ther processing. However, word segmentation itself
is a dif cult problem in these languages. A second
dif culty is that there is a lack of standard bench-
mark data sets for these languages. Nevertheless,
recently, there has been signi cant notable progress
on Chinese and Japanese text classi cation (Aizawa,
2001; He et al., 2001).
Many standard machine learning techniques have
been applied to text categorization problems, such
as naive Bayes classi ers, support vector machines,
linear least squares models, neural networks, and k-
nearest neighbor classi ers (Sebastiani, 2002; Yang,
1999). Unfortunately, most current text classi-
 ers work with word level features. However, word
identi cation in Asian languages, such as Chinese
and Japanese, is itself a hard problem. To avoid
the word segmentation problems, character level a7 -
gram models have been proposed (Cavnar and Tren-
kle, 1994; Damashek, 1995). There, they used a7 -
grams as features for a traditional feature selection
process and then deployed classi ers based on cal-
culating feature-vector similarities. This approach
has many shortcomings. First, there are an enor-
mous number of possible features to consider in text
categorization, and standard feature selection ap-
proaches do not always cope well in such circum-
stances. For example, given a suf ciently large num-
ber of features, the cumulative effect of uncommon
features can still have an important effect on clas-
si cation accuracy, even though infrequent features
contribute less information than common features
individually. Therefore, throwing away uncommon
features is usually not an appropriate strategy in this
domain (Aizawa, 2001). Another problem is that
feature selection normally uses indirect tests, such as
a8a10a9 or mutual information, which involve setting ar-
bitrary thresholds and conducting a heuristic greedy
search to  nd a good subset of features. Moreover,
by treating text categorization as a classical classi -
cation problem, standard approaches can ignore the
fact that texts are written in natural language, which
means that they have many implicit regularities that
can be well modeled by speci c tools from natural
language processing.
In this paper, we present a simple text categoriza-
tion approach based on statistical a7 -gram language
modeling to overcome the above shortcomings in a
principled fashion. An advantage we exploit is that
the language modeling approach does not discard
low frequency features during classi cation, as is
commonly done in traditional classi cation learning
approaches. Also, the language modeling approach
uses a7 -gram models to capture more contextual in-
formation than standard bag-of-words approaches,
and employs better smoothing techniques than stan-
dard classi cation learning. These advantages are
supported by our empirical results on Chinese and
Japanese data.
2 Language Model Text Classi ers
The goal of language modeling is to predict the
probability of natural word sequences; or more sim-
ply, to put high probability on word sequences that
actually occur (and low probability on word se-
quences that never occur). Given a word sequence
a11a13a12a14a11
a9a16a15a17a15a17a15
a11a19a18 to be used as a test corpus, the quality of
a language model can be measured by the empirical
perplexity (or entropy) on this corpus
Perplexity a20 a21
a22a23
a23
a24
a18
a25
a26a28a27
a12 a29
a30a32a31
a11
a26a14a33
a11 a12
a15a17a15a17a15
a11
a26a35a34
a12a37a36
(1)
The goal of language modeling is to obtain a small
perplexity.
2.1 a38 -gram language modeling
The simplest and most successful basis for language
modeling is the a7 -gram model. Note that by the
chain rule of probability we can write the probability
of any sequence as
a30a39a31
a11a13a12a14a11
a9a16a15a17a15a17a15
a11a19a18 a36
a20
a18
a25
a26a28a27
a12
a30a39a31
a11
a26a33
a11a13a12
a15a17a15a17a15
a11
a26a35a34
a12a36 (2)
An a7 -gram model approximates this probability by
assuming that the only words relevant to predicting
a30a39a31
a11
a26a33
a11a13a12
a15a17a15a17a15
a11
a26a35a34
a12 a36 are the previous
a7a41a40
a29
words; that
is, it assumes the Markov a7 -gram independence as-
sumption
a30a32a31
a11
a26a33
a11a13a12
a15a17a15a17a15
a11
a26a35a34
a12a36
a20
a30a32a31
a11
a26a33
a11
a26a35a34a43a42a45a44
a12
a15a17a15a17a15
a11
a26a46a34
a12a36
A straightforward maximum likelihood estimate of
a7 -gram probabilities from a corpus is given by the
observed frequency
a30a39a31
a11
a26a33
a11
a26a46a34a43a42a45a44
a12
a15a17a15a17a15
a11
a26a35a34
a12a36
a20 a47
a31
a11
a26a35a34a43a42a45a44
a12
a15a17a15a17a15
a11
a26
a36
a47
a31
a11
a26a35a34a43a42a45a44
a12
a15a17a15a17a15
a11
a26a35a34
a12a37a36
(3)
where #(.) is the number of occurrences of a speci-
 ed gram in the training corpus. Unfortunately, us-
ing grams of length up to a7 entails estimating the
probability ofa48 a42 events, wherea48 is the size of the
word vocabulary. This quickly overwhelms modern
computational and data resources for even modest
choices of a7 (beyond 3 to 6). Also, because of the
heavy tailed nature of language (i.e. Zipf’s law) one
is likely to encounter novel a7 -grams that were never
witnessed during training. Therefore, some mecha-
nism for assigning non-zero probability to novel a7 -
grams is a central and unavoidable issue. One stan-
dard approach to smoothing probability estimates to
cope with sparse data problems (and to cope with
potentially missing a7 -grams) is to use some sort of
back-off estimator
a30a39a31
a11
a26a33
a11
a26a35a34a43a42a49a44
a12
a15a17a15a17a15
a11
a26a35a34
a12a36
a20
a50a51
a51
a52
a51
a51a53
a54
a30a39a31
a11
a26a33
a11
a26a46a34a43a42a45a44
a12
a15a17a15a17a15
a11
a26a35a34
a12a36a56a55
if
a47
a31
a11
a26a35a34a43a42a45a44
a12
a15a17a15a17a15
a11
a26
a36a58a57a60a59
a61 a31
a11
a26a35a34a43a42a49a44
a12
a15a17a15a17a15
a11
a26a35a34
a12a36a19a62
a30a32a31
a11
a26a33
a11
a26a35a34a43a42a45a44
a9a16a15a17a15a17a15
a11
a26a46a34
a12a36a56a55
otherwise
(4)
where
a54
a30a39a31
a11
a26a33
a11
a26a35a34a43a42a49a44
a12
a15a17a15a17a15
a11
a26a35a34
a12a36
a20a64a63a66a65a68a67a70a69a72a71a16a73
a7a75a74
a47
a31
a11
a26a35a34a43a42a45a44
a12
a15a17a15a17a15
a11
a26
a36
a47
a31
a11
a26a35a34a43a42a45a44
a12
a15a17a15a17a15
a11
a26a35a34
a12a36
(5)
is the discounted probability, and a61 a31a11 a26a35a34a43a42a49a44 a12
a15a17a15a17a15
a11
a26a35a34
a12 a36
is a normalization constant calculated to be
a61 a31
a11
a26a35a34a43a42a49a44
a12
a15a17a15a17a15
a11
a26a35a34
a12a36
a20
a29
a40 a76
a77a49a78a79a81a80a83a82a85a84a17a86a45a87a72a88a90a89a92a91a93a91a93a91a82a85a84a17a86a94a89a35a77a96a95a98a97a85a99
a54
a30a39a31a35a100
a33
a11
a26a46a34a43a42a45a44
a12
a15a17a15a17a15
a11
a26a35a34
a12a36
a29
a40 a76
a77a49a78a79a81a80a83a82a85a84a17a86a45a87a72a88a90a89a92a91a93a91a93a91a82a85a84a17a86a94a89a35a77a96a95a98a97a85a99
a54
a30a39a31a35a100
a33
a11
a26a46a34a43a42a45a44
a9a49a15a17a15a17a15
a11
a26a35a34
a12a36
(6)
The discounted probability (5) can be com-
puted using different smoothing approaches includ-
ing Laplace smoothing, linear smoothing, absolute
smoothing, Good-Turing smoothing and Witten-
Bell smoothing (Chen and Goodman, 1998).
The language models described above use indi-
vidual words as the basic unit, although one could
instead consider models that use individual char-
acters as the basic unit. The remaining details re-
main the same in this case. The only difference is
that the character vocabulary is always much smaller
than the word vocabulary, which means that one can
normally use a much higher order, a7 , in a charac-
ter level a7 -gram model (although the text spanned
by a character model is still usually less than that
spanned by a word model). The bene ts of the char-
acter level model in the context of text classi cation
are multi-fold: it avoids the need for explicit word
segmentation in the case of Asian languages, and it
greatly reduces the sparse data problems associated
with large vocabulary models. In this paper, we ex-
periment with character level models to avoid word
segmentation in Chinese and Japanese.
2.2 Language models as text classi ers
Text classi ers attempt to identify attributes which
distinguish documents in different categories. Such
attributes may include vocabulary terms, word av-
erage length, local a7 -grams, or global syntactic and
semantic properties. Language models also attempt
capture such regularities, and hence provide another
natural avenue to constructing text classi ers.
Our approach to applying language models to text
categorization is to use Bayesian decision theory.
Assume we wish to classify a text
a63
a20
a11a101a12a92a11
a9a49a15a17a15a17a15a17a15
a11a103a102
into a category
a69a105a104a107a106
a20a109a108
a69
a12 a55
a15a17a15a17a15
a55
a69a16a110a111a112a110a83a113
. A natural
choice is to pick the category
a69
that has the largest
posterior probability given the text. That is,
a69a115a114
a20a117a116a49a118a14a119a121a120a122a116a16a123
a124a92a125
a111a127a126
a118
a31
a69
a33
a63
a36 (7)
Using Bayes rule, this can be rewritten as
a69 a114
a20a117a116a49a118a14a119a81a120a122a116a16a123
a124a92a125
a111 a126
a118
a31
a69
a36
a126
a118
a31
a63
a33
a69
a36 (8)
a20a117a116a49a118a14a119a81a120a122a116a16a123
a124a92a125
a111 a126
a118
a31
a69
a36
a102
a25
a26a28a27
a12
a30a129a128
a124
a31
a11
a26a33
a11
a26a35a34a43a42a49a44
a12
a15a17a15a17a15
a11
a26a35a34
a12 a36
(9)
Here,
a126
a118
a31
a63
a33
a69
a36 is the likelihood of
a63
under category
a69
, which can be computed bya7 -gram language mod-
eling. The likelihood is related to perplexity by
Equ. (1). The prior
a126
a118
a31
a69
a36 can be computed from
training data or can be used to incorporate more as-
sumptions, such as a uniform or Dirichelet distribu-
tion.
Therefore, our approach is to learn a separate
back-off language model for each category, by train-
ing on a data set from that category. Then, to cate-
gorize a new text
a63
, we supply
a63
to each language
model, evaluate the likelihood (or entropy) of
a63
un-
der the model, and pick the winning category ac-
cording to Equ. (9).
The inference of an a7 -gram based text classi er
is very similar to a naive Bayes classi er (to be
dicussed below). In fact, a7 -gram classi ers are a
straightforward generalization of naive Bayes (Peng
and Schuurmans, 2003).
3 Traditional Text Classi ers
We introduce the three standard text classi ers that
we will compare against below.
3.1 Naive Bayes classi ers
A simple yet effective learning algorithm for text
classi cation is the naive Bayes classi er. In this
model, a document
a63
is normally represented by a
vector of a130 attributes
a63
a20
a31a35a131
a12 a55
a131
a9
a55
a15a17a15a17a15a17a15
a131a133a132
a36 . The
naive Bayes model assumes that all of the attribute
values a131a96a134 , are independent given the category label
a69
. Thus, a maximum a posteriori (MAP) classi er
can be constructed as follows.
a69a115a114
a20a117a116a49a118a14a119a135a120a39a116a16a123
a124a92a125
a111
a50
a52
a53
a30a32a31
a69
a36a58a62
a132
a25
a134
a27
a12
a30a32a31a35a131 a134
a33
a69
a36a45a136a93a137
a138
(10)
To cope with features that remain unobserved dur-
ing training, the estimate of a30a39a31a35a131a16a134 a33
a69
a36 is usually ad-
justed by Laplace smoothing
a30a32a31a35a131a115a134
a33
a69
a36
a20
a38
a124
a134a140a139a142a141
a134
a38
a124 a139a142a141 (11)
where a38 a124a134 is the frequency of attribute a143 in a144 a124 ,
a38
a124
a20a146a145
a134
a38
a124
a134 , and a141
a20a146a145
a134 a141
a134 . A special case of
Laplace smoothing is add one smoothing, obtained
by setting a141 a134 a20
a29
. We use add one smoothing in our
experiments below.
3.2 Ad hoc a7 -gram text classi ers
In this method a test document
a63
and a class label
a69are both represented by vectors of
a7 -gram features,
and a distance measure between the representations
of
a63
and
a69
is de ned. The features to be used dur-
ing classi cation are usually selected by employing
heuristic methods, such asa8a147a9 or mutual information
scoring, that involve setting cutoff thresholds and
conducting a greedy search for a good feature sub-
set. We refer this method as ad hoc a7 -gram based
text classi er. The  nal classi cation decision is
made according to
a69a115a114
a20a117a116a49a118a14a119a135a120a39a148a28a149
a124a150a125
a111
a108 distance
a31
a63
a55
a69
a36
a113
(12)
Different distance metrics can be used in this ap-
proach. We implemented a simple re-ranking dis-
tance, which is sometimes referred to as the out-out-
place (OOP) measure (Cavnar and Trenkle, 1994).
In this method, a document is represented by an a7 -
gram pro le that contains selected a7 -grams sorted
by decreasing frequency. For each a7 -gram in a test
document pro le, we  nd its counterpart in the class
pro le and compute the number of places its loca-
tion differs. The distance between a test document
and a class is computed by summing the individual
out-of-place values.
3.3 Support vector machine classi ers
Given a set of a38 linearly separable training exam-
ples a151a152a20a153a108
a100
a26
a104a155a154
a42
a33
a65
a20
a29
a55a157a156a94a55
a15a17a15a17a15
a55
a38
a113
, where each
sample belongs to one of the two classes, a158 a26
a104
a108
a139
a29
a55
a40
a29
a113
, the SVM approach seeks the optimal hy-
perplane a11a160a159 a100 a139a162a161 a20 a59 that separates the positive
and negative examples with the largest margin. The
problem can be formulated as solving the following
quadratic programming problem (Vapnik, 1995).
minimize
a29
a156
a33a17a33
a11
a33a17a33
a9 (13)
subject to a158
a26
a31
a11a163a159
a100
a26
a139a164a161
a36a121a165
a29
In our experiments below, we use the
a151a167a166a169a168a171a170
a26a173a172a157a174a115a175 (Joachims, 1998) toolkit with default
settings.
4 Empirical evaluation
We now present our experimental results on Chinese
and Japanese text classi cation problems. The Chi-
nese data set we used has been previously investi-
gated in (He et al., 2001). The corpus is a subset of
the TREC-5 People’s Daily news corpus published
by the Linguistic Data Consortium (LDC) in 1995.
The entire TREC-5 data set contains 164,789 docu-
ments on a variety of topics, including international
and domestic news, sports, and culture. The corpus
was originally intended for research on information
retrieval. To make the data set suitable for text cat-
egorization, documents were  rst clustered into 101
groups that shared the same headline (as indicated
by an SGML tag). The six most frequent groups
were selected to make a Chinese text categorization
data set.
For Japanese text classi cation, we consider
the Japanese text classi cation data investigated
by (Aizawa, 2001). This data set was converted
from the NTCIR-J1 data set originally created for
Japanese text retrieval research. The conversion pro-
cess is similar to Chinese data. The  nal text classi-
 cation dataset has 24 categories which are unevenly
distributed.
4.1 Experimental paradigm
Both of the Chinese and Japanese data sets involve
classifying into a large number of categories, where
each document is assigned a single category. Many
classi cation techniques, such as SVMs, are intrin-
sically de ned for two class problems, and have to
be extended to handle these multiple category data
sets. For SVMs, we employ a standard technique of
 rst converting the a33
a106
a33 category classi cation prob-
lem to a33
a106
a33 binary classi cation problems.
For the experiments on Chinese data, we fol-
low (He et al., 2001) and convert the problem into
6 binary classi cation problems. In each case, we
randomly select 500 positive examples and then se-
lect 500 negative examples evenly from among the
remaining negative categories to form the training
data. The testing set contains 100 positive docu-
ments and 100 negative documents generated in the
same way. The training set and testing set do no
overlap and do not contain repeated documents.
For the experiments on Japanese data, we fol-
low (Aizawa, 2001) and directly experiment with
a 24-class classi cation problem. The NTCIR data
sets are unevenly distributed across categories. The
training data consists of 310,355 documents dis-
tributed unevenly among the categories (with a min-
imum of 1,747 and maximum of 83,668 documents
per category), and the testing set contains 10,000
documents unevenly distributed among categories
(with a minimum of 56 and maximum of 2,696 doc-
uments per category).
4.2 Measuring classi cation performance
In the Chinese experiments, where 6 binary classi -
cation problems are formulated, we measured classi-
 cation performance by micro-averaged F-measure
scores. To calculate the micro-averaged score, we
formed an aggregate confusion matrix by adding up
the individual confusion matrices from each cate-
gory. The micro-averaged precision, recall, and F-
measure can then be computed based on the aggre-
gated confusion matrix.
For the Japanese experiments, we measured over-
all accuracy and the macro-averaged F-measure.
Here the precision, recall, and F-measures of each
individual category can be computed based on a
a33
a106
a33
a62
a33
a106
a33 confusion matrix. Macro-averaged scores
can be computed by averaging the individual scores.
The overall accuracy is computed by dividing the
number of correctly identi ed documents (summing
the numbers across the diagonal) by the total number
of test documents.
4.3 Results on Chinese data
Table 1 gives the results of the character level lan-
guage modeling approach, where rows correspond
to different smoothing techniques. Columns corre-
spond to different a7 -gram order a7a142a20
a29
a55a157a156a94a55a157a176a94a55a150a177 . The
entries are the micro-average F-measure. (Note that
the naive Bayes result corresponds to a7 -gram order
1 with add one smoothing, which is italicized in the
table.) The results the ad hoc OOP classi er, and for
the SVM classi er are shown in Table 2 and Table 3
respectively, where the columns labeled  Feature # 
are the number of features selected.
1 2 3 4
Add-one 0.856 0.802 0.797 0.805
Absolute 0.856 0.868 0.867 0.868
Good-Turing 0.856 0.863 0.861 0.862
Linear 0.857 0.861 0.861 0.865
Witten-Bell 0.857 0.860 0.865 0.864
Table 1: Results of character level language model-
ing classi er on Chinese data.
Feature # Micro-F1 Feature # Micro-F1
100 0.7808 500 0.7848
200 0.8012 1000 0.7883
300 0.8087 1500 0.7664
400 0.7889 2000 0.7290
Table 2: Results of the character level OOP classi er
on Chinese data.
Feature # Micro-F1 Feature # Micro-F1
100 0.811 500 0.817
200 0.813 1000 0.817
300 0.817 1500 0.815
400 0.816 2000 0.816
Table 3: Results of the character level SVM classi-
 er on Chinese data.
4.4 Results on Japanese data
For the Japanese data, we experimented with byte
level models (where in fact each Japanese charac-
ter is represented by two bytes). We used byte level
models to avoid possible character level segmen-
tation errors that might be introduced, because we
lacked the knowledge to detect misalignment errors
in Japanese characters. The results of byte level lan-
guage modeling classi ers on the Japanese data are
shown in Table 4. (Note that the naive Bayes re-
sult corresponds to a7 -gram order 2 with add one
smoothing, which is italicized in the table.) The re-
sults for the OOP classi er are shown in Table 5.
Note that SVM is not applied in this situation since
we are conducting multiple category classi cation
directly while SVM is designed for binary classi -
cation. However, Aizawa (Aizawa, 2001) reported
a performance of abut 85% with SVMs by convert-
ing the problem into a 24 binary classi cation prob-
lem and by performing word segmentation as pre-
processing.
Feature # Accuracy Macro-F
100 0.2044 0.1692
200 0.2830 0.2308
300 0.3100 0.2677
400 0.3616 0.3118
500 0.3682 0.3295
1000 0.4416 0.4073
2000 0.4990 0.4510
3000 0.4770 0.4315
4000 0.4462 0.3820
5000 0.3706 0.3139
Table 5: Results of byte level OOP classi er on
Japanese data.
5 Discussion and analysis
We now give a detailed analysis and discussion
based on the above results. We  rst compare the
language model based classi ers with other classi-
 ers, and then analyze the in uence of the order a7
of the a7 -gram model, the in uence of the smooth-
ing method, and the in uence of feature selection in
tradition approaches.
5.1 Comparing classi er performance
Table 6 summarizes the best results obtained by each
classi er. The results for the language model (LM)
classi ers are better than (or at least comparable to )
other approaches for both the Chinese and Japanese
data, while avoiding word segmentation. The SVM
result on Japanese data is obtained from (Aizawa,
2001) where word segmentation was performed as
a preprocessing. Note that SVM classi ers do not
perform as well in our Chinese text classi cation
as they did in English text classi cation (Dumais,
1998), neither did they in Japanese text classi ca-
tion (Aizawa, 2001). The reason worths further in-
vestigations.
Overall, the language modeling approach appears
to demonstrate state of the art performance for Chi-
nese and Japanese text classi cation. The reasons
for the improvement appear to be three-fold: First,
the language modeling approach always considers
every feature during classi cation, and can thereby
avoid an error-prone feature selection process. Sec-
ond, the use of a7 -grams in the model relaxes the re-
strictive independence assumption of naive Bayes.
Third, the techniques of statistical language model-
ing offer better smoothing methods for coping with
features that are unobserved during training.
LM NB OOP SVM
Chinese Character Level
0.868 0.856 0.8087 0.817
Japanese Byte Level
0.84 0.66 0.4990 85% (Aizawa, 2001)
Table 6: Comparison of best classi er results
5.2 In uence of thea7 -gram order
The ordera7 is a key factor ina7 -gram language mod-
eling. An order a7 that is too small will not capture
suf cient information to accurately model character
dependencies. On the other hand, a context a7 that
is too large will create sparse data problems in train-
ing. In our Chinese experiments, we did not observe
signi cant improvement when using higher order a7 -
gram models. The reason is due to the early onset
of sparse data problems. At the moment, we only
have limited training data for Chinese data set (1M
in size, 500 documents per class for training). If
more training data were available, the higher order
models may begin to show an advantage. For ex-
ample, in the larger Japanese data set (average 7M
size, 12,931 documents per class for training) we
a7 Add-one Absolute Good-Turing Linear Witten-Bell
Accu. F-Mac Accu. F-Mac Accu. F-Mac Accu. F-Mac Accu. F-Mac
1 0.33 0.29 0.33 0.29 0.34 0.29 0.34 0.29 0.34 0.29
2 0.66 0.63 0.66 0.62 0.66 0.61 0.66 0.63 0.66 0.62
3 0.77 0.68 0.75 0.72 0.75 0.72 0.76 0.73 0.75 0.72
4 0.74 0.51 0.81 0.77 0.81 0.76 0.82 0.76 0.81 0.77
5 0.69 0.42 0.83 0.77 0.83 0.76 0.83 0.76 0.83 0.77
6 0.66 0.42 0.84 0.76 0.83 0.75 0.83 0.75 0.84 0.77
7 0.64 0.38 0.84 0.75 0.83 0.74 0.83 0.74 0.84 0.76
8 0.62 0.31 0.83 0.74 0.83 0.73 0.83 0.73 0.84 0.76
Table 4: Results of byte level language model classi er on Japanese data.
observe an obvious increase in classi cation perfor-
mance with higher order models (Table 4). How-
ever, here too, whena7 becomes too large, over tting
will begin to occur, as better illustrated in Figure 1.
1 2 3 4 5 6 7 80.2
0.3
0.4
0.5
0.6
0.7
0.8
order n or n−gram model
Overall accuracy
Add one smoothing on Japanese
Figure 1: Effects of order of a7 -gram language mod-
els
5.3 In uence of smoothing techniques
Smoothing plays an key role in language model-
ing. Its effect on classi cation is illustrated in Fig-
ure 2. In both cases we have examined, add one
smoothing is obviously the worst smoothing tech-
nique, since it systematically over ts much earlier
than the more sophisticated smoothing techniques.
The other smoothing techniques do not demonstrate
a signi cant difference in classi cation accuracy on
our Chinese and Japanese data, although they do
show a difference in the perplexity of the language
models themselves (not shown here to save space).
Since our goal is to make a  nal decision based on
the ranking of perplexities, not just their absolute
values, a superior smoothing method in the sense of
perplexity reduction does not necessarily lead to a
better decision from the perspective of categoriza-
tion accuracy.
1 1.5 2 2.5 3 3.5 40.65
0.7
0.75
0.8
0.85
Chinese Topic Detection
Accuracy
1 2 3 4 5 6 7 80.2
0.4
0.6
0.8
1
Japanese Topic Detection
Accuracy
order n of n−gram models
Absolute   
Good−Turing
Linear     
Witten−Bell
Adding−One 
Figure 2: Effects of the smoothing techniques
5.4 In uence of feature selection
The number of features selected is a key factor in de-
termining the classi cation performance of the OOP
and SVM classi ers, as shown in Figure 3. Obvi-
ously the OOP classi er is adversely affected by in-
creasing the number of selected features. By con-
trast, the SVM classi er is very robust with respect
to the number of features, which is expected because
the complexity of the SVM classi er is determined
by the number of support vectors, not the dimension-
ality of the feature space. In practice, some heuristic
search methods are normally used to obtain an op-
timal subset of features. However, in our language
modeling based approach, we avoid explicit feature
selection by considering all possible features and
the importance of each individual feature is mea-
sured by its contribution to the perplexity (or en-
tropy) value.
0 200 400 600 800 1000 1200 1400 1600 1800 20000.7
0.75
0.8
0.85
number of selected features
Macro−F
OOP
SVM
Figure 3: Effects of the number of selected features
5.5 Related Work
The use ofa7 -gram models has also been extensively
investigated in information retrieval. However, un-
like previous research (Cavnar and Trenkle, 1994;
Damashek, 1995), where researchers have used a7 -
grams as features for a traditional feature selection
process and then deployed classi ers based on cal-
culating feature-vector similarities, we consider all
a7 -grams as features and determine their importance
implicitly by assessing their contribution to perplex-
ity. In this way, we avoid an error prone feature se-
lection step.
Language modeling for text classi cation is a rel-
atively new area. In principle, any language model
can be used to perform text categorization. However,
a7 -gram models are extremely simple and have been
found to be effective in many applications. Teahan
and Harper (Teahan and Harper, 2001) used a PPM
(prediction by partial matching) model for text cate-
gorization where they seek a model that obtains the
best compression on a new document.
6 Conclusion
We have presented a simple language model based
approach without word segmentation for Chinese
and Japanese text classi cation. By comparison to
three standard text classi ers, the language model-
ing approach consistently demonstrates better clas-
si cation accuracies while avoiding word segmen-
tation and feature selection. Although straightfor-
ward, the language modeling approach appears to
give state of the art results for Chinese and Japanese
text classi cation.
It has been found that word segmentation in Chi-
nese text retrieval is tricky and the relationship be-
tween word segmentation and retrieval performance
is not monotonic (Peng et al., 2002). However, since
text classi cation and text retrieval are two different
tasks, it is not clear whether the same relationship
exists in text classi cation context. We are currently
investigating this issue and interesting  ndings have
already been observed.

References

A. Aizawa. 2001. Linguistic Techniques to Improve the
Performance of Automatic Text Categorization. Proceedings NLPRS 2001.

W. Cavnar and J. Trenkle. 1994. N-Gram-Based Text
Categorization. Proceedings of SDAIR-94

S. Chen and J. Goodman. 1998. An Empirical Study of
Smoothing Techniques for Language Modeling. TR-10-98, Harvard University

M. Damashek. 1995. Gauging Similarity with N-Grams:
Language-Independent Categorization of Text? Sci-
ence, 267(10), pages 843-848.

S. Dumais, J. Platt, D. Heckerman, and M. Sahami 1998.
Inductive Learning Algorithms and Representations
for Text Categorization. Proceedings of CIKM 98

J. He, A. Tan, and C. Tan. 2001. On Machine Learning
Methods for Chinese Documents Classi cation. Applied Intelligence's Special Issue on Text and Web Min-
ing

T. Joachims. 1998. Text Categorization with Support
Vector Machines: Learning with Many Relevant Features. Proceedings of the ECML-1998

F. Peng, X. Huang, D. Schuurmans, and N. Cercone.
2002. Investigating the Relationship of Word Segmentation Performance and Retrieval Performance in Chinese IR. Proceedings of COLING 2002

F. Peng and D. Schuurmans. 2003. Combining Naive
Bayes and N-Gram Language Models for Text Classication. Proceedings of ECIR2003

F. Sebastiani. 2002. Machine Learning in Automated
Text Categorization. ACM Computing Surveys, 34(1).

W. Teahan and D. Harper. 2001. Using Compression-Based Language Models for Text Categorization. Proceedings of LMIR2001

V. Vapnik. 1995. The Nature of Statistical Learning The-
ory. Springer-Verlag.

Y. Yang. 1999. An Evaluation of Statistical Approaches
to Text Categorization. Information Retrieval Journal,
