Machine Learning Methods for 
Chinese Web Page Categorization 
Ji He 1, Ah-Hwee Tan 2 and Chew-Lira Tan 1 
1School of Computing, National University of Singapore 
10 Kent Ridge Crescent, Singapore 119260 
(heji,tancl}@comp.nus.edu.sg 
2Kent Ridge Digital Labs 
21 Heng Mui Keng Terrace, Singapore 119613 
ahhwee@krdl.org.sg 
Abstract 
This paper reports our evaluation of 
k Nearest Neighbor (kNN), Support 
Vector Machines (SVM), and Adap- 
tive Resonance Associative Map 
(ARAM) on Chinese web page clas- 
sification. Benchmark experiments 
based on a Chinese web corpus 
showed that their predictive per- 
formance were roughly comparable 
although ARAM and kNN slightly 
outperformed SVM in small cate- 
gories. In addition, inserting rules 
into ARAM helped to improve per- 
formance, especially for small well- 
defined categories. 
1 Introduction 
Text categorization refers to the task of au- 
tomatically assigning one or multiple pre- 
defined category labels to free text docu- 
ments. Whereas an extensive range of meth- 
ods has been applied to English text cate- 
gorization, relatively few have been bench- 
marked for Chinese text categorization. Typi- 
cal approaches to Chinese text categorization, 
such as Naive Bayes (NB) (Zhu, 1987), Vector 
Space Model (VSM) (Zou et al., 1998; Zou et 
al., 1999) and Linear List Square Fit (LLSF) 
(Cao et al., 1999; Yang, 1994), have well stud- 
ied theoretical basis derived from the informa- 
tion retrieval research, but are not known to 
be the best classifiers (Yang and Liu, 1999; 
Yang, 1999). In addition, there is a lack of 
publicly available Chinese corpus for evaluat- 
ing Chinese text categorization systems. 
This paper reports our applications of 
three statistical machine learning methods, 
namely k Nearest Neighbor system (kNN) 
(Dasarathy, 1991), Support Vector Machines 
(SVM) (Cortes and Vapnik, 1995), and Adap- 
tive Resonance Associative Map (ARAM) 
(Tan, 1995) to Chinese web page categoriza- 
tion. kNN and SVM have been reported as 
the top performing methods for English text 
categorization (Yang and Liu, 1999). ARAM 
belongs to a popularly known family of pre- 
dictive self-organizing neural networks which 
until recently has not been used for docu- 
ment classification. The trio has been eval- 
uated based on a Chinese corpus consisting 
of news articles extracted from People's Daily 
(He et al., 2000). This article reports the ex- 
periments of a much more challenging task in 
classifying Chinese web pages. The Chinese 
web corpus was created by downloading from 
various Chinese web sites covering a wide vari- 
ety of topics. There is a great diversity among 
the web pages in terms of document length, 
style, and content. The objectives of our ex- 
periments are two-folded. First, we examine 
and compare the capabilities of these meth- 
ods in learning categorization knowledge from 
real-fife web docllments. Second, we investi- 
gate if incorporating domain knowledge de- 
rived from the category description can en- 
hance ARAM's predictive performance. 
The rest of this article is organized as fol- 
lows. Section 2 describes our choice of the fea- 
ture selection and extraction methods. Sec- 
tion 3 gives a sllrnrnary of the kNN and SVM, 
and presents the less familiar ARAM algo- 
rithm in more details. Section 4 presents our 
evaluation paradigm and reports the experi- 
93 
mental results. 
2 Feature Selection and Extraction 
A pre-requisite of text categorization is to ex- 
tract a suitable feature representation of the 
documents. Typically, word stems are sug- 
gested as the representation units by infor- 
mation retrieval research. However, unlike 
English and other Indo-European s, 
Chinese text does not have a natural delim- 
iter between words. As a consequence, word 
segmentation is a major issue in Chinese doc- 
ument processing. Chinese word segmenta- 
tion methods have been extensively discussed 
in the literature. Unfortunately perfect preci- 
sion and disambiguation cannot be reached. 
As a result, the inherent errors caused by 
word segmentation always remains as a prob- 
lem in Chinese information processing. 
In our experiments, a word-class bi-gram 
model is adopted to segment each training 
document into a set of tokens. The lexi- 
con used by the segmentation model contains 
64,000 words in 1,006 classes. High precision 
segmentation is not the focus of our work. In- 
stead we aim to compare different classifier's 
performance on noisy document set as long as 
the errors caused by word segmentation are 
reasonably low. 
To select keyword features for classifica- 
tion, X (CHI) statistics is adopted as the 
ranking metric in our experiments. A prior 
study on several well-known corpora in- 
cluding Reuters-21578 and OHSUMED has 
proven that CHI statistics generally outper- 
forms other feature ranking measures, such 
as term strength (TS), document frequency 
(DF), mutual information (MI), and informa- 
tion gain (IG) (Yang and J.P, 1997). 
During keyword extraction, the document 
is first segmented and converted into a 
keyword frequency vector (t fl, t f2,..., t.f M ) , 
where tfi is the in-document term frequency 
of keyword wi, and M is the number of the 
keyword features selected. A term weight- 
ing method based on inverse document .fre- 
quency (IDF) (Salton, 1988) and the L1- 
norm~llzation are then applied on the fre- 
quency vector to produce the keyword feature 
vector 
(Xl, X2, • - • , XM) 
x = max{xi} ' (i) 
in which xi is computed by 
zi = (1 + log 2 tfi) log2 ~ (2) 
ni ' 
where n is the number of documents in the 
whole training set, and ni is the number of 
training documents in which the keyword wi 
occurs at least once. 
3 The Classifiers 
3.1 k Nearest Neighbor 
k Nearest Neighbor (kNN) is a tradi- 
tional statistical pattern recognition algo- 
rithm (Dasarathy, 1991). It has been studied 
extensively for text categorization (Yang and 
Liu, 1999). In essence, kNN makes the predic- 
tion based on the k training patterns that are 
closest to the unseen (test) pattern, accord- 
ing to a distance metric. The distance metric 
that measures the similarity between two nor- 
malized patterns can be either a simple LI- 
distance function or a L2-distance function, 
such as the plain Euclidean distance defined 
by 
D(a,b)=~s~. (a~-bi)2. (3) 
The class assignment to the test pattern is 
based on the class assignment of the closest k 
training patterns. A commonly used method 
is to label the test pattern with the class that 
has the most instances among the k nearest 
neighbors. Specifically, the class index y(x) 
assigned to the test pattern x is given by 
yCx) ..-.. arg'max, {n(dj, )ld.:j kNN}, (4) 
where n(dj, ~) is the number of training pat- 
tern dj in the k nearest neighbor set that are 
associated with class c4. 
The drawback of kNN is the difficulty in 
deciding a optimal k value. Typically it has 
to be determined through conducting a series 
of experiments using different values. 
94 
3.2 Support Vector Machines 
Support Vector Machines (SVM) is a rela- 
tively new class of machine learning tech- 
niques first introduced by Vapnik (Cortes 
and Vapnik, 1995). Based on the structural 
risk minimization principle from the compu- 
tational learning theory, SVM seeks a decision 
surface to separate the tralning data points 
into two classes and makes decisions based on 
the support vectors that are selected as the 
only effective elements from the training set. 
Given a set of linearly separable points 
s = {x  Rnli = 1,2,...,N}, each point xi 
belongs to one of the two classes, labeled as 
yiE{-1,+l}. A separating hyper-plane di- 
vides S into two sides, each side containing 
points with the same class label only. The 
separating hyper-plane can be identified by 
the pair (w,b) that satisfies 
w-x+b=0 
and yi(w'xi + b)>l (5) 
for i = 1, 2,..., N; where the dot product op- 
eration • is defined by 
w. x ---- ~ wixi (6) 
for vectors w and x. Thus the goal of the 
SVM learning is to find the optimal separat- 
ing hyper-plane (OSH) that has the maximal 
margin to both sides. This can be formula- 
rized as: 
minimize ½w. w 
subject to yi(w.xi + b)>l (7) 
The points that are closest to the OSH are 
termed support vectors (Fig. 1). 
The SVM problem can be extended to lin- 
early non-separable case and non-linear case. 
Various quadratic programming algorithms 
have been proposed and extensively studied 
to solve the SVM problem (Cortes and Vap- 
nik, 1995; Joachims, 1998; Joacbims, 1999). 
During classification, SVM makes decision 
based on the OSH instead of the whole 
training set. It simply finds out on which 
side of the OSH the test pattern is located. 
0 0 O0 
o o , 
J///--.. 
Figure 1: Separating hyperplanes (the set 
of solid lines), optimal separating hyperpIane 
(the bold solid line), and support vectors (data 
points on the dashed lines). The dashed lines 
identify the max margin. 
This property makes SVM highly compet- 
itive, compared with other traditional pat- 
tern recognition methods, in terms of com- 
putational efficiency and predictive accuracy 
(Yang and Liu, 1999). 
In recent years, Joachims has done much re- 
search on the application of SVM to text cat- 
egorization (Joachims, 1998). His SVM zight 
system published via http://www-ai.cs.uni- 
dortmund.de/FORSCHUNG/VERFAHREN/ 
SVM_LIGHT/svm_light.eng.html is used in 
our benchmark experiments. 
3.3 Adaptive Resonance Associative 
Map 
Adaptive Resonance Associative Map 
(ARAM) is a class of predictive serf- 
organizing neural networks that performs 
incremental supervised learning of recog- 
nition categories (pattern classes) and 
multidimensional maps of patterns. An 
ARAM system can be visualized as two 
overlapping Adaptive Resonance Theory 
(ART) modules consisting of two input fields 
F~ and F1 b with an F2 category field (Tan, 
1995; Tan, 1997) (Fig. 2). For classification 
problems, the F~ field serves as the input 
field containing the document feature vector 
and the F1 b field serves as the output field 
containing the class prediction vector. The 
F2 field contains the activities of the recogni- 
tion categories that are used to encode the 
patterns. 
95 
.. I ±/± I 
x !.'1 x, 
ARTa A B ARTb 
Figure 2: The Adaptive Resonance Associa- 
tive Map architecture 
When performing classification tasks, 
ARAM formulates recognition categories of 
input patterns, and associates each cate- 
gory with its respective prediction. During 
learning, given an input pattern (document 
feature) presented at the F~ input layer 
and an output pattern (known class label) 
presented at the Fib output field, the category 
field F2 selects a winner that receives the 
largest overall input. The winning node se- 
lected in F2 then triggers a top-down priming 
on F~ and F~, monitored by separate reset 
mechanisms. Code stabilization is ensured 
by restricting encoding to states where 
resonance are reached in both modules. 
By synchronizing the un.qupervised catego- 
rization of two pattern sets, ARAM learns 
supervised mapping between the pattern sets. 
Due to the code stabilization mechanism, 
fast learning in a real-time environment is 
feasible. 
The knowledge that ARAM discovers dur- 
ing learning is compatible with IF-THEN 
rule-based presentation. Specifically, each 
node in the FF2 field represents a recognition 
category associating the F~ patterns with the 
F1 b output vectors. Learned weight vectors, 
one for each F2 node, constitute a set of rules 
that link antecedents to consequences. At any 
point during the incremental learning process, 
the system architecture can be translated into 
a compact set of rules. Similarly, domain 
knowledge in the form of IF-THEN rules can 
be inserted into ARAM architecture. 
The ART modules used in ARAM can be 
ART 1, which categorizes binary patterns, or 
analog ART modules such as ART 2, ART 2- 
A, and fuzzy ART, which categorize both bi- 
nary and analog patterns. The fuzzy ARAM 
(Tan, 1995) algorithm based on fuzzy ART 
(Carpenter et al., 1991) is introduced below. 
Parameters: Fuzzy ARAM dynamics are 
determined by the choice parameters aa > 0 
and ab > 0; the learning rates ~a E \[0, 1\] and 
~b E \[0, 1\]; the vigilance parameters Pa E \[0, 1\] 
and Pb E \[0, 1\]; and the contribution parame- 
ter '7 E \[0, 1\]. 
Weight vectors: Each F2 category node j 
is associated with two adaptive weight tem- 
plates w~ and w~. Initially, all category nodes 
are uncommitted and all weights equal ones. 
After a category node is selected for encoding, 
it becomes committed. 
Category choice: Given the F~ and F1 b in- 
put vectors A and B, for each F2 node j, the 
choice function Tj is defined by 
IA Aw~l IB A w~l 
= ~a~ + Iw~'l + (1 --~)~b + Iw~l' (S) 
where the fuzzy AND operation A is defined 
by 
(p A q)i --~ min(pi, qi), (9) 
and where the norm I-I is defined by 
IPl -= ~Pi (10) 
i 
for vectors p and q. 
The system is said to make a choice when 
at most one F2 node can become active. The 
choice is indexed at J where 
Tj = ma,x{Tj : for all F2 node j}. (11) 
When a category choice is made at node J, 
yj = 1; andyj =0 for all j ~ J. 
Resonance or reset: Resonance occurs if 
the match .functions, m~ and m~, meet the 
vigilance criteria in their respective modules: 
IA A w~l m~ = \[AI _> pa (12) 
and 
m~ = IB A w~l> Pb. (13) 
IBI - 
96 
Learning then ensues, as defined below. If 
any of the vigilance constraints is violated, 
mismatch reset occurs in which the value of 
the choice function Tj is set to 0 for the du- 
ration of the i.nput presentation. The search 
process repeats to select another new index J 
until resonance is achieved. 
Learning: Once the search ends, the weight 
vectors w~ and w~ are updated according to 
the equations 
W~ (new) -- (1 ,~ iRa(old) - -..,,,j +&(A^w3 
(14) 
and 
wb, cnew)~ (i ~ ~_ b(old) = - ,bJWj + ~b(B A wbj (Old)) 
(15) 
respectively. Fast learning corresponds to set- 
ting/~a =/~b = 1 for committed nodes. 
Classification: During classification, using 
the choice rule, only the F2 node J that re- 
ceives maximal F~ ~ F2 input Tj predicts 
ARTb output. In simulations, 
1 if j = J where Tj > Tk 
yj = for all k ¢ J (16) 
0 otherwise. 
The F1 b activity vector x b is given by 
J 
The output prediction vector B is then given 
by 
B ~ (bl, b2,.., bN) = X b (18) 
where bi indicates the likelihood or confidence 
of assigning a pattern to category i. 
Rule insertion: Rule insertion proceeds in 
two phases. The first phase parses the rules 
for keyword features. When a new keyword is 
encountered, it is added to a keyword feature 
table containing keywords obtained through 
automatic feature selection from training 
documents. Based on the keyword feature 
table, the second phase of rule insertion 
translates each rule into a M-dimensional 
vector a and a N-dimensional vector b, where 
M is the total number of features in the 
keyword feature table and N is the number 
of categories. Given a rule of the following 
format, 
IF Xl, X2, - • • , Xm 
THEN Yl, Y2,.-., Yn 
where xt,..., xm are antecedents and 
Yt,... ,Yn are consequences, the algorithm 
derives a pair of vectors a and b such that 
• for each index i = 1,..., M, 
1 ifwi = xj for some j 6 {1,...,m} 
ai = 0 otherwise 
(19) 
where wi is the i th entry in the keyword fea- 
ture table; and for each index i = 1,..., N, 
1 ifwi = yj for some j E {1,...,n} 
bi = 0 otherwise 
(20) 
where wi is the class label of the category i. 
The vector pairs derived from the rules are 
then used as training patterns to initialize a 
ARAM network. During rule insertion, the 
vigilance parameters Pa and Pb are each set 
to 1 to ensure that only identical attribute 
vectors are grouped into one recognition cat- 
egory. Contradictory symbolic rules are de- 
tected during rule insertion when identical in- 
put attribute vectors are associated with dis- 
tinct output attribute vectors. 
4 Empirical Evaluation 
4.1 The Chinese Web Corpus 
The Chinese web corpus, colleeted in-house, 
consists of web pages downloaded from vari- 
ous Chinese web sites covering a wide variety 
of topics. Our experiments are based on a 
subset of the corpus consisting of 8 top-level 
categories and over 6,000 documents. For 
each category, we conduct binary classifica- 
tion experiments in which we tag the cur- 
rent category as the positive category and the 
other seven categories as the negative cate- 
gories. The corpus is further partitioned into 
training and testing data such that the num- 
ber of the training documents is at least 2.5 
times of that of the testing documents (Table 
1). 
97 
Table 1: The eight top-level categories in the 
Chinese web corpus, and the training and test 
samples by category. 
Category Description Train Test Art 
Art Topic regarding 325 102 Belief 
literature, art 
Belief Philosophy and 131 40 B/z 
religious beliefs 
Biz Business 2647 727 Edu 
Edu Education 205 77 
IT Computer and 1085 309 /T 
internet informatics 
Joy Online fresh, 636 216 
interesting info Joy 
Med Medical care 155 57 Meal 
related web sites 
Sci Various kinds 119 39 Sci 
of science 
Table 2: A sample set of 19 rules generated 
based on the accompanied description of the 
Chinese web categories. 
:- ~ (Chinese painting) 
:- ~" (.pray) ~.~r~ (rabbi) 
:- {~ (promotion) ~ (rrcal estate) 
~:P (cli~O 
:- ~-~ (undergradua~) -~-~ (supervisor) 
~2N (campus) 
:- ~:~k (version) ~ (virus) 
g/~k~ (ffirewan) ~ (program) 
:- ~'~ (lantern riddle) 
:- ~ (health cam) ~J~ (pmscriplion) 
\[~ (medical jurisprudence) 
:- ~fl ~ ~ (supernaturalism) 
~ (high technology) 
4.2 Experiment Paradigm 
kNN experiments used the plain Euclidean 
distance defined by equation (3) as the simi- 
laxity measure. On each pattern set contain- 
ing a varying number of documents, different 
values of k ranging ~om 1 to 29 were tested 
and the best results were recorded. Only odd 
k were used to ensure that a prediction can 
always be made. 
SVM experiments used the default built-in 
inductive SVM parameter set in  VM tight, 
which is described in detail on the web site 
and elsewhere (Joachims, 1999). 
ARAM experiments employed a standard 
set of parameter values of fuzzy ARAM. In 
addition, using a voting strategy, 5 ARAM 
systems were trained using the same set of 
patterns in different orders of presentation 
and were combined to yield a final prediction 
vector. 
To derive domain theory on web page clas- 
sification, a varying number (ranging from 10 
to 30) of trainiug documents from each cate- 
gory were reviewed. A set of domain knowl- 
edge consists of 56 rules with about one to 10 
rules for each category was generated. Only 
positive rules that link keyword antecedents 
to positive category consequences were in- 
cluded (Table 2). 
4.3 Performance Measures 
Our experiments adopt the most commonly 
used performance measures, including the re- 
call, precision, and F1 measures. Recall (R) is 
the percentage of the documents for a given 
category that are classified correctly. Preci- 
sion (P) is the percentage of the predicted 
documents for a given category that are clas- 
sifted correctly. Ft rating is one of the com- 
monly used measures to combine R and P into 
a single rating, defined as 
2RP 
Ft = (R + P)" (21) 
These scores are calculated for a series of bi- 
nary classification experiments, one for each 
category. Micro-averaged scores and macro- 
averaged scores on the whole corpus are 
then produced across the experiments. With 
micro-averaging, the performance measures 
are produced across the documents by adding 
up all the documents counts across the differ- 
ent tests, and calculating using these summed 
values. With macro-averaging, each category 
is assigned with the same weight and per- 
formance measures are calculated across the 
categories. It is understandable that micro- 
averaged scores and macro-averaged scores re- 
flect a classifier's performance on large cate- 
gories and small categories respectively (Yang 
and Liu, 1999). 
98 
Table 3: 
classifiers on the Chinese web 
kNN 
Category 
Art 
Belief 
Biz 
Edu IT 
Joy Med 
$ci 
Predictive performance of the four 
P R 
corpus. 
! SVM 
F1 ~P R 
.440 .398 .402 
.548 .556 .500 
.706 .692 .703 
.365 .602 .074 
.321 .394 .307 
.291 .462 .255 
.494 .330 .544 
.213! .137 .179 
.795 .304 
.773 .425 
.724 .689 
.380 .351 
.309 .333 
.381 .236 
.833 .351 
.625 .128 
F~ 
.400 
.526 
.698 
.180 
.345 
.328 
.411 
.156 
Micro-ave. .584 .482 .528 .523 .521 .522 
Macro-ave. .600 .352 .422 .384 .454 .380 
ARAM ARAMw/rules 
Category P R Fx P R F1 
Art Belief 
Biz Edu 
IT Joy 
Med Sei 
.653 .461 .540 
.750 .750 .750 
.742 .622 .677 
.421 .312 .358 
.444 .259 .327 
.600 .208 .309 
.421 .421 .421 
.292 .179 .222 
.706 .471 .565 
.714 .750 .732 
.745 .604 .667 
.420 .273 .331 
.437 .291 .350 
.618 .194 .296 
.448 .456 .452 
.409 .231 .295 
Micro-ave. .619 .453 .523 .628 .450 .524 
Macro-ave. .540 .402 .451 .562 .409 .461 
4.4 Results and Discussions 
Table 3 summarizes the three classifier's per- 
formances on the test corpus in terms of pre- 
cision, recall, and F1 measures. The micro- 
averaged scores produced by the trio, which 
were predominantly determined by the clas- 
sifters' performance on the large categories 
(such as Biz, IT, and Joy), were roughly com- 
parable. Among the three, kNN seemed to 
be marginally better than SVM and ARAM. 
Inserting rules into ARAM did not have a 
significant impact. This showed that do- 
main knowledge was not very useful for cat- 
egories that already have a large number of 
training examples. The differences in the 
macro-averaged scores produced by the three 
classifiers, however, were much more signifi- 
cant. The macro-averaged F1 score obtained 
by ARAM was noticeably better than that of 
kNN, which in turn was higher than that of 
SVM. This indicates that ARAM (and kNN) 
tends to outperform SVM in small categories 
that have a smaller number of training pat- 
terns. 
We are particularly interested in the classi- 
fier's learning ability on small categories. In 
certain applications, such as personalized con- 
tent delivery, a large pre-labeled training cor- 
pus may not be available. Therefore, a classi- 
fiefs ability of learning from a small training 
pattern set is a major concern. The different 
approaches adopted by these three classifiers 
in learning categorization knowledge are best 
• seen in the light of the distinct learning pe- 
culiarities they exhibit on the small training 
sets. 
kNN is a lazy learning method in the sense 
that it does not carry out any off-line learning 
to generate a particular category knowledge 
representation. Instead, kNN performs on- 
line scoring to find the training patterns that 
are nearest to a test pattern and makes the 
decision based on the statistical presumption 
that patterns in the same category have simi- 
lar feature representations. The presumption 
is basically true to most pattern instances. 
Thus kNN exhibits a relatively stable perfor- 
manee across small and large categories. 
SVM identifies optimal separating hyper- 
plane (OSH) across the training data points 
and makes classification decisions based on 
the representative data instances (known as 
support vectors). Compared with kNN, SVM 
is more computationally efficient during clas- 
sification for large-scale training sets. How- 
ever, the OSH generated using small train- 
ing sets may not be very representative, espe- 
cially when the training patterns are sparsely 
distributed and there is a relatively narrow 
margin between the positive and negative pat- 
terns. In our experiments on small train- 
ing sets including Art, Belief, Edu, and Sci, 
SVM's performance were generally lower than 
those of kNN and ARAM. 
ARAM generates recognition categories 
from the input training patterns. The incre- 
mentally learned rules abstract the major rep- 
resentations of the training patterns and elim- 
inate minor inconsistencies in the data pat- 
terns. During classifying, it works in a sim- 
ilar fashion as kNN. The major difference is 
that AI:tAM uses the learned recognition cat- 
egories as the similarity-scoring unit whereas 
kNN uses the raw in-processed training pat- 
terns as the distance-scoring unit. It follows 
99 
that ARAM is notably more scalable than 
kNN by its pattern abstraction capability and 
therefore is more suitable for handling very 
large data sets. 
The overall improvement in predictive per- 
formance obtained by inserting rules into 
ARAM is also of particular interest to us. 
ARAM's performance was more likely to be 
improved by rule insertion in categories that 
are well defined and have relatively fewer 
numbers of training patterns. As long as a 
user is able to abstract the category knowl- 
edge into certain specific rule representa- 
tion, domain knowledge could complement 
the limited knowledge acquired through a 
small training set quite effectively. 
Acknowledgements 
We would like to thank our colleagues, Jian 
Su and Guo-Dong Zhou, for providing the 
Chinese segmentation software and Fon-Lin 
Lai for his valuable suggestions in designing 
the experiment system. In addition, we thank 
T. Joachims at the University of Dortmund 
for making SVM light available. 

References 
Suqing Cao, Fuhu Zeng, and Huanguang Cao. 
1999. A mathematical model for automatic 
Chinese text categorization. Journal of the 
China Society for Scientific and Technical In- 
formation \[in Chinese\], 1999(1). 
G.A. Carpenter, S. Grossberg, and D.B. Rosen. 
1991. Fuzzy ART: Fast stable learning and cat- 
egorization of analog patterns by an adaptive 
resonance system. Neural Networks, 4:759-771. 
C. Cortes and V. Vapnik. 1995. Support vector 
networks. Machine learning, 20:273-297. 
Belur V. Dasarathy. 1991. Nearest Neighbor (NN) 
Norms: NN Pattern Classification Techniques. 
IEEE Computer Society Press, Las Alamitos, 
California. 
Ji He, A.-H. Tan, and Chew-Lira Tan. 2000. 
A comparative study on Chinese text catego- 
rization methods. In PRICAI'2000 Interna- 
tional Workshop on Text and Web Mining, Mel- 
bourne, August. 
T. Joachims. 1998. Text categorization with sup- 
port vector machines: Learning with many rel- 
evant features. In Proceedings of the European 
Conference on Machine Learning, Springer. 
T. Joachims. 1999. Making large-Scales SVM 
learning Pracical. Advances in Kernel Methods 
- Support Vector Learning. B. Scholkopf, C. 
Burges and A. Smola (ed.), MIT Press. 
Salton. 1988. Term weighting approaches in au- 
tomatic text retrieval. Information Processing 
and Management, 24(5):513-523. 
A.-H. Tan. 1995. Adaptive resonance associative 
map. Neural Networks, 8(3):437--446. 
A.-H. Tan. 1997. Cascade ARTMAP: Integrat- 
ing neural computation and symbolic knowl- 
edge processing. IEEE Transactions on Neural 
Networks, 8(2):237-235. 
Y. Yang and Pedersen J.P. 1997. A comparative 
study on feature selection in text categoriza- 
tion. In the Fourteenth International Confer- 
ence on Machine Learning (ICML'97), pages 
412-420. 
Y. Yang and X. Liu. 1999. A re-examination 
of text categorization methods. In ~nd An- 
nual International ACM SIGIR Conference on 
Research and Development in Information Re- 
trieval (SIGIR'99), pages 42-49. 
Y. Yang. 1994. Expert network: Effective and ef- 
ficient learning from human decisions in text 
categorization and retrieval. In 17th Annual 
International A CM SIGIR Conference on Re- 
search and Development in Information Re- 
trieval (SIGIR '94). 
Y. Yang. 1999. An evaluation of statistical ap- 
proaches to text categorization. Journal of In- 
formation Retrieval, 1(1/2):67-88. 
Lanjuan Zhu. 1987. The theory and experiments 
on automatic Chinese documents classification. 
Journal of the China Society for Scientific and 
Technical Information \[in Chinese\], 1987(6). 
Tao Zou, Ji-Cheng Wang, Yuan Huang, and Fu- 
Yan Zhang. 1998. The design and implementa- 
tion of an automatic Chinese documents classi- 
fication system. Journal for Chinese Informa- 
tion \[in Chinese\], 1998(2). 
Tao Zou, Yuan Huang, and Fuyan Zhang. 1999. 
Technology of information mining on WWW. 
Journal of the China Society for Scientific and 
Technical Information \[in Chinese\], 1999(4). 
