A Comparison of Manual and Automatic Constructions of Category
Hierarchy for Classifying Large Corpora
Fumiyo Fukumoto
Interdisciplinary Graduate
School of Medicine and Engineering
Univ. of Yamanashi
fukumoto@skye.esb.yamanashi.ac.jp
Yoshimi Suzuki
Interdisciplinary Graduate
School of Medicine and Engineering
Univ. of Yamanashi
ysuzuki@ccn.yamanashi.ac.jp
Abstract
We address the problem dealing with a large
collection of data, and investigate the use of
automatically constructing category hierarchy
from a given set of categories to improve clas-
sification of large corpora. We use two well-
known techniques, partitioning clustering, CZ-
means and a D0D3D7D7 CUD9D2CRD8CXD3D2 to create category
hierarchy. CZ-means is to cluster the given cate-
gories in a hierarchy. To select the proper num-
ber of CZ, we use a D0D3D7D7 CUD9D2CRD8CXD3D2 which mea-
sures the degree of our disappointment in any
differences between the true distribution over
inputs and the learner’s prediction. Once the
optimal number of CZ is selected, for each clus-
ter, the procedure is repeated. Our evaluation
using the 1996 Reuters corpus which consists
of 806,791 documents shows that automati-
cally constructing hierarchy improves classifi-
cation accuracy.
1 Introduction
Text classification has an important role to play, espe-
cially with the recent explosion of readily available on-
line documents. Much of the previous work on text clas-
sification use statistical and machine learning techniques.
However, the increasing number of documents and cate-
gories often hamper the development of practical classifi-
cation systems, mainly by statistical, computational, and
representational problems(Dietterich, 2000). One strat-
egy for solving these problems is to use category hierar-
chies. The idea behind this is that when humans organize
extensive data sets into fine-grained categories, category
hierarchies are often employed to make the large collec-
tion of categories more manageable.
McCallum et. al. presented a method called ‘shrink-
age’ to improve parameter estimates by taking advan-
tage of the hierarchy(McCallum, 1999). They tested their
method using three different real-world datasets: 20,000
articles from the UseNet, 6,440 web pages from the In-
dustry Sector, and 14,831 pages from the Yahoo, and
showed improved performance. Dumais et. al. also de-
scribed a method for hierarchical classification of Web
content consisting of 50,078 Web pages for training, and
10,024 for testing, with promising results(Dumais and
Chen, 2000). Both of them use hierarchies which are
manually constructed. Such hierarchies are costly human
intervention, since the number of categories and the size
of the target corpora are usually very large. Further, man-
ually constructed hierarchies are very general in order to
meet the needs of a large number of forthcoming acces-
sible source of text data, and sometimes constructed by
relying on human intuition. Therefore, it is difficult to
keep consistency, and thus, problematic for classifying
text automatically.
In this paper, we address the problem dealing with a
large collection of data, and propose a method to gener-
ate category hierarchy for text classification. Our method
uses two well-known techniques, partitioning clustering
method called CZ-means and a D0D3D7D7 CUD9D2CRD8CXD3D2 to create
hierarchical structure. CZ-means partitions a set of given
categories into CZ clusters, locally minimizing the average
squared distance between the data points and the clus-
ter centers. The algorithm involves iterating through the
data that the system is permitted to classify during each
iteration and constructs category hierarchy. To select the
proper number of CZ during each iteration, we use a D0D3D7D7
CUD9D2CRD8CXD3D2 which measures the degree of our disappoint-
ment in any differences between the true distribution over
inputs and the learner’s prediction. Another focus of this
paper is whether or not a large collection of data, the
1996 Reuters corpus helps to generate a category hier-
archy which is used to classify documents.
The rest of the paper is organized as follows. The next
section presents a brief review the earlier work. We then
explain the basic framework for constructing category hi-
erarchy, and describe hierarchical classification. Finally,
we report some experiments using the 1996 Reuters cor-
pus with a discussion of evaluation.
2 Related Work
Automatically generating hierarchies is not a new goal
for NLP and their application systems, and there have
been several attempts to create various types of hier-
archies(Koller and Sahami, 1997), (Nevill-Manning et
al., 1999), (Sanderson and Croft, 1999). One attempt
is Crouch(Crouch, 1988), which automatically generates
thesauri. Cutting et al. proposed a method called Scat-
ter/Gather in which clustering is used to create document
hierarchies(Cutting et al., 1992). Lawrie et al. proposed a
method to create domain specific hierarchies that can be
used for browsing a document set and locating relevant
documents(Lawrie and Croft, 2000).
At about the same time, several researchers have in-
vestigated the use of automatically generating hierarchies
for a particular application, text classification. Iwayama
et al. presented a probabilistic clustering algorithm called
Hierarchical Bayesian Clustering(HBC) to construct a set
of clusters for text classification(Iwayama and Tokunaga,
1995). The searching platform they focused on is the
probabilistic model of text categorisation that searches
the most likely clusters to which an unseen document is
classified. They tested their method using two data sets:
Japanese dictionary data called ‘Gendai yogo no kiso-
tisiki’ which contains 18,476 word entries, and a collec-
tion of English news stories from the Wall Street Jour-
nal which consists of 12,380 articles. The HBC model
showed 2AO3% improvements in breakeven point over the
non-hierarchical model.
Weigend et al. proposed a method to generate hi-
erarchies using a probabilistic approach(Weigend et al.,
1999). They used an exploratory cluster analysis to create
hierarchies, and this was then verified by human assign-
ments. They used the Reuters-22173 and defined two-
level categories: 5 top-level categories (agriculture, en-
ergy, foreign exchange, metals and miscellaneous cate-
gory) called meta-topic, and other category groups as-
signed to its meta-topic. Their method is based on a prob-
abilistic approach that frames the learning problem as one
of function approximation for the posterior probability of
the topic vector given the input vector. They used a neu-
ral net architecture and explored several input represen-
tations. Information from each level of the hierarchy is
combined in a multiplicative fashion, so no hard decision
have to be made except at the leaf nodes. They found
a 5% advantage in average precision for the hierarchical
representation when using words.
All of these mentioned above perform well, while the
collection they tested is small compared with many real-
istic applications. In this paper, we investigate that a large
collection of data helps to generate a hierarchy, i.e. it is
statistically significant better than the results which uti-
lize hierarchical structure by hand, that has not previously
been explored in the context of hierarchical classification
except for the improvements of hierarchical model over
the flat model.
3 Generating Hierarchical Structure
3.1 Document Representation
To generate hierarchies, we need to address the question
of how to represent texts(Cutting et al., 1992), (Lawrie
and Croft, 2000). The total number of words we focus on
is too large and it is computationally very expensive.
We use two statistical techniques to reduce the number
of inputs. The first is to use CRCPD8CTCVD3D6DD DACTCRD8D3D6 instead of
CSD3CRD9D1CTD2D8 DACTCRD8D3D6. The number of input vectors is not
the number of the training documents but equals to the
number of different categories. This allows to make the
large collection of data more manageable. The second is
a well-known technique, i.e. mutual information measure
between a word and a category. We use it as the value in
each dimension of the vector(Cover and Thomas, 1991).
More formally, each category in the training set is rep-
resented using a vector of weighted words. We call it
CRCPD8CTCVD3D6DD DACTCRD8D3D6. Category vectors are used for repre-
senting as points in Euclidean space in CZ-means cluster-
ing algorithm. Let CR
CY
be one of the categories CR
BD
, A1A1A1, CR
D1
,
and a vector assigned to CR
CY
be (CR
BDCY
, CR
BECY
, A1A1A1, CR
D2CY
). The mu-
tual information C5C1B4CFBNBVCPD8B5 between a word CF, and a
category BVCPD8is defined as:
C5C1B4CFBNBVCPD8B5
BP
CG
CFBECUDBBNAMDBCV
CG
BVCPD8BECUCRBNAMCRCV
C8B4CFBNBVCPD8B5D0D3CV
C8B4CFBNBVCPD8B5
C8B4CFB5C8B4BVCPD8B5
(1)
Each CR
CXCY
(1 AKCXAKD2) is the value of mutual information
between DB
CX
and CR
CY
. We select the 1,000 words with the
largest mutual information for each category.
3.2 Clustering
Clustering has long been used to group data with many
applications(Jain and Dubes, 1988). We use a simple
clustering technique, CZ-means to group categories and
construct a category hierarchy(Duda and Hart, 1973). CZ-
means is based on iterative relocation that partitions a
dataset into CZ clusters. The algorithm keeps track of the
centroids, i.e. seed points, of the subsets, and proceeds in
iterations. In each iteration, the following is performed:
(i) for each point DC, find the seed point which is closest
to DC. Associate DC with this seed point, (ii) re-estimate
each seed point locations by taking the center of mass
of points associated with it. Before the first iteration the
seed points are initialized to random values. However, a
bad choice of initial centers can have a great impact on
performance, since CZ-means is fully deterministic, given
the starting seed points. We note that by utilizing hierar-
chical structure, the classification problem can be decom-
posed into a set of smaller problems corresponding to hi-
erarchical splits in the tree. This indicates that one first
learns rough distinctions among classes at the top level,
then lower level distinctions are learned only within the
appropriate top level of the tree, and lead to more special-
ized classifiers. We thus selected the top CZ frequent cate-
gories as initial seed points. Figure 1 illustrates a sample
hierarchy obtained by CZ-means. The input is a set of cat-
egory vectors. Seed points assigned to each cluster are
underlined in Figure 1.
k=3
C1, C2, C3 C6, C7, C8,
C9, C10C4, C5
C2
C1 C3
C5
C4
C8
C6 C7 C9
C10
k=2
k=3
C1 C2
C6 C7 C9, C10
Figure 1: Hierarchical structure obtained by CZ-means
In general, the number of CZ is not given beforehand.
We thus use a D0D3D7D7CUD9D2CRD8CXD3D2which is derived from Naive
Bayes(NB) classifiers to evaluate the goodness of CZ.
3.3 NB
Naive Bayes(NB) probabilistic classifiers are commonly
studied in machine learning(Mitchell, 1996). The basic
idea in NB approaches is to use the joint probabilities of
words and categories to estimate the probabilities of cat-
egories given a document. The NB assumption is that all
the words in a text are conditionally independent given
the value of a classification variable. There are several
versions of the NB classifiers. Recent studies on a Naive
Bayes classifier which is proposed by McCallum et al. re-
ported high performance over some other commonly used
versions of NB on several data collections(McCallum,
1999). We use the model of NB by McCallum et al.
which is shown in formula (2).
C8B4CR
CY
CY CS
CX
BN
CM
AIB5 BP
C8B4CR
CY
CY
CM
AIB5A5
CYCS
CX
CY
CZBPBD
C8B4DB
CS
CXCZ
CY CR
CY
BN
CM
AIB5
C8
CYBVCY
D6BPBD
C8B4CR
D6
CY
CM
AIB5A5
CYCS
CX
CY
CZBPBD
C8B4DB
CS
CXCZ
CY CR
D6
BN
CM
AIB5
DBCWCTD6CT
CM
AI
D8CY
AH C8B4DB
D8
CY CR
CY
BN
CM
AIB5
BP
BDB7
C8
CYBWCY
CXBPBD
C6B4DB
D8
BNCS
CX
B5C8B4CR
CY
CY CS
CX
B5
CYCECY B7
C8
CYCE CY
D7BPBD
C8
CYBWCY
CXBPBD
C6B4DB
D7
BNCS
CX
B5C8B4CR
CY
CY CS
CX
B5
CM
AI
BCCY
AH C8B4CR
CY
CY
CM
AIB5BP
CYBWCY
CG
CXBPBD
C8B4CR
CY
CY CS
CX
B5BPCYBWCY (2)
CYCECY refers to the size of vocabulary, CYBWCY denotes the num-
ber of labeled training documents, and CYBVCY shows the
number of categories. CYCS
CX
CY denotes document length. DB
CS
CXCZ
is the word in position CZ of document CS
CX
, where the sub-
script of DB, CS
CXCZ
indicates an index into the vocabulary.
C6B4DB
D8
BNCS
CX
B5 denotes the number of times word DB
D8
occurs
in document CS
CX
, and C8B4CR
CY
CY CS
CX
B5 is defined by C8B4CR
CY
CY CS
CX
B5
BECU0,1CV.
There are several strategies for assigning categories to
a document based on the probability C8B4CR
CY
CY CS
CX
BN
CM
AIB5 such as
CZ-D4CTD6-CSD3CR strategy (Field, 1975), D4D6D3CQCPCQCXD0CXD8DD D8CWD6CTD7CWD3D0CS
and D4D6D3D4D3D6D8CXD3D2CPD0 CPD7D7CXCVD2D1CTD2D8 strategies(Lewis, 1992).
We use probability threshold(PT) strategy where each
document is assigned to the categories above a thresh-
old AI
1
. The threshold AI can be set to control preci-
sion and recall. Increasing AI, results in fewer test items
meeting the criterion, and this usually increases precision
but decreases recall. Conversely, decreasing AI typically
decreases precision but increases recall. In a flat non-
hierarchical model, we chose AI for each category, so as
to optimize performance on the F measure on a training
samples and development test samples. In a manual and
automatic construction of hierarchy, we chose AI at each
level of a hierarchy using training samples and develop-
ment test samples.
3.4 Estimating Error Reduction
Let C8B4DD CY DCB5 be an unknown conditional distribution over
inputs, DC, and output classes, DD BECUDD
BD
, DD
BE
, A1A1A1, DD
D2
CV, and
let C8B4DCB5 be the marginal ‘input’ distribution. The learner
is given a labeled training set BW, and estimates a classi-
fication function that, given an input DC, produces an esti-
mated output distribution
CM
C8
BW
B4DD CY DCB5. The expected error
of the learner can be defined as follows:
BX
CM
C8
BW
BP
CI
DC
C4B4C8B4DD CY DCB5BN
CM
C8
BW
B4DD CY DCB5B5C8B4DCB5 (3)
where C4 is some loss function that measures the degree
of our disappointment in any differences between the true
1
We tested these three assignment strategies in the exper-
iment, and obtained a better result with probability threshold
than with other strategies.
distribution,C8B4DD CY DCB5 and the learner’s prediction,
CM
C8
BW
B4DD CY
DCB5. A log loss which is defined as follows:
C4 BP
CG
DDBECH
C8B4DD CY DCB5D0D3CVB4
CM
C8
BW
B4DD CY DCB5B5 (4)
Suppose that we chose the optimal number of CZ in
the CZ-means algorithm. Let BW
CZ
(2 AK CZ AK D2) be one
of the result obtained by CZ-means algorithm, and be a
set of seed points(categories) labeled training samples.
The learner aims to select the result of BW
CX
, such that the
learner trained on the set BW
CX
has lower error rate than any
other sets.
BKCZ BX
CM
C8
BW
CX
BOBX
CM
C8
BW
CZ
(5)
We defined a loss function as follows:
CM
BX
CM
C8
BW
CX
BPA0
BD
CYCH
CZ
CY
BD
CYCGCY
CG
DCBECG
CG
DDBECH
CZ
C8B4DD CY DCB5D0D3CVB4
CM
C8
BW
CX
B4DD CY DCB5B5 (6)
CH
CZ
in formula (6) denotes a set of seed points(categories)
of BW
CZ
. We note that the true output distribution C8B4DD CY DCB5
in formula (6) is unknown for each sample DC.Roy
et al.(Roy and McCallum, 2001) proposed a method of
CPCRD8CXDACT D0CTCPD6D2CXD2CV that directly optimizes expected future
error by log-loss, using the entropy of the posterior class
distribution on a sample of the unlabeled examples. We
applied their technique to estimate it using the current
learner. More precisely, from the development training
samples BW, a different training set is created. The learner
then creates a new classifier from the set. This procedure
is repeated D1 times, and the final class posterior for an
instance is taken to be the average of the class posteriori
for each of the classifiers.
3.5 Generating Category Hierarchy
The algorithm for generating category hierarchy is as fol-
lows:
1. Create category vectors from the given training sam-
ples.
2. Apply CZ-means up to D2-1 times (2 AK CZ AK D2), where
D2 is the number of different categories.
3. Apply a loss function (6) to each result.
4. Select the CX-th result using formula (5), i.e. the result
such that the learner trained on the CX-th set has lower
error rate than any other results.
5. Assign every seed point(category) of the clusters in
the CX-th result to each node of the tree.
For each cluster of sub-branches, eliminates the seed
point, and the procedure 2 AO 5 is repeated, i.e. run a local
CZ-means for each cluster of children, until the number of
categories in each cluster is less than two.
4 Hierarchical Classification
Like Dumais’s approach(Dumais and Chen, 2000), we
classify test data using the hierarchy. We select the 1,000
features with the largest mutual information for each cat-
egory, and use them for testing. The selected features are
used as input to the NB classifiers.
We employ the hierarchy by learning separate classi-
fiers at each internal node of the tree. Then using these
classifiers, we assign categories to each test sample us-
ing probability threshold strategy where each sample is
assigned to categories above a threshold AI. The pro-
cess is repeated by greedily selecting sub-branches until
it reaches a leaf.
5 Evaluation
5.1 Data and Evaluation Methodology
We compare automatically created hierarchy with flat and
manually constructed hierarchy with respect to classifica-
tion accuracy. We further evaluate the generated category
hierarchy from two perspectives: we examine (i) whether
or not the number of categories effects to construct a cat-
egory hierarchy, and (ii) whether or not a large collection
of data helps to generate a category hierarchy.
The data we used is the 1996 Reuters corpus which
is available lately(Reuters, 2000). The corpus from 20th
Aug., 1996 to 19th Aug., 1997 consists of 806,791 doc-
uments. These documents are organized into 126 topi-
cal categories with a fifth level hierarchy. After eliminat-
ing unlabeled documents, we divide these documents into
four sets. Table 1 illustrates each data which we used in
each model, i.e. a flat non-hierarchical model, manually
constructed hierarchy, and automatically created hierar-
chy. The same notation of (X) in Table 1 denotes a pair of
training and test data. For example, ‘(F1) Training data’
shows that 145,919 samples are used for training NB clas-
sifiers, and ‘(F1) Test data’ illustrates that 290,665 sam-
ples are used for classification. We selected 102 cate-
gories which have at least one document in each data.
We obtained a vocabulary of 320,935 unique words
after eliminating words which occur only once, stem-
ming by a part-of-speech tagger(Schmid, 1995), and stop
word removal. The number of categories per document
is 3.21 on average. For both of the hierarchical and
non-hierarchical cases, we select the 1,000 features with
the largest MI for each of the 102 categories, and create
CRCPD8CTCVD3D6DD DACTCRD8D3D6.
Like Roy et al’s method, we use CQCPCVCVCXD2CV to reduce
variance of the true output distribution C8B4DD CY DCB5. From
Table 1: Data sets used in each method
Training samples Dev. training samples Dev. test samples Test samples
# of samples (145,919 samples) (300,000 samples) (60,021 samples) (290,665 samples)
Date ’96/08/20-’96/10/30 ’96/10/30-’97/03/19 ’97/03/19-’97/04/01 ’97/04/01-’97/08/19
Flat (F1) Training data (F2) Training data for
estimating PT
(F2) Test data for esti-
mating PT
(F1) Test data
Manual (M1) Training data (M2) Training data for
estimating PT at each
hierarchical level
(M2) Test data for es-
timating PT at each hi-
erarchical level
(M1) Test data
Automatic (A1) Training data (A2) Training data for
estimating PT at each
hierarchical level
(A3) Training data
for estimating the true
output distribution
(A2) Test data for esti-
mating PT at each hi-
erarchical level
(A3) Test data for esti-
mating the true output
distribution
(A1) Test data
our original development training set, 300,000 docu-
ments, a different training set which consists of 200,000
documents is created by random sampling. The learner
then creates a new NB classifier from this sample. This
procedure is repeated 10 times, and the final class poste-
rior for an instance is taken to be the average of the class
posteriors for each of the classifiers.
For evaluating the effectiveness of category assign-
ments, we use the standard recall, precision, and F-score.
Recall is defined to be the ratio of correct assignments
by the system divided by the total number of correct as-
signments. Precision is the ratio of correct assignments
by the system divided by the total number of the system’s
assignments. The F-score which combines recall (D6) and
precision (D4) with an equal weight is FB4D6BND4B5BP
BED6D4
D6B7D4
.We
use micro-averaging F score where it computes globally
over D2(all the number of categories) A2 D1 (the number of
total test documents) binary decisions.
5.2 Results and Discussion
5.2.1 Generating Category Hierarchy
Table 2 shows a top level of the hierarchy which is
manually constructed. Table 3 shows a portion of the au-
tomatically generating hierarchy which is associated with
the categories shown in Table 2. ‘DC-DD-A1A1A1’ shows the ID
number which is assigned to each node of the tree. For
example, 3-5-1 shows that the ID number of the top, sec-
ond, and third level is 3, 5, and 1, respectively. AI shows
a threshold value obtained by the training samples and
development test samples.
Tables 2 and 3 indicate that the automatically con-
structed hierarchy has different properties from manually
created hierarchies. When the top level categories of hi-
erarchical structure are equally CSCXD7CRD6CXD1CXD2CPD8CXD2CV proper-
ties, they are useful for text classification. In the man-
ual construction of hierarchy(Reuters, 2000), there are 25
categories in the top level, while the result of our method
based on corpus statistics shows that the top 4 frequent
categories are selected as a discriminative properties,
and other categories are sub-categorised into ‘Govern-
ment/social’ except for ‘Labour issues’ and ‘Weather’. In
the automatically generating hierarchy, ‘Economics’ AX
‘Expenditure’AX ‘Welfare’, and ‘Economics’ AX ‘Labour
issues’ are created, while ‘Welfare’ and ‘Labour issues’
belong to the top level in the manually constructed hier-
archy.
Another interesting feature of our result is that some of
the related categories are merged into one cluster, while
in the manual hierarchy, they are different locations. Ta-
ble 4 illustrates the sample result of related categories in
the automatically created hierarchy. In Table 4, for exam-
ple, ‘Ec competition/subsidy’ is sub-categorised by ‘Mo-
nopolies/competition’. In a similar way, ‘E31’, ‘E311’,
‘E143’, and ‘E132’ are classified into ‘MCAT’, since
these categories are related to market news.
5.2.2 Classification
As just described, thresholds for each level of a hier-
archy were established on the training samples and de-
velopment test samples. We then use these thresholds for
text classification, i.e. for each level of a hierarchy, if a
test sample exceeds the threshold, we assigned the cate-
gory to the test sample. A test sample can be in zero, one,
or more than one categories.
Table 5 shows the result of classification accuracy.
‘Flat’ and ‘Manual’ shows the baseline, i.e. the result
for all 102 categories are treated as a flat non-hierarchical
problem, and the result using manually constructed hi-
erarchy, respectively. ‘Automatic’ denotes the result of
our method. ‘miR’, ‘miP’, and ‘miF’ refers to the micro-
averaged recall, precision, and F-score, respectively.
Table 5 shows that the overall F values obtained by our
method was 3.9% better than the Flat model, and 3.3%
better than the Manual model. Both results are sta-
Table 2: Manually constructed hierarchies
Level AI Category node
Top 0.400 1 Corporate/industrial 2 Economics 3 Government/social 4 Markets
5 Crime 6 Defence 7 International relations 8 Disasters
9 Entertainment 10 Environment 11 Fashion 12 Health
13 Labour issues 14 Obituaries 15 Human interest 16 Domestic politics
17 Biographies/people 18 Religion 19 Science 20 Sports
21 Tourism 22 War 23 Elections 24 Weather
25 Welfare
Table 3: Automatically constructed hierarchies
Level AI Category node
Top 0.422 1 Corporate/industrial 2 Economics 3 Government/social 4 Markets
Second 0.098 3-1 Domestic politics 3-2 International relations 3-3 War 3-4 Crime
3-5 Merchandise trade 3-6 Sports 3-7 Defence 3-8 Disasters
3-9 Elections 3-10 Biographies/people 4-10 Weather 2-5 Expenditure
2-6 Labour issues
Third 0.992 3-1-1 Health 3-5-1 Tourism 3-8-1 Environment 3-10-1 Entertainment
3-10-2 Religion 3-10-3 Science 3-10-4 Human interest 3-10-5 Obituaries
2-5-1 Welfare
Fourth 0.999 3-10-4-1 Fashion
Table 5: Classification accuracy
Method miR miP miF
Flat 0.753 0.647 0.695
Manual 0.685 0.708 0.701
Automatic 0.807 0.675 0.734
tistically significant using a micro sign test, P-value AK
.01(Yang and Liu, 1999). Somewhat surprisingly, there
is no difference between Flat and Manual, since a mi-
cro sign test, P-value BQ 0.05. This shows that manual
construction of hierarchy which depends on a corpus is
a difficult task. The overall F value of our method is
0.734. Classifying large data with similar categories is a
difficult task, so we did not expect to have exceptionally
high accuracy like Reuters-21578 (the performance over
0.85 F-score(Yang and Liu, 1999)). Performance on the
closed data, i.e. training samples and development test
samples in ‘Flat’, ‘Manual’, and ‘Automatic’ was 0.705,
0.720, and 0.782, respectively. Therefore, this is a diffi-
cult learning task and generalization to the test set is quite
reasonable.
Table 6 and Table 7 illustrates the results at each hi-
erarchical level of manually constructed hierarchies, and
our method, respectively. ‘Clusters’ denotes the number
of clusters, and ‘Categories’ refers to the number of cat-
egories at each level. The F-score of ‘Manual’ for the
top level categories is 0.744, and that of our method is
0.919. They outperform the flat model. However, the
performance by both ‘Manual’ and our method monoton-
ically decreases when the depth from the top level to each
node is large, and the overall F-score at the lower level
of hierarchies is very low. This is because at the lower
level more similar or the same features could be used as
features within the same top level category. This sug-
gests that we should be able to obtain further advantages
in efficiency in the hierarchical approach by reducing the
number of features which are not useful discriminators
within the lower-level of hierarchies(Koller and Sahami,
1997).
Table 6: Accuracy by hierarchical level(Manual)
Level Clusters Categories miR miP miF
Top 25 25 0.753 0.724 0.744
Second 16 63 0.524 0.553 0.543
Third 37 37 0.431 0.600 0.500
Fourth 43 43 0.276 0.103 0.146
Fifth 1 1000
Table 7: Accuracy by hierarchical level(Automatic)
Level Clusters Categories miR miP miF
Top 4 4 0.988 0.859 0.919
Second 59 59 0.813 0.697 0.750
Third 35 35 0.568 0.126 0.151
Fourth 4 4 0.246 0.102 0.103
Table 4: The sample result of related categories
Node Category Node Category Node Category
1-22 Monopolies/competition(C34) 1-22-1 Ec competition/subsidy(G157)
2-7 Defence(GDEF) 2-7-1 Defence contracts(C331)
3 Markets(MCAT) 3-11 Output/capacity(E31)
3-12 Industrial production(E311)
3-13 Retail sales(E143) 3-13-1 Housing starts(E61)
3-14 Wholesale prices(E132)
4 Economics(ECAT) 4-2 Monetary/economic(E12)
4-8 Ec monetary/economic(G154)
4 Economics(ECAT) 4-6 Labour issues(GJOB)
4-7 Employment/labour(E41) 4-7-1 Unemployment(E411)
5.2.3 The Number of Categories v.s. Accuracy
Figure 2 shows the result of classification accuracy us-
ing different number of categories, i.e. 10, 50 and 102
categories. Each set of 10 and 50 categories from training
samples is created by random sampling. The sampling is
repeated 10 times
2
. Each point in Figure 2 is the average
performance over 10 sets.
 0.69
 0.7
 0.71
 0.72
 0.73
 0.74
 0.75
 10  20  30  40  50  60  70  80  90  100  110
F
# of categories
Flat
Manual
Automatic
Figure 2: Accuracy v.s. category size
Figure 2 shows that our method outperforms the flat
and manually constructed hierarchy at every point in the
graph. As can be seen, the grater the number of cate-
gories, the more likely it is that a test sample has been in-
correctly classified. Specifically, the performance of flat
model using 10 categories was 0.720 F-score and that of
50 categories was 0.695. This drop of accuracy indicates
that flat model is likely to be difficult to train when there
are a large number of classes with a large number of fea-
tures.
2
In our method, 10 hierarchies are constructed for each set
of categories.
5.2.4 Efficiency of Large Corpora
Figure 3 shows the result of classification accuracy us-
ing the different size of training samples, i.e. 10,000,
100,000 and 145,919 samples
3
. Each set of samples is
created by random sampling except for the set of 145,919
samples. The sampling process is repeated 10 times. The
average accuracy of each result across the 10 sets of sam-
ples is reported in Figure 3.
 0.6
 0.62
 0.64
 0.66
 0.68
 0.7
 0.72
 0.74
 0  20000  40000  60000  80000  100000  120000  140000  160000
F
# of training samples
Flat
Manual
Automatic
Figure 3: Accuracy v.s. training corpus size
The performance can benefit significantly from much
larger training samples. At the number of training sam-
ples is 10,000, all methods have poor effectiveness, but it
learns rapidly, especially, the results of flat model shows
that it is extremely sensitive to the amount of training
data. At every point in the graph, the result of our method
with cluster-based generalisations outperforms other two
methods, especially, the method becomes more attractive
with less training samples.
3
We used the same number of development, development
test, and test samples which are shown in Table1 in the experi-
ment.
6 Conclusions
We proposed a method for generating category hierar-
chy in order to improve text classification performance.
We used CZ-means and a D0D3D7D7 CUD9D2CRD8CXD3D2 which is derived
from NB classifiers. We found small advantages in the
F-score for automatically generated hierarchy, compared
with a baseline flat non-hierarchy and that of manually
constructed hierarchy from large training samples. We
have also shown that our method can benefit significantly
from less training samples. Future work includes (i) ex-
tracting features which discriminate between categories
within the same cluster with low F-score, (ii) using other
machine learning techniques to obtain further advantages
in efficiency in dealing with a large collection of data, (iii)
comparing the method with other techniques such as hier-
archical agglomerative clustering and ‘X-means’(Pelleg
and Moore, 2000), and (iv) developing evaluation method
between manual and automatic construction of hierar-
chies to learn more about the strengths and weaknesses
of the two methods of classifying documents.
Acknowledgments
We would like to thank anonymous reviewers for their
valuable comments. We also would like to express many
thanks to the Research and Standards Group of Reuters
who provided us the corpora.

References
T. Cover and J. Thomas. 1991. Elements of Information
Theory. In Wiley.
C. Crouch. 1988. A Cluster-based Approach to The-
saurus Construction. In Proc. of the 11th Annual In-
ternational ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 309–320.
D. Cutting, D. Karger, J. Pedersen, and J. Tukey. 1992.
Scatter/gather: A Cluster-based Approach to Brows-
ing Large Document Collections. In Proc. of the
15th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval,
pages 318–329.
T.G. Dietterich. 2000. Ensemble Methods in Machine
Learning. In Proc. of the 1st International Workshop
on Multiple Classifier Systems.
R. Duda and P. Hart. 1973. Pattern Classification and
Scene Analysis. Wiley.
S. Dumais and H. Chen. 2000. Hierarchical Classifica-
tion of Web Content. In Proc. of the 23rd Annual In-
ternational ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 256–263.
B. Field. 1975. Towards Automatic Indexing: Auto-
matic Assignment of Controlled Language Indexing
and Classification from Free Indexing. Journal of
Documentation, pages 246–265.
M. Iwayama and T. Tokunaga. 1995. Cluster-Based Text
Categorization: A Comparison of Category Search
Strategies. In Proc. of the 18th Annual International
ACM SIGIR Conference on Research and Development
in Information Retrieval, pages 273–280.
A.K. Jain and R.C. Dubes. 1988. Algorithms for Clus-
tering Data. Englewood Cliffs N.J. Prentice Hall.
D. Koller and M. Sahami. 1997. Hierarchically Classify-
ing Documents using Very Few Words. In Proc. of the
14th International Conference on Machine Learning,
pages 170–178.
D. Lawrie and W.B. Croft. 2000. Discovering and Com-
paring Hierarchies. In Proc. of RIAO 2000 Conference,
pages 314–330.
D.D. Lewis. 1992. An Evaluation of Phrasal and Clus-
tered Representations on a Text Categorization Task.
In Proc. of the 15th Annual International ACM SIGIR
Conference on Research and Development in Informa-
tion Retrieval, pages 37–50.
A.K. McCallum. 1999. Multi-Label Text Classification
with a Mixture Model Trained by EM. In Revised ver-
sion of paper appearing in AAAI’99 Workshop on Text
Learning.
T. Mitchell. 1996. Machine Learning. McGraw Hill.
Nevill-Manning, C.I. Witten, and G. Paynter. 1999.
Lexically-Generated Subject Hierarchies for Browsing
Large Collections. International Journal on digital Li-
braries, 2(3):111–123.
D. Pelleg and A. Moore. 2000. X-means: Extending
K-means with Efficient Estimation of the Number of
Clusters. In Proc. of the 17th International Conference
on Machine Learning, pages 725–734.
Reuters. 2000. Reuters Corpus Volume 1 En-
glish Language. 1996-08-20 to 1997-08-19,
release date 2000-11-03, Format version 1,
http://www.reuters.com/researchandstandards/corpus/.
N. Roy and A.K. McCallum. 2001. Toward Optimal
Active Learning through Sampling Estimation of Error
Reduction. In Proc. of the 18th International Confer-
ence on Machine Learning, pages 441–448.
M. Sanderson and B. Croft. 1999. Deriving Concept
Hierarchies from Text. In Proc. of the 22nd Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 206–213.
H. Schmid. 1995. Improvements in Part-of-Speech Tag-
ging with an Application to German. In Proc. of the
EACL SIGDAT Workshop.
A.S. Weigend, E.D. Wiener, and J.O. Pedersen. 1999.
Exploiting Hierarchy in Text Categorization. Informa-
tion Retrieval, 1(3):193–216.
Y. Yang and X. Liu. 1999. A Re-Examination of Text
Categorization Methods. In Proc. of the 22nd Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 42–49.
