The Effect of Topological Structure on Hierarchical 
Text Categorization 
Stephen D'Alessio Keitha Murray 
Robert Schiaffino 
Department of Computer and Information Science 
Iona College 
New Rochelle, N.Y. 10801, USA 
sdalessio@iona.edu, kmurray@iona.edu, rschiaffino@iona.edu 
Aaron Kershenbaum 
Department of Computer Science 
Polytechnic University 
Hawthorne, N.Y. 10532, USA 
akershen@duke.poly.edu 
Abstract 
The problem of assigning documents to categories 
in a hierarchically organized taxonomy and the ef- 
fect of modifying the topology of the hierarchy 
are considered. Given a training corpus of doc- 
uments already placed in categories, vocabulary is 
extracted. The vocabulary, words that appear with 
high relative frequency within a given category, 
characterize each subject area by being associated 
with nodes in the hierarchy. Each node's vocabu- 
lary is filtered and its words assigned weights with 
respect to the specific category. Test documents 
are scanned for this vocabulary and categories are 
ranked with respect to the document based on the 
presence of terms from this vocabulary. Documents 
are assigned to categories based on these rankings. 
Precision and recall are measured. 
We present an algorithm for associating words 
with individual categories within the hierarchy and 
demonstrate that precision and recall can be sig- 
nificantly improved by solving the categorization 
problem taking the topology of the hierarchy into 
account. We also show that these results can be 
improved even further by inteUigent'y selecting in- 
termediate categories in the hierarchy. Solving the 
problem iteratively, moving downward from the 
root of the taxonomy to the leaf nodes, we improve 
precision from 82% to 89% and recall from 82% 
to 87% on the much-studied Reuters-21578 corpus 
with 135 categories organized in a three-level hier- 
archy of categories. 
1 Introduction and Background 
The proliferation of available online information at- 
tributable to the explosive use of the Internet has 
brought about the necessity for text retrieval sys- 
tems that can assist the user in accessing this in- 
formation in an effective, efficient and timely man- 
ner. Today's search engines have had difificulty 
keeping pace with the increasing amount of infor- 
mation that continuously needs to be indexed and 
searched. Categorization of the original text is a 
means by which the information can be arranged 
arid organized to facilitate the retrieval task. Nat- 
ural language processing systems can be used to 
query against these pre-specified categories yield- 
ing retrieval results more acceptable and beneficial 
to the user. 
The document categorization problem is one of 
assigning newly arriving documents to categories 
within a given hierarchy of categories. In general, 
lower level categories may be part of more than 
one higher level category. Moreover, a document 
may belong to more than one low-level category. 
While the techniques described here can be applied 
to this more general problem, the experiments we 
have conducted, to date, have been carried out on a 
corpus where each document is a member of a sin- 
gle category and the categories form a tree rather 
than a more general directed acyclic graph. Vv~ lim- 
ited the investigation to this more specific problem 
in order to focus the investigation on the effect of 
making use of the hierarchy, specifically on changes 
66 
in the topology" of the hierarchy. 
Most computational experience discussed in 
the literature deaJs with hierarchies that are 
trees. Indeed, until recently, most problems dis- 
cussed dealt with categorization within a sim- 
ple (non-hierarchical) set of categories (Frakes 
and Baeza-Yates, 1992). The Reuters-21578 
corpus (available at David Lewis's home page: 
http://www.research.att.com/ lewis) has been 
studied extensively. "~.ng ('~hag, 1997) compares 
14 categorization algorithms applied to this Reuters 
corpus as a flat categorization problem on 135 cat- 
egories. This same corpus has been more recently 
studied by others treating the categories as a hierar- 
chy" (Chakrabarti et al., 1997)(Koller and Sahami, 
1997)(Ng et al., 1997)(Yang, 1996). "~.ng examines 
a portion of the OHSUMED (Hersh et al., 1994) 
corpus of medical abstracts, a part of the National 
Library of Medicine corpus that has over 9 million 
abstracts organized into over 10,000 categories in a 
taxonomy (called MESH) which is seven levels deep 
in some places. 
We describe an algorithm for hierarchical docu- 
ment categorization where the vocabulary and term 
weights are associated with categories at each level 
in the taxonomy and where the categorization prro- 
cess itself is iterated over levels in the hierarchy. 
Thus a given term may be a discriminator at one 
level in the taxonomy receiving a large weight and 
then become a stopword at another level in the hi- 
erarchy. We also consider making modifications to 
the hierarchy itself as a means of increasing the ac- 
curacy and speed of the categorization process. 
There are two strong motivations for taking the 
hierarchy into account. First, experience to date 
has demonstrated that both precision and recall de- 
crease as the number of categories increases (Apte 
et al., 1994) (Yang, 1996). One of the reasons for 
this is that as the scope of the corpus increases, 
terms become increasingly polysemous. This is par- 
ticularly evident for acronyms, which are often lim- 
ited by the number of 3- and 4-1etter combinations, 
and which are reused from one domain to another. 
The second motivation for doing categorization 
within a hierarchical setting is it affords the ability 
to deal with very large problems. As the number 
of categories grows, the need for domain-specific 
vocabulary grows as well. Thus, we quickly reach 
the point where the index no longer fits in mem- 
ory and we are trading accuracy against speed and 
software complexity. On the other hand, by treat- 
ing the problem hierarchically, we can decompose 
it into several problems each involving a smaller 
number of categories and smaller domain-specific 
vocabularies and perhaps yield savings of several 
orders of magnitude- 
Feature selection, deciding which terms to actu- 
ally include ha the indexing and categorization pro- 
cess, is another aspect affected by size of the corpus. 
Some methods remove words with low frequencies 
both in order to reduce the number of features and 
because such words are often unreliable due to the 
low confidence in their distribution of occurrence 
across categories. Depending on the size of the cor- 
pus, this may still leave over 10,000 features, which 
renders even the simplest categorization methods 
too slow to be of use on very large corpora and 
renders the more complex ones entirely infeasible. 
.Methods that incorporate additional feature se- 
lection have been studied (Apte et al., 1994) 
(Chakrabarti et al., 199T) (Deerwester et al. 1990) 
(Koller and Sahami, 1996) (Lewis, 1992) (Ng et al., 
1997) (~h.ng and Pederson 1997). The effectiveness 
off these feature selection methods varies. Most re- 
duce the size of the feature set by one to two orders 
of magnitude without significantly reducing preci- 
sion and recall from what is obtained with larger 
feature sets. Some approaches assign weights to 
the features and then assign category ranks based 
on a sum of the weights of features present. Some 
weigh the features further by their frequency in the 
test documents. These methods are all known as 
linear cl~sifiers and are computationally simplest 
and most efficient, but they sometimes lose accu- 
racy because of the assumption they make that the 
feaaures appea~'independently in documents. More 
sophisticated categorization methods base the cat- 
egory ranks on groups of terms (Chakrabarti et 
al., 1997) (Heckerman, 1996) (Koller and Saharni, 
1997) (Sahami, 1996) (Yang, 1997). The methods 
that approach the problem hierarchically compute 
probabilities and make the categorization decision 
one level in the taxonomy at a time. 
Precision and recall are used by most authors as a 
measure of the effectiveness of the algorithms. Most 
of the simpler methods achieved values for these 
near 80% for the Reuters corpus (Apte et al., 1994) 
(Cohen and Singer, 1996). More computationally 
expensive methods using the same corpus achieved 
results near 90% (Koller and Sahami, 1997) while 
the methods that used hierarchy obtained small ino 
creases in precision and large increases in speed (Ng 
et al., 1997). As the number of categories increased 
in a corpus (OSHUMED), precision and recall de- 
cline to 60% (Yang 1996). 
67' 
2 Problem Definition 
2.1 Definition of Categories 
We are given a set of categories where sets of cat- 
egories can be further organized into supercate- 
gories. We are given a training corpus and, for each 
document, the category to which it belongs. Doc- 
uments can, in general, be members of more than 
one category-. In that case, it is possible to consider 
a binary categorization problem where a decision is 
made whether each document is or is not in each 
category. Here, we examine the M-ary categoriza- 
tion problem where we choose a single category for 
each document. 
2.2 Document Corpus and Taxonomy 
We use the Reuters-21578 corpus, Distribution 1.0, 
which is comprised of 21578 documents, repr~ent- 
ing what remains of the original Reuters-22173 cor- 
pus after the elimination of 595 duplicates by Steve 
Lynch and David Lewis in 1996. The size of the 
corpus is 28,329,337 bytes, yielding an average doc- 
ument size of 1,313 bytes per document. The doc- 
uments are "categorized" along five axes - topics, 
people, places, organizations, and exchanges. We 
consider only the categorization along the topics 
axis. Close to half of the documents (10,211):have 
no topic and as Yang (~hng, 1996) and others sug- 
gest, we do not include these documents in either 
our training or test sets. Note, that unlike Lewis 
(acting for consistency with earlier studies), the 
documents that we consider no-category are those 
that have no categories listed between the topic 
tags in the Reuters-21578 corpus' documents. This 
leaves 11,367 documents with one or more topics. 
Most of these documents (9,49.5) have only a single 
topic. The average number of topics per document 
is 1.26. 
The Reuters collection uses a set of 135 categories 
organized as a flat taxonomy. Although the collec- 
tion does not have a pre-defined hierarchical clas- 
sification structure, additional information on the 
category sets available at Lewis's site describes an 
organization that has 5 additional categories that 
become supercategories of all but 3 of the original 
topics categories. Adding a root forms a 3-1evel hi- 
erarchy (see Figure 1). Figure 1 includes counts 
by selected individual leaf categories and summa- 
rized by upper level supercategories. The number 
of categories per supercategory varies widely, from 
a minimum of 2 to a maximum of 78. The number 
of test documents per category also varies widely, 
from a minimum of 0 (for 76 such categories) to a 
maximum of 1,156 (earn). On the other hand, doc- 
ument size does not vary greatly across categories. 
In the same way that a wide variation in docu- 
ment size makes ranking documents with respect to 
a query in information retrieval difficult, it is difl~- 
cult to accurately rank categories with respect to a 
document when the number of documents per cate- 
gory varies greatly across categories. Of course, we 
cannot control the number of documents actually in 
each category. We can reduce this variation to some 
extent by altering the hierarchy, as least temporar- 
ily, during the categorization process. Thus, for 
example, the hierarchy described in Figure 1 above 
group the "acq" and "earn" categories into a com- 
mon supercategory "corporate". Each of these cat- 
egories separately contains more documents than 
all of the other supercategories. Thus, we might 
improve the precision of the categorization process 
by "promoting" these categories tc 3upercateguries. 
This idea is explored in Section 4. 
It might also help to temporarily move a category 
to a different part of the hierarchy when it shares 
important features with other categories there. In 
this case, by moving the categories under a com- 
mon parent we can reliably get the document to 
that parent and then, using features that specifi- 
cally separate these categories from one another, we 
can accurately complete the categorization. Mov- 
ing categories is also explored in Section 4. 
2.3 Performance Metrics 
We measure the-effectiveness of our algorithm by 
using the standard measures of microaveraged pre- 
cision and recall; i.e., the ratio of correct decisions 
to the total number of decisions and the ratio of 
correct decisions to the total number of documents- 
We do, however, sometimes leave documents in 
non-leaf categories and then, in measuring precision 
and recall, count these as "no-category", reducing 
recall but not precision. 
3 Algorithm Description 
3.1 Overview 
We begin by creating training and test files us- 
ing the 9,495 single-category documents from the 
Reuters-21578 corpus. While this led to somewhat 
higher precision and recall than would have been 
obtained by including multicategory documents, 
our 89% precision and 87% recall is also higher than 
the roughly 80% typically reported for categoriza- 
tion methods of comparable speed and complexity. 
68 
Thus, our approach is comparable to those methods 
and serves as a reasonable baseline against which 
to study the effects of the hierarchy. 
The corpus is divided randomly, using a 
70%/30% split, into a training corpus of 6,753 
training documents and 2,742 test documents. 
Documents in both the training and test corpora 
are then divided into words using the same proce- 
dure. Non-alphabetic characters (with the excep- 
tion of "-") are removed and all characters are Iow- 
ercased. Stopwords are removed. The document 
is then parsed into "words"; i.e., character strings 
delimited by whitespace, and these words are then 
used as features. 
Next, we count the number of times each feature 
appears in each document and, from that, we com- 
pute the total number of times each feature appears 
in training documents in each category. We retain 
only features appearing 2 or more times in a single 
training document or 10 or more times across the 
training corpus. All other features are discarded as 
being insufficiently reliable. 
Next we use a variant of the ACTION Algorithm 
(Wong et al. 1996), described in detail in Section 
3.2 below, to associate features with nodes in the 
taxonomy. This is one of the aspects that make 
our approach novel. This algorithm is particularly 
useful because it allows us to compare the frequency 
of a feature within a category with its frequency in 
sibling categories in the same subtree. This is more 
effective than just comparing the frequency within 
a category with global frequency as it focuses on 
the decision actually being made at that node in 
the hierarchy. 
By eliminating most features from most cate- 
gories, we gain several advantages. First, by limit- 
ing the appearance of a feature to a small number 
of categories (usually, just one) where it is an un- 
ambiguous discriminator, we improve the precision 
of the categorization process. Second, by working 
with a small number of features, we avoid optimiza- 
tion over a large number of features, and have a 
procedure with low computational complexity that 
can be applied to large problems with many cate- 
gories. (Currently the number of features is set to 
50). Our feature selection procedure most closely 
resembles rule induction (Apte et al., 1994) but it 
differs from that approach in that it considers the 
interactions among a larger number of features for 
a given amount of computational effort. 
Weights are now assigned to the surviving fea- 
tures in each category. We associate a weight, Wlc 
, with each surviving feature, f, in category c. We 
define W/¢ by: 
= + (1 - (1) 
where NI¢ is the number of times f appears in c, 
Mc is the maximum frequency of any feature in c, 
and is a parameter (currently set to 0.4). 
where N(fc) is the number of times f appears in 
c, Mc is the maximum frequency of any feature in 
c, and is a parameter (currently set to 0.4). 
We also assign a negative weight to features asso- 
ciated with siblings (successors of the same parent 
node) of each category. A feature appearing in one 
or more siblings of c but not in c itself, is assigned 
a negative weight 
~)~ = -(~, + (1 - A)-~7~- ) (2) 
where p is the parent of c in the hierarchy. Thus 
Nip is the number of times f appears In the parent 
of c, which is In turn the number of times f appears 
in all siblings of c since it does not appear in c itself 
at all. Mp is the maximum frequency of any feature 
in c's parent. 
Finally, we filter the set of positive and negative 
words associated with each category, retaining, at 
most, 50 positive and 50 negative words with high- 
est weights for each category, both leaf and interior. 
We now have an index suitable for use in the cat- 
egory ranking process. The index contains features 
and a weight, WI¢, associated with each feature in 
each category. Given a document, d, a rank can 
now be associated with each category with respect 
to d. Let F be the set of features, f, in D. The 
ranking of category c with respect to document d, 
R(cd), is then defined to be: 
nee = ,vI wI, (3) 
! 
where the sum is over all positive and negative fea- 
tures associated with c and IVI,~ is the number of 
times f appears in d. Note that, in practice, the 
sum is taken only over features that are in the in- 
tersection of the sets of features actually appearing 
in d and actually associated with c. Note that R¢4 
may be positive, negative or zero. 
Test document d is now placed in a category. 
Starting at r, the root of the hierarchy, we com- 
pute Red for all c which are successors of r. If all 
R¢,l are zero or negative, d is left at r. If any R.c,~ 
is positive, let c' be the category with the highest 
rank. If c' is a leaf node, d is placed in c'. If c' 
is an interior node, the contest is repeated at node 
c'. Thus, d is eventually placed either in a leaf cat- 
egory which wins a contest among its siblings or 
69 
in an interior node none of whose children have a 
positive rank with respect to d. In this latter case, 
we may say that d is actually placed in the interior 
category, partially categorized or not categorized at 
all. Which of these we choose is dependent upon 
the application and on how much we value precision 
versus recall. 
3.2 The ACTION Algorithm 
The ACTION Algorithm was first described in 
(~Vong et al., 1996) ~ a method of associating doc- 
uments with categories within a hierarchy. Here, 
we use it to associate vocabulary with nodes in a 
hierarchy and associate documents with the nodes 
using the procedure described in Section 3.1 above. 
The original algorithm applied to problems with 
documents at interior and leaf nodes. Although our 
adaptations apply to the more general case also, we 
describe the algorithm with respect to that simpler 
case since the corpus we are using has documents 
only at leaf nodes. 
The algorithm begins by counting Nit, the num- 
ber of times feature f appears in documents associ- 
ated with category c in the training set, for all f and 
c. There is a level,, associated with each category, 
c, in the hierarchy'. By convention, the root is'at 
level 1; its immediate successors are at level 2, etc. 
We then define EFtc, the effective frequency of 
subtree rooted at node c with respect to feature 
fas 
EF/c = E (4) jcS, 
Thus, EFIc is the total number of occurrences of f 
in c and all subcategories, S¢ of node c. 
Finally, we define i'~,c, the significance value of c 
with respect to f, as 
= × (5) 
Thus, a node gets credit, in proportion to its level, 
for occurrences of f in itself and in its successors. 
The farther down the tree a node is, the more credit 
it is given for its level, but the higher up the tree 
a node is, the larger the subtree rooted at c and 
the larger the credit it gets for effective frequency. 
A competition thus takes place between each node 
and its parent (immediate predece.~or). For each 
feature, f, EFIc is compared with, EFIp , where p 
is the parent of c and if EFIc is smaller then f is 
removed from node c. Thus a parent can remove a 
feature from a child but not vice versa. In the case 
of a tie, the child loses the feature. All this compe- 
tition proceeds from the leaves upward towards the 
root. 
The net effect of this is that if a feature occurs in 
only a single child of a given parent, then the child 
retains the feature (as does the parent), but if the 
feature occurs significantly in more than one child 
of the same parent, then only the parent retains the 
feature. 
Several advantages accrue from all this. First, 
common features, including stopwords, will natu- 
rally rise to the root, where they will not participate 
in any rankings. Thus, this algorithm is a gener- 
alized version of removing stopwords. If a feature 
is prominent in several children of the same node, 
the parent will remove it from all of them. Ideally, 
words that are important for making fine distinc- 
tions among categories farther down in the category 
hierarchy, but are ambiguous at higher levels, will 
participate only in places where they can help. 
Note that we never directly remove a feature from 
the parent even when the child retains it. The rea- 
son for this is that we may need the feature to get 
the document to the parent; if it doesn't reach the 
parent it can never reach the child. In the case 
where a feature strongly represents only one cate- 
gory, there is no harm in the parent retaining it. In 
the cases where it is ambiguous at the level of the 
parent, the grandparent removes it from the parent 
(its child). 
Thus, at the end of the algorithm when we filter 
the feature set for each category (leaf and non-leaf) 
retaining only the 50 most highly ranked positive 
and negative words, at non-leaf categories we also 
retain any words retained by their children. 
4 Comput~rtional Experience 
There are a number of ways that the performance of 
a hierarchical categorization system can be tuned. 
Two alternatives that we are exploring are modi- 
fying the topology of the hierarchy and adjusting 
the weighting functions. This paper describes the 
experiments that we performed in order to under- 
stand the effects of modifying the topology. In an- 
other paper, we describe the effect of adjusting the 
level numbers (weights) of the categories within the 
hierarchy (D'Alessio et al., 1998). Our ultimate ob- 
jective is to find a set of transformations that can 
be applied to a hierarchy as a part of the training 
process. 
In the first experiment no hierarchy was used, 
that is, none of the 5 Reuters supercategories were 
used. We applied our feature selection algorithm 
and our categorization algorithm in the normal 
manner, however we assigned the root a level of 
0. The effect of this is to prevent any features from 
70 
being associated with the root. ~Ve refer to this 
organization as Flat-0. The remaining categories 
then keep their 50 most significant positive and neg- 
ative features. The results for overall precision and 
recall, the number of unclassified documents (doc- 
uments left at the root), and a selected example are 
reported in Table 1. Examining the results of this 
experiment shows that our algorithm does poorly 
in the case of several small categories. For exam- 
ple, there are only 4 petrochemical test documents, 
however our algorithm assigned 124 documents to 
the petrochemicals category of which only I was ac- 
tuaUy a petrochemicals document. Other small cat- 
egories such as lumber, strategic metals, and money 
supply exhibit similar behavior. An examination of 
these categories shows that in each case they share 
a few key features ~ith a larger category. When 
these features appear in a test document they are 
given disproportionate weight in the smaller cat- 
egories. Of the incorrect documents assigned to 
petrochemicals, nearly all (118) were either acqui- 
sitions or earnings documents. The vocabulary as- 
sociated with the petrochemicals category in Flat-0 
includes words such as "rain" and "dlrs" that are 
also earnings and acquisitions words. The formula 
used to assign weights to words found in test doc- 
uments uses a normalization factor to account for 
the differences in the sizes of the categories. In this 
case the net effect is to bias the decision towards 
petrochemicals whenever these words appear in a 
test document. 
One advantage of using a hierarchy is that it 
should provide a mechanism for moving features 
to positions where they aid in categorization and 
remove features from positions where they are am- 
biguous. We tested this hypothesis by introducing 
a simple hierarchical organization. We changed the 
level of the root node from 0 to 1, and gave the 
subcategories of the root a level of 2. We refer to 
this organization as Fla~-l. Again, each category 
kept its 50 most significant positive and negative 
features and the categorization algorithm was ap- 
plied to the same test data as above. The compari- 
son between Flat-0 and Flat-l, for this case, is also 
given in Table 1. Note the significant improvement 
in precision and recall. Examination of the vocabu- 
lary associated with the petrochemicals category in 
Flat-1 no longer includes "mln" and "dlrs" as the 
ACTION algorithm has removed them preventing 
this small category from stealing documents from 
larger categories with some similar features. Ad- 
ditionally, the time required for the categorization 
was reduced by a factor of one third. This experi- 
ment demonstrated the beneficial effect of using the 
ACTION algorithm with the hierarchy by allowing 
us to efficiently compare the relative frequency of 
features within a category and outside a category. 
The ambiguous words that were previously associ- 
ated with petrochemicals were either moved to the 
root where they" became stop words, or were moved 
to other categories. 
We then conducted a number of experiments to 
explore how modifying the topology of the hier- 
archy affects the categorization. As a baseline, 
we used the hierarchy of topics supplied with the 
Reuters corpus (see Figure 1) referred to as the Ba- 
sic hierarchy. This organization is significantly dif- 
ferent from Flat-1 in that it is a three-level hierar- 
chy with 5 supercategories. We applied our feature 
selection and categorization algorithms using the 
same test data as above. The results for overall 
precision and recall, the precisions and recalls asso- 
ciated with the acquisitions and earnings categories 
themselves, and document placement counts axe re- 
ported in Table 2 below. The time required for the 
categorization for the Basic hierarchy was approx- 
imately one half the time required for the Flat-1 
case. An examination of the results shows that this 
hierarchy also avoids the small category problem 
experienced in Flat-0. However the overall perfor- 
mance was not as good as in Flat-1. We identified 
and analyzed situations where the use of the deeper 
hierarchy caused problems and attempted to study 
the problems by modifying the hierarchy. 
An error analysis using the dispersion matrix 
identified the first problem as occurring when sib- 
ling leaf categories steal documents from each 
other. An exarfiple is the case of the earnings and 
acquisitions categories. In the Basic hierarchy both 
earnings and acquisitions are subcategories of the 
corporate category while in Flat-1 both are sub- 
categories of the root. A comparison of the pre- 
cision and recall for acquisitions and earnings us- 
ing Flat-1 versus Basic shows that acquisitions' re- 
Call drops from 92% to 77% with the other val- 
ues remaining somewhat comparable. In this case 
the deeper hierarchy" impedes performance. An ex- 
amination of the dispersion matrix (Table 2) for 
the Basic hierarchy" shows that 91 acquisition doc- 
uments are classified as earnings documents and 
15 earning documents are classified as acquisitions 
documents with another 19 acquisition documents 
being left at the corporate node. This indicates 
that most of the earnings and acquisitions docu- 
ments are being correctly classified as corporate 
documents, however, in many cases there is insuffi- 
cient information to make the distinction between 
earning and acquisitions. We hypothesize that in 
71 
Overall 
Prec/Rec 
82.85/82.79 
89.36/85.74 
U nclass \['~tto~ 
Docs I Corr I Incorr U 
 L_it_N lll 
Table h Comparison Between Flat-0 and Flat-1 
this case, our vocabulary selection algorithm has 
removed too many terms from earnings and acqui- 
sitions and given them to corporate. Removing the 
corporate category from the hierarchy would allow 
earnings and acquisitions to become subcategories 
of the root and retain more of their significant fea- 
tures. 
We tested this hypothesis by constructing a new 
hierarchy, ~.r-1, by removing corporate from the 
Basic hierarchy. Table 3 summarizes the compari- 
son between these two topologies. The table illus- 
trates that in the case where acquisitions and earn- 
ings are both children of the root (W~.r-1) there is 
less stealing of documents occurring between these 
two siblings resulting in an overall improvement 
over the Basic hierarchy case. 
A second problem we identified is that in :some 
cases the vocabulary selection process removes too 
many features from a leaf category with the result 
that it becomes difficult to properly categorize doc- 
uments belonging to that category. An example 
of this can be seen with the category interest. As 
shown in the dispersion matrix for the Basic hierar- 
chy above, there are 104 test documents belonging 
to the interest category, however only 24 interest 
documents are correctly classified. In this case in- 
terest is a subcategory of the root and most of its 
incorrectly classified test documents (68) are clas- 
sifted in the economic indicators subtree. Here we 
have a slightly different problem. We do not have 
sibling leaves stealing documents from each other. 
Only one economic indicators document, a trade 
document, is placed in interest. We have a leaf cat- 
egory competing directly with a larger, similar sub- 
tree. As a result many of its documents are placed 
in the subtree. %~ hypothesize that in this case 
the leaf category should be moved into the subtree. 
This would allow the smaller category to compete 
for the documents that are assigned to the subtree. 
We tested this hypothesis by constructing a new 
hierarchy, ~ur-2, by making interest a subcategory 
of economic indicators. Table 4 summarizes the 
comparison of the overall precision and recall, and 
selected document placement counts for the Basic 
hierarchy, Var-2, and a third hierarchy, called Vat- 
3, that is a variation combining variations one and 
two. Again, we see an improvement in overall pre- 
cision and recall but this time it was a result of 
making a category that was weak and losing its doc- 
uments stronger by moving it to a position where 
it could directly compete for features and thus doc- 
uments. 
A third type of problem was also identified. At 
times a leaf category" will have poor precision be-. 
cause it is assigned many documents not belonging 
to the category. In some cases this occurs because 
documents were incorrectly classified at a higher 
node in the hierarchy. These documents are then 
examined along the wrong path and are placed in 
an incorrect leaf. An example of this occurs in the 
category trade, which is the largest subcategory of 
economic indicators in the Basic hierarchy. The 
dispersion matrLx shows that there are 104 trade 
test documents, 94 of which axe correctly classi- 
fied; 101 other documents are incorrectly classified 
as trade documents. This is not a case of a cate- 
gory stealing documents from its sibling categories, 
rather documents belonging to a variety of non- 
economic indicator categories are incorrectly clas- 
sifted as economic indicators documents. When we 
have to decide which subcategory of economic in- 
dicators to plaice the documents into, trade being 
tl~e largest subcategory attracts the majority of the 
documents. We hypothesize that we can correct 
this problem by" moving trade and making it a sub.- 
category of the root. This has two effects. First, 
it weakens economic indicators by removing one of 
its largest categories. Second, it weakens trade be- 
cause it lowers its level number and therefore re- 
duces the significance of its features. This is ex- 
actly the reverse of the actions that we took with 
interest, a category" that was too weak to attract 
the documents it needed. 
To test our hypothesis we constructed a new hier- 
archy, ~.r-4, by making trade a subcategory of the 
root. %~ also incorporated our other variations, 
so that earnings and acquisitions are also subcate- 
gories of the root and interest is a subcategory of 
economic indicators. Table 5 reports the compari- 
son of the Basic and Var-4 hierarchies. The overall 
precision and recall improve again, this time, by 
taking a category that is stealing because it was 
72 
-,-..zj.../ ~ "...j.~_L./~ST" 
trade 
104" 
• number of test documents 
1" number of subcategories 
Figure 1 Reuters basic hierarchy 
root 
2742' 
acquisitions earnings 
688" 1156" 
Root 
Corp 
Acq 
Earn 
Interest 
Trade 
Eci* 
Other 
Root Corp Acq Earn Interst ! Trade Eci* 
o o o o oi o o 
0 0 0 0 0 0 0 
25 19 529 91 0 8 1 
2 1 15 1121 0 4 9 
0 3 4 2 24 24 44 
0 0 1 2 1" 94 0 
2 1 0 7 0 211 122 
I 4 30 15 2 44 18 
Other 
0 
0 
15 
4 
3 
6 
9 
388 
Table 2: Dispersion Table for Basic Hierarchy 
The columns list the categories where documents were placed by the algorithm 
the rows list the categories the documents were actually in. 
Overall 
Prec/Rec 
Basic 85.71/82.06 v -1 87.55/84.61 
Acq 
Prec/Rec 91/77 
92/90 
Earn 
Prec/Rec 
91/97 
97/94 
Acq docs 
at Corp 
19 
Earn as Acq as 
Acq Earn 
15 91 
38 16 
Table 3: Comparison Between Basid and Var-I 
Overall 
Prec/Rec 
Basic 85.71/82.06 
Vat-2 86.46/82.93 
Vat-3 88.72/85.78 
Interest docs 
as Interest 
24 
74 
76 
Non-Interest 
docs as Interest 
3 
28 
30 
Interest docs placed 
incorrectly 
in eci subtree 
68 
26 
27 
Table 4: Comparison Among Basic, Var-2 and Var-3 
73 
Overall 
Prec/Rec 
Basic 85.71/82.06 
~r-4 89.49/86.91 
Trade docs 
as Trade 
94 
87 
Non-Trade 
docs as 'I~'acle 
101 
24 
Table 5: Comparison Between Basic and Ya.r-4 
too strong and moving it to a position where it had 
to compete with equally strong siblings. 
Table 6 is a summary of the results for the a J1 the 
hierarchies discussed above. 
5 Summary 
We have demonstrated that using a hierarchy can 
have a positive impact on the categorization task. 
Precision and recall are increased and the process- 
ing time is substantially reduced. In addition we 
have shown that the topology of the hierarchy can 
be modified to produce improvements in precision 
and recall. Our ultimate goal is to identify a set 
of transformations, category level settings, asld the 
conditions under which each should be applied so 
that we can automatically train the hierarchy. This 
would allow us to begin with a minimal hierarchy 
such as Flat-l, and, using training data, automati- 
cally evolve an optimal hierarchy. We are continu- 
ing to do research in this area. 
An obvious danger when using a hierarchy is that 
placing a document into its correct category in- 
volves multiple decision points. If an error is made 
at an upper level in the hierarchy, the document 
will be placed incorrectly. Therefore it is critical 
that these early decisions be extremely accurate. 
Our experiments demonstrate that it is possible to 
achieve this accuracy. In the case of Flat-1 only one 
decision point is used and 2351 of 2742 (85.7%) test 
documents are placed in correct categories. In the 
case of Var-4 if we look at only the first level, 2467 
of the 2742 (89.7%) test documents are placed into 
the correct subcategory. In addition in Flat-l, the 
root is unable to make any decision for l lI (4%) 
documents while in Var-4 there are only 23 (0.8%) 
such documents. On a supercategory basis, the root 
performed better for some than others. For com- 
modities, it had precision and reca', t. around 82%. 
For energy, it had about 93% precision and recall. 
Likewise, the performance of the interior nodes in 
the hierarchy varied. Economic indicators had a 
88% precision and a 76% recall, while commodities 
had a 96% precision and a 93% recall. Thus we 
see that there is room for further improvement via 
moving categories from one part of the hierarchy 
to another and this investigation is the focus of our 
current research. 

References 
Apte C., Damerau F. and Weiss S.M. (1994) Auto- 
mated Learning of Decision Rules for Text Catego- 
rization. ACM Transactions on Information Sys- 
tems, 233.-251. 
Chakrabarti S., Dom B., Agarawal R., and Raghavan 
P. (1997) Using Taxonomy, Discriminanta and Sig- 
naturea for Navigating in Te.z~ Databases. Proceed- 
ings of the 23rd VLDB Conference; Athens, Greece. 
Cohen W.W. and Singer Y. (1996) Context-Sensitive 
Learning Methods for Text Categorization. Pro- 
ceedings of the 19th Annual ACM/SIGIR Confer- 
ence. 
D'Alessio S., Kershenbaum A., Murray K., Schiaffino 
R.(1998) Category Levels in Hierarchical Te~ Cat- 
egorization. Proceedings of the Third Conference 
on Empirical Methods in Natural Language Pro- 
cessing (EMNLP-3). 
Deerwester S., Dumais S., Furnas G., Landauer T., and 
Harshman R. (1990) Indexing by Latent Semantic 
Analysis. Journal of the American Society for In- 
formation Science, 41(6), pp. 391-407. 
Frakes W.B. and Baeza-Yates R. (1992) Informa- 
tion Retrieval: Data Structures and Algorithms. 
Prentice-Hall. 
Heckerman D. (1996) Bayesian Networks for Knowl- 
edge Discovery. Advances in Knowledge Discov- 
ery and Data Mining. Fayyad, Piatetsky-Shapiro, 
Smyth and Utlaurusamy eds., MIT Press. 
Hersh W., Buckley C., Leone T. and Hickrnan D. 
(1994) OHSUMED: An Interactive Retrieval Evab 
uation and a New Large Text Collection \]or Re- 
search. Proceedings of the 17th Annual Interna- 
tional ACM SIGIR Conference on Research and De- 
velopment in Information Retrieval, Philadelphia. 
Koller D. and Sahami M. (1996) Towards Optimal Fea- 
ture Selection. International Conference on Ma- 
chine Learning, "volume 13, Morgan-Kauffman. 
Koller D. and Sahami M. (1997) Hierarchically Clas- 
sifying Documents using Very Few Words. Inter- 
national Conference on Machine Learning, Volume 
14, Morgan-Kauffman. 
Larkey L. and Croft W.B. (1996) Combining Classi- 
tiers in Text Categorization. Proceedings of the 
19th Annual ACM/SIGIR Conference. 
Lewis. D (1992) Text Representation for Intel- 
ligent Text Retrieval: A Classification-Oriented 
View. Text-Based Intelligent Systems, P.S. Jacobs, 
Lawrence-Erlbaum. 
Lewis D. and Ringuette. M. (1994) A Comparison of 
Two Learning Algorithms for text Categorization. 
Third Annual Symposium on Document Analysis 
and In.formation Retrieval, Las Vegas, pp. 81-93. 
Ng H.-T., Gob W.-B. and Low K.-L. (1997) Feature Se- 
lection, Perception Learning and a Usability Case 
Study. Proceedings of the 20th Ann'aal Interna- 
tional ACM SIGIR Conference on Research and De- 
velopment in Information Retrieval, Philadelphia, 
July 27-31, pp. 67-73 
Sahami M. (1996). Learning Limited Dependence 
Bayesian Classifiers. Proc. KDD-96, pp.33~33& 
Salton G. (1989) Automatic Text Processing: The 
Transformation, Analysis and Retrieval of Infor- 
mation by Computer., Addison-~,Vesley. 
van Rijsbergen. C.J. (1979) Information Retrieval. 
Buttersworth, London, second edition. 
Witten I.H., Moffat A. and Bell T. (1994) Managing 
Gigabytes. Van Nostrand Reinhold. 
Wong J.W.T., Wan W.K. and Young G. (1996) AC- 
TION: Automatic Classification \[or Full- Text Doc- 
uments. SIGIR Forum 30(1), pp. 11-25. 
Yang Y. (1997) An Evaluation of Statistical Ap- 
proaches to Text Categorization. Technical Report 
CMU-CS-97-127, Computer Science Department, 
Carnegie Mellon University. 
Yang Y.(1996). An Evaluation of Statistical Ap- 
proaches to MEDLINE Indexing. Proceedings of 
the AMIA, pp. 358.-362. 
"~ng "k'. and Chute.C.G.(1992) A Linear Leant SquareJ 
Fit Mapping Method/or Information Retrieval fi,am 
Natural Language Tee.yrs. Proceedings of COLIiNG 
'92, pp. 447-453. 
Yang Y. and Pederson.J.P. (1997) Feature Selection in 
Statistical Learning of Text Categorization. Inter- 
national Conference on Machine Learning, Volume 
14, Morgan-Kauffman. 
(1997) UMLS Knowledge Sources 8th Edition National 
Library of Medicine 
