DOCUMENT CLASSIFICATION BY 
MACHINE:Theory and Practice 
Louise Guthrie 
Elbert Walker 
New Mexico State University 
Las Cruces, New Mexico 88001 
Joe Guthrie 
University of 'lb.xas at E1 Paso 
E1 Paso, 'l>xa~s 79968 
Abstract 
In this note, we present results concerning the the- 
ory and practice of determining for a given document 
which of several categories it best fits. We describe a 
mathematical model of classification schemes and the 
one scheme which can be proved optimal among all 
those based on word frequencies. Finally, we report 
the results of an experiment which illustrates the effi- 
cacy of this classification method. 
TOPICAL PAPER, 
Subject Area: TEXT PROCESSING 
1 Introduction 
A problem of considerable interest in Computational 
Linguistics is thai; of classifying documents via com- 
puter processing \[lIayes, 1992; Lewis 1992; Walker 
and Amsler, 1986\]. Siml>ly put, it is this: a docu- 
ment is one of several types, and a machine process- 
ing of the document is to determine of wbicll type. 
In this note, we present results concerning the theory 
and practice of classification schemes t)ased on word 
frequencies. The theoretical results are about matlt. 
ematical models of classification schemes, and apply' 
to any document classitication problem to tile extent 
that the model represents faithfully that problem. One 
must cimosc a model that not only provides a math- 
ematical description of the problem at imnd, but one 
in which the desired calculations can be made. For ex- 
ample, in document classificatiou, it would bc nice to 
be able to calcnlatc the probability that a document 
on subject i will be classified as on subject i. Further, 
it would be comforting to know that there is no bet- 
ter scheme than the ouc being used. Our models have 
these characteristics. They are siml)lc, the calculations 
of probabilities of correct document classification are 
straightforward, and we imve proved that there are no 
schemes using tile same information that have better 
success rates. In an experiment the scheme was u~d 
to classify two types of documents, and was found to 
work very well indeed. 
2 The Description of a Classifi- 
cation Scheme 
Suppose that we must classify a document into one of k 
types. These types arc known. Here, k is any positive 
integer at least 2, and a typical value might be any- 
where from 2 to 10. I)enote these types T1,7~,..., 7l'k. 
The set of words in tile language is broken into m dis- 
joint subsets W1, W2,..., W,,. Now from a host ofdoc- 
umeuts, or a large body of literature, on subject ~/~, tile 
frequencies Pij of words in W i are determined. So with 
subject ~ we have associated the vector of frequencies 
(pil, Pi2, . . • ,Pim), and of cour~ pil+pi2+. . .+Pim = 1. 
Now, given a document on one of the possible k sub- 
jects, it is classified as follows. The document has n 
words in it, nl words from I4/1, n2 words from W~,..., 
and nm words from Win. Based on this information, 
a calculation is made to determine from which sub- 
ject the document is most likely to have come, and is 
so classified. This calculation is key: there arc many 
possible calculations on which a classification can be 
made, but some are better titan others. We will prove 
that in this situation, there is a best one. 
We elaborate on a specific case which ~ems to 
hold promise. The idea is that the frequencies 
(Pil, Pi2, •.., pi,,,) will be ditferent enough from i to i to 
distinguish between types of documents. From a docu- 
ment of word length n, let nj be the number of words in 
Wj. Titus the vector of word frequencies for that par- 
ticular document is (hi/n, n2/n,..., nm/n). The word 
frequencies gfrom a document of type i should resem- 
ble the frequencies (Pil, Plm... ,Pim), and indeed, the 
classification scheme is to declare the documeut to he 
of type Ti if its freqnencies "most closely resemble" the 
frequencies (Pll, Pi2, • •., Pi,,). Intuitively, if two of tile 
vectors are (Pil, Pi2,...,Pim) very nearly equal, then 
it will be difficult to distinguish documents of those 
two types. Thus the success of cla.ssification depends 
critically on the vectors (pil, pi'~,..., pim) of frequen- 
cies. Equivalently, the sets Wj are critical, and must 
be chosen with great care. The particular situation we 
have in mind is this. Faeh of the types of documents is 
1059 
on a rather special topic, calling for a somewhat spe- 
cialized vocabulary. The Language is broken into k + 1 
disjoint sets W1, Wu,..., Wk+l of words. For i < k, 
the words in W/ are "specific" to subject i, and Wk+l 
consists of tire remaining words in the language. Now 
from a host of documents, or a large body of literature, 
on the subject T/, we determine the frequencies pij of 
words in W/. But first, the word sets Wi are needed, 
and it, is also from such bodies of text that they will be 
determined. Doing this in a manner that is optimal for 
our models is a difficult problem, but doing it in such 
a way that our models are very effective seems quite 
routine. 
So with subject Ti we have associated the vector 
of frequencies (Pil, Pi2,...,Pim), the vector being of 
length one more than the number of types of docu- 
ments. Since the words in Wi are specific to documents 
of type 7\], these vectors of frequencies should be quite 
dissimilar and allow a sharp demarkation between doc- 
ument types. This particular scheme has the added ad- 
vantage that m is small, being k+l, only one more than 
the number of document types. Further, our scheme 
will involve only a few hundred words, those that ap- 
pear in Wl, W2,..., Wk, with the remainder appearing 
in Wk+l. This makes is possible to calculate the prob- 
abilities of correct classification of documents of each 
particular type. Such calculations are intractable for 
large m, even on fast machines. There are classification 
schemes being used with m in the thousands, making 
an exact mathematical calculation of probabilities of 
correct classification next to impossible. But with k 
and m small, say no more than 10, such calculations 
are possible. 
3 The Mathematical Model 
A mathematical description of the situation just 
described is this. We are given k multino- 
mial populations, with the i-th having frequencies 
(plr, pi~,...,Pi,~). The i-th population may be en- 
visioned to be an infinite set consisting of m types of 
elements, with the proportion of type j being Pij. We 
are given a random sample of size n from one of the 
populations, and are asked to determine from which of 
the populations it came. If the sample came from pop- 
ulation i, then the probability that it has nj elements 
of type j is given by the formula 
n m (n!/,,, !,~!... n.~ !)(P;~'PT~"" 'P,. )' 
This is an elementary probabilistic fact. If a sample to 
be classified has nj elements of type j, we simply make 
this calculation for each i, and judge the sample to be 
from population i if the largest of the results was for 
the i-th population. Thus, the sample is judged to be 
from the i-th population if the probability of getting 
the particular n/'s that were gotten is the largest for 
that population. 
To determine which of 
(n!/nl!n2! ..n IX\[,~'~,," .... p~,~,) • rn "\]kYil ~i2 
is the largest, it is only necessary to determine which of 
nl n~ r'.m the (Plr Pi2 " " "Pin, ) is largest, and that is an easy ma- 
chine calculation. All numbers are known beforehand 
except the n i's, which are counted from the sample. 
Before illustrating success rates with some calcula- 
tions, some comments on our modeling of this docn- 
meat classification scheme are in order. The i-th multi- 
nomial population represents text of type 7~. This text 
consists of m types of things, namely words from each 
of the W i. The frequencies (pit, Pi~,... ,pin,) give the 
proportion of words from the classes W1, W'2,..., Wm 
in text of type 7~. A random sample of size n repre- 
sents a document'of word length n. This last repre- 
sentation is arguable: a document of length n is not 
a random sample of n words from its type of text. 
It is a structured sequence of such words• The va- 
lidity of the model proposed depends on a document 
reflecting the properties of a random sample in the fre- 
quencies of its words of each type. Intuitively, long 
documents will do that• Short ones may not. The 
success of any implementation will hinge on the fre- 
quencies (Pit, Pi2,...,Pim). These frequencies must 
differ enough from document type to document type 
so that documents ('an be distinguished on the basis of 
them. 
4 Some Calculations 
We now illustrate with some calculations for a simple 
case: there arc two kinds of documents, T1 and 7~, and 
three kinds of words. We have in mind here that Wj 
consists of words specific to documents of type Tz, W:2 
specific to T2, and that Wa consists of the remaining 
words in the language. So we have the frequencies 
(pu, pr2, w3) and (m~, m2, m3). Of course vi.~ = ~- 
Pll -Pi2. Now we are given a document that we know 
is either of type 711 or of type 7~, and we nmst discern 
which type it is on the basis of its word frequencies. 
Suppose it has nj words of type j, j = 1,2,3. We 
calculate the numbers 
nl n2 r~3 ti = Pil Pi2 Pi3 
for i = 1, 2, and declare the document to be of type 7~ if 
ti is the larger of the two. Now what is the probability 
of success? }tere is the calculation. If a document 
of size n is drawn from a trinomial population with 
parameters (p11, P12, pla), the probability of getting 
nl words of type l, n2 words of type 2, and n3 words 
of type 3 is 
n ! ~l In ll'~ ! nl ~ na ( ./ r. 2. 3.)(PllP12P13). 
Thus to calculate the probability of classifying sue- 
cessfidly a document of type 7'1 ms being of that type, 
we must add these expressions over all those triples 
(nl,n2, n3) for which tl is larger than t2. This is a 
1060 
fairly easy coml)utation, and we have carried it out for 
a host of different p's and n's. Table I contains results 
of some of these calculations. 
Table i gives the probability of classifying a doc- 
ument of type T1 as of type 7~, and of classifying a 
document of type 7~ as of type '/~. These probabili- 
ties are labeled Prob(f) and Prob(2), respcctively. Of 
course, here we get for free the probability that a docu- 
ment of type 7'1 will be classified ms of type 7~, namely 
1- Prob(1). Similarly, 1~- l'rob(2) is tile probability 
that a document of type 7.) will be classified as of type 
7\]. The Plj are the frequencies of words from Wj for 
documents of type '/~, and n is the muuber of words in 
the document. 
Table 1 
Ptj .08 .04 .88 
P2j .03 .06 .91 
n 50 100 200 400 
Prob(1) .760 .871 .951 .991 
Prob(2) .842 .899 .959 .992 
\[ ~,,.o~(2) 
.10 .03 .87 
.02 .05 .93 
50 100 200 400 
.894 .963 .995 .999 
.920 .975 .997 .999 
I Plj .08 .04 .88 p~2 .07 .04 .89 I 
n 50 100 200 400 
Prob(1) .575 .553 .595 .638 
Prob(2) .533 .598 .617 .658 
There are several things worth noting in Table 1. 
The frequencies used in tile table were chosen to il- 
lustrate the behavior of the scheme, att(l not necessar- 
ily to reflect document claqsification reality, l\[owevcr, 
consider the first set of l?equeneies (.08, .()4, .88) and 
(.03, .06, .91). This represents a circnmstan(-c where 
documents of type T1 have eight percent of their words 
specific to that subjcct, and four percent specific to the 
other subject. Documents of type 7.) have six percent 
of their words specific to its subject, and three percent 
specific to the other sutlject. These percentages seem 
to be easily attainable. Our scheme correctly classifies 
a document of length 200 and of type q'l 95.1 percent of 
the time, and a docmneut of length 400 99.1 percent of 
the time. The last set of frequencies, (.08, .04, .88) and 
(.07, .04, .89) arc Mrnost alike, and as the table shows, 
do not serve to classify documents correctly with high 
probability. In general, the probabilities of success arc 
remarkably high, even for relatively small n, and in the 
experiment reported on in Section 6, it was easy to find 
word scts with satisfatory frequencies. 
It is a fact that the probability of success can 
be made as close to 1 as desired by taking n large 
enough, assuming that (Ptt, Pv~, Pro) is not identi- 
cal to (P'21, P2~, P23). llowever, since for reasonable 
frequencies, tile probabilities of success are high for n 
just a few hundred, this snggests that long documents 
would not have to he completely tabulated in ordcr to 
be classified correctly with high probability. One could 
just use. a random sample of appropriate size from tile 
document. 
The following table give some success rates for the 
case where thcre are three kinds of documents and four 
word cla.,~qes. The rates are surprisingly high. 
Table 2 
PU .05 .03 .02 .90 
P2j .01 .06 .01 .92 
P~L._ .04 .02 .08 .86 
n 50 1O0 200 400 
Prob(1) .703 .871 .966 .997 
Prob(2) .884 .938 .985 .999 
Prob(3) .826 .922 .981 .998 
Plj .05 .03 .02 .90 
P'zj .01 .05 .0l .93 
P3j .03 .02 .05 .90 
n 50 100 200 400 
l'rob(l) .651 .784 .906 .978 
l'rob(2) .826 .917 .977 .998 
1"rob(3) .697 .815 .916 .978 
5 Theoretical Results 
In this section, we prove our optilnality result. But 
first we must give it a precise mathematical formu- 
lation. 'lb say that there is no better claqsification 
schcme than some given one, wc nmst know not only 
what "better" means, we must kuow precisely what 
a classitication schenu~ is. The setup is as in Sec-- 
tion 3. We have k multinomial populations with 
frequencies (Pil,Plu,...,plm), i = 1,2,...,k. We, are 
given a random sample of size n from one of the 
populations and are forced to assert from wMch one 
it camc. Tbe infi)rmation at our disposal, besides 
the set of frequencies (pit,pin,...,pim), is, for each 
j, thc number nj of elements of type j ill the sam- 
pie. So the inlormation Lfrom the sample is the tu- 
pie (hi, n=,,...,n,,). Our scheme for specifying front 
which population it came is to say that it came 
• t\/ I'll tl:~ * t~m front population i if (n!/nl !n~! .. n.~.)tpi~ Piu " 'Pi,,, ) 
is maximnm over the i's. This then, determines 
which (n~, nu .... , n.,) re.nits in which cla.~sification. 
()ur scheme partitions the sample space, that is, the 
set of all the tuples (nl,n2 .... ,n.,), into k pieces, 
1061 
the i-th piece being those tuples (nl,n2,...,nm) 
for which (n!/nl!nz!."n,,,!)(p:?l~p ..... P~r~) is maxi- 
mum. For a given sample (or document) size n, this 
leads to the definition of a scheme as any partition 
{A1, A~ .... , Ak} of the set of tuples (nl, n~,..., nm) 
for which ~i ni = n into k pieces. The procedure then 
is to classify a sample as coming from the i-th pop- 
ulation if the tuple (hi, n2,..., am) gotten from the 
sample is in Ai. It doesn't matter how this partition 
is arrived at. Our method is via the probabilities 
rl m qi(nl, nu,..., nm) = (n!/nl!n2!." nm!)(P~?P~¢ ""Pi,n ). 
There are many ways we could define optimality. 
A definition that has particular charm is to define a 
scheme to be optimal if no other scheme has an higher 
overall probability of correct classification. But in this 
setup, we have no way of knowing the overall rate of 
correct classification because we do not know what pro- 
portion of samples come from what populations. So we 
cannot use that definition. An alternate definition that 
makes sense is to define a scheme to be optimal if no 
other scheme has, for each population, a higher proba- 
bility of correct classification of samples from that pop- 
ulation. But our scheme is optimal in a much stronger 
sense. We define a scheme A1,A2,...,Ak to be opti- 
mal if for any other scheme B1, B2, • •., B~, 
~_~ P(AdTi) >_ ~ P(BiIT~). 
Proofs of the theorems ill this note will be given 
elsewhere, .. 
Theorem 1 Let T1,T2,...,Tk be multinomial popu- 
lations with the i- th population having frequencies 
(Pil ,Pi2 ..... Pim). For a random sample of size n from 
one of these populations, let nj be the number of ele- 
ments of type j. Let 
tl m qi(nl, n2 .... ,nm) = ~:n I/n./1.1n^l"'z. n m "\]~vill~('~n~ vi2"~n .... Pim )" 
Then 
space 
by 
the partition of the sample ! 
{(nl,n2,...,n~) : nj > 0, Ejnj = n} given 
Ai ~- {(nl, n2 ..... am): qi(nl,n2 ..... am)) > 
qj(nl, n2,..., Urn) for i # j} 
is an optimal schente for determining from which of 
the populations a sample of size n came. 
An interesting feature of Table 1 is that for all fre- 
quencies Prob(1) + Prob(2) is greater for sample size 
100 than for sample size 50. This supports our intu- 
ition that larger sample sizes should yield better re- 
sults. This is indeed a fact. 
Theorem 2 The following inequality holds, with 
equality only in the trivial case that Pik ~--- Pjk for all i, 
j, and k, 
m~x(((, + 1)!/(,,!n~! • • • n.,!)pT:p~¢... P,"2) -> 
n-t-1 
max((n!/(nl !n~! . . . -,, . ,.i, ,'is " " Pi,, 1, 
n 
where ~n+l means to sum over those tuples (hi, n~, 
• • . , nm) whose sum is n+ 1, and ~n means to sum 
over those tuples (nl, n2, • . . , n~) whose sum is n. 
6 Practical Results 
Our theoretical results assure us that documents can 
be classified correctly if we have appropriate sets of 
words. We have algorithms which compute the proba- 
bility of classifying document types correctly given the 
document size and the probability of some specialized 
sets of words appearing in the two document types. 
Tables 1 and 2 show some sample outputs from that 
program. Intuitively, we need sets of words whicb ap- 
pear much more often in one text type than tile other, 
but the words do not need to appear in either text type 
very often. Below we describe an experiment with two 
document collections that indicates that appropriate 
word sets can be chosen easily. Moreover, in our sam- 
ple experiment, the word sets were chosen automati- 
cally and the classification scheme worked perfectly, as 
predicted by our theoretical results. 
Two appropriate collections of text were available 
at the Computing Research Laboratory. The first was 
made up of 1000 texts on busine~ (joint ventures) 
/,from the DAR.PA TIPSTER project and the second 
collection consisted of 1100 texts from the Message 
Understanding Conference (MUC) \[Sundheim, 1991\] 
describing terrorist incidents in South America. The 
business texts were all newspaper articles, whereas the 
MUC texts were transmitted by teletype and came 
from various sources, such as excerpts from newspa- 
per articles, radio reports, or tape rccorded messages. 
The collections were prepared by human analysts who 
judged the relevance of the documents in the collec- 
tions. Each collection contained about half a million 
words. 
We removed any dates, annotations, or header infor- 
mation from the documents which uniquely identified 
it as being of one text type or another. We divided 
each collection of texts in half to form two training 
sets and two test sets of documents, yielding four col- 
lections of about a quarter of a million words each. We 
treated each of the training sets as one huge text and 
obtained frequency counts for each of the words in the 
text. Words were not stemmed and no stop list was 
used. Thc result was two lists of words with their cor- 
responding frequencies, one for the TIPSTER training 
set and one for the MUC training set. 
Our goal at this point was to choose two sets of 
words, which we call TIP-SET and MUC-SET, that 
could be used to distinguish the documents. We knew 
from the results of TABLE 1 that if we could identify 
one set of words (TIP-SET) that appeared in the TIP- 
1062 
STER documents with probability .1 and in the MUC 
documents with low probability (say .03 or less) and 
another set (MUC-SET) that appeared with probabil- 
ity .1 in the MUC documents and a low probability 
(say .03 or less) in the TIPSTER documents, that we 
could achieve perfect or nearly perfect classification. 
We used a simple heuristic in our initial tests: choose 
the TIP-SET by choosing words which were among the 
300 most frequent in the TIPSTER training set and 
not in the 500 most frequent in the MUC training set. 
We intended to vary the 300 and 500 to see if we could 
choose good sets. However, this algorithm yielded a 
set of words that appeared with probability .13 in the 
TIPSTER training set and with probability .01 in the 
MUC training set. Note that even though no stop list 
was used when the frequency counts were taken, this 
procedure effectively creates a stop list automatically. 
The same algorithm was used to create the MUC-SET: 
choose words from among the 300 most frequent in the 
MUC training set if they did not appear in tile 500 
most frequent in tim TIPSTER~ training set. 
Our theoretical results implied that we could classify 
each document type correctly 99.99% or the time if we 
had documents with at least 200 words. Our average 
document size in the two collections was 500 words. We 
then tested the classification scheme on the remaining 
half (those not used for training) of each document set. 
Only one document was classitied differently from the 
truman classification. 
When we read the text in question, it was our opin- 
ion that the original document classification by a hu- 
man was incorrect. If we change the classification 
of this text, then our document classitication scheme 
worked perfectly on 700 documents. It should be noted 
that the two document collections that were available 
to us were on very different subject matter, so the 
choice of the word sets was extremely easy. We expect 
that differentiating texts which are on related subject 
areas will be much more difficult and we are developing 
retinements for this task. 
\[Walker and Amsler, 1986\] D. Walker and R. Amsler, 
The Use of Machine-Readable Dictionaries in Sublan- 
guage Analysis, Analyzing Language in l~eslricled Do- 
mains, Grishman and Kittredge, eds., Lawrence Erl- 
baum, I\[illsdale, NJ. 
7 References 
\[Hayes, 1992\] Philip Hayes, Intelligent tligh-Volume 
q hxt Processing Using Shallow, Domain Specific Tech- 
niques, 7~xt-Based Intelligent Systems, P. Jacobs, cd., 
Lawrence Erlbaum, Ilillsdale, N J, pp. 227- 241. 
\[Lewis, 1992\] David Lewis, Feature Selection and 
Feature Extraction for Text Categorization, Proceed- 
ings Speech and Natural Language Workshop, Morgan 
Kaufman, San Mateo, CA, February 1992, pp. 212- 
217. 
\[Sundheim, 1991\] Beth Sundheim, editor. Proceedings 
of the Third Message Understanding Evaluation and 
Conference, Morgan Kaufman, Los Altos, CA, May 
1991. 
1063 
