Content-Oriented Categorization of Document Images 
Takehiro Nakayama 
FX Palo Alto Laboratory, Inc. 
3400 Hillview Avenue 
Palo Alto, CA 94304 USA 
nakaymrm@pal.xerox.com 
Abstract 
We have developed a technique that catego- 
rizes document images based on their con- 
tent. Unlike conventional methods that use 
optical character recognition (OCR), we con- 
vert document images into word shape 
tokens, a shape-based representation of 
words. Because we have only to recognize 
simple graphical features from image, this 
process is much faster than OCR. Although 
the mapping between word shape tokens and 
words is one-to-many, they are a rich source 
of information for content characterization. 
Using a vector space classifier with a 
scanned document image database, we show 
that the word shape token-based approach is 
quite adequate for content-oriented categori- 
zation in terms of accuracy compared with 
conventional OCR-based approaches. 
1 Introduction 
The number of documents available on the network is 
increasing with the development of the computational 
infrastructure. Accordingly, information retrieval has 
become one of the most important research topics in 
natural language processing (NLP). In the digital net- 
work world, documents are usually distributed in 
either text file or image format, where the former is a 
sequence of character codes (e.g., ASCII) and the lat- 
ter is a bitmap. Although only text files are nmchine- 
readable and convenient from the viewpoint of infor- 
marion retrieval, many documents are available as 
images alone. They are easily generated by scanning 
hard-copy documents which the real world is mas- 
sively using. 
While most information retrieval systems have 
been designed for text files, there are some systems 
proposed for images. They convert images into text 
files using optical character recognition (OCR) to uti- 
lize existing NLP techniques. Even though state-of- 
the-art OCR creates noisy output with recognition 
errors (Rice, et al., 1995), prior work has shown that 
OCR output is satisfactory for retrieval purposes (Itt- 
her, et al., 1995; Mittendorf, et al., 1995; Myers and 
Mulgaonkar, 1995; Wenzel and Hoch, 1995). The 
inaccuracy of OCR can be largely mitigated. How- 
ever, little attention has been paid to reducing the 
computational expense of OCR. OCR is a major bot- 
tleneck for information retrieval systems in terms of 
speed. For example, Myers and Mulgaonkar reported 
in their OCR-based information extraction system 
that the total processing time was dominated by char- 
acter and word recognition processes (Myers and 
Mulgaonkar, 1995). This suggests an important ques- 
tion: "how much NLP can be done without character 
recognition (Church, et al., 1994)?" 
As an alternative technique to OCR, there is word 
shape token processing which converts images into a 
shape-based representation. It recognizes coarse 
character shape classes (character shape codes) rather 
than character codes. Because the number of charac- 
ter shape codes is small and they are defined by sim- 
ple graphical features, their recognition from images 
is inexpensive. Word shape token processing has 
been proven to be of use for European language iden- 
tification (Nakayama and Spitz, 1993; Sibun and 
Spitz, 1994). Also, its feasibility for content charac- 
terization has been discussed with the use of con- 
trolled (noise-free) on-line data set (Nakayama, 1994; 
Nakayama 1995; Sibun and Farrar, 1994). However, 
no analysis has been done with real document images, 
which are usually degraded in quality. In addition, a 
comparative evaluation between the word shape 
token-based and the OCR-based approach is needed. 
We have developed a technique which automati- 
cally categorizes document images into pre-defined 
classes based on their content. It employs a vector 
space classifier drawn from many robust statistical 
techniques in information retrieval (see Salton, 1991). 
We show in this paper that our technique can catego- 
rize as accurately as the conventional OCR-based 
approach, while it can process much faster. 
In the next section, we describe the definition of 
character shape codes and word shape tokens, and 
their generation from document images. In section 3, 
we outline the automated categorization system which 
we developed. In section 4, with the use of a topic- 
tagged document image database, we show the word 
shape token-based approach is quite adequate for con- 
tent-oriented categorization in comparison with a con- 
ventional OCR-based system. In section 5, we 
discuss the experimental results and future work. 
818 
2 Character Shape Code and Word 
Shape Token 
A character shape code is a machine-readable code 
which represents a set of graphically similar charac- 
ters. A word shape token is a sequence of one o1" 
more character shape codes which represents a word. 
Character shape codes are defined differently by the 
selection of graphical features. In this paper, we con- 
sider the number of connected components, vertical 
location, and deep concavity as graphical features to 
classify characters. First, we identify the positions of 
the text lines as shown in figure 1. Second, we iden- 
tify the character cells, and count the number of con- 
nected components in each character cell. Third, we 
note their position with respect to the text lines. 
Finally, we identify the presence of a deep eastward/ 
southward concavity. In figure 1, vertical location 
classifies characters into three groups--{"l"} {"g"} 
{"a", "n", "u", "e"}; characters that occupy the space 
between the top and the baseline, characters that 
occupy the space between the x-height line and the 
bottom, and characters that occupy the space between 
the x-height line and the baseline, respectively. The 
last one is further classified by presence or absence of 
a deep eastward/southward concavity. Resultant 
groups are {"a", "u"} {"e"} {"n"}. 
The defined character classes and the members for 
the ASCII character set are shown in Table 1. Once 
classification has been performed, the resulting char- 
acter shape codes are grouped by word boundary and 
used as word shape tokens for the downstream pro- 
cessing. Figure 2 gives an example of generated word 
shape token representation with its original document 
image. 
x-.eig.,,,°e Too 
Figure 1 : text line parameter positions (above) 
and comlected components (below) 
'Fable 1: character shape code membership 
character menlbers 
shape code 
A A-Zbdfhklt0-9#$&@ 
x amorsuvwxz 
e co 
n 13 
i i 
g gPqY 
J J 
'-.:= !l "-,.:;=!? ()/<>\[\] { } I 
There are many different languages in common 
use around the world and many different scripts 
in which these languages are typeset. 
AAexe xxe xxng AIAAexenA Axngxxgex In exxxxn 
xxe xxxxnA AAe xxxAA xnA xxAg AIAAexenA xexigAx 
In xAleA AAexe Axngxxgex xxe AggexeA. 
Figure 2: document image (above) and generated 
word shape tokens (below) 
note: there is all error (many - xxAg) in the 
second line due to a small ink drop 
Our character shape code recognition doesn't 
require a complicated image analysis. For example, 
distinguishing "c" from "e" is a difficult task for OCR 
that requires a considerable computational expense 
(Ho and Baird, 1994), whereas they are in the same 
class in our representation (Table 1). Also, our pro- 
cess is free from font identification which is manda- 
tory for OCR (for font identification complexity, see 
Zramdini and Ingold, 1993). As a result, the process 
of word shape token generation from images is much 
faster than current OCR technology. 
While we save a computational expense, we lose 
some information which original document images 
have. Table 1 shows that the mapping between char- 
acter shape codes and original characte~ is one-to- 
many--we use only seven character shape codes {A x 
e n i g j }1 to represent all alphabetical characters. 
1. We use boldface to represent the character 
shape codes. 
819 
This would seem to be very ambiguous. However, 
when used for mapping between word shape tokens 
and original words, the ambiguity is much reduced. 
We show this using a lexicon of 122,545 distinct word 
(surface-form) entries. When we transformed the lex- 
icon into word shape token representation, the number 
of distinct entries was reduced to 89,065. This means 
one word shape token mapped to 1.38 words on aver- 
age. Next, we extracted nouns, which are important 
content-representing words for information retrieval, 
from the lexicon. We were then left with 75,043 dis- 
tinct word entries. Similarly, we obtained 57,049 dis- 
tinct word shape tokens from them. This time, one 
word shape token mapped to 1.32 words. More 
importantly, most of them--49,953 of 57,049 word 
shape tokens (87.6%)--mapped to a single word. 
of topic-tagged document images. The system uses 
the cosine measure to compute the similarity: 
t 
E (WikWjk) 
sim(D i'D ")l = k = 1 
2 Wik 2. 
1 1 
The greater the value of sim(Di, Dj), the more the 
similarity between D i and Dj. For each prepared cate- 
gory profile, the system computes the similarity to 
assign the test document to the most similar cate- 
gory 1 . 
3 Categorization System 
We implemented a content-oriented categorization 
system to evaluate the word shape token-based 
approach in comparison with the OCR-based 
approach. The system, which uses the vector space 
classifier, consists of three main processes as shown 
in figure 3. 
First, the system transforms the test document 
image into a sequence of word shape tokens as 
described in the previous section, where conventional 
systems perform OCR to generate a sequence of 
ASCII encoded words. 
Next, it generates a document profile through the 
following stages: 
Stage 1. The system removes punctuation marks. 
Note that they are distinguishable from alphabeti- 
cal characters in the character shape code repre- 
sentation (Table 1). 
Stage 2. The system removes word shape tokens 
corresponding to stop-words. In this process, it 
may also remove some non stop-words because 
of the one-to-many mapping between word shape 
tokens and words. In the OCR-based approach, it 
removes stop-words. 
Stage3. The system computes frequencies of 
word shape tokens to generate a document pro- 
file. The document profile D i is represented as a 
vector of numeric weights, 
De =(Wil, Wi2 ..... Wik ..... wit ) ,where Wik is 
the weight given the kth word shape token in the 
ith document, and t is the number of distinct word 
shape tokens of the ith document. We use the rel- 
ative frequency between 0 and 1 as the weight. 
As for the OCR-based approach, read word shape 
token as word. 
Finally, the system measures the degree of similar- 
ity between the document profile and a category pro- 
file. The category profile Dj is also represented as a 
vector derived in the same manner from a collection 
~\[\[\] test document 
.... t~ .... I, image (bitmap) 
hard-copy scanner 
document (word shape token h 
k,,generafion \] OCR// 
word shape tokens / 
ASCII encoded words 
category profiles 
(training data) 
< profile generation. ) 
( z~imilarity ~ ~ document profile 
k,,measurement J J 
category assignment 
Figure 3: categorization process 
4 Performance Assessment 
We have constructed a document image database to 
compare our categorization approach with the con- 
ventional OCR-based approach. First, we carefully 
chose ten topic categories with strong boundaries. In 
general, the accuracy of an automated categorization 
system is evaluated by contrast with the expert judge- 
ments. However, experts don't always agree on the 
judgements. For an unbiased comparative experi- 
ments between the two approaches, we chose rela- 
tively specific topics. Resultant topic categories are 
affirmative action, Internet, stock market, local traffic, 
1. In this paper, documents are always 
assigned to a single category. 
820 
Presidential race, Athletics (MLB), Giants (MLB), 
PGA golf, Tokyo subway attack, and food recipe. 
Second, we manually collected the body potion of 50 
newswire articles for each category; 500 documents 
in total. They were clearly relevant to a single cate- 
gory and much less relevant to the other categories. 
Third, we printed them using a 300-dpi laser printer, 
and made nth generation photo-copies from them to 
degrade images by quality. In the photo-copy pro- 
cess, documents were degraded due to spreading 
toner, low print contrast, paper placement angle, 
paper flaws, and so on. Finally, we scanned the hard- 
copy documents of the first, the third, and the fifth 
generation with a 300-dpi scanner. As a result, we 
obtained 500 topic-tagged document images for each 
nth generation photo-copies (n = 1, 3, 5). Figure 4 
shows scanned image samples. The average size of 
the original documents was 647, and ranged from 63 
to 2,860 words. The standard deviation was 377. 
n=l 
There are many different languages in common t 
n=3 
\[ There are many different languages in common \[ 
n=5 I-There 
are many different languages in common 
Figure 4: scanned image samples from nth 
generation photo-copy 
We transformed the document images into word 
shape tokens and ASCII encoded words, where we 
randomly took 30 inlages for each category (300 in 
total) as training data to generate category profiles, 
and tested the remaining 20 images (200 in total). 
We used ScanWorX OCR (Xerox hnaging Systems) 1 
for the ASCII encoding. 
'Fable 2 shows the processing thne for the u'ansfor- 
marion of all images on a SPARCstation 10 (Sun 
Microsystems). Although it had not been optimized, 
word shape token generation was 8 to 52 times faster 
than OCR. The difference increased with progression 
of n (n = 1, 3, 5). The OCR speed was highly depen- 
dent on image quality. Also, its word recognition 
accuracy was affected by image quality--96.3%, 
92.8%, and 80.7% for the first, the third, and the fifth 
generation copies, respectively. It is well understood 
that OCR is slower and generates numerous elxors for 
lower quality images (Taghva, et al., 1994). O11 the 
1. This is one of the state-of-the-art OCRs in 
terms of speed and accuracy, see Rice, et 
al., 1995. 
other hand, word shape token generation was a little 
faster for lower quality images. This mffavorable 
result was mainly caused by the lack of character seg- 
mentation function. Some characters touched each 
other in lower quality images, and were treated as a 
single character in the process of word shape token 
generation. Consequently, the number of characters 
to process became small. 
'Fable 2: processing time (second) Ior word shape 
token (WST) generation and OCR 
WST 
OCR 
image quality (nth generation photo-copies) 
n=l 
1860 
n=3 
1814 
n=5 
1702 
15408 32322 87986 
Our system categorized the test documents in word 
shape token and ASCII format as described in the pre- 
vious section. As shown in Table 3, the accuracy of 
the word shape token-based approach for higher qual- 
ity images (n = 1, 3) was nearly equal to that of the 
OCR-based approach. For lower quality images (n = 
5), the former was significantly lower than the latter. 
Table 4 and 5 show the accuracy of the two 
approaches as a function of the size of test documents. 
When images were in higher quality (n = 1, 3), there 
was little correlation between the accuracy and the 
size. When they were in lower quality (n = 5), the 
OCR-based approach had stronger correlation 
between the accuracy and the size than the word 
shape token-based approach. This can be explained 
as follows: In the statistical categorization, it is gener- 
ally difficult to get good accuracy when the size of the 
test document is small. In the OCR-based approach 
with the first and the third generation copies (n = 1, 3), 
the test documents were large enough for this catego- 
rization task. When the OCR encountered the fifth 
generation copies (n = 5), it garbled ninny words. 
Most of them were transformed into ill-formed 
(unl~lown) words 2 rather than mistaken for other 
words. These ill-formed words were ignored in our 
sinfilarity measurement. Thus, they didn't act as a 
negative factor, but virtually made the size of the test 
document smaller. On the other hand, in the word 
shape token-based approach with the first and the 
third generation copies (n = 1, 3), the test documents 
were similarly large enough. When it encountered the 
fifth generation copies (n = 5), it also garbled many 
words. But, this time, they were mistaken for other 
word shape tokens (e.g., many - xxAg in Fig. 2), and 
acted as a negative factor to reduce the accuracy. 
2. ScanWorX outputs a word with a reject 
mark when it is unable to recognize or is 
unsure in recognition (e.g., meterii~g). 
821 
Table 3: categorization accuracy for the word 
shape token-based and the OCR-based 
approach (number of correctly assigned 
documents / number of test documents) 
WST 
OCR 
image quality (nth generation photo-copies) 
n=l n=3 n=5 
193/200 192/200 154/200 
(97%) (96%) (77%) 
196/200 196/200 189/200 
(98%) (98%) (95%) 
Table 4: accuracy of the word shape token-based 
categorization as a function of the size of test 
docnmeuts 
size of test documents (number of words) 
0 - 400 400 - 800 800 - 
n = 1 50/51 (98%) 84/86 (98%) 59/63 (94%) 
n = 3 51/51(100%) 81/86 (94%) 60/63 (95%) 
n = 5 39/51 (76%) 62/86 (72%) 53/63 (84%) 
Table 5: accuracy of the OCR-based 
categorization as a function of the size of test 
docnments 
0 - 400 400 - 800 800 - 
n = 1 50/51 (98%) 85/86 (99%) 61/63 (97%) 
n = 3 50/51 (98%) 85/86 (99%) 61/63 (97%) 
n = 5 44/51 (86%) 84/86 (98%) 61/63 (97%) 
5 Discussion 
From the experimental results in the previous section, 
our hypothesis that word shape token-based approach 
is quite adequate for content-oriented categorization 
was strongly supported at least for the document 
images from first and third generation photo-copies. 
This means that the mapping ambiguity between word 
shape tokens and original words was acceptable for 
the categorization purpose. The accuracy drop 
observed with the fifth generation photo-copies was 
not due to the mapping ambiguity but was caused by 
recognition errors. Unlike OCR which attempts to 
correctly recognize each word using lexical informa- 
tion, word shape token generation is only faithful to 
the original image, Thus, it makes many errors with 
low quality images, whereas OCR indicates illegible 
characters. Indicating diffidence is better than incor- 
rect recognition for categorization. It would be possi- 
ble to utilize lexical information in word shape token 
representation for reducing errors. However, we must 
pay attention to its computational expense. 
Although it is arguable whether word stemming 
algorithms contribute to improving the categorization 
accuracy (Riloff, 1995), we desire to develop an algo- 
rithm for word shape token representation. It would 
be of use for other information retrieval applications 
such as word-spotting. We feel the word shape token 
representation is sufficient for locating some suffixes 
with accuracy. For example, 1,651 words were with 
suffix "-tion" in the lexicon of 122,545 distinct word 
entries. We obtained a set of word shape tokens from 
them. The set mapped to only 25 words without the 
suffix 1. Similarly, word shape tokens from all 8,077 
words with suffix "-ing" mapped to only 20 words 
without the suffix 2. 
Because all capital letters map to A (Table 1), it is 
difficult to identify words with only capital letters, 
which are sometimes important content-representing 
words (e.g., acronyms). We need to find a graphical 
feature to distinguish some capital letters from others, 
considering the complexity of image analysis. 
When we extend the word shape token processing 
to other applications, it is important to note that the 
word shape token representation is only meaningful 
for the computer and hardly human-friendly. Thus, it 
should be used in unsupervised systems with no 
human interaction required. Our technique would be 
useful for an automated incoming fax sorting by the 
content. Also, it would be used as an automated dic- 
tionary selector for the OCR which uses domain-spe- 
cific dictionaries. 
6 Conclusion 
Several studies have suggested that OCR output is 
satisfactory for information retrieval in terms of accu- 
racy. However, OCR is a major bottleneck for infor- 
mation retrieval systems in terms of speed. 
We have described a technique to generate word 
shape tokens from document images, and have shown 
that this shape-based representation can be generated 
much faster than current OCR technology. Further, 
we have shown how word shape token processing can 
be applied to content-oriented categorization. In spite 
of the mapping ambiguity between word shape tokens 
and words, we have shown that the word shape token- 
based approach can categorize document images in 
good quality with nearly the same accuracy as the 
conventional OCR-based approach. When images are 
1. e.g., AxxAixn-fashion, exxeAixn-comedian 
2. e.g., AexAing-destiny, Aing-tiny 
822 
in poor quality, the accuracy drops significantly due to 
misrecognition of word shape tokens as opposed to 
OCR which indicates illegible charactel~ rather than 
making errors. 
Acknowledgments 
We would like to thank Dan Kuokka for his com- 
ments, Ron Maim for his progranmfing assistance, 
and Arlene Holloway for her constxucting our docu- 
ment image database. 
References 
Kemleth W. Church, William A. Gale, Jonathan I. 
Helfman, and David D. Lewis. 1994. Fax: an 
alternative to SGML. In Proceedings oJ the 15th 
btternational Cot!ference on Computational Lin- 
guistics, pages 525-529, Kyoto, Japan. 
Tin Kam Ho and Henry Baird. 1994. Asymptotic 
accuracy of two-class discrimination. In t'ro 
ceedings of the Third Annual Symposium on Doc 
ument Analysis attd Information Retrieval, pages 
275-288, Las Vegas, Nevada. 
David J. IttneL David D. Lewis, and David D. Ahn. 
1995 Text categorization of low quality images. 
In Proceedings ~ the Fourth Annual Synq~osium 
on Document Analysis and hzformation 
Retrieval, pages 301-315, Las Vegas, Nevada. 
Elke Mittendorf, Peter Schauble, and Paraic Sheridan. 
1995. Applying probabilistic term weighting to 
OCR text in the case of a large alphabetic library 
catalogue. In Proceedings of the 18th Annual 
lnternatiot~al ACM SIGIR Cot(\[~'rence on 
Research and Development in h!formation 
Retrieval, pages 328-335, Seattle, Washington. 
Gregory K. Myers and Prasanna G. Mulgaonkar. 
1995. Automatic extraction of information from 
printed documents. In Proceedings ~¢'the Fourth 
Annual Symposium on Document Analysis and 
lnforntation Retrieval, pages 81-88, Las Vegas, 
Nevada. 
Takehiro Nakayama and A. Lawrence Spitz. 1993. 
European language detemfination from image In 
Proceedings ~( the Second International Cot!fer- 
ettce on Document Analysis and Recognition, 
pages 159-162, Tsukuba Science City, Japan. 
Takehiro Nakayama. 1994. Modeling content idenfi- 
fication from document images. In Proceedings 
of the Fourth ACL Cot~'erence (m Applied Natu 
ral Language Processing, pages 22-27, Stuttgart, 
Germany. 
Takehiro Nakayama. 1995. Text categorization using 
word shape tokens, In Proceedings of the Second 
Conference of the Pacijic Association for Co,qm 
rational Linguistics, pages 207-217, Brisbane, 
Australia,. 
Stephen V. Rice, Frank R. Jenkins, and Thomas A. 
Nartker. 1995. The tburth mmual test of OCR 
accuracy, lnfiv'mation Science Research Institute 
1993 Atmual Research Report, University of 
Nevada, Las ¼'gas, pages 11-50. 
Ellen Riloff. 1995. Little words can make a big dif- 
ference for text classification, In Proceedings of 
the 18th Annual Imernational ACM SIGIR Con 
femnce on Research attd Development in lnfor 
mation Retrieval, pages 130-136, Seattle, 
Washington. 
Gerard Salton. 1991. Developments in automatic text 
retfeval, Science, Vol. 253, pages 974-980. 
Penelope Sibun and David. S. Farrar. 1994. Content 
characterization using word shape tokens, In Pro- 
ceedings of the 15th International Cot~'erettce on 
Computational Linguistics. pages 686-690, 
Kyoto, Japan. 
Penelope Sibun and A. Lawrence Spitz. 1994. Lan- 
guage determination: natural language processing 
from scmmed document images. In Proceedings 
of the Fourth ACL Cot!\[~,rence on Applied Natu 
ral Language Processing, pages 15-21, Stuttgart, 
Germany. 
Kazem Taghva, Julie Borsack, Allen Condit, and 
Srinivas Erva. 1994. The effects of noisy data on 
text retrieval. ,hmrnal of the American Socie(v 
fly" lt~'ormation Science. 45, pages 50-58. 
Claudia Wenzel and Rainer Hoch. 1995. Text catego- 
rization of scanned docmnents applying a rule- 
based approach. In Proceedings ~" the Fourth 
Annual 3~,q~osium rm Document Analysis and 
h~&mation Retrieval, pages 333-346, Las Vegas, 
Nevada. 
Abdelwahab Zramdini and Rolf Ingold. 1993. Opti- 
cal font recognition fl'om projection profiles. 
Electronic Publishing, Vol. 6(3), pages 249-260. 
823 
