CONTENT CHARACTERIZATION USING WORD SHAPE TOKENS 
Penelope Sibun and David S. Farrar 
Fuji Xerox Palo Alto Laboratory, 3400 Hillview Avenue, Palo Alto, CA 94304 
sibun@pal.xerox.com, farrar@pal.xerox.com 
Abstract 
By quickly classifying character images into character 
shape categories, il is possible to automatically extract 
syntactic information from the text of document images 
without optical character recognition. Using word shape 
tokens composed of these charactershapecodes, a properly 
mr|ned text tagger can extract part-of.speech information 
fronl scanned document images. Later components of a 
document processing system can then use this information 
to locate topics, characterize document style, and assist ill 
inlormation rctriewll. 
extract noun phrases and other content characteristics 
using only word shape tokens that have been tagged with 
their parts of speech. Using this approach, we can process 
document images quickly to determine whether OCP, is 
warranted, tbrexample, when a text is a likely match for 
keywords in a database query. 
In the next two sections, we describe how word shape 
tokens are derived; in section four, we discuss part-of- 
speech tagging; in the following fonr sections, wc 
describe in detail parl-of-speech tagging nsing word shape 
tokens; in sections nine and ten we discuss our results. 
1 INTRODUCTION 
There are nlany text processing tasks that we would 
like to accomplish, such as document classification, text 
database structuring, matching documents with queries, 
and topic characterization. The field of computatiomd 
linguistics has developed a variety of techniques for 
accomplishing these tasks for text &vuments represented 
by character codes (e.g., ASCII). llowever, many 
documents for which we would like to use otn automated 
techniques arc not stored online in character-coded \[ornla\[, 
but instead exist only on paper. Optical character 
recognition (OCR) is a technique for converting scanned 
document images into character codes. By using ()CR, 
document images can \[y,2 converted into a form amenable 
to existing text processing techniques, t towcvcr, OCR is 
expensive, slow, and o\[\[cn illaccnrate. Because of these 
drawbacks, we would like to avoid OCR it we can, c.r at 
the least, postpone using OCR until we are confident that 
a document wammts detailed processing. In other words, 
we would like a high-bandwidth document processing 
system that is sensitive enough to detect desired document 
Icatnrcs. 
Our document understanding goals at the Fuji Xerox 
Pale Alto Laboratory include latlgaage determination 
(Nakayama and Spitz, 1993; Sibun and Spitz, 
forthcoming), (:otllettl ('hara(Terizalion, and style 
charucteri=alion. Toward these goals, we are developing it 
set of methods for extracting inlk)rmation from document 
images which do not depend on OCR. We have been 
working toward our goal of inexpensive content 
characterization by adapting a part of-.v)eech tagger to 
process word shape tokens rather than character coded 
words. Part-el-speech tagging is a technique that has been 
developed and refined over the past several years, and it 
provides an inexpensive, last, and reliable source of 
inlormation for recognizing noun phlases and other 
syntax-related text features which help characterize a 
doeunlen\[rs content. 
In this paper, we describe how we combine our 
technology for determining word shape tokens with text- 
tagging technology. We are developing systems that can 
2 WORD SHAPE TOKEN CREATION 
In this section we briefly describe our system that 
constructs character shape codes and word shape tokens 
from a document linage (for more detail, see Nakayama 
and Spitz, 1993; Sibun and Spitz, forthcoming). To 
recognize character shape codex from an image, SOnle 
transfornlatitnls alc first nlade \[o correct for various 
scanning artifacts such as skew angle and text line 
cnrvature. On each text line, four horizontal lines define 
three significant zones: the area between the baseline and 
the top of characters such as "x" is the x cone; the area 
above the x-height level is the ascender,~one; the area 
below the x-zone is the descender zone (figure 1). Tim 
text line is furthcr divided into charactercells by vertical 
bonnda,ics which delineate the connected components of 
each character image. ~ 
top 
x-height 
baseline 
bottom 
Figure I: The text line parameter positions. 
The majority of characters can easily be mapped to a 
small numher of distinct ccMes (\[igure 2). 1 Cllaracters 
which are contained entirely in the x-zone map to shape 
code x ; characters which extend \[rom the baseline to alxwe 
the x-height line map to shape code A: and those which 
extend from below the baseline to the xqmight line map 
to shape code g. Characters which map to A, x, or g are 
composed o1 a single connected component. Some 
characters con|ain Fnorc than one connected component: 
an x-height character with a single diacritical mark in the 
ascender zone maps to i ; a character with a descender and a 
single diacritical mark maps to j. Most common 
punctuation marks map to unique shape codes; however, 
I If this nmppmg can bc done from docmncnt images, it can 
more trivially bc aCCOlnplished frmn character coded 
docmncnts, sllch as .,\St '.\[I text (providing, of course, that lhc 
method of encoding is known). 
686 
some are mapped into shape codes shared with alphabetic 
characters (e.g., "&" maps to shape code A). 
Shape Oxtc 
A 
...... Ctmtacter 
A-Zbdfh kl t0-9# $ &/(-:'1 
X ace Irl I1O rs U V tAX 7, 
i I a it a e e e 1 I I o o o u tr tl n i 
g g PqY~: 
J J 
l:igurc 2: Character shape codes. 
3 SttAPE CONVERSION 
In general, our approach to docmnent processing 
finesses the problems iltllerent in mapping from an imagc 
to a character coded representation: we nlap instead frollt 
the imagc to a shal)e basedr~Tn'esentalion. This technique 
can transform evell a degraded document tillage itlto a 
representation which provides useful abstractions about the 
text of a document. The shape-based representation that 
we construct is proving to be a relnarkably rich source o1 
information. While our initial goal has beell to, use it lor 
language identification in support of downstreanl OCR 
pr(x;esses, we are finding that this representation lnay be a 
sufficient source of information for document content 
characterization, such as that supported by part-of-speech 
lagging. 
In our tagging work, we have used character shape 
c~×tedtext derived froth normal character-c{~,led text. This 
is simply because we dc, tlOt have access to enough inlage 
documents on which to train a taggef. We call the process 
of creating a shape-Ntsed version ol the dtxxttnent lroln the 
character eerie based version shape conver.viotL 
For the purlx~se of text tagging, then, we cltn think oI 
the word shatx: token representation as an approximation 
of the representation composed of words. We can think 
about the relationship between words and word shape 
tokens its a mapping from a word to its corresponding 
word shape token. For example, the word "apple" maps to 
tile word shape token xggAx, and tile word "apples" maps 
to the word shape token x g g A x x. 
hi d(×;uments, words exist its sur/ace.fi~rms, not its 
morphological systems; thus "apple" and "apples" are 
different words. Therefore, it is of no use to us to have a 
lexicon organized in terms of stems and suffixes; i+rstcad, 
our lexicon is conlposed of stlrfaee forms like "apple" and 
"apples". Throughout the rest of this paper, when we say 
"words", we rllean words as Sillface ftwll\]S. 
4 I'ART-OF-SI+EECli TAGGING 
A part of speech tagger is at system that uses context 
to assign parts of speech to words. Part-of-speech 
information facilitates higher-level analysis, such as 
recognizing nOUll phrases and other patterns ill text• 
Several different approaches have been used for building 
text taggers. A particular fornl of Markov model has been 
widely used that assumes thai a word dcpends 
probabilistically on just its part-of-slx~eeh category, which 
m turn depends solely on the categories of the plecedmg 
two words. Training the trlodel is sonletinles doue by 
means of a large lagged corpus, but this is not necessary. 
The I~autn-Welch algorithm (Baum, 1972), also knowtt its 
the t;orward-l~,ackward algorithm, carl be used. In this 
ease, the model is called a hidden Markov nlodel (I IMM), 
since state transiticms (i.e., part-.of-speech categories) are 
assunled to be unobseuvable. 
l:or this work, we use an 11MM-based text tagger that 
is publicly available from Xerox PAP, C. As described in 
Cutting el al. (1902), the PAR(2 tagger is efficient and 
highly flexible. It is particularly important that the tagger 
can be trained on any eorptls el text, using ally lexicon. 
This flexibility allows us to shape-convert our training 
corpus and lexicon, its described in section 5, without 
needing to modify the tagger itself. Below we outline tile 
basic operation of tire PARC tagger; please refer to 
Cutting el al. (1902) for further detail. 
1. Text destined for tire tagger first encotlllters a 
tokenizer, whose duty is to eoltVel+t text (a sequence el 
characters) into a sequence of tokens. Each sentence 
boundary is also identified by the tokenizer, and is passed 
its a special token. 
2. The tokenizer passes tokens t¢+ the lexicon, where 
tokens are matched with a set of surface fofms, each 
annotated with a Imrt-of-speech tag. The set el tags 
constitutes an ambiguily class. The lexicon passes along a 
stream of (.~'llrfilce.fi)rnt, ambigltily class) pairs. 
3a. In training mode, the tagger takes long sequences 
of ambiguity classes as input. It uses the Baum-Welch 
algorithm to produce a trained IIMM, which is used its 
input in tagging Inode. Training is performed on some 
corpus of interest; this corptlS lnay be of broad coverage or 
may be genre specific. 
3b. Ill lagging mode, tile tagger buflers sequences el 
ambiguity classes between sentence boundmies. '\['hesc 
sequences are disambiguated by computing tile lnaximal 
path through the I IMM with the Viterbi algorithm (lO67). 
Operating at sentence granuhuity does llot sacrifiee 
accuracy, since sentence boundaries are unambiguous. 
Output consists of pairs of surface forms and tags. 
5 THE LEXICON 
The word shape tagging in our work follows tile 
t IMM4)ased process described above. Both word shape 
tagging atld standard word tagging require a lexicon. 
5.1 Constructing tile Lexicon 
A word shape lexicon can be derived from a standard 
lexicon of words. The lexicon used with the standard text 
tagger contains a list of all the distinct surface forms 
likely to be encountered m the hmguage. Associated with 
each surface form is a list of the possible parts of sIx'ech 
that the ~ttrface form can hitve. \];or exalllple: 
ijp~le noun 
~LP~ i)hual noun 
eat verb 
eats third person singular verb 
t~l noun, adjective 
f.lle determiner 
()liCe We have a lexicon which consists of sttrface fonns, 
we can use it to create a lexicOlt of word shape tokens for 
687 
word shape tagging. In particular, this transR)rmatl ,n 
consists of the following steps: 
1. Shape convert the surface forms to th, ir 
corresponding word shape tokens. 
2. Sort the lexicon by surface form word shape. At 
this stage there may be duplicate word shape tokens. 
3. Eliminate duplicate entries in the lexicon: collect 
all parts of speech behind one word shape token (combine 
their ambiguity classes). At this stage each word shape 
token should be unique. 
4. Eliminate duplicate parts of speech behind each 
word shape token. At this stage each part of speech 
should be unique within each mnbiguity class. 
The lexicon fragment above would be converted to: 
x g g A x noun 
xggAxx plural norm 
x x A verb, noun, adjective 
x x A x third person singular verb 
A A x detelminer 
5.2 Analysis of the Lexicon 
For this work, we use a lexicon provided by Xerox 
PARC. This lexicon is organized so that there is an entry 
for each of roughly 150,000 surface forms, l:or word 
shape tagging, we shape converted this lexicon. As can be 
seen in the table, shape conversion results ill about 50,000 
distinct word shape surlace forms. This suggests that, on 
average, each word shape token is a mapping of three 
surlacc forms. However, about 30,000 of the word shape 
tokens arc unique, that is, correspond to a single surface 
form. 
Surface Forms Count %Total 
Standard Lexicon 148,703 I 0()+0 
Sh~.tpc-eonverted Lexicon 47,1()2 31.7 
Shape-converted Unique. 28,949 19.5 
Thus, the word shape lexicon is approximately one- 
third the size of the standard lexicon. Clearly, information 
has been lost, but not as much as one might think. In 
fact, the 20% of the word shape tokens that are unique 
carry exactly as much reformation as their corresponding 
character-coded words. While some surface forms that map 
to unique word shape tokens are long and infrequent (like 
"flibbertigibbet", AAiAAxxAigiAAxA), many are 
short, Ct/lnlylOn words: 
xggAx 
xggAxx 
AAigA 
thirst X AAixxAg 
lifelike AiAxAiAx 
gxAxxg 
gxgAxg 
gxgAxgx 
While word shape tokens that are unique have the salne 
parts of speech as their corresponding surface forms, the 
others will tend on average to have many more parts <)l 
speech than an average stnTace form. This defxznds 
somewhat on the tagset (see section 6). In general, word 
shape tokens frequently have as many as 10 to 15 parts of 
speech, whereas standard surlace forms rarely have more 
than 4 or 5. 
6 DEVISING THE TAGSET 
The lagset is implicit in the lexicon: it includes all 
parts of speech listed in any entry of the lexicon; it also 
includes a small set of tags for punctuation, such as 
comma, hyphen, and sentence boundary. Although the 
tagset is not explicitly defined, we can modify it by 
mapping from selected tags fonnd in the lexicon to other 
tags of our choosing. For example, the lexicon 
distinguishes between verb tenses and has separate tags for 
different combinations of verb tense, person, and number: 
presenl tense verb, paxl lense verb, third pets'on singular 
present verb, etc. If we preferred, we could map all these 
different verb forms to a single verb tag. However, we 
typically prefer to maintain such distinctions, as the text 
taggcr can take advantage of differences in the surface 
forms of verbs with different tenses in order to uniquely 
identify their parts of speech. 
Shape com,ersion collapses different surface lorms onto 
one word shape and merges their ambiguity classes. The 
result is that them tend to be tcwer distract surface forms, 
and that each surface form has, on average, a larger 
ambiguity class. If this ambiguity is problematic, one 
way to reduce it may be to reduce the size of the tagset. 
t:or example, we may choose to have one undifferentiated 
verb tag rather than a set which differentiates tense, 
person, and namber. With fewer possible parts of speech 
to choose from, the HMM may find the part-of-speech 
selection more constrained. This in turn may improve its 
accuracy at selecting one of the tags that are available. 
The uninteresting case, of COtll'Se, is where every word 
shape has the same tag, that is, a tag set of one. This 
situation yields no useful syntaclic inforlnation from the 
doctlnlent. Since the use of word shape tokens doesreduce 
the mnount of information that is mailable to the tagger, 
it may rexluce the number of different tags it can accurately 
assign. The proper size of the tagset becomes conshained 
on one hand by the anloun\[ oJ syntactic ill\[Ormation we 
wish to extract (more inlk~rmation with a larger tagset) and 
on the other by the size of the ambiguity classes of the 
word shape tokens (more ambiguity with a larger tagset). 
Its proper size is thus an empirical question. For our tests 
we used tagsets vdth approximately 30 parts of speech. 
7 TIlE TRAINING PROCESS 
Just as the hiddcn Markov model fc, r standard tcxt 
tagging requires a large corpus of text to tram on, the word 
shape HMM requires a large corpns of text that has been 
converted to word shape tokens. We used at least 3.5 
megabytes of ASCII text for our standard text laggcr's 
corpus; we then shape converted this text to create the 
corpus for the word shape tagger. This corpus consisted of 
a variety of different writing styles (from colloquial to 
professional) and difficulty levels (from casual Io erudite). 
\[-'2xamplcs include essays by huulorists, proposals lor new 
government lx~lieies, and classic works o\[ literature. 
688 
8 TtlE TA(;GIN(; PRO(~ENS 
With the word Shal)C lexicon in place and tin adequatcly 
trained 1 \[MM, word shape tagging works just as stmldmd 
text tagging does. 111 part(el.liar, word simpe tagging 
consists of the following steps: 
I. A stieanit of tcxl is tokenized in(() a streani of w(,ird 
shap0 tok0ns segnlented itlto S0lltellces. 
2. The slml)c-eonvcrted lexicon assigns an ambiguity 
class to caeh word sl,iape tokcll. Thc ucsult is i/ StlCi(lll ()l" 
sentence++ composed of (word shape Ioke., amhig.ilv 
clas.v) pairs. 
3. The laggcr uses thc trained hidden Mark(,iv model to 
comtmtc the highest probability part <11 speech for each 
word shape t(~ken in a sen(cute. The rcsult ix a stream of 
(word shape loken, part o/ speech) pairs, ~hich are 
grouped accordiilg to senletice bOUlMaties, 
W0 can limx us0 the r0sulting l,it/l'ts O( speeclit to illlOlM 
()thor se~+(litleltts of i,i t|OetllTIoiql ulldelSlillldillg :;ystelll. The 
word shape ixut--ol-spcech laggcr tiros accepts w,ind shal)e 
tokens grouped by solltei,iee blltuldaries; within those 
boundaries, it assigiis the inl)sl likely part of speech t(~ 
ca(hi word shat~c tok0n. 
9 RI£SULTS 
In thlis section, we introduce i,i tool which etin 
recognize noun phrases in sentences, and we use this tool 
to conipme the performalitcc (11 the standard taggcr and tim 
woix_l shape tagger. We exemplily the comparison with 
tWO texts: one on which the staitidard tagger perfoims very 
well, al,id oitic oitit which it does rehitti+ely p(+oity. While 
the word shape tagger does h:ss well in each case, its 
behavit/r tracks that (fl' the standard tagger, exhibiting 
siinihu" successes aild faihlrcs, l:or the partieuhu task (/I 
iindiititg simple notln piuases, the word shape tagger's 
pei'l'01l/lililee is less than tJilit of the standard tlll~gcl's, bill a 
hu'ge !+l;aetion of the litOtliit phrases still are found. 
Wc have Lit s.ystcnl tht,it (:till ieeogiitiie sJlnplc lie(ill 
phrases whcn givei,i its input the seq,ileilce of tags Iot a 
SOl,licit(co+ t{ach of these phrases conlpriscs a contigtlous 
sequence o1 tags that satisfies a strut+h: gral,illilar, l"or 
example, a II(,itlll pluase eltil be simi)ly a plonoull t~ig or (,in 
all:litlaly setitlellce (:,I lie(It1 lind ad.iective lags, pmsibly 
preceded by a dctell,iHiler lag and possibly with till 
embedded possessive lag. 2 The hlngesl possible S,ileh 
sequences itr0 I+otmd. (\]oi,ij,ili,ictions ale l}oi lec~>gliized (is 
piut of a llOUll pinase, llOl is prcp(+sithmal Dhirase 
:alhlehnient perf()rii,icd. We can bc eonlident of finding 
Ill(lily shnple no(it/phmtses b0cause the t~ old "thC' hlas Ihe 
tnlique word shape /% A x. 3 I,~ccognilion el i1(1(111 phrases 
is a iirst sicp in topic idcntilieation: the topie (it a 
d(,iCUlilel,it is likely t<l be indicaled t)3 its lnosl hequent 
tie(in phrases. 
li,i 0vahialing., the hit(gel ellel rnle, wc rise s0veral 
liiteaS,illes (s0c tables). We calculate lhc pcleenlagc of IolaJ 
error~, thc percentage of Irh,Tal error& and the porcel,illlge 
?. The i)osscssivc tap is tlsc(I for " 's " el ' r ,, as in "the cat's 
l)ajanias' striF, es" 
3 Another I,inglish xvc)l'd, "lhl," also maps Io AAx; 
I'ollilllalcly, ill III+.)SI Ct)lllL'XlS Ihis word is l{llC 
(~1 l~erniciouserror~ (there me tit few eiT()lS that do not fall 
in either of the latter categories). Tagging 'lalaMning" ill 
"what the advocates a,e finding ahuming" as tit present 
txuticipk: rather than as an adjeclivc is an examplc /fl a 
trivial error. Pernicious errors typically invoh, e 
re(stagging nouns as verbs or verbs as nouns (in l';nglish, 
there ~tlC ilially stlrtitce IOIIDS that can be either l,i()lllHlal (,il 
verbal). These latter el'l"()i+s e0.11se probh:ms in h,itcr 
pl+oeessiitlg, suchl as dote.(ring simple ititOUitl phrases, sitice 
they May (IbNctll'l: 1101111 phl+a+~es or illh+odtlce spurious 
(/lies. 
We compatc the standat+d tagger and the word shape 
taggcr by counting the real(hem in the strcatns of output 
tags. We do not demand strict matches, but ms(cad allow 
the rags to belong to pertinent equivalence classes. I,'or 
exarnple, the standat+d tagger labels the noun "monitors" as 
a plural noun, at,id the word shape tagger la\[xels 
x x x i A x x x simply (is a litOut,i. We c()ititsider this a match, 
SillCe it l,i(Itllit \[ttitd a plkitit'ltl itit(3tllit iltl'e equally well recognized 
as pttrt o1 it lit)till( phrase, 
Ahl,iosl all instances ,(,it niismatehes rcstllt from the 
standard tagger being right and the word shape lagger being 
wrong. Very occasionally the situatiotit ix the reverse, but 
this ix to be expected as within the normal range of 
probabilities. More interesting is the observation that 
almost every pernicious error made by the standard tagger 
ix repeated by Ihe word shape tagger. Wc take this a+s 
c(,infirtnaltion of tim word shape tagger's ability to 
appmxintate the standard tagger's pcrtoimat,iee. 
The first COl/ll)arisoII of tagger peMormance involves a 
30/!---w(,ird excell,it I+l+Ollit it govorl/I,ilelll, doctliititent. The 
standard lagger's I)eitT()itmance is better than 95c)~ correct. (it 
bettcr than 97% if trivial errors are disregarded. The word 
shape tagger's perl(irnuuitee is a 59% match (11 the standaFd 
tagger's (or 51% if only exact matches are considered). 
The noun phrase recogni/.er \[outld 113 sinlple limln 
phrases in the standard tagger's (,itltptlt iitlitd 77 ((~b;%) o1 
these in the word shape taggcr's OUtl)Ut. 
Siandal'd Tagger Erl'ors 
Matching Outpu! ol 
Standard Tagger and Word Shape Tagger 
l)isregardmg Ineludiilg all\] 
t_ 'lh trial Misnultehes Mismatches 1 
59% - ...... -5"~ i}ll'lellt 
\[Nonscnsc l~l"/<: 38~ I 
Noun Phr:tses Recognized from Tagger Outpul 
The second comparison is of lags, big a 14+I word piece 
el IIOItSeIISC VCI'SC. The stiilldiild t+:.lg.gcr's i)et f+.)rnlilncc is 
689 
89% correct, or 94% disregarding trivial errors. The word 
shape tagger's perfornmnce is a 47% match (or 38% 
considering only exact matches). The noun phrase 
recognizer found 45 simple noun phrases in the slandm'd 
tagger's output and 17 (38%) of these in the word shape 
lagger's output. 
\[:urther study is needed to determine exactly how 
reliable word shape part-of-speech tagging and simple 
noun phrase recognition will be in finding the topic or 
topics in a document image, One means of improving 
this reliability may be our technique for grammatical 
function assignment which uses only the output of the 
part-of-speech taggerand phrase recognizers (Sibun 1991). 
However, we can abeady nse part-of-speech lagging and 
simple noun phrase recognition as a tool for discerning 
something about the content of the document by 
discovering at least some of its noun phrases, Since our 
document rceognition technology allows us to use word 
shape tokens to index directly into the document image, 
we can also identify parts of the image as promising 
candidates for OCR, 
10 I)ISCUSSION 
Although the word shape tagger- tleals wilh greater 
ambiguity, it can still extract significant information from 
a text. The increase in ambiguity is not as high as might 
be expected: a large number of word shapes remain 
unambiguous after the lexicon has been shape converted. 
As noted above, the creation of the word shape lexicon 
from the standard lexicon reduces the number of distinct 
entries to approximately one-third. Vor example, distinct 
words such as "cat" and "rat" map onto the same word 
shape token xxA. Nevertheless, the complexity of 
English spelling still allows a large proportion of surface 
forms to be distinguished merely by their word shapes. 
Several inlprovements on our technique remain to be 
fully implemented. We do not yet have a principled way 
to determine the optimal tagset for a given corpus of texl. 
As noted alxwe, there is a tension between the size of the 
tagset and the amount of syntactic information that is 
available in the word shape tokens. 
We are also investigating computationally inexpensive 
ways of making finer distinctions between characte,s that 
map to the character shape codes x and A. Initially, 
parentheses and brackets were always classified as A anti 
distorted any word shape they were adjacent to: for 
example, "(USA)" woukl be shape converted to A A A A A. 
Recently we have made progress m recognizing these nora 
alphabetic characters as wnrd shape token delimiters, rather 
than parts of the word shape tokens Ihemselves. It may 
also be useful to distinguish more alphabetic character 
elasses by mapping scanned character bnages to a larger 
set of chmacter shape codes. We can ext,'act more useful 
inlknmation by distinguishing upper case letters from 
lower case letters, such as "h" and "k", which malt to the 
character shape code A. A larger number of character 
shape codes gives us more information about the word 
shape tokens, and helps Io reduce ambiguity, l lowever, 
we must be careful to choose character shape features 
which can bc easily dctccted in the image and quickly 
classi fled by a character shape ctx.le. 
In keeping with Vnji Xerox's multiqingual document 
emphasis, we are also exploring ways in which this 
method may be applied to other Roman-alphabet 
languages, such as French, German, Dutch, and Spanish. 
The technique will need to be evaluated separately for each 
language, however, to better understand how each 
hmguage's typographic conventions may be reflected in its 
word shape. 
1 1 CONCLI!SION 
We have presented a new technique for the 
understanding of English document images without optical 
character recognition. By scanning and categorizing 
character shapes, it is possible to extract word shapes from 
the document text; these word shapes tokens can then be 
used as input to a tagger which determines part-of-speech 
reformation. This part-of-speech inlormation can then be 
used to inform other document understanding techniques, 
including noun phrase recognition and topic identification. 
The lack of OCR means we cannot extract all of the 
information contained in the scanned d¢×:tnnent's image; 
nevertheless, the information from the word shape tokens 
allows us to characterize the document's content with 
significant accuracy, and more quickly than if we 
performed O(;\[,I. 
Acknowledgments 
We thank Larry Spitz and Masa Ozaki for their useful 
comments. 
References 
Baum, 1,. E. "An inequality anti associated maximization 
technique in statistical estimation for pmbabilistic 
functions of a Markov process." luequalilies, 3: l--g, 
1972. 
Cutting, Doug, Julian Kupiec, Jan Pedersen, and Penelope 
Sibun. "A Practical Part-of-Speech Tagger." In 
Proceedings of the Third Cot{ference on Applied 
Natural Language Processing (ACIJ, pp 133- 140, 
Trento, Italy, 1992. Also Report SSL-924)l/P92- 
00(X)I, Xerox Palo Alto Research Center, 1992. 
Nakayama, Takehiro and A. L. Spitz. "European 
Language Determination from Image." In l~roeeedings 
of the SeColld hllernational (~.onference on D(K3tlnlerl\[ 
Amflysis and Recognition, pp 159-162, Tsukuba 
Science City, Japan, 1993. 
Sibun, Penelope. "Grammatical Function Assignment in 
Unrestricted Text." Inte,nal Report, Xerox Palo Alto 
Research Center, 1991. 
Sibun, Penelope and A. l,awrence Spitz. "Language 
Determination: Natural l,anguage Processing frolu 
Scanned l)ocnnmnt Images." l:orthcoming. 
Viterbi, A. J. "Error Ix)unds \[k)r convolution codes and ctn 
asymptotically optimal decoding algorithm." llqffs' 
Transactions on lt{/brmalion Theory. pit 260-269. 
April 1967. 
690 
