Speech and Text-Image Processing in Documents 
Marcia A. Bush 
Xerox Palo Alto Research Center 
3333 Coyote Hill Road 
Palo Alto, CA 94304 
ABSTRACT 
Two themes have evolved in speech and text image processing work 
at Xerox PARC that expand and redefine the role of recognition tech- 
nology in d~ument-oriented appficafions. One is the development 
of systems that provide functionality similar to that of text proces- 
sors but operate directly on audio and scanned image data. A second, 
related theme is the use of speech and text-nnage recognition to re- 
Irieve arbitrary, user-specified information from documents with sig- 
nal content. This paper discusses three research initiatives at PARC 
that exemplify these themes: a text-image editor\[I\], a wordspotter 
for voice editing and indexing\[12\], and a decoding framework for 
scanned-document content retrieval\[4\]. 1 The discussion focuses on 
key concepts embodied in the research that enable novel signal-based 
document processing functionality. 
1. INTRODUCTION 
Research on application of spoken language processing to 
document creation and information retrieval has focused on 
the use of speech as an interface to systems that operate pri- 
marily on text-based material. Products of such work include 
commercially available voice transcription devices, as well 
as DARPA-sponsored systems developed to support speech 
recognition and database interaction tasks (e.g., the Resource 
Management\[9\] and Air Travel Information Systems\[8\] tasks, 
respectively). Similarly, work on text image processing has 
focused primarily on optical character recognition (OCR) as 
a means of transforming paper-based documents into manip- 
ulable (i.e., ASCH-based) electronic form. In both cases, the 
paradigm is one of format conversion, in which audio or im- 
age data are converted into symbolic representations that fully 
describe the content and structure of the associated document 
or query. 
Over the past few years, two themes have evolved in speech 
and text image processing work at Xerox PARC that expand 
and redefine the role of recognition technology in document- 
oriented applications. The first of these is the development 
of systems that provide functionality similar to that of text 
processors but operate directly on audio and scanned image 
data. These systems represent alternatives to the traditional 
format conversion paradigm. They are based on a principle of 
1 Thanks to S. Bagley, R Chou, G. Kopec and L W'dcox for agreeing to 
have their work discussed here. "Ibis work ~ts a sample, rather than a 
~!1 survey, of relevant speech and text-image research at PARC. 
partial document modeling, in which only enough signal anal- 
ysis is performed to accomplish the user's goal. The systems 
are intended to facilitate authoring and editing of documents 
for which input and output medium are the same. Examples 
include authoring of voice mail and editing of facsimile-based 
document images. 
A second, and related, research theme is the use of speech 
and text-image recognition to retrieve arbitrary, user-specified 
information from documents with signal content. The focus 
is again on partial document models that are defined in only 
enough detail to satisfy task requirements. Depending upon 
application, format conversion may or may not be a desired 
goal. For example, in retrieving relevant portions of a lengthy 
audio document via keyword spotting, a simple time index 
is sufficient. On the other hand, extracting numerical data 
from tabular images to facilitate on-line calculations requires 
at least partial transcription into symbolic form. 
This paper discusses three research initiatives at PARC that 
exemplify these themes: a text-image editor\[l\], a wordspotter 
for voice editing and indexing\[12\], and a decoding framework 
for scanned-document content retrieval\[4\]. Overviews of the 
three systems are provided in Sections 2 through 4, respec- 
tively; concluding comments are contained in Section 5. The 
discussion focuses on key concepts embodied in the research 
that enable novel signal-based document processing function- 
ality. Technical details of the individual efforts are described 
in the associated references. 
2. TEXT IMAGE EDITING: IMAGE 
EMACS 
Image EMACS is an editor for scanned documents in which 
the inputs and outputs are binary text images\[I\]. The primary 
document representation in Image EMACS is a set of image 
elements extracted from scanned text through simple geomet- 
rical analysis. These elements consist of groupings of con- 
nected components (i.e., connected regions of black pixels)\[2\] 
that correspond roughly to character images. Editing is per- 
formed via operations on the connected components, using 
editing commands patterned after the text editor EMACS \[11\]. 
Image EMACS supports two classes of editing operations. 
376 
The first class is based on viewing text as a linear sequence of 
characters, defined in terms of connected components. Tradi- 
tional "cut and paste" functionality is enabled by a selection 
of insertion and deletion commands (e.g., delete-character, 
kill-line, yank region). As with text, editing is typically per- 
formed in the vicinity of a cursor, and operations to adjust cur- 
sor position are provided (e.g., forward-word, end-of-buffer). 
Characters can also be inserted by normal typing. This is 
accomplished by binding keyboard keys to character bitmaps 
from a stored font or, alternatively, to user-selected character 
images in a scanned document. Correlation-based matching 
of the character image bound to a given key against succes- 
sive connected-component groupings allows for image-based 
character search. 
The second class of operations supported by Image EMACS 
is based on viewing text as a two-dimensional arrangement of 
glypbs on an image plane\[l\]. These operations provide typo- 
graphic functionality, such as horizontal and vertical character 
placement, interword spacing, vertical line spacing, indenta- 
tion, centering and line justification. Placement of adjacent 
characters is accomplished using font metrics estimated for 
each character directly from the image\[5\]. These metrics 
allow for typographically acx.eptable character spacing, in- 
eluding character overlap where appropriate. 
Taken together, Image EMACS commands are intended to 
convey the impression that the user is editing a text-based 
document. In actuality, the system is manipulating image 
components rather than character codes. Moreover, while 
the user is free to assign character labels to specific image 
components, editing, including both insertion and search, is 
accomplished without explicit knowledge of character iden- 
tity. This approach enables interactive text-image editing and 
reproduction, independent of font or writing system. 
Figures 1 and 2 show an example of a scanned mnltilingual 
document before and after editing with Image EMACS. The 
example demonstrates the results of image-based insertion, 
deletion, substitution and justification, as well as intermin- 
gling of text in several writing systems and languages (para- 
gapbs 4 through 7). 2 Such capabilities are potentially achiev- 
able using a format-conversion paradigm; however, this would 
require more sophisticated OCR functionality than currently 
exists, as well as access to fonts and stylistic information used 
in rendering the original document. 
3. AUDIO EDITING AND INDEXING 
A second example of signal-based doeurnent processing is 
provided by a wordspotter developed to support editing and 
indexing of documents which originate and are intended to 
remain in audio form\[12\]. Examples include voice mail, die- 
2Syntax and semantics are not neeessmily preserved in the example, 
thanks to the user's lack of familiarity with most of the languages involved. 
tated instructions and pre-recorded radio broadcasts or com- 
mentaries. The wordspotter can also be used to retrieve rel- 
evant portions of less structured audio, such as recorded lec- 
tures or telephone messages. In contrast with most previous 
wordspotting applications (e.g., \[15, 10\]), unconstrained key- 
word vocabularies are critical to such editing and indexing 
tasks. 
The wordspotter is similar to Image Ernacs in at least three 
ways: 1) it is based on partial modeling of signal content; 2) it 
requires user specification of keyword models; and 3 ) it makes 
no explicit use of linguistic knowledge during recognition, 
though users are free to assign interpretations to keywords. 
The wordspotter is also speaker or, more accurately, sound- 
source dependent. These constraints allow for vocabulary 
and language independence, as well as for the spotting of 
non-speech audio sounds. 
The wordspotter is based on hidden Markov models (HlVlM's) 
and is trained in two stages\[13\]. The first is a static stage, in 
which a short segment of the user's speech (typically 1 minute 
or less) is used to create a background, or non-keyword, 
HMM. The second stage of trig is dynamic, in that key- 
word models are created while the system is in use. Model 
specification requires only a single repetition of a keyword 
and, thus, to the system user, is indistinguishable from key- 
word spotting. Spotting is performed using a HMM network 
consisting of a parallel connection of the background model 
and the appropriate keyword model. A forward-backward 
search\[13\] is used to identify keyword start and end times, 
both of which are required to enable editing operations such 
as keyword insertion and deletion. 
The audio editor is implemented on a Sun Microsystems 
Sparcstation and makes use of its standard audio hardware. A 
videotape demonstrating its use in several multilingual appli- 
cation scenarios is available from SIGGRAPH\[14\]. 
4. DOCUMENT IMAGE DECODING 
Document image decoding (DID) is a framework for scanned 
document recognition which extends hidden Markov model- 
ing concepts to two-dimensional image data\[4\]. In analogy 
with the HMM approach to speech recognition, the decoding 
framework assumes a communication theory model based on 
three dements: an image generator, a noisy channel and an 
image decoder. Tbe image generator consists of a message 
source, which generates a symbol string containing the infor- 
marion to be cornrnunicated, and an imager, which formats 
or encodes the message into an ideal bitmap. The channel 
transforms this ideal image into a noisy observed image by 
introducing distortions associated with printing and scanning. 
The decoder estimates the message from the observed im- 
age using maximum a posteriori decoding and a Viterbi-like 
decoding algorithm\[6\]. 
377 
A key objective being pursued within the DID framework is 
the automatic generation of optimized decoders from explicit 
models of message source, imager and channel\[3\]. The goal 
is to enable application-oriented users to specify such models 
without sophisticated knowledge of image analysis and recog- 
nition techniques. It is intended that both the format and type 
of information returned by the document decoder be under the 
user's control. The basic approach is to support declarative 
specification of a priori document infomation (e.g., page lay- 
out, font metrics) and task constraints via formal stochastic 
grarnmars. 
Figures 3 through 5 illustrate the application of DID to the 
extraction of subject headings, listing types, business names 
and telephone numbers from scanned yellow pages\[7\]. A 
slightly reduced version of a sample scanned yellow page 
column is shown in Figure 3 and a finite-state top-level source 
model in Figure 4. The yellow page column includes a subject 
heading and examples of several different listing types. These, 
in turn, are associated with branches of the source model. The 
full model contains more than 6000 branches and 1600 nodes. 
Figure 5 shows the result of using a decoder generated from the 
model to extract the desired information from the yellow page 
column. Automatically generated decoders have also been 
used to recgnize a variety of other document types, including 
dictionary entries, musical notation and baseball box scores. 
5. SUMMARY 
The speech and text-image recognition initiatives discussed 
in the preceding sections illustrate two research themes at 
Xerox PARC which expand and redefine the role of recog- 
nition technology in document-oriented applications. These 
include the development of editors which operate directly 
on audio and scanned image data, and the use of speech 
and text-image recognition to retrieve arbitrary information 
from documents with signal ccontent. Key concepts embod- 
ied in these research efforts include partial document models, 
task-oriented document recognition, user specification and 
intepretation of recognition models, and automatic genera- 
tion of recognizers from declarative models. These concepts 
enable the realization of a broad range of signal-based dec- 
ument processing operations, including font, vocabulary and 
language-independent editing and retrieval. 
References 
1. Bagley, S. and Kopoc, G. 'TAi~g images of text". Technical 
Report P92-000150, Xerox PARC, Palo Alto, CA, November, 
1992. 
2. Horn, B. Robot Hsion, The MIT Press, Cambridge, MA, 1986. 
3. Kopee, G. and Chou, P. "Automatic generation of custom docu- 
ment irnage decoders". Submitted to ICDAR'93: SecondIAPR 
Conference on Document Analysis and Recognition, Tsukuba 
Science City, Japan, October, 1993. 
4. Kopee, G. and Chou, P. "Document image decoding using 
Markov sources". To be presented at ICASSP-93, Minneapolis, 
MN, April 1993. 
5. Kopec, G. "Least-squares font metric estimation fi'om images, 
EDL Report EDL-92-008, Xerox PARC, Palo Alto, CA, July, 
1992. 
6. Kopec, G. "Row-major scheduling of image decoders". Sub- 
milled to IEEE Trans. on Image Processing, February, 1992. 
7. Pacific Bell Smart Yellow Pages, Palo Alto, Redwood City and 
Menlo Park, 1992. 
8. Price P., "Evaluation of Spoken Language Systems: The ATIS 
Domain;' Proc. Third DARPA Speech and Language Workshop, 
P. Price (ed.), Morgan Kauflnann, June 1990. 
9. Price, P., Fisher, W., Bemstein, J. and PallelI, D. "The DARPA 
1000-Word Resource Management Database for Continous 
Speech Recognition". Proceedings of ICASSP-88, 1988, pp 
651-654. 
10. Rohlicek, R., Russeh W., Ronkos, S. and Gish, H. "Contin- 
uous hidden Markov modeling for speaker-independent word 
spotting". Proceedings ICASSP-89, 1989, 627-630. 
11. Stallman, R. GNU Emacs Manual, Free Software Foundation, 
Cambridge, MA, 1986. 
12. W'dcox, L. and Bush, M. "HMM-based wordspotting for voice 
editing and audio indexing". Proceedings of Eurospeech-91, 
Genova, Italy, 1991, pp 25-28. 
13. W'fleox, L. and Bush, M. "Training and search algorithms for an 
interactive wordspotting system". Proceedings of ICASSP-92, 
1992, pp 11-97 - 11-100. 
14. W'dcox, L., Smith, L and Bush, M. "Wordspotling for voice 
editing and indexing". Proceedings of CHl "92, 1992, pp 655- 
656. Video on SIGGRAPH Video Review 76-77. 
15. W'flpon, J., Miller, L., and Modi, P."Improvements and appfica- 
fions for key word recognition using hidden Markov modeling 
techniques".ProceedingsoflCASSP-91, 1991, pp 309-312. 
378 
Some Important Info, iiiation 
About the Area Code Changes 
• is running out of telephone numbers 
in both the San Francisco Bay Area and the Los 
Angeles Area. The new Area Codes are being 
introduced to satisfy the need for numbers 
created by economic growth and increased 
demand for telecommunications services. 
The introduction of the new Area Codes... 
-- will not increase the cost of your calls, 
-- will not change your regular seven-digit 
telephone number. 
• We are telling you now so that stationery 
purchases and reprogramming of equipment can 
be planned. 
Para informaci6n en espafiol sobrc los eambios cfeetuados 
a los e6digos de irea 4151510 y 2131310, favor de Iiamar 
gratis a: Servicios Comerciales--811-2733, Servicios Resi- 
denciales Zona Norte--811-7730, Servicios Residenciales 
Zona Sur--811-5855 
~.~1111415/510 ~11213/310 ~t~l~'~ ~'~ ~ 
~, ~'~811-6888 o 
t~ bi~t ~h~m tin & v~. ~3r thay d~i ~ khu v~. 415/5 10 v~ 213/ 
310 b~ng ti~ng Vi#t xin ggi s6 di(m tlm~i mi~m i~h| 811-5315. 
°d--~- 415~1~ ~ 510~_~-~-~ ~1- o~_~_ 213~1~ 
Some Important Information 
About the Area Code Changes 
• is running out of telephone profits 
in both the San Francisco Bay Area and the Los 
Angeles Area. So, we require all telephones 
in these areas to be replaced to offset the decline 
in our economic growth and the decreased 
demand for our telecommunications services. 
• Also, you can anticipate that we... 
-- will increase the cost of your calls, 
-- will change your regular seven-digit 
telephone number. 
• We are telling you now so that new telephone 
purchases and replacement of equipment can 
be planned. 
Para informaci6n en ~.~lJtl~~H 415 los cambios efectuados 
a los c6digos de ~rea 4151510 y 2131310, favor de llama testbed 
gratis a: Servicios Comerciales--811-2733, Servicios Resi- 
denciales Zona Norte--811-7730, Servicios Residenciales 
Zona Sur--811-5855 
,~l.l~ ~~ 415/510 ~g 213/310 ~'~ t~c v~ s~r thay ~ 
Q~. ~il/~.1~811_6888 o 
Dd bi~t them tin tt~ v~. sV thay d~i s6 khu vve 415/510 vl 213/ 
310 b~g ti~g Vi~t xin %L~_~_ mi~ I~| 811-5315. 
Figure h Original scanned doeurnent image. Figure 2: Scanned document image after Image EMACS edit- ing. 
379 
APEX COMMUIIICATIONS INC 254 San Ger~k~rm 
Su,mp.aJ. ............. 408 773 9600 Co,r~m~unlcati4~ts 2001 
Vo,ce Mall System 
433 Airptr'c BI ~ttl ............ ,347 1500 Data-Tel SMto 
.................. 3496010 Davis ¢ommulllclltloll ~rvl¢15 
683 Crar, eAv ~sterCt~ ......... 341 0214 G'taffe Commlnlicotlofts \[~C. ...... ~94 9196 
BC 24 HOUR 
PHONE SERVICE 
REPAIR 
SALES & SERVICE 
JACK INST. - REWIRE 
SERVICE CALLS 
3o2e El CamJno RI PA ......... 49.3 )663 
Integrated Tec~l~ie~ Inc ...... 345 4484 Imerconn~ct Senice.- IK 
......... 591 3120 |ntwcomwct St~'~i¢~ |no ......... 851 9224 
Mr Telepho~ Man ................ b92 41Z8 
~7~,,,J, Ke ~ Ao~vtzrtl~."~r~e~f Ttt:$ IP~ 
MODERN APPUCATIUN$ GROUP Telecomeeu~liCatlOn$ Professionals 
"Quality Value Set, vice" 
........................ 961 8181 
NE¢ BUSINESS COMMUNICATION 
SYSTEMS 69b0 KOII Cl,flt.l.r Pkwy 
I:leasmcm ............ SlO 4114 2010 
RICKELL TElL ¢Oldld 404b Mira LOlIUl Wy 
s,~o,~ .............. 408 e2e 1011 
NORTHERN TELECOM Ili¢ -- 
Warldw~de Leader in jid~d i nc~thom Telecommunlcatlollj 
Tzclm~W 
• MeHdtz. I nnd Ikwnar • Business TeleplzoR, e System 
• Integrated Voice and Data NetwOrks 
• Private ComnmmlralLllm N~workl • Tempi, one I,xlustir~r S~kn 
"FOR INFOIbt~\[ATION CALL" 
NurtJwn Telecom IK 
685E tAIddlefleM Rd M~VIC, w .9~19170 PIcT~d Mlrklllm Systm 
333 Lakes+deOr FosterCIty ...~3;,~0 PacTel Ilkh'ld~an SysTems 
2305 Misli04 Colic98 81 SanlaClara ............... 4~6 cj~8 5MO 
Or 
No thlllO TO Calllllg Party .................... 800 667 8437 
PANASONIC AUTHORIZED 
SERVICENTERS-PUTS 
DISTRIBUTORS-- 
PA~I"$ 6" SERVICE 
MF~T|O~It S~RVI~ Pl~lill NO ClMule To 
CiiIIJ~i Petty .................... 800 545 2672 
FACTORY SE2FICE CEN'T~ft$ 
MATSIISiaIT& SERVICES COMPANY 
e0o ~uque Av SSF ......... DT\], 6373 
PRECISION PHONE ......... 728 3023 ~ See A o~.Ver.e.~ 7by ~ 
PRECISION PHONE 
BUSINESS SYSTEMS 
COMPLETE SYSTEMS PLUS 
- REASONABLE REPAIRS - - ADDmONS OR 
MOV(S 
OF EXISTING SYSTEMS - 
NEW OR REBUILT 
I'I"1" • TIE • COMDIAL 
ALL BELL PRODUCTS 
TOSHIBA & OTHERS 
728-3623 
CALL FOR FRE£ ESTIMATES 
CR 
nl f_ _ 
w ~ 0 w 
(0,1) 
standard 
bo.Ld 
boxed • 
v 
standaLd group ) 
k boxed....group • 
subject 
k ju2k , 
n¥ 
=0 =° (o,1) 
Figure 4: Top-level source model of yellow page column. 
Transition probabilities omitted for simplicity. 
\su1~ect{Tele¢~,na~ations-\\Telepbone E~d~ent\\a Systm-Service &\\ReaLtY} 
~bold{\Da3be{AP~ C~ICATIC~ INC} \n~ber{~O8 773 9600}} 
\st~mda~{\nm0e{Co;h~t1~{catio~ ZOOI) ~tmbez{317 1SOD)) 
\~{~e{I~vls coi~uhicati~ ServiceD) \number{341 021&)) 
\8~r~vd{~a~e{G'zaffe C~unicatlo~ Inc} \nulber{59~ 9196}) 
~boxed{~n~e{HBC} \~ber{493 3663)} 
~-~-~{\~a~e{Integ~ated Topologies Inc} \m~ber(345 418&)) 
\s~an~{\n~le{I~terconnect Services-Illc} \riDer{591 3120}} 
\standa~1{\n~De(Interconnect Services I~c} \number{D51 9224}) 
\standard{~n~e{Mr Telephoto R~n} \nustber{692 4128)) 
\bold(~nameOKXX~ APPLI~TICBS GROUP) \n~ber{961 8181}) 
~bold{\mme{~c BUSD~SS CO~0NZ~TION~\SYST~) \n~ber{510 484 2010)) 
\bold{\nmDe{NICKXLL T~ CQ~(} ~nu~ber{408 629 I011}) 
\bo~oup(\n~me{NORTHKRN TEn ~4C} 
\subg~ou~{\mme{" 'FO~ I~TION CALL'') 
\s~*-,t~d{\name{Nozthern Telecor~ Inc) \n~ber{969 9170}} 
\s~-,t~d{\n~e(PacTel Meridian System8} \n~mber{358 3300)) 
\sta~dar~{~na~e(PacTel Meridian S~st~s} \n~ber{aO8 988 5550}} 
\standard{\n~e{' ~r) \ntmber{O 667 8t37}))) 
\svana~dgrou~ { \haKe { PA~SCMZC ADT~D\ \SE~VIC~TERS- PARTS \ \DISTEtt~i'DRS } 
\sul~rc~(\naue(PA~TS • S~VTCE) 
\stdGr~e~{\name{INFQ.RMATION S~RVICES "P-J--JH) ~nmnber{S&5 2672))) 
\subgrc~(\nmne{FACTCa¥ S~VTC~ CI~qTI~,S} 
\8tc~z1~en{\nmno{~TSUSHITA S~'ICES C~MPANY) \mmber{871 6373}})} 
~bold(\name{PRECISIC~ R~r~) \number{728 3623)) 
Figure 5: Decoder output for yellow-page column in fig 3. 
Figure 3: Scanned yellow page column, slightly reduced. 
380 
