File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1018_intro.xml
Size: 2,392 bytes
Last Modified: 2025-10-06 14:01:44
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1018"> <Title>A Generative Probabilistic OCR Model for NLP Applications</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Although a great deal of text is now available in electronic form, vast quantities of information still exist primarily (or only) in print. Critical applications of NLP technology, such as rapid, rough document translation in the field (Holland and Schlesiger, 1998) or information retrieval from scanned documents (Croft et al., 1994), can depend heavily on the quality of optical character recognition (OCR) output. Doermann (1998) comments, &quot;Although the concept of a raw document image database is attractive, comprehensive solutions which do not require complete and accurate conversion to a machine-readable form continue to be elusive for practical systems.&quot; Unfortunately, the output of commercial OCR systems is far from perfect, especially when the language in question is resource-poor (Kanungo et al., in revision). And efforts to acquire new language resources from hardcopy using OCR (Doermann et al., 2002) face something of a chicken-and-egg problem. The problem is compounded by the fact that most OCR system are black boxes that do not allow user tuning or re-training -- Baird (1999, reported in (Frederking, 1999)) comments that the lack of ability to rapidly retarget OCR/NLP applications to new languages is &quot;largely due to the monolithic structure of current OCR technology, where language-specific constraints are deeply enmeshed with all the other code.&quot; In this paper, we describe a complete probabilistic, generative model for OCR, motivated specifically by (a) the need to deal with monolithic OCR systems, (b) the focus on OCR as a component in NLP applications, and (c) the ultimate goal of using OCR to help acquire resources for new languages from printed text. After presenting the model itself, we discuss the model's implementation, training, and its use for post-OCR error correction. We then present two evaluations: one for standalone OCR correction, and one in which OCR is used to acquire a translation lexicon from printed text. We conclude with a discussion of related research and directions for future work.</Paragraph> </Section> class="xml-element"></Paper>