XML Viewer - n03-1018

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1018_metho.xml
Size: 24,222 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1018">
  <Title>A Generative Probabilistic OCR Model for NLP Applications</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Model
</SectionTitle>
    <Paragraph position="0"> Generative &amp;quot;noisy channel&amp;quot; models relate an observable string a0 to an underlying sequence, in this case recognized character strings and underlying word sequences a1 . This relationship is modeled by a2a4a3 a1a6a5 a0a8a7 , decomposed by Bayes's Rule into steps modeled by a2a4a3 a1 a7 (the source model) and a2a4a3 a0a4a9a1 a7 (comprising sub-steps generating a0 from a1 ). Each step and sub-step is completely modular, so one can flexibly make use of existing sub-models or devise new ones as necessary.1 We begin with preliminary definitions and notation, illustrated in Figure 1. A true word sequence a1 a10  a0 a10 a11 a0 a12 a5a16a14a15a14a16a14a15a5 a0a2a1 a20 , and the OCR system's output character sequence is given by a0 a10 a11 a0 a12a13a5a15a14a15a14a16a14a15a5 a0a4a3 a20 . A segmentation of the true character sequence into</Paragraph>
    <Paragraph position="2"> ment boundaries are only allowed between characters.</Paragraph>
    <Paragraph position="3"> Subsequences are denoted using segmentation positions</Paragraph>
    <Paragraph position="5"> Correspondingly, a segmentation of the OCR'd character sequence into a34 subsequences is given by</Paragraph>
    <Paragraph position="7"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Generation of True Word Sequence
</SectionTitle>
      <Paragraph position="0"> The generative process begins with production of the true word sequence a1 with probability a2a4a3 a1 a7 ; for example,</Paragraph>
      <Paragraph position="2"> lying sequence at the word level facilitates integration with NLP models, which is our ultimate goal. For example, the distribution a2a4a3 a1 a7 can be defined using a21 -grams, parse structure, or any other tool in the language modeling arsenal.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 From Words to Characters
</SectionTitle>
      <Paragraph position="0"> The first step in transforming a1 to a0 is generation of a character sequence a0 , modeled as a2a4a3 a0a4a9a1 a7 . This step accommodates the character-based nature of OCR systems, and provides a place to model the mapping of different character sequences to the same word sequence (case/font variation) or vice versa (e.g. ambiguous word segmentation in Chinese). If the language in question provides explicit word boundaries (e.g. words are separated by spaces when printed) then we output '#' to represent visible word boundaries. One possible a0 for our example a1 is a0 = &amp;quot;This#is#an#example.&amp;quot;  a9 a0 a5 a1 a7 , is motivated by the fact that most OCR systems first perform image segmentation, and then perform recognition on a word by word basis.</Paragraph>
      <Paragraph position="1"> For a language with clear word boundaries (or reliable tokenization or segmentation algorithms), one could simply use spaces to segment the character sequence in a non-probabilistic way. However, OCR systems may make segmentation errors and resulting subsequences may or may not be words. Therefore, a probabilistic segmentation model that accommodates word merge/split errors is necessary.</Paragraph>
      <Paragraph position="2"> If a segment boundary coincides with a word boundary, the word boundary marker '#' is considered a part of the segment on both sides. A possible segmentation for our example is a7 a10 a11a44a67 a5a68a49a18a49 a5a69a49a69a70 a20 , i.e. a0</Paragraph>
      <Paragraph position="4"> &amp;quot;#an#&amp;quot;, a0a36a72 = &amp;quot;#ex&amp;quot;, a0a74a73 = &amp;quot;ample.&amp;quot; Notice the merge error in segment 1 and the split error involving segments</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 and 4.
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Character Sequence Transformation
</SectionTitle>
      <Paragraph position="0"> Our characterization of the final step, transformation into an observed character sequence, is motivated by the need to model OCR systems' character-level recognition errors. We model each subsequence a0  as being transformed into an OCR subsequence a0</Paragraph>
      <Paragraph position="2"> and we assume each a0  Kolak and Resnik (2002). This is also a logical place to make use of confidence values if provided by the OCR system. We assume that # is always deleted (modeling merge errors), and can never be inserted. Boundary markers at segment boundaries are re-inserted when segments are put together to create a0 , since they will be part of the OCR output (not as #, but most likely as spaces). For our example a0</Paragraph>
      <Paragraph position="4"> Assuming independence of the individual steps, the complete model estimates joint probability</Paragraph>
      <Paragraph position="6"/>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Implementation
</SectionTitle>
    <Paragraph position="0"> We have implemented the generative model using a weighted finite state model (FSM) framework, which provides a strong theoretical foundation, ease of integration for different components, and reduced implementation time thanks to available toolkits such as the AT&amp;T FSM Toolkit (Mohri et al., 1998). Each step is represented and trained as a separate FSM, and the resulting FSMs are then composed together to create a single FSM that encodes the whole model. Details of parameter estimation and decoding follow.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> The specific model definition and estimation methods assume that a training corpus is available, containing a11 a0 a5 a0 a5 a1 a20 triples.</Paragraph>
      <Paragraph position="1"> Generation of True Word Sequence. We use an n-gram language model as the source model for the original word sequence: an open vocabulary, trigram language model with back-off generated using CMU-Cambridge Toolkit (Clarkson and Rosenfeld, 1997). The model is trained on the a1 from the training data using the Witten-Bell discounting option for smoothing, and encoded as a simple FSM. We made a closed vocabulary assumption to evaluate the effectiveness of our model when all correct words are in its lexicon. Therefore, although the language model is trained on only the training data, the words in the test set are included in the language model FSM, and treated as unseen vocabulary.</Paragraph>
      <Paragraph position="2"> From Words to Characters. We generate three different character sequence variants for each word: upper case, lower case, and leading case (e.g. this a5 a6 THIS, this, This a7 ). For each word, the distribution over case variations is learned from the a11 a1a6a5 a0 a20 pairs in the training corpus. For words that do not appear in the corpus, or do not have enough number of occurrences to allow a reliable estimation, we back off to word-independent case variant probabilities.3 Segmentation. Our current implementation makes an independent decision for each character pair whether to insert a boundary between them. To reduce the search space associated with the model, we limit the number of 3Currently, we assume a Latin alphabet. Mixed case text is not included since it increases the number of alternatives drastically; at run time mixed-case words are normalized as a preprocessing step.</Paragraph>
      <Paragraph position="3"> boundary insertions to one per word, allowing at most two-way word-level splits. The probability of inserting a segment boundary between two characters, conditioned on the character pair, is estimated from the training corpus, with Witten-Bell discounting (Witten and Bell, 1991) used to handle unseen character pairs.</Paragraph>
      <Paragraph position="4"> Character Sequence Transformation. This step is implemented as a probabilistic string edit process. The confusion tables for edit operations are estimated using Viterbi style training ona11 a0 a5 a0 a20 pairs in training data. Our current implementation allows for substitution, deletion, and insertion errors, and does not use context characters.4 Figure 2 shows a fragment of a weighted FSM model for  Final Cleanup. At this stage, special symbols that were inserted into the character sequence are removed and the final output sequence is formed. For instance, segment boundary symbols are removed or replaced with spaces depending on the language.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Decoding
</SectionTitle>
      <Paragraph position="0"> Decoding is the process of finding the &amp;quot;best&amp;quot; a1 for an</Paragraph>
      <Paragraph position="2"> Decoding within the FSM framework is straightforward: we first compose all the components of the model in order, and then invert the resulting FSM. This produces a single transducer that takes a sequence of OCR characters as input, and returns all possible sequences of truth words as output, along with their weights. One can then simply encode OCR character sequences as FSMs and compose them with the model transducer to perform decoding.</Paragraph>
      <Paragraph position="3"> Note that the same output sequence can be generated through multiple paths, and we need to sum over all paths to find the overall probability of that sequence. This can be achieved by determinizing the output FSM generated by the decoding process. However, for practical reasons, we chose to first find the a25 -best paths in the resulting FSM and then combine the ones that generate the same output.</Paragraph>
      <Paragraph position="4"> The resulting lattice or a25 -best list is easily integrated with other probabilistic models over words, or the most 4We are working on conditioning on neighbor characters, and using character merge/split errors. These extensions are trivial conceptually, however practical constraints such as the FSM sizes make the problem more challenging.</Paragraph>
      <Paragraph position="5"> 5The probabilities are constructed for illustration, but realistic: notice how n is much more likely to be confused for c than</Paragraph>
      <Paragraph position="7"> probable sequence can be used as the output of the post-OCR correction process.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"> We report on two experiments. In the first, we evaluate the correction performance of our model on real OCR data. In the second, we evaluate the effect of correction in a representative NLP scenario, acquiring a translation lexicon from hardcopy text.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Training and Test Data
</SectionTitle>
      <Paragraph position="0"> Although most researchers are interested in improving the results of OCR on degraded documents, we are primarily interested in developing and improving OCR in new languages for use in NLP. A possible approach to retargeting OCR for a new language is to employ an existing OCR system from a &amp;quot;nearby&amp;quot; language, and then to apply our error correction framework. For these experiments, therefore, we created our experimental data by scanning a hardcopy Bible using both an English and a French OCR system. (See Kanungo et al. (in revision) and Resnik et al. (1999) for discussion of the Bible as a resource for multilingual OCR and NLP.) We have used the output of the English system run on French input to simulate the situation where available resources of one language are used to acquire resources in another language that is similar.</Paragraph>
      <Paragraph position="1"> It was necessary to pre-process the data in order to eliminate the differences between the on-line version that we used as the ground truth and the hardcopy, such as footnotes, glossary, cross-references, page numbers. We have not corrected hyphenations, case differences, etc.</Paragraph>
      <Paragraph position="2"> Our evaluation metrics for OCR performance are Word Error Rate (WER) and Character Error Rate (CER), which are defined as follows:</Paragraph>
      <Paragraph position="4"> Since we are interested in recovering the original word sequence rather than the character sequence, evaluations are performed on lowercased and tokenized data. Note, however, that our system works on the original case OCR data, and generates a sequence of word IDs, that are converted to a lowercase character sequence for evaluation.</Paragraph>
      <Paragraph position="5"> We have divided the data, which has 29317 lines, into 10 equal size disjoint sets, and used the first 9 as the training data, and the first 500 lines of the last one as the test data.6 The WER and CER for the English OCR system on the French test data were 18.31% and 5.01% respectively. The error rates were 5.98% and 2.11% for the output generated by the French OCR system on the same input. When single characters and non-alphabetical tokens are ignored, the WER and CER drop to 17.21% and 4.28% for the English OCR system; 4.96% and 1.68% for the French OCR system.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Reduction of OCR Error Rates
</SectionTitle>
      <Paragraph position="0"> We evaluated the performance of our model by studying the reduction in WER and CER after correction. The input to the system was original case, tokenized OCR output, and the output of the system was a sequence of word IDs that are converted to lowercase character sequences for evaluation.</Paragraph>
      <Paragraph position="1"> All the results are summarized in Table 1. The conditions side gives various parameters for each experiment. The language model (LM) is either (word) unigram or trigram. Word to character conversion (WC) can allow the three case variations mentioned earlier, or simply pick the most probable variant for each word. Segmentation (SG) can be disabled, or 2-way splits and merges may be allowed. Finally, the character level error model (EM) may be trained on various subsets of training data.7 Table 2 gives the adjusted results when ignoring all single characters and tokens that do not contain any alphabetical character.</Paragraph>
      <Paragraph position="2"> As can be seen from the tables, as we increase the training size of the character error model from one section to five sections, the performance increases. However, there  is a slight decrease in performance when the training size is increased to 9 sections. This suggests that our training procedures, while effective, may require refinement as additional training data becomes available. When we replace the unigram language model with a trigram model, the results improve as expected. However, the most interesting case is the last experiment, where word merge/split errors are allowed.</Paragraph>
      <Paragraph position="3"> Word merge/split errors cause an exponential increase in the search space. If there are a21 words that needs to be corrected together, they can be grouped in a0</Paragraph>
      <Paragraph position="5"> different ways; ranging from a21 distinct tokens to a single token. For each of those groups, there are a25 a3a2a1a3 possible correct word sequences where a40 is the number of tokens in that group, a4 is the maximum number of words that can merge together, and a25 is the vocabulary size. Although it is possible to avoid some computation using dynamic programming, doing so would require some deviation from the FSM framework.</Paragraph>
      <Paragraph position="6"> We have instead used several restrictions to reduce the search space. First, we allowed only 2-way merge and split errors, restricting the search space to bigrams. We further reduce the search space by searching through only the bigrams that are seen in the training data. We also introduced character error thresholds, letting us eliminate candidates based on their length. For instance, if we are trying to correct a sequence of 10 characters and have set a threshold of 0.2, we only need check candidates whose length is between 8 and 12. The last restriction we imposed is to force selection of the most likely case for each word rather than allowing all three case variations. Despite all these limitations, the ability to handle word merge/split errors improves performance significantly. It is notable that our model allows global interactions between the distinct components. As an example, if the input is &amp;quot;ter- re&amp;quot;, the system returns &amp;quot;mer se&amp;quot; as the most probable correction. When &amp;quot;la ter- re&amp;quot; is given as the input, interaction between the language model, segmentation model, and the character error model chooses the correct sequence &amp;quot;la terre&amp;quot;. In this example, the language model overcomes the preference of the segmentation model to insert word boundaries at whitespaces.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Translation Lexicon Generation
</SectionTitle>
      <Paragraph position="0"> We used the problem of unsupervised creation of translation lexicons from automatically generated word alignment of parallel text as a representative NLP task to evaluate the impact of OCR correction on usability of OCR text. We assume that the English side of the parallel text is online and its foreign language translation is generated using an OCR system.8 Our goal is to apply our OCR error correcting procedures prior to alignment so the resulting translation lexicon has the same quality as if it had been derived from error-free text.</Paragraph>
      <Paragraph position="1"> We trained an IBM style translation model (Brown et al., 1990) using GIZA++ (Och and Ney, 2000) on the 500 test lines used in our experiments paired with corresponding English lines from an online Bible. Word level alignments generated by GIZA++ were used to extract cross-language word co-occurrence frequencies, and candidate 8Alternatively, the English side can be obtained via OCR and corrected.</Paragraph>
      <Paragraph position="2"> translation lexicon entries were scored according to the log likelihood ratio (Dunning, 1993) (cf. (Resnik and Melamed, 1997)).</Paragraph>
      <Paragraph position="3"> We generated three such lexicons by pairing the English with the French ground truth, uncorrected OCR output, and its corrected version. All text was tokenized, lowercased, and single character tokens and tokens with no letters were removed. This method of generating a translation lexicon works well; as Table 3 illustrates with the top twenty entries from the lexicon generated using ground truth French.</Paragraph>
      <Paragraph position="4"> and et for car of de if si god dieu ye vous we nous you vous christ christ the le not pas law loi but mais jesus j'esus lord seigneur as comme the la that qui is est in dans  ground truth French Figure 3 gives the precision-recall curves for the translation lexicons generated from OCR using the English OCR system on French hardcopy input with and without correction, using the top 1000 entries of the lexicon generated from ground truth as the target set. Since we are interested in the effect of OCR, independent of the performance of the lexicon generation method, the lexicon auto-generated from the ground truth provides a reasonable target set. (More detailed evaluation of translation lexicon acquisition is a topic for future work.)  The graph clearly illustrates that the precision of the translation lexicon generated using original OCR data degrades quickly as recall increases, whereas the corrected version maintains its precision above 90% up to a recall of 80%.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> There has been considerable research on automatically correcting words in text in general, and correction of OCR output in particular. Kukich (1992) provides a general survey of the research in the area. Unfortunately, there is no commonly used evaluation base for OCR error correction, making comparison of experimental results difficult.</Paragraph>
    <Paragraph position="1"> Some systems integrate the post-processor with the actual character recognizer to allow interaction between the two. In an early study, Hanson et al. (1976) reports a word error rate of about 2% and a reject rate of 1%, without a dictionary. Sinha and Prasada (1988) achieve 97% word recognition, ignoring punctuation, using an augmented dictionary, a Viterbi style algorithm, and manual heuristics.</Paragraph>
    <Paragraph position="2"> Many systems treat OCR as a black box, generally employing word and/or character level a21 -grams along with character confusion probabilities. Srihari et al. (1983) is one typical example and reports up to 87% error correction on artificial data, relying (as we do) on a lexicon for correction. Goshtasby and Ehrich (1988) presents a method based on probabilistic relaxation labeling, using context characters to constrain the probability of each character. They do not use a lexicon but do require the probabilities assigned to individual characters by the OCR system.</Paragraph>
    <Paragraph position="3"> Jones et al. (1991) describe an OCR post-processing system comparable to ours, and report error reductions of 70-90%. Their system is designed around a stratified algorithm. The first phase performs isolated word correction using rewrite rules, allowing words that are not in the lexicon. The second phase attempts correcting word split errors, and the last phase uses word bigram probabilities to improve correction. The three phases interact with each other to guide the search. In comparison to our work, the main difference is our focus on an end-to-end generative model versus their stratified algorithm centered around correction.</Paragraph>
    <Paragraph position="4"> Perez-Cortes et al. (2000) describes a system that uses a stochastic FSM that accepts the smallest k-testable language consistent with a representative language sample. Depending on the value of k, correction can be restricted to sample language, or variations may be allowed. They report reducing error rate from 33% to below 2% on OCR output of hand-written Spanish names from forms.</Paragraph>
    <Paragraph position="5"> Pal et al. (2000) describes a method for OCR error correction of an inflectional Indian language using morphological parsing, and reports correcting 84% of the words with a single character error. Although it is limited to single errors, the system demonstrates the possibility of correcting OCR errors in morphologically rich languages.</Paragraph>
    <Paragraph position="6"> Taghva and Stofsky (2001) takes a different approach to post-processing and proposes an interactive spelling correction system specifically designed for OCR error correction. The system uses multiple information resources to propose correction candidates and lets the user review the candidates and make corrections.</Paragraph>
    <Paragraph position="7"> Although segmentation errors have been addressed to some degree in previous work, to the best of our knowledge our model is the first that explicitly incorporates segmentation. Similarly, many systems make use of a language model, a character confusion model, etc., but none have developed an end-to-end model that formally describes the OCR process from the generation of the true word sequence to the output of the OCR system in a manner that allows for statistical parameter estimation. Our model is also the first to explicitly model the conversion of a sequence of words into a character sequence.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML