File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-2138_evalu.xml

Size: 2,293 bytes

Last Modified: 2025-10-06 14:00:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2138">
  <Title>Combining Trigram and Winnow in Thai OCR Error Correction</Title>
  <Section position="6" start_page="840" end_page="841" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We have prepared the corpus containing about 9,000 sentences (140,000 words, 1,300,000 characters) for evaluating our methods. The corpus is separated into two parts; the first part containing about 80 % of the whole corpus is used as a training set for both the trigram model and Winnow, and the rest is used as a test set.</Paragraph>
    <Paragraph position="1"> Based on the prepared corpus, experiments were conducted to compare our methods. The results  rors after applying Trigram and Winnow are shown in Table 1, and Table 2.</Paragraph>
    <Paragraph position="2"> Table 1 shows the percentage of word errors from the entire text. Table 2 shows the percentage of corrected word errors after applying Tri-gram and Winnow. The result reveals that the trigram model can correct non-word and realword, but introduced some new errors. By the trigram model, real-word errors are more difficult to correct than non-word. Combining Winnow to the trigram model, both types of errors are further reduced, and improvement of real-word error correction is more acute.</Paragraph>
    <Paragraph position="3"> The reason for better performance of Trigram+Winnow over Trigram alone is that the former can exploit more useful features, i.e., context words and collocation features, in correction. For example, the word &amp;quot;d~&amp;quot; (to bring) is frequently recognized as &amp;quot;~&amp;quot; (water) because the characters &amp;quot;~&amp;quot; is misreplaced with a single character &amp;quot; &amp;quot;~' by OCR. In this case, Tri-gram cannot effectively recover the real-word error &amp;quot;d~&amp;quot; to the correct word &amp;quot;~&amp;quot;. The word &amp;quot;d~&amp;quot; is effectively corrected by Winnow as the algorithm found the context words that indicate the occurence of &amp;quot;~&amp;quot; such as the words &amp;quot;=L~a&amp;quot; (evaporate) and &amp;quot;~&amp;quot; (plant). Note that these context words cannot be used by Trigram to correct the real-word errors.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML