File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-2138_evalu.xml
Size: 2,293 bytes
Last Modified: 2025-10-06 14:00:36
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2138"> <Title>Combining Trigram and Winnow in Thai OCR Error Correction</Title> <Section position="6" start_page="840" end_page="841" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We have prepared the corpus containing about 9,000 sentences (140,000 words, 1,300,000 characters) for evaluating our methods. The corpus is separated into two parts; the first part containing about 80 % of the whole corpus is used as a training set for both the trigram model and Winnow, and the rest is used as a test set.</Paragraph> <Paragraph position="1"> Based on the prepared corpus, experiments were conducted to compare our methods. The results rors after applying Trigram and Winnow are shown in Table 1, and Table 2.</Paragraph> <Paragraph position="2"> Table 1 shows the percentage of word errors from the entire text. Table 2 shows the percentage of corrected word errors after applying Tri-gram and Winnow. The result reveals that the trigram model can correct non-word and realword, but introduced some new errors. By the trigram model, real-word errors are more difficult to correct than non-word. Combining Winnow to the trigram model, both types of errors are further reduced, and improvement of real-word error correction is more acute.</Paragraph> <Paragraph position="3"> The reason for better performance of Trigram+Winnow over Trigram alone is that the former can exploit more useful features, i.e., context words and collocation features, in correction. For example, the word &quot;d~&quot; (to bring) is frequently recognized as &quot;~&quot; (water) because the characters &quot;~&quot; is misreplaced with a single character &quot; &quot;~' by OCR. In this case, Tri-gram cannot effectively recover the real-word error &quot;d~&quot; to the correct word &quot;~&quot;. The word &quot;d~&quot; is effectively corrected by Winnow as the algorithm found the context words that indicate the occurence of &quot;~&quot; such as the words &quot;=L~a&quot; (evaporate) and &quot;~&quot; (plant). Note that these context words cannot be used by Trigram to correct the real-word errors.</Paragraph> </Section> class="xml-element"></Paper>