File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-2198_intro.xml
Size: 4,620 bytes
Last Modified: 2025-10-06 14:05:40
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2198"> <Title>WORD CLASS DISCOVERY FOR POSTPROCESSING CHINESE HANDWRITING RECOGNITION</Title> <Section position="2" start_page="0" end_page="1221" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> Class-based language models (Brown et al., 1992)have been proposed for dealing with two problems confronted by the well-known word n-gram language models (1) data sparseness: the amount of training data is insufficient for estimating the huge number of parameters; and (2) domain robustness: the model is not adaptable to new application domains. The classes can be either linguistic categories or statistical word clusters. The tbrmer includes morphological features (Lee L. et al., 1993), grammatical parts-of-speech (Derouault and MeriMdo, 1986; Church, 1989; Chang and Chen, 1993a), and semantic categories. The latter uses word classes discovered by the computer using statistical characteristics in very large corpora. There have recently been several groups working on corpus-based word class discovery such as Brown ct al. (1992), Jardino and Adda (1993), Schutze (1993), and Chang and Chen (1993b). However, the practical value of word class discovery needs to be proved by real-world applications. In this paper, we apply the discovered word classes to language models for contextual post-processing o\[&quot; Chinese handwriting recognition. The Chinese language has more than 10,000 character categories. Therefore, the problem of Chinese character recognition is very challenging and has attracted many researchers. The tield has usually divided into three types: on-line recognition, printed character recognition, and handwriting recognition, in the order of difficulty. The recognition systems have been reported to have character accuracies ranging h'om 60% to 99%, by character recognizers for different types of texts from different producers. Misrecognitions and/or rejections are hard to avoid due to the problems of different fonts, charaeters with similar shape, character segmentation, different writers, and algorithmic imperfections. Therefore, contextual postprocessing of the recognition results is very useful in both reducing the number of recognition errors and saving the time in human proofreading.</Paragraph> <Paragraph position="1"> Contextual postprocessing of character recognition results is not novel: Shinghal (1983) and Sinha (1988) proposed approaches for English; Sugimura and Saito (1985) dealt with the reject correction of Japanese character recognition; and several researchers (Chou B. and Cbang, 1992; Lee II. et al., 1993) presented approaches for postprocessing Chinese character recognition, just to name a thw.</Paragraph> <Paragraph position="2"> Three large text corpora have been used in the experiments: 10-million-character 1991ud for collecting character bigrams and word frequencies, 540thousand-character day7 for word class discovery, and 92-thousand-character poll2 for evaluating postprocessing language models A simulated annealing approach is used for discovering the statistical word classes in the training corpus. The discovery process converges to an optimal class assignment to the words, with a minimal perplexity for a predefined number of classes. The discovered word classes are then used in the class bigram language model for postprocessing.</Paragraph> <Paragraph position="3"> We have used a state-of-the-art Chinese handwriting recognizer (Li et al., 1992) developed by ATC, CCL, ITRI, Taiwan as the basis of our experiments. The CCL/IICCR handwritten character database (5401 character categories, 200 samples each category) (riLl ct al., 1991) was automatically sorted according to character quality (Chou S. and Yu, 1993). The recognizer produces N best category candidates for each character sample in the test part of the database. The postprocessor then uses as its input the category candidates for the pol+-2 corpus and chooses one of the candidates for each character as its output.</Paragraph> <Paragraph position="4"> For comparison, we have also implemented three other language models: a least-word model, a word-frequency model, and the powerful inter-word character bigram model (Lee L. et al., 1993). We have conducted extensive experiments with the discovered class bigram (changing the number of classes) and these three competitive models on character samples with different quality. The experimental results show that our discovered class bigram model outperforms the three competing models.</Paragraph> </Section> class="xml-element"></Paper>