File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/c94-2198_evalu.xml

Size: 9,468 bytes

Last Modified: 2025-10-06 14:00:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2198">
  <Title>WORD CLASS DISCOVERY FOR POSTPROCESSING CHINESE HANDWRITING RECOGNITION</Title>
  <Section position="5" start_page="1222" end_page="7224" type="evalu">
    <SectionTitle>
4. EXPEII,IMENTAL RESULTS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1222" end_page="1222" type="sub_section">
      <SectionTitle>
4.1 The eorpora and word blgralns
</SectionTitle>
      <Paragraph position="0"> The 1991 U l) newsl)aper corpus (199lad) of approximately 10,000,000 characters has beeo used for collecting the character bigrams and word frequencies used in the lWCll model. A sul)corpus of 1991ud, day7, was used for word (;lass discovery.</Paragraph>
      <Paragraph position="1"> 'l!he subcorpns is first segmented automat|rally into sentences, then into words by our Viterbid)ased word identification program VSG. SI, atistics of the day7 sub-corpus arc; summarized: 42,537 senteuces, 2;t,9&amp;quot;/7 wor(b  character, 4,460 d-character).</Paragraph>
      <Paragraph position="2"> A sin,pie program is then used for counting the word collocation frequencies for the 23,977x23,977 word bi gram, iu which only 203,304 entries arc; nonzero. After that, the full word bigram is stored in compressed form.</Paragraph>
      <Paragraph position="3"> 'Fhe simulated anneMing procedm:e is w~ry timeconsuming; that is why we have used the smMler day7 rather than the original 1991ud corpus for word class discove.ry. For example, it took 201.2 CPU hours on a I)EC 3000/500 AXP workstation to classify 23,977 words into 200 classes with 50,000 trials in each of 416 iterations, using the day7 corpus.</Paragraph>
      <Paragraph position="4"> An iudelmndent, set of news abstract artMes, polL2, were collected for evaluating the l)erforntance of lan-guage models, polL2 is cli\[\['erenl; from day7 in both pulAisher and time period poll2 contains 6,930 sentences or !t2,710 (Jhiuese characters.</Paragraph>
    </Section>
    <Section position="2" start_page="1222" end_page="1222" type="sub_section">
      <SectionTitle>
4.2 llanclwriting recognition
</SectionTitle>
      <Paragraph position="0"> We have used a state-of-the art Chinese handwriting recognizer (I,i el el., 1992) de.veloped by ATC, (XII,, \['I'll\], Taiwan as the basis of our experiments. The (',(JlffllCCl/. hamlwritten character tie|abase (5401 character categories, 2(10 samples each category) ('IS| el el., 1991)was first automatically sorted according I.o character quality (Chou S. and Yu, 1993), then was divided into two t~m'l,s: the odd-rank s~mq)les \]))r |;rMn ing the recognizer, the. eves-rank samples as iteM-out test data.</Paragraph>
      <Paragraph position="1"> We have used for our experiments three sets of character samples, CQI0, CQ20, and CQ30, which are the saml)les with quMity ranks 10, 20, and 30, respectively. The recognition results; are sumu,arized in Table l(a).</Paragraph>
      <Paragraph position="2"> The table shows the n,unbers of character samples in which position the correct character categories were ranked by the recognizer. There are, for example, 5,270 character samples ranked 1, 105 ranked 2, 15 ranked 3~ ..., aud 4 ranked after 10, for CQI0. The error rates, in terms of character categories, would be 2.43%, '3.48%, and 4.07%, for (JQI0, CQ20, and (X230, respe.ctiw~ly.</Paragraph>
    </Section>
    <Section position="3" start_page="1222" end_page="1223" type="sub_section">
      <SectionTitle>
4.3 Word class discovery
</SectionTitle>
      <Paragraph position="0"> The day7 subcorlms was used for discovering word classes. Tim initial contiguration is: Words with tYequency less tlum m (currently set to 6) are assigne'd to Class-0, the unseen word (:lass (Jardino and Adda 1993); i)ttnctuation marks are assigned to a speciM class Class-l; aud l 4 character numl)er words are assigned to Classes 2 .5, resl)ectively; all other words are assigne.d to Class--0. The word-types assigne(t to the six spe.cial classes classes 0-5 are not subject to reassignment. '\['he control \[)a.ra/tleter (7.\]) is initially set to 0.1 and the amlealing factor af 0.9.</Paragraph>
      <Paragraph position="1"> We have conducted rmmbers of experiments with  (b) Number of Correct Characters in po1+-2  different predefined number of classes NC. The automatie discovery procedure stops when the perplexity converges or the control parameter approaches to zero. The converged perplexities range from 670 to 1200, depending on NC. Classifications with higher NC have lower training set perplexities. However, we have to careful about the problem of overtraining due to insufficient training data. See Chang and Chen (1993b) for discussion on the problem.</Paragraph>
      <Paragraph position="2"> A statistical langnage model must be able to deal with the problem of unseen words and bigrams, in real-world applications. We adopt a simple linear smoothing scheme, similar to Jardino and Adda (1993). The interpolation parameters ct and C/? are set to 1 - 10 -'~ and 0.1, respectively.</Paragraph>
    </Section>
    <Section position="4" start_page="1223" end_page="7224" type="sub_section">
      <SectionTitle>
4.4 Contextual postproeessing
</SectionTitle>
      <Paragraph position="0"> The poll2 corpus of 92,710 Chinese characters was used for evaluating the performance of contextual postprocessing. The recognition resnlts for the three sets of character samples were used as the basis of evalnation.</Paragraph>
      <Paragraph position="1"> Table 1 (b) shows the recognition results in terms of the poli2 corpus. The corpus contains 52 uncommon characters which do not belong to any of the 5401 character categories. The table shows the nmnbers of characters in the corpus in which position the correct characters were ranked by the recognizer. For example, there are 90,778 characters ranked 1, 1451 ranked 2, 178 ranked 3, ..., and 50 ranked after 10, in terms of the CQI0 samples. The recognition error rate for CQ10 would be 2.08%, without contextual postprocessing. 3'he er-For rate for CQ20, 4.08%, is higher than that for CQ30, 3.25%, because some very common characters, e.g., ;/~ , ~ in CQ20 samples are misrecognized. We set the number of candidates K to 6 in the experiments, as a tradeoff for better performance. Therefore, the characters ranked after 6 and the 52 uncomnmn characters are impossible to recover using the postprocessor. The optimM results a language model can do are thus with error rates 0.11%, 0.48%, and 0.22%, for CQ10, CQ20, and CQ30, respectively.</Paragraph>
      <Paragraph position="2"> The changes the postprocessor makes can be classified into three types: wrong-to-correct (XO), correctto-wrong (OX), and wrong-to-wrong (XX). In the XO type, a wrong character (i.e., a recognition error) is cot: rected; in the OX type, a correct character is changed to a wrong one; and in the XX type, a wrong character is changed to another different wrong one. The performance of the postprocessor can be evaluated as the net gain, @XOs - #OXs.</Paragraph>
      <Paragraph position="3">  processing \['or the three sets of character samples. The columns XO, OX, XX, and Gain list the average numbers of characters in types XO, OX, XX, and XO-OX, respectively. The last column ER lists the overall error rates after postprocessing with the various language models. The No Grammar row lists the error rates without postprocessing; the rows Least Word, Word Freq., and IWCB show the results for the Least;-Word, Word-Frequency, and Inter-word Character Bi-gram models; and tire NC rows show the results for discovered class bigram models with different nnmbers of classes. We observe from Table 2 that: * Our discovered class bigram model out-performed the other three models in general. The order of performance is: NC = 200 &gt; IWCI3 &gt; Wt ~' &gt; LW. The average error rates are - Kecognizer: 3.14%, LW:2.76%, WF:1.29%, lWCB:I.10%, and NC = 200: 0.82%.</Paragraph>
      <Paragraph position="4"> In other words, our NC = 200 rednced the error rate by 73.89%, while IWCB reduced it by 64.97%,  WF by 58.92%, and LW by 12.10%. Note that a 0.27% average of the characters arc always wrong; that; is, the least error rate is 9.27%. le, xcluding these characters, the NC = 200 model reduced the error rate by 80.84%! * The l,east-word model is not sufficient (it has negative gain for CQ10), and the Word-frequency model is much better, reducing the error rates by more than Iifty percent.</Paragraph>
      <Paragraph position="5"> * Our model outperformed the powerful \[WCB model, except for CQ20. The difference of CQ20 performance is just 0.05%, while our model out-performed IWCB by much larger margins, 0.51% and 0.4:3%, tbr CQ10 and CQ30, respeetiw~ly. Besides, the storage requirement of otlr model is much less than that of 1WCB model.</Paragraph>
      <Paragraph position="6"> * The IWCB model usually corrects more errors than ours, while it also commits much more OX mistakes.</Paragraph>
      <Paragraph position="7"> * The optimal NC vahtes for the discovered class bigram models are 200 for CQ10 and CQ20, and 150 for CQ30. This is consistent to the common rule of thumb: the size of training data should be at least ten times the number of parameters, which suggests a NC value of 189 for the size of the ctay7 corpus (355,347 words).</Paragraph>
      <Paragraph position="8"> The N(; = 500 models are apparently overtrained, which is consistent to the evaluation of test t(,t perplexities we discussed in (?hang and Chen (1993b).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML