File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1074_metho.xml
Size: 16,607 bytes
Last Modified: 2025-10-06 14:07:09
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1074"> <Title>Hybrid Neuro and Rule-Based Part of Speech Taggers</Title> <Section position="4" start_page="509" end_page="510" type="metho"> <SectionTitle> 3 Hybrid System </SectionTitle> <Paragraph position="0"> Our hybrid system (Fig. J) consists of a neuro tagger, which is used as an initial-state an notai;or, ~nd a rule-based corrector, which corrects the outputs of the neuro tagger. When a word seque,~ce W t \[see l~q. (2)\] is given, the neuro tagger outl)ut a tagging result rN(Wt) for tile target word wt at first. The rule-based corrector then corrects the output of the neuro tagger as a fine tuner and gives tile final tagging result</Paragraph> <Section position="1" start_page="509" end_page="510" type="sub_section"> <SectionTitle> 3.1 Neuro tagger </SectionTitle> <Paragraph position="0"> As shown in Fig. 2, the neuro tagger consists of a three-layer I)erceptron with elastic input.</Paragraph> <Paragraph position="1"> This section mainly descril)es the construction of inl)ut and output of the nenro tagger, and the elasticity by which it; becomes possible to use variable length of context for tagging. For details of the architecture of l)erceptron see e.g., Haykin, 1994 and for details of the features of the neuro tagger see Ma and isahara, 1998 and Ma, et aJ., 1999.</Paragraph> <Paragraph position="2"> lnl)ut IPT is constructed fi'om word sequence W t \[Eq. (2)\], which is centered on target word wt and has length l + 1 + r:</Paragraph> <Paragraph position="4"> provided that input length l+ J+r has elasticity, as described a,t tile end of this section. When woM wis given in position x (x = t-l,...,t+r),</Paragraph> </Section> </Section> <Section position="5" start_page="510" end_page="511" type="metho"> <SectionTitle> OPT </SectionTitle> <Paragraph position="0"> iptt_ I ...... il)\[t_ I ipl t defined as ipt.,: = :/.,,. (e,,,,('-,,'e,.&quot; ,q,,~), (4) where g;,; is the inIbrmation gain which can be obtained using information theory (for details see Ma and lsahara., 11998) and 7 is the number of tyl)es of POSs. l\[' w is a word that apl)ears in the tra.ining data, then ea.ch bit e,,,i can be obtained: = P,,ot,(/I,,,), (s) where l>'rob(ril'w) is a prior l>robal)ility of r i tha.t (;he woM 'w can take,. It is estitnated from the t raining (la,ta:</Paragraph> <Paragraph position="2"> where C'(r (,w) is the number of lfimes both r: at++d w al)pea, r , a,nd C(w) is the number oi' times w appears in the training data. 1\[ w is a word that does not at)pear ill (,he training data~ then each t)it c,,,,i is obtained:</Paragraph> <Paragraph position="4"> where 7,, is the number of P()Ss that the word 'w Call ta.ke. Output OPT is defined as OFT= ,o+), (s) provi<led that the output OI)T is decoded as</Paragraph> <Paragraph position="6"> where rN(W,) is the PSa.gging result obtained by the neuro tagger.</Paragraph> <Paragraph position="7"> There is more inforlnation available for constructing the input for words on the left, because they have already been tagged. In the tagging phase, instead of using (4:)-(6), the input can be constructed simply as</Paragraph> <Paragraph position="9"> where i = 1,...,1, and O I)T(-i) means the output of the tagger for the ith word before the target word. ltowever, in the training process, the out;put of the tagger is not alway.a correct a.nd cannot be ted back to the inputs directly.</Paragraph> <Paragraph position="10"> Instead, a weighted awerage of the actual output a.nd tlm desired output is used:</Paragraph> <Paragraph position="12"> where \]@,uo and \]'JAC'T are the objective and actual errors. Thus, at the beginning of training, the weighting of the desired output is largo. It decreases to zero during training.</Paragraph> <Paragraph position="13"> Elastic inputs are used in the neuro tagger so that the length of COlltext is variable in tagging based on longest context priority. In (te~ tail, (l, r) is initially set as large as possible for tagging. If rN('Wi) = Unknown, then (1, r) is reduced by some constant interval. This l)rocess is repeated until rN(W~) 7 k Unknown or (1, r) = (0,0). On the other hand, to nmke the same set of connection weights of the neu ro tagger with the largest (1,'r) ava.ilable a.s lnuch as possible when using short inputs for tagging, in training phase the neuro tagger is regarded as a neural network that has gradually grown fi'om small one. The training is therefore performed step by step from small networks to large ones (for details see Ma, et al. 1999).</Paragraph> <Section position="1" start_page="511" end_page="511" type="sub_section"> <SectionTitle> 3.2 Rule-based eorreetor </SectionTitle> <Paragraph position="0"> Even when the POS of a word can be determined with certainty by only the word on the left, for example, tile neuro tagger still tries to tag based on the complete context. That is, in general, what tile neuro tagger can easily acquire by learning is the rules whose conditional parts are constructed by all inpttts iptx</Paragraph> <Paragraph position="2"> ticult for tile neuro tagger to learn rules whose conditional parts are constructed by only a single input like (ipt,. --+ OPT) ~). Also, although lexical information is very important in tagging, it is difficult for the neuro tagger to use it, because doing so would make the network euof mous. Tha.t is, the neuro tagger cannot acquire rules whose conditional parts consist of lexical information like (w -4 OPT), (w&r -4 OPT), and (w~w2 --+ OPT), where w, Wl, and w2 are words and 7- is tile POS. Furthermore, because of convergence and over-training 1)rol)lems, it is iml)ossible and also not advisable to train net> ral nets to all accuracy of 100%. The training should be stopped at an apl)rol)riate level of a.ccuracy. Thus, neural net may not acquire some useful rules.</Paragraph> <Paragraph position="3"> The transfbrmation rule-based corrector makes up for these crucial shortcomings.</Paragraph> <Paragraph position="4"> The rules are acquired Dora a training col pus using a set of transformation templates by transformation-based error-driven learning (Brill, 1994). Tile templates are constructed using only those that supply the rules that tile nenro tagger can hardly acquire, i.e., are those 1)The neuro tagger can also learn this kind of rules because it can tag tile word using only ipt, (the input of tile target word), ill the case of reducing tile (I, r) to (0,0), as described in Sec. a.l. The rules with single input described here, however, are a more general case, ill which the input call be ipt,~ (~: = t - 1,..., t + r). for acquiring the rules with single input, with lexical information, and with AND logica.1 input of POSs and lexical information. The set of templa,tes is shown in Table 112).</Paragraph> <Paragraph position="5"> According to the learning procedure shown in Table 2, an ordered list of transformation rules are acquired by applying the template set to a training corpus, which had ah'eady been tagged by the neuro tagger. After tile transformation rules are acquired, a corl)us is tagged as tbllows. It is first tagged by the neuro tagger. The tagged corpus is then corrected by using the ordered list of transformation rules.</Paragraph> <Paragraph position="6"> The correction is a repetitive process applying the rules ill order to the corptlS, which is then updated, until all rules have been applied.</Paragraph> </Section> </Section> <Section position="6" start_page="511" end_page="513" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> Data: For our computer experiments, we used tile same Thai corpus used by Ma et al. (1999).</Paragraph> <Paragraph position="1"> Its 10,d52 sentences were randomly divided into two sets: one with 8,322 sentences for trail> ing and the other with 2,1.30 sentences for testint. The training set contained 12d,331 words, of which 22,311 were ambiguous; the testing set contained a4,5~14 words, of which 6,717 were ambiguous. For training tile n euro tagger, only the ambiguous wor(ls in the. training set were used. For training the HMM, all tilt words in the training set were used. In both cases, all the words in tile training set were used to estimate Prob(rilw), tim probability of &quot;c i that wor(I w can be (for details on the HMM, see Ma, et al., 1999). In the corpus, 4:7 types of POSs are defined (Charoenporn et al., 1997); i.e., 7 = 47.</Paragraph> <Paragraph position="2"> Neuro tagger: The neuro tagger was constructed by a three-layer perceptron whose input-middle-outI)ut layers had p z, 2 7 units, respectively, where p = 7 x (1 + I + r). The (l + 1 + r) had tile following elasticity. In training, tile (I, r) was increased step by step as (71,1)</Paragraph> <Paragraph position="4"> training fl'om a small to a large network was pertbrmed. Ill tagging, on the other hand, the 2)To see whether this set is suitable, a immloer of additional experiments were conducted using various sets of templates. The details are described in Sec. 4.</Paragraph> <Paragraph position="5"> r ~ I a,1)lc 1: Set o\[' templa.tes for tra.ns\[orln;~tion rules Change t;ag v a to t;ag v deg when: (single inlm|;) (input ('onsists of a POS) 1. left (right) word is tagged v.</Paragraph> <Paragraph position="6"> 2. second left (right) word is tagged r. 3. third left (right) word is ta.gged r. (inI)n|; consist;s of a word) 4. ta.rget word is ~.</Paragraph> <Paragraph position="7"> 5. left (right) word is w.</Paragraph> <Paragraph position="8"> 6. second left, (right) word is w.</Paragraph> <Paragraph position="9"> (AND logical inpuC/ ot&quot; words) 7. l, arget word is 'UO 1 ~tlld left (right) word is wu. 8. left; (right) word is u,1 and second lcfl, (,'ight) word is w2. 9. left, word is w~ a.nd right, word is 'wu. (AND logical in.lint; of POS and words) 10. ta.rget word is uq and left (right) word is llaggod r. :11. left (righl;) word is .w~ and left. (right) word is tagged r. 12. ta.rget word is w~, left (right) word is ,w.,, and left (right) word is tagged r. Ta,1)le 2: l)roetdure for learning transi'orma,tion rules 1. Apply neuro taggtr to training corpus, which is then updated. 2. Compare tagged results with desired ones and find errors. 3. Ma.teh templates l'or all errors and obtain set of tra.nsformation rules. d. Select rule in corpus with the maximum value of' (cnl,_qood-h. cnl,_bad), where cnZ_qood: number that transforms incorrect tags to correct elliS: c'nl._bad: number that transforms correct tags to incorrect ones, h: weight to control the strict, hess of generating 1;he rule. 5. Apply selecttd rule to training corpus, which is then updated. 6. Append selected rule to ordered list o1&quot; trausl'orma.tion rules. 7. Ih'4)eal; steps :2 through (j until no such rule can I)e selected, i.e., c'n,t_goodh,. cnl,_bad < O.</Paragraph> <Paragraph position="10"> (l, 'r) was inversely reduce(l ste l) by step as (3,3)</Paragraph> <Paragraph position="12"> (0,0) a.s needed, provided tJlat the number of units in the middle layer was kept a.t the ma.xi-Ill lllll vahle.</Paragraph> <Paragraph position="13"> Rule-based correttor: The parameter h in the tw~Juat;ion function (cnl,_9ood - h, . c'M._bad) used in 1;he learning procedure (Table 2) is a weight to control the strictness of generating a. rule. IF It is large, the weight of cnt_bad is la.rge and the possibility of generating incorrect rules is reduced. By regarding the neuro tagger as ~dready having high accuracy and using tile rule-based correcter as a fine tuner, weight h. was set to a. large vahm, 100. Applying |;lit templates Co the training corptm, which had already been tagged 1) 5, the neuro ta.gger, we obta.ined a.n ordered list; of 520 transfbrmation rules. '.l'~d)le 3 shows the first 15 transfbrmation rules.</Paragraph> <Paragraph position="14"> Results: Table 4 shows the resull;s of I)()S tagging for the testing data.. In addition to the accuracy o\[&quot; the neuro tagger and hybrid system, the ta.ble also shows tile accuracy of a, bastline model, the IIMM, and a rule-based model \['or comparison. The baseline model is one that performs tagging without using the contextual inlorma.tion; instead, it performs ta.gging using only f'requency informa.tion: the proba.bility of P()S that; tach word can be. The rule-based model, to be exact, is also a hybrid system con-</Paragraph> </Section> <Section position="7" start_page="513" end_page="514" type="metho"> <SectionTitle> 1 PREL 2 PREL 3 Unknown 4 XVHI4 5 VATT 6 Unknown 7 NCI4N 8 VATT </SectionTitle> <Paragraph position="0"> sisting of an initial-state annotator and a set of transformation rules. As the initial-state annob~tor, however, the baseline model is used instea.d of' the neuro tagger. And, its rule set. has 1,177 transformation rules acquired h'om a more general teml)late set, which is described at the end of this section. The reason for using a general template set is that the sol; of tra.nsibrma.tion rules in the rule-based model should be the main annotator, not a fine post-processing tuner. For the same reason, the parameter to control the strictness of generating a rule, h, was set to a small value, \], so that a larger number of rules were generated.</Paragraph> <Paragraph position="1"> As shown in the table, the accuracy of the nenro tagger was far higher than that of the HMM and higher than that of the rule-based model. The accuracy of the rule-based model, on the other hand, was also far higher than that of the IIMM, ~lthough it was inferior to that of the neuro tagger. The accuracy of the hybrid system was 1.1% higher than that of the neuro tagger. Actually, the rule-based corrector corrected 88.4% and 19.7% of the errors made by the neuro tagger for the training and testing data, respectively.</Paragraph> <Paragraph position="2"> Because the template set shown in Table 1 was designed only to make up for the shortcomings of the neuro tagger, tile set is small compared to that used by Brill (1994). To see whether this set is la.rge enough for our system, we perlbrmed two additional experiments in which (\]) a sol; constructed 193' adding the templates with OR logical input of words to the original set and (2) a, set constructed 1)5' fnrther adding the templates with AND and OR logical inputs of POSs to the set of case (1) were used. The set used in case (2) inclnded the set used by Brill (\]994) and all the nets nsed in our experiments. It was also used for acquiring the transformation rules in the rule-based model.</Paragraph> <Paragraph position="3"> The experimental results show that compared to the original case, the accuracy in case (1) was improved very little and the accuracy in case (2) was also improved only 0.03%. These results show that the original set is nearly la.rge enough for our system.</Paragraph> <Paragraph position="4"> To see whether tile set is snitable tbr our system, we performed ~tn additional experiment using the original set in which the templa.tes with OR logical inputs were used instead of the templates with AND logical inputs. The accuracy dropped by 0.1%. Therefore, tile templates with AND logical inputs are more suitable than those with O11 logical inputs.</Paragraph> <Paragraph position="5"> We also performed an experiment using a template set without lexical intbrmation. In this case, l;he accuracy dropl)ed by 0.9%, indicating that lexical informatioll is important in tagging. To determine the effect o1' using a. large h, for generating rules, we per\['ormed an experiment with h = 1. In this case, the accuracy dropped by only 0.045%, an insignifica.nt difference compared to the case of h, = 100.</Paragraph> <Paragraph position="6"> By examining the acquired rules that were obtained by al)plying the most COml)lete template set, i.e., the set used in case (2) described above, we found that 99.9% of them were those that can be obtained by a.pl)lying the original set of templates, rl'ha.t is, the acquired rules were almost those that are dif\[icult \['or the neure tagger to acquire. '.l'his rein forced our expectat;ion that the rule-based al)l)roach is a well-suited method to cope with the shortcoming of the neuro tagger.</Paragraph> <Paragraph position="7"> Finally, il, should 1)e noted that ill the literatures, tile tagging a.ccuracy is usua.lly delined by counting a.ll tile words regardless of whether they are a.nlbiguous or not. If we used this dellnil:ion, t\]le accura.cy of our hybrid system would be 99.1%.</Paragraph> </Section> class="xml-element"></Paper>