File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/c00-1051_evalu.xml
Size: 8,693 bytes
Last Modified: 2025-10-06 13:58:32
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1051"> <Title>Committee-based Decision Making in Probabilistic Partial Parsing</Title> <Section position="7" start_page="350" end_page="353" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="350" end_page="351" type="sub_section"> <SectionTitle> 4.1 Settings </SectionTitle> <Paragraph position="0"> We conducted eXl)erinmnts using the tbllowing tive statistical parsers: based oll maxinmm entropy estimation, Since dependency score matrices given by KANA have no probabilistic semantics, we normalized them tbr each row using a certain function manually tuned for this parser. down model incorporating lexical collocation statistics. Equation (1) was used tbr estimating DPs.</Paragraph> <Paragraph position="1"> * Peach Pie Parser (Uchilnoto et al., 1999): a bottom-up model based on maximum entropy estimation.</Paragraph> <Paragraph position="2"> Note that these models were developed flfily independently of ea('h other, and have siglfifi-Calltly different (:haracters (Ii)r a comparison of their performance, see %tble 1). In what Jbllows, these models are referred to anonymously. For the source of the training/test set, we used the Kyoto corpus (ver.2.0) (Kurohashi et al., 1.997), which is a collection of Japanese newspaper articles mmotated in terms of word boundaries, POS tags, BP boundaries, and inter-BP dependency relations. The corpus originally contained 19,956 sentences. To make the training/test sets~ we tirst removed all the sentences that were rejected by any of the above five parsers (3,146 sentences). For the remaining 16,810 sentences, we next checked the consistency of the BP boundaries given by the parsers since they had slightly different criteria tbr BP segmentation fl'om each other. In this process, we tried to recover as many inconsistent boundaries as possible. For example, we tbund there were quite a few cases where a parser recoglfized a certain word sequence ms a single BP, whereas some other parser recognized the same sequence as two BPs. In such a case, we regarded that sequence as a single BP under a certain condition. An a result;, we obtained 13,990 sentences that can be accepted by all the parsers with all the BP boundaries consistent 2 We used thin set tbr training and evaluation.</Paragraph> <Paragraph position="3"> For cloned tests, we used 11,192 sentences (66,536 BPs a) for both training and tests.</Paragraph> <Paragraph position="4"> For open tests, we conducted five-fold cross-validation on the whole sentence set.</Paragraph> <Paragraph position="5"> 2In the BP concatenation process described here, quite a few trivial dependency relations between neigl,boring BPs were removed from the test set. This made our test set slightly more difficult tlmn what it should have 1)cert.</Paragraph> <Paragraph position="6"> 3This is the total nmnber of BPs excluding the right-most two BPs for each sentence. Since, in Jal)anese, a BP ahvays depends on a BP following it, the right-most BP of a sentence does not (lei)(tnd on any other BP, and the second right-most BP ahvays depends on the right-most BP. Therefore, they were not seen as subjects of evahmtion.</Paragraph> <Paragraph position="7"> For the classification of problems, we manually established the following twelve (:lasses, each of which is defined in terms of a certain nlol:phological pattern of depending BPs: 1.. nonfinal BP wit, h a case marker &quot;'wa (topic)&quot; 2. nominal BP with a case marker &quot;no (POS)&quot; 3. nominal BP with a case marker &quot;ga (NOM)&quot; 4. nominal BP with a case marker % (ACC)&quot; 5. nonlinal BP with a case marker &quot;hi (DAT)&quot; 6. nominal BP with a case marker &quot;de (LOC/...)&quot;</Paragraph> </Section> <Section position="2" start_page="351" end_page="353" type="sub_section"> <SectionTitle> 4.2 Results and discussion </SectionTitle> <Paragraph position="0"> Table 1 shown the total/ll-point accuracy of each individual model. The performance of each model widely ranged from 0.96 down to 0.86 in ll-point accuracy. Remember that A is the optimal model, and there are two second-best models, B and C, which are closely comparable.</Paragraph> <Paragraph position="1"> In what tbllows, we use these achievements ms the baseline for evaluating the error reduction achieved by organizing a committee.</Paragraph> <Paragraph position="2"> The pertbrmanee of various committees is shown in Figure 4 and 5. Our primary interest here is whether the weighting functions presented above effectively contribute to error reduction. According to those two figures, although the contribution of the flmction Normal were nor very visible, the flmction Class consistently improved the accuracy. These results can be a good evidence tbr the important role of weighting flmctions in combining parsers.</Paragraph> <Paragraph position="3"> While we manually tmill: the 1)roblem classiti('al;ion in our ext)erimen|;, autom;~I;ic (:lassitication te.chniques will also 1)e obviously worth considering. null We l;hen e.on(tucted another exl)e, rime.nI; to examine the, et\['e(-l;s of muli;it)le voting. One (:an sl;raighi;forwardly sinn|late a single-voting comnlil;tee by ret)lacing wij in equal;ion (7) with w~. i given by:</Paragraph> <Paragraph position="5"> The resull;s are showll in Figure 7, which corot)ares l;he original multi-voting committees and l;he sinmlai;e(t single-voi:ing (:olmnil;l;ees.</Paragraph> <Paragraph position="6"> Clearly, in our se|;tings, multil)le voting significanl;ly oul;pertbrmed single vol;ing 1)arti(:ul~rly when t;he size of a ('ommii;tee is small.</Paragraph> <Paragraph position="7"> The nexl; issues are whel;her ~ (:Omlnil;te,(', always oul;perform its indivi(tmd memt)ers, mtd if not;, what should be (-onsidered in organizing a commii;i;ee. Figure 4 and 5 show |;hal; COllllllil;tees nol; ilmlu(ling t;he ot)timal model A achieved extensive imt)rovemenl;s, whereas the merit of organizing COlmnitl;ees including A is not very visible. This can be t)arl, ly attrilml;ed to the fa.ct that the corot)el;once of the, individual meml)ers widely diversed, and A signiti(:md;ly OUtl)erforms the ol:her models.</Paragraph> <Paragraph position="8"> Given l,he good error reduct;ion achieved by commit, tees containing comt)ar~ble meml)ers sueh ~s BC, BD a, nd B@I), however, it should t)e reasonable 1;o eXl)ect thai; a (:omlnil,l,e,e including A would achieve a significant imt)rovement; if anol;her nearly ol)t;ilnal model was also incorl)o null rated. To empirically prove this assmnpl;ion, we, conduct;ed anot;her experiment, where we add another parser KNP (Kurohashi el; al., 1 !)94:) 1;o each commil;|;ee that apt)ears in Figure 4. KNI? is much closer to lnodel A in l;ol;al accuracy t;han t;he other models (0.8725 in tol;al accuracy). However, il; does not provide. DP real;rices since it is designed in a rule-l)ased fashion the current; version of KNP 1)rovides only the t)esl;-t)referrext parse t;ree for ea(:h inl)Ul; sentence without ~my scoring annotation. We l;hus let KNP 1;o simply vol;e its l;ol;al aeem:aey. Tim results art; shown in lqgure 6. This time all l;he commil;tees achieved significant improvemenl;s, wil;h |;he m~ximum e, rror re(hu:|;ion rate up l;o '3~%. As suggested 1)y |;he. re, suits of t;his exl)erimenl; with KNP, our scheme Mlows a rule-based 11011t)~r;m,el:ric p~rse.r t;o pb~y in a eommil;l;e.e preserving it;s ~d)ilit:y t;o oui;t)ul; t)aralnel;rie I)P ma(;ri(:es. To 1)ush (;he ~u'gumen(; fl,rl;her, SUl)pose ;~ 1)lausil)le sil;ual;ion where we have ;m Ol)l;imal l)ut non-1)arametrie rule-based parser and several suboptimal si;atistical parsers. In su('h ~ case, our commil;teeA)ased scheme may t)e able l;o organize a commi|,tee that can 1)rovide l)P lnatri(:es while preserving the original tol;al accuracy of the rule-b~sed parser. To set this, we conducted another small experiment, where, we combined KNP with each of C and D, 1)oth of whi(:h are less compe.tent than KNP. The resulting (:ommil;l;ees successflflly t)rovided reasonal)le P-A curves as shown in Figure 8, while even further lint)roving the original |;ol;al at:curacy of KNP (0.8725 to 0.8868 tbr CF and 0.8860 for DF). Furthermore, t;he COmlnittees also gained the 11-point accuracy over C and D (0.9291 to 0.9600 tbr CF and 0.9266 to 0.9561 for DF).</Paragraph> <Paragraph position="9"> These. results suggest that our committee-based scheme does work even if the most competent member of a committee is rule-based and thus non-parametric.</Paragraph> </Section> </Section> class="xml-element"></Paper>