File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/c00-1019_evalu.xml
Size: 7,058 bytes
Last Modified: 2025-10-06 13:58:33
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1019"> <Title>OUVRIERES LA BOUR FA(,J()N h;VENT P * 17' I~VIDLNCL CLEARLY EVIDh;NC\]'; OBVIO USIN HOMMF, S POIATICIANS PRISONNIFJ{S PR/SONEI{S RETOUR, BA.CII(, REVENIR BACK CONVENU AGREED SIGNE SIGNEI) VU SEEN AGRJCOLE AGR1C UL'I'URE ENT'IER AROUN\]) E N T I ER T Ill RO U G I\] O U T OCCIDENTAL WESTERN AVIDUGLI~S BI,IND CIIA.USSURI'2S SI-IOES CONSTRUC;I'EURS BUILDh;RS PENSIONN, F,S PENSIONERS RISTRAITES PENSIONERS VETEMENTS CLOTHING POISSON FISI\] PORC IK)RK Figure 2: Sanli)le Chlsters</Title> <Section position="7" start_page="128" end_page="129" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> The nlethod (les('ril>ed i,I this l)a, per does (Sttl) jectively) a, very good jol> of clustering like words toget\]wx, a lid using the clusters to getlera.lize EI{MT gives a. (;onsidera.I)le boost, to the. l)etTVol'ltl~-Lt,ce+ of' the l<\]\]\]\]\/l~\[ ' SySl;(':lll. l&quot;igure 2 shows a, sa.ml)ling of tile sma.ller clusters generated from 1.\] million words o\[' Hansard text. While the nmmbers of a, cluster are often semantica,lly linked (a,s in cluster 848, which cotltains types of politica.1 paxl;ies, or cluster stag), they need not be. Those clusters whose members a.re not semantically linked generaJly contain words which a.17e all the sa.me l)a.rt of sl)eech , numl)er, a.nd gender (a.s in (:luster 2472, which costa.ins exclusively plural nouns) 1)ut a.s will be discussed in t;he next section, even those chlsters whose ,neml)ers a.re tota.lly unrela.ted may 1)e useful a.nd correct.. One J'a.h:ly cotlltl\]Otl occurreltce a, lllOllg the smaller clusters is that various synonymous 1;ra.nslnt;ions o\[ a word (from either source or target language) will chlster together, as in cluster \]652. This is pa.rticula.rly useful when tile ta.rget-language word is the sa.me, a.s this a.llows va.rious wa.ys of expressing t.he same thing to be tra.nsla.ted when ~l.lly Of&quot; those \['OFtlIS ~/l'e present in the tra.ining ('orpl.t s.</Paragraph> <Paragraph position="1"> Figure 3 shows how adding a.utoma.ticallygenerated equiva.lence classes sul)sta.ntially increases the covers,we of the EI3MT system. A1terna.tively, lnuch less text is required to rea.ch a. specific level of coverage. The lowest curve in the. graph is tile percentage of the d5,000-word test text for which the EI{M:J' system was able to genera.te tra.nsla.tions when using strict lexic+d matching against the trahling corpus. The lop-most curve shows the best performa.nce, previously achieved using 1)oth a, la.rge set of eqttivalento classes (in t;he fornt of tagged entries from the \]\ItYI'II+'I, dicl;iona.rv) a.nd a. production-rule gra.nlntar (\]{rows, J999). Of the two center curves, the lower is the performs.nee when genera.lizing the tra.ining corl)us using the equivalence classes which were autolna.tica.lly gonerated from that same text, a.nd tim upper shows the t)erforma.tlce using ('lustering with the d25 seed pairs.</Paragraph> <Paragraph position="2"> /ks can b~, seen in Figure 3, 80% coverage of the test text is achieved with less than 300,000 words using nta.ntta.lly-crea.te(l generalizat, ion information a.nd with approxima.tely 300,000 words wllen using a.utonmticallycreaJ;ed genera.liza.tion informa.tion, but requires 1.2 million words when not using genera.liza.ties. 90% covers.we is reached with less than 500,000 words using lna.nua.lly-ereat.ed informa.</Paragraph> <Paragraph position="3"> lion a.nd should I>e reached with less t.ha.n 1.2 tnillion words using a.utonm.tically-crealed genera.lization informa.tk)n, versus T million words without genera.liza.tion. Tiffs reduction I)y a. tim(or off our to live in tile amount of text is accom1)lishe(I with lit;tie o)' no degradation in the quality of the tra.nsla.tions. Adding a. small amount of kt,owle(lge in the f'ornt o1&quot; 425 seed pairs re(lutes the required trahling text; even further; this ca.n la.rgely be attril)uted to the merging of clusters which would otherwise have rema.ined distinct, thus increasing the level of generaliza.ties. null Adding the production-rule gratnma.r to the seeded clustering had little effect. When usirtg more than 50,000 words of tra.ining text, the increase in coverage from adding the gram m a,r was negligible, and even with the sma.llest training corl)ora, (,he+ increase wa.s very modest.</Paragraph> <Paragraph position="4"> Using the sa.me thresltolds tha.t were used in tile fully-~mtonla.tic case, clustering on 1.\] million words expands the initial 425 word pairs in 37 clusters to a200 word pairs, a.nd adds a.n additions.1 555 word pairs in \]d() further non(;t:ivia,1 clusters. This (:Oral)ares very fa.vorably with automatic clustering -4-clustering w/425 seeds -D--, full manual g~neralization --x--- null l i'igure 3: BI3MT \]~el'formance with and without Generalization to the 3506 word 1)airs in 221 clusters tbund without seeding.</Paragraph> <Paragraph position="5"> 'l'he 1)rogram also runs reasonably quickly.</Paragraph> <Paragraph position="6"> The step of creating context term vectors converts approximately 500,000 words of raw text per minute on a 300 MHz processor. 1,'or agglomerative clustering, the processing time is roughly quadratic in the number of word \])airs, with a theoretical cubic worst case; the 17,527 distinct word pairs found from the million-word training corpus require about 25 minutes to cluster.</Paragraph> </Section> <Section position="8" start_page="129" end_page="129" type="evalu"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> One statement made earlier deserves cla.rification: l;he members of ~ cluster need not be related to each other in any way, either syntactically or semantically, for a cluster to be useful and correct. This is because (absent a grammar) we do not care about the features of tile words in the cluster, only wh, cthc~&quot; their tr(mslalion,s Jbllow the same pattcrT~.</Paragraph> <Paragraph position="1"> An illustration based on actual experience is useful here. In early testing of the groupaverage clustering algorithm with seeding, the <conjunction> seed class of &quot;and&quot; and &quot;or&quot; was used. Clustering augmented this seed class with &quot;,&quot; (comma.), &quot;in&quot;, and %y&quot;. One can easily see tha.t the comma is a valid member of the class, since it takes the place of &quot;and&quot; in lists of items. 13ut wllat about ':in&quot; and &quot;135;&quot;, wlfich are prepositions rather than conjunctions2 11' one considers the tra.nsbttion t)attern</Paragraph> <Paragraph position="3"> it becomes clear that all of the terms in the expanded class give a correct translation when placed in the blank in this pattern, lndeed, one could imagine a production-rule grammar geared toward taking advmltage of such common translation patterns regardless of conventional linguistic features.</Paragraph> </Section> class="xml-element"></Paper>