File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1019_metho.xml
Size: 10,331 bytes
Last Modified: 2025-10-06 14:07:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1019"> <Title>OUVRIERES LA BOUR FA(,J()N h;VENT P * 17' I~VIDLNCL CLEARLY EVIDh;NC\]'; OBVIO USIN HOMMF, S POIATICIANS PRISONNIFJ{S PR/SONEI{S RETOUR, BA.CII(, REVENIR BACK CONVENU AGREED SIGNE SIGNEI) VU SEEN AGRJCOLE AGR1C UL'I'URE ENT'IER AROUN\]) E N T I ER T Ill RO U G I\] O U T OCCIDENTAL WESTERN AVIDUGLI~S BI,IND CIIA.USSURI'2S SI-IOES CONSTRUC;I'EURS BUILDh;RS PENSIONN, F,S PENSIONERS RISTRAITES PENSIONERS VETEMENTS CLOTHING POISSON FISI\] PORC IK)RK Figure 2: Sanli)le Chlsters</Title> <Section position="5" start_page="126" end_page="127" type="metho"> <SectionTitle> 3 Clustering Approaches </SectionTitle> <Paragraph position="0"> A tota.l of six clustering a.lgoHthms ha.v(~ I)oen 1.ested; th roe variants of grout)-a.vora.go. ('\]tlsl.('.,'ins a.nd i, hree of agglomera.tive clustering. Incl'omental group-a.vera.ge clustering was ilnplemented tirst, to provide a. proof of concopt, borore the COml)uta.tiona.lly more expensive a.gglomerative (bottom-up) clusteril~g was i lnplemented. null The incremental groul)-a.vera.ge a.lgoril;hms all exa.mine each word pair in turn, computing a similsu:ity measure to evory existing clustor. If th(; 1)(;st siinila.rity measur(; is a l)ov(~ a. l)r(;del;er nfin('d threshold, the new word pair is i)laced in tile corresponding cluster; otherwis% a now (;\]usi;er is crea.ted. The th roe varianl;s diltT, r only in tile simila.rity moasure eml)loyed: :1. cosin(; similarity 1)(;1;w(~(;n 1,h(~ i)s(;u(lo<locumonl, a.nd the centroid o1&quot; the oxisting clus- null ter (standard grOUl)-a.vera.ge clusto.rillg;) 2. a.verage of' i;\]lo cosine similaril;ies l)otwe(;n the l)seudo-docuni(;nl; a.nd all nl(;nll)ers o\[' the 0xisting (:lust(;,' (a.voragc-link clustoring) null 3. square root of' 1;h(; a.vcrag(; of 1;lie S(luared cosine simila.r\]l;io.s I)ctweon l;he l)seudo(locuinent an(\] all molnl)(~,'s or l he existing ('hlster (rool.-nloa.n-sqllar(, nlo(lifical.ion of average-liNlC/ clustering) Thoso i;hro(~ vnria.tiol,S give hlc,'eas\]ngly IIl()l'(': weight to 1,ho nea.rer mcml)ers of' tho oxist.ing cl ust;cr.</Paragraph> <Paragraph position="1"> Tim t)o(;1;oin-u 1) a.gglomera.tive algoril;hms all funcl;ion I)y (;tea.tills a. clustor For each I)Seudo(\[o(:unlenl,, t;hon r(;i)(;a.1;(;(lly ln(u:ging l:li(; two clusl;ors witli the \]iighesl; siinila.ril,y score unl,il 110 (,WO C\]tlS|,orH \]lSt,vo ,% ,q\]iilila,ril;y .~(:Ol'(~ (~x('.(~(;ding a l)re(Iol;ornlino(\] 1;hl:eshold. The three vari-;/,IIi;S }/,ga, ill differ ()lily ill 1;lio S\]liiilaril,y lllO}lStll'O O llll)loyc(l: \]. cosine simila.rity between clustor centroids (st~ul(la.rd agglomei:a.tivo clustering) 2. a.vera.ge of cosine sitnilariLy 1)etween men> l)ers of the two clusters (a.vera.ge-tink) 3. nia.xilnal cosino similarity betweon a.ny pair Of ni('.nll)oi:s of l,\]ie i;wo clusl;(',rs (single-lin\]{) l&quot;oi: (;acli of the va.i:ia.tions a.bovc, the l)r(~(l(;1,er niincd (;hreshol(I is a. funci;ion of word \['r(xluoncy. Two words wliich each a.l)l)ea.r only onc(Y in the entire tra.ining text a.nd ha.re a. high simila.rib, score a.ro more likely to ha.re a.l)l)ea.red in siniila.r contexl;s I)y cohicide.nce l:ha.n 1;wo wor(ls which each a,1)pea.r ill 1;he traJliil/g 1;(;xi; lifty tin-its. thor side as context, a.nd a. linca.r dcca.y in t;erm weights, two singleton words achievo a. sinlitarit; 5, scor(', of ().321 (1.000 is the ma.ximum t)ossiblc) if just one o\[&quot; the immodia,tely a(lja,ccnt words is the sa.mc for 1)oth, evon if none of' 1;ho other five context words axe the sa, mc'. /ks the number o\[' occul'renc('s increases, l;ho contri\])ul,ion t,o the simila.rit,y score o\[' hidividua.l words decreases, ma.king it less likely 1;o encounter a high score by chance. Ilencc, we wish to set a. si;ricl;er 1;hres\],ol(l \['or clustering low-frequollcy words i;hati higho,'-l'roquelmy words.</Paragraph> <Paragraph position="2"> The thr(~shold Function is exI)ressc(l in 1,(~rms of tim fr('(lU(mcy o1&quot; occurrence in th(~ 1,ra.il,ing 1.exl.s. I&quot;or si,,gle, ull('lus(;ere(\[ \vord pairs, I, ho t'requollcy is sinll)ly 1,11o numb(~r ol' 1;hnos I, he wor(I 1)a.ir was (m(:ounl,(u'(,d. When I)e,'\['orn> ing groul)-a.\,erag(; (;lu.qlx;ring, the l'requoncy assignod l;() a. ('\]/ml;('.r is tim sum o\[' (;h(; frequencios of a.ll the members; for agglomera.l.ive (:lust('.ri)lg, the \['re(ltten(;y is the sum when using cent;roids and 1,he lnaximunl fre(lucn('y <tnlong the m(;m-I)oJ'S wllen using l;he average or lmarest-,,(;ighl)or ,~imila.rity. The va.lu(~ of' the (;hr(>shold \['or a. given pair of ('lusi,('ms is the va.lue of tim thr(~,~hold I'unction a.t the lower word frequency. \]:igure 1 sl,ows l,h(', threshold tunction used in the (,Xl)Criments whose results a, rc rel)ortcd here; clusterins is only allowed if the simila, rity measure is a.1)ove the indicated threshold vahm.</Paragraph> <Paragraph position="3"> On its own, clustering is quite suc(:essfill for generalizing EBMT ('Xaml)les, I)ut the fullya.utomated t)roducl;ion of clusters is not comt)a.tible with adding a, l)roduction-rule gra.mma.r as (lcscril)od in (l~rown, \]999). Therel'ore, the clustering process may 1)e seeded with a. set of m an u a.lly-gc'nera.ted clusters.</Paragraph> <Paragraph position="4"> VVhell seed clusters m'e a.va.ilablo., the clusterins process is moditied in two ways. First, l;he grOUl)-avera.ge a.pl)roa.clms a.dd an initiaJ clusl;er for o.a,('h soed cluslcr and the a.gglolnera.tive a p- null proaches add an initial cluster for each word pair; these initial clusters are tagged with the name of the seed cluster. Second, whenever a tagged chister is merged with an untagged one or another cluster with the same tag, the combination inherits the tag; further, merging two clusters with different tags is disallowed. As a result, the initial seed chlsters are expanded by adding additional word pairs while preventing any of the seed clusters from themselves inerging with each other.</Paragraph> <Paragraph position="5"> One special case is handled sepa.rately, namely numeric strings. If both the source-language and target-l~mguage words of a word pair are numeric strings, the word pair is treated as if it had been specified ill the seed class <number>. Word pairs not containing a digit in either word can optionally be prevented fi'om being added to the <number> chlster unless explicitly seeded in that cluster. The former feature eusures that nunibers will apl)ear in a.</Paragraph> <Paragraph position="6"> single cluster, rather than in multiple chlsters.</Paragraph> <Paragraph position="7"> The latter avoids the inclusion of the many non-numeric word pairs (primarily adjectives) which would otherwise tend to cluster with numbers, because both they and numbers are used as modifiers.</Paragraph> <Paragraph position="8"> Once clustering is completed, any clusters which have inherited the same tag (which is possible when using agglomerative clustering) are merged. Those clusters which contain more than one pseudo-document are output, together with any inherited label, a.nd can be used as a set of equivalence classes for EBMT.</Paragraph> <Paragraph position="9"> Agglomerative chlstering using the maximal cosine sinfila.rity (single-link) produced the subjectively best clusters, and was used for the experiments described here.</Paragraph> </Section> <Section position="6" start_page="127" end_page="128" type="metho"> <SectionTitle> 4 Experiment </SectionTitle> <Paragraph position="0"> The Inethod described in the previous two sections was tested on French-English EBMT.</Paragraph> <Paragraph position="1"> The training corpus was a subset of the 1BM Ilansard corpns of Canadian parliamentary pro-</Paragraph> <Section position="1" start_page="127" end_page="128" type="sub_section"> <SectionTitle> ceedings (Linguistic Data Consortium, 1997), </SectionTitle> <Paragraph position="0"> containing a total of slightly more than one million words, approximately half in each language. Word-level alignment between French and English was pertbrmed using a dictionary containing entries derived statistically from the full Hansard corpus, auglnented by the ARTH, French-English did;iona.ry (ARTFL Project, 1998). This dictionary was used for all EBMT and chlstering runs.</Paragraph> <Paragraph position="1"> The efl'ects of varying the amount of training texts were determined by further sl)litting the training corl)us into smaller seglnents aM using differing numbers of segments. For each run using clustering, the first K segments of the corl)uS a.re cones.Lena.ted into a. single file, which is used as inl)ut \['or both the clustering l)t:ogra, m a.nd the EI{M:I.' system. The clust;erltlg 1)rogranl is rtltt (;o deternfine a. set o1&quot; equivalence classes, a.nd these classes a.re then provkled to tile I';I{M:I' systetn a Jest with the tra.ining exa, mples to be indexed, lleld-out lla.nsa.rd text (a,1)I)roxima.lsely d5,0()O words)is then tra.nslaLed, +tnd tile l)ercenta.ge of tile words in the test text for which the I~;I~M~.I ' system could lind ma,tches a.nd generate a. tl'a.lasla.tion is determined. null To test the efl'ects of adding seed ('lttsters+ a set of' initia.1 clusters was generated with the \]te.lp of the A I{:I'I&quot;I, dict;iona.ry. First, the 500 most frequ(:nt words in the milliou-word \]\]~msa.rd sul)se.t (excluding pun('.\[;uation) were extracted. These terms were then nmtched a.gMnst the AI~.TFI, dictionary, removing those words which had multi-word transla.tions as well a.s severaJ which listed multil)le parts el&quot; sl)eech For the same tra,nslation (multil>le l>a, rts of speech can only 1)e used i\[' the corresi>on(Iing tra.tlsla.tiolls are distinct f'rom each <)ther). The remaining d20 tra.nslal.ion pairs, tagged for l)a.rt o\[' speech, were then convert:e(l inl,o se(~(I clusters a.nd l)rovided to the clustering t)rogra.nl. To fa.cilita.te experiments using the t)re-existing l)roduction-rule grammar, tire a.d(litiona,I tra.nslaPSion I)a,h's from the lna,nually-gelmra, ix~(1 equivaJe.n(:e ('la.sses were a.dded t;o l)rovide seeds for five equiva.\]ence classes which a.re not, l)resent in the dictiona.ry.</Paragraph> </Section> </Section> class="xml-element"></Paper>