File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1078_metho.xml
Size: 16,724 bytes
Last Modified: 2025-10-06 14:14:13
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1078"> <Title>Alignment of Shared Forests for Bilingual Corpora</Title> <Section position="3" start_page="0" end_page="460" type="metho"> <SectionTitle> 2 Our Approach </SectionTitle> <Paragraph position="0"> For each input sent;ellee our parser t)rodlmes a set of trees, corresponding to each possible syntactic analysis. O.r parse trees are transformed illl, O a &quot;regularized&quot; format, to represent the l)redicate Argument structure. For each senten<:e, the out put of the parser is a stru(:tm:e-sharing tbrest. An example of strltetllre sharing I:)etween two ila, rse trees <ff l,}le same input senten('.e is shown in Fig are 1. We apply the parser to the source and target sentences, using a Spanish and an F, nglish grain mar, respeetbely. The resulting sets of structureshin:lug parse trees form the input to the alignment procedure.</Paragraph> <Paragraph position="1"> Our alignment program employs dynamic programming I algorithms, which are described in detail in later sect;ions. The program begins a.t the roots of the source and target trees, and 1)roeeeds top down reeursively, filling a matrix of scorcs.</Paragraph> <Paragraph position="2"> (liven N nodes in the som:ee tree 7',+ -- &quot;/'(V~, I5',) 2 and M nodes in the target tree &quot;/}. -- 7'(Vt, El), the score matrix is an N x M matrix. For each pair of nodes xi,i = I .... N 6 V~ and Yi,J - 1,...M C ~, the corresponding entry in the score matrix is a.</Paragraph> <Paragraph position="3"> measure of how well z/ illatc:hes >{. The score for each pair of nodes depends only on the closeness of the lexieal entries associated with the nodes and i()\[., e.g. (Cornlen (% al., \]990), pp.299-328 >l'he expression T(<, 1'}.~) <lcnotes a tree as a pair of sets: V~ is the, set of vertices (nodes) in (,he tree, and l','.+ is the set; of edges (arcs).</Paragraph> <Paragraph position="5"> lJml+ a (Icl)<'JMen<:y t.yl)c ,.-;t.ru(:t, ttr<', iH a.,~minl(;(I. 4 ., ) p ) Ih'mc a r{;la,t,i{)n h, oh:( l IU, \] , A R,(:) is r('l)r<;~<;l\]l~e{I I)y an ar(:, t,h<~ l'lUql)o\[ , (., at iJic tail ot'i+l~<~ arc, and I,h('+ ALP.(; (or 1,he roh~ r('(:ii)i<m{,) al, I,h<! h(;a(l .</Paragraph> <Paragraph position="6"> :~t~y hlii)\['ovhll~, I.h,.: ~l+sylnl)(.(}l,h:: 8t){!{!(1 o1 ;tliznin{'lit (}\]l Lh('.s{' \['cW S{~\]II,CII(:(~H} \VC ()I){'ll l,h(', t>(},'+,'~ihilil,y \['OI + IISiil<t~ lilllch la, r<~,::r c{.wF,(}t';t Jri \['lllMr(', &quot;,,v{}il,:. 4(~;+~+{,o ~++tl{ |N;~.p,~++()~ 199{}) ;+u\](I (M,q.l,sHltiot() ('t, ~d., If)!)',l) also ,:b~StUltC {l(G:,('lt{h!n(:y l,.yt),:' Sl.l'tl(:i,tit'{~ ht I,\]l{!\]r e x ,q. m t } h:- t );_is,.,. ( l woi'k.</Paragraph> <Paragraph position="7"> ,::>\[' th,:+ a r,::. l;'c,r ,:+xa.tJll:.l('~ , ,':l tu;'ri,+' +s l,h,:+ sul)j<+ct, or i,hc l)redicatc ,+;('?' iu l<'izure I. It+ a, rezu\]arizcd l)arse, corral. (:/()+~,+(I 8y.ta<:tic cla,sse,% +u(:h +m prc'l>OSitkm+ and +.l)oMhmt,e cottjunct, hcms, are rel)re+e.l,ed a,~ a r<: lab(q,+ dc.oI,+.~ role+, (e.Z. , tim l)l'(+l)OHitiO/l (\[(: ill \[&quot;i,'~llro l) r+1{Jv':r i,h+111 +ts ilO(I(~s itl I:- 81,t'Hct,tH'e.</Paragraph> <Paragraph position="8"> Ht,t'tlcLtlre Hha,'il,g a,l,l(:,n~ l,he t,l'eCs in t,lm i>a,'+,:+ l'<>r<++<, all(>w,~ u+~ i,o red.c(+ the ntHnl)(':r o\[' c,:)ti+l)Ut,{x\[ ,'+(:orc~s, We (:oltil)Ht,(': |,\]t(~ s(:()r(; \[or a, ,Zivcli \[)air of mll)t,t'ees only otlc,:~, &quot;e~a,'( les~ of Lhe mmfl)er or i+re+cs which share the,~c +ul>t, rees l)<+<'ause the s(mor<> <)\[+ a l)ah&quot; <)|' no(les <l<q><mcLs o.ly on l,he +('ore,~ (>l' their ,.le,+<:<m(huH,+ (a.(l .oi+ ol' t, heh' a.t.:e,~l::>r,~).</Paragraph> <Paragraph position="9"> ('.tirreiHJy ,:)tit' Im, rser re<:{>rd+ M, rtl<:l, Hr(+ mha.ritl Z ..rely h('.t,w{ml+ NI}+'+. l'\]xI)(q'ittl(mls ill whi<:h .II uon~ tH(m MrH,::l,ur<'~ i,'+, ,'-;h,+~re,:l, a,'+ iH I&quot;i~t.'<~ I, :-+;.~Z<~sI+ Lhal, ('~xi,<m<li.~ ,'.+l+ru(:t.re ,'4mt'ill~; Lo od.~r Lyl>e,~ of H<>(h!s ',,,.,,()tllcl \['tH'l,lmr iHiF, r{}v<+ I)erl'(.'nmn<:( +. 'l'\]d~ Ht, rH(:t, tlr(~ ,~ha,'i.g al)t;r(m(:h J,'.+ I)a~<'+d ()n I>rt+wious w(M( in <>l>l,iJllizinz I+'<mt+Hr<~ Slmu(:l,Hr<+ I)a,scd pars: iu~+ (S<'{+ For (+xamlJ('. (ICu'i,i, utt(m, 1!)85) aim (I)<~t'(d,'a, 1985)).</Paragraph> <Paragraph position="10"> 4 The LCA-Preservhlg Algorithm8 \,Vc lit+i, di+(:u.+.<-; I,h<+ I'oruia, I a:,+F.,{'.<:I+H o1' LIt(; a lizulll<ml, t)t'ol)l(mi a,u(t i lit, rodu(:(' l,<;rtili,oh>gy.</Paragraph> <Paragraph position="11"> 4.1 The Maximal Tr(;e Ali~titnl.('+nt Prol)l{:ni 'l'h<'. ol).i<><:i,iv(~ i+ t;o lind . tnaxiumHi s{:or<> <:+)rr<'-Sl)oH{l<m<:<~ I)<'J,w<~(;. n{)(.l<;,'+ i. a pair of (,t+<;es. Tit(=&quot; ~+il\[l,t,<}lll(\]lli, ()\[' t,h(+ l)rol)l<'+tn or ali~+;nit+~ two t+I:(ms 5 7) a lt<:l !\]), <:orl+(+Hl)()u(Is rlo.+dy to thal, \['()tm<l in ( M at,-sUHiOl,O (;t, al., t9(,7'3). Ottr algorii;liitm are I)aH(~(I o. I, hose I>r<>~<'.t;<',-I iu (Steel all{I War.ow, 1!)9:1), (l:ara<:h ct+ al.> 1!)95I>) +'rl' ( \[ ( 1995~j.</Paragraph> <Paragraph position="12"> W(': ,~ay l,\]m,L a, no(l(' ;c is a, (:o\]tttHoti an(:(~,'-d,or of .(}(les a a.{I b i. a i, ree 7' il' t,\]mr<; (~'xi+'.+I, l)a.,Lhs +.)\[' l(>\]lgi,lt p- 0 l'r(>t\]~ :. i,o <~ att(I \['rom :+,+ l,o b. '\['h(! I(mM <:ollittloti a.tw(~t,or (h:a,) (>l' Lv.,,<} H()(I(~,'+ (~ aH(l l, is i,h(~ m}(h'. :+:0 : : lc(~(<,,, b), ~tl(:ti l,hat, I. :.It() iS ;4. (:OIIllll(.)lt +lJic(h'-+;l,or o\[ (l. aim b, at~(l 2. I'Ol + a.\]ly ol,h(':r ('OlHlllOll +i.llC(:;-;l~or 3; of (.f. +~n(I b, ;+:{} iH a (l<>'-;C<;ll(/atH+ o\[' x.</Paragraph> <Paragraph position="13"> A\]I ;dizt\]ni<ml, I}('J,w<;<m l,w()tr<'<'~s 7' :~ (V, 1'\]) a..(t '1': : (V I, \]+\]'1) iS ;I, corl+('+sl)Oll(1011c(; (~1, {)It(2 |,O ()tic timpl)it<g ) f :,%'+ >- ,%':, wh<~re ,%' * V a.{t ,%'/ (: V/, ,,vhi<:h t.aim,a, itin l,he I'<}lh)winp; r<>hdA)tl,~hil): !' I:(:,r Mml>li<:ii,y <)l'l>r('~,.:tm;d:i,.m wc si,~U,c t,hc t>r<)l)lem iH i,(+'rms ,:)1&quot; MiPStmi('ttt, ol&quot; t.t'{:e~. Ill i);'a{:(.i(:{~ wc ;u',~: H:-;iHp+ a. {}t>t,ill\]i~e<l v;~,i'i;).Iil, (:,I' LIic a, lg(}ril,linl, ;vhi(:\]i aligH+ l)air+ ,.}1' sl,iticLtlrc:,'-;ha+rhl Z t'(}lesl,H.</Paragraph> <Paragraph position="14"> tween a, b and c, and a', b' and e'.</Paragraph> <Paragraph position="15"> If nodes a E V and b C/ V map into nodes</Paragraph> <Paragraph position="17"> To illustrate, in Figure 2 there is no lca-preserving alignment of the two trees which maps all three of the leaf nodes a, b and c into the nodes a', b' and c'. Lea-preserving alignments are possible which map any two of the leaves.</Paragraph> <Paragraph position="18"> The algorithm assumes that least common ancestors are preserved in the alignment. We assign a score to each alignment based on the labels of the corresponding nodes and the arcs from these nodes, as described below. The algorithm seeks an alignment with maximal score.</Paragraph> <Section position="1" start_page="460" end_page="460" type="sub_section"> <SectionTitle> 4.2 The Algorithm </SectionTitle> <Paragraph position="0"> Let T~ and Tt be the source and the target trees. The algorithm uses dynamic programming to build up, in a bottom:up fashion, the scores for matching each node in T~ against each node of 2gl. There are O(n 2) such scores, where n = max(I T~ 1, I 7~ 1) is the number of nodes in the trees. Let d(v) be the degree of a node v. We denote children of v by vi, i = 1,..., d(v), and the arc (v, vi) by #i.</Paragraph> <Paragraph position="1"> Procedure SCORELcA: The dynamic programming builds up a score function S(v, v') for all v E T, and v' E Tt, which is stored in a I T, \] x I Tt I matrix S. The value S(v, v') is the score assigned to the best match between the two subtrees rooted at v in T., and at v' in Tt. Initially, S is filled with undefined values. When a value for S(v, v') is required, and the corresponding entry in the matrix is undefined, it is recursively computed by the following formula:</Paragraph> <Paragraph position="3"> The function MATCHI~a(V, v') is a measure of how well the nodes v and v' align, and is computed as follows:</Paragraph> <Paragraph position="5"> the label on source node v corresponds to the label on target node v ~ in the bilingual dictionary. Lex~,.~ (#, ~') is the corresponding measure for arcs.</Paragraph> <Paragraph position="6"> P(v, v') is the set of all possible pairings of the children of v against the children of v'.</Paragraph> <Paragraph position="7"> There are O(d!) such pairings, where d is the smaller of the degrees of v and v'.</Paragraph> <Paragraph position="8"> P(#i) _> 0 is the penalty for collapsing the edge vi, which may depend on the label of that edge.</Paragraph> <Paragraph position="9"> The summation in (2) ranges over all pairs, denoted by (i,j), which appear in a given palnng p E P(v, v'). The summation is evaluated for all O(d!) possible pairings. The pairing with the maximum score is then selected.</Paragraph> <Paragraph position="10"> The total running time for computing the scores of all of the O(n 2) nod(: pairs v and v', is O(d!n2), where d is the lesser of the degrees of the source and target trees. Comparing the max term in (2) can be mapped into tile Maxbnum-Weight Clique problem (which is NP-eomplete), of. (Farach et al., 1995b). However, in the NLP domain, the running time is contained becmlse d < 6 for most trees encountered in practice. Next we describe a heuristic ~which achieves a time bound quadratic in the size of the tree.</Paragraph> </Section> </Section> <Section position="4" start_page="460" end_page="462" type="metho"> <SectionTitle> 5 A Greedy Heuristic </SectionTitle> <Paragraph position="0"> We can reduce tile computation time of the max term in (2), if we do not consider all of the O(d!) pairings of the children of v and v'. Instead we use a greedy approach and choose the d highestscoring, mutually disjoint pairs from among the d 2 possible pairs of children of v and v/. The justification for this heuristic is that we expect that the high-scoring pairs will dominate, and will be a priori mutually disjoint.</Paragraph> <Paragraph position="1"> The following is an alternative, greedy procedure for computing ,5'(v, v'): 1. Vi,j s.t. I < i < d(v),\] <_ j _< d(,/) cornpute the corresponding entry in a d(v) x d(v')</Paragraph> <Paragraph position="3"> The entry Mij of M = M(v,v') is the score of matching the ith child of v with jth child of v'.6 2. l,et TOP +- {} be the set of highest scoring pairs.</Paragraph> <Paragraph position="4"> 3. Find the largest entry Miojo ill the matrix, such that neither its row nor its cohmm is already occupied by some pair in TOI':</Paragraph> <Paragraph position="6"> where the coordinates (i0, jo) are such that</Paragraph> <Paragraph position="8"> 4. Repeat the above step d times, where d = rnin(d(v), d(v')).</Paragraph> <Paragraph position="9"> 5. Compute the result:</Paragraph> <Paragraph position="11"> With sorting, this can be done in O(d log d+ d 2) time.</Paragraph> <Paragraph position="12"> The validity of this heuristic can be tested by comparing the performance of the procedures using the computation it, (2) and in (4).</Paragraph> <Paragraph position="13"> 6Note: if we disregard the arc labels for simplicity, and set Lex,,.~(., .) = O, then we do not need to build M, and may simply use Mij : S(vi, v5).</Paragraph> </Section> <Section position="5" start_page="462" end_page="462" type="metho"> <SectionTitle> 6 Strict Lexical Matching Heuristic </SectionTitle> <Paragraph position="0"> (Grishman, 11.994) employed an optimization heuristic which favored lexical matches. For each source node v with label L(v), the procedure using this heuristic would first attempt to find a target node v' with \]a, bel L(v') such that L(v) translated as L(v') in the bilingual dictionary (a perfcct lexical match). If such a lexical match was found, the procedure did not attempt to match v with any other target node.</Paragraph> <Paragraph position="1"> A similar heuristic (Lex-Match) was incorporated into our program as the following preprocessing steps: 1 For each source node v, all possible lexical matches are identified in the target tree\] If.</Paragraph> <Paragraph position="2"> v has at least one possible lexical match, all of those positions in the score matrix S which do not correspond to a lexical match of v are set m zero.</Paragraph> <Paragraph position="3"> 2 For each target node v' which has at least one lexical match, all of those positions in the score matrix which do nol, correspond to a lexicM match o\[ v' are set to zero.</Paragraph> <Paragraph position="4"> By setting to zero those positions in the score matrix which represent unlikely matches, this heuristic prevents these scores from ever being cah:ulated, substantially reducing the rmming time. Lex-Match, unlike the (Grishman, 1994) heuristic, allows one source node to match lexically with more than one node in the target tree.</Paragraph> </Section> <Section position="6" start_page="462" end_page="463" type="metho"> <SectionTitle> 7 Implementation </SectionTitle> <Paragraph position="0"> We have implemented the greedy LCA-preserviug algorithm with the following features: Penalties: The penalties for collapsing edges were set to 0. s Scores: A LeX,~od,; score of 100 and a Lexa,.,, score of 21 was awarded for each match using our bilingual dictionary. These fimctions have the value 0 if there is no lexica\] match. rA match M(v, v') is also a lexical match if either M(v, w') o,' M(w, v') is a lexical match, where w and w' are children of v and v j, respectively.</Paragraph> <Paragraph position="1"> aWhen penalties are set, to zero and an empty t)ilingum dictionary is used, the alignment algorithm \[ills tim scoring matrix with zeros. When we introduce no\[l-zero penalties, the alignment procedure prefers matches between nodes dominating similar structures, since nodes dominating dissimilar structures receive negative scores. Wc expect that non-zero penalties will improve precision with a nonempty bilingual dictionary, because they will favor similar structm'es. In preliminary testing, penalty values of 20 and 30 yielded iinprovenmnts in precision.</Paragraph> <Paragraph position="2"> Optimization Variables: We experimented with variants of thc. procedures which included Structure Sharing (Struc-Share) and the Lexieal Match Optimization (lmx-Match), as well as with those that did not.</Paragraph> <Paragraph position="3"> Table I shows the time consumed by our program to align sentences under different conditions. The baseline refers to our program without any optimiza.tions (which is at least 6 times faster than before using this algorithm.) The optimization w~riables have different effec.ts on the different texts. We believe that structure sharing has a much stronger el\[bet on Uurious Geo~ve than on El Camino Real because the tbrmer has longer sentences which produced more parses. The l,ex-M~tch optimization has a greater effect on l'SI (7amino Real than on Curious Ceor.q<: because all of the words contained in El Camino Real are ineluded in our bilingual dictionary, but only a small portion of the words in Curious Geovje are included. We expect that as the size of our dictionary increases, the Lex-Match optimization will have a greater effect.</Paragraph> <Paragraph position="4"> The precision for each aligned pair of sentences is computed according to the formula: \[ltc.s'ultSct N Answcr Kc?/\[ { t~esuliSet I where Re.suit,fret is the set o1' source parses to which the alignment procedure assigned the highcst score, and AnswcrKey is the sc't of best source parses as judged by one of the exDerJtnenteFs. 9 This precision measure was previously used in (Matsumoto et al., 1993) a.nd ((h:ishman, 71994).</Paragraph> <Paragraph position="5"> Table 2 compares the precision of the alignment procedure with and without the Lex-Match heuristic (structure sharing had no eider on the scores.) The slight increase in precision observed with the 91f there was no correct parse, the parscs wil,h the fewest errors were used for purposes of nligmnent. l,ex-Match optimization, may be an itldication that we should raise the score for lexical matches of node labels.</Paragraph> </Section> class="xml-element"></Paper>