File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2131_metho.xml

Size: 18,510 bytes

Last Modified: 2025-10-06 14:07:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2131">
  <Title>Finding Structural Correspondences from Bilingual Parsed Corpus for Corpus-b sed Translation</Title>
  <Section position="3" start_page="0" end_page="908" type="metho">
    <SectionTitle>
2 Finding Structural Correspondences
</SectionTitle>
    <Paragraph position="0"> This sectiou describes methods for finding structural correspondences for a paired parsed trees.</Paragraph>
    <Section position="1" start_page="0" end_page="906" type="sub_section">
      <SectionTitle>
2.1 Data Structure
</SectionTitle>
      <Paragraph position="0"> Before going into the details of finding structural correspondences, we describe the data format of a</Paragraph>
      <Paragraph position="2"> used in this pat)er is a tree consisting of nodes and links (or m:cs), wh('.re a node represents a content word, while a link rel)resents a fllnctional word or a relation between content words. For instance, as shown in Figure 2, a t)reposition &amp;quot;at;&amp;quot; is represented as a link in l~,nglish.</Paragraph>
    </Section>
    <Section position="2" start_page="906" end_page="907" type="sub_section">
      <SectionTitle>
2.2 Finding Word Correspondences
</SectionTitle>
      <Paragraph position="0"> The tirst task for finding stru('tm:al corresI)Onden(:c's is to lind word (:orro, sl)ondenccs t)et;ween (;he nodes of a sour(:e parsed tree and the nodes of a t;wget parsed tree.</Paragraph>
      <Paragraph position="1"> Word correspondences are tkmn(1 by eonsull;ing a source-to-target translation dictionary. Most words can find a unique 1;ranslation candidate in a target tree, but there are cases such that there are many translation candidates in a target parsed tree for a source word. Theretbre, the main task of tinding word correspondences is to determine the most plausible l;ranslation word mnong can(tidates. We call a pair of a source word and its translation candidate word in a target tree a word correspondence candidate denoted by WC(s,/,), where s is a source word and t is a target word. If 17\[TC(s,/,) ix a word correspondence candida.te such that there is rto other WC originating h'om s, then it is called WA word correspondence.</Paragraph>
      <Paragraph position="2"> The basic idea to select the most plausil)le word correspondence candidate ix to select a candidate which is near to another word correspondence whose source is also near to a sour(:e word in question.</Paragraph>
      <Paragraph position="3"> Suppose a source word s has multiple candidate translation target words t~ (i = 1,...,7~,), that is, there are multiple 17FCs originating h'om .s'. We, denote these multiple word corresl)ondence candidates by WC(s, tl). For each I'VC of s, this procedure finds the neighbor WA correspondence whose distance to WC ix below a threshold. The distance between WC(sl,/,~) and WA(s.2,/,2) is defined as the distance between sl and .s2 plus the distmme between s2 and 1,2 where a distance between two nodes is defined as the number of nodes in the t)ath whoso, ends are the two nodes. Among I~VCs of .s for which neighbor H/A ix tound, the one with the smallest (listan(:(~ is chosen as the word corre-Sl)ondenee of s, and I/VCs whMl are not chosen are invalidated (or deleted). We call a word correspondence found t)y this procedure WX. We use 3 as t;he distance threshold of the above procedure currently. This procedure ix applied to all source nodes which have multii)le WCs. Figure 3 shows an example of WX word correspondence. In this examt)le, since the Japanese word &amp;quot;ki&amp;quot; has two English l;ranslation word candidates &amp;quot;time&amp;quot; and &amp;quot;period,&amp;quot; there are two WCs (~7C 1 and WC2). The direct parent node &amp;quot;ymlryo&amp;quot; of &amp;quot;ki&amp;quot; has a WA correspondence (I/VA1) to &amp;quot;concern,&amp;quot; and the direct child node &amp;quot;ikou&amp;quot; has also a WA correspondc'nee (WA2) to &amp;quot;transition.&amp;quot; In this ease, since the distance between I'VC2 and WA2 is smaller than the distan(:e between I.VC1 and WA1, I'VC~ in clmnged to a 1/l/X, and I~ITC1 is adandoned.</Paragraph>
      <Paragraph position="4"> In addition to WX correspondences, we consider a special case such that given a word correspondence l'lZ(s,/,), if s has only one child node which ix  .., ........ be ....... &amp;quot;,,,  a leaf and t has also only one child node which ix a leaf, th(;n we COllStrllet a lleW word correspondence called 1US from these two leaf nodes. This WE procedure is al)plied to all word correspondences.</Paragraph>
      <Paragraph position="5"> Note tlmt this word correst)ondence is not to se.le, ct one of candidates, rather it is a new finding of word corre, spondence by utilizing a special structm:e. For instance, in Figure 3, if there is a word eorrespol&gt; dence 1)etween &amp;quot;ki&amp;quot; and &amp;quot;period&amp;quot; and there is no word correst)ondence between &amp;quot;ikou&amp;quot; and &amp;quot;transition,&amp;quot; then I&lt;V,g(iko'u~ transition) will be found 1)3' this 1)roeedure.</Paragraph>
      <Paragraph position="6"> These WX and WS t)rocedures are continuously al)plied until no new word correspondences arc t'(mnd.</Paragraph>
      <Paragraph position="7"> Aft;er al)l)lying the above WX and I'VS pro(:edures, there are some target words t such that t is a destination of a l,l/C(.s &amp;quot;, t) and there ix no other 1,176 , whose destination ix t:. In this case, the lUG(s,t) correspondence candidate is chosen as a valid word correspondence between s and/,, and it; is called a HzZ word eorrest)ondence.</Paragraph>
      <Paragraph position="8"> We call a source node or a target node of a word correspondence an anchor node in what tbllows.</Paragraph>
      <Paragraph position="9"> The above t)rocedures for finding word corresI)ondences are summarized as follows: Find WCs by consulting translation dictionary;</Paragraph>
      <Paragraph position="11"> if no new word corresp, is found, then break;</Paragraph>
      <Paragraph position="13"/>
    </Section>
    <Section position="3" start_page="907" end_page="908" type="sub_section">
      <SectionTitle>
2.3 Finding Phrasal Col'resl)ondences
</SectionTitle>
      <Paragraph position="0"> The next step is to tind phrasal correspondences based on word eorl'eSl)ondences t'(mnd t) 3, 1)roce.dures described in tim previous section. What we would like to retrieve here, is a set of phrasal correspondences which (:overs all elements of a paired dependency trees.</Paragraph>
      <Paragraph position="1"> In what follows, we (:all a portion of a tree which consists of nodes in a 1)att~ from a node ?t I (;o alloth(;r node nu which is a descen(lanl; of n:l a lin-. ear tree denoted by LT(v,1, n~), and we denote a minimal sul)tree including st)coiffed nodes hi, ..., n.~, l)y T(nl,...,n,). For instan(:(,~ in the English tree structure (the right tree) in Figure 4, LT(tcch, nology , science) is a rectangular area covering %eclmol* tg &amp;quot;e e ~ ogy,&amp;quot; and SOl ,no ,, anti .T(J'acl;or, cou'ntrjl ) is a 1)olygonal area covering &amp;quot;factor,&amp;quot;&amp;quot;atDcl,, .... t)olicy,&amp;quot; and &amp;quot;country.&amp;quot; The tirst step is to find a 1)air of word correst)ondences W, (.~'~, t,) and ~4q(.,.~, t,~) such that .,, a.,t s2 constructs a linear tree LT(si, s2) and there is no anchor node in th(' 1)al;h from s~ to s2 other than .s'~ and .s2, where 1UI and H~ denote any tyi)e of word ('orrest)on(lences 2 and we assmne there is a word corresI)ondence t)etwee, n roots of source and (;arget trees by defmflt. We construct a t)hrasal correspondence fi'om source nodes in LT(s,,s2) and target l/o(les itl r\]'(t:l,/'2), (l()llote(t by \];'(l~,~F'(.q'l, .&amp;quot;;2), 5\].n(tl, t2)). For illstall('e~ ill Figllre 41~ \]&amp;quot;11~ \]~12~ 1)'2~ 1)3 and \])4 tu.'e source portions of phrasal et)rrespondences found in this step.</Paragraph>
      <Paragraph position="2"> The next stel) checks, for ea(:h 1', if all anchor llo(les of wor(1 eorres1)Oll(leile(?s wllose SOUlT(;e o1 ~;alget node is included in P are al,eo included in P.</Paragraph>
      <Paragraph position="3"> If a t)hrasal correst)ondenee satisiies this condition, then it is called closed, otherwise it ix called open.</Paragraph>
      <Paragraph position="4"> Further, nodes which are not included in the I ) in question are called open nodes. If a l ) ix ot)en, then it ix merged with other 1)hrasal correspondences having ol)en nodes of P so that the merged 1)hrasal correspondence becomes (-losed.</Paragraph>
      <Paragraph position="5"> Next, each P~,, is checked if there is another l)q which shares any nodes ottmr than anchor nodes with P.,,. If this is the case, these P:., and 1~ are lnerged into one phrasal correspondence. In Figure 4, t)hrasal correspondences i 11 and P12 are merged into P1, since their source I)ortions LT (haikei, koku) and LT (haikci, seisaku) share &amp;quot;doukou&amp;quot; which is not an anchor node.</Paragraph>
      <Paragraph position="6"> Finally, any path whose nodes other than the root are not included in any 1)s but the root node ix included in a 1 ) is searched for. This procedure  is apl)lied I;o 1)oth source a.nd (;arget trees. A im.th found 1)y this 1)ro(:(xlur(~ is called an open pal, h,, m~(t its root no(le is called a pivot. If such an Ol)en path is found, it is t)rocessed as follows: l, br each 1)ivot node, (a) if the t)ivot is not an mmhor nod(;, then open lmths originating fl:om the pivot is merged into a 1 ) having I;he pivot, (b) if the pivot is an ~LIlChOf llo(lo~ {;hOll 3_ llOW t)hl'~lS~L1 c()rFos1)oII(|(~IlC( ~, iS created from Ol)(m 1)ai;hs originating from the m&gt; thor nodes of the word (:orrcsl)on(l(:ncc.</Paragraph>
      <Paragraph position="7"> In Figure 4, w(: get tinally four phrasal (:orr(:-</Paragraph>
      <Paragraph position="9"> if the t)ivot is ml mmhor, |;hen merge the path to P having the anchor, otherwise create new l ) by merging all open t)ai,hs having l;lm pivot;</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="908" end_page="908" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="908" end_page="908" type="sub_section">
      <SectionTitle>
3.1 C, orlms and Dictionary
</SectionTitle>
      <Paragraph position="0"> We used (l()(;lllil(~'ll|;s t'rolil White Papers on S(:ien(-e and Technology (1.994 to \]996) pul)lished by the S(:ience mid Technology Agency (STA) of tim .\]al)mmse govcrlim(~nl;. STA lmblished th(;se White PaI)ers in both Jat)mmse and English. The Commmfications l{esea.rch Laboratory of&amp;quot; the Ministry of Posts and Telecommuni(:a.tion of the .\]al)mmse goverlmmnt supl)lied us with the l)ilingual corpus wtfich is already roughly aligned. We made a bilingual cortms consisting of pa.rs(;d dependency structures by using the KNP\[2\] .\]al)mmso, 1)arser ((l(wel-Ol)ed by Kyoto (hfive)sity) for .Jal)anes(~ sentences and the ESG\[5\] English 1)arser (developed by IBM Watson i{e, sear(:h Center) for English s(~nl;(!nces.</Paragraph>
      <Paragraph position="1"> We mad(} al)oul; 500 senl;(m(:e l)airs, each of whi(:h 11~1,'4 ;I, OIlC-I;O-OII(', 80,11|;(',11(;0 (-orresl)onden(:(~,, fl'OI\[l (,\]lO raw (t~tta of l;he, White l)al)crs, mid s(',l(;(;i;(xl rm&gt; domly aboul; 130 s('aH;en(:c pairs for (',Xl)(Mm(;nts.</Paragraph>
      <Paragraph position="2"> ilow(wer, since a 1)nrser does not always \])ro(hwe (;orl'c(;\[; 1);~l&amp;quot;s(t t;re(}s~ wo (~x(:lude(1 some, ,~(~ii|;(Hic(~ p;Lil's wlfich have severe 1)arse errors, and tinally got i\[15 S(~,II\[;OIlC(; pairs as a, to, st s(%.</Paragraph>
      <Paragraph position="3"> As a trm~slation wor(1 dictionary/)etw(',(m .l at)ml(',s(; and English, we, tirsl; used ,l-to-l~; trmlslati()n (li(:l,ionary which has mot(,' tlmn 100,000 (,ifl;l'i(;~, but we, fi)un(l l;}l~/{; l;ller(? are som(~ word ('orr(~sl)Oll(l(~,llt;(~s not (:()v(ued in this di(:ti()nary. Tlmref()rG we merged (retries fi:om \]';-t;o-.I translatioll dictionary in order to get; much broad (:ov(wag(,'. The l;oDd nulnl)(}r ()f entries a.re now more I;ha.n \[50,000.</Paragraph>
    </Section>
    <Section position="2" start_page="908" end_page="908" type="sub_section">
      <SectionTitle>
3.2 Experinmntal Results
</SectionTitle>
      <Paragraph position="0"> Td)le i shows l;he result of (~Xl)c, rimeni; fl)r tinding word correspond(nm(~s. A row with ALL in th(', l:yl)e cohmm shows Llle total ~CClll'~lcy of WOI'(1 corr(Lqpolld('31c(~s and ol;\]l{~r rows sh()\v Llle .~iCClll'ktcy of each t, yt)e. It is clear that WA (:orr(~sl)Olld(~ll(;(',s have a very high a('cura(:y. Other word (:orresl)On-do, nc(,,s also ha.ve a roJatively high ac(:ura(:y.</Paragraph>
      <Paragraph position="1"> Table 2 shows tim remflt of exl)erimenl,s for find~ ing 1)hrasal correspondences. The row with ALL in I;he l;yt)c cohlmn shows l;he l;ol;al accuracy of phrasal (:ol'r(~sl)ondo, n(:(~s found by the 1)rol)osed 1)rocedure.</Paragraph>
      <Paragraph position="2"> This ac(:macy level is not I)romising and it is not; useful for later 1)ro(:e, sses since it needs human (:he(:king ml(l (:orrecPSion. Therefore, we sul)categoriz(~ each phrasal corl'eSpond(m('es, and check l;he a('(:uracy for each subca.tegory.</Paragraph>
      <Paragraph position="3"> We consider the following sut)catcgories for 1)hrasal  direct child of tl.</Paragraph>
      <Paragraph position="4"> * LTX ... P(LT(.s'I,S2),LT(tl,t2)) such that all nodes other titan s2 and t2 have only one child node.</Paragraph>
      <Paragraph position="5"> * LTY ... P(LT(sl,.S2), LT(tl, t2)) such that all nodes other than Sl, s2,1':1 and t.2 have only one child node.</Paragraph>
      <Paragraph position="6"> LTX is a special case of LTY, since Sl and tl of LTX must have only one child node, on the other hand, ones of LTY may have more than two child nodes. A subcategory test tbr a phrasal correspon- null dence is done in the above order. Exmnples of these subcategories are shown in Fig 5.</Paragraph>
      <Paragraph position="7"> Tlm result of these subcategories are also shown in Table 2. Subcategories MIN and LTX have very high accuracy and this result is very promising, since we can avoid nmnual checking for ttmse phrasal correst)ondences , or we would check only these types of t)hrasal correspondences mmmally and discard other types.</Paragraph>
      <Paragraph position="8"> As stated earlier, since we removed only sentences with severe parsing errors from the test set, please note that the above mtmbers of experimental results are calculated for a bilingual parsed corpus including parsing errors.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="908" end_page="910" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> There have been some studies on structural align-Inent of bilingual texts such as \[1, 4, 13, 3, 6\]. Our work is similar to these previous studies at the conceptual level, but different in some aspects. \[1\] reported a method for extracting translation templates by CKY parsing of bilingual sentences. This work is to get phrase-structure level phrasal correspondences, but our work is to get dependencystructure level phrasal correspondences. \[4\] proposed a method for extracting structural matclfing (pairs of dependency trees) by calculating matching similarities of two dependency structures. Their work focuses on tile parsing ambiguity resolution by calculating structural matching. Further, \[3, 6\] proposed structural alignnmnt of dependency structures. Their work assuined tha.t least common ancestors of each fragment of a structural correspondence are preserved, but our work does not have such structural restriction. \[13\] is different to others in that it tries to find phrasal correspondences by comt)aring a MT result and its manual correction. null In addition to these differences, the main difference is to find classes (or categories) of phrasal correspondences which have high accuracy. In general, since bilingual structural alignment is very complicated and difficult task, it; is very hard to get more than 90% accuracy in total. If we get only such an accuracy rate, the result is not useful, since we need manual clmcks tbr the all correspondences retrieved. But, if we can get some classes of phrasal correspondence with, for instance, more than 90% accuracy rate, then we can reduce manual clmcking for phrasal correspondences in such classes, and this reduces the development cost of translation patterns used in later corpus-based translation protess. As shown in the previous section, we could find ttmt all (:lasses of word correspondences and two subclasses of phrasal correspondences are more than 90% accurate.</Paragraph>
    <Paragraph position="1"> When actually using this automatically retrieved structural correspondence data, we must consider how to manually correct the incomplete parts and how to reuse mamlal correction data if the parser results are ctmnged.</Paragraph>
    <Paragraph position="2"> As for the tbrlner issue, we need an easy-to-use tool to modify correspondences to reduce the cost of mmmal operation. We have developed a GUI tool as shown in Figure 6. In this figure, the bottom half presents a pair of source and target dependency structures with word correspondences (solid lines) and phrasal correspondences (sequences of slmded circles). You can easily correct correspondences by looking at this graplfical presentation. As for tlm latter issue, we must develop methods for reusing the manual correction data as much as possible even if tim parser outputs are changed.</Paragraph>
    <Paragraph position="3"> We have developed a tool for attaching phrasal correspondences by using existing phrasal cormspondence data. This is implemented as follows: Each phrasal correspondence is assigned a signature which is a pair of source and target, sentences, each of which tins bracketed segments which are included in the phrasal correspondence. For instance,  In the above e, xample, segments betwee, n '\[' and '\]' represent a phrasal correspondence.</Paragraph>
    <Paragraph position="4"> If new parsed dqmndency structures for a sentence pair is given, for each phrasal correspondence signature of the sentence pair, nodes in the structures wtfich are inside 1)rackets of the signature are marked, mid if there is a minimal sul)tree consisting of only marked nodes, then a phrasal corre-Sl)ondence is reconstructed from the phrasal correspondence signature. By using this tool, we can efficiently reuse the manual efforts as much as possible even if parsers are updated.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML