File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1006_metho.xml
Size: 18,050 bytes
Last Modified: 2025-10-06 14:07:09
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1006"> <Title>The Effects of Word Order and Segmentation on Translation Retrieval Performance</Title> <Section position="4" start_page="35" end_page="37" type="metho"> <SectionTitle> 3 Similarity metrics </SectionTitle> <Paragraph position="0"> Due to o111&quot; interest in the efli~cts of both word order and seglnentation, we must have a selection of similarity lnetrics compatible with the various permutations of these two 1)arameter types. We choose to look at a nunlber of bag-of-words and word order-sensitive methods which are compatible with both character-based and word-based indexing, and vary the intmt to model tile etl~ects of the two indexing paradigms. The particular bag-of-word approactles we target are tlm vector space model (Manning and Schiitze, 1.999, p300) and &quot;token intersection&quot;, a silnple ratio-based similarity nletric. For word order-sensitive approaches, we test edit distance (Wagner and Fisher, 1974; Planas and Furuse, 1999), &quot;sequential correspondence&quot; and &quot;weigllted sequential correspondence&quot;.</Paragraph> <Paragraph position="1"> Each of tile similarity metrics eillpirically describes the sintilarity between two inlmt strings tmi mid i~., 2 where we define tmi as a source language string taken fl'om the TM and i~. as the input string which we are seeking to 1hatch within the TM.</Paragraph> <Paragraph position="2"> One featnre of all similarity metrics given here is that they have fine-grained discriminatory potential and are able to narrow down the final set of translation candidates to a handfld of, and in nlost cases one, outlmt. This was a deliberate design decision, and aimed at example-based machine translation applications, where human judgement cannot be relied upon to single out the most appropriate translation from multiple system outputs. In this, we set ourselves apart from the research of Sunlita and Tsutsumi (1.991), for example, who judge the system to have been successful if there are a total of 100 or less outputs, aud a useful translation is contained within them. Note that it would be a relatively simple pro2Note that the ordering here is arbitrary, and that all the similarity metrics described herein are commutative for the given implementations.</Paragraph> <Paragraph position="3"> cedure to fall ()lit the 11111111)e1&quot; of Olltt)lltS to it ill ollr case, tly taking tim top n ranking outputs.</Paragraph> <Paragraph position="4"> For all silnitarity metrics, we weight different .\]ai)mmse segment tyl)es according to their exl)ected impact on translation, in the form of the sweigh, t fllnctioll: Segment type s,wcight punctuation 0 other segments 1 W(' exl)erinlentally trialled intermediate swcight settings tbr ditt'erent character tyl)es (in the case of character-based indexing) or segment tyl)eS (in the case of word-based indexing), none of which was fomtd to apl)reciat)ly iml)rove performance. :~ a.1 Similarity metrics used in this research Vector space model Within our imt)lenmntation of the reactor space Inodol (VSM), the segment content of each string is (lescril)('.(l as a vector, ma(le u l) of 3 single dimension for each segment tok(,n occurring within tmi or in. The. value of each vector eolnt)onent is given as the weighted frequen(-y of that token accor(ling to its sweiqht vahle, such that any nulnber of 3 given i)un(:tuation mark will produce a fl'e(luen(:y of 0. The string sinfilarity of t?H, i and in is then detined sis tim cosine of the angle l/etween vectors t\[\[~.i and iT\[t, re-Sl)ectivety, calculated as: tT~,i, i~5, cos(t,fi,,,i;4 - It, ll l 0) where dot l)roduct and vect()r length (:oin(:i(le wil;h l;he standard detlnitions.</Paragraph> <Paragraph position="5"> The strings tmi of maximal similarity are th()se whi(:h i)roduce the nmxinuun v3hw, for th(! v(~ctor cosine.</Paragraph> <Paragraph position="6"> Not(; that VSM c(msi(lers (inly s('.gment fre(tueney and is insensitive to word order.</Paragraph> <Paragraph position="7"> Token intersection The token intersection of tmi 3nd in is defined as the cumulative intersecting fl'equency of tokens appearing in each of the strings, normalised according to the combined segment lengths of tm, i and in. Forreally, this equates to: tint(tm~, in) : e x ~_~, l'lill (f,'{?(htnl (\[),frcqilz(,)) &quot; m~(l,,,~)+>.,,(i,,) (2) where each t is a token (iccurring in e.ither tmi or in, freq,(t) is detined as the swei.qht-l)ased fi'equency of token t occurring in string s, and Ion(s) is tlm aIf anything, weighting down hi,agana characters, fin&quot; example, due to their common occurrence as intlectional suffices or particles (as per Fujii and Croft (1993)) led to a significant drop in 1)eribrmanee. Simihwly, weighting down stop word-like flmetional parts-of-sf)eech in ,lat)anese had little eltiect, unlike weighting down stop words in the case of English (see below).</Paragraph> <Paragraph position="8"> segment length of string s, that is the swcight-1)ased COllllt Of seglllellts (:(nltained ill .s'.</Paragraph> <Paragraph position="9"> As tbr VSM, the string(s) tmi most similar t;(i in arc thos(; which general;e the nlaximum value tbr tint(tmi, in).</Paragraph> <Paragraph position="10"> Note that word order does not take any part in calculation.</Paragraph> <Paragraph position="11"> Edit distance The first of the word order-sensitive methods is edit dist3nce (Wagner and Fisher, 1974; l?hmas and Furuse, 1999). Essentially, the segment-lmsed edit distance 1)etwecn strings t'ln, i and in is the minimunl numl/er of prilnitive edit operations on single segments required to transtbrm tmi into in (and vice versa), 1)ased Ul)On the ol)erations of segment equality (segments tmi,m and in, are identical), segment deletion (delete segment a fl'OlIl a given 1)osition in string .s') and scgmc'nt insertion (insert segmen~ (t into a given position in string .s). The cost associated with each ol)eration on segment a is defined ~/S: 4</Paragraph> <Section position="1" start_page="36" end_page="37" type="sub_section"> <SectionTitle> Operation Cost </SectionTitle> <Paragraph position="0"> segment equality () segment deletion swcigh, t(a ) s(;gment insertion swcigh, t(a) Unlike other similarity metrics, smaller v31ues indicate greater similarity for edit distance, and identical strings have edit distmme 0. The woM order sensitivity of edit distance is per\]ml)S t)est exeml)litie(l tly way of the following exam1)le, where segment delimiters are given as :.'. (1) E - SN- 14-':winter r3in&quot; (2a) 2F- $51. l+&quot;summer rain&quot; (21)) 1+&quot; SN- 2F &quot;a rainy summer&quot; Itere, the edit distance from (1) to (2a) is 1 -t- 1 = 2, as one deletion ol/eration is required to remove E \[\]:uyu\] &quot;winter&quot; and one insertion ol)eration required to 3dd 2F \[natu\] &quot;summer&quot;. The edit distance from (1) to (21/), on the other hand, is 1 + 1 + 1 + 1 = 4 despite (2b) being identical in segment content to (2a). In terms of edit distance, therefore, (23) is adjudged more similm&quot; to (1) than (21)). Sequential correspondence Sequential corresI)ondence is 3 measure of the m3xinnun subsl;ring sinlilarity lmtween tmi and in, normalised acc(irding to the comt)ined segment lengths h'.n(tmi) and len(in). Essentially, this method requires th3t all substring matches submatch (tmi, in) between tmi and in be calculated, and the maximum scqcorr ratio returned, where scqcorr is delined as:</Paragraph> <Paragraph position="2"> IIere, tile cardinality operator applied to submatch(tmi,in) returns tile combined segment length of matching substrings, weighted according to swcight. That is: I~,~ ..... t~(~.,,~.~,~)I=~,j ~ .... igl~t(s,~j,,~) (4) for each segment ssj,t~ of each matching substring ssj G submatch(tmi, in).</Paragraph> <Paragraph position="3"> Returning to our exmnple from above, the similarity for (1) and (2a) is 2x2 2 whereas that for * 3+3 -- g (1) and (2b)is ')x~ , 3+3 ~ :~&quot; Weighted sequential correspondence Weighted sequential correspondence--the last of the word order-sensitive methods--~is an extension of sequential correspondence. It attempts to sut)plement the deficiency of sequential correspondence that the contiguity of substring matches is not taken into consideration. Given input string a~ a2a.~a/,, for example, sequential correspondence would suggest equal similarity (of ~) with strings a~ ba~ca:~da/, and aj ap. a3 a 4 cfg, despite the second of these being more likely to produce a translation at; least partially resembling tlmt of the intmt string.</Paragraph> <Paragraph position="4"> We get around this by associating all incremental weight with each matelfing segment assessing the contiguity of left-neighl)ouring segments, in the manner (Inscribed by Sato (1992) for chaxactcr-based matclfing. Namely, the kth segment of a matched substring is given the multiplicative weight rain(k, Max), where Max was set to 4 in evaluation after Sato. I submatch,(tmi,iu,)l fi'om equation (3) thus t)ecomes:</Paragraph> <Paragraph position="6"> ilarly modify tile definition of the lea flmction for a string s to:</Paragraph> <Paragraph position="8"> for each segment .sj of s.</Paragraph> </Section> <Section position="2" start_page="37" end_page="37" type="sub_section"> <SectionTitle> 3.2 Retrieval speed optirnisation </SectionTitle> <Paragraph position="0"> While this paper is mainly concerned with accuracy, we take a moment out here to discuss the potential to accelerate the proposed methods, to get a feel for their relative speeds in actual retrieval.</Paragraph> <Paragraph position="1"> One immediate and effective way in which we can limit the search space for all methods is to use the current top-ranking score in establishing upper and lower t)ounds on the length of strings which have the potential to better that score. For token intersection, for example, fi'om the fixed length lea(in) of input string in and current top score a, we can calculate the following bounds based on the greatest possible degree of lnatch between in and tmi: Upper bout, d: le,~(t.~d </(~-~)~n(~'~)J (7) L CZ _ F alen('in) 7 Lower bound: len(tmi) >, 2-(,, (8) In a similar fashion, we can stipulate a corridor of allowable segment lengths for tin i, for sequential correspondence and weighted sequential correspondence.</Paragraph> <Paragraph position="2"> For edit distance, we make the observation that tbr a current minimum edit distance of a, the following inequality over Icn(tmi) inust be satisfied for tmi to have a chance of bettering ct:</Paragraph> <Paragraph position="4"> We can also limit the numl)er of string comparisons required to reach the optimal match with in, by indexing each tmi by its component segments and working through the component segments of in in ascending order of global fi'equency. At each iteration, we consider each previously unmatched translation record containing the current segment token, adjusting the upper and lower bounds as we go, given that translation records for a given iteration caiulot hmre contained segment tokens already processed. The maxinmm possible segment correspondence between the strings is therefore decreasing on each iteration.</Paragraph> <Paragraph position="5"> We are also able to completely discomlt strings wit}l no segment component conunon with iTt in this way.</Paragraph> <Paragraph position="6"> Through these two methods, we were able to greatly reduce the number of string comparisons in word-based indexing evaluation for VSM, token intersection, sequential correspondence and weighted sequential correspondence methods in particular, and edit distance to a lesser degree. The degree of reduction for character-based indexing was not as marked, due to the massive increase in numbers of l;ranslation records sharing some character content with in.</Paragraph> <Paragraph position="7"> There is also considerable scope to accelerate the matching mechanisms used by the word order-sensitive approaches. Currently, all approaches are implemented in Perl 5, and the word order-sensitive approaches use a naive, highly recursive method to exhaustively generate all substring matches and deternfine the sinfilarity for each. One obvious way in which we could enhance this implelnentation would be to use an N-gram index as proposed by Nagao and Mori (1.994). Dynamic Programming (DP) techniques would undoubtedly lead to greater efficiency, as suggested by Crmfias et al. (1995, 1997) and also Planas and Furuse (this volume).</Paragraph> </Section> </Section> <Section position="5" start_page="37" end_page="38" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="37" end_page="38" type="sub_section"> <SectionTitle> 4.1 Evaluation specifications </SectionTitle> <Paragraph position="0"> Evaluation was partitioned off into character-based and word-based indexing for the vm'ious similarity methods. For word-based indexing, seginentation was carried out with ChaSen v2.0b (Matsmnoto et al., 1999). No attempt was made to post-edit the segmented outtmt, in interests of maintaining consistency in the data. Segmented and non-segmented strings were tested using a single program, with segment length set to a single character for non-segmented strings.</Paragraph> <Paragraph position="1"> As test data, we used 2336 unique translation records deriving fi'om technical field reports on construction machinery translated from Japanese into English. Translation records varied in size from</Paragraph> <Paragraph position="3"> single-word technical terms taken f1'Ol12 SI~ technical glossary, to multiple-sentence strings, at an average se.glnent length of 13.4 and average character length of 26.1. All .lapane, se strings of length 6 chara(:ters or more (a l;ol;al of 1802 strings) were extracted fl'om the Ix;st da.ta, leaving a resi(hle gh)ssary of te(:hni(:al 1;erltls (533 strings) as we w(nfld not CXl)e('t to find use, hll nlat(:hes in the TM. The retrie, val a(:curacy ()\,or the 1802 hmger strings was then vcritied t)y \] 0fokt (:ross wflidation, including the glossary in the test TM on each iteration.</Paragraph> <Paragraph position="4"> Not(; that the test data was llre-1)artitioned into single technical terms, single sentences or sentence clusters, each constitut;i21g a single translation record. Partitions were taken as given in evaluation, whereas for reM-worhl TM systems, tim automal;i(m of this i)2&quot;()cess (;Oltll)l'ises ;tll il211)ortalll; COlill)()ll(1Ilt of the (/verall sysI;(mL 1)re(',eding translation rel,ri(;val. While ackn()wh;(lging the ilnl)ort;an(:(; ()f this step and its int(;ra(:l;ion with r(C/ri(;val 1)or\[ormall(:(;, we (:boost, to sideste l) it for the lmri)os(~s of this pal)c.r , and leave it for hltm(; resc.m(:h.</Paragraph> <Paragraph position="5"> In an effort to make evaluation as ol)jeci;ive and empirical as l)ossibh;, apl)r()i)riatencss of translation candidate(s) l)rOl)OSed by the different metri(:s was evahmted according to the mil2inlunl edit distahoe between the translation candidate(s) and the unique model translation. In this, we transferred 1,t2(; edit distance, method described M)ove directly across to the ta.rg(% langustge, (English), with segments its words and the fl)lh)wing s'weight schema: Stol) words are defined as those containcd within the SMART (Salton, 197\].) stop word list) The system output was judged to be correct if it contained a translation optimally close to the model trmMation; the average ol)timal edit distance h'onl the model translation was 4.73.</Paragraph> <Paragraph position="6"> '5 \[tp:// fl, p.corne, ll.cs.ed U/l)U b/smar t/english,stop We set; the additional criterion that the difl'erent metrics should be able to determine whether the top-ranking translation (:mMida.te is likeJy to be useflfl to the translator, and that no outlmt shouhl lm given if' the chlsest nmt('hing translation record was outside a certain l'~/Ilg( ~. Of &quot;transla.ti(m uscflflness'. In p2&quot;actice, this was set to the, edit distance between the model translation and the empty string (i.e. the e.dit (:()st; of creating th(; model translation fl'(nn s(:ratch). This cut;off' 1)oint vlts realised for the different similarity metrics by thrcshohling over the similarit.y scores. The ditferent thresholds settled Ull(m experimentally for all similarity metrics are given ill t)ra(:kcts in the second column of Table 1, with the threshohl for (;(lit, distance dynamicMly set t(/the edit dislane(; l~etween the input and tim eml)ty string.</Paragraph> <Paragraph position="7"> \Ve set (mrs(;\]ves al)art \]'IX)211 COIlV(;21I;i()IIsll 2'(~S(;D.l'('h ()n TM r(;hieval lmrl'o2unan(:(; in a(lol)ting this ()l/.i(;(:li\'(; mmmrical (~vahmti()n method. Traditionally, r(:i.ri(~val l)erformalm(~ has 1)(!e,n gauged 1)y tlm subj(~(:t;iv(; useflfln(;ss of the closest matching e.lenmnt of the syst;(~lll OUtlmt (as judged 1)y a. hunm,d, mid described by way of a dis(:rete set; of transla.tion (lualit;y des('ril)tors ((;.g. (Nakm2mra, 1989; Smnita and Tsutsmni, 1991; Sato, 1992)). Perhaps the closest evaluation a.tte2nt)ts to what we prol)ose are those of' Planas and Nn'use (1.999) in s(!tting a mechanical cutoff for &quot;translation usability&quot; as the al/ility to generate the model translation from a given translation candidate 1)y editing less than half the component words, and Nirenburg et al. (1993) ill calculating the weighted mmtber of key strokes r(;quirexl to convert the system outllut into ;m apl)ropriate translation for the original inllut. Tile method of Nirenburg et al. (1993) is certainly more indicative of t:rue target language useflllness, but is dependent 022 the coml)etence of the translator editing the TM system output, and not automated to the degree our method is.</Paragraph> </Section> </Section> class="xml-element"></Paper>