The Effects of Word Order and Segmentation on Translation 
Retrieval Performance 
Timothy Baldwin and Hozumi Tanaka 
27okyo \]nstil;ul;e ()I "~ I ethnology 
2-1.2-1 Ooka,yama, Meguro-ku, qlbkyo 1.52-8552 ,JAPAN 
{tim, tanaka}@cl, ca. titech, ac. jp 
Abstract 
This research looks at tim cIt'ccts of word order 
mL(t scgm(mtation on l;ra.nslation retri(~val t)(~rfor- 
III~\[.11C( ~. lot" ~.111 eXl)erim(:nta.1 Jal>an(>s(>English (;rm>- 
lation memory system. We iml)lem('.nt a num- 
ber of both bag-of-words and word order-s(msitiv(~ 
s;imilarity metrics, and test each over charact(u- 
l/ased m~d word-based indexing. Tim translation 
r(%rieval l)elt'ormmm(~ of ca(:h sysi;em (:ontiguration 
is (~valuat(~(1 (mq)iri(:ally through tlm n()ti(>n of word 
edit distan(:(~ \])(}(;W(}(}IL translation (:ml(li(lal;(~ ()ul;lml;s 
mid tim mo(hd translation. Ore resull;s in(li('.at(~ 
(;hat(; (:hm'act(!r-l)as(!d indexing is (:(msislxmtly sup(> 
riot (;() wor(l-bas(:d in(l(:xing, sugg(:sl;ing (;hal; s(:glncn- 
l;al;ion is ;m mm('.cessary luxury in th(', giv(m domain. 
\¥or(1 ord(:r-s(:nsi(;iv(: al)i)roach('s at(: do.monsl;rat(:d 
to generally OUtlt(~rform bag-of-words methods, with 
som'(:c bmguagc segment-lev(d edit distan(:o, proving 
th(: most; (:fl'(:(;l;iv(~ similarity m(,,l;ric. 
1 Introduction 
Transla(.ioll m(unorio,q (TM's) m'c a w(~ll-(!slal)lished 
I,(:(:\]uloliigy wil,llilL (,h(! hlunalL and n|a(:hilm ld'an,qla 
(;ion t'rat('.rnii;i(:s, duo. to the high (raiLslat;ion lit(! - 
(;isioIL (;lmy a flbrd. Esstml;ially, TM's me a list 
of translation records (source la.nguage strings 
paired with a unique target  translation), 
which the TM system accesses in suggcsl;ing a list 
of target languag(', translation candidates which 
may l)(,. hell)tiff to (;h(: translator in translating a 
given source  inputJ 
Naturally, TM systems h~w('~ no way of accessing 
the (;a.rgcl; la.nguagc cquiv;fl(m(; of tit(: soltr(:(: lan- 
guage input, and hence (;lm list of tautc.l, lanquagc 
tnmslation cmMi(lat(:s is det(:rntined base(l on source 
 similarity between tim (:urr(mt input and 
trmlslation examples within the TM, with transla- 
tion equivalent(s) of maximally similar source lan- 
guage string(s) given as the translation candidate(s). 
This is based on the assumption that structural att(t 
semantic similarities 1)etwe(m targ(:t  trans- 
lations will be reflected in the original source lan- 
guage cquivalenl;s. 
One reason tbr the popularity of TM's is the low 
operational burden they t)(LS(~ to tim user, in that 
translation pairs are largely acquired automatically 
1See \])lanas (1998) for a thorough review of commercial 
TM systems. 
from observai;ion of l;lm incremental (;rmlsl&Lion pro- 
(:(:ss, and translation cml(lidates cml \]m l)roduced on 
(hunand almost insf;ani;ancously. To support this low 
()vt}rlma(1, TM systems must allow first access into 
the l)Oixmtially la.l'g(,.-s(:ah} TM, lint at the stone time 
I)e al)lc to 1)rc(lict t.ranslation similarity with high ac- 
curacy. Ilere, th(n'(~ is clearly a trade-off between ac- 
(:ess/retricval speed anti predictive accuracy of 
(,he retriewfl m(,.ctmnism. 2haditiomflly, resemch on 
TM r(~trieval nmthods has focused on Slme(l, with lit- 
(;1(~ (:ross-(~vahml;ion of (;he accuracy of differ(mr mclh- 
otis. \Vc t>r(~t'(u to focus on ac(:tlracy, and t)r(~s(~ll(; 
(~mlfiLical data (~vid(!ncing tim relative l)r(~di(:l;ivc l>O- 
((u~iial of difl'<u'(mt similarity metrics over different 
l)aram(:t(,.risations. 
In tiffs l)almr, we focus on comparison of differ(mr 
retrieval algorithms for non-segmenting la.nguag(~s, 
1)ascd around a TI~,I sysi;cm from .\]almnese to En- 
glish. Non-s(!gm(ml;ing s are those which (Io 
not involve d(:limii;ers (e.g. spaces) tmtwe(m words, 
and in(:lude .lapmms(:, (Jhines(: and Thai. W(: are 
tmrticularly int(~'r(~st(:(l in the part tim orlhog(mal 1 m- 
rmnet(~rs of s(.,gmentnl;ion and word order play in the 
st)(!cd/a(:(:uracy trad(!-oti'. That is, 1)3" doing away 
with segnl(:ntai;ion in relying soMy on ch\[/t'}lc\[(}l- 
h~v(~l comparis(m (character-1)ased indexing), do 
w(: signiti(:mitly degrade match tmrt'ormance, as com- 
pared to word-level comparison (word-based in- 
dexing)? Similm'ly, by ignoring word order and 
treating each sour(:e  string as a "bag of 
words", do \re genuinely lose out over word order- 
s(msitive apl)roacho.s? The. In;fin objective of this 
research is thus (;o (teJ;ermine whether the COmlmi,a- 
tioiml overlmad associated with more stringent ap- 
proaches (i.e. word-based indexing and word order- 
sensitive alH)roaches) is commensura.te with the per- 
formancc gains they ott'er. 
To l)rccmpt what tollows, the major contrilmtions 
of this research are: (a) empirical evaluation of dif- 
thrcnt comparison methods over actual Japanese- 
English TM data, focusing on four orthogonal re- 
triewfl paradigms; (b) the finding that, over tile tar- 
get; data, character-based indexing is consistently 
superior to word-based indexing in identii\[ying the 
translation candidate most sinfilar to tile optimal 
translation for a given inlmt; and (c) empirical ver- 
ification of tim supremacy of word order-sensitive 
exhaustiv(: string comparison methods over boolean 
inal;ch methods. 
In the %llowing sections we discuss the effects 
35 
of segmentation and word order (~ 2) and preseut 
a number of both bag-el;words and word order- 
sensitive sinfilarity metrics (§ 3), before going on to 
evaluate the difl'crent lnethods with character-based 
and word-based indexing (§ 4). We then conclude 
the paper in Section 5. 
2 Segmentation and word order 
Using segmentation to divide strings into compo- 
nent words or nlori)helnes has tile obvious advml- 
tage of clustering characters into senlantic units, 
which in the case of ideogrmn-based s such 
as Japanese (in the fern1 of kanji characters) and 
Chinese, generally disatnbiguates character tnean- 
ing. The kanji character 'J \[', for example, can be 
used to mean any of "to discern/discriminate", "to 
speak/argue" and "a valve", but word context easily 
resolves such mnbiguity, hi this sense, our intuition 
is that segmented strings should produce better re- 
sults than non-segmented strings. 
Looking to past research on similarity metrics for 
TM systelns, ahnost all systems involving aal)anese 
as the source  rely on segnlentation (e.g. 
(Nakanmra, 1989; Sulnita and Tsutsumi, 1991; Ki- 
talnura and Yamamoto, 1996; Tmtaka, 19971), with 
Sate (1992) and Sate and Kawase (1994) providing 
rare instances of character-based systelnS. 
By avoiding tile need to segment text;, we: (a) al- 
leviate computational overhead; (b) avoid the need 
to commit ourselves to a particular analysis type in 
the case of ambiguity; (c) avoi(1 the issue of' how 
to deal with unknown words; (d) avoid the need 
for stemming/lenlmatisation; and (e) to a large ex- 
tent get around problems related to the nornmlisa- 
tion of lexical alternation (see Baldwin and Tanaka 
(1999) for a discussion of problems related to lexical 
alternation in Jal)anese). Additionally, we can use 
the conmlonly anlbiguous na.ture of individual kanji 
characters to our advantage, in modelling seinan- 
tic similarity between related words with character 
overlap. With word-based indexing, this would only 
be possible with tile aid of a thesaurus. 
Similarly for word order, we would expect that 
translation records that preserve the word (seg- 
ment) order observed in the inImt string would pro- 
vide closer-matching translations than translation 
records containing those stone segnlents in a differ- 
ent order. Natur~dly, enforcing preservation of word 
order is going to place a significant burden on the 
matching mechanism, in that a number of different 
substring match schenlata are inevitably going to 
be produced between rely two strings, each of which 
nmst be considered on its own merits. 
To the authors' knowledge, there is no TM sys- 
tem operating from Japanese that does not rely 
on word/segment/character order to some degree. 
Tanaka (1997) uses pivotal content words identified, 
by the user to search through the TM and locate 
translation records which contain those same con- 
tent words in the stone order and preferably the stone 
segment distance apart. Nakamura (1989) similarly 
gives preference to translation records in which the 
content words contained in the original input occur 
in the same linear order, although there is tile scope 
to back off to translation records which do not I)re- 
serve the original word order. Sumita and Tsutsmni 
(19911 take the opposite tack in iteratively filter- 
ing out NPs and adverbs to leave only functional 
words and nlatrix-level predicates, and find trmlsla- 
tion records which contain those same key words in 
the same ordering, preferably with the same segment 
types between them in the same numbers. Niren- 
burg et al. (1993) propose a word order-sensitive 
metric based on "string composition discrepancy", 
and increlnentally relax the restriction on the qual- 
ity of match required to inehlde word lenmlata, word 
synonynls and then word hyt)ernylns , increasing the 
match penalty as they go. Sate and Kawase (1994) 
employ a more local model of character order in 
modelling similarity according to N-grams fashioned 
from the original string. 
The greatest advantage in ignoring word/segnlent 
order is computational, in that we significantly re- 
duce the search space and require only a single over- 
all comparison per string pair. Below, we analyse 
whether this gain in speed outweighs any losses in 
retrieval perfbrmance. 
3 Similarity metrics 
Due to o111" interest in the efli~cts of both word order 
and seglnentation, we must have a selection of sim- 
ilarity lnetrics compatible with the various permu- 
tations of these two 1)arameter types. We choose to 
look at a nunlber of bag-of-words and word order- 
sensitive methods which are compatible with both 
character-based and word-based indexing, and vary 
the intmt to model tile etl~ects of the two indexing 
paradigms. The particular bag-of-word approactles 
we target are tlm vector space model (Manning and 
Schiitze, 1.999, p300) and "token intersection", a 
silnple ratio-based similarity nletric. For word order- 
sensitive approaches, we test edit distance (Wagner 
and Fisher, 1974; Planas and Furuse, 1999), "se- 
quential correspondence" and "weigllted sequential 
correspondence". 
Each of tile similarity metrics eillpirically de- 
scribes the sintilarity between two inlmt strings tmi 
mid i~., 2 where we define tmi as a source  
string taken fl'om the TM and i~. as the input string 
which we are seeking to 1hatch within the TM. 
One featnre of all similarity metrics given here is 
that they have fine-grained discriminatory potential 
and are able to narrow down the final set of trans- 
lation candidates to a handfld of, and in nlost cases 
one, outlmt. This was a deliberate design decision, 
and aimed at example-based machine translation ap- 
plications, where human judgement cannot be relied 
upon to single out the most appropriate translation 
from multiple system outputs. In this, we set our- 
selves apart from the research of Sunlita and Tsut- 
sumi (1.991), for example, who judge the system to 
have been successful if there are a total of 100 or less 
outputs, aud a useful translation is contained within 
them. Note that it would be a relatively simple pro- 
2Note that the ordering here is arbitrary, and that all the 
similarity metrics described herein are commutative for the 
given implementations. 
36 
cedure to fall ()lit the 11111111)e1" of Olltt)lltS to it ill ollr 
case, tly taking tim top n ranking outputs. 
For all silnitarity metrics, we weight different 
.\]ai)mmse segment tyl)es according to their exl)ected 
impact on translation, in the form of the sweigh, t 
fllnctioll: 
Segment type s,wcight 
punctuation 0 
other segments 1 
W(' exl)erinlentally trialled intermediate swcight set- 
tings tbr ditt'erent character tyl)es (in the case of 
character-based indexing) or segment tyl)eS (in the 
case of word-based indexing), none of which was 
fomtd to apl)reciat)ly iml)rove performance. :~ 
a.1 Similarity metrics used in this research 
Vector space model 
Within our imt)lenmntation of the reactor space 
Inodol (VSM), the segment content of each string 
is (lescril)('.(l as a vector, ma(le u l) of 3 single dimen- 
sion for each segment tok(,n occurring within tmi or 
in. The. value of each vector eolnt)onent is given as 
the weighted frequen(-y of that token accor(ling to 
its sweiqht vahle, such that any nulnber of 3 given 
i)un(:tuation mark will produce a fl'e(luen(:y of 0. The 
string sinfilarity of t?H, i and in is then detined sis tim 
cosine of the angle l/etween vectors t\[\[~.i and iT\[t, re- 
Sl)ectivety, calculated as: 
tT~,i, i~5, cos(t,fi,,,i;4 
- It, ll l 0) 
where dot l)roduct and vect()r length (:oin(:i(le wil;h 
l;he standard detlnitions. 
The strings tmi of maximal similarity are th()se 
whi(:h i)roduce the nmxinuun v3hw, for th(! v(~ctor 
cosine. 
Not(; that VSM c(msi(lers (inly s('.gment fre(tueney 
and is insensitive to word order. 
Token intersection 
The token intersection of tmi 3nd in is defined as 
the cumulative intersecting fl'equency of tokens ap- 
pearing in each of the strings, normalised according 
to the combined segment lengths of tm, i and in. For- 
really, this equates to: 
tint(tm~, in) : e × ~_~, l'lill (f,'{?(htnl (\[),frcqilz(,)) " m~(l,,,~)+>.,,(i,,) (2) 
where each t is a token (iccurring in e.ither tmi or 
in, freq,(t) is detined as the swei.qht-l)ased fi'equency 
of token t occurring in string s, and Ion(s) is tlm 
aIf anything, weighting down hi,agana characters, fin" ex- 
ample, due to their common occurrence as intlectional suffices 
or particles (as per Fujii and Croft (1993)) led to a significant 
drop in 1)eribrmanee. Simihwly, weighting down stop word- 
like flmetional parts-of-sf)eech in ,lat)anese had little eltiect, 
unlike weighting down stop words in the case of English (see 
below). 
segment length of string s, that is the swcight-1)ased 
COllllt Of seglllellts (:(nltained ill .s'. 
As tbr VSM, the string(s) tmi most similar t;(i in 
arc thos(; which general;e the nlaximum value tbr 
tint(tmi, in). 
Note that word order does not take any part in 
calculation. 
Edit distance 
The first of the word order-sensitive methods is edit 
dist3nce (Wagner and Fisher, 1974; l?hmas and Fu- 
ruse, 1999). Essentially, the segment-lmsed edit dis- 
tance 1)etwecn strings t'ln, i and in is the minimunl 
numl/er of prilnitive edit operations on single seg- 
ments required to transtbrm tmi into in (and vice 
versa), 1)ased Ul)On the ol)erations of segment equal- 
ity (segments tmi,m and in, are identical), segment 
deletion (delete segment a fl'OlIl a given 1)osition in 
string .s') and scgmc'nt insertion (insert segmen~ (t 
into a given position in string .s). The cost asso- 
ciated with each ol)eration on segment a is defined 
~/S: 4 
Operation Cost 
segment equality () 
segment deletion swcigh, t(a ) 
s(;gment insertion swcigh, t(a) 
Unlike other similarity metrics, smaller v31ues in- 
dicate greater similarity for edit distance, and iden- 
tical strings have edit distmme 0. 
The woM order sensitivity of edit distance is per- 
\]ml)S t)est exeml)litie(l tly way of the following exam- 
1)le, where segment delimiters are given as :.'. 
(1) E - SN- 14-':winter r3in" 
(2a) 2F- $51. l+"summer rain" 
(21)) 1+" SN- 2F "a rainy summer" 
Itere, the edit distance from (1) to (2a) is 1 -t- 1 = 2, 
as one deletion ol/eration is required to remove E 
\[\]:uyu\] "winter" and one insertion ol)eration required 
to 3dd 2F \[natu\] "summer". The edit distance from 
(1) to (21/), on the other hand, is 1 + 1 + 1 + 1 = 4 
despite (2b) being identical in segment content to 
(2a). In terms of edit distance, therefore, (23) is 
adjudged more similm" to (1) than (21)). 
Sequential correspondence 
Sequential corresI)ondence is 3 measure of the m3x- 
innun subsl;ring sinlilarity lmtween tmi and in, nor- 
malised acc(irding to the comt)ined segment lengths 
h'.n(tmi) and len(in). Essentially, this method re- 
quires th3t all substring matches submatch (tmi, in) 
between tmi and in be calculated, and the maximum 
scqcorr ratio returned, where scqcorr is delined as: 
, . , 2×max\[su¢,mateh(tml,in)\[ ~ ~ " m~It.,,.)+t~.(~,) (3) 
1Note that dm costs for deletion and insertioil must be 
equal to maintain commutativity. 
37 
IIere, tile cardinality operator applied to 
submatch(tmi,in) returns tile combined seg- 
ment length of matching substrings, weighted 
according to swcight. That is: 
I~,~ ..... t~(~.,,~.~,~)I=~,j ~ .... igl~t(s,~j,,~) (4) 
for each segment ssj,t~ of each matching substring 
ssj G submatch(tmi, in). 
Returning to our exmnple from above, the simi- 
larity for (1) and (2a) is 2x2 2 whereas that for • 3+3 -- g 
(1) and (2b)is ')x~ , 3+3 ~ :~" 
Weighted sequential correspondence 
Weighted sequential correspondence--the last of the 
word order-sensitive methods--~is an extension of se- 
quential correspondence. It attempts to sut)plement 
the deficiency of sequential correspondence that the 
contiguity of substring matches is not taken into 
consideration. Given input string a~ a2a.~a/,, for 
example, sequential correspondence would suggest 
equal similarity (of ~) with strings a~ ba~ca:~da/, 
and aj ap. a3 a 4 cfg, despite the second of these being 
more likely to produce a translation at; least partially 
resembling tlmt of the intmt string. 
We get around this by associating all incremen- 
tal weight with each matelfing segment assessing 
the contiguity of left-neighl)ouring segments, in the 
manner (Inscribed by Sato (1992) for chaxactcr- 
based matclfing. Namely, the kth segment of a 
matched substring is given the multiplicative weight 
rain(k, Max), where Max was set to 4 in evaluation 
after Sato. I submatch,(tmi,iu,)l fi'om equation (3) 
thus t)ecomes: 
~ssj ~t, rain ( k × swcight(.ssj,~.),Ma, z) (5) 
tbr each sul)string ssj ~ submatch(tmi, i77,). \¥e siln- 
ilarly modify tile definition of the lea flmction for a 
string s to: 
lea(s) =- Ejmin (j x sweight(.,'j),Max ) (6) 
for each segment .sj of s. 
3.2 Retrieval speed optirnisation 
While this paper is mainly concerned with accuracy, 
we take a moment out here to discuss the potential 
to accelerate the proposed methods, to get a feel for 
their relative speeds in actual retrieval. 
One immediate and effective way in which we can 
limit the search space for all methods is to use the 
current top-ranking score in establishing upper and 
lower t)ounds on the length of strings which have 
the potential to better that score. For token inter- 
section, for example, fi'om the fixed length lea(in) 
of input string in and current top score a, we can 
calculate the following bounds based on the greatest 
possible degree of lnatch between in and tmi: 
Upper bout, d: le,~(t.~d </(~-~)~n(~'~)J (7) L CZ 
_ F alen('in) 7 Lower bound: len(tmi) >, 2-(,, (8) 
In a similar fashion, we can stipulate a corridor of al- 
lowable segment lengths for tin i, for sequential corre- 
spondence and weighted sequential correspondence. 
For edit distance, we make the observation that tbr 
a current minimum edit distance of a, the following 
inequality over Icn(tmi) inust be satisfied for tmi to 
have a chance of bettering ct: 
len(in) - ~ < len(tmi) < len(in) + a (9) 
We can also limit the numl)er of string compar- 
isons required to reach the optimal match with in, 
by indexing each tmi by its component segments and 
working through the component segments of in in as- 
cending order of global fi'equency. At each iteration, 
we consider each previously unmatched translation 
record containing the current segment token, adjust- 
ing the upper and lower bounds as we go, given that 
translation records for a given iteration caiulot hmre 
contained segment tokens already processed. The 
maxinmm possible segment correspondence between 
the strings is therefore decreasing on each iteration. 
We are also able to completely discomlt strings wit}l 
no segment component conunon with iTt in this way. 
Through these two methods, we were able to 
greatly reduce the number of string comparisons in 
word-based indexing evaluation for VSM, token in- 
tersection, sequential correspondence and weighted 
sequential correspondence methods in particular, 
and edit distance to a lesser degree. The degree of 
reduction for character-based indexing was not as 
marked, due to the massive increase in numbers of 
l;ranslation records sharing some character content 
with in. 
There is also considerable scope to accelerate 
the matching mechanisms used by the word order- 
sensitive approaches. Currently, all approaches are 
implemented in Perl 5, and the word order-sensitive 
approaches use a naive, highly recursive method to 
exhaustively generate all substring matches and de- 
ternfine the sinfilarity for each. One obvious way in 
which we could enhance this implelnentation would 
be to use an N-gram index as proposed by Nagao 
and Mori (1.994). Dynamic Programming (DP) tech- 
niques would undoubtedly lead to greater efficiency, 
as suggested by Crmfias et al. (1995, 1997) and also 
Planas and Furuse (this volume). 
4 Evaluation 
4.1 Evaluation specifications 
Evaluation was partitioned off into character-based 
and word-based indexing for the vm'ious similarity 
methods. For word-based indexing, seginentation 
was carried out with ChaSen v2.0b (Matsmnoto et 
al., 1999). No attempt was made to post-edit the 
segmented outtmt, in interests of maintaining con- 
sistency in the data. Segmented and non-segmented 
strings were tested using a single program, with 
segment length set to a single character for non- 
segmented strings. 
As test data, we used 2336 unique translation 
records deriving fi'om technical field reports on con- 
struction machinery translated from Japanese into 
English. Translation records varied in size from 
38 
CIIAI{ACTEI{- 
BASEl) 
1NI)EXING 
\~)~() 1/J )- 
IL,\SEI) 
INI)I'iXING 
Similarity metric 
Vector space model (0.5) 
Token intersection (0.4) 
Edit distance (/cn(in))- 
Sequential corr. (0.4) 
Weighted seq. (:orr. (0.2) 
Vector sllace model (0.5) 
Token intersection (0.4) 
Edit distmme (h,n(in~- 
Sequential corr, (0.4) 
Weighted seq. corr. (0.2) 
Accuracy 
44.0 
44.3 
Edit 
diserep. 
4.86 
3.25 
1.82 
2.92 
2.89 
Ave, 
outputs 
1.04 (0.97) 
1.01 (0.99) 
1.39 (0.80) 
1.02 (0.98) 
1.04 (0.97) 
50.2 
46.6 
45.6 
43.7 (-0.8%) 
43.0 (-2.9%) 
47.3 (-5.9%) 
43.1 (-7.4%) 
40.7 (-10.7%) 
5.21 
3.12 
2.03 
3.06 
3.30 
1.17 (0.91) 
1.01 (0.99) 
1.90 (0.69) 
1.01 (0.99) 
1.14 (0.92) 
Ave. 
time 
2.14 
2.24 
4.75 
3.20 
4.10 
0.76 
0.88 
1.00 
1.10 
1.24 
Table 1: Results for the different similarity metri(:s under character-1)ased and word-based indexing 
single-word technical terms taken f1'Ol12 SI~ technical 
glossary, to multiple-sentence strings, at an average 
se.glnent length of 13.4 and average character length 
of 26.1. All .lapane, se strings of length 6 chara(:ters 
or more (a l;ol;al of 1802 strings) were extracted fl'om 
the Ix;st da.ta, leaving a resi(hle gh)ssary of te(:hni(:al 
1;erltls (533 strings) as we w(nfld not CXl)e('t to find 
use, hll nlat(:hes in the TM. The retrie, val a(:curacy 
()\,or the 1802 hmger strings was then vcritied t)y \] 0- 
fokt (:ross wflidation, including the glossary in the 
test TM on each iteration. 
Not(; that the test data was llre-1)artitioned into 
single technical terms, single sentences or sen- 
tence clusters, each constitut;i21g a single translation 
record. Partitions were taken as given in evaluation, 
whereas for reM-worhl TM systems, tim automal;i(m 
of this i)2"()cess (;Oltll)l'ises ;tll il211)ortalll; COlill)()ll(1Ilt 
of the (/verall sysI;(mL 1)re(',eding translation rel,ri(;val. 
While ackn()wh;(lging the ilnl)ort;an(:(; ()f this step and 
its int(;ra(:l;ion with r(¢ri(;val 1)or\[ormall(:(;, we (:boost, 
to sideste l) it for the lmri)os(~s of this pal)c.r , and 
leave it for hltm(; resc.m(:h. 
In an effort to make evaluation as ol)jeci;ive and 
empirical as l)ossibh;, apl)r()i)riatencss of transla- 
tion candidate(s) l)rOl)OSed by the different metri(:s 
was evahmted according to the mil2inlunl edit dis- 
tahoe between the translation candidate(s) and the 
unique model translation. In this, we transferred 1,t2(; 
edit distance, method described M)ove directly across 
to the ta.rg(% langustge, (English), with segments its 
words and the fl)lh)wing s'weight schema: 
Segment type 
tmnctuation 
stop \VOl'dS 
other words 
swcight 
0 
0.2 
1 
Stol) words are defined as those containcd within the 
SMART (Salton, 197\].) stop word list) The system 
output was judged to be correct if it contained a 
translation optimally close to the model trmMation; 
the average ol)timal edit distance h'onl the model 
translation was 4.73. 
'5 \[tp:// fl, p.corne, ll.cs.ed U/l)U b/smar t/english,stop 
We set; the additional criterion that the difl'erent 
metrics should be able to determine whether the top- 
ranking translation (:mMida.te is likeJy to be useflfl to 
the translator, and that no outlmt shouhl lm given if' 
the chlsest nmt('hing translation record was outside 
a certain l'~/Ilg( ~. Of "transla.ti(m uscflflness'. In p2"ac- 
tice, this was set to the, edit distance between the 
model translation and the empty string (i.e. the e.dit 
(:()st; of creating th(; model translation fl'(nn s(:ratch). 
This cut;off' 1)oint vlts realised for the different sim- 
ilarity metrics by thrcshohling over the similarit.y 
scores. The ditferent thresholds settled Ull(m experi- 
mentally for all similarity metrics are given ill t)ra(:k- 
cts in the second column of Table 1, with the thresh- 
ohl for (;(lit, distance dynamicMly set t(/the edit dis- 
lane(; l~etween the input and tim eml)ty string. 
\Ve set (mrs(;\]ves al)art \]'IX)211 COIlV(;21I;i()IIsll 2'(~S(;D.l'('h 
()n TM r(;hieval lmrl'o2unan(:(; in a(lol)ting this ()l/- 
.i(;(:li\'(; mmmrical (~vahmti()n method. Traditionally, 
r(:i.ri(~val l)erformalm(~ has 1)(!e,n gauged 1)y tlm sub- 
j(~(:t;iv(; useflfln(;ss of the closest matching e.lenmnt of 
the syst;(~lll OUtlmt (as judged 1)y a. hunm,d, mid de- 
scribed by way of a dis(:rete set; of transla.tion (lualit;y 
des('ril)tors ((;.g. (Nakm2mra, 1989; Smnita and Tsut- 
smni, 1991; Sato, 1992)). Perhaps the closest evalua- 
tion a.tte2nt)ts to what we prol)ose are those of' Planas 
and Nn'use (1.999) in s(!tting a mechanical cutoff for 
"translation usability" as the al/ility to generate the 
model translation from a given translation candidate 
1)y editing less than half the component words, and 
Nirenburg et al. (1993) ill calculating the weighted 
mmtber of key strokes r(;quirexl to convert the system 
outllut into ;m apl)ropriate translation for the orig- 
inal inllut. Tile method of Nirenburg et al. (1993) 
is certainly more indicative of t:rue target  
useflllness, but is dependent 022 the coml)etence of 
the translator editing the TM system output, and 
not automated to the degree our method is. 
4.2 Results 
The results for the different similarity metrics with 
character-based and word-based indexing are given 
in Tal)le 1, with the two bag-of-words al)t)roaches 
partitioned off from the three word order-s(msitive 
al)I)roaches tor ea(:h indexing paradigm. "Accuracy" 
is an indication of the prol)ortion of intmts fbr whi(:h 
39 
an optimal translation was produced; character- 
based indexing accuracies in bold indicate a signifi- 
cant ~ advantage over the corresponding wprd-based 
indexing accuracy, and figures in brackets for word- 
based indexing indicate the relative pert'ormaime 
gain over the corresponding character-based index- 
ing configuration. "Edit discrep." refers to the mean 
minimum edit distance discrepancy between trans- 
lation candidate(s) and optimal translation(s) in the 
case of the translation candidate set containiug uo 
optimal translations. "Ave. outputs" describes the 
average number of translation candidates output by 
the system, with the figure in brackets being the 
proportion of int)uts for which a unique translation 
candidate was produced. "Ave. time" describes the 
average time taken to deterlnine the translation era> 
didate(s) for a single output, relative to the time 
taken tbr word-based edit distance retrieval. 
Perhaps the most striking result is ttmt character- 
based indexing produces a superior match accuracy 
to word-based indexing tbr all similarity metrics, at; 
a significant margin tbr all three word order-based 
methods. This is the complete opposite of what we 
had expected, although it does fit in with the find- 
ings of Fujii and Croft (1993) that character-based 
indexing performs comparably with word-based in- 
dexing in Japanese information retrieval. 
Looking to word order, we see that edit distance 
outperforms all other methods for t)oth character- 
and word-based indexing, peaking at just over 50% 
for character-based indexing. Tile relative perfor- 
mance of the remaining methods is variable, with 
the two bag-of-words methods being superior to or 
roughly equivalent to sequential correspondence and 
weighted sequential correspondence tbr word-based 
indexing, but tile word order-based methods having 
a cleat' advantage over the bag-of-words methods for 
character-based indexing. It is thus difticult to draw 
any hard and fast conclusion as to the relative merits 
of word order-based versus bag-of words methods, 
other than to say that edist distance would appear 
to have a clear advantage over other methods. 
The figures for edit discrepancy in the case of non- 
optimal translation candidate(s) are equally inter- 
esting, and suggest that on the whole, the various 
methods err more conservatively for character-based 
than word-based indexing. The most robust method 
is (source ) edit distance, at all edit dis- 
crepancy of 1.82 and 2.O3 for character-based and 
word-based indexing, respectively. 
All methods were able to produce just over one 
translation candidate on average, with all other than 
edit distance returning a unique translation candi- 
date over 90% of the time. The greater number of 
outtmts for the edit distance method can certainly 
be viewed as one reason for its inflated performance, 
although the lower level of mnbiguity for character- 
based indexing but higher accuracy, would tend to 
suggest otherwise. 
Lastly, word-based indexing was found to be faster 
than character-based indexing across the board, for 
the simple reason that the immber of character seg- 
~As determined by the paired t test (p < 0.05). 
ments is always going to be greater than or equal 
to the number of word segments. The average seg- 
ment lengths quoted above (26.1 characters vs. 13.4 
words) indicate that we generally have twice as many 
characters as words in a given striug. Additionally, 
tile acceleration technique described in § 3.2 of se- 
quentially working through the segment component 
of the input string in increasing order of global fre- 
quency, has a greater ett>ct for word-tmsed index- 
ing than character-based indexing, accentuating any 
speed disparity. 
4.3 Reflections on the results 
An immediate exlflanation tbr character-based in- 
dexing's empirical edge over word-based iudexing is 
the semantic smoothing effects of individual kanji 
characters, alluded to above (§ 2). To take an exam- 
ple, the single-segment nouns A': n \[s6sa\] and : ng0 
\[sadS\] both mean "operation", but would not match 
under word-based indexing. Character-based index- 
ing, on the other hand, would recogifise the overlap 
in character content, and in the process pick up on 
the semantic corresi)ondenee between the two words. 
To take tile opposite tack, one reason wily word- 
based indexing may have been disadvantaged is the 
we did not stem or lemmatise words in word-based 
indexing. Having said this, the. output fl'om ChaSen 
is such that stems of inflecting words are given as 
a single segment, with inflectional morphemes each 
presented as sel)arate segments. In this sense, stem- 
ruing would only act to delete the inflectional mor- 
phemes, and not add allything new. 
Another way in which the outlmt of ChaSen 
could conceivably have atlbcted retrieval perfor- 
iilance is that technical terms tended to be over- 
segmented. Experilnentally combining recognised 
technical terms into a single segment (particularly 
in the case of contiguous katakana segments in the 
manner of Nljii and Croft (1993)), however, de- 
graded rather than lint)roved retrieval performance 
for both character-based and word-based indexing. 
As such, this side-etfect of ChaSen would not appear 
to have impinged on retriewfl accuracy. 
One other plausible reason for tile unexpected re- 
sults is that the test data could have been ill some 
way inherently better suited to character-based in- 
dexing than word-based indexing, although the fact 
that the results were cross-wtlidatcd would tend to 
rule out this possibility. 
A surprising result was the lacklustre performance 
of the weighted sequential correspondence method as 
compared to simple sequential correspondence. We 
have no explanation for the drop in accuracy, other 
than to speculate that either the proposed formu- 
lation is in some way flawed or contiguity of match 
does not impinge on translation similarity to the de- 
gree we had expected. 
To return to the original question posed above of 
retrieval speed vs. accuracy, the word order-sensitive 
edit distance approach would seem to hold a gen- 
uine edge over the other methods, to an order that 
would suggest the extra computational overhead is 
warranted, ill both accuracy and translation discrep- 
ancy. It must be said that the TM used in evalua- 
40 
tion was too small to get a gemfine f(;el for the com- 
t)ul;ational overhead that would 1)e cxp(,,ri(;ncc, d in 
~ real-world TM system context of t)ot;entially mil- 
lions rath(;r than thousands of translation records. 
A C the saint', (tim(;, however, coding Ul) the c(lit dis- 
tan(:(; l)roc(',dure in a  fasto, r than Perl using 
chara(;l;(?r r~d;h(~,r \[;\]lall SI;t'illg COIlq)arisol~ 1)roc(?(hlrcs 
mid ai)l)lying (lynami(" 1)rogl'amming t(whni(lu(,,s or 
similar, may well oIl~set th('. large \]nero.as(; in number 
of comparisons dcmand(',d of the system. 
5 Concluding remarks 
This research is concerned with l;}m r(;lativ(~ iml)orl; 
ot7 word order and segm(mta.1;ion on translation re- 
l;rieval i)erformmlc(~ tbr a TM system. Wc mo(Ml('xl 
the elthcts of word order s(msitivity vs. 1)ag-of-wol'dS 
word order ins(msit;ivity 1)y iml)l(mmnl,ing a total of 
live similarity mcla'ics: two bag-of-words al)proach(',s 
(lhe v(',(:tor spa(:(; model and "tol¢('.n int(us(!(:tion") 
and tin'('.(', w()r(l ord(',r-s(;nsitive al)l)roach(',s ((',(lit; dis- 
tan(:('., "s(;quential corr(',Sl)ond(',nce" and "wcight(',d 
sequential corr(~st)ondenc(¢'). Ea(:h of th(;s(; nw, tri(',s 
was then l;(~sl;e(t Hll(ler (:har;~cl;(;r-1)as(~(\[ al~(t word- 
based in(h'~xing, to deto, rmin(,~ what (;tt'c(:t s(~gm(',nta- 
l;ion wouhl have, on r('.trieval 1)(~rl'orman(:(h Eml)iri- 
c~d evaluation })asc, d }ll'Olllld \[,h( ~, l;alg(!l, languag(', (;(tit 
distance of t)rot)osed traiMa.tion can(lidal(',s r(~vcaicd 
that (:hara(:tcr-1)ascd indexing consist(mtly produ(:ed 
gr('~atcr accuracy than wordq)ased in(lexiltg; and thai; 
the word or(l(~r-s('atsitivo~ (;(lit distain:(', m(;tri(: clearly 
outl)(',rforme(1 all other methods un(h',r 1)oth in(l(',xing 
paradigms. 
The main area in wlfi(',h we, fc!d this r(~s(!ar(:h c(mht 
1)c, (mhan(:(~d is to validate th(~ findings of this 1)a- 
per in (~Xlmn(ling evahlati()n 1o olh(w domains mid 
l;esl; Set,q, whi(:h wc h'av(', as ;lll il:(?lll 1'()1 t'ulm(~ re- 
s(mr(:h. We also skirl;ed m'(mnd lira issu(~ ()f lrmls- 
lation record partitioning, and wish 11)inv(!stigale 
how difl'(;r(mt 1)mtitioning m(~'tho(ls lmrfl)rm againsl; 
c,;mh other. One important area in which w(; hop(~ 
to eXl)and our resem'ch is to look at tim etl'(~(:ts of 
character type on chm'act(',r-bas(~d indexing, t(anji 
would a,ppear to be helping the case of character- 
based indexing at t)rc, s(mt, ;rod it woul(\[ 1)e highly 
r(;vcaling to look at wh(',th(',r COml)ara,1)l(', ro, sults to 
t\]losc 1)r(:s(;nt('d h(;r(~ would 1)(', t)ro(ht(:ed \[or full 
kaim-basc'd (alphal)c, ti(:) ,lal)an(',sc input, or otlmr 
all)hal)ct-1)ased n(m-s(~gm(ulting s such as 
Thai. 
Acknowledgements 
Vital input into this research was rcc(~ivcd t¥om 
Francis Bond (NTT), Emmanu(;1 Planas (NTT), and 
three anonylllOUS reviewers. 

References 

T Baldwin and H. Tanaka. 1999. The applications of 
unsupervised learning to Japanese grapheme-phoneme 
alignment,. In Proc. of the ACL Workshop on Unsupervised Learning in Natural Language Processing, pages 9-16. 

L. Cranias, H. Ibqmgr.orgiou, and S. Pilmridis. 1995. 
A Matching Technique in Example-Based Machine 
Translation. cmp-lg/9508005. 

L. Cranias, H. Papageorgiou, and S. Piper\]dis. 1997. E×- 
amt)h~ retrieval from a trmlslation memory. Natwral 
Language \]'Jngine.ering, 3(4):255 77. 

1t. Fuji\] and W.B. Croft. 1.993. A comparison of index- 
ing tc(:lmiqu(~s fl)r .lal)ancsc t;c.x|; r('.trieval. In Proc. 
of 161h International ACM-SIGH~, Cot@fence on Re- 
search and Dc'vclopmcnt in Information Ib:tricval (SI- 
GIR'93), pages 237 46. 

It;. Kitamura and II. "~Smmmoto. 1996. Translation 
retrieval systo.m using alignment data flom parallc.l 
texts, in P~wc. of the 5&'d Annual Mccting of tit(" 
II'S,I, volmne 2, pages 385 6. (Ill Ja.t)ancsc ). 

C. Manning and II. S(:hiil;ze. 1999. Foundations of Sta- 
tistical Natural La'ngurtgc P~vccssing. MIT Press. 

Y. Matsmnoto, A. I(i/,auchi, T. Yamashita, and Y. IIi- 
rano. 1999. ,\]apancsc Moudtolo.qical Analysis S?/s- 
l, cm UttaScn Version 2.0 Manual. ~lt'~chnical l/.eporl; 
NAISqUIS-Tl199009, NAIST. 

M. Nagao and S. Mori. 1994. A new method of N-gram
statistics for large number of N and automatic extraction of words and phrases from large text data of Japanese. In Proc. of the 15th, lnternational Conference on Computational Linguistics (COLING 94), 
pages 611-5. 

N. Nakamma. 1989. ~l¥~mslat, ion supl)orf by retrieving 
bilingual texts. In l'~wc, of the 38th Annual Mcctin 9 
of the IPSJ, volume 1, pagt;s 357 8. (In Jai)ancs(; ). 

S. Nirelflmrg, C. l)omashnc.v, and \]).J. Gramms. 1993. 
Two apt)roa(:hes to mat;thing in eXaml)h>bas(~d rim- 
chin(', translation. In Proc. of the 5th International 
CoT@:rc'ncc on 771corctical and Mcthodologic(d lasucs 
i'tl. Math, inc. 7!ransl,,tio'a 151'M1-93), pages d7 57. 

E. Planas and (). l:uruse. 1999. F(wmalizing translation 
m(m,n'ies. In l)Twc, o.f Math\]n(: Translation ,%m'mit 
VII, pages 331 9. 

1'2. Planas. 1998. A Case, Study on Memory Based Ma- 
chine ~}'anslation 7bols. Phi) Felkm~ \Vorking 1)al)c.r, 
Unil;ed Nations University. 

G. Salton. 1971. The SMAR, T It, err\]oval Sy.stevt: E:rpcr- 
ime.nt.s in Automatic Document Processing. Prentice- 
Hall. 

S. Sato and 3'. Kawase. 1994. A ltigh-Spc.ed B(:st Match 
i{e.tricval Method fin" ,\]apancsc ~}:'a;t. Tct:lulical Rctmrt; 
1S-11R-94-9I, JAIST. 

S. Sato. 1992. CTM: An example-based translation aid 
system. In Proc. of the 141h International Conference 
on Computational Linguistics (COLING '92), pages 
1259-63. 

E. Smnit;}~ mtd Y. Tsutsumi. 1991. A 1)ract,ical method 
of retrieving similar examples 1or trmMation aid. 
7Yansaction,s of the IEICE, J74-D-II(10):1437 47. (In 
Japanese). 

It. Tanaka. 1.997. An efficient way of gauging siinilar- 
ity lmtwcen hmg .lalmnc, so, expressions. In Informa- 
tion l~roccssin9 ,%ciety of Japan SIG Notes, vohun(! 
,1t7, no. 85, 1)ages 69 74. (In .l~q)aneso,). 

A. Wagner and M. Fisher. 1974. The' string-to-string 
correction 1)roblcm. Journal of the A CM, 21(1):168-73.  
