C, TM: An Exmnt)le-Based Translation Aid System 
Satoshi SATO * 
School of Information Science 
Japan Institute of Science and Technology, lqokuriku 
Tat, stmokuehi, Ishikawa, 923 12, Japan 
sato~jaist-east.ac.j p 
Abstract 
This paper describes a Japanese-English translation 
aid system, C'I'M, which has a usefid capability for 
flexible retrieval of texts from hilingaal corpora or 
translation databases. Translation examples (pairs 
of a text and its translation equivalent) are very 
helpful for us to translate the similar text. Our 
character-based best match retrieval method can re: 
trieve translation examples similar to the given input. 
This method has the following advantages: (1) this 
method accepts free-style translation examples, i.e., 
pairs of any text string and its translation equiva- 
lent, (2) morphological analysis is unnecessary, (3) 
this method accepts fl'ce-Myle inlmts (i.e., any text 
strings) for retrieval. We show the retrieval examples 
with the following characteristic features: phrasal ex- 
pression, long-distance dependency, idiom, synonym, 
mad semantic ambiguity. 
1 Introduction 
In the late 1980's, several commercial Japanese- 
English machine translation systems had heen devel- 
oped in Japan, In these systems, the computer is the 
agent of translation, while the user assists in editing 
the translation inputs and revising the results. Al- 
though they are usefal to translate large amounts of 
texts roughly and rapidly, high quality translation is 
impossible. 
Translation aM is another kind of machine trans- 
lation: the user is the agent of translation, while the 
computer provides him or her with the helpfifl tools, 
e.g., quick-retrieval electronic dictionaries. A quick- 
retrieval bilingual corpus is also usefifl, specifically 
when it h,'~s the flexihle (best match) retrieval mech- 
anism. Because translatiml examples (pairs of source 
text, and its translation equivalent) are very helpful 
for us to translate the similar text. This type of sys- 
tem is called a~ example-based translalion aid \[6\], and 
there are two prototype systems in Japanese-English 
translation: ETOC \[8\] and Nakanmra's system \[5\]. 
*The author had been I, rallsfetTeiI from Kyot(~ I.Jllivenlity ()n 
April I, 1992. This work w~.~ done al Kyolo University. 
\[Intmt Text\] 
(He is a great swimmer.) 
t 
Best Match ~ 
Retrieval 
J 
\[Retrieved EXaml)h, \] 
YOll~re ~z great actor. 
Figure h B~ic Configuration of Example-B,'Lsed 
Translation Aid 
Figure I shows the basic configuration of example- 
ha~ed translation aid (EI~TA). It consists of two com- 
ponents: the translation database is the collection 
of translation examples, and the best match re- 
trieval engine is to retrieve the example that is the 
most similar to the given input text. The character- 
istic of the EBTA system is that it accepts free-style 
texl inputs for the retrieval: it frees the user from 
learning the tbrmal language for datah,~se query. 
'l'he central problem in EBTA is the implementa- 
tion of the hest match retrieval. Two methods were 
proposed: one is the syntax-matching driven by gen- 
eralization rules in ETOC \[8\], and the other is Naka- 
nmra's method using content words \[5\]. They are 
the word-based best match retrieval methods 1, 
which need morphological analysis. 
This paper proposes the character-based best 
match retrieval method, specifically for Japanese 
texts Compared with the word-hased methods, the 
charaeter-h~sed method has the following advantages: 
• Morphological analysis is unnecessary. 
• Some kind of synonyms can he retrieved without 
a thesallrlls. 
This method has been implemented in CTM ~, a 
Japanese-English translation aid system for writ- 
ing/translating technical papers. 
1 \]n w.rd bm~ed (reap. char~'ter b&sed) hem lll/~tc|t retrieval 
method, a word (Ires n. chara~'ter) is a primitive. 
2C'I'M is named frc~m th~ .lalmnese phr~, "Chotto 
'Isukatte Mitene", which means "nNe it naly time you want". 
Acr~ DE COLING-92, NANTES. 23-28 AO(H" 1992 I 2 5 9 Prec. o~: COLING-92. NAI~rrEs, Auo. 23-28, 1992 
2 The Character-Based Best 
Match Retrieval Method 
2.1 Characteristics of Japanese Writ- 
ten Texts 
Japanese written texts have remarkable characteris- 
tics as follows. 'riley cannot he found in European 
languages, i.e., English, French, and German. 
1. The number of characters is very large. 
The numher of characters that are used ill text is 
more than 7,000 in Japanese. while it is less than a 
hundred in a European language. 
2. Synouyms often have the same Kanji character. 
Japanese characters are divided into three types'. Hi- 
ragana (83 characters), Katakana (86 characters), 
and Kanji. A ltiragana or Katakana character ex- 
presses a sound, and a Kanji character represents a 
semantic primitive. For example, tile Kanji character 
"~" means "thinking", and it is used for construct- 
lug several words concerned with thinking: e.g., ,~( 
~(thinking)~ ~",~, (consideration), ~,~'(deep think- 
ing), ~~Ta (think), ~\[~TTa (devise). 
3. There is no delimiter between words. 
In l"uropean languages, the white space is the delim- 
iter for word separation. In contrast, Japanese has 
no explicit delimiter. Therefore, the main part of 
Japanese morphological analysis is to divide a text 
string into words: it is not easy task a. 
These characteristics of Japanese suggest the 
character-based best lllate|l, becanse 
I While the word-based method needs morphologi- 
cal analysis, the character-bmsed method does not 
need it. 
2. In order to retrieve synonyms the word-based 
method needs a thesaurus. In coutra.st, the 
character-based method call retrieve some kind of 
synonyms withont a thesallrus, because synonynls 
often have tile same Kanji character in Japanese. 
2.2 The Character-Based Best Match 
The character-based best match can be determined 
by defining the distance or similarity measure be- 
tween two strings. 
The simple measure of similarity hetween two 
strings, A = alau...a~., H = btb2...by, is the num- 
ber of the matching characters considering the char- 
acter order constraint. It is not particularly good 
aFor example, a Jap&llese morphoh~gical anMyMs pFOgl'&lll 
developed by Nyolt~ University fails to anMyze 3 ~ ,5 % of 
Selll ellCeS. 
memsure, bat makes a convenient starting point. We 
define it as follows: 
s(i,j) = 
0 ifi=ovj=o 
s(i- l,j l)+m(i,j), ) 
max s(i l,j), 
s(i,j 1) 
if(l _< i < x) A(1 <_j<y) 
1 if a~ = b 3 
m{i,j) = 0 ifa,~bj 
This measure often produces the undesirable re- 
suits, because we ignore continuation of matching 
characters. For example, consider the following 
strings: 
A = I"I~R4~'¢70 (solve the problem) 
f~ = t~a~ m 5,~j~1ce~ Ltco 
(He solved the problem yesterday.) 
(determine tile method for solving the problem) 
We want to be S(A,13) > ,9(A, F¢'), but the above 
measure produces ,5'(A, B) < ,S'(A, B~). To solve the 
problem, we consider tile bonus for contimmns match- 
ing characters. It can be done by modifying m(i,j) 
m the the above definition: 
,5'(A,,~) = s(x,:j) 
s(i,j) = 
.s(i -- 1,j 1) + min(cm(i,j),W) 
max s(i- l,j), 
s(i,j - 1) 
if(l < i< ~)A(1 _<j _< y) 
~,,~(/, j) = 
0 ifi=OVj=O 
em(i l,j - 1) + m(i,j) 
if(1 _< i _< x)^(~ _<j <:/) 
1 if ai = bj 
m(i,j) = 0 ifai~bj 
This is the similarity score that we use, where W is a 
parameter that determines the maximum value of the 
bonus for tile continuons matching characters. When 
14" = 1, this definition is the same with tile previous 
definition. Table l shows ,5'(A, B) and S(A, B') with 
varying vahws of W. l_lsually we use W = 4. 4 
4'l'his value was detemni,ted empirically. II may be ex 
plained ~-s follows, '\['he average character length of a Japanese 
word is abottt two, and we frel that the COlllillll(lllS lll~.tChillg 
of two w~)rds is Ihe Mrollg match. 
AC1T.S DE COLING-92, NANTES, 23-28 ^O£rC 1992 1 2 6 O PROC. OF COL1NG-92, NANTES, AUG. 23-28, 1992 
Table h Scores vs. W 
W \[ 'l '2 3 4 5 
S(A,I\]) \[ 5: 9: 12 14 15 
5'(A,B') 7 9 9 9 9 
Table 2: Translation I)atah,'~se l 
ID .lnp~umse English 
l ~ <'95"69 several 
2 ~"3"C ~ every tlmc 
3 ~aO~ some (lity 
4 ~ O') } yeuterd~ty 
Tahle 3: Character Index 
Ch. lI)'s Ch. II)'s 
~ 1, 2, 3 O 1, 2, 3 
fl 4 -¢ 2 
7~ ~ 1, 3 a) 1, 4 
4 ~ 2 < 1 
2.3 Acceleration by Character Index 
At the be'st n/arch retrieval, we use the acceleration 
method using the character index. '~ The character 
index is tile tahle of every character with ll)'s of ex- 
amples in which the character is appeared. Table 2 
shows all exatnple of translation database and Table 3 
shows the character index of it. 
In the first stage of the retrieval, the character in- 
dex is used for the pre-seleetion of tile examples. Fig- 
ore 2 illustrates the pre-selection process: it is 
1. Look up the records for the characters that are 
appeared in the input string. 
2. For e'very examph,, compute the pre-selection 
score, I'SS, wtfich can he ohtained by counting 
tile nurnher of the example ll)'s in the records. It 
is the number of matching characters between the 
input string and tilt! example ignoring the charac- 
ter order constraint. 
3. Select tile top N examples that have tile largest 
pre-selection score, where N is the parameter and 
we usually use N = 200. s 
In the second stage of the retrieval, the similarity 
scores of life-selected examples arc eomptlte(I, and the 
examples are ordered by the score. 
3 The CTM System 
Above mentioned retrieval mechanism hP-~ been im- 
plemented in CTM, a Japanese-English translation 
5We C&llllOt COllll)llte tile similarity re:ore of every exltnlt)le 
ill tile tlatabm~e, because the C()llll)lll~iltioll Ileeds almut 5 lllil- 
tisecond between the' ItVel'age illl}ll( siring (lO <'ll&racter~) &lid 
the average extmtDle (5(\] cha~'actet~) (m SparcSlalion 2, 
eThis value wa.~ determined empirically. 
,\[ 
L~k~l~ 4~ \[CI ...... ter Index\] 
1 I 
Ch. II)s ID PSS" Jap. 
I, 2, 3 "& '2 3 ~ ~ :9 "72 ~o 
Figure 2: Prc-seleetiml using Character Index 
'l'ranslati .... _ ~ _, ~ 
I )it t al)i~.~e 
I I 
I\] M'I'C (Clienl ..... NE ..... ) ~\] 
l"igure 3: The CTM system 
aid system CTM is written by C and runs on Sun 
Workstations. Pigure 3 shows tile contlguration of 
CTM: it consists of three programs. 
mkdb The program to create the character index 
t¥om tile translation database. 
CTM server The main program, which retrieves 
the hest matched examples with the given input. 7 
MTC ~ The client program on NF.macs (Nihongo 
(Japanese) GNU Emacs), which interacts the 
C'I'M server via Ethernet. 
The translation datah,-Lse of (YI'M is text tiles, in 
which a Japanese text string and an English text 
string appear one .after the other. These files call be 
made from J al)anese text files and the correspondent 
English text tiles hy nsing the alignment progratn \[1\] 
semi-automatically. We have made the translation 
datahase from several sources: Tahle 4 shows ollr 
translation databases. 
4 Retrieval Examples 
We show here C, TM retrieval cxaml)les with the fol- 
lowing features: phra.qal expression, long-distance (le- 
pendency, idiom, synonynl, and semantic ambiguity. 
Figure 4 shows a retriewd exanlple of phrasal ex- 
pression "~ < "Dh~C)~J~:)::~¢,~'~.~J'Yo (consider from 
several points of view)". Although there is no ex- 
act matched expression in the datahase, CTM can 
retrieve helpful examples for us to translate it. 
rThe CTM ~erver ha~ ~tller facilities: tile charactelq)aaed 
exact lllatdl retrieval fiw Jap~tllese texts, and tire word-bmsed 
hem or exert nl&tch retrieval f~n' English texts. 
s M'I'(Y is named t'r~\]n tile lanal~pne phra-~', "Molt. "1 ~ukatte 
C, hondai', whi,'h III~RII~ "11~ il lllOr~ and Hlore". 
ACRES DE COLING-92, NANTES, 23-28 AO~r 1992 l 2 6 1 PREC. OF COLING-92, NANTES, AUG. 23-28, 1992 
Table 4: The ~TM Translation Databases 
Name Direction ll.eco rds K Byte Sonrce(s) 
ScienceYYMM 1';~J 11,115 3,175 Scientilic American & its Japanese translation (Nikkei Science) 
MLI E~,I 2,655 458 Chap. I 4 in Machine Learning \[3\] ,tz its Japanese translation 
.11,~ .I~E 4,230 139 ~;ntry words on \[4\] 
MTE J~E 3,938 379 Test examph!s on \[2\] 
EX J~E 6,624 595 Translation examples co\]leered by Oikawa 
TJ ,I~E 1,d67 259 The column, Tensei-Jingn, on Asahi Newspaper 
K\]) .\]~l", 38,190 2,729 F, xamples on \[7\] 
Total 67,619 7,733 
C'/'M(AI,)> ~' { -9 h'o)~,t.~(7,~' ¢9 ~-~g,-J- 7a 
Score = 28, 1311 = Science8710, lI) = 598, I"ile = 03.ej 
~o),t 5 l:~, < -gz>o),~,f*~f~$1/61o)t¢,lt;t.~,h~gSL~ &, ~',) 
l"rmn the viewpoint of several material limits, then, gaJ- 
fium arsenlde offers advantages ow'r silicon in speed. 
Score = 24, 1)It = Science8710, II) = 549, File = fla.ej 
~-~ o) 5 -9o) t~<)~ft~:~,~ b. 3 -9 e~.,~,~, r~ ~-.g!. 
Each lewd of thr hierarchy can be considered from three 
different points (ff view, which are respectively theory, 
practice and historical analogy. 
Figure 4: Example (Phrasal Expression) 
CTM(Ab)>~ b'CJz ~, 
Score = 9, DB = Science8710, ID = 1649, File = 07.ej 
This is no small undertaking, however, and snccess pre- 
snpposes that society generates significant demand. 
Score - 9, DB = Science8710, 11) = 1944, File : 09.ej 
This view is not reMly in conflict witt, the traditional 
model of medical libraries as informati<m centers. 
Figure 5: Example (Long-Distance l)ependency) 
-CTM(Ab)> \[. -~/,~'~"9 ~.b 
Score = 18, DB = M'FE, II) = 79, File = mttest.je 
,~Xlat ~ O) b o l~'¢g -9 zbt~ t~o 
I gra-~l>~t I.he tail of a <:at,. 
Score = 18, DB = MTE, 1I) = 78, File = rnttest.je 
I fonnd his weak tmint. 
Figure 6: Example (Idiom) 
C'I'M supports the retrieval of long=distance depen- 
dency: Figure 5 shows a retrieval example, where "~ 
L"C" is an adverb, and ,,¢'¢~+v, is an auxiliary adjec= 
tive for negation, and they are often used together 
with the general meaning "never". 
CTM also supports the retrieval of idiomatic ex- 
pression: Figure 6 shows an example. In this figure, 
the first retrieval example is the literal meaning, and 
the second is the idiomatic meaning. 
The character-based best match method can re- 
trieve synonyms. Figure 7 shows an example: in 
this case. CTM retrieved an exact match example 
CTM(Ab)>-~f~.T 7~ 
Scare = 10, DB = M\]A, ID = 605, File = 03.ej 
~, ~ g'~tU5 t~':~f~-~ (MSC "~4~) '~.g-9 
In particular, we examine method~ for finding the 
maximally-specific conjunctiw~ generalizations (MSC- 
generalizations) that cover all of the training examples 
of a Riven concept. 
Score = 7, DB = Science9003, ID = 468, File = 
nl~,\[i t&l.e.ej 
Presumably the therapist's interpretations help patients 
to gain insight into the effects of the unconscious mind 
on their conscions thoughts, feelings and behaviors. 
Score = 6, DB = MI,1, ID = 147, File = 01.ej 
:~¢~'~. 
• Active experimentation, where the le&rner perturbs the 
ellvlrollment re) observe the resnlts of its perturbations. 
Figure 7: Example (Synonym) 
with "~',~,-~j-To (consider/examine)" and two exam- 
pies with two synonyms, "~l,iJg'?}~ "¢ To (gain insight 
into)" and "~JJ~J--5 (observe)". 
Figure 8 shows three retrieval examples for the 
Japanese construction "NOUN+/+~-+~o;~z '', where 
"IS." is a case marker and "~9)~ " is the past. form 
of the verb ".),,To". There are several translation of 
"/k.Ta" The frst input "~LI'~ (office) ~:-Jvo/'z" 
ha.s two meaning: one is "entered the office" and the 
other is "joined as a new member of the office". The 
second input "J~ (ear) ~S-/vo/'#." is an idiomatic ex- 
pression that means "beard". 'Fhe l,'ust input ":eg~-t~: 
(bookstore) {,2/k.-'9 ? ~:'' is more complicated: the trans- 
lation depends on not only "~E (ni)"-case hut also "~ 
(ga)"-ca.se. The retrieval examples show the following 
three cases: 
1. "\]k ( hu man)÷ 7~'~+ ~.'_ " (room)+ ~Z q- 2k. 7o" 
(human enters the room) 
2. "$il, (wind)+i/+g\[l\[+._" (room)+~Z-+.A.7o" 
(the wind blows into the room) 
3. "~ (book) + 7~'+-~l'.'.: (hooks tore) + ~5 + & 7o" 
(the book arrives at the bookstore) 
AC"IT.S DE COLING-92, NAN'I'gS, 23-28 AO6"T 1992 1 2 6 2 PROC. OF COL1NG-92, NANTES, AUG. 23-28, 1992 
Score - 14, I)B ~ MTE, 111 = 290, l,'ilc = inttest.je 
lie entered the (:la~sroom hom the hack entrance. 
Score = 14, I)P. = Scien<:e9003, II) = 404, I"ile = inter.e.ej 
llucy-Mei WiLllg, a recellt graduate student in my hal)o: 
ratory, extended these lindings by showing that the 11, 2 
receptor fnnctions a~u an "on-off ~ switch. 
C'\['M (M,) >'II~5 L o-Z75: 
Score = 14, I)B = MTE, II) - 279, Fib. = mttest.je 
RUliiors reached her ears. 
(:'1' M ( A |,) > ~1":-)~:~ ft. o ?," 
Score = 14, I)B = EX, ID = 5947, File : yourci.jc 
It appears that the thief entered the room by tile window. 
Score ~ 14, DB = MTI';, II) -- 283, l"ih. -- mttest.je :~- 
5 ~/~L;O ~ ~ v- A ~, t: o 
Draft blew into the room. 
Score = 12, l)ll : MTI';, 11) -- 278, l"ile ~ mttest.je 
Newly pul)lished /)oinks arrived at the tmoksttm~. 
Figure 8: Example (Amhiglfity) 
5 Evaluation 
It is very difficult to evaluate a translation aid sys- 
tem, because its effectiveness essentially depends on 
the user's satisfaction: when the user feels that the 
system is helpfld, it is effective. The evaluation of 
CTM is now in progress, an<l we show some results of 
experiments here. 
The Retrieval Time 
F, mpirieally, we ohtaim'd the following equation, 
which estimates the retriewd time (millisecond). 
time(I,k,N) = I x (D0 x k+2/3 x N) 
where I is tile length of tile input string, k (mega 
byte) is the database size, and N is tile we-selection 
parameter. For e×ample, if/= D0 (characters), k = 8 
(mega byte), N = 200, then time = 2,133 (millisec- 
ond). It shows that tile current systeln responses ill 
a few seconds and it is not so fi~st. The more accel- 
eration is need for the larger datah,~e. 
Evaluation of 100 retrievals 
Wc have evahmted 100 retrieval results I~y hand 
We have given one of ttle following grades to each 
retrieved example. 
A The example exactly matches the input. 
B The example provides enough infornmtion about 
tile translation of the whole input. 
C The examl)le provides information ahout the 
translation of some part of the input. 
A 
II 
(; 
I" 
Total 
Table 5: Evaluation of 100 retrievals 
Character l,ength 
1 5 6 10 10 15 15 20 20 311 
21 6 o D D 27 
4 10 3 2 1 20 
1 15 10 6 2 34 
9 4 3 0 3 19 
35 35 16 8 6 10O 
F The example provides almost no information about 
the translation of the input. 
We evalaatod top five examples for each retrieval, and 
tile hest grade of them is used for the evalnation of 
a retrievalJ ~ Table 5 shows the result of the evalu- 
ation. The table shows that (I) we can obtain very 
usefill information from 47% of the retrievals, (2) we 
can obtain at least some information fi'om 81% of the 
retrievals. 
Acknowledgments 
The author would like to thank Ms. Yuko Tomita, 
who im\[ped me to make the translation datal~,'~es. 
References 
\[l\] Gah~, W. and (lhur(:h, K.: A Program fi)r Aligning 
Sentences ill BilingnM Corl)ora , l'roc, of ACL-91, 
pp177 184, 199l. 
\[2\] Ikeltara, S. : Teal ,'ientencea /or \]'huduat- 
in 9 Japanese-English Machim 'FranMation, (in 
Japanese), NTT, 1991. 
\[3\] Michalski, R., Carlmnell, J. and Mit<hell, T. (Eds.): 
Machiu~ Learning, Tioga Puhlishing Company, 
1983. 
\[4\] Nagao, M. et ;d (I'Ms): Iwanami Encyclopedic 
lhctionory of Computer Science, (in Japanese), 
Iwanami Shoten, 1990. 
\[5\] Nakamura, N.: Translation Support by ltetrieving 
Bilingual Texts, (in Japanese), I¥oc. of 38lh Con- 
vention of IPSJ, pp357 358, 198!). 
\[6\] Solo, S.: l';xamph~-Itased Translation Approltch, 
Pro~. c,\] International WorkM.~p on t'itndamentol 
lteseareh for flu: l'iaure (hner.tion of Natural Lan- 
guage Processing, A'I'R Interpreting "lclephony ILe- 
search l,aborataries, ppl 16, 1991. 
\[7\] Shhnizu and Narita (Eds.): The Kodunsha Japanese- 
English Dictionary, }':oudansha, 1976. 
\[8\] Sumita, E, ~tnd Tsutsumi, Y.: A Translation 
Aid Swtem Using l"lexibh: 71:zt Retrieval Ba.qed on 
Synt~:-Matching, TILl, Research \]teport, TR-87- 
1019, Tokyo Research Lalmratory, IBM, 1988. 
9\[t is eltough for the u~er to tlnd a useful example in the 
top five eXalllpJes. 
ACRES DE COLING-92. NANTES, 23-28 AOt)r 1992 l 2 6 3 PRec. OF COLING-92, NANTES, AUG. 23-28, 1992 
