Representation and Recognition Method 
for Multi-Word Translation Units 
in Korean-to-Japanese MT System 
Kyonghi Moon 
Dept. of Computer Science & Engineering 
Pohang Univ. of Science and Technology 
San 31 Hyoia-dong Nam-gu, Pohang 790-784 
Republic of Korea 
khmoon @ kle.po stech.ac.kr 
Jong-Hyeok Lee 
Dept. of Computer Science & Engineering 
Pohang Univ. of Science and Technology 
San 31 Hyoja-dong Nam-gu, Pohang 790-784 
Republic of Korea 
jhlee@postech.ac.kr 
Abstract 
Due to grammatical similarities, even a 
one-to-one mapping between Korean and 
Japanese words (or morphemes) can usually 
result in a high quality Korean-to-Japanese 
machine translation. However, multi-word 
translation units (MWTU) such as idioms, 
compound words, etc., need an n-to-m 
mapping, and their component words often 
do not appear adjacently, resulting in a 
discontinuous MWTU. During translation, 
the MWTU should be treated as one lexical 
item rather than a phrase. In this paper, we 
define the types of MWTUs and propose 
their representation and recognition method 
depending on their characteristics in 
Korean-to-Japanese MT system. In an 
experimental evaluation, the proposed 
method turned out to be very effective in 
handling MWTUs, showing an average 
recognition accuracy of 98.4% and a fast 
recognition time. 
1 Introduction 
As a transfer problem in a machine 
translation (MT), lexical and structural 
differences exist between source and target 
s, which requires l-n, m-n, or n-1 
mapping strategies for machine translation 
system. For such mapping strategies, we need to 
treat several (n, or m) words (or morphemes) as 
a single translation unit. Although some 
researches (D.Santos,1990; Linden E.,1990; 
Yoon Sung Hoe, 1992; Ha Gyu Lee, 1994; 
D.Arnold,1994) employ the term "idiom" for 
these units, we prefer MWTU (Multi-Word 
Translation Unit) because it is a more general 
and broader term for MT environment. 
Up to now, some reseamh has focused on 
recognition and transfer of MWTUs, although 
very little research has been undertaken for 
Korean-to-Japanese machine translation systems 
(Seen-He Kim,1997). In previous researches, 
some tended to simplify the problem by treating 
only special types of MWTUs, while others had 
some recognition errors and took too much 
recognition time because they did not restrict the 
recognition scope (D.Santos,1990; Yoon Sung 
Hee,1992; Ha Gyu Lee, 1994; Seen-He 
Kim, 1997). 
For a Korean-to-English MT, Lee and Kim 
(Ha Gyu Lee,1994) uses only weak restrictions 
like adjacent inforlnation for recognition scope. 
However, their method needs stronger 
restrictions to resolve recognition errors and to 
speed up the process. Although some differences 
exist depending on which kinds of source and 
target s are dealt with, MWTUs in 
Korean-to-Japanese MT frequently have their 
component words close together, so that one can 
predict the location of their separated component 
words. For this reason, we can enhance the 
recognition accuracy and time effectively by 
restricting the recognition scope according to the 
characteristics of an MWTU rather than taking 
the whole sentence as the scope. 
Moreover, the method by Lee and Kim (Ha 
Gyu Lee,1994) deals with only surface-level 
consistency without considering word order 
because Korean has ahnost free word order. It is 
obvious that the method can deal with variable 
544 
word-order MWTUs, but some incorrect 
recognition results arc possible whcn meaning 
changes according to word order. Because 
MWTUs to be treated in Korean-to-Japanese 
MT have an almost fixed word order sequence, 
their meaning may vary if the word order is 
changed. In (1), both sentences have the same 
lexical words (or morphemes), but while the first 
sentence must be treated as an MWTU, the 
second, which has the different sequence from 
the first, does not have the meaning of an 
MWTU. In (1), the words surrounded with a box 
are an essential component morpheme for an 
MWTU. 
(big) (nose) (get hurt) 
/*(1) had a bit~)erience */ 
(nose) (get hurt) (big) 
/* It is serious (that I) got hurt in my nose */ 
In this paper, to solve the word order 
problem and thus enhance a recognition 
accuracy and time for MWTUs, we fix the word 
order in an MWTU and define the recognition 
scope of component words according to their 
characteristics. Based on it, then we propose a 
representation and recognition method of 
MWTUs for a Korean-to-Japanese MT system. 
In the rest of this paper, details will be presented 
about lhese proposed ideas, logclher with some 
evalualion results. For representing Korean and 
Japanese expressions, the 1994-SK (ROK 
Ministry of Education) and the Kunrei 
Romanization systems are used respectively. 
2 Processing of MWTUs 
In developing MT systems, we frequently 
contact with some differences in word spacing, 
grammar, and so on, between sotuve and target 
s. But the method and degree of 
difficulty of handling them highly depend upon 
the nature of the source and target hmguage in 
the MT system. In this paper, we treat the 
representation and recognition methods of 
MWTUs according to their characteristics for 
only a Korean-to-Japanese MT system. 
2.1 Types of MWTU 
There call be 1-1, l-m, n-l, and n-m 
mapping relations of morphemes between source 
and target  in machine translation. Due 
to the grammatical similarities of Korean and 
Japanese, Korean-to-Japanese machine 
translation systems have been developed under 
the direct MT strategy, which assumes a 1-1 
mapping relation. But a uniform application of 
this 1-1 mapping relation will easily result in an 
unnatural translation. 
It is not difficult to handle a 1-1 and l-m 
mapping relations in Korean-to-Japanese MT 
system although it uses only direct MT strategy, 
because it is easy to recognize only one 
morpheme in source , Korean. It is also 
due to the fact that Japanese correspondences 
have characteristics of non-spacing and 
continuity, which allows several words to be 
treated as a single word. In this reason, we need 
to consider just types with n-I and n-m mapping 
relations. Table 1 shows the types of MWTUs to 
be handled in Korean-to-Japanese MT. 
The compound words in Table 1 are the 
units that must be translated into one Japanese 
morpheme though they are conlpound words ill 
Korean. For example, "wodett peuroseseo" is a 
Korean compound word which consists of two 
morphemes "wodeu" and "l)euroseseo", but its 
Japanese equivalent is only one morpheme, 
"walmro". The Korean word '),eojju -co be 
l-dal" is also a compound word, made by 2 
lexical morphemes "yeoiju" and "be" and 1 
functional morpheme "-eo", but it also 
corresponds to only one Japanese equivalent 
morpheme, "ukagal-u\]". in these cases, the 
Korean compound words shoukl be recognized 
as one unit to be transformed into one Japanese 
morpheme. 
We can classify verbal nouns into 2 types 
according to their Japanese equivalents. Table 2 
shows them. If we define a Korean verbal noun 
as X and its equivalent in Japanese as X', and 
another single word in Japanese as Y, we can 
describe the two types of relations between 
Korean and Japanese verbal nouns as below. 
Although the type 1 satisfies l:l mapping 
relation, the type 2 does not. So, for the type2, 
the verbal noun, X (e.g., "chuka") and "ha\[-da\]" 
need to be recognized as a single unit to be 
transformed into a Japanese equivalent, Y. 
545 
5) Idiom :: :: 
~\] l-&,l 
(congratulation) (do) 
(noise) (play) 
(thing) (equal) 
(bi\[~) (nose) \[,,,~, I(~et hurt) 
(first) (see) 
(in favor of) 
/* ask */ 
iwal-ul 
/* congratulate */ 
sawa\[-gu\] 
/* disturb */ 
soul-da\] 
/,I: seenl *'/ 
hide -i me -hi a \[-u\] 
/* have a bitter experience */ 
hazime -masi -te 
/* How do you do */ 
-110 t(l111(~ -I10 
/* lbr */ 
ITable 2\] Types of verbal nouns 
X + ha\[-dal 
X + hal-dal 
Japanese 
X' .t- SHl"tl 
Y 
Collocation patterns are the units that 
frequently co-occurr in sentences and affect the 
semantics of each other. There are two kinds of 
collocation patterns. In one, each component 
morpheme is translated into different equivalents, 
such as "dambae \[-reul\] piu\[-&ll(smoke)" 
corresponding to "tabako -o su\[-u\]", and in the 
other, all component morphemes must be 
translated into one Japanese morpheme with an 
equivalent meaning, such as "soran \[-eul\] 
piu\[-da\]" corresponding to "sawa\[-gu\]". While 
the morphemes in the former case have a l-to-1 
mapping relation, the morphemes in the latter 
case have an n-to-1 mapping relation and 
therefore, must be treated as a single morpheme. 
While some modalitics consist of only one 
morpheme like "-eot" or "-da", there are also 
some modalities made up of several morphemes 
like "-neun geot gat". Accordingly, the latter 
must be handled as an MWTU. 
An Idiom is a general idiomatic unit 
defined in a dictionary. Generally, since an 
idiom does not reflect literal meaning itself, 
translating their component morphemes 
individually results in very different meaning, In 
this case, it must be treated as a single unit. 
A colloquial idiomatic phrase is also 
composed of several morphemes, but it is 
recognized like a single unit word. For instance, 
the Korean greeting "cheoeum bee 1) -get 
-seumnida" corresponds to "hazime -masi -le" 
in Japanese. In this case, a 1-to-I mapping 
transformation results in an unnatural translation. 
Therefore, it also should be recognized as 
MWTUs. 
Moreover, MWTUs can be used for groups 
of words that can give a more natural translation 
when they are treated as one unit. We will call 
these groups of words semi-words. 
2.2 The Characteristics of MWTUs 
To minimize the recognition time and 
recognition error rate of MWTUs, we need to 
represent MWTUs according to their 
characteristics. The following shows the 
characteristics of MWTUs. 
1) Fixed word order 
All of the 7 types of MWTUs in Table 1 
have a fixed word order sequence, even though 
Korean and Japanese are known as free word 
order s. Expressions such as "keu -n ko 
dachi" and "-neun geot gat" nmst be recognized 
546 
as MWTUs, but their meaning may be changed 
from thin of MWTUs if the word order sequence 
has been changed. This provides a good 
characteristic for simply representing MWTUs. 
2) Extension by insertion o1' other words 
For some kinds of MWTUs, it is possible to 
insert some grammatical morphemes or other 
words between their component n~orplaemes of 
an MWTU. "-do" in (2) , "-reul" and "-reul geu 
-ege" in (3) are those cases. 
(go) (means) (is) 
/* (l) can go */ 
/* (1) can go, too */ 
............................... (3) 
(a favor) (owe) 
/* be obliged to */ 
~ -reul ~ -da 
/* be obliged to */ 
,~ -reul eg~ -egg ~ -da 
(he) 
/* be obliged to him */ 
According to this feature, the relations 
between immediately located two component 
morphemes of MWTUs can be classified as 
follows: 
A. tightly connected : the relation that no 
morpheme can be inserted between them 
B. loosely connected : the relation that some 
morphemes can be inserted between them. 
B-I. Only particles mad endings of a 
word are allowed to be inserted between 
them. 
B-2. Any kinds of morphemes can be 
inserted between them. 
\[Figure I\] Relations between two adjacent 
component morphemes of MWTUs 
3) Strong cohesion 
Although some MWTUs have 
characteristics of extension by insertion of other 
words, component morphemes in an MWTU 
have strong cohesion, not only logically but also 
physically. This means that tile recognition o1' an 
MWTU is possible by local comparison of its 
physical location. But it does not imply that the 
scope is limited in a simple sentence structure. 
4) The predictable recognition scope of 
MWTUs 
It is possible to predict tile recognition 
scope between two adjacent component 
morphemes of MWTUs, according to the above 
characteristics. The scope can be predicted as 
follows l'or each type of MWTUs shown in 
Table 1. 
Component morphemes of a compound 
word are corltiguous to the next Olle, so their 
scopes are predictable. 
Both verbal nouns and collocation patterns 
have the l'orm combined with "Noun" and 
"Verb", where other words can be inserted 
between them. But in the case (51' 
"Noun+Verb+Verb", which is the fern1 that 
another verb is inserted between the noun and 
verb, its meaning may be different in that of an 
MWTU. So ttae scope of the "Verb" can be 
limited up to the position of the first verb 
appearing after the "Noun", that is, the position 
where the POS(part-ofspeech) appears. 
Component morphemes of a modality have 
an especially strong cohesion. So at most, one 
particle is often inserted next to the bound noun. 
From this, we can predict the next component 
morpheme apart from pro component at most in 
distance 2. 
idioms, colloquial idiomatic phrases and 
senti-words consist of various colnponenl 
morphemes, which results in various scopes for 
MWTU recognition. The scopes of each 
conlpollellt ll\]Ol'phellles froul pl*e-colllponellt 
morphemes can be determined by distance 1, 
distance 2, or infinity. But inl'inite scope can 
also be limited by the position which the POS of 
the component morpheme appears. 
2.3 Representation of MWTU 
The representation of an MWTU must be 
considered in order to enhance recognition 
accuracy and speed up the process. Accordingly, 
in this paper, we propose representation method 
(51' MWTUs according to the characteristics 
mentioned in section 2.2. 
One basic rule for MWTU representation is 
that an MWTU is composed of only lexical 
morphemes if possible, that is, grammatical 
547 
morphemes such as particles and the endings of 
a word will be extracted in the representation 
because of the above characteristics which are 
freely inserted and omitted. However, 
grammatical morphemes affecting the meanings 
of MWTUs must be described. 
Next, according to the characteristics 
described in section 2.2, we need to represent 
recognition scopes between adjacent component 
morphemes and POS of each component 
morpheme for the restriction of recognition 
scope. 
m,(POS,, d,2) m2(POS 2, d2~) ... m,(POS,, d,, m) ... 
m (POS,,, d,,.,,+,) 
m~: i-th COlnponent morpheme o1' an MWTU 
POS~ : POSofm~ 
d~.~+ x : maximum distance from m, tom~+~ 
\[Figure 2\] Representation of an MWTU 
d~,~+~ has 4 kinds of values according to 
Figure 1. For the case of A, d~,~+, is 1, for the case 
of B-l, it is 2, for the case of B-2, it is ~, mad 
then for the last component morpheme, it is 
always 0 because (n+l)-th component 
morpheme doesn't exist. 
The examples of MWTUs described by 
above representation are shown in Figure 3. 
• wodeu(N,1)proseseo(N,O) ~ wapuro 
(word) (processor) /* word processor */ 
• yeojju(V, 1 ) -eo(mC, 1 ) bo(V,0) ~ ukaga 
(ask) (see) /* ask */ 
• keu(ADJ, 1 ) -n(mT, l ) ko(N,2) dachi(V,O) 
(big) (nose) (get hurt) 
hidoinwnia /* have a bitter experience */ 
• -neunOnT,l) geot(ND,2) gat(ADJ,O) ~ sou 
(thing) (equal) /* seem */ 
• chuka(N,oo) ha(V,0) ~ iwa 
(congratulation) (do) /* congratulation*/ 
• -reul(j,1 ) wiha(V, 1) -n(mT,0) ~ notameno 
(in favor of) /* for */ 
• sesang(N, 2) muljeong(N, oo) moreu(V,O) 
(world) (condition) (don't know) 
seziniuto I* be ignorant of the world */ 
• jal(B,l) meok (V,l) -eot(e,l) -seumnidaOnT,O) 
(well) (eat) 
gotisousamadesita 
/* I have enjoyed my dinner very much */ 
\[Figure 3\] Examples of MWTUs 
Each MWTU is entered into the dictionary 
as an entry word such as the general morphemes 
as shown in Figure 4. Additionally, for 
recognition, we made the first component 
morpheme of the MWTU have an MWTU field, 
which is composed of MWTUs starting from the 
entry word. This means that only one access to 
the dictionary is needed after an MWTU is 
confirmed. Figure 4 shows the dictionary 
structure for an MWTU. 
4____ 
;(mouth) (use) 
/* speak carelessly */ 
\[Connection info. for K~ 
\[Semantic info., Colloc~ 
\[Japanese equivalence, 
Janane, se ..... \] 
(ip) 
prcan\] 
tion pattern\] 
Connection 
\[Connection info. lot Kcrean, 
MWTU {ip(N ~ wlli(y;O), ip(N, 
bareu(V,O) ..... } \] 
\[Semantic info., Collocation pattern\] 
\[Japanese equivalence, Connection info. \['or 
.lanane~e ..... 1 
info. for 
~) 
\[Figure 4\] Dictionary for an MWTU 
2.4 Recognition of MWTU 
Some rules are required in order to 
recognize MWTUs represented like those in 
section 2.3. 
First, the recognition scope of m~+~ after 
recognizing m~ is decided by POS~+, and d~.~+ c For 
restricting the recognition scope maximally 
while preventing other recognition errors, we 
formulated recognition scopes of each 
component morphemes of an MWTU as follows. 
RS(Recognition Scope) = min\[real_dist~<, d,+,\] 
real dist~+~ : the distance fi'om ln~ tothe i~oint 
- ' that the POS of In\[+ ~ appears at 
first in an input sentence 
d~ ~+~ : maximum distance from m~ to in ,+, 
\[Figure 5\] Recognition scope 
In (4), for an MWTU "ip(N,oo) nolli(V,O), 
the recognition scope of "nolli" is 3 because dl, 2 
is oo and real_dist,, 2 is 3, which is fi'om 6-3. For 
an MWTU, "-ji(mC,2) an(V,0), the recognition 
scope of "an" is 1 because d3. 2 is 2 and real_dist,, 2 
is 1, which is from 12-11. Therefore, we can 
recognize MWTUs by a small comparison. 
548 
position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
Korean: ,,e-ga~-eul geureoke ~-,nyeon bi,um-eul bat ~ ~-get-neut,ya ?...(4) 
(you) (mlmth) (in that manner) (u/e) (censure) (receive)\(nqt) 
speak carelessly in lhat manncr,'~u may be censured. */ 
Japanese: anatrt -ga sou nuka -se -ha hinan -o uke na -i ka "~ 
(you) (in that manner) (speak carelessly) (censure) (receive) (not) 
/* If you speak carelessly in that manner, you may be censured. */ 
position l 2 
Korean "~-reul 
(a favor) 
I 
Japanese (a): !mlzi -nagmTa 
(deploring) 
Japanese (b): 
3 4 5 6 7 8 9 10 11 12 
hanta, -her -myeo ~ -neun hae -reul barabo -at -da ....... (5) 
(deploring) (?we or set) (sun) (look) 
/ 
~"~' /*Denlorine his circumstance. I looked at a settine sun.*/ 
osewanina -ru hi -o nagame -ta (X) 
(be obligated to) (sun) (look) 
/*Deploring, I looked at a sun which I am obligated to.*/ 
minoue -o tanzi -nagara irihi -o nagame -ta (O) 
(circumstance) (deploring) (a setting sun) (look) 
/*Deploring his circumstance, he looked at a setting sun.*/ 
\[Figure 6\] Recognition examples 
This Recognition rule can also prohibit 
some recognition errors generated from 
urlrlecessary comparisons. For instance, the 
recognition scope of "ji" in an MWTU 
"sinse(N,oo),ji(V,0)" was limited by 2, which is 
the minimum value between d~=(oo) and 
real_distj.2(3-1=2). So it prohibits errors, such as 
Japanese (a) in (5), occurring when an MWTU is 
recognized in whole sentence. 
The second rule states that morphemes 
inserted between the component morphemes of 
the recognized MWTU must be rearranged in 
the following manner: 
1) ff inserted morphemes are lexical 
morphemes, they are rearranged to the front of 
the MWTU. "geureoke(in that manner)" in (4) is 
such a case. 
2) If they are grammatical morphemes, they 
are ignored when they directly follow any 
component of the MWTU, and they are 
transl~rred to the front of the MWTU together 
with the inserted lexieal morphemes when thcy 
follow any inserted lexical morphemes. In (4), 
"-eul" is the former case. If any grammatical 
morpheme such as "-do" or "-ha" is attached 
after "geureoke", it will be the latter case. 
Third, if a morpheme is the common subset 
of the two MWTUs, we select the one such that 
its first component morpheme locates in the 
pre-position. This rule is used to reduce the 
recognition time by skipping morphemes which 
are subsets of the pre-confirmed MWTUs 
Fourth, we select the superset of MWTU in 
case that two or more MWTUs starting from a 
same morpheme are recognized and one is the 
superset of the others. For" example, let us 
consider two MWTUs: '~iamsi -man -yo (wait a 
moment)" and 'ijamsi -man(for a little while)", ff 
",jamsi -man-yo" is recognized, '~iamsi -man" 
can also be recognized and '~amsi -man -yo" is 
the supcrset of "jamsi -man". In this case, we 
select the supersct, '~antsi -man -yo". 
549 
3 Evaluation 
To demonstrate the efficiency of our 
proposed method, we applied it to a 
Korean-to-Japanese machine translation system 
(COBALT-K/J), and evaluated its recognition 
accuracy and recognition time. COBALT-K/J 
consists of about 150,000 general purpose words 
and 7,500 MWTUs. For the test corpus, we 
arbitrary extracted 2,808 sentences from a 10 
million word corpus, the KIBS (Korean 
Information Base System). MWTUs registered 
in the dictionary appeared 3,647 times in them. 
Table 3 shows the evaluation results 
classified by the types of MWTUs. 
\[Table 3\] Evaluation results on the recognition 
of MWTUs 
~i, Accur 
~u~o,~ 
33 32 97.0% 
:A)g:No! i 
son 
918 907 
C0116~afio ..... 33 29 
1326 i 292 
5 5 
Coil0quial 
~3 83 
1249 ! 242 
> tola! 3,647 3,590 
98.8% 1.05 
87.9% 1.82 
97.4% 1.02 
100% 1.3 
100% 1.08 
99.4% t 1.01 
98.4% 1.03 
In Table 3, idioms, collocation patterns and 
compound words have a very low frequency 
while verbal nouns, modalities and semi-words 
have a relatively high frequency. Nevertheless, 
98.4% of the test samples were recognized 
correctly. In order to recognize an MWTU, it 
needed only 1.03 comparisons per each 
component morpheme of the MWTU on the 
average. This shows the effectiveness and the 
speed of our proposed method for treating 
MWTUs in Korean-to-Japanese MT. 
Conclusion 
In this paper, we classified the different 
kinds o1' MWTUs and proposed a representation 
and recognition method for them in a 
Korean-to-Japanese MT. 
MWTUs in Korean-to-Japanese MT have 
the characteristics of fixed word order, strong 
cohesion, predictable scope of its component 
morphemes, extension by other words, etc. 
Accordingly, we enhanced accuracy and 
recognition time by representing and 
recognizing MWTUs according to their 
characteristics. 
In our experiment, 98.4% of the test 
samples were recognized correctly, which shows 
the effectiveness of our proposed method. In 
future work, we will research in more strict 
recognition restrictions and plan to extract 
MWTUs from a corpus automatically. 

References 

D. Santos(1990), Lexical gaps and idioms in 
machine translation, 13" International 
Conference of Computational Linguistics. 
Coling 90, Finland, pp. 330-335. 

Linden E., Wessel K. (1990), Ambiguio~ 
resolution and the retriewE of idioms: two 
aM)roaches, 13'" International Conference of 
Computational Linguistics. Coling 90, Finland, 
pp. 245-248. 

Yoon Sung Hee (1992), Idiomatical and 
Collocational Approach to English-Korean 
Machine Translation., Proceedings of 
1CCPOL '92, pp.56-60. 

Ha Gyu Lee, Yung Taek Kim (1994), 
Representation arm Recognition of Korean 
Idioms for Machine Translation, Journal of the 
Korean Information Science Society, Vol. 21, 
No. 1, pp.139-149 (written in Korean). 

Seon-Ho Kim (1997), Lexicon-Based Approach 
to Recognition and Tran,sfer of Multi-Word 
Translation Units hz Korean-Japanese 
Machine 7)'anslation, MS Thesis, Pohang 
University of Science and Technology (written 
in Korean). 

D.Arnold, L.Balkan, R. Lee Hurnphreys, 
S.Meijer, L.sadler (1994), Machine 
Transhttion, Blackwell, USA. 
