File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1079_metho.xml
Size: 15,756 bytes
Last Modified: 2025-10-06 14:07:09
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1079"> <Title>Representation and Recognition Method for Multi-Word Translation Units in Korean-to-Japanese MT System</Title> <Section position="3" start_page="544" end_page="546" type="metho"> <SectionTitle> 2 Processing of MWTUs </SectionTitle> <Paragraph position="0"> In developing MT systems, we frequently contact with some differences in word spacing, grammar, and so on, between sotuve and target languages. But the method and degree of difficulty of handling them highly depend upon the nature of the source and target hmguage in the MT system. In this paper, we treat the representation and recognition methods of MWTUs according to their characteristics for only a Korean-to-Japanese MT system.</Paragraph> <Section position="1" start_page="544" end_page="545" type="sub_section"> <SectionTitle> 2.1 Types of MWTU </SectionTitle> <Paragraph position="0"> There call be 1-1, l-m, n-l, and n-m mapping relations of morphemes between source and target language in machine translation. Due to the grammatical similarities of Korean and Japanese, Korean-to-Japanese machine translation systems have been developed under the direct MT strategy, which assumes a 1-1 mapping relation. But a uniform application of this 1-1 mapping relation will easily result in an unnatural translation.</Paragraph> <Paragraph position="1"> It is not difficult to handle a 1-1 and l-m mapping relations in Korean-to-Japanese MT system although it uses only direct MT strategy, because it is easy to recognize only one morpheme in source language, Korean. It is also due to the fact that Japanese correspondences have characteristics of non-spacing and continuity, which allows several words to be treated as a single word. In this reason, we need to consider just types with n-I and n-m mapping relations. Table 1 shows the types of MWTUs to be handled in Korean-to-Japanese MT.</Paragraph> <Paragraph position="2"> The compound words in Table 1 are the units that must be translated into one Japanese morpheme though they are conlpound words ill Korean. For example, &quot;wodett peuroseseo&quot; is a Korean compound word which consists of two morphemes &quot;wodeu&quot; and &quot;l)euroseseo&quot;, but its Japanese equivalent is only one morpheme, &quot;walmro&quot;. The Korean word '),eojju -co be l-dal&quot; is also a compound word, made by 2 lexical morphemes &quot;yeoiju&quot; and &quot;be&quot; and 1 functional morpheme &quot;-eo&quot;, but it also corresponds to only one Japanese equivalent morpheme, &quot;ukagal-u\]&quot;. in these cases, the Korean compound words shoukl be recognized as one unit to be transformed into one Japanese morpheme.</Paragraph> <Paragraph position="3"> We can classify verbal nouns into 2 types according to their Japanese equivalents. Table 2 shows them. If we define a Korean verbal noun as X and its equivalent in Japanese as X', and another single word in Japanese as Y, we can describe the two types of relations between Korean and Japanese verbal nouns as below.</Paragraph> <Paragraph position="4"> Although the type 1 satisfies l:l mapping relation, the type 2 does not. So, for the type2, the verbal noun, X (e.g., &quot;chuka&quot;) and &quot;ha\[-da\]&quot; need to be recognized as a single unit to be transformed into a Japanese equivalent, Y.</Paragraph> <Paragraph position="5"> Collocation patterns are the units that frequently co-occurr in sentences and affect the semantics of each other. There are two kinds of collocation patterns. In one, each component morpheme is translated into different equivalents, such as &quot;dambae \[-reul\] piu\[-&ll(smoke)&quot; corresponding to &quot;tabako -o su\[-u\]&quot;, and in the other, all component morphemes must be translated into one Japanese morpheme with an equivalent meaning, such as &quot;soran \[-eul\] piu\[-da\]&quot; corresponding to &quot;sawa\[-gu\]&quot;. While the morphemes in the former case have a l-to-1 mapping relation, the morphemes in the latter case have an n-to-1 mapping relation and therefore, must be treated as a single morpheme.</Paragraph> <Paragraph position="6"> While some modalitics consist of only one morpheme like &quot;-eot&quot; or &quot;-da&quot;, there are also some modalities made up of several morphemes like &quot;-neun geot gat&quot;. Accordingly, the latter must be handled as an MWTU.</Paragraph> <Paragraph position="7"> An Idiom is a general idiomatic unit defined in a dictionary. Generally, since an idiom does not reflect literal meaning itself, translating their component morphemes individually results in very different meaning, In this case, it must be treated as a single unit.</Paragraph> <Paragraph position="8"> A colloquial idiomatic phrase is also composed of several morphemes, but it is recognized like a single unit word. For instance, the Korean greeting &quot;cheoeum bee 1) -get -seumnida&quot; corresponds to &quot;hazime -masi -le&quot; in Japanese. In this case, a 1-to-I mapping transformation results in an unnatural translation. Therefore, it also should be recognized as MWTUs.</Paragraph> <Paragraph position="9"> Moreover, MWTUs can be used for groups of words that can give a more natural translation when they are treated as one unit. We will call these groups of words semi-words.</Paragraph> </Section> <Section position="2" start_page="545" end_page="546" type="sub_section"> <SectionTitle> 2.2 The Characteristics of MWTUs </SectionTitle> <Paragraph position="0"> To minimize the recognition time and recognition error rate of MWTUs, we need to represent MWTUs according to their characteristics. The following shows the characteristics of MWTUs.</Paragraph> <Paragraph position="1"> 1) Fixed word order All of the 7 types of MWTUs in Table 1 have a fixed word order sequence, even though Korean and Japanese are known as free word order languages. Expressions such as &quot;keu -n ko dachi&quot; and &quot;-neun geot gat&quot; nmst be recognized as MWTUs, but their meaning may be changed from thin of MWTUs if the word order sequence has been changed. This provides a good characteristic for simply representing MWTUs. 2) Extension by insertion o1' other words For some kinds of MWTUs, it is possible to insert some grammatical morphemes or other words between their component n~orplaemes of an MWTU. &quot;-do&quot; in (2) , &quot;-reul&quot; and &quot;-reul geu -ege&quot; in (3) are those cases.</Paragraph> <Paragraph position="3"> According to this feature, the relations between immediately located two component morphemes of MWTUs can be classified as follows: A. tightly connected : the relation that no morpheme can be inserted between them B. loosely connected : the relation that some morphemes can be inserted between them.</Paragraph> <Paragraph position="4"> B-I. Only particles mad endings of a word are allowed to be inserted between them.</Paragraph> <Paragraph position="5"> B-2. Any kinds of morphemes can be inserted between them.</Paragraph> <Paragraph position="6"> \[Figure I\] Relations between two adjacent component morphemes of MWTUs</Paragraph> </Section> </Section> <Section position="4" start_page="546" end_page="549" type="metho"> <SectionTitle> 3) Strong cohesion </SectionTitle> <Paragraph position="0"> Although some MWTUs have characteristics of extension by insertion of other words, component morphemes in an MWTU have strong cohesion, not only logically but also physically. This means that tile recognition o1' an MWTU is possible by local comparison of its physical location. But it does not imply that the scope is limited in a simple sentence structure. 4) The predictable recognition scope of MWTUs It is possible to predict tile recognition scope between two adjacent component morphemes of MWTUs, according to the above characteristics. The scope can be predicted as follows l'or each type of MWTUs shown in Component morphemes of a compound word are corltiguous to the next Olle, so their scopes are predictable.</Paragraph> <Paragraph position="1"> Both verbal nouns and collocation patterns have the l'orm combined with &quot;Noun&quot; and &quot;Verb&quot;, where other words can be inserted between them. But in the case (51' &quot;Noun+Verb+Verb&quot;, which is the fern1 that another verb is inserted between the noun and verb, its meaning may be different in that of an MWTU. So ttae scope of the &quot;Verb&quot; can be limited up to the position of the first verb appearing after the &quot;Noun&quot;, that is, the position where the POS(part-ofspeech) appears.</Paragraph> <Paragraph position="2"> Component morphemes of a modality have an especially strong cohesion. So at most, one particle is often inserted next to the bound noun. From this, we can predict the next component morpheme apart from pro component at most in distance 2.</Paragraph> <Paragraph position="3"> idioms, colloquial idiomatic phrases and senti-words consist of various colnponenl morphemes, which results in various scopes for MWTU recognition. The scopes of each conlpollellt ll\]Ol'phellles froul pl*e-colllponellt morphemes can be determined by distance 1, distance 2, or infinity. But inl'inite scope can also be limited by the position which the POS of the component morpheme appears.</Paragraph> <Section position="1" start_page="546" end_page="547" type="sub_section"> <SectionTitle> 2.3 Representation of MWTU </SectionTitle> <Paragraph position="0"> The representation of an MWTU must be considered in order to enhance recognition accuracy and speed up the process. Accordingly, in this paper, we propose representation method (51' MWTUs according to the characteristics mentioned in section 2.2.</Paragraph> <Paragraph position="1"> One basic rule for MWTU representation is that an MWTU is composed of only lexical morphemes if possible, that is, grammatical morphemes such as particles and the endings of a word will be extracted in the representation because of the above characteristics which are freely inserted and omitted. However, grammatical morphemes affecting the meanings of MWTUs must be described.</Paragraph> <Paragraph position="2"> Next, according to the characteristics described in section 2.2, we need to represent recognition scopes between adjacent component morphemes and POS of each component morpheme for the restriction of recognition scope.</Paragraph> <Paragraph position="4"> d~.~+ x : maximum distance from m, tom~+~ \[Figure 2\] Representation of an MWTU d~,~+~ has 4 kinds of values according to Figure 1. For the case of A, d~,~+, is 1, for the case of B-l, it is 2, for the case of B-2, it is ~, mad then for the last component morpheme, it is always 0 because (n+l)-th component morpheme doesn't exist.</Paragraph> <Paragraph position="5"> The examples of MWTUs described by above representation are shown in Figure 3.</Paragraph> <Paragraph position="7"> /* I have enjoyed my dinner very much */ \[Figure 3\] Examples of MWTUs Each MWTU is entered into the dictionary as an entry word such as the general morphemes as shown in Figure 4. Additionally, for recognition, we made the first component morpheme of the MWTU have an MWTU field, which is composed of MWTUs starting from the entry word. This means that only one access to the dictionary is needed after an MWTU is confirmed. Figure 4 shows the dictionary structure for an MWTU.</Paragraph> </Section> <Section position="2" start_page="547" end_page="549" type="sub_section"> <SectionTitle> 2.4 Recognition of MWTU </SectionTitle> <Paragraph position="0"> Some rules are required in order to recognize MWTUs represented like those in section 2.3.</Paragraph> <Paragraph position="1"> First, the recognition scope of m~+~ after recognizing m~ is decided by POS~+, and d~.~+ c For restricting the recognition scope maximally while preventing other recognition errors, we formulated recognition scopes of each component morphemes of an MWTU as follows.</Paragraph> <Paragraph position="2"> RS(Recognition Scope) = min\[real_dist~<, d,+,\] real dist~+~ : the distance fi'om ln~ tothe i~oint - ' that the POS of In\[+ ~ appears at first in an input sentence d~ ~+~ : maximum distance from m~ to in ,+, \[Figure 5\] Recognition scope In (4), for an MWTU &quot;ip(N,oo) nolli(V,O), the recognition scope of &quot;nolli&quot; is 3 because dl, 2 is oo and real_dist,, 2 is 3, which is fi'om 6-3. For an MWTU, &quot;-ji(mC,2) an(V,0), the recognition scope of &quot;an&quot; is 1 because d3. 2 is 2 and real_dist,, 2 is 1, which is from 12-11. Therefore, we can recognize MWTUs by a small comparison.</Paragraph> <Paragraph position="3"> position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Korean: ,,e-ga~-eul geureoke ~-,nyeon bi,um-eul bat ~ ~-get-neut,ya ?...(4) (you) (mlmth) (in that manner) (u/e) (censure) (receive)\(nqt) speak carelessly in lhat manncr,'~u may be censured. */ Japanese: anatrt -ga sou nuka -se -ha hinan -o uke na -i ka &quot;~ (you) (in that manner) (speak carelessly) (censure) (receive) (not) ~&quot;~' /*Denlorine his circumstance. I looked at a settine sun.*/ osewanina -ru hi -o nagame -ta (X) (be obligated to) (sun) (look) /*Deploring, I looked at a sun which I am obligated to.*/ minoue -o tanzi -nagara irihi -o nagame -ta (O) (circumstance) (deploring) (a setting sun) (look) /*Deploring his circumstance, he looked at a setting sun.*/ \[Figure 6\] Recognition examples This Recognition rule can also prohibit some recognition errors generated from urlrlecessary comparisons. For instance, the recognition scope of &quot;ji&quot; in an MWTU &quot;sinse(N,oo),ji(V,0)&quot; was limited by 2, which is the minimum value between d~=(oo) and real_distj.2(3-1=2). So it prohibits errors, such as Japanese (a) in (5), occurring when an MWTU is recognized in whole sentence.</Paragraph> <Paragraph position="4"> The second rule states that morphemes inserted between the component morphemes of the recognized MWTU must be rearranged in the following manner: 1) ff inserted morphemes are lexical morphemes, they are rearranged to the front of the MWTU. &quot;geureoke(in that manner)&quot; in (4) is such a case.</Paragraph> <Paragraph position="5"> 2) If they are grammatical morphemes, they are ignored when they directly follow any component of the MWTU, and they are transl~rred to the front of the MWTU together with the inserted lexieal morphemes when thcy follow any inserted lexical morphemes. In (4), &quot;-eul&quot; is the former case. If any grammatical morpheme such as &quot;-do&quot; or &quot;-ha&quot; is attached after &quot;geureoke&quot;, it will be the latter case. Third, if a morpheme is the common subset of the two MWTUs, we select the one such that its first component morpheme locates in the pre-position. This rule is used to reduce the recognition time by skipping morphemes which are subsets of the pre-confirmed MWTUs Fourth, we select the superset of MWTU in case that two or more MWTUs starting from a same morpheme are recognized and one is the superset of the others. For&quot; example, let us consider two MWTUs: '~iamsi -man -yo (wait a moment)&quot; and 'ijamsi -man(for a little while)&quot;, ff &quot;,jamsi -man-yo&quot; is recognized, '~iamsi -man&quot; can also be recognized and '~amsi -man -yo&quot; is the supcrset of &quot;jamsi -man&quot;. In this case, we select the supersct, '~antsi -man -yo&quot;.</Paragraph> </Section> </Section> class="xml-element"></Paper>