File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/02/c02-1099_relat.xml
Size: 20,487 bytes
Last Modified: 2025-10-06 14:15:36
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1099"> <Title>An English-Korean Transliteration Model Using Pronunciation and Contextual Rules</Title> <Section position="2" start_page="0" end_page="11" type="relat"> <SectionTitle> 2. Related works </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="11" type="sub_section"> <SectionTitle> 2.1 Probability based transliteration </SectionTitle> <Paragraph position="0"> (Lee et al., 1998) used formula (1) to generate a transliterated Korean word 'K' for a given English word 'E'. Lee et al. (1998) defined a pronunciation unit. It is a chunk of graphemes or alphabets that can be mapped to phoneme. They divided an English word into pronunciation units (PUs) for transliteration. For example, an English word 'board (/B AO R D/)' can be divided into 'b/B/: oa/AO/: r/R/: d/D/' - 'b', 'oa', 'r' and 'd' are PUs. An English word 'E' was represented as 'E=epu</Paragraph> <Paragraph position="2"> were generated according to epu</Paragraph> <Paragraph position="4"> (1998) considered all possible English PU sequences and corresponding Korean PU sequences for a given English word, because its pronunciation was not determined. For example, 'data' can have PU sequences such as 'd :at :a', 'da :ta', 'd :a :t :a' and so on. If the total number of English PU in E is N and the average number of kpu</Paragraph> <Paragraph position="6"> is M, the total number of generated Korean PU sequences will be about N*M. Then he selected the best result among them as a Korean transliteration word.</Paragraph> <Paragraph position="8"/> <Paragraph position="10"> - and used a neural network to approximate P(E|K).</Paragraph> <Paragraph position="12"/> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 2.2 Decision Tree based transliteration </SectionTitle> <Paragraph position="0"> Kang, et al. (2000; 2001) proposed an English alphabet-to-Korean alphabet conversion method based on a decision tree. This method used six attribute values - left three English alphabets and right three English alphabets - for determining Korean alphabets corresponding to English alphabets. For each English alphabet, its corresponding decision trees are constructed.</Paragraph> <Paragraph position="1"> Table 1 shows an example of transliteration for an English word 'data'. In table 1, (E) represents Henthforce, ':' will be used as a PU boundary a current English alphabet, K represents generated Korean alphabets by decision trees.</Paragraph> <Paragraph position="3"> < < < d a t a G1 'd' < < d a t a > G1 'e-i' < d a t a > > G1 't' d a t a > > > G1 'a' This method showed about 49% precision for 6,185 E-K pairs for training and 1,000 E-K pairs for testing.</Paragraph> <Paragraph position="4"> Though the previous works showed relatively good results, they also showed some limitations. Because they focused on a converting method from English alphabet to Korean alphabet, they did not consider phonetic features such as phoneme and word formation features such as origin of English. This makes some errors when pronunciation and origin of English were important clues for transliteration - 'Mcdonald' (pronunciation is needed) and 'amylase' (origin of English word is needed).</Paragraph> <Paragraph position="5"> 3. An English-Korean Transliteration</Paragraph> </Section> <Section position="3" start_page="11" end_page="11" type="sub_section"> <SectionTitle> Model using Pronunciation and Contextual Rules 3.1 Overall System Description </SectionTitle> <Paragraph position="0"> Figure1 shows the overall system description.</Paragraph> <Paragraph position="1"> Our method is composed of two phases alignment (section 3.2) and transliteration (section 3.3, 3.4, 3.5 and 3.6).</Paragraph> <Paragraph position="2"> First an English pronunciation unit (hearafter, EPU) and its corresponding phoneme are aligned. EPU-to-Phoneme alignment is to find out the most phonetically probable correspondence between an English pronunciation unit and phoneme. EPU to phoneme aligned results acquired from the alignment algorithm offer training data for estimating pronunciation of English words, which are not registered in a pronunciation dictionary, for example 'zinkenite'. Second, English words are transliterated into Korean words through several steps. Using an English The term 'pronunciation unit' will be used as the same meaning as in the Lee's (Lee et al., 1998) pronunciation dictionary (P-DIC), we can assign pronunciation to a given English word. When it is not registered in P-DIC, we investigate that it has a complex word form (section 3.3). For detecting a complex word form, we divide a given English word into two words (word+word) using entries of P-DIC. If both of them are in P-DIC, we can assign pronunciation to the given word otherwise we should estimate pronunciation (section 3.5). Then, we check whether the English word is from Greek origin or not (section 3.4). Because a way of E-K transliteration for the English words of Greek origin is different from that for pure English words, it is important to detect them.</Paragraph> <Paragraph position="3"> Pronunciation for English words, which are not registered in a P-DIC, is estimated (section 3.5) in the next step. Finally, Korean transliterated words are generated using conversion rules (section 3.6). The right side of figure 1 shows a transliteration example for an English word,</Paragraph> </Section> <Section position="4" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.2 EPU-to-Phoneme Alignment </SectionTitle> <Paragraph position="0"> EPU-to-Phoneme (hereafter, EPU-P) alignment is to find out the most phonetically probable correspondence between an English pronunciation unit and phoneme. For example, one of the possible alignment for an English word 'board' and its pronunciation '/B AO R D/' is as follows.</Paragraph> <Paragraph position="1"> 'broadcasting' may be divided into three words : 'broad', 'cast' and 'ing'. But from the training corpus and pronunciation dictionary, all of complex word is divided into two words like 'broad' and 'casting'. For automatic EPU-P alignment, we used the modified version of Kang's E-K alignment algorithm (Kang et al., 2000; Kang et al., 2001). It is based on Covington's algorithm (Covington, 1996). Covington views an alignment as a way of stepping through two words - a word in one side and a word in the other side - while performing 'match' or 'skip' operation on each step. Kang added 'forward bind' and 'backward bind' operations to consider one-to-many, many-to- one and many-to-many alignments word 'board'. 'M' represents 'match', and '<' represents 'backward bind'.</Paragraph> <Paragraph position="2"> Unlike the previous alignment algorithm, we combine 'skip' and 'bind' operations because the 'skip' operation can be replaced with the 'bind' operation. This makes all PUs to be mapped into phoneme. It means that our algorithm does not allow null-to-phoneme alignment or PU-to-null alignment. All the valid alignments that are possible by 'match', and 'bind' operations can be generated. Alignment is one of the method for coding phonemes into ASCII chracters.</Paragraph> <Paragraph position="3"> In this paper, vowel pronunciation includes diphthongs. may be interpreted as finding the best result among them. To find the best result, a penalty scheme is used - the best alignment result is one that has the least penalty values. Since Kang's method focused on an E-K character alignment, a penalty scheme and an E-K character-matching table were restricted to an E-K alignment. Instead of Kang's E-K character penalty scheme, we developed an EPU-P penalty scheme and an EPU-P matching table using manually aligned EPU-P data. We assume that all vowels can be aligned with all vowel phonemes without penalty. Table 3 shows our penalty metrics and table 4 shows an example of EPU-P alignment.</Paragraph> <Paragraph position="4"> We aligned about 120,000 English word and Pronunciation pairs in 'The CMU Pronouncing Dictionary'. For evaluating performance of the alignment, we randomly selected 100 results. The performnance of EPU-P alignment is 99%.</Paragraph> </Section> <Section position="5" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.3 Dealing with a Complex word form </SectionTitle> <Paragraph position="0"> Some English words are not in P-DIC, because they are in a complex word form. In this paper, we define words in a complex word form as those composed of two base nouns in P-DIC.</Paragraph> <Paragraph position="1"> When a given word is not in P-DIC, it is segmented into all possible two words. If the two words are in P-DIC, we can assign their pronunciation. For example, 'cutline' can be segmented into 'c+utline', 'cu+tline', 'cut+line' and so on. 'cut+line' is the correct segmentation of 'cutline', because 'cut' and 'line' are in the P-DIC. If words are not in P-DIC and they are not in a complex word form, we should estimate their pronunciation. The details of estimating pronunciation will be described in the section 3.5.</Paragraph> </Section> <Section position="6" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.4 Detecting English words of Greek origin </SectionTitle> <Paragraph position="0"> In Korean, there are two methods for E-K transliteration - 'written word transliteration' and 'spoken word transliteration' (Lee et al., 1998). The two methods use similar mechanism for consonant transliteration. However, 'written word transliteration' uses its character and 'spoken word transliteration' uses its phoneme when they transliterate vowels. For example, 'a' in 'piano' can be transliterated into 'pi-a-no' with its character and 'pi-e-no' with its phoneme.</Paragraph> <Paragraph position="1"> Since, a vowel in a pure English word is usually transliterated using its phoneme and that in an English word of Greek origin is usually transliterated with its character in E-K transliteration- for example, 'hernia' (he-reu-ni-a), 'acacia' (a-ka-si-a), 'adenoid' (a-de-no-i-deu) and so on -, it is important to detect them. We use suffix and prefix patterns for detecting English words of Greek origin (Luschnig, 2001) shows the patterns. If words have the affixes in table 5, we determine them as words of Greek origin otherwise pure English words.</Paragraph> <Paragraph position="2"> Suffix -ic, -tic, -ac, -ics, -ical, -oid, -ite, -ast, -isk, -iscus, -ia, -sis, -me, -ma Prefix amphi-, ana-, anti-, apo-, dia-, dys-, ec-, ecto-, enantio-, endo-, epi-, cata-, cat-, meta-, met-, palin-, pali-, para-, par-, peri-, pros-, hyper-, hypo-, hyp-Table 5. Suffix and prefix patterns for detecting English words of Greek origin.</Paragraph> </Section> <Section position="7" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.5 Estimating Pronunciation </SectionTitle> <Paragraph position="0"> Estimating pronunciation is composed of two steps. Using aligned EPU-P pairs as training data, we can find EPUs in the given English word (Chunking EPU) and assign their appropriate phoneme (EPU-to-Phoneme assignment). For dealing with English words of Greek origin, we categorize EPU-P aligned data into pure English words (E-class) and English words of Greek origin (G-class). Then we construct the 'Chunking EPU' module and the 'EPU-to-Phoneme assignment' module for each class.</Paragraph> <Paragraph position="1"> 'Chunking EPU' is to find out boundaries of EPUs in English words. For example, we can find EPUs in 'board' as 'b:oa:r:d'. For chunking EPU, we used C4.5 (Quilan, 1993) with ten attributes - left five alphabets and right five alphabets and the setting shows the best result among various settings such as eight attributes (left four and right four - 87.2% ) and so on. In this paper, some Greek affixes are not used, because they such as prefix 'a-', 'an-', and postfix '-y', '-m' may cause error.</Paragraph> <Paragraph position="2"> C4.5 is one of the popular method for recognizing boundary of chunks. Unlike Kang et al., (2000)'s method, We use 90% of EPU-P aligned data as training data and 10% of those as test data. Our 'Chunking EPU' module shows 91.7% precision.</Paragraph> <Paragraph position="4"> represented as formula (5). p(P) and p(E|P) are approximated as formula (6) and (7).</Paragraph> </Section> <Section position="8" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.6 Phoneme-to-Korean Conversion </SectionTitle> <Paragraph position="0"> Our Phoneme-to-Korean (P-K) conversion method is based on English-to-Korean Standard Conversion Rule (EKSCR) (Ministry, 1995).</Paragraph> <Paragraph position="1"> EKSCR is composed of nine general rules and five rules for specific cases - each rule contains several sub-rules. It describes a transliteration method from English alphabets or phonemes to Korean alphabets. It uses English phoneme as a transliteration condition - if a phoneme is A then transliterate into a Korean alphabet B.</Paragraph> <Paragraph position="2"> However, EKSCR does not contain enough rules to generate correct Korean words for corresponding English words, because it mainly focuses on a way of mapping from one English phoneme to one Korean character without context of phonemes and PUs. For example, an English word 'board' and its pronunciation '/B AO R D/', are transliterated into 'bo-reu-deu' by EKSCR - the correct transliteration is 'bo-deu'.</Paragraph> <Paragraph position="3"> In E-K transliteration, the phoneme 'R' before consonant phonemes and after vowel phonemes is rarely transliterated into Korean characters (Note that the phoneme 'R' in English words of Greek origin is transliterated into a Korean our method produces EPU and it phoneme. This makes possible for a E-K conversion method (in section 3.6) to use context of EPU and its phoneme. Because an alphabet-to-alphabet mapping method did not use EPU and its phoneme, it may show some errors when phoneme and its context are the most importnat clues, for example, 'Mcdonald'.</Paragraph> <Paragraph position="4"> consonant 'r' frequently.) These contextual rules are very important to generate correct Korean transliterated words.</Paragraph> <Paragraph position="5"> We capture contextual rules by observing errors in the results, which are generated by applying EKSCR to 200 randomly selected words from the CMU pronunciation dictionary. The selected words are not in the test data in the experiment. Among the generated rules, we selected 27 contextual rules with high frequency (above 5).</Paragraph> <Paragraph position="6"> Table 6 shows some rules and their conditions in which rules will be fired. There are three conditions - 'Context', 'TPU (Target PU)', and 'TP (Target Phoneme)'. In context condition, '[]', '{}', C, VP, and CP represent phoneme, pronunciation unit, consonant, vowel phonemes and consonant phonemes respectively. The rule with context condition, '[R] after VP and before CP', is not fired for the English words of Greek origin. Except it, all rules are applied to both C+ {le} 'le' AH L 'eul' {or} in the end of a word 'or' ER 'eo' {or} in a word 'or' ER 'eu' {sm} in the end of a word 'sm' S AH M 'jeum' [R] after VP and before CP 'r' 'R' 'eu'</Paragraph> </Section> <Section position="9" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.Experiment 4.1 Experimental Setup </SectionTitle> <Paragraph position="0"> We use two data sets for an accuracy test. Test Set I (Lee et al., 1998) is composed of 1,650 E-K pairs. Since, the test set was used as a common testbed for (Lee et al., 1998; Kim et al., 1999; Kang et al., 2000; Kang et al., 2001), we use them as a testbed for comparison between our method and other methods. For comparison, 1,500 pairs are used as training data for other methods and 150 pairs are used as test data for our method and other methods. Test set II (Kang et al., 2000) consists of 7,185 E-K pairs - the number of training data is 6,185 and that of test data is 1,000. We use Test set II to compare our method with (Kang et al., 2000), which shows the best result among the previous works.</Paragraph> <Paragraph position="1"> Evaluation is performed by word accuracy (W.</Paragraph> <Paragraph position="2"> A.) and character accuracy (C.A.), which were used as the evaluation measure in the previous works (Lee and Choi 1998; Kim and Choi 1999; Kang and Choi 2000).</Paragraph> <Paragraph position="4"> where L represents the length of the original string, and di, , and s represent the number of insertion, deletion and substitution respectively. If )( sdiL ++< , we consider it as zero (Hall and Dowling, 1980).</Paragraph> <Paragraph position="5"> We perform the three experiments as follows.</Paragraph> <Paragraph position="6"> G2G3Comparison Test: Comparison between our method and the previous works G2G3Dictionary Test: Performance of transliteration for words in a pronunciation dictionary and that for for Test set I and Test set II respectively. In the tables our method shows higher performance especially in W.A. Moreover, our method shows higher performance in C.A. It means that the generated words by our method are more similar to the correct transliteration, when they are not the correct answer.</Paragraph> <Paragraph position="7"> with 20 higher rank results.</Paragraph> <Paragraph position="8"> For the dictionary test, we use test data of Test set II. In the result, 'registered' words show higher performance. It can be analysed that contextual rules are constructed using registered words in a P-DIC and estimating pronunciation module makes some errors. However, 'not registered' words also show relatively good performance.</Paragraph> <Paragraph position="9"> For the component test, we use words, which are 'not registered' in Test set II. Components, which are tested in 'Component test' are 'Dealing with words in a complex word form'[C], 'Detecting English words of Greek origin' [G], and 'Contextual rules' [R]. In the result, [G] and [R] show good results in contrary to [C]. There are so few words in complex word forms that [C] does not show significant performance improvement though the performance is relatively good -about 70% W.A. for 43 words (43 words out of total 313 words). For the effective comparison, it will be necessary to consider the number of words, which each component handles. Our method shows better performance than 'W/O [R]'(EKSCR). It indicates that contextual rules are important.</Paragraph> </Section> <Section position="10" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.3. Discussion </SectionTitle> <Paragraph position="0"> The previous works focused on an alphabet-to-alphabet mapping method. Because, how the transliteration is more phonetic than orthographic, without phonetic infomation it Hangul alphabet has phonetic as well as orthographic. It may be adopted to our method as phoneme. Because one may be difficult to acquire more relevant result. In the result, 'crepe(keu-le-i-peu/ keu-le-pe) ', 'dealer (dil-leo/ di-eol-leo)', 'diode (da-i-o-deu/ di-o-deu)', and 'pheromone (pe-ro-mon/ pe-eo-o-mon)' etc. produce errors in the previous works because they are transliterated into Korean with pronunciation and the patterns can not be acquired from an alphabet-to-alphabet mapping method. For example, 'e' before 'p' in 'crepe' is transliterated into Korean chracters 'e-i' but it is usually transliterated into 'e' in training data. Origin of English word also contributes performance improvement. For example, words such as 'hittite (hi-ta-i-teu /ha-i-ta-i-teu)', 'hernia (he-leu-ni-a/ heo-ni-a)', 'cafeteria (ka-pe-te-li-a/ ka-pi-te-ri-a)'. In summary, E-K transliteration is not an alphabet-to-alphabet mapping problem but a problem that can be solved with mixed use of alphabet, phoneme, and word formation information.</Paragraph> <Paragraph position="1"> In the experiments, we find that vowel transliteration is the main reason of errors rather than consonant transliteration in E-K transliteration. Especially, 'AH' is the most ambiguous phoneme because it can be several Korean characters such as 'eo', 'e', 'u', and so on. To improve performance of E-K transliteration, more specific rules may be necessary to handle vowel transliteration.</Paragraph> </Section> </Section> class="xml-element"></Paper>