File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1056_metho.xml

Size: 14,688 bytes

Last Modified: 2025-10-06 14:07:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1056">
  <Title>An English to Korean Transliteration Model of Extended Markov Window</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
LG Electronics Institute of Technology
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="383" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Automatic transliteration problem 1s to transcribe foreign words in one's own alphabet.</Paragraph>
    <Paragraph position="1"> Machine generated transliteration can be useful in various applications such as indexing in an information retrieval system and pronunciation synthesis in a text-to-speech system. In this paper we present a model for statistical English-to-Korean transliteration that generates transliteration candidates with probability. The model is designed to utilize various information sources by extending a conventional Markov window. Also, an efficient and accurate method for alignment and syllabification of pronunciation units is described. The experimental results show a recall of 0.939 for trained words and 0.875 for untrained words when the best 10 candidates are considered.</Paragraph>
    <Paragraph position="2"> Introduction As the amount of international communication increases, more foreign words arc flooding into the Korean language. Especially in the area of comlmter and information science, it has been reported that 29.4% of index terms are transliterated fiom or directly written in English in the case of a balanced corpus, KT-SET \[18\]. The transliteration of l'oreign words is indispensable in Korean language processing.</Paragraph>
    <Paragraph position="3"> In information retrieval, a simple method of processing foreign words is via query term translation based on a synonym dictionary of foreign words and their target transliteration. It is necessary to automate the construction process of a synonym dictionary since its maintenance requires continuous efforts for ever-incoming foreign words. Another area to which transliteration can be applied is a text-to-speech system where orthographic words are transcribed into phonetic symbols, in such applications, maximum likelihood \[15\], decision tree \[1\], neural network \[10\] or weighted finited-state acceptor \[19\] has been used for finding the best fit.</Paragraph>
    <Paragraph position="4"> English-to-Korean transliteration problem is that of generating an appropriate Korean word given an English word. In general, there can be various possible transliterations in Korean which correspond to a single English word. It is COlnmon that the newly imported foreign word is transliterated into several possible candidate words based on pronunciation, out of which only a few survive in competition over a period of time. In tiffs respect, a statistical approach makes sense where multiple transliteration variations exist for one word, generating candidates in probable order.</Paragraph>
    <Paragraph position="5"> In this paper, we present a statistical method to transliterate English words in Korean alphabet to generate various candidates. In the next section, we describe a phonetic mapping table construction. In Section 2, we describe alignment and syllabification methods, and in Section 3, mathematical formulation for a statistical model is presented. Section 4 provides experimental results, and finally, we state our conclusions.</Paragraph>
    <Paragraph position="6"> ~' Present a&amp;h'ess: Sevice Engineering Team, Chollia, Service Development Division, DACOM Cotporation, Seoul,</Paragraph>
    <Paragraph position="8"/>
  </Section>
  <Section position="3" start_page="383" end_page="383" type="metho">
    <SectionTitle>
1 Phonetic mapping table construction
</SectionTitle>
    <Paragraph position="0"> First of all, we generate a mapping between English and Korean phonetic unit pairs (Table 6).</Paragraph>
    <Paragraph position="1"> In doing so, we use pronunciation sylnbols for English words (Table 5) as defined in the Oxford computer-usable dictionary \[12\]. The English and Korean phonetic unit can be a consonant, a vowel or some composite of them so as to make transliteration mapping unique and accurate. The orthography for foreign word transliteration to Korean provides a siml?le mapping from English to Korean phonetic units. But in reality, there are a lot of transliteration cases that do not follow the orthography. Table 6-1 has been constructed by examining a significant amount of corpus so that we can cover as many cases as possible.</Paragraph>
    <Paragraph position="2"> Table 6-2 shows complex cases where a combination of two or more English phonelnes are mapped to multiple candidates of a composite Korean phonetic unit. This phonetic mapping table is carefully constructed so as to produce a unique candidate in syllabification and aligmnent in the training stage. When a given English pronunciation can be syllabificated into serveral milts or a single composite unit, we adopt a heuristic that only the composite unit consisting of longer phonetic units is considered. For example, the English phonetic unit &amp;quot;u @&amp;quot; can be mapped to a Korean phonetic unit ,,@o\] \[u@\]&amp;quot; or &amp;quot;wdegq \[w@\]&amp;quot; even though the colnposition of each unit mapping of &amp;quot;u&amp;quot; and &amp;quot;@&amp;quot; can result in other composite mappings such as &amp;quot;-degr-deg\] \[juG\]&amp;quot;, ,,~o\] \[wI@\]&amp;quot;, ,,o_o\] \[wuja\]&amp;quot;, etc. This composite phonetic unit mapping is also useful for statistical tagging since composite units provide more accurate statistical information when they are well devised.</Paragraph>
  </Section>
  <Section position="4" start_page="383" end_page="384" type="metho">
    <SectionTitle>
2 Alignment and syllabification
</SectionTitle>
    <Paragraph position="0"> method The alignment and syllabification process is critical for probabilistic tagging as it is closely linked to computational complexity. There can be combinatorial explosion of state sequences because potential syllables may overlap the same letter sequences. A statistical approach called, Forward-Backward parameter estimation algorithm, is used by Sharman in phonetic transcription problem \[2\]. But a statistical approach for syllabification requires expensive computatioual resources and a large amount of training corpus. Moreover, it often results in many improper candidates. In this paper, we propose a simple heuristic alignment and syllabification method that is fast and efficient. The maiu principle in separating phonetic units is to manage a phonetic unit of English and that of Korean to be mapped in a unique way. For example, the pronunciation notation &amp;quot;@R&amp;quot; of the suffix &amp;quot;-er&amp;quot; in &amp;quot;computer&amp;quot; is mapped to ,,cq \[@R\]&amp;quot; in Korean. In this case, the complex pronunciation &amp;quot;@R&amp;quot; is treated as one phonetic unit. There are many such examples in complex vowels, as in &amp;quot;we&amp;quot; to &amp;quot;~-\]\] \[we\]&amp;quot;, &amp;quot;'jo&amp;quot; to ,,.~o \[jo TM j, etc. It is essential to come up with a phonetic unit mapping table that can reduce the time complexity of a tagger and also contribute to accurate transliteration results. Table 6 shows the examples of phonetic units and their mapping to Korean.</Paragraph>
    <Paragraph position="1"> The alignment process in training consists of two stages. The first is consonant alignment which identifies corresponding consonant pairs by scanning the English phonetic unit mad Korean notation. The second is vowel alignment which separates corresponding vowel pairs within the consonant alignment results of stage 1. Figure 1 shows an aligmnent example in training. The aligned and syllabificated units are used to extract statistical inforination from the training corpus. The alignment process always produces one result. This is possible because of the predefined English to Korean phonetic unit mapping in Table 6.</Paragraph>
    <Paragraph position="2"> Input: English pronunciation and Korean notation First stage: consonant alignment Second stage: vowel alignment</Paragraph>
    <Paragraph position="4"> &lt;Figure l&gt;AIignment example for training data input.</Paragraph>
    <Paragraph position="5">  '/' mark: a segmentation position by a consonant '\]' mark: a segmentation position by a vowel.  and alignment. To take the English word &amp;quot;computer&amp;quot; as an exalnple, the English pronunciation notation &amp;quot;k@mpu@R&amp;quot; is retrieved froln the Oxford dictionary, in the first stage, it is segmented in flont of the consonants &amp;quot;k&amp;quot;, &amp;quot;m&amp;quot;, &amp;quot;p&amp;quot; and &amp;quot;t&amp;quot; which are aligned with the corresponding Korean consonants &amp;quot;=1 \[k\]&amp;quot;, &amp;quot;rJ \[m\]&amp;quot;, &amp;quot;~ \[p\]&amp;quot; and &amp;quot;E It\]&amp;quot;. In the second stage, it is segmented in flont of the vowels &amp;quot;@&amp;quot;, &amp;quot;u&amp;quot; and &amp;quot;@R&amp;quot; and aligned with the corresponding Korean vowels &amp;quot;-\] \[@R\]&amp;quot;, &amp;quot;-lT \[ju\]&amp;quot; and &amp;quot;-\] \[@R\]&amp;quot;. The composite vowel &amp;quot;@R&amp;quot; is not divided into two simple vowels &amp;quot;@&amp;quot; and &amp;quot;R&amp;quot; since it is aligned to Korean &amp;quot;-\] \[@R\]&amp;quot; in accordance with entry in Table 6-2. When it is possible to syllabificate in more than one ways, only the longest phonetic unit is selected so that an alignment always ends up being unique during the training process.</Paragraph>
    <Paragraph position="6"> After the training stage, an input English word must be syllabificated automatically so that it can be transliterated by our tagger. During this stage, all possible syllabil'ication candidates are geuerated and are given as inputs to the statistical tagger so that the proper Korean notation can be found.</Paragraph>
  </Section>
  <Section position="5" start_page="384" end_page="386" type="metho">
    <SectionTitle>
3 Statistical transliteration model
</SectionTitle>
    <Paragraph position="0"> A probabilistic tagger finds the most probable set of Korean notation candidates fl'om the possible syllabificated results of English pronunciation notation. Let \[7, 8, 9\] proposed a statistical transliteration model based on the statistical translation model-I by Brown \[2\] that uses only a simple information source of a word pair.</Paragraph>
    <Paragraph position="1"> Various kinds of information sources are involved in the English to Korean transliteration problem. But it is not easy to systematically exploit various information sources by extending the Markov window in a statistical model. The tagging model proposed in this paper exploits not only simple pronunciation unit-to-unit mapping froul English to Korean, but also more complex contextual information of multiple units mapping. In what follows, we explain how the contextual information is represented as conditional probabilities.</Paragraph>
    <Paragraph position="2"> An English word E's pronunciation S is fouud in a phonetic dictionary. Suppose that S can be segmented into a sequence of syllabificated units sis 2...s. where s~ is an English phonetic unit as in Table 6. Also suppose thatKis a Korean word, where lq is the i-th phonetic unit of K.</Paragraph>
    <Paragraph position="4"> Let us say P(E, K) is the probability that an English word E is transliterated to a Korean word K. What we have to find is K where P(E, K) is lnaximized given E. This probability can be approximated by substituting the English word E with its prontmciation S. Thus, the following formula holds.</Paragraph>
    <Paragraph position="5"> arg max P(E, K)</Paragraph>
    <Paragraph position="7"> (2) where P(S) is called language model and I'(K\]S) is called translation, model. P(S) is not constant given a fixed input word because there can be a number of syllabification candidates.</Paragraph>
    <Paragraph position="8"> In detwmining k~, four neighborhood variables are taken into account, while conventional tagging models use only two neighborhood wuiables. The extended Markov window of infolnlation source is defined as in Figure 2. It also shows a conventional Markov window using a dashed line. Mathematical fornmlation for Markov window extension is not an easy probleln since extended window aggravates data sparseness. We will explain our solution in the next step.</Paragraph>
    <Paragraph position="9">  Now, the translation lnodel, P(K\[S) in equation (2) can be approximated by Markov assumption as follows.</Paragraph>
    <Paragraph position="11"> Equation(3) still has data sparseness problem in gathering information directly from the training corpus. So we expand it using Markov chain in order to replace the conditional probability term in (3) with more fragmented probability terms.</Paragraph>
    <Paragraph position="13"> In Equation (4), there are two kinds of approximations in probability terms. First,</Paragraph>
    <Paragraph position="15"> This approxilnation is based on our heuristic that kj_~and s~ ~ provide somewhat redundant information in deterlnining s,. Secondly,</Paragraph>
    <Paragraph position="17"> respectively, based on a heuristic that k,_ts~_ ~ is farther off than k, sj. and is redundant. Equation (4) can be reduced to Equation (5) because l'(s,, I k,,k,) of (4) is equivalent to ,'(k, Is, ,k, ,)</Paragraph>
    <Paragraph position="19"> The language model we use is a bigram language model (Brown et al. \[61)</Paragraph>
    <Paragraph position="21"> Now, our statistical tagging model can be formulated as Equation (7) when the translation model (5) and the language model (6) are applied to the transliteration model (2)  reduced to a conventional bigram tagging model (Eq. 8). that is a base model of Brown model-1 \[2\]. Charniak \[4\], Merialdo \[11\] and Lee \[7, 8. 9\].</Paragraph>
    <Paragraph position="22"> arg max P(S. K) --_ arg max I~ P(ki I k.,)P(s~ I ki) (8) K K i Equation (7) is the final tagging model proposed in this paper. We use a back-off strategy \[10, 1 1\] as follows, because our tagging model may have a data sparseness problem.</Paragraph>
    <Paragraph position="23"> P(k, \] s,_,k,_l).~ P(k, I k,,), if Count(s,_,ki_l) = 0 .,'(s, I k, s, ,) .~ p(.~, I k,). if Counz(k,s,_,) = 0 p(si+ 1 \]kisi) w, F(s/+ 1 Is/), if Coltrlt(kisi) = 0 Each probability term in equation (7) is obtained fi'om the training data. The statistical tagger modeled here uses Viterbi algorithm \[12\] for its search to get N-Best candidates.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML