File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2025_metho.xml
Size: 12,317 bytes
Last Modified: 2025-10-06 14:10:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2025"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Modified Joint Source-Channel Model for Transliteration Asif Ekbal</Title> <Section position="4" start_page="193" end_page="193" type="metho"> <SectionTitle> 3 Proposed Models and Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="193" end_page="193" type="sub_section"> <SectionTitle> Scheme </SectionTitle> <Paragraph position="0"> Machine transliteration has been viewed as a sense disambiguation problem. A number of transliteration models have been proposed that can generate the English transliteration from a Bengali word that is not registered in any bilingual or pronunciation dictionary. The Bengali word is divided into Transliteration Units (TU) that have the pattern C+M, where C represents a vowel or a consonant or conjunct and M represents the vowel modifier or matra.</Paragraph> <Paragraph position="1"> An English word is divided into TUs that have the pattern C*V*, where C represents a consonant and V represents a vowel. The TUs are considered as the lexical units for machine transliteration. The system considers the Bengali and English contextual information in the form of collocated TUs simultaneously to calculate the plausibility of transliteration from each Bengali TU to various English candidate TUs and chooses the one with maximum probability. This is equivalent to choosing the most appropriate sense of a word in the source language to identify its representation in the target language. The system learns the mappings automatically from the bilingual training corpus guided by linguistic features. The output of this mapping process is a decision-list classifier with collocated TUs in the source language and their equivalent TUs in collocation in the target language along with the probability of each decision obtained from a training corpus. The machine transliteration of the input Bengali word is obtained using direct orthographic mapping by identifying the equivalent English TU for each Bengali TU in the input and then placing the English TUs in order. The various proposed models differ in the nature of collocational stastistics used during machine transliteration process: monogram model with no context, bigram model with previous (with respect to the current TU to be transliterated) source TU as the context, bigram model with next source TU as the context, bigram model with previous source and target TUs as the context (this is the joint source channel model), trigram model with previous and next source TUs as the context and the modified joint source-channel model with previous and next source TUs and the previous target TU as the context.</Paragraph> </Section> </Section> <Section position="5" start_page="193" end_page="194" type="metho"> <SectionTitle> * Model A </SectionTitle> <Paragraph position="0"> In this model, no context is considered in either the source or the target side. This is essentially the monogram model.</Paragraph> <Paragraph position="2"> This is essentially a bigram model with previous source TU, i.e., the source TU occurring to the left of the current TU to be transliterated, as the context.</Paragraph> <Paragraph position="4"> This is essentially a bigram model with next source TU, i.e., the source TU occurring to the right of the current TU to be transliterated, as the context.</Paragraph> <Paragraph position="6"> This is essentially the joint source-channel model where the previous TUs in both the source and the target sides are considered as the context.</Paragraph> <Paragraph position="7"> The previous TU on the target side refers to the transliterated TU to the immediate left of the current target TU to be transliterated.</Paragraph> <Paragraph position="9"/> </Section> <Section position="6" start_page="194" end_page="194" type="metho"> <SectionTitle> * Model E </SectionTitle> <Paragraph position="0"> This is basically the trigram model where the previous and the next source TUs are considered as the context</Paragraph> <Paragraph position="2"> In this model, the previous and the next TUs in the source and the previous target TU are considered as the context. This is the modified</Paragraph> <Paragraph position="4"> The performance of the system is evaluated in terms of Transliteration Unit Agreement Ratio (TUAR) and Word Agreement Ratio (WAR) following the evaluation scheme in (Goto et al., 2003). The evaluation parameter Character Agreement Ratio in (Goto et al., 2003) has been modified to Transliteration Unit Agreement Ratio as vowel modifier matra symbols in Bengali words are not independent and must always follow a consonant or a conjunct in a Transliteration Unit. Let, B be the input Bengali word, E be the English transliteration given by the user in open test and E/ be the system generates the transliteration..TUAR is defined as, TUAR = (L-Err)/ L, where L is the number of TUs in E, and Err is the number of wrongly transliterated TUs in E/ generated by the system. WAR is defined as, WAR= (S-Err/) / S, where S is the test sample size and Err/ is is the number of erroneous names generated by the system (when E/ does not match with E). Each of these models has been evaluated with linguistic knowledge of the set of possible conjuncts and diphthongs in Bengali and their equivalents in English. It has been observed that the Modified Joint Source Channel Model with linguistic knowledge performs best in terms of Word Agreement Ratio and Transliteration Unit Agreement Ratio.</Paragraph> </Section> <Section position="7" start_page="194" end_page="196" type="metho"> <SectionTitle> 4 Bengali-English Machine Transliteration </SectionTitle> <Paragraph position="0"> Translation of named entities is a tricky task: it involves both translation and transliteration.</Paragraph> <Paragraph position="1"> Transliteration is commonly used for named entities, even when the words could be translated [LXToc V_ (janata dal) is translated to Janata Dal (literal translation) although LXToc (Janata) and V_ (Dal) are vocabulary words]. On the other hand ^cV[yYCI[y x[y`Yx[yVic_I^ (jadavpur viswavidyalaya) is translated to Jadavpur University in which ^cV[yYCI[y (Jadavpur) is transliterated to Jadavpur and x[y`Yx[yVic_I^ (viswavidyalaya) is translated to University.</Paragraph> <Paragraph position="2"> A bilingual training corpus has been kept that contains entries mapping Bengali names to their respective English transliterations. To automatically analyze the bilingual training corpus to acquire knowledge in order to map new Bengali names to English, TUs are extracted from the Bengali names and the corresponding English names, and Bengali TUs are associated with their English counterparts.</Paragraph> <Paragraph position="3"> Some examples are given below:</Paragraph> <Paragraph position="5"> After retrieving the transliteration units from a Bengali-English name pair, it associates the Bengali TUs to the English TUs along with the TUs in context.</Paragraph> <Paragraph position="6"> For example, it derives the following transliteration pairs or rules from the name-pair:</Paragraph> <Paragraph position="8"> But, in some cases, the number of transliteration units retrieved from the Bengali and English words may differ. The [ [yELa]cc/X (brijmohan) - brijmohan ] name pair yields 5 TUs in Bengali side and 4 TUs in English side [ [yE |L |a]c |c/ |X - bri |jmo |ha |n]. In such cases, the system cannot align the TUs automatically and linguistic knowledge is used to resolve the confusion. A knowledge base that contains a list of Bengali conjuncts and diphthongs and their possible English representations has been kept. The hypothesis followed in the present work is that the problem TU in the English side has always the maximum length. If more than one English TU has the same length, then system starts its analysis from the first one. In the above example, the TUs bri and jmo have the same length. The system interacts with the knowledge base and ascertains that bri is valid and jmo cannot be a valid TU in English since there is no corresponding conjunct representation in Bengali. So jmo is split up into 2 TUs j and mo, and the system aligns the 5 TUs as [[yE |L |a]c |c/ |X - bri |j |mo |ha |n]. Similarly, [a_cEoXcU (loknath) - loknath] is initially split as [ a_c |Eo |Xc |U ] - lo |kna | th], and then as [ lo |k |na |th ] since kna has the maximum length and it does not have any valid conjunct representation in Bengali.</Paragraph> <Paragraph position="9"> In some cases, the knowledge of Bengali diphthong resolves the problem. In the following example, [I[yc |+ |]c (raima) - rai |ma], the number of TUs on both sides do not match. The English TU rai is chosen for analysis as its length is greater than the other TU ma. The vowel sequence ai corresponds to a diphthong in Bengali that has two valid representations < %c+, B >. The first representation signifies that a matra is associated to the previous character followed by the character +. This matches the present Bengali input. Thus, the English vowel sequence ai is separated from the TU rai (rai - r |ai) and the intermediate form of the name pair appears to be [I[yc |+ |]c (raima) - r |ai |ma]. Here, a matra is associated with the Bengali TU that corresponds to English TU r and so there must be a vowel attached with the TU r. TU ai is further splitted as a and i (ai - a |i) and the first one (i.e. a) is assimilated with the previous TU (i.e. r) and finally the name pair appears as: [ I[yc | + |]c (raima) - ra |i |ma].</Paragraph> <Paragraph position="10"> In the following two examples, the number of TUs on both sides does not match.</Paragraph> <Paragraph position="11"> [ aV |[y |I[yc |L (devraj) - de |vra |j ] [ aac |] |Xc |U (somnath) - so |mna |th] It is observed that both vr and mn represent valid conjuncts in Bengali but these examples contain the constituent Bengali consonants in order and not the conjunct representation. During the training phase, if, for some conjuncts, examples with conjunct representation are outnumbered by examples with constituent consonants representation, the conjunct is removed from the linguistic knowledge base and training examples with such conjunct representation are moved to a Direct example base which contains the English words and their Bengali transliteration. The above two name pairs can then be realigned as [ aV |[y |I[yc |L (devraj) - de |v |ra |j ] [ aac |] |Xc |U (somnath) - so |m |na |th] Otherwise, if such conjuncts are included in the linguistic knowledge base, training examples with constituent consonants representation are to be moved to the Direct example base.</Paragraph> <Paragraph position="12"> The Bengali names and their English transliterations are split into TUs in such a way that, it results in a one-to-one correspondence after using the linguistic information. But in some cases there exits zero-to-one or many-to-one relationship. An example of Zero-to-One relationship [Ph - h] is the name-pair [%c |{c (alla) - a |lla |h] while the name-pair [%c |+ | x\o (aivy) - i |vy] is an example of Many-to-One relationship [%c, + - i]. These bilingual examples should also be included in the Direct example base.</Paragraph> <Paragraph position="13"> In some cases, the linguistic knowledge apparently solves the mapping problem, but not always. From the name-pair [[yI[yFc (barkha) barkha], the system initially generates the mapping [[y |I[y |Fc - ba |rkha] which is not one-to-one. Then it consults the linguistic knowledge base and breaks up the transliteration unit as (rkha - rk |ha ) and generates the final aligned transliteration pair [[y |I[y |Fc - ba |rk | ha ] (since it finds out that rk has a valid conjunct representation in Bengali but not rkh), which is an incorrect transliteration pair to train the system. It should have been [[y |I[y |Fc - ba |r | kha]. Such type of errors can be detected by following the alignment process from the target side during the training phase. Such training examples may be either manually aligned or maintained in the Direct Example base.</Paragraph> </Section> class="xml-element"></Paper>