File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1005_metho.xml
Size: 19,573 bytes
Last Modified: 2025-10-06 14:15:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1005"> <Title>Translating Names and Technical Terms in Arabic Text</Title> <Section position="4" start_page="35" end_page="35" type="metho"> <SectionTitle> 3 Adapting to Arabic </SectionTitle> <Paragraph position="0"> There are many interesting differences between Arabic and Japanese transliteration. One is that Japanese uses a special alphabet for borrowed foreign names and borrowed terms. With Arabic, there are no such obvious clues, and it is difficult to determine even whether to attempt a backtransliteration, to say nothing of computing an accurate one. We will not address this problem directly here, but we will try to avoid inappropriate transliterations. While the Japanese system is robust-everything gets some transliteration--we will build a deliberately more brittle Arabic system, whose failure signals that transliteration may not be the correct option.</Paragraph> <Paragraph position="1"> While Japanese borrows almost exclusively from English, Arabic borrows from a wider variety of languages, including many European ones. Fortunately, our pronunciation dictionary includes many non-English names, but we should expect to fail more often on transliterations from, say, French or Russian. null Japanese katakana writing seems perfectly phonetic, but there is actually some uncertainty in how phonetic sequences are rendered orthographically.</Paragraph> <Paragraph position="2"> Arabic is even less deterministically phonetic; short vowels do not usually appear in written text. Long vowels, which are normally written in Arabic, often but not always correspond to English stressed vowels; they are also sometimes inserted in foreign words to help disambiguate pronunciation. Because true pronunciation is hidden, we should expect that it will be harder to establish phonetic correspondences between English and Arabic.</Paragraph> <Paragraph position="3"> Japanese and Arabic have similar consonantconflation problems. A Japanese r sound may have an English r or 1 source, while an Arabic b may come from p or b. This is what makes back-transliteration hard. However, a striking difference is that while Japanese writing adds extra vowels, Arabic writing deletes vowels. For example: 2 Hendette --~ H Ell N R IY EH T (English) -~t h e n o r i ett o (Japanese) =h n r y t (Arabic) This means potentially much more ambiguity; we have to figure out which Japanese vowels shouldn't ~The English phonemic representation uses the phoneme set from the online Carnegie Mellon University Pronouncing Dictionary, a machine-readable pronunciation dictionary for North American English (http://w~. speech, cs. aau. edu/cgi-b in/cmudict).</Paragraph> <Paragraph position="4"> be there (deletion), but we have to figure out which Arabic vowels should be there (addition).</Paragraph> <Paragraph position="5"> For cases where Arabic has two potential mappings for one English consonant, the ambiguity does not matter. Resolving that ambiguity is bonus when going in the backwards direction--English T, for example, can be safely posited for Arabic t or T without losing any information*</Paragraph> </Section> <Section position="5" start_page="35" end_page="36" type="metho"> <SectionTitle> 4 New Model for Arabic </SectionTitle> <Paragraph position="0"> Fortunately, the first two models of (Knight and Graehl, 1997) deal with English only, so we can re-use them directly for Arabic/English transliteration.</Paragraph> <Paragraph position="1"> These are P(w), the probability of a particular English word sequence and P(elw), the probability of an English sound sequence given a word sequence.</Paragraph> <Paragraph position="2"> For example, P(Peter) may be 0.00035 and P(P IY T gRlPeter ) may be 1.0 (if Peter has only one pronunciation). null To follow the Japanese system, we would next propose a new model P(qle) for generating Arabic phoneme sequences from English ones, and another model P(alq) for Arabic orthography. We would then attempt to find data resources for estimating these probabilities. This is hard, because true Arabic pronunciations are hidden and no databases are available for directly estimating probabilities involving them.</Paragraph> <Paragraph position="3"> Instead, we will build only one new model, P(ale ), which converts English phoneme sequences directly into Arabic writing. ~,Ve might expect the model to include probabilities that look like:</Paragraph> <Paragraph position="5"> The next problem is to estimate these numbers empirically from data. We did not have a large bilingual dictionary of names and terms for Arabic/English, so we built a small 150-word dictionary by hand. We looked up English word pronunciations in a phonetic dictionary, generating the Englishphoneme-to-Arabic-writing training data shown in Figure 1.</Paragraph> <Paragraph position="6"> We applied the EM learning algorithm described in (Knight and Graehl, 1997) on this data, with one variation. They required that each English sound ((AE N T OW N IY ON) (! ' n T w n y w)) ((AE N T AH N IY) (.' ' n T w n y)) ((AA N W AA R) (! ' n w r)) ((AA R M IH T IH JH) (! ' r m y t ! j)) ((AA R N AA L D OW) (! r n i d w)) ((AE T K IH N Z) (! ' t k y n z)) ((K AO L V IY N OW) (k ! 1 f y n w)) ((K AE M ER AH N) (k ! m r ! n)) ((K AH M IY L) (k m y i)) ((K AA R L AH) (k '. r 1 .')) ((K AE R AH L) (k ! r w i)) ((K EH R AH LAY N) (k ! r w 1 y n)) ((K EH R AH L IH N) (k ! r w 1 y n)) ((K AA R Y ER) (k ! r f r)) ((K AE S AH L) (k ! s I)) ((K R IH S) (k r y s)) ((K R IH S CH AH N) (k r y s t s h n)) ((K R IH S T AH F ER) (k r y s t w f r)) ((K L AO D) (k 1 w d)) ((K LAY D) (k 1 ! y d)) ((K AA K R AH N) (k w k r ! n)) ((K UH K) (k w k)) ((K AO R IH G AH N) (k w r y G ! n)) ((EH B ER HH AA R T) (! + y b r ffi h ! r d)) ((EH D M AH N D) (! + d m w n)) ((EH D W ER D) (! ' d w ! r d)) ((AH LAY AH S) (! + i y ! s) ((IH L IH Z AH BAH TH) (! + 1 y z ! b y t h))</Paragraph> </Section> <Section position="6" start_page="36" end_page="37" type="metho"> <SectionTitle> 5 Problems Specific to Arabic </SectionTitle> <Paragraph position="0"> One problem was the production of many wrong English phrases, all containing the sound D. For example, the Arabic sequence 0~ frym!n yielded two possible English sources, Freeman and Friedman. The latter is incorrect. The problem proved to be that, like several vowels, an English D sound sometimes produces no Arabic letters. This happens in cases like .jl~i Edward ! 'dw!r and 03~7.~ Raymond rymwn. Inspection showed that D should only be dropped in word-final position, however, and not in the middle of a word like Friedman.</Paragraph> <Paragraph position="1"> This brings into question the entire shape of our P(ale ) model, which is based on a substitution of Arabic letters for an English sound, independent of that sound's context. Fortunately, we could incorporate an only-drop-final-D constraint by extending the model's transducer format.</Paragraph> <Paragraph position="2"> The old transducer looked like this: S/z'~ &quot;'&quot; While tile new transducer looks like this: produce at least one Japanese sound. This worked because Japanese sound sequences are always longer than English ones, due to extra Japanese vowels.</Paragraph> <Paragraph position="3"> Arabic letter sequences, on the other hand, may be shorter than their English counterparts, so we allow each English sound the option of producing no Arabic letters at all. This puts an extra computational strain on the learning algorithm, but is otherwise not difficult.</Paragraph> <Paragraph position="4"> Initial results were satisfactory. The program learned to map English sounds onto Arabic letter sequences, e.g.: Nicholas onto ~r,N~&quot; nykwl ! s and Williams onto .~..~ wlymz.</Paragraph> <Paragraph position="5"> We applied our three probabilistic models to previously unseen Arabic strings and obtained the top n English back-transliteration for each, e.g., finds itself in a final state with no further transitions. It can consume no further English sound input, so it has, by definition, come to the end of the word. We noticed a similar effect with English vowels at the end of words. For example, the system suggested both Manuel and Manuela as possible sources for ~,SL~ ,,!nwyl. Manuela is incorrect; we eliminated this frequent error with a technique like the one described above.</Paragraph> <Paragraph position="6"> A third problem also concerned English vowels.</Paragraph> <Paragraph position="7"> For Arabic .~.'lzf~i !'wkt !fy., the system produced both Octavio and Octavia as potential sources, though the latter is wrong. While it is possible for the English vowel ~ (final in Octavia) to produce Arabic w in some contexts (e.g., .~..%~ rwjr/Roger), it cannot do so at the end of a word. Eli and AA have the same property. Furthermore, none of those three vowels can produce the letter y when in word-final position. Other vowels like IY may of course do so.</Paragraph> <Paragraph position="8"> We pursued a general solution, replacing each in- null stance of an English vowel in our training data with e II one of three symbols, depending on its position in AA the word. For example, an AH in word-initial po- AA-S sit!on was replaced by AH-S; word-final AH was re- ,, placed by AH-F; word-medial was htI. This increases AE AE-S our vowel sound inventory by a factor of three, and AH &quot; even though AH might be pronounced the same in any position, the three distinct AH- symbols can acquire different mappings to Arabic. In the case of AH, learning revealed: ,,</Paragraph> <Paragraph position="10"> We can see that word-final AH can never be dropped. We can also see that word-initial AH can be dropped; this goes beyond the constraints we originally envisioned. Figure 2 shows the complete table of sound-letter mappings.</Paragraph> <Paragraph position="11"> We introduced just enough context in our sound mappings to achieve reasonable results. We could, of course, introduce left and right context for every sound symbol, but this would fragment our data; it is difficult to learn general rules from a handful of examples. Linguistic guidance helped us overcome these problems.</Paragraph> </Section> <Section position="7" start_page="37" end_page="39" type="metho"> <SectionTitle> 6 EXAMPLE </SectionTitle> <Paragraph position="0"> Here we show the internal workings of the system through an example. Suppose we observe the Arabic string br!nstn. First, we pass it though the P(a\[e) model from Figure 2, producing the network of possible English sound sequences shown in Figure 3. Each sequence ei could produce (or &quot;explain&quot;) br!nstn and is scored with P(br!nstn\[ ei). For ex-</Paragraph> <Paragraph position="2"> abilistic mappings to Arabic sound sequences, as learned by estimation-maximization.</Paragraph> <Paragraph position="3"> Next, we pass this network through the P(e\[w) model to produce a new network of English phrases. Finally, we re-score this network with the P(w) model. This marks a preference for common English words/names over uncommon ones. Here are the top n sequences at each stage:</Paragraph> </Section> <Section position="8" start_page="39" end_page="39" type="metho"> <SectionTitle> 7 Results and Discussion </SectionTitle> <Paragraph position="0"> We supplied a list of 2800 test names in Arabic to our program and received translations for 900 of them. Those not translated were frequently not foreign names at all, so the program is right to fail in many such cases. Sample results are shown in Figure 4.</Paragraph> <Paragraph position="1"> The program offers many good translations but still makes errors of omission and commission. Some of these errors show evidence of lexical or orthographic influence or of interference from other languages (such as French).</Paragraph> <Paragraph position="2"> English G is incorrectly produced from its voiceless counterpart in Arabic, k. For example, d~l..p&quot; krys comes out correctly as Chris and Kr/s but also, incorrectly, as Grace. The source of the G-k correspondence in the training data is the English name AE L AH G Z AE N D ER Alexander, which is .~ &quot;a.z....Q1 !lksndr in our training corpus. A voiced fricative G is available in Arabic, which in other contexts corresponds to the English voiced stop G, although it, too, is only an approximation. It appears that orthographic English X is perceived to correspond to Arabic ks, perhaps due partly to French influence. Another possible influence is the existing Arabic name 1~I ! skndr (which has k), from the same Greek source as the English name Alezander, but with metathesis of k and s.</Paragraph> <Paragraph position="3"> Sometimes an Arabic version of a foreign name is not strictly a transliteration, but rather a translation or lexicalized borrowing which has been adapted to Arabic phonological patterns. For example, the name Edward is found in our data as a.~l.~a/! ' dw!rd, .jl~.~' 'dw!r, and.~l~! !+dw!r. The last version, an Arabicization of the original Anglo-Saxon name, is pronounced Idwar. The approach taken here is flexible enought to find such matches.</Paragraph> <Paragraph position="4"> Allowing the English sound D to have a zero match word-finally (also a possible French influence) proves to be too strong an assumption, leading to matches such as: !' lfr Oliver Alfred. &quot;A revised rule would allow the D to drop word-finally only when immediately preceded by another consonant (which consonant could further be limited to a sonorant).</Paragraph> <Paragraph position="5"> Another anomaly which is the source of error is the mapping of English CH to Arabic x, which carries equal weight (0.5) to the mapping of Clt to Arabic tsh (0.5). This derives from the name Koch, which in Arabic is ~-j.C'kwx, as in the German pronunciation of the name, as opposed to the English pronunciation. This kind of language interference can be minimized by enlarging the training data set.</Paragraph> </Section> <Section position="9" start_page="39" end_page="39" type="metho"> <SectionTitle> ABBEY ABBY ABBIE ADAMS ADDAMS EDRIS EDWARD EDOUARD EDUARD EDWARD EDOUAKD EDUARD AVERA ALAN ALLEN ALLAN ALBERT ALPERT ELBERT ALBERTI ALBERTY ALPER ALVARO ALFARO ALVERO OLIVER OLIVER ALFRED ALEXANDER ALEXANDER ALEXANDRE ALAN ALLEN ALLAN ELLIS ALICE LAS ALLISON ALISON ELLISON AMOS AMOSS AMIC0 AMERCO EMIL EMILE EMAIL AMMAN AMIN AMMEEN AMER AMIR AMOR ANA ANNA ANA INIGUEZ ANTOINE ANTOINE ANTOINETTE ANTOINETTE ANTON ANT00N ANTOINE ANTONY ANTONI ANTONE ANTONIA ANTONIO ANTONIU ANDREW ANDREU ANDREA ANDREA ANDRIA </SectionTitle> <Paragraph position="0"> lations of names written in Arabic.</Paragraph> <Paragraph position="1"> English orthography appears to have skewed the training data in the English name Frederick, pronounced F R F_hi D R IH K. In Arabic we have A.~) frdyrk as well as frydryk, frdyryk and frdryk for this name. The English spelling has three vowels, but the English phonemic representation lacks the middle vowel. But some Arabic versions have a (long) &quot;vowel&quot; (y) in the middle vowel position, leading in the training data to the incorrect mapping English R to Arabic y. This results in incorrect translations, such as Racine for ysyn.</Paragraph> <Paragraph position="2"> As might be expected when the sound system of one language is being re-interpreted in another, Arabic transliteration is not always attuned to the subtleties of English phonotactic variation, especially when the variants are not reflected in English orthography. An example is the assimilation in voicing of English fricatives to an immediately preceding stop consonant. In James, pronounced JH EY H Z, the final consonant is a voiced Z although it is spelled with the voiceless variant, s. In this case, Arabic follows the English orthography rather than pronunciation, transliterating it O-~T jyms. Similarly, Horowitz is pronounced HH A0 R 0W IH T $ in English, with a final devoiced $ rather than the voiced variant z present in the spelling, whereas the Arabic transliteration follows the English spelling, ff~ .%~ja~ =hwrwwytz. The present version of the program applies these variant correspondences indiscriminately, such that ~3.~.I... s!ymwn is translated as Simon or Zyman. Separating out these correspondences according to their positions in the word, as was done with the vowels, would help to rectify this, by reducing the probability of an S--z correspondence in less likely positions (e.g., initial position).</Paragraph> <Paragraph position="3"> Some Arabic transliterations carry the imprint of English spelling even when it departs even farther from the pronunciation. For example, i,~I.~ Gr!=h!m is an Arabic transliteration for the English name Graham, pronounced G R AE H. (an alternative is the Arabic i'~ Gr!m). These mappings were not found by the program (even though they might be readily evident to a human). This kind of spelling-transliteration lies outside of the phonemic correspondences to Arabic orthography that the program has learned.</Paragraph> <Paragraph position="4"> Vowels are still a problem, even when they are distinguished by their position in the word. In the test cases given in the Introduction, (answers are Mike McCurrg, OPEC, and lnternet Explorer), the quality of the Arabic vowels, when present, matches the English vowels fairly well. However, as can be seen from names like Frederick, the decision as to whether or not to insert vowel is arbitrary and somewhat dependent on English orthography, which influences the quality as well as position of the Arabic vowel.</Paragraph> <Paragraph position="5"> Medial English AIt, for example, is normally ! (alif but can also be found in Arabic as t~ or y (e.g, English Jeremy, pronounced JH EH R AH H IY is written in Arabic as jyrymy). This results in incorrect translations, such as Amman for Arabic ! 'myn.</Paragraph> <Paragraph position="6"> In this initial model, English vowel stress was not represented. Because long vowels in Arabic are usually stressed, one might expect that English stressed vowels would be equated with Arabic long vowels for purposes of transliteration. However, our data suggest that English stress does not have a strong correlation with Arabic long vowel placement in transliterated names. For example, Kevin mapped to kyfyn and &quot;.~d~kfyn, but not .~(.. kyfn. If stress were a factor here and were interpreted as a long vowel, -~(.. kyfn would be predicted as a preferred transliteration based on the phonemic representation of Kevin as K EH1 V IH N (where &quot;1&quot; indicates primary stress). Similarly,.~.&quot; fyktwr and .~,i fktt~r were found for Victor but not the expected fyktr. &quot;~.~kynth is found, but so are kynyth, and &quot;~.~knyth. In syllable-final position at least, it appears that stress does not outweigh other factors in Arabic vowel placement. However, the relation of English stress to Arabic vowel placement in other positions might be used to rule out unlikely translations (such as Camille with final stress for Arabic 0.,~Sk!ml) and deserves further study.</Paragraph> <Paragraph position="7"> All of these observations point to places where the system can be improved. A larger training data set, more selected contextual mappings, and refinement of linguistic rules are all potential ways to capture these improvements within the finite-state framework, and we hope to study them in the near future.</Paragraph> </Section> class="xml-element"></Paper>