File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1508_metho.xml
Size: 21,934 bytes
Last Modified: 2025-10-06 14:08:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1508"> <Title>Transliteration of Proper Names in Cross-Lingual Information Retrieval</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Translation System Description </SectionTitle> <Paragraph position="0"> We break down the transliteration process into various steps as depicted in Figure 1.</Paragraph> <Paragraph position="1"> 1. Conversion of an English name into a phonemic representation using the Festival1 speech synthesis system.</Paragraph> <Paragraph position="2"> 2. Translation of the English phoneme sequence into a sequence of generalized initials and finals or GIFs -- commonly used sub-syllabic units for expressing pronunciations of Chinese characters.</Paragraph> <Paragraph position="3"> 3. Transformation of the GIF sequence into a sequence of pin-yin symbols without tone.</Paragraph> <Paragraph position="4"> 4. Translation of the pin-yin sequence to a character sequence.</Paragraph> <Paragraph position="5"> Steps 1. and 3. are deterministic transformations, while Steps 2. and 4. are accomplished using statistical means.</Paragraph> <Paragraph position="6"> The IBM source-channel model for statistical machine translation (P. Brown et al., 1993) plays a central role in our system. We therefore describe it very briefly here for completeness. In this model, a a0word foreign language sentence a1 a2 a3a4 a3a5 a6 a6 a6 a3a7 is modeled as the output of a &quot;noisy channel&quot; whose input is its correct a8-word English translation a9 a2</Paragraph> <Paragraph position="8"> a10a11, and having observed the channel output a1, one seeks a posteriori the most likely English</Paragraph> <Paragraph position="10"> a paired corpus of foreign-language sentences and their English translations, and the language model are available both for training models2 as well as for decoding3 -- the task of determining the most likely translation a12a9.</Paragraph> <Paragraph position="11"> Since we seek Chinese names which are transliteration of a given English name, the notion of words in a sentence in the IBM model above is replaced with phonemes in a word. The roles of English and Chinese are also reversed. Therefore, a1 a2 a3a4 a3a5 a6 a6 a6 a3a7 represents a sequence of English phonemes, and a9 a2</Paragraph> <Paragraph position="13"> a10a11, for instance, a sequence of GIF symbols in Step 2. described above. The overall architecture of the proposed transliteration system is illustrated in Figure 2.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Translation Model Training </SectionTitle> <Paragraph position="0"> We have available from Meng et al (2000) a small list of about 3875 English names and their Chinese transliteration. A pin-yin rendering of the Chinese transliteration is also provided. We use the Festival text-to-speech system to obtain a phonemic pronunciation of each English name. We also replace all pin-yin symbols by their pronunciations, which are described using an inventory of generalized initials and finals. The pronunciation table for this purpose is obtained from an elementary Mandarin text- null corpus is 43 phonemes, and the Chinese side is 58 (21 initials and 37 finals). Note, however, that only 409 of the 21a037 possible initial-final combinations constitute legal pin-yin symbols.</Paragraph> <Paragraph position="1"> A second corpus of 3875 &quot;sentence&quot; pairs is derived corresponding to the fourth and fifth lines of Figure 1, this time to train a statistical model to translate pin-yin sequences to Chinese characters.</Paragraph> <Paragraph position="2"> The vocabulary of the pin-yin side of this corpus is 282 and that of the character side is about 680.</Paragraph> <Paragraph position="3"> These, of course, are much smaller than the inventory of Chinese pin-yin- and character-sets. We note that certain characters are preferentially used in transliteration over others, and the resulting frequency of character-usage is not the same as unrestricted Chinese text. However, there isn't a distinct set of characters exclusively for transliteration.</Paragraph> <Paragraph position="4"> For purposes of comparison with the transliteration accuracy reported by Meng et al (2001), we divide this list into 2233 training name-pairs and 1541 test name-pairs. For subsequent CLIR experiments, we create a larger training set of 3625 name-pairs, leaving only 250 names-pairs for intrinsic testing of transliteration performance. The actual training of all translation models proceeds according to a standard recipe recommended in GIZA++, namely 5 iterations of Model 1, followed by 5 of Model 2, 10 HMM-iterations and 10 iterations of Model 4.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Language Model Training </SectionTitle> <Paragraph position="0"> The GIF language model required for translating English phoneme sequences to GIF sequences is estimated from the training portion of the 3875 Chinese names. A trigram language model on the GIF vocabulary is estimated with the CMU toolkit, using Good-Turing smoothing and Katz back-off. Note that due to the smoothing, this language model does not necessarily assign zero probability to an illegal GIF sequence, e.g., one containing two consecutive initials. This causes the first translation system to sometimes, though very rarely, produce GIF sequences which do not correspond to any pin-yin sequence. We make an ad hoc correction of such sequences when mapping a GIF sequence to pin-yin, which is otherwise trivial for all legal sequences of initials and finals. Specifically, a final e or i or a is tried, in that order, between consecutive initials until a legitimate sequence of pin-yin symbols obtains.</Paragraph> <Paragraph position="1"> The language model required for translating pin-yin sequences to Chinese characters is relatively straightforward. A character trigram model with Good-Turing discounting and Katz back-off is estimated from the list of transliterated names.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Decoding Issues </SectionTitle> <Paragraph position="0"> We use the ReWrite decoder provided by ISI, along with the two translation models and their corresponding language models trained, either on 2233 or 3625 name-pairs, as described above, to perform transliteration of English names in the respective test sets with 1541 or 250 name-pairs respectively.</Paragraph> <Paragraph position="1"> 1. An English name is first converted to a phoneme sequence via Festival.</Paragraph> <Paragraph position="2"> 2. The phoneme sequence is translated into an GIF sequence using the first translation model described above.</Paragraph> <Paragraph position="3"> 3. The translation output is corrected if necessary to create a legitimate pin-yin sequence.</Paragraph> <Paragraph position="4"> 4. The pin-yin sequence is translated into a se- null quence of Chinese characters using a second translation model, also described above.</Paragraph> <Paragraph position="5"> A small but important manual setting in the ReWrite decoder is a list of zero fertility words. In the IBM model described earlier, these are the words a10a0 which may be &quot;deleted&quot; by the noisy channel when transforming a9 into a1 . For the decoder, these are therefore the words which may be optionally inserted in a9 even when there is no word in a1 of which they are considered a direct translation. For the usual case of Chinese to English translation, these would usually be articles and other function words which may not be prevalent in the foreign language but frequent in English.</Paragraph> <Paragraph position="6"> For the phoneme-to-GIF translation model, the &quot;words&quot; which need to be inserted in this manner are syllabic nuclei! This is because Mandarin does not permit complex consonant clusters in a way that is quite prevalent in English. This linguistic knowledge, however, need not be imparted by hand in the IBM model. One can, indeed, derive such a list from the trained models by simply reading off the list of symbols which have zero fertility with high probability. This list, in our case, is a1 -i, e, u, o, r, &quot;u, ou, c, iu, iea2.</Paragraph> <Paragraph position="7"> The second translation system, for converting pin-yin sequences to character sequences, has a one-to-one mapping between symbols and therefore has no words with zero fertility.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Intrinsic Evaluation of Transliteration </SectionTitle> <Paragraph position="0"> We evaluate the efficacy of our transliteration at two levels. For comparison with the very comparable set-up of Meng et al (2001), we measure the accuracy of the pin-yin output produced by our system after Step 3. in Section 2.3. The results are shown in Table 1, where pin-yin error rate is the edit distance between the &quot;correct&quot; pin-yin representation of the correct transliteration and the pin-yin sequence output by the system.</Paragraph> <Paragraph position="1"> Note that the pin-yin error performance of our fully statistical method is quite competitive with previous results. We further note that increasing the training data results in further reduction of the syllable error rate. We concede that this performance, while comparable to other systems, is not satisfactory and merits further investigation.</Paragraph> <Paragraph position="2"> We also evaluate the efficacy of our second translation system which maps the pin-yin sequence produced by the previous stages to a sequence of Chinese characters, and obtain character error rates of 12.6%. Thus every correctly recognized pin-yin symbol has a chance of being transformed with some error, resulting in higher character error rate than the pin-yin error rate. Note that while significantly lower error rates have been reported for converting pin-yin to characters in generic Chinese text, ours is a highly specialized subset of transliterated foreign names, where the choice between several characters sharing the same pin-yin symbol is somewhat arbitrary.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Spoken Document Retrieval System </SectionTitle> <Paragraph position="0"> Several multi-lingual speech and text applications require some form of name transliteration, cross-lingual spoken document retrieval being a prototypical example. We build upon the experimental infrastructure developed at the 2000 Johns Hopkins Summer Workshop (Meng et al., 2000) where considerable work was done towards indexing and retrieving Mandarin audio to match English text queries. Specifically, we find that in a large number of queries used in those experiments, English proper names are not available in the translation lexicon, and are subsequently ignored during retrieval. We use the technique described above to transliterate all such names into Chinese characters and observe the effect on retrieval performance.</Paragraph> <Paragraph position="1"> The TDT-2 corpus, which we use for our experiments, contains 2265 audio clips of Mandarin news stories, along with several thousand contemporaneously published Chinese text articles, and English text and audio broadcasts. The articles tend to be several hundred to a few thousand words long, while the audio clips tend to be two minutes or less on average. The purpose of the corpus is to facilitate research in topic detection and tracking and exhaustive relevance judgments are provided for several topics.</Paragraph> <Paragraph position="2"> i.e. for each of at least 17 topics, every English and Chinese article and news clip has been examined by a human assessor and determined to be either onor off-topic. We randomly select an English article on each of the 17 topics as a query, and wish to retrieve all the Mandarin audio clips on the same topic without retrieving any that are off-topic. For mitigating the variability due to query selection, we choose up to 12 different English articles for each of the 17 topics and average retrieval performance over this selection before reporting any results. We use the query term-selection and translation technique described by Meng et al (2000) to convert the English document to Chinese, the only augmentation being the transliterated names -- there are roughly 2000 tokens in the queries which are not translatable, and almost all of them are proper names. We report IR performance with and without the nametransliteration. null We use a different information retrieval system from the one used in the 2000 Workshop (Meng et al., 2000) to perform the retrieval task. A brief description of the system is therefore in order.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 The HAIRCUT System </SectionTitle> <Paragraph position="0"> The Hopkins Automated Information Retriever for Combing Unstructured Text (HAIRCUT) is a research retrieval system developed at the Johns Hopkins University Applied Physics Laboratory. The system was developed to investigate knowledgelight methods for linguistic processing in text retrieval. HAIRCUT uses a statistical language model of retrieval such as the one explored by Hiemstra (2001). The model ranks documents according to the probability that the terms in a query are generated by a document. Various smoothing methods have been proposed to combine the contributions for each term based on the document model and also a generic model of the language. Many have found that a simple mixture model using document term frequencies for the former, and occurrence statistics from a large corpus for the later, works quite well.</Paragraph> <Paragraph position="1"> McNamee and Mayfield (2001) have shown using HAIRCUT that overlapping character n-grams are effective for retrieval in non-Asian languages (e.g., using n=6) and that translingual retrieval between closely related languages is quite feasible even with- null out translation resources of any kind (McNamee and Mayfield, 2002).</Paragraph> <Paragraph position="2"> For the task of retrieving Mandarin audio from Chinese text queries on the TDT-2 task, the system described by Meng et al (2000) achieved a mean average precision of 0.733 using character bigrams for indexing. On identical queries, HAIRCUT achieved 0.762 using character bigrams. This figure forms the monolingual baseline for our CLIR system.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Cross-Lingual Retrieval Performance </SectionTitle> <Paragraph position="0"> We first indexed the automatic transcription of the TDT-2 Mandarin audio collection using character bigrams, as done by Meng et al (2000). We performed CLIR using the Chinese translations of the English queries, with and without transliteration of proper names, and compared the standard 11-step mean average precision (mAP) on the TDT-2 audio corpus. Our results and the corresponding results from Meng et al (2001) are reported in Table 2.</Paragraph> <Paragraph position="1"> Without name transliteration, the performance of the two CLIR systems is nearly identical: a paired t-test shows that the difference in the mAPs of 0.514 and 0.501 is significant only at a a0-value of 0.74.</Paragraph> <Paragraph position="2"> A small improvement in mAP is obtained by the Haircut system with name transliteration over the system without name transliteration: the improvement from 0.501 to 0.515 is statistically significant at a a0-value of 0.084. The statistical significance of the improvement from 0.514 to 0.522 by Meng et al (2001) is not known to us. In any event, a need for improvement in transliteration is suggested by this result.</Paragraph> <Paragraph position="3"> We recently received a large list of nearly 2M Chinese-English named-entity pairs from the LDC.</Paragraph> <Paragraph position="4"> As a pilot experiment, we simply added this list to the translation lexicon of the CLIR system, i.e., we &quot;translated&quot; those names in our English queries which happened to be available in this LDC list.</Paragraph> <Paragraph position="5"> This happens to cover more than 85% of the previously untranslatable names in our queries. For the remaining names, we continued to use our automatic transliterator. To our surprise, the mAP improvement from 0.501 to 0.506 was statistically insignificant (a0-value of 0.421) and the reason why the use of the ostensibly correct transliteration most of the time still does not result in any significant gain in CLIR performance continues to elude us.</Paragraph> <Paragraph position="6"> We conjecture that the fact that the audio has been processed by an automatic speech recognition system, which in all likelihood did not have many of the proper names in question in its vocabulary, may be the cause of this dismal performance. It is plausible, though we cannot find a stronger justification for it, that by using the 10-best transliterations produced by our automatic system, we are adding robustness against ASR errors in the retrieval of proper names.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 A Large Chinese-English Translation </SectionTitle> <Paragraph position="0"> The LDC Chinese-English named entity list was compiled from Xinhua News sources, and consists of nine pairs of lists, one each to cover personnames, place-names, organizations, etc. While there are indeed nearly 2 million name-pairs in this list, a large number of formatting, character encoding and other errors exist in this beta release, making it difficult to use the corpus as is in our statistical MT system. We have tried using from this resource the two lists corresponding to person-names and place-names respectively, and have attempted to augment the training data for our system described previously in Section 2.1. However, we further screened these lists as well in order to eliminate possible errors.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Extracting Named Entity Transliteration </SectionTitle> <Paragraph position="0"> Pairs for Translation Model Training There are nearly 1 million pairs of person or place-names in the LDC corpus. In order to obtain a clean corpus of Named Entity transliterations we performed the following steps: 1. We coverted all name-pairs into a parallel corpus of English phonemes on one side and Chinese GIFs on the other by the procedure described earlier.</Paragraph> <Paragraph position="1"> 2. We trained a statistical MT system for translating from English phonemes to Chinese GIFs from this corpus.</Paragraph> <Paragraph position="2"> 3. We then aligned all the (nearly 1M) training &quot;sentence&quot; pairs with this translation model, and extracted roughly a third of the sentences with an alignment score above a certain tunable threshold (a0a1a2a3). This resulted in the extraction of 346860 name-pairs.</Paragraph> <Paragraph position="3"> presumably &quot;good&quot; training set and evaluated the pin-yin error rate of the transliteration. The result of this evaluation is reported in Table 3 against the line &quot;Huge MT (Self),&quot; where we also report the transliteration performance of the so-called Big MT system of Table 1 on this new test set. We note, again with some dismay, that the additional training data did not result in a significant improvement in transliteration performance.</Paragraph> <Paragraph position="4"> varying amounts of training data and different data selection procedures.</Paragraph> <Paragraph position="5"> We continue to believe that careful data-selection is the key to successful use of this beta-release of the LDC Named Entity corpus. We therefore went back to Step 3 of the procedure outlined above, where we had used alignment scores from an MT system to select &quot;good&quot; sentence-pairs from our training data, and instead of using the MT system trained in Step 2 immediately preceding it, we used the previously built Big MT system of Section 2.1, which we know is trained on a small but clean data-set of 3625 namepairs. With a similar threshold as above, we again selected roughly 300K name-pairs, being careful to leave out any pair which appears in the 3122 pair test set described above, and reestimated the entire phoneme-to-GIF translation system on this new corpus. We evaluated this system on the 3122 namepair test set for transliteration performance, and the results are included in Table 3.</Paragraph> <Paragraph position="6"> Note that significant improvements in transliteration performance result from this alternate method of data selection.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Cross-Lingual Retrieval Performance -- II </SectionTitle> <Paragraph position="0"> We reran the CLIR experiments on the TDT-2 corpus using the somewhat improved entity transliterator described above, with the same query and document collection specifications as the experiments reported in Table 2. The results of this second experiment is reported in Table 4, where the performance of the Big MT transliterator is reproduced for comparison. null and without name transliteration Note that the gain in CLIR performance is again only somewhat significant, with the improvement in mAP from 0.501 to 0.517 being significant only at a a0-value of 0.080.</Paragraph> </Section> </Section> class="xml-element"></Paper>