File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1061_metho.xml
Size: 14,383 bytes
Last Modified: 2025-10-06 14:09:35
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1061"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 483-490, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Mining Key Phrase Translations from Web Corpora</Title> <Section position="4" start_page="484" end_page="486" type="metho"> <SectionTitle> 3 Extracting Key Phrase Translation </SectionTitle> <Paragraph position="0"> When the Chinese key phrase and its English hint words are sent to Google as the query, returned web page snippets contain the source query and possibly its translation. We preprocess the snippets to remove irrelevant information. The preprocessing steps are: 2. Convert HTML special characters (e.g., &quot;&lt&quot;) to corresponding ASCII code (&quot;>&quot;); 3. Segment Chinese words based on a maximum string matching algorithm, which is used to calculate the translation probability between a Chinese key phrase and an English translation candidate.</Paragraph> <Paragraph position="1"> 4. Replace punctuation marks with phrase separator '|'; 5. Replace non-query Chinese words with placeholder mark '+', as they indicate the distance between an English phrase and the Chinese key phrase.</Paragraph> <Paragraph position="2"> For example, the snippet << <b> Lang Qiao Yi Meng </b> >> (the bridges of madison county)[review]. Fa Bu Zhe : anjing | Fa Bu Shi Jian :2004-01-25 Xing Qi Ri 02:13 |Zui Xin Geng Xin Shi Jian is converted into</Paragraph> <Paragraph position="4"> where &quot;<b>&quot; and &quot;</b>&quot; mark the start and end positions of the Chinese key phrase. The candidate English phrases, &quot;the bridges of madison county&quot;, &quot;review&quot; and &quot;anjing&quot;, will be aligned to the source key phrase according to a combined feature set using a transliteration model which captures the pronunciation similarity, a translation model which captures the semantic similarity and a frequency-distance model reflecting their relevancy. These models are described below.</Paragraph> <Section position="1" start_page="484" end_page="485" type="sub_section"> <SectionTitle> 3.1 Transliteration Model </SectionTitle> <Paragraph position="0"> The transliteration model captures the phonetic similarity between a Chinese phrase and an English translation candidate via string alignment.</Paragraph> <Paragraph position="1"> Many key phrases are person and location names, which are phonetically translated and whose written forms resemble their pronunciations. Therefore it is possible to discover these translation pairs through their surface strings. Surface string transliteration does not need a pronunciation lexicon to map words into phoneme sequences; thus it is especially appealing for OOV word translation. For non-Latin languages like Chinese, a romanization script called &quot;pinyin&quot; maps each Chinese character into Latin letter strings. This normalization makes the string alignment possible.</Paragraph> <Paragraph position="2"> We adopt the transliteration model proposed in (Huang, et al. 2003). This model calculates the probabilistic Levinstein distance between a romanized source string and a target string. Unlike the traditional Levinstein distance calculation, the character alignment cost is not binary (0/1); rather it is the logarithm of character alignment probability, which ensures that characters with similar pronunciations (e.g. `p` and `b`) have higher alignment probabilities and lower cost. These probabilities are automatically learned from bilingual name lists using EM.</Paragraph> <Paragraph position="3"> Assume the Chinese phrase f has J Chinese characters, , and the English candidate phrase e has L English words, . The transliteration cost between a Chinese query and an English translation candidate is calculated as:</Paragraph> <Paragraph position="5"> where is the pinyin of Chinese character , is the i th letter in , and and are their aligned English letters, respectively.</Paragraph> <Paragraph position="6"> is the letter transliteration probability. The transliteration costs between a Chinese phrase and an English phrase is approximated by the sum of their letter transliteration cost along the optimal alignment path, which is identified based on dynamic programming.</Paragraph> <Paragraph position="8"/> </Section> <Section position="2" start_page="485" end_page="485" type="sub_section"> <SectionTitle> 3.2 Translation Model </SectionTitle> <Paragraph position="0"> The translation model measures the semantic equivalence between a Chinese phrase and an English candidate. One widely used model is the IBM model (Brown et al. 1993). The phrase translation probability is computed using the IBM model-1 as: where is the lexical translation probabilities, which can be calculated according to the IBM models. This alignment model is asymmetric, as one source word can only be aligned to one target word, while one target word can be aligned to multiple source words. We estimate both and , and define the NE translation cost as:</Paragraph> <Paragraph position="2"/> </Section> <Section position="3" start_page="485" end_page="485" type="sub_section"> <SectionTitle> 3.3 Frequency-Distance Model </SectionTitle> <Paragraph position="0"> The more often a bilingual phrase pair co-occurs, or the closer a bilingual phrase pair is within a snippet, the more likely they are translations of each other. The frequency-distance model measures this correlation.</Paragraph> <Paragraph position="1"> Suppose S is the set of returned snippets for query , and a single returned snippet isf Ss</Paragraph> <Paragraph position="3"> The source phrase occurs in s</Paragraph> <Paragraph position="5"> as ( since f may occur several times in a snippet). The frequency-distance weight of an English candidate where is the distance between phrase and e, i.e., how many words are there between the two phrases (the separator `|` is not counted).</Paragraph> <Paragraph position="7"/> </Section> <Section position="4" start_page="485" end_page="486" type="sub_section"> <SectionTitle> 3.4 Feature Combination </SectionTitle> <Paragraph position="0"> Define the confidence measure for the transliteration model as: where e and e' are English candidate phrases, and m is the weight of the distance model. We empirically choose m=2 in our experiments. This measure indicates how good the English phrase e is compared with other candidates based on transliteration model. Similarly the translation model confidence measure is defined as: The overall feature cost is the linear combination of transliteration cost and translation cost, which are weighted by their confidence scores respec- null where the linear combination weight l is chosen empirically. While trl ph and trans ph represent the relative rank of the current candidate among all compared candidates, C and indicate its absolute likelihood, which is useful to reject the top 1 incorrect candidate if the true translation does not occur in any returned snippets.</Paragraph> <Paragraph position="1"> trl trans</Paragraph> </Section> </Section> <Section position="5" start_page="486" end_page="487" type="metho"> <SectionTitle> C 4 Experiments </SectionTitle> <Paragraph position="0"> We evaluated our approach by translating a set of key phrases from different domains. We selected 310 Chinese key phrases from 12 domains as the test set, which were almost equally distributed within these domains. We also manually translated them as the reference translations. Table 1 shows some typical phrases and their translations, where one may find that correct key phrase translations need both phonetic transliterations and semantic translations. We evaluated inclusion rate, defined as the percentage of correct key phrase translations which can be retrieved in the returned snippets; alignment accuracy, defined as the percentage of key phrase translations which can be correctly aligned given that these translations are included in the snippets; and overall translation accuracy, defined as the percentage of key phrases which can be translated correctly. We compared our approach with the LiveTrans (Cheng et.al. 2004) system, an unknown word translator using web corpora, and we observed better translation performance using our approach.</Paragraph> <Section position="1" start_page="486" end_page="487" type="sub_section"> <SectionTitle> 4.1 Query Translation Inclusion Rate </SectionTitle> <Paragraph position="0"> In the first round query search, for each Chinese key phrase f, on average 13 unique snippets were returned to identify relevant Chinese hint words f', and the top 5 f's were selected to generate hint words e's. In the second round f and e's were sent to Google again to retrieve mixed language snippets, which were used to extract e, the correct translation of f.</Paragraph> <Paragraph position="1"> Figure 3 shows the inclusion rate vs. the number of snippets used for three mixed-language web page searching strategies: containing f (Cheng et al. 2004); * Search any web pages containing f and hint words e', as proposed in this paper.</Paragraph> <Paragraph position="2"> The first search strategy resulted in a relatively low inclusion rate; the second achieved a much higher inclusion rate. However, because such English pages were limited, and on average only 45 unique snippets could be found for each f, which resulted in a maximum inclusion rate of 85.8%. In the case of the cross-lingual query expansion, the search space was much larger but more focused and we achieved a high inclusion rate of 89.7% using 32 mixed-language snippets and 95.2% using 165 snippets, both from the second round retrieval. These web pages are labeled by Google as &quot;English&quot; web pages, though they may contain non-English characters.</Paragraph> </Section> <Section position="2" start_page="487" end_page="487" type="sub_section"> <SectionTitle> 4.2 Translation Alignment Accuracy </SectionTitle> <Paragraph position="0"> We evaluated our key phrase extraction model by testing queries whose correct translations were included in the returned snippets. We used different feature combinations on differently sized snippets to compare their alignment accuracies. Table 2 shows the result. Here &quot;Trl&quot; means using the transliteration model, &quot;Trans&quot; means using the translation model, and &quot;Fq-dis&quot; means using Frequency-Distance model. The frequency-distance model seemed to be the strongest single model in both cases (with and without hint words), while incorporating phonetic and semantic features provided additional strength to the overall performance.</Paragraph> <Paragraph position="1"> Combining all three features together yielded the best accuracy. Note that when more candidate translations were available through query expansion, the alignment accuracy improved by 30% relative due to the frequency-distance model.</Paragraph> <Paragraph position="2"> However, using transliteration and/or translation models alone decreased performance because of more incorrect translation candidates from returned snippets. After incorporating the frequency-distance model, correct translations have the maximum frequency-distance weights and are more likely to be selected as the top hypothesis.</Paragraph> <Paragraph position="3"> Therefore the combined model obtained the highest translation accuracy.</Paragraph> </Section> <Section position="3" start_page="487" end_page="487" type="sub_section"> <SectionTitle> 4.3 Overall Translation Quality </SectionTitle> <Paragraph position="0"> The overall translation qualities are listed in Table 3, where we showed the translation accuracies of the top 5 hypotheses using different number of snippets. A hypothesized translation was considered to be correct when it matched one of the reference translations. Using more snippets always increased the overall translation accuracy, and with all the 165 snippets (on average per query), our approach achieved 80% top-1 translation accuracy, and 90% top-5 accuracy.</Paragraph> <Paragraph position="1"> We compared the translations from a research statistical machine translation system (CMU-SMT, Vogel et al. 2003) and a web-based MT engine (BabelFish). Due to the lack of topic-relevant contexts and many OOV words occurring in the source key phrases, their results were not satisfactory. We also compare our system with LiveTrans, which only searched within English web pages, thus with limited search space and more noises (incorrect English candidates). Therefore it was more difficult to select the correct translation. Table 4 lists some example key phrase translations mined from web corpora, as well as translations from the BabelFish. null</Paragraph> </Section> </Section> <Section position="6" start_page="487" end_page="489" type="metho"> <SectionTitle> 5 Relevant Work </SectionTitle> <Paragraph position="0"> Both (Cheng et al. 2004) and (Zhang and Vines 2004) exploited web corpora for translating OOV terms and queries. Compared with their work, our proposed method differs in both webpage search space and translation extraction features. Figure 4 illustrates three different search strategies. Suppose we want to translate the Chinese query &quot;Fu Shi De &quot;. (Cheng et al. 2004) only searched 188 English web pages which contained the source query, and 53% of them (100 pages) had the correct translations.</Paragraph> <Paragraph position="1"> (Zhang and Vines 2004) searched the whole 55,100 web pages, 10% of them (5490 pages) had the correct translation. Our approach used query expansion to search any web pages containing &quot;Fu Shi De &quot; and English hint words, which was a larger search space than (Cheng et al. 2004) and more focused compared with (Zhang and Vines 2004), as illustrated with the shaded region in Figure 4.</Paragraph> <Paragraph position="2"> For translation extraction features, we took advantage of machine transliteration and machine translation models, and combined them with frequency and distance information.</Paragraph> </Section> class="xml-element"></Paper>