File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1068_evalu.xml
Size: 9,122 bytes
Last Modified: 2025-10-06 13:59:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1068"> <Title>Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Performance Evaluation </SectionTitle> <Paragraph position="0"> We conducted extensive experiments to examine the performance of the proposed approach. We obtained the search-result pages of a term by submitting it to the real-world search engines, including Google and Openfind (http://www.openfind.com.tw). Only the first 100 snippets received were used as the corpus.</Paragraph> <Paragraph position="1"> Performance Metric: The average top-n inclusion rate was adopted as a metric on the extraction of translation equivalents. For a set of terms to be translated, its top-n inclusion rate was defined as the percentage of the terms whose translations could be found in the first n extracted translations. The experiments were categorized into direct translation and transitive translation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Direct Translation </SectionTitle> <Paragraph position="0"> Data set: We collected English terms from two real-world Chinese search engine logs in Taiwan, i.e.</Paragraph> <Paragraph position="1"> Dreamer (http://www.dreamer.com.tw) and GAIS (http://gais.cs.ccu.edu.tw). These English terms were potential ones in the Chinese logs that needed correct translations. The Dreamer log contained 228,566 unique query terms from a period of over 3 months in 1998, while the GAIS log contained 114,182 unique query terms from a period of two weeks in 1999. The collection contained a set of 430 frequent English terms, which were obtained from the 1,230 English terms out of the most popular 9,709 ones (with frequencies above 10 in both logs). About 36% (156/430) of the collection could be found in the LDC (Linguistic Data Consortium, http://www.ldc.upenn.</Paragraph> <Paragraph position="2"> edu/Projects/Chinese) English-to-Chinese lexicon with 120K entries, while about 64% (274/430) were not covered by the lexicon.</Paragraph> <Paragraph position="3"> English-to-Chinese Translation: In this experiment, we tried to directly translate the collected 430 English terms into traditional Chinese. Table 1 shows the results in terms of the top 1-5 inclusion rates for the translation of the collected English terms. &quot;kh2&quot;, &quot;CV&quot;, and &quot;kh2+CV&quot; represent the methods based on the chisquare, context-vector, and chi-square plus context-vector methods, respectively. Although either the chi-square or context-vector method was effective, the method based on both of them (kh2+CV) achieved the best performance in maximizing the inclusion rates in every case because they looked complementary. The proposed approach was found to be effective in finding translations of proper names, e.g.</Paragraph> <Paragraph position="4"> personal names &quot;Jordan&quot; (Qiao Dan , Qiao Deng ), &quot;Keanu Reeves&quot; (Ji Nu Li Wei , Ji Nuo Li Wei ), companies' names &quot;TOYOTA&quot; (Feng Tian ), &quot;EPSON&quot; (Ai Pu Sheng ), and technical terms &quot;EDI&quot; (Dian Zi Zi Liao Jiao Huan ), &quot;Ethernet&quot; (Yi Tai Wang Lu ), etc.</Paragraph> <Paragraph position="5"> English-to-Chinese Translation for Mainland China, Taiwan and Hong Kong: Chinese can be classified into simplified Chinese (SC) and traditional Chinese (TC) based on its writing form or character encoding scheme. SC is mainly used in mainland China while TC is mainly used in Taiwan and Hong Kong (HK). In this experiment, we further investigated the effectiveness of the proposed approach in English-to-Chinese translation for the three different regions. The collected 430 English terms were classified into five types: people, organization, place, computer and network, and others.</Paragraph> <Paragraph position="6"> Tables 2 and 3 show the statistical results and some examples, respectively. In Table 3, the number stands for a translated term's ranking. The underlined terms were correct translations and the others were relevant translations. These translations might benefit the CLIR tasks, whose performance could be referred to our earlier work which emphasized on translating unknown queries (Cheng et al., 2004). The results in Table 2 show that the translations for mainland China and HK were not reliable enough in the top-1, compared with the translations for Taiwan.</Paragraph> <Paragraph position="7"> One possible reason was that the test terms were collected from Taiwan's search engine logs. Most of them were popular in Taiwan but not in the others.</Paragraph> <Paragraph position="8"> Only 100 snippets retrieved might not balance or be sufficient for translation extraction. However, the inclusion rates for the three regions were close in the top-5. Observing the five types, we could find that type place containing the names of well-known countries and cities achieved the best performance in maximizing the inclusion rates in every case and almost had no regional variations (9%, 1/11) except that the city &quot;Sydney&quot; was translated into Xi Ni (Sydney) in SC for mainland China and HK and Xue Li (Sydney) in TC for Taiwan. Type computer and network containing technical terms had the most regional variations (41%, 47/115) and type people had 36% (5/14). In general, the translations in the two types were adapted to the use in different regions. On the other hand, 10% (15/147) and 8% (12/143) of the translations in types organization and others, respectively, had regional variations, because most of the terms in type others were general terms such as &quot;bank&quot; and &quot;movies&quot; and in type organization many local companies in Taiwan had no translation variations in mainland China and HK.</Paragraph> <Paragraph position="9"> Moreover, many translations in the types of people, organization, and computer and network were quite different in Taiwan and mainland China such as the personal name &quot;Bred Pitt&quot; was translated into &quot;Bi Bi Te &quot; in SC and &quot;Bu Lai De Bi Te &quot; in TC, the company name &quot;Ericsson&quot; into &quot;Ai Li Xin &quot; in SC and &quot;Yi Li Xin &quot; in TC, and the computer-related term &quot;EDI&quot; into &quot;Dian Zi Shu Ju Lian Tong &quot; in SC and &quot;Dian Zi Zi Liao Jiao Huan &quot; in TC. In general, the translations in HK had a higher chance to cover both of the translations in mainland China and Taiwan.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Multilingual & Transitive Translation </SectionTitle> <Paragraph position="0"> variations among the five types as mentioned in the previous subsection, we collected two other data sets for examining the performance of the proposed approach in multilingual and transitive translation. The data sets contained 50 scientists' names and 50 disease names in English, which were randomly selected from 256 scientists (Science/People) and 664 diseases (Health/Diseases) in the Yahoo! Directory (http://www.yahoo.com), respectively.</Paragraph> <Paragraph position="1"> English-to-Japanese/Korean Translation: In this experiment, the collected scientists' and disease names in English were translated into Japanese and Korean to examine if the proposed approach could be applicable to other Asian languages. As the result in Table 4 shows, for the English-to-Japanese translation, the top-1, top-3, and top-5 inclusion rates were 35%, 52%, and 63%, respectively; for the English-to-Korean translation, the top-1, top-3, and top-5 inclusion rates were 32%, 54%, and 63%, respectively, on average.</Paragraph> <Paragraph position="2"> Chinese-to-Japanese/Korean Translation via English: To further investigate if the proposed transitive approach can be applicable to other language pairs that are not frequently mixed in documents such as Chinese and Japanese (or Korean), we did transitive translation via English. In this experiment, we first manually translated the collected data sets in English into traditional Chinese and then did the Chinese-to-Japanese/Korean translation via the third language English.</Paragraph> <Paragraph position="3"> The results in Table 4 show that the propagation of translation errors reduced the translation accuracy. For example, the inclusion rates of the Chinese-to-Japanese translation were lower than those of the English-to-Japanese translation since only 70%-86% inclusion rates were reached in the Chinese-to-English translation in the top 1-5. Although transitive translation might produce more noisy translations, it still produced acceptable translation candidates for human verification. In Table 4, 45%50% of the extracted top 5 Japanese or Korean terms might have correct translations.</Paragraph> </Section> </Section> class="xml-element"></Paper>