File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1068_intro.xml
Size: 8,193 bytes
Last Modified: 2025-10-06 14:02:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1068"> <Title>Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Compilation of translation lexicons is a crucial process for machine translation (MT) (Brown et al., 1990) and cross-language information retrieval (CLIR) systems (Nie et al., 1999). A lot of effort has been spent on constructing translation lexicons from domain-specific corpora in an automatic way (Melamed, 2000; Smadja et al., 1996; Kupiec, 1993).</Paragraph> <Paragraph position="1"> However, such methods encounter two fundamental problems: translation of regional variations and the lack of up-to-date and high-lexical-coverage corpus source, which are worthy of further investigation.</Paragraph> <Paragraph position="2"> The first problem is resulted from the fact that the translations of a term may have variations in different dialectal regions. Translation lexicons constructed with conventional methods may not adapt to regional usages. For example, a Chinese-English lexicon constructed using a Hong Kong corpus cannot be directly adapted to the use in mainland China and Taiwan. An obvious example is that the word &quot;taxi&quot; is normally translated into &quot;De Shi &quot; (Chinese transliteration of taxi) in Hong Kong, which is completely different from the translated Chinese words of &quot;Chu Zu Che &quot; (rental cars) in mainland China and &quot;Ji Cheng Che &quot; (cars with meters) in Taiwan. Besides, transliterations of a term are often pronounced differently across regions. For example, the company name &quot;Sony&quot; is transliterated into &quot;Xin Li &quot; (xinli) in Taiwan and &quot;Suo Ni &quot; (suoni) in mainland China. Such terms, in today's increasingly internationalized world, are appearing more and more often. It is believed that their translations should reflect the cultural aspects across different dialectal regions.</Paragraph> <Paragraph position="3"> Translations without consideration of the regional usages will lead to many serious misunderstandings, especially if the context to the original terms is not available.</Paragraph> <Paragraph position="4"> Halpern (2000) discussed the importance of translating simplified and traditional Chinese lexemes that are semantically, not orthographically, equivalent in various regions. However, previous work on constructing translation lexicons for use in different regions was limited. That might be resulted from the other problem that most of the conventional approaches are based heavily on domain-specific corpora. Such corpora may be insufficient, or unavailable, for certain domains.</Paragraph> <Paragraph position="5"> The Web is becoming the largest data repository in the world. A number of studies have been reported on experiments in the use of the Web to complement insufficient corpora. Most of them (Kilgarriff et al., 2003) tried to automatically collect parallel texts of different language versions (e.g. English and Chinese), instead of different regional versions (e.g. Chinese in Hong Kong and Taiwan), from the Web. These methods are feasible but only certain pairs of languages and subject domains can extract sufficient parallel texts as corpora. Different from the previous work, Lu et al. (2002) utilized Web anchor texts as a comparable bilingual corpus source to extract translations for out-of-vocabulary terms (OOV), the terms not covered by general translation dictionaries. This approach is applicable to the compilation of translation lexicons in diverse domains but requires powerful crawlers and high network bandwidth to gather Web data.</Paragraph> <Paragraph position="6"> It is fortunate that the Web contains rich pages in a mixture of two or more languages for some language pairs such as Asian languages and English.</Paragraph> <Paragraph position="7"> Many of them contain bilingual translations of terms, including OOV terms, e.g. companies', personal and technical names. In addition, geographic information about Web pages also provides useful clues to the regions where translations appear. We are, therefore, interested in realizing whether these nice characteristics make it possible to automatically construct multilingual translation lexicons with regional variations. Real search engines, such as Google (http://www.google.com) and AltaVista (http://www.</Paragraph> <Paragraph position="8"> altavista.com), allow us to search English terms only for pages in a certain language, e.g. Chinese or Japanese. This motivates us to investigate how to construct translation lexicons from bilingual search-result pages (as the corpus), which are normally returned in a long ordered list of snippets of summaries (including titles and page descriptions) to help users locate interesting pages.</Paragraph> <Paragraph position="9"> The purpose of this paper is trying to propose a systematic approach to create multilingual translation lexicons with regional variations through mining of bilingual search-result pages. The bilingual pages retrieved by a term in one language are adopted as the corpus for extracting its translations in another language. Three major problems are found and have to be dealt with, including: (1) extracting translations for unknown terms - how to extract translations with correct lexical boundaries from noisy bilingual search-result pages, and how to estimate term similarity for determining correct translations from the extracted candidates; (2) finding translations with regional variations - how to find regional translation variations that seldom co-occur in the same Web pages, and how to identify the corresponding languages of the retrieved search-result pages once if the location clues (e.g. URLs) in them might not imply the language they are written in; and (3) translation with limited corpora - how to translate terms with insufficient search-result pages for particular pairs of languages such as Chinese and Japanese, and simplified Chinese and traditional Chinese.</Paragraph> <Paragraph position="10"> The goal of this paper is to deal with the three problems. Given a term in one language, all possible translations will be extracted from the obtained bi-lingual search-result pages based on their similarity to the term. For those language pairs with unavailable corpora, a transitive translation model is proposed, by which the source term is translated into the target language through an intermediate language. The transitive translation model is further enhanced by a competitive linking algorithm. The algorithm can effectively alleviate the problem of error propagation in the process of translation, where translation errors may occur due to incorrect identification of the ambiguous terms in the intermediate language. In addition, because the search-result pages might contain snippets that do not be really written in the target language, a filtering process is further performed to eliminate the translation variations not of interest.</Paragraph> <Paragraph position="11"> Several experiments have been conducted to examine the performance of the proposed approach.</Paragraph> <Paragraph position="12"> The experimental results have shown that the approach can generate effective translation equivalents of various terms - especially for OOV terms such as proper nouns and technical names, which can be used to enrich general translation dictionaries. The results also revealed that the created translation lexicons can reflect different cultural aspects across regions such as Taiwan, Hong Kong and mainland China.</Paragraph> <Paragraph position="13"> In the rest of this paper, we review related work in translation extraction in Section 2. We present the transitive model and describe the direct translation process in Sections 3 and 4, respectively. The conducted experiments and their results are described in Section 5. Finally, in Section 6, some concluding remarks are given.</Paragraph> </Section> class="xml-element"></Paper>