File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2026_intro.xml
Size: 5,797 bytes
Last Modified: 2025-10-06 14:03:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2026"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Chinese-English Term Translation Mining Based on Semantic Prediction Gaolin Fang, Hao Yu, and Fumihito Nishino</Title> <Section position="3" start_page="0" end_page="199" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The goal of Web-based Chinese-English (C-E) term translation mining is to acquire translations of terms or proper nouns which cannot be looked up in the dictionary from the Web using a statistical method, and then construct an application system for reading/writing assistant (e.g., San Guo Yan Yi Gc6The Romance of Three Kingdoms). During translating or writing foreign language articles, people usually encounter terms, but they cannot obtain native translations after many lookup efforts. Some skilled users perhaps resort to a Web search engine, but a large amount of retrieved irrelevant pages and redundant information hamper them to acquire effective information. Thus, it is necessary to provide a system to automatically mine translation knowledge of terms using abundant Web information so as to help users accurately read or write foreign language articles.</Paragraph> <Paragraph position="1"> The system of Web-based term translation mining has many applications. 1) Reading/writing assistant. 2) The construction tool of bilingual or multilingual dictionary for machine translation. The system can not only provide translation candidates for compiling a lexicon, but also rescore the candidate list of the dictionary. We can also use English as a medium language to build a lexicon translation bridge between two languages with few bilingual annotations (e.g., Japanese and Chinese). 3) Provide the translations of unknown queries in cross-language information retrieval (CLIR). 4) As one of the typical application paradigms of the combination of CLIR and Web mining.</Paragraph> <Paragraph position="2"> Automatic acquisition of bilingual translations has been extensively researched in the literature.</Paragraph> <Paragraph position="3"> The methods of acquiring translations are usually summarized as the following six categories. 1) Acquiring translations from parallel corpora. To reduce the workload of manual annotations, researchers have proposed different methods to automatically collect parallel corpora of different language versions from the Web (Kilgarriff, 2003). 2) Acquiring translations from non-parallel corpora (Fung, 1997; Rapp, 1999). It is based on the clue that the context of source term is very similar to that of target translation in a large amount of corpora. 3) Acquiring translations from a combination of translations of constituent words (Li et al., 2003). 4) Acquiring translations using cognate matching (Gey, 2004) or transliteration (Seo et al., 2004). This method is very suitable for the translation between two languages with some intrinsic relationships, e.g., acquiring translations from Japanese to Chinese or from Korean to English. 5) Acquiring translations using anchor text information (Lu et al., 2004). 6) Acquiring translations from the Web.</Paragraph> <Paragraph position="4"> When people use Asia language (Chinese, Japanese, and Korean) to write, they often annotate associated English meanings after terms. With the development of Web and the open of accessible electronic documents, digital library, and scientific articles, these resources will become more and more abundant. Thus, acquiring term translations from the Web is a feasible and effective way. Nagata et al. (2001) proposed an empirical function of the byte distance between Japanese and English terms as an evaluation criterion to extract translations of Japanese words, and the results could be used as a Japanese-English dictionary. null Cheng et al. (2004) utilized the Web as the corpus source to translate English unknown queries for CLIR. They proposed context-vector and chi-square methods to determine Chinese translations for unknown query terms via mining of top 100 search-result pages from Web search engines.</Paragraph> <Paragraph position="5"> Zhang and Vines (2004) proposed using a Web search engine to obtain translations of Chinese out-of-vocabulary terms from the Web to improve CLIR performance. The method used Chinese as query items, and retrieved previous 100 document snippets by Google, and then estimated possible translations using co-occurrence information. null From the review above, we know that previous related researches didn't concern the issue how to obtain effective Web pages with bilingual annotations, and they mainly utilized the frequency feature as the clue to mine the translation. In fact, previous 100 Web results seldom contain effective English equivalents.</Paragraph> <Paragraph position="6"> Apart from the frequency information, there are some other features such as distribution, length ratio, distance, keywords, key symbols and boundary information which have very important impacts on term translation mining. In this paper, the approach based on semantic prediction is proposed to obtain effective Web pages; for acquiring a correct translation list, the evaluation strategy in the weighted sum of multi-features is employed to rank the candidates.</Paragraph> <Paragraph position="7"> The remainder of this paper is organized as follows. In Section 2, we give an overview of the system. Section 3 proposes effective Web page collection. In Section 4, we introduce translation candidate construction and noise solution. Section 5 presents candidate evaluation based on multi-features. Section 6 shows experimental results. The conclusion is drawn in the last section. null</Paragraph> </Section> class="xml-element"></Paper>