File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1068_metho.xml
Size: 15,537 bytes
Last Modified: 2025-10-06 14:09:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1068"> <Title>Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Construction of Translation Lexicons </SectionTitle> <Paragraph position="0"> To construct translation lexicons with regional variations, we propose a transitive translation model Strans(s,t) to estimate the degree of possibility of the translation of a term s in one (source) language ls into a term t in another (target) language lt. Given the term s in ls, we first extract a set of terms C={tj}, where tj in lt acts as a translation candidate of s, from a corpus. In this case, the corpus consists of a set of search-result pages retrieved from search engines using term s as a query. Based on our previous work (Cheng et al., 2004), we can efficiently extract term tj by calculating the association measurement of every character or word n-gram in the corpus and applying the local maxima algorithm. The association measurement is determined by the degree of cohesion holding the words together within a word ngram, and enhanced by examining if a word n-gram has complete lexical boundaries. Next, we rank the extracted candidates C as a list T in a decreasing order by the model Strans(s,t) as the result.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Bilingual Search-Result Pages </SectionTitle> <Paragraph position="0"> The Web contains rich texts in a mixture of multiple languages and in different regions. For example, Chinese pages on the Web may be written in traditional or simplified Chinese as a principle language and in English as an auxiliary language. According to our observations, translated terms frequently occur together with a term in mixed-language texts.</Paragraph> <Paragraph position="1"> For example, Figure 1 illustrates the search-result pages of the English term &quot;George Bush,&quot; which was submitted to Google for searching Chinese pages in different regions. In Figure 1 (a) it contains the translations &quot;Qiao Zhi Bu Xi &quot; (George Bush) and &quot;Bu Xi &quot; (Bush) obtained from the pages in Taiwan. In Figures 1 (b) and (c) the term &quot;George Bush&quot; is translated into &quot;Bu Shi &quot;(busir) or &quot;Bu Shen &quot;(buson) in mainland China and &quot;Bu Shu &quot;(busu) in Hong Kong.</Paragraph> <Paragraph position="2"> This characteristic of bilingual search-result pages is also useful for other language pairs such as other Asian languages mixed with English.</Paragraph> <Paragraph position="3"> For each term to be translated in one (source) language, we first submit it to a search engine for locating the bilingual Web documents containing the term and written in another (target) language from a specified region. The returned search-result pages containing snippets (illustrated in Figure 1), instead of the documents themselves, are collected as a corpus from which translation candidates are extracted and correct translations are then selected.</Paragraph> <Paragraph position="4"> Compared with parallel corpora and anchor texts, bilingual search-result pages are easier to collect and can promptly reflect the dynamic content of the Web.</Paragraph> <Paragraph position="5"> In addition, geographic information about Web pages such as URLs also provides useful clues to the regions where translations appear.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 The Transitive Translation Model </SectionTitle> <Paragraph position="0"> Transitive translation is particularly necessary for the translation of terms with regional variations because the variations seldom co-occur in the same bilingual pages. To estimate the possibility of being the translation t T of term s, the transitive translation model first performs so-called direct translation, which attempts to learn translational equivalents directly from the corpus. The direct translation method is simple, but strongly affected by the quality of the adopted corpus. (Detailed description of the direct translation method will be given in Section 4.) If the term s and its translation t appear infrequently, the statistical information obtained from the corpus might not be reliable. For example, a term in simplified Chinese, e.g. Hu Lian Wang (Internet) does not usually co-occur together with its variation in traditional Chinese, e.g. Wang Ji Wang Lu (Internet). To deal with this problem, our idea is that the term s can be first translated into an intermediate translation m, which might co-occur with s, via a third (or intermediate) language lm. The correct translation t can then be extracted if it can be found as a translation of m.</Paragraph> <Paragraph position="1"> The transitive translation model, therefore, combines the processes of both direct translation and indirect translation, and is defined as:</Paragraph> <Paragraph position="3"> where m is one of the top k most probable intermediate translations of s in language lm, and v is the confidence value of m's accuracy, which can be estimated based on m's probability of occurring in the corpus, and q is a predefined threshold value.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 The Competitive Linking Algorithm </SectionTitle> <Paragraph position="0"> One major challenge of the transitive translation model is the propagation of translation errors. That is, incorrect m will significantly reduce the accuracy of the translation of s into t. A typical case is the indirect association problem (Melamed, 2000), as shown in Figure 2 in which we want to translate the term s1 (s=s1). Assume that t1 is s1's corresponding translation, but appears infrequently with s1. An indirect association error might arise when t2, the translation of s1's highly relevant term s2, co-occurs often with s1. This problem is very important for the situation in which translation is a many-to-many mapping. To reduce such errors and enhance the reliability of the estimation, a competitive linking algorithm, which is extended from Melamed's work (Melamed, 2000), is developed to determine the The idea of the algorithm is described below. For each translated term tjT in lt, we translate it back into original language ls and then model the translation mappings as a bipartite graph, as shown in Figure 2, where the vertices on one side correspond to the terms {si} or {tj} in one language. An edge eij indicates the corresponding two terms si and tj might be the translations of each other, and is weighted by the sum of Sdirect(si,tj) and Sdirect(tj,si,). Based on the weighted values, we can examine if each translated term tjT in lt can be correctly translated into the original term s1. If term tj has any translations better than term s1 in ls, term tj might be a so-called indirect association error and should be eliminated from T. In the above example, if the weight of e22 is larger than that of e12, the term &quot;Technology&quot; will be not considered as the translation of &quot;Wang Ji Wang Lu &quot; (Internet). Finally, for all translated terms {tj} T that are not eliminated, we re-rank them by the weights of the edges {eij} and the top k ones are then taken as the translations. More detailed description of the algorithm could be referred to Lu et al. (2004).</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Direct Translation </SectionTitle> <Paragraph position="0"> In this section, we will describe the details of the direct translation process, i.e. the way to compute Sdirect(s,t). Three methods will be presented to estimate the similarity between a source term and each of its translation candidates. Moreover, because the search-result pages of the term might contain snippets that do not actually be written in the target language, we will introduce a filtering method to eliminate the translation variations not of interest.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Translation Extraction </SectionTitle> <Paragraph position="0"> The Chi-square Method: A number of statistical measures have been proposed for estimating term association based on co-occurrence analysis, including mutual information, DICE coefficient, chi-square test, and log-likelihood ratio (Rapp, 1999). Chi-square test (kh2) is adopted in our study because the required parameters for it can be obtained by submit-</Paragraph> <Paragraph position="2"> ting Boolean queries to search engines and utilizing the returned page counts (number of pages). Given a term s and a translation candidate t, suppose the total number of Web pages is N; the number of pages containing both s and t, n(s,t), is a; the number of pages containing s but not t, n(s,!t), is b; the number of pages containing t but not s, n(!s,t), is c; and the number of pages containing neither s nor t, n(!s, !t), is d. (Although d is not provided by search engines, it can be computed by d=N-a-b-c.) Assume s and t are independent. Then, the expected frequency of (s,t), E(s,t), is (a+c)(a+b)/N; the expected frequency of (s,!t), E(s,!t), is (b+d)(a+b)/N; the expected frequency of (!s,t), E(!s,t), is (a+c)(c+d)/N; and the expected frequency of (!s,!t), E(!s,!t), is (b+d)(c+d)/N.</Paragraph> <Paragraph position="3"> Hence, the conventional chi-square test can be computed as: Although the chi-square method is simple to compute, it is more applicable to high-frequency terms than low-frequency terms since the former are more likely to appear with their candidates. Moreover, certain candidates that frequently co-occur with term s may not imply that they are appropriate translations.</Paragraph> <Paragraph position="4"> Thus, another method is presented.</Paragraph> <Paragraph position="5"> The Context-Vector Method: The basic idea of this method is that the term s's translation equivalents may share common contextual terms with s in the search-result pages, similar to Rapp (1999). For both s and its candidates C, we take their contextual terms constituting the search-result pages as their features. The similarity between s and each candidate in C will be computed based on their feature vectors in the vector-space model.</Paragraph> <Paragraph position="6"> Herein, we adopt the conventional tf-idf weighting scheme to estimate the significance of features and define it as:</Paragraph> <Paragraph position="8"> where f(ti,p) is the frequency of term ti in search-result page p, N is the total number of Web pages, and n is the number of the pages containing ti. Finally, the similarity between term s and its translation candidate</Paragraph> <Paragraph position="10"> directS (s,t)=cos(cvs, cvt), where cvs and cvt are the context vectors of s and t, respectively.</Paragraph> <Paragraph position="11"> In the context-vector method, a low-frequency term still has a chance of extracting correct translations, if it shares common contexts with its translations in the search-result pages. Although the method provides an effective way to overcome the chi-square method's problem, its performance depends heavily on the quality of the retrieved search-result pages, such as the sizes and amounts of snippets. Also, feature selection needs to be carefully handled in some cases.</Paragraph> <Paragraph position="12"> The Combined Method: The context-vector and chi-square methods are basically complementary. Intuitively, a more complete solution is to integrate the two methods. Considering the various ranges of similarity values between the two methods, we compute the similarity between term s and its translation candidate t by the weighted sum of 1/Rkh2(s,t) and 1/RCV(s,t). Rkh2(s,t) (or RCV(s,t)) represents the similarity ranking of each translation candidate t with respect to s and is assigned to be from 1 to k (number of output) in decreasing order of similarity measure SX2direct(s,t) (or SCVdirect(s,t)). That is, if the similarity rankings of t are high in both of the context-vector and chi-square methods, it will be also ranked high in the combined method.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Translation Filtering </SectionTitle> <Paragraph position="0"> The direct translation process assumes that the retrieved search-result pages of a term exactly contain snippets from a certain region (e.g. Hong Kong) and written in the target language (e.g. traditional Chinese). However, the assumption might not be reliable because the location (e.g. URL) of a Web page may not imply that it is written by the principle language used in that region. Also, we cannot identify the language of a snippet simply using its character encoding scheme, because different regions may use the same character encoding schemes (e.g. Taiwan and Hong Kong mainly use the same traditional Chinese encoding scheme).</Paragraph> <Paragraph position="1"> From previous work (Tsou et al., 2004) we know that word entropies significantly reflect language differences in Hong Kong, Taiwan and China.</Paragraph> <Paragraph position="2"> Herein, we propose another method for dealing with the above problem. Since our goal is trying to eliminate the translation candidates {tj} that are not from the snippets in language lt, for each candidate tj we merge all of the snippets that contain tj into a document and then identify the corresponding language of tj based on the document. We train a uni-gram language model for each language of concern and perform language identification based on a discrimination function, which locates maximum character or word entropy and is defined as:</Paragraph> <Paragraph position="4"> where N(tj) is the collection of the snippets containing tj and L is a set of languages to be identified. The candidate tj will be eliminated if )( j tlang lt.</Paragraph> <Paragraph position="5"> To examine the feasibility of the proposed method in identifying Chinese in Taiwan, mainland China and Hong Kong, we conducted a preliminary experiment. To avoid the data sparseness of using a tri-gram language model, we simply use the above unigram model to perform language identification.</Paragraph> <Paragraph position="6"> Even so, the experimental result has shown that very high identification accuracy can be achieved. Some Web portals contain different versions for specific regions such as Yahoo! Taiwan (http://tw.yahoo.</Paragraph> <Paragraph position="7"> com) and Yahoo! Hong Kong (http://hk.yahoo.com).</Paragraph> <Paragraph position="8"> This allows us to collect regional training data for constructing language models. In the task of translating English terms into traditional Chinese in Taiwan, the extracted candidates for &quot;laser&quot; contained &quot;Lei She &quot; (translation of laser mainly used in Taiwan) and &quot;Ji Guang &quot; (translation of laser mainly used in mainland China). Based on the merged snippets, we found that &quot;Ji Guang &quot; had higher entropy value for the language model of mainland China while &quot;Lei She &quot; had higher entropy value for the language models of Taiwan and Hong Kong.</Paragraph> </Section> </Section> class="xml-element"></Paper>