File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2026_metho.xml
Size: 21,412 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2026"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Chinese-English Term Translation Mining Based on Semantic Prediction Gaolin Fang, Hao Yu, and Fumihito Nishino</Title> <Section position="4" start_page="199" end_page="199" type="metho"> <SectionTitle> 2 System Overview </SectionTitle> <Paragraph position="0"> The C-E term translation mining system based on semantic prediction is illustrated in Figure 1.</Paragraph> <Paragraph position="1"> ing system based on semantic prediction The system consists of two parts: Web page handling and term translation mining. Web page handling includes effective Web page collection and HTML analysis. The function of effective Web page collection is to collect these Web pages with bilingual annotations using semantic prediction, and then these pages are inputted into HTML analysis module, where possible features and text information are extracted. Term translation mining includes candidate unit construction, candidate noise solution, and rank&sort candidates. Translation candidates are formed through candidate unit construction module, and then we analyze their noises and propose the corresponding methods to handle them. At last, the approach using multi-features is employed to rank these candidates.</Paragraph> <Paragraph position="2"> Correctly exploring all kinds of bilingual annotation forms on the Web can make a mining system extract comprehensive translation results. After analyzing a large amount of Web page examples, translation distribution forms is summarized as six categories in Figure 2: 1) Direct annotation (a). some have nothing (a1), and some have symbol marks (a2, a3) between the pair; 2) Separate annotation. There are English letters (b1) or some Chinese words (b2, b3) between the pair; 3) Subset form (c); 4) Table form (d); 5) List form (e); and 6) Explanation form (f).</Paragraph> <Paragraph position="3"> 1. Frequency 2. Distribution 3. Distance 4. Length ratio 5. Key symbols and boundary</Paragraph> </Section> <Section position="5" start_page="199" end_page="201" type="metho"> <SectionTitle> 3 Effective Web page collection </SectionTitle> <Paragraph position="0"> For mining the English translations of Chinese terms and proper names, we must obtain effective Web pages, that is, collecting these Web pages that contain not only Chinese characters but also the corresponding English equivalents. However, in a general Web search engine, when you input a Chinese technical term, the number of retrieved relevant Web pages is very large. It is infeasible to download all the Web pages because of a huge time-consuming process. If only the 100 abstracts of Web pages are used for the translation estimation just as in the previous work, effective English equivalent words are seldom contained for most Chinese terms in our experiments, for example: &quot;San Guo Yan Yi , San Hao Xue Sheng , Bai Mu Da San Jiao , Che Pai Hao &quot;. In this paper, a feasible method based on semantic prediction is proposed to automatically acquire effective Web pages. In the proposed method, possible English meanings of every constituent unit of a Chinese term are predicted and further expanded by using semantically relevant knowledge, and these expansion units with the original query are inputted to search bilingual Web pages. In the retrieved top-20 Web pages, feedback learning is employed to extract more semantically-relevant terms by frequency and average length. The refined expansion terms, together with the original query, are once more sent to retrieve effective relevant Web pages.</Paragraph> <Section position="1" start_page="199" end_page="201" type="sub_section"> <SectionTitle> 3.1 Term expansion </SectionTitle> <Paragraph position="0"> Term expansion is to use predictive semantically-relevant terms of target language as the expansion of queries, and therefore resolve the issue that top retrieved Web pages seldom contain effective English annotations. Our idea is based on the assumption that the meanings of Chinese technical terms aren't exactly known just through their constituent characters and words, but the closely related semantics and vocabulary information may be inferred and predicted. For example, the corresponding unit translations of a term &quot;San Guo Yan Yi &quot; are respectively: three(San ), country, nation(Guo ), act, practice(Yan ), and meaning, justice(Yi ). As seen from these English translations, we have a general impression of &quot;things about three countries&quot;. After expanding, the query item for the example above becomes &quot;San Guo Yan Yi &quot;+ (three |country |nation |act |practice |meaning | justice). The whole procedure consists of three steps: unit segmentation, item translation knowledge base construction, and expansion knowledge base evaluation.</Paragraph> <Paragraph position="1"> Unit segmentation. Getting the constituent units of a technical term is a segmentation procedure. Because most Chinese terms consist of out-of-vocabulary words or meaningless characters, the performance using general word segmentation programs is not very desirable. In this paper, a segmentation method is employed to handle term segmentation so that possible meaningful constituent units are found. In the inner structure of proper nouns or terms, the rightmost unit usually contains a headword to reflect the major meaning of the term. Sometimes, the modifier starts from the leftmost point of a term to form a multi-character unit. As a result, forward maximum matching and backward maximum matching are respectively conducted on the term, and all the overlapped segmented units are added to candidate items. For example, for the term &quot;abcd&quot;, forward segmented units are &quot;ab cd&quot;, backward are &quot;a bcd&quot;, so &quot;ab cd a bcd&quot; will be viewed as our segmented items.</Paragraph> <Paragraph position="2"> Item translation knowledge base construction. Because the segmented units of a technical term or proper name often consist of abbreviation items with shorter length, limited translations provided by general dictionaries often cannot satisfy the demand of translation prediction. Here, a semantic expansion based method is proposed to construct item translation knowledge base. In this method, we only keep these nouns or adjective items consisting of 1-3 characters in the dictionary. If an item length is greater than two characters and contains any item in the knowledge base, its translation will be added as translation candidates of this item. For example, the Chinese term &quot;Liu Tong Gu &quot; can be segmented into the units &quot;Liu Tong &quot; and &quot;Gu &quot;, where &quot;Gu &quot; has only two English meanings &quot;section, thigh&quot; in the dictionary. However, we can derive its meaning us-</Paragraph> <Paragraph position="4"> ing the longer word including this item such as &quot;Gu Dong , Gu Piao &quot;. Thus, their respective translations &quot;stock, stockholder&quot; are added into the knowledge base list of &quot;Gu &quot; (see Figure 3).</Paragraph> <Paragraph position="5"> knowledge base Expansion knowledge base evaluation. To avoid over-expanding of translations for one item, using the retrieved number from the Web as our scoring criterion is employed to remove irrelevant expansion items and rank those possible candidates. For example, &quot;Gu &quot; and its expansion translation &quot;stock&quot; are combined as a new query &quot;Gu stock -Gu Piao &quot;. It is sent to a general search engine like Google to obtain the count number, where only the co-occurrence of &quot; Gu &quot; and &quot;stock&quot; excluding the word &quot;Gu Piao &quot; is counted. The retrieved number is about 316000. If the occurrence number of an item is lower than a certain threshold (100), the evaluated translation will not be added to the item in the knowledge base. Those expanded candidates for the item in the dictionary are sorted through their retrieved number.</Paragraph> </Section> <Section position="2" start_page="201" end_page="201" type="sub_section"> <SectionTitle> 3.2 Feedback learning </SectionTitle> <Paragraph position="0"> Though pseudo-relevance feedback (PRF) has been successfully used in the information retrieval (IR), whether PRF in single-language IR or pre-translation PRF and post-translation PRF in CLIR, the feedback results are from source language to source language or target language to target language, that is, the language of feedback units is same as the retrieval language. Our novel is that the input language (Chinese) is different from the feedback target language (English), that is, realizing the feedback from source language to target language, and this feedback technique is also first applied to the term mining field.</Paragraph> <Paragraph position="1"> After the expansion of semantic prediction, the predicted meaning of an item has some deviations with its actual sense, so the retrieved documents are perhaps not our expected results. In this paper, a PRF technique is employed to acquire more accurate, semantically relevant terms.</Paragraph> <Paragraph position="2"> At first, we collect top-20 documents from search results after term expansion, and then select target language units from these documents, get language units from these documents, which are highly related with the original query in source language. However, how to effectively select these units is a challenging issue. In the literature, researchers have proposed different methods such as Rocchio's method or Robertson's probabilistic method to solve this problem. After some experimental comparisons, a simple evaluation method using term frequency and average length is presented in this paper. The evaluation method is defined as follows:</Paragraph> <Paragraph position="4"> D(t) represents the average length between the source word s and the target candidate t. If the greater that the average length is, the relevance degree between source terms and candidates will become lower. The purpose of adding D(t) to 1 is to avoid the divide overflow in the case that the average length is equal to zero. D</Paragraph> <Paragraph position="6"> (s,t) denotes the byte distance between source words and target candidates, and N represents the total number of candidate occurrences in the estimated Web pages. This evaluation method is very suitable for the discrimination of these words with lower, but same term frequencies. In the ranked candidates after PRF feedback, top-5 candidates are selected as our refined expansion items. In the previous example, the refined expansion items are: Kingdoms, Three, Romance, Chinese, Traditional.</Paragraph> <Paragraph position="7"> These refined expansion terms, together with the original query, &quot;San Guo Yan Yi &quot;+(Kingdoms |Three | Romance |Chinese |Traditional) are once more sent to retrieve relevant results, which are viewed as effective Web pages used in the process of the following estimation.</Paragraph> </Section> </Section> <Section position="6" start_page="201" end_page="202" type="metho"> <SectionTitle> 4 Translation candidate construction and </SectionTitle> <Paragraph position="0"> noise solution The goal of translation candidate construction is to construct and mine all kinds of possible translation forms of terms from the Web, and effectively estimate their feature information such as frequency and distribution. In the transferred text, we locate the position of a query keyword, and then obtain a 100-byte window with keyword as the center. In this window, each English word is built as a beginning index, and then string candidates are constructed with the increase of string in the form of one English word unit. String candidates are indexed in the database with hash and binary search method. If there exists the same item as the inputted candidate, its frequency is increased by 1, otherwise, this candidate is added to this position of the database. After handling one Web page, the distribution information is also estimated at the same time. In the programming implementation, the table of stop words and some heuristic rules of the beginning and end with respect to the keyword position are employed to accelerate the statistics process.</Paragraph> <Paragraph position="1"> The aim of noise solution is to remove these irrelevant items and redundant information formed in the process of mining. These noises are defined as the following two categories.</Paragraph> <Paragraph position="2"> 1) Subset redundancy. The characteristic is that this item is a subset of one item, but its frequency is lower than that item. For example, &quot;Che Pai Hao :License plate number (6), License plate (5)&quot;, where the candidate &quot;License plate&quot; belongs to subset redundancy. They should be removed.</Paragraph> <Paragraph position="3"> 2) Affix redundancy. The characteristic is that this item is the prefix or suffix of one item, but its frequency is greater than that item. For example, 1. &quot;San Guo Yan Yi : Three Kingdoms (30), Romance of the Three Kingdoms (22), The Romance of Three Kingdoms (7)&quot;, 2. &quot;Lan Chou Gu : Blue Chip (35), Blue Chip Economic Indicators (10)&quot;. In Example 1, the item &quot;Three Kingdoms&quot; is suffix redundancy and should be removed. In Example 2, the term &quot;Blue Chip&quot; is in accord with the definition of prefix redundancy information, but this term is a correct translation candidate. Thus, the problem of affix redundancy information is so complex that we need an evaluation method to decide to retain or drop the candidate.</Paragraph> <Paragraph position="4"> To deal with subset redundancy and affix redundancy information, sort-based subset deletion and mutual information methods are respectively proposed. More details refer to our previous paper (Fang et al., 2005).</Paragraph> </Section> <Section position="7" start_page="202" end_page="203" type="metho"> <SectionTitle> 5 Candidate evaluation based on multi- </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="202" end_page="203" type="sub_section"> <SectionTitle> features 5.1 Possible features for translation pairs </SectionTitle> <Paragraph position="0"> Through analyzing mass Web pages, we obtain the following possible features that have important influences on term translation mining. They include: 1) candidate frequency and its distribution in different Web pages, 2) length ratio between source terms and target candidates (S-T), 3) distance between S-T, and 4) keywords, key symbols and boundary information between S-T.</Paragraph> <Paragraph position="1"> 1) Candidate frequency and its distribution Translation candidate frequency is the most important feature and is the basis of decisionmaking. Only the terms whose frequencies are greater than a certain threshold are further considered as candidates in our system. Distribution feature reflects the occurrence information of one candidate in different Webs. If the distribution is very uniform, this candidate will more possibly become as the translation equivalent with a greater weight. This is also in accord with our intuition. For example, the translation candidates of the term &quot;Ren Gu Qi Quan &quot; include &quot;put option&quot; and &quot;short put&quot;, and their frequencies are both 5. However, their distributions are &quot;1, 1, 1, 1, 1&quot; and &quot;2, 2, 1&quot;. The distribution of &quot;put option&quot; is more uniform, so it will become as a translation candidate of &quot;Ren Gu Qi Quan &quot; with a greater weight. 2) Length ratio between S-T The length ratio between S-T should satisfy certain constraints. Only the word number of a candidate falls within a certain range, the possibility of becoming a translation is great.</Paragraph> <Paragraph position="2"> To estimate the length ratio relation between S-T, we conduct the statistics on the database with 5800 term translation pairs. For example, when Chinese term has three characters, i.e. W=3, the probability of English translations with two words is largest, about P(E=2 |W =3)= 78%, and there is nearly no occurrence out of the range of 1-4. Thus, different weights can be impacted on different candidates by using statistical distribution information of length ratio. The weight contributing to the evaluation function is set according to these estimated probabilities in the experiments.</Paragraph> <Paragraph position="3"> 3) Distance between S-T Intuitively, if the distance between S-T is longer, the probability of being a translation pair will become smaller. Using this knowledge we can alleviate the effect of some noises through impacting different weights when we collect possible correct candidates far from the source term. To estimate the distance between S-T, experiments are carried on 5800*200 pages with 5800 term pairs, and statistical results are depicted as the histogram of distances in Figure 4.</Paragraph> <Paragraph position="4"> In the figure, negative value represents that English translation located in front of the Chinese term, and positive value represents English translation is behind the Chinese term. As shown from the figure, we know that most candidates are distributed in the range of -60-60 bytes, and few occurrences are out of this range. The numbers of translations appearing in front of the term and after the term are nearly equal. The curve looks like Gaussian probability distribution, so Gaussian models are proposed to model it. By the curve fitting, the parameters of Gaussian models are obtained, i.e. u=1 and sigma=2. Thus, the contribution probability of distance to the ranking function is formulized as sents the byte distance between the source term i and the candidate j.</Paragraph> <Paragraph position="5"> 4) Keywords, key symbols and boundary information between S-T Some Chinese keywords or capital English abbreviation letters between S-T can provide an important clue for the acquisition of possible correct translations. These Chinese keywords include the words such as &quot;Zhong Wen Jiao , Zhong Wen Yi Wei , Zhong Wen Ming Cheng , Zhong Wen Ming Cheng Wei , Zhong Wen Cheng Wei , Huo Cheng Wei , You Cheng Wei , Ying Wen Jiao , Ying Wen Ming Wei , Ying Wen Cheng Wei , Ying Wen Quan Cheng &quot;. The punctuations between S-T can also provide very strong constraints, for example, when the marks &quot;( )( ) [ ]&quot; exist, the probability of being a translation pair will greatly increase. Thus, correctly judging these cases can not only make translation finding results more comprehensive, but also increase the possibility that this candidate is as one of correct translations.</Paragraph> <Paragraph position="6"> Boundary information refers to the fact that the context of candidates on the Web has distinct mark information, for example, the position of transition from continuous Chinese to English, the place with bracket ellipsis and independent units in the HTML text.</Paragraph> </Section> <Section position="2" start_page="203" end_page="203" type="sub_section"> <SectionTitle> 5.2 Candidate evaluation method </SectionTitle> <Paragraph position="0"> After translation noise handling, we evaluate candidate translations so that possible candidates get higher scores. The method in the weighted sum of multi-features including: candidate frequency, distribution, length ratio, distance, keywords, key symbols and boundary information between S-T, is proposed to rank the candidates.</Paragraph> <Paragraph position="1"> The evaluation method is formulized as follows: . If the bigger these component values are, the more they contribute to the whole evaluation formula, and correspondingly the candidate has higher score. The length ratio relation ),( tsp L reflects the proportion relation between S-T as a whole, so its weight will be impacted on the Score(t) in the macro-view. The weights are trained through a large amount of technical terms and proper nouns, where each relation corresponds to one probability. N denotes the total number of Web pages that contain candidates, and partly reflects the distribution information of candidates in different Web pages. If the greater N is, the greater Score(t) will become. The distance relation ),( jip D is defined as the distance contribution probability of the jth source-candidate pair on the ith Web pages, which is impacted on every word pair emerged on the Web in the point of micro-view. Its calculation formula is defined in Section 5.1. The l represents the weight of counting the nearest distance occurrence for each Web page. wji ),(d is the contribution probability of keywords, key symbols and boundary information. If there are predefined keywords, key symbols, and boundary information between S-T, i.e., 1),( =jid , then the evaluation formula will give a reward w, otherwise, 0),( =jid indicate that there is no impact on the whole equation.</Paragraph> </Section> </Section> class="xml-element"></Paper>