File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1081_metho.xml
Size: 19,774 bytes
Last Modified: 2025-10-06 14:10:17
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1081"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Concept Unification of Terms in Different Languages for IR</Title> <Section position="4" start_page="641" end_page="641" type="metho"> <SectionTitle> 2 Concept Unification </SectionTitle> <Paragraph position="0"> The essence of the concept unification of terms in different languages is similar to that of the query translation for cross-language information retrieval (CLIR) which has been widely explored (Cheng et al., 2004; Cao and Li, 2002; Fung et al., 1998; Lee, 2004; Nagata et al., 2001; Rapp, 1999; Zhang et al., 2005; Zhang and Vine, 2004).</Paragraph> <Paragraph position="1"> For concept unification in index, firstly key English phrases should be extracted from local Web pages. After translating them into the local language, the English phrase and their translation(s) are treated as the same index units for IR. Different from previous work on query term translation that aims at finding relevant terms in another language for the target term in source language, conceptual unification requires a high translation precision. Although the fuzzy Chinese translations (e.g. &quot; Bing Du (virus), Chen Ying Hao (designer's name), Dian Nao Bing Du (computer virus)) of English term &quot;CIH&quot; can enhance the CLIR performance by the &quot;query expansion&quot; gain (Cheng et al., 2004), it does not work in the conceptual unification of terms in different languages for IR.</Paragraph> <Paragraph position="2"> While there are lots of additional sources to be utilized for phrase translation (e.g., anchor text, parallel or comparable corpus), we resort to the mixed language Web pages which are the local Web pages with some English words, because they are easily obtainable and frequently selfrefresh. null Observing the fact that English words sometimes appear together with their equivalence in a local language in Web texts as shown in Figure 1, it is possible to mine the mixed language search-result pages obtained from Web search engines and extract proper translations for these English words that are treated as queries. Due to the language nature of Chinese and Korean, we integrate the phoneme and semanteme instead of statistical information alone to pick out the right translation from the search-result pages.</Paragraph> </Section> <Section position="5" start_page="641" end_page="641" type="metho"> <SectionTitle> 3 Key Phrase Extraction </SectionTitle> <Paragraph position="0"> Since our intention is to unify the semantically identical words in different languages and index them together, the primary task is to decide what kinds of key English phrases in local Web pages are necessary to be conceptually unified.</Paragraph> <Paragraph position="1"> In (Jeong et al., 1999), it extracts the Korean foreign words for concept unification based on statistical information. Some of the English equivalences of these Korean foreign words, however, may not exist in the Korean Web pages.</Paragraph> <Paragraph position="2"> Therefore, it is meaningless to do the cross-language concept unification for these words.</Paragraph> <Paragraph position="3"> The English equivalence would not benefit any retrieval performance since no local Web pages contain it, even if the search system builds a semantic class among both local language and English for these words. In addition, the method for detecting Korean foreign words may bring some noise. The Korean terms detected as foreign words sometimes are not meaningful.</Paragraph> <Paragraph position="4"> Therefore, we do it the other way around by choosing the English phrases from the local Web pages based on a certain selection criteria.</Paragraph> <Paragraph position="5"> Instead of extracting all the English phrases in the local Web pages, we only select the English phrases that occurred within the special marks including quotation marks and parenthesis. Because English phrases within these markers reveal their significance in information searching to some extent. In addition, if the phrase starts with some stemming words (e.g., for, as) or includes some special sign, it is excluded as the phrases to be translated.</Paragraph> </Section> <Section position="6" start_page="641" end_page="1234" type="metho"> <SectionTitle> 4 Translation of English Phrases </SectionTitle> <Paragraph position="0"> In order to translate the English phrases extracted, we query the search engine with English phrases to retrieve the local Web pages containing them.</Paragraph> <Paragraph position="1"> For each document returned, only the title and the query-biased summary are kept for further analysis. We dig out the translation(s) for the English phrases from these collected documents.</Paragraph> <Section position="1" start_page="641" end_page="1234" type="sub_section"> <SectionTitle> 4.1 Extraction of Candidates for Selection </SectionTitle> <Paragraph position="0"> After querying the search engine with the English phrase, we can get the snippets (title and summary) of Web texts in the returned search-result pages as shown in Figure 1. The next step then is to extract translation candidates within a window of a limited size, which includes the English phrase, in the snippets of Web texts in the returned search-result pages. Because of the agglutinative nature of the Chinese and Korean languages, we should group the words in the local language into proper units as translation candidates, instead of treating each individual word as candidates. There are two typical ways: one is to group the words based on their co-occurrence information in the corpus (Cheng et al., 2004), and the other is to employ all sequential combinations of the words as the candidates (Zhang and Vine, 2004). Although the first reduces the number of candidates, it risks losing the right combination of words as candidates. We adopt the second in our approach, so that, return to the aforementioned example in Figure 1, if there are three Chinese characters (Wei Te Bi ) within the pre-defined window, the translation candidates for English phrases &quot;Viterbi&quot; are &quot;Wei &quot;,&quot;Te &quot;, &quot;Bi &quot;, &quot;Wei Te &quot;, &quot;Te Bi &quot;, and &quot;Wei Te Bi &quot;. The number of candidates in the second method, however, is greatly increased by enlarging the window size k . Realizing that the number of words, n , available in the window size, k , is generally larger than the predefined maximum length of candidate, m , it is unreasonable to use all adjacent sequential combinations of available words within the window size k . Therefore, we tune the method as follows: 1. If nm[?] , all adjacent sequential combinations of words within the window are treated as candidates 2. If nm> , only adjacent sequential combina null tions of which the word number is less than m are regarded as candidates. For example, if we set n to 4 and m to 2, the window &quot; wwww &quot; consists of four words. Therefore, only &quot;</Paragraph> <Paragraph position="2"> employed as the candidates for final translation selection.</Paragraph> <Paragraph position="3"> Based on our experiments, this tuning method achieves the same performance while reducing the candidate size greatly.</Paragraph> </Section> <Section position="2" start_page="1234" end_page="1234" type="sub_section"> <SectionTitle> 4.2 Selection of candidates </SectionTitle> <Paragraph position="0"> The final step is to select the proper candidate(s) as the translation(s) of the key English phrase.</Paragraph> <Paragraph position="1"> We present a method that considers the statistical, phonetic and semantic features of the English candidates for selection.</Paragraph> <Paragraph position="2"> Statistical information such as co-occurrence, Chi-square, mutual information between the English term and candidates helps distinguish the right translation(s). Using Cheng's Chi-square method (Cheng et al., 2004), the probability to find the right translation for English specific term is around 30% in the top-1 case and 70% in the top-5 case. Since our goal is to find the corresponding counterpart(s) of the English phrase to treat them as one index unit in IR, the accuracy level is not satisfactory. Since it seems difficult to improve the precision solely through variant statistical methods, we also consider semantic and phonetic information of candidates besides the statistical information. For example, given the English Key phrase &quot;Attack of the clones&quot;, the right Korean translation &quot;keulronyiseubgyeog&quot; is far away from the top-10 selected by Chi-square method (Cheng et al., 2004). However, based on the semantic match of &quot;seubgyeog&quot; and &quot;Attack&quot;, and the phonetic match of &quot;keulron&quot; and &quot;clones&quot;, we can safely infer they are the right translation. The same rule applies to the Chinese translation &quot;Ke Long Ren De Jin Gong &quot;, where &quot;Ke Long Ren &quot; is phonetically match for &quot;clones&quot; and &quot;Jin Gong &quot; semantically corresponds to &quot;attack&quot;.</Paragraph> <Paragraph position="3"> In selection step, we first remove most of the noise candidates based on the statistical method and re-rank the candidates based on the semantic and phonetic similarity.</Paragraph> </Section> <Section position="3" start_page="1234" end_page="1234" type="sub_section"> <SectionTitle> 4.3 Statistical model </SectionTitle> <Paragraph position="0"> There are several statistical models to rank the candidates. Nagata (2001) and Huang (2005) use the frequency of co-occurrence and the textual distance, the number of words between the Key phrase and candidates in texts to rank the candidates, respectively. Although the details of the methods are quite different, both of them share the same assumption that the higher co-occurrence between candidates and the Key phrase, the more possible they are the right translations for each other. In addition, they observed that most of the right translations for the Key phrase are close to it in the text, especially, right after or before the key phrase (e.g. &quot; ...</Paragraph> <Paragraph position="1"> yeonbangsusagug(FBI)i...&quot;). Zhang (2004) suggested a statistical model based on the frequency of co-occurrence and the length of the candidates.</Paragraph> <Paragraph position="2"> In the model, since the distance between the key phrase and a candidate is not considered, the right translation located far away from the key phrase also has a chance to be selected. We observe, however, that such case is very rare in our study, and most of right translations are located within 5~8 words. The distance information is a valuable factor to be considered.</Paragraph> <Paragraph position="3"> In our statistical model, we consider the frequency, length and location of candidates together. The intuition is that if the candidate is the right translation, it tends to co-occur with the key phrase frequently; its location tends to be close to the key phrase; and the longer the candidates' length, the higher the chance to be the right translation. The formula to calculate the ranking score for a candidate is as follows:</Paragraph> <Paragraph position="5"> dqc is the word distance between the English phrase q and the candidate</Paragraph> <Paragraph position="7"> c in the k-th occurrence of candidate in the search-result pages. If q is adjacent to</Paragraph> <Paragraph position="9"> c , the word distance is one. If there is one word between them, it is counted as two and so forth. a is the coefficient constant, and max Freq len[?] is the max reciprocal of</Paragraph> <Paragraph position="11"/> </Section> <Section position="4" start_page="1234" end_page="1234" type="sub_section"> <SectionTitle> 4.4 Phonetic and semantic model </SectionTitle> <Paragraph position="0"> Phonetic and semantic match: There has been some related work on extracting term translation based on the transliteration model (Kang and Choi, 2002; Kang and Kim, 2000). Different from transliteration that attempts to generate English transliteration given a foreign word in local language, our approach is a kind a match problem since we already have the candidates and aim at selecting the right candidates as the final translation(s) for the English key phrase.</Paragraph> <Paragraph position="1"> While the transliteration method is partially successful, it suffers form the problem that transliteration rules are not applied consistently. The English key phrase for which we are looking for the translation sometimes contains several words that may appear in a dictionary as an independent unit. Therefore, it can only be partially matched based on the phonetic similarity, and the rest part may be matched by the semantic similarity in such situation. Returning to the above example, &quot;clone&quot; is matched with &quot;keulron&quot; by phonetic similarity. &quot;of&quot; and &quot;attack&quot; are matched with &quot;yi&quot; and &quot;seubgyeog&quot; respectively by semantic similarity. The objective is to find a set of mappings between the English word(s) in the key phrase and the local language word(s) in candidates, which maximize the sum of the semantic and phonetic mapping weights. We call the sum as SSP (Score of semanteme and phoneme). The higher SSP value is, the higher the probability of the candidate to be the right translation.</Paragraph> <Paragraph position="2"> The solution for a maximization problem can be found using an exhaustive search method.</Paragraph> <Paragraph position="3"> However, the complexity is very high in practice for a large number of pairs to be processed. As shown in Figure 2, the problem can be represented as a bipartite weighted graph matching problem. Let the English key phrase, E, be represented as a sequence of tokens cw cw<>.</Paragraph> <Paragraph position="4"> Each English and candidate token is represented as a graph vertex. An edge (, )</Paragraph> <Paragraph position="6"> ew cwo calculated as the average of normalized semantic and phonetic values, whose calculation details are explained below. In order to balance the number of vertices on both sides, we add the virtual vertex (vertices) with zero weight on the side with less number of vertices. The SSP is calculated:</Paragraph> <Paragraph position="8"> where p is a permutation of {1, 2, 3, ..., n}. It can be solved by the Kuhn-Munkres algorithm (also known as Hungarian algorithm) with polynomial time complexity (Munkres, 1957).</Paragraph> <Paragraph position="10"> guages have a close linguistic relationship such as English and French, cognate matching (Davis, 1997) is typically employed to translate the untranslatable terms. Interestingly, Buckley et al., (2000) points out that &quot;English query words are treated as potentially misspelled French words&quot; and attempts to treat English words as variations of French words according to lexicographical rules. However, when two languages are very distinct, e.g., English-Korean, English-Chinese, transliteration from English words is utilized for cognate matching.</Paragraph> <Paragraph position="11"> Phonetic weight is the transliteration probability between English and candidates in local language. We adopt the method in (Jeong et al., 1999) with some adjustments. In essence, we compute the probabilities of particular English ee, and the candidate in the local language is comprised of a string of phonetic elements.</Paragraph> <Paragraph position="13"> cc. For Korean language, the phonetic element is the Korean alphabets such as &quot;g&quot;, &quot;i&quot;, &quot;r&quot; , &quot;h&quot; and etc. For Chinese language, the phonetic elements mean the elements of &quot;pinying&quot;.</Paragraph> <Paragraph position="15"> g is a pronunciation unit comprised of one or more English alphabets ( e.g., 'ss' for 's', a Korean alphabet ). The first term in the product corresponds to the transition probability between two states in HMM and the second term to the output probability for each possible output that could correspond to the state, where the states are all possible distinct English pronunciation units for the given Korean or Chinese word. Because the difference between Korean/Chinese and English phonetic systems makes the above uni-gram model almost impractical in terms of output quality, bi-grams are applied to substitute the single alphabet in the above equation. Therefore, the phonetic weight should be calculated as:</Paragraph> <Paragraph position="17"> is computed from the training corpus as the ratio between the fre-</Paragraph> <Paragraph position="19"> is substituted with a space marker.</Paragraph> <Paragraph position="20"> The semantic weight is calculated from the bi-lingual dictionary. The current bilingual dictionary we employed for the local languages are Korean-English WorldNet and LDC Chinese-English dictionary with additional entries inserted manually. The weight relies on the degree of overlaps between an English translation and the candidate semanteme No. of overlapping units w(E,C)=argmax total No. of units For example, given the English phrase &quot;Inha University&quot; and its candidate &quot;inhadae (Inha University), &quot;University&quot; is translated into &quot;daehaggyo&quot;, therefore, the semantic weight between &quot;University&quot; and &quot;dae&quot; is about 0.33 because only one third of the full translation is available in the candidate.</Paragraph> <Paragraph position="21"> Due to the range difference between phonetic and semantic weights, we normalized them by dividing the maximum phonetic and semantic weights in each pair of the English phrase and a candidate if the maximum is larger than zero.</Paragraph> <Paragraph position="22"> The strategy for us to pick up the final translation(s) is distinct on two different aspects from the others. If the SSP values of all candidates are less than the threshold, the top one obtained by statistical model is selected as the final translation. Otherwise, we re-rank the candidates according to the SSP value. Then we look down through the new rank list and draw a &quot;virtual&quot; line if there is a big jump of SSP value. If there is no big jump of SSP values, the &quot;virtual&quot; line is drawn at the bottom of the new rank list. Instead of the top-1 candidate, the candidates above the &quot;virtual&quot; line are all selected as the final translations. It is because that an English phrase may have more than one correct translation in the local language. Return to the previous example, the English term &quot;Viterbi&quot; corresponds to two Chinese translations &quot;Wei Te Bi &quot; and &quot;Wei Te Bi &quot;. The candidate list based on the statistical information is &quot;Bian Ma , Suan Fa , Yi Ma , Wei Te Bi ,...,Wei Te Bi &quot;. We then calculate the SSP value of these candidates and re-rank the candidates whose SSP values are larger than the threshold which we set to 0.3.</Paragraph> <Paragraph position="23"> Since the SSP value of &quot;Wei Te Bi (0.91)&quot; and &quot;Wei Te Bi (0.91)&quot; are both larger than the threshold and there is no big jump, both of them are selected as the final translation.</Paragraph> </Section> </Section> class="xml-element"></Paper>