File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1081_intro.xml
Size: 3,624 bytes
Last Modified: 2025-10-06 14:03:36
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1081"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Concept Unification of Terms in Different Languages for IR</Title> <Section position="3" start_page="0" end_page="641" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The mixed use of English and local languages presents a classical problem of vocabulary mis-match in monolingual information retrieval (MIR). The problem is significant especially in Asian language because words in the local languages are often mixed with English words. Although English terms and their equivalences in a local language refer to the same concept, they are erroneously treated as independent index units in traditional MIR. Such separation of semantically identical words in different languages may limit retrieval performance. For instance, as shown in Figure 1, there are three kinds of Chinese Web pages containing information related with &quot;Viterbi Algorithm (Wei Te Bi Suan Fa )&quot;. The first case contains &quot;Viterbi Algorithm&quot; but not its Chinese equivalence &quot;Wei Te Bi Suan Fa &quot;. The second rithm&quot;. The third has both of them. A user would expect that a query with either &quot;Viterbi Algorithm&quot; or &quot;Wei Te Bi Suan Fa &quot; would retrieve all of these three groups of Chinese Web pages. Otherwise some potentially useful information will be ignored.</Paragraph> <Paragraph position="1"> Furthermore, one English term may have several corresponding terms in a different language. For instance, Korean words &quot;dijital&quot;, &quot;dijiteul&quot;, and &quot;dijiteol&quot; are found in local Web pages, which all correspond to the English word &quot;digital&quot; but are in different forms because of different phonetic interpretations. Establishing an equivalence class among the three Korean words and the English counterpart is indispensable. By doing so, although the query is &quot;dijital&quot;, the Web pages containing &quot;dijiteul&quot;, &quot;dijiteol&quot; or &quot;digital&quot; can be all retrieved. The same goes to Chinese terms. For example, two same semantic Chinese terms &quot;Wei Te Bi &quot; and &quot;Wei Te Bi &quot; correspond to one English term &quot;Viterbi&quot;. There should be a semantic equivalence relation between them.</Paragraph> <Paragraph position="2"> Although tracing the original English term from a term in a native language by back transliteration (Jeong et al., 1999) is a good way to build such mapping, it is only applicable to the words that are amenable for transliteration based on the phoneme. It is difficult to expand the method to abbreviations and compound words.</Paragraph> <Paragraph position="3"> Since English abbreviations frequently appear in Korean and Chinese texts, such as &quot;segyemuyeoggigu (WTO)&quot; in Korean, &quot;Shi Jie Mao Yi Zu Zhi (WTO)&quot; in Chinese, it is essential in IR to have a mapping between these English abbreviations and the corresponding words. The same applies to the compound words like &quot;seouldae (Seoul National University)&quot; in Korean, &quot;Feng Niu Bing (mad cow disease)&quot; in Chinese. Realizing the limitation of the transliteration, we present a way to extract the key English phrases in local Web pages and conceptually unify them with their semantically identical terms in the local language.</Paragraph> </Section> class="xml-element"></Paper>