File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1081_evalu.xml

Size: 7,364 bytes

Last Modified: 2025-10-06 13:59:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1081">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Concept Unification of Terms in Different Languages for IR</Title>
  <Section position="7" start_page="1234" end_page="1234" type="evalu">
    <SectionTitle>
5 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"> Although the technique we developed has values in their own right and can be applied for other language engineering fields such as query translation for CLIR, we intend to understand to what extent monolingual information retrieval effectiveness can be increased when relevant terms in different language are treated as one unit while indexing. We first examine the translation precision and then study the impact of our approach for monolingual IR.</Paragraph>
    <Paragraph position="1"> We crawls the web pages of a specific domain (university &amp; research) by WIRE crawler provided by center of Web Research, university of Chile (http://www.cwr.cl/projects/WIRE/). Currently, we have downloaded 32 sites with 5,847  Korean Web pages and 74 sites with 13,765 Chinese Web pages. 232 and 746 English terms were extracted from Korean Web pages and Chinese Web pages, respectively. The accuracy of unifying semantically identical words in different languages is dependant on the translation performance. The translation results are shown in table 1. As it can be observed, 77% of English terms from Korean web pages and 83% of English terms from Chinese Web pages can be strictly translated into accurate Korean and Chinese, respectively. However, additional 15% and 14% translations contained at least one Korean and Chinese translations, respectively. The errors were brought in by containing additional related information or incomplete translation. For instance, the English term &amp;quot;blue chip&amp;quot; is translated into &amp;quot;Lan Xin (blue chip)&amp;quot;, &amp;quot;Lan Chou Gu (a kind of stock)&amp;quot;. However, another acceptable translation &amp;quot;Ji You Gu (a kind of stock)&amp;quot; is ignored. An example for incomplete translation is English phrase &amp;quot; SIGIR 2005&amp;quot; which only can be translate into &amp;quot;Guo Ji Ji Suan Ji Jian Suo Nian Hui (international conference of computer information retrieval&amp;quot; ignoring the year.</Paragraph>
    <Paragraph position="2">  We also compare our approach with two well-known translation systems. We selected 200 English words and translate them into Chinese and Korean by these systems. Table2 and Table 3 show the results in terms of the top 1, 3, 5 inclusion rates for Korean and Chinese translation, respectively. &amp;quot;Exactly and incomplete&amp;quot; translations are all regarded as the right translations. &amp;quot;LiveTrans&amp;quot; and &amp;quot;Google&amp;quot; represent the systems against which we compared the translation ability. Google provides a machine translation function to translate text such as Web pages. Although it works pretty well to translate sentences, it is ineligible for short terms where only a little contextual information is available for translation.</Paragraph>
    <Paragraph position="3"> LiveTrans (Cheng et al., 2004) provided by the WKD lab in Academia Sinica is the first unknown word translation system based on webmining. There are two ways in this system to translate words: the fast one with lower precision is based on the &amp;quot;chi-square&amp;quot; method (  kh ) and the smart one with higher precision is based on &amp;quot;context-vector&amp;quot; method (CV) and &amp;quot;chi-square&amp;quot; method (  kh ) together. &amp;quot;ST&amp;quot; and &amp;quot;ST+PS&amp;quot; represent our approaches based on statistic model and statistic model plus phonetic and semantic model, respectively.</Paragraph>
    <Paragraph position="4">  kh ) in both Table 2 and 3, the same doesn't hold for each individual. For instance, &amp;quot;Jordan&amp;quot; is the English translation of Korean term &amp;quot;yoreudan&amp;quot;, which ranks 2nd and  kh +CV), respectively. The context-vector sometimes misguides the selection. In our two-step selection approach, the final selection would not be diverted by the false statistic information. In addition, in order to examine the contribution of distance information in the statistical method, we ran our experiments based on statistical method (ST) with two different conditions. In the first case, we set (, ) ki dqc to 1, that is, the location information of all candidates is ignored. In the second case, (, ) ki dqc is calculated based on the real textual distance of the candidates. As in both Table 2 and Table 3, the later case shows better performance.</Paragraph>
    <Paragraph position="5"> As shown in both Table 2 and Table 3, it can be observed that &amp;quot;ST+PS&amp;quot; shows the best performance, then followed by &amp;quot;LiveTrans (smart)&amp;quot;, &amp;quot;ST&amp;quot;, &amp;quot;LiveTrans(fast)&amp;quot;, and &amp;quot;Google&amp;quot;. The sta- null tistical methods seem to be able to give a rough estimate for potential translations without giving high precision. Considering the contextual words surrounding the candidates and the English phrase can further improve the precision but still less than the improvement made by the phonetic and semantic information in our approach. High precision is very important to the practical application of the translation results. The wrong translation sometimes leads to more damage to its later application than without any translation available. For instance, the Chinese translation of &amp;quot;viterbi&amp;quot; is &amp;quot;Suan Fa (algorithm)&amp;quot; by LiveTrans (fast). Obviously, treating &amp;quot;Viterbi&amp;quot; and &amp;quot;Suan Fa (algorithm)&amp;quot;as one index unit is not acceptable. We ran monolingual retrieval experiment to examine the impact of our concept unification on IR. The retrieval system is based on the vector space model with our own indexing scheme to which the concept unification part was added.</Paragraph>
    <Paragraph position="6"> We employed the standard tf idfx scheme for index term weighting and idf for query term weighting. Our experiment is based on KT-SET test collection (Kim et al., 1994). It contains 934 documents and 30 queries together with relevance judgments for them.</Paragraph>
    <Paragraph position="7"> In our index scheme, we extracted the key English phrases in the Korean texts, and translated them. Each English phrases and its equivalence(s) in Korean is treated as one index unit. The baseline against which we compared our approach applied a relatively simple indexing technique. It uses a dictionary that is Korean-English WordNet, to identify index terms. The effectiveness of the baseline scheme is comparable with other indexing methods (Lee and Ahn, 1999). While there is a possibility that an indexing method with a full morphological analysis may perform better than our rather simple method, it would also suffer from the same problem, which can be alleviated by concept unification approach. As shown in Figure 3, we obtained 14.9 % improvement based on mean average 11-pt precision. It should be also noted that this result was obtained even with the errors made by the unification of semantically identical terms in different languages.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML