Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 641–648,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Concept Unification of Terms in Different Languages for IR 
 
Qing Li,  Sung-Hyon Myaeng 
Information & Communications 
University, Korea 
{liqing,myaeng}@icu.ac.kr
Yun Jin  
Chungnam National 
University, Korea 
wkim@cnu.ac.kr 
Bo-yeong Kang 
Seoul National University, 
Korea 
comeng99@snu.ac.kr 
 
  
 
Abstract 
Due to the historical and cultural reasons, 
English phases, especially the proper 
nouns and new words, frequently appear 
in Web pages written primarily in Asian 
languages such as Chinese and Korean. 
Although these English terms and their 
equivalences in the Asian languages refer 
to the same concept, they are erroneously 
treated as independent index units in tra-
ditional Information Retrieval (IR). This 
paper describes the degree to which the 
problem arises in IR and suggests a novel 
technique to solve it. Our method firstly 
extracts an English phrase from Asian 
language Web pages, and then unifies the 
extracted phrase and its equivalence(s) in 
the language as one index unit. Experi-
mental results show that the high preci-
sion of our conceptual unification ap-
proach greatly improves the IR perform-
ance. 
1 Introduction 
The mixed use of English and local languages 
presents a classical problem of vocabulary mis-
match in monolingual information retrieval 
(MIR). The problem is significant especially in 
Asian language because words in the local lan-
guages are often mixed with English words. Al-
though English terms and their equivalences in a 
local language refer to the same concept, they are 
erroneously treated as independent index units in 
traditional MIR. Such separation of semantically 
identical words in different languages may limit 
retrieval performance. For instance, as shown in 
Figure 1, there are three kinds of Chinese Web 
pages containing information related with 
“Viterbi Algorithm (韦特比算法 )”.  The first 
case contains “Viterbi Algorithm” but not its 
Chinese equivalence “韦特比算法 ”. The second 
 
 
Figure 1.  Three Kinds of Web Pages 
contains “韦特比算法 ” but not “Viterbi Algo-
rithm”. The third has both of them. A user would 
expect that a query with either “Viterbi Algo-
rithm” or “韦特比算法 ” would retrieve all of 
these three groups of Chinese Web pages. Oth-
erwise some potentially useful information will 
be ignored.  
Furthermore, one English term may have sev-
eral corresponding terms in a different language. 
For instance, Korean words “디지탈”, “디지틀”, 
and “디지털” are found in local Web pages, 
which all correspond to the English word “digi-
tal” but are in different forms because of differ-
ent phonetic interpretations. Establishing an 
equivalence class among the three Korean words 
and the English counterpart is indispensable. By 
doing so, although the query is “디지탈”, the 
Web pages containing “디지틀”, “디지털” or 
“digital” can be all retrieved. The same goes to 
Chinese terms. For example, two same semantic 
Chinese terms “维特比 ” and “韦特比 ” corre-
spond to one English term “Viterbi”. There 
should be a semantic equivalence relation be-
tween them. 
Although tracing the original English term 
from a term in a native language by back trans-
literation (Jeong et al., 1999) is a good way to 
build such mapping, it is only applicable to the 
words that are amenable for transliteration based 
on the phoneme. It is difficult to expand the 
method to abbreviations and compound words. 
641
Since English abbreviations frequently appear in 
Korean and Chinese texts, such as 
“세계무역기구 (WTO)” in Korean, “世界贸易
组织  (WTO)” in Chinese, it is essential in IR to 
have a mapping between these English abbrevia-
tions and the corresponding words. The same 
applies to the compound words like “서울대 
(Seoul National University)” in Korean, “疯牛病
(mad cow disease)” in Chinese. Realizing the 
limitation of the transliteration, we present a way 
to extract the key English phrases in local Web 
pages and conceptually unify them with their 
semantically identical terms in the local language.  
2 Concept Unification 
The essence of the concept unification of terms 
in different languages is similar to that of the 
query translation for cross-language information 
retrieval (CLIR) which has been widely explored 
(Cheng et al., 2004; Cao and Li, 2002; Fung et 
al., 1998; Lee, 2004; Nagata et al., 2001; Rapp, 
1999; Zhang et al., 2005; Zhang and Vine, 2004).  
For concept unification in index, firstly key Eng-
lish phrases should be extracted from local Web 
pages. After translating them into the local lan-
guage, the English phrase and their translation(s) 
are treated as the same index units for IR. Differ-
ent from previous work on query term translation 
that aims at finding relevant terms in another 
language for the target term in source language, 
conceptual unification requires a high translation 
precision. Although the fuzzy Chinese transla-
tions (e.g. “ 病毒 (virus), 陈盈豪  (designer’s 
name), 电脑病毒  (computer virus)) of  English 
term “CIH” can enhance the CLIR performance 
by the “query expansion” gain (Cheng et al., 
2004), it does not work in the conceptual unifica-
tion of terms in different languages for IR.   
While there are lots of additional sources to be 
utilized for phrase translation (e.g., anchor text, 
parallel or comparable corpus), we resort to the 
mixed language Web pages which are the local 
Web pages with some English words, because 
they are easily obtainable and frequently self-
refresh.  
Observing the fact that English words some-
times appear together with their equivalence in a 
local language in Web texts as shown in Figure 1, 
it is possible to mine the mixed language search-
result pages obtained from Web search engines 
and extract proper translations for these English 
words that are treated as queries. Due to the lan-
guage nature of Chinese and Korean, we inte-
grate the phoneme and semanteme instead of 
statistical information alone to pick out the right 
translation from the search-result pages. 
3 Key Phrase Extraction 
Since our intention is to unify the semantically 
identical words in different languages and index 
them together, the primary task is to decide what 
kinds of key English phrases in local Web pages 
are necessary to be conceptually unified.  
In (Jeong et al., 1999), it extracts the Korean 
foreign words for concept unification based on 
statistical information. Some of the English 
equivalences of these Korean foreign words, 
however, may not exist in the Korean Web pages. 
Therefore, it is meaningless to do the cross-
language concept unification for these words. 
The English equivalence would not benefit any 
retrieval performance since no local Web pages 
contain it, even if the search system builds a se-
mantic class among both local language and 
English for these words. In addition, the method 
for detecting Korean foreign words may bring 
some noise. The Korean terms detected as for-
eign words sometimes are not meaningful. 
Therefore, we do it the other way around by 
choosing the English phrases from the local Web 
pages based on a certain selection criteria.  
Instead of extracting all the English phrases in 
the local Web pages, we only select the English 
phrases that occurred within the special marks 
including quotation marks and parenthesis. Be-
cause English phrases within these markers re-
veal their significance in information searching 
to some extent. In addition, if the phrase starts 
with some stemming words (e.g., for, as) or in-
cludes some special sign, it is excluded as the 
phrases to be translated.  
4 Translation of English Phrases 
In order to translate the English phrases extracted, 
we query the search engine with English phrases 
to retrieve the local Web pages containing them. 
For each document returned, only the title and 
the query-biased summary are kept for further 
analysis. We dig out the translation(s) for the 
English phrases from these collected documents.  
4.1 Extraction of Candidates for Selection 
After querying the search engine with the Eng-
lish phrase, we can get the snippets (title and 
summary) of Web texts in the returned search-
result pages as shown in Figure 1. The next step 
then is to extract translation candidates within a 
window of a limited size, which includes the 
642
English phrase, in the snippets of Web texts in 
the returned search-result pages. Because of the 
agglutinative nature of the Chinese and Korean 
languages, we should group the words in the lo-
cal language into proper units as translation can-
didates, instead of treating each individual word 
as candidates. There are two typical ways: one is 
to group the words based on their co-occurrence 
information in the corpus (Cheng et al., 2004), 
and the other is to employ all sequential combi-
nations of the words as the candidates (Zhang 
and Vine, 2004). Although the first reduces the 
number of candidates, it risks losing the right 
combination of words as candidates. We adopt 
the second in our approach, so that,  return to the 
aforementioned example in Figure 1, if there are 
three Chinese characters (韦特比 ) within the pre-
defined window, the translation candidates for 
English phrases “Viterbi” are  “韦 ”,“特 ”, “比 ”, 
“韦特  ”, “特比 ”, and “韦特比 ”. The number of 
candidates in the second method, however, is 
greatly increased by enlarging the window size 
k . Realizing that the number of words, n , avail-
able in the window size, k , is generally larger 
than the predefined maximum length of candi-
date, m ,  it is unreasonable to use all adjacent 
sequential combinations of available words 
within the window size k . Therefore, we tune 
the method as follows: 
1. If nm≤ , all adjacent sequential combina-
tions of words within the window are treated as 
candidates 
2. If nm> , only adjacent sequential combina-
tions of which the word number is less than m  
are regarded as candidates. For example, if we 
set n  to 4 and m  to 2, the window “
1234
wwww ” 
consists of four words. Therefore, only “
12
ww ”, 
“
23
ww”, “
34
ww”, “
1
w ”, “
2
w ”，  “
3
w ”, “
4
w ” are 
employed as the candidates for final translation 
selection.  
Based on our experiments, this tuning method 
achieves the same performance while reducing 
the candidate size greatly. 
4.2 Selection of candidates  
The final step is to select the proper candidate(s) 
as the translation(s) of the key English phrase. 
We present a method that considers the statistical, 
phonetic and semantic features of the English 
candidates for selection.  
Statistical information such as co-occurrence, 
Chi-square, mutual information between the 
English term and candidates helps distinguish the 
right translation(s). Using Cheng’s Chi-square 
method (Cheng et al., 2004), the probability to 
find the right translation for English specific 
term is around 30% in the top-1 case and 70% in 
the top-5 case. Since our goal is to find the corre-
sponding counterpart(s) of the English phrase to 
treat them as one index unit in IR, the accuracy 
level is not satisfactory. Since it seems difficult 
to improve the precision solely through variant 
statistical methods, we also consider semantic 
and phonetic information of candidates besides 
the statistical information. For example, given 
the English Key phrase “Attack of the clones”, 
the right Korean translation “클론의습격” is far 
away from the top-10 selected by Chi-square 
method (Cheng et al., 2004). However, based on 
the semantic match of “습격” and “Attack”, and 
the phonetic match of “클론” and “clones”, we 
can safely infer they are the right translation. The 
same rule applies to the Chinese translation  “克
隆人的进攻 ”, where “克隆人 ” is phonetically 
match for “clones” and “进攻 ” semantically cor-
responds to “attack”.  
 In selection step, we first remove most of the 
noise candidates based on the statistical method 
and re-rank the candidates based on the semantic 
and phonetic similarity. 
4.3 Statistical model 
There are several statistical models to rank the 
candidates.  Nagata (2001) and Huang (2005) use 
the frequency of co-occurrence and the textual 
distance, the number of words between the Key 
phrase and candidates in texts to rank the candi-
dates, respectively. Although the details of the 
methods are quite different, both of them share 
the same assumption that the higher co-
occurrence between candidates and the Key 
phrase, the more possible they are the right trans-
lations for each other. In addition, they observed 
that most of the right translations for the Key 
phrase are close to it in the text, especially, right 
after or before the key phrase (e.g. “ …  
연방수사국(FBI)이…”). Zhang (2004) sug-
gested a statistical model based on the frequency 
of co-occurrence and the length of the candidates. 
In the model, since the distance between the key 
phrase and a candidate is not considered, the 
right translation located far away from the key 
phrase also has a chance to be selected. We ob-
serve, however, that such case is very rare in our 
study, and most of right translations are located 
within 5~8 words. The distance information is a 
valuable factor to be considered.  
643
In our statistical model, we consider the fre-
quency, length and location of candidates to-
gether. The intuition is that if the candidate is the 
right translation, it tends to co-occur with the key 
phrase frequently; its location tends to be close to 
the key phrase; and the longer the candidates’ 
length, the higher the chance to be the right 
translation. The formula to calculate the ranking 
score for a candidate is as follows: 
1
() (,)
(, ) (1 )
max max
kiki
FL i
len Freq len
len c d q c
wqc αα
−
=× +− ×
∑
 
where (, )
ki
dqc is the word distance between the 
English phrase  q  and the candidate 
i
c  in the k-
th occurrence of candidate in the search-result 
pages. If  q  is adjacent to 
i
c  , the word distance 
is one. If there is one word between them, it is 
counted as two and so forth.  α  is the coefficient 
constant, and max
Freq len−
 is the max reciprocal of 
(, )
ki
dqc  among all the candidates. ()
i
len c  is the 
number of characters in the candidate 
i
c .  
4.4 Phonetic and semantic model 
Phonetic and semantic match: There has been 
some related work on extracting term translation 
based on the transliteration model (Kang and 
Choi, 2002; Kang and Kim, 2000). Different 
from transliteration that attempts to generate 
English transliteration given a foreign word in 
local language, our approach is a kind a match 
problem since we already have the candidates 
and aim at selecting the right candidates as the 
final translation(s) for the English key phrase. 
While the transliteration method is partially 
successful, it suffers form the problem that trans-
literation rules are not applied consistently. The 
English key phrase for which we are looking for 
the translation sometimes contains several words 
that may appear in a dictionary as an independent 
unit. Therefore, it can only be partially matched 
based on the phonetic similarity, and the rest part 
may be matched by the semantic similarity in 
such situation. Returning to the above example, 
“clone” is matched with “클론” by phonetic 
similarity. “of” and “attack” are matched with 
“의” and “습격” respectively by semantic simi-
larity. The objective is to find a set of mappings 
between the English word(s) in the key phrase 
and the local language word(s) in candidates, 
which maximize the sum of the semantic and 
phonetic mapping weights. We call the sum as 
SSP (Score of semanteme and phoneme). The 
higher SSP value is, the higher the probability of 
the candidate to be the right translation.  
The solution for a maximization problem can 
be found using an exhaustive search method. 
However, the complexity is very high in practice 
for a large number of pairs to be processed.  As 
shown in Figure 2, the problem can be repre-
sented as a bipartite weighted graph matching 
problem. Let the English key phrase, E, be repre-
sented as a sequence of tokens 
1
,...,
m
ew ew<>, and 
the candidate in local language, C, be repre-
sented as a sequence of tokens
1
,...,
n
cw cw<>. 
Each English and candidate token is represented 
as a graph vertex. An edge (, )
ij
ew cw  is formed 
with the weight (, )
ij
ew cwω  calculated as the av-
erage of normalized semantic and phonetic val-
ues, whose calculation details are explained be-
low. In order to balance the number of vertices 
on both sides, we add the virtual vertex (vertices) 
with zero weight on the side with less number of 
vertices. The SSP is calculated:  
n
()
i=1
SSP=argmax ( , )
ii
kw ew
π
ω
∑
 
where π  is a permutation of {1, 2, 3, …, n}. It 
can be solved by the Kuhn-Munkres algorithm 
(also known as Hungarian algorithm) with poly-
nomial time complexity (Munkres, 1957).  
 
 
Figure 2. Matching based on the semanteme and 
phoneme 
Phonetic & Semantic Weights: If two lan-
guages have a close linguistic relationship such 
as English and French, cognate matching (Davis, 
1997) is typically employed to translate the un-
translatable terms. Interestingly, Buckley et al., 
(2000) points out that “English query words are 
treated as potentially misspelled French words” 
and attempts to treat English words as variations 
of French words according to lexicographical 
rules.  However, when two languages are very 
distinct, e.g., English–Korean, English–Chinese, 
transliteration from English words is utilized for 
cognate matching. 
Phonetic weight is the transliteration probabil-
ity between English and candidates in local lan-
guage. We adopt the method in (Jeong et al., 
1999) with some adjustments. In essence, we 
compute the probabilities of particular English 
클론습격 의
The of Clones Attack
644
key phrase EW given a candidate in the local 
language CW.  
11
11 1
( , ) ( ,..., , ,..., )
1
( ,..., , ,..., ) log ( | ) ( | )
phoneme phoneme m k
phoneme n k j j j j
j
EW CW e e c c
g gc c Pg g Pc g
n
ωω
ω
−
=
==
∑
 
where the English phrase consists of a string of 
English alphabets 
1
,...,
m
ee, and the candidate in 
the local language is comprised of  a string of 
phonetic elements. 
1
,...,
k
cc. For Korean language, 
the phonetic element is the Korean alphabets 
such as “ㄱ”, “ㅣ”, “ㄹ” , “ㅎ” and etc. For Chi-
nese language, the phonetic elements mean the 
elements of “pinying”.  
i
g  is a pronunciation unit 
comprised of one or more English alphabets 
( e.g., ‘ss’ for ‘ㅅ’, a Korean alphabet ).  
The first term in the product corresponds to 
the transition probability between two states in 
HMM and the second term to the output prob-
ability for each possible output that could corre-
spond to the state, where the states are all possi-
ble distinct English pronunciation units for the 
given Korean or Chinese word. Because the dif-
ference between Korean/Chinese and English 
phonetic systems makes the above uni-gram 
model almost impractical in terms of output 
quality, bi-grams are applied to substitute the 
single alphabet in the above equation. Therefore, 
the phonetic weight should be calculated as:  
11 1 1
1
(,) log( | )( | )
phoneme j j j j j j j j
j
EC Pg g g g Pcc g g
n
ω
+− + +
=
∑
where 
11
(| )
jj j j
Pcc g g
++
 is computed from the 
training corpus as the ratio between the fre-
quency of 
1jj
cc
+
 in the candidates, which were 
originated from 
1jj
g g
+
in English words, to the 
frequency of 
1jj
g g
+
. If 1j =  or j n= , 
1j
g
−
 or 
1j
g
+
, 
1j
c
+
 is substituted with a space marker.  
The semantic weight is calculated from the bi-
lingual dictionary. The current bilingual diction-
ary we employed for the local languages are Ko-
rean-English WorldNet and LDC Chinese-
English dictionary with additional entries in-
serted manually. The weight relies on the degree 
of overlaps between an English translation and 
the candidate  
semanteme
No. of  overlapping units
w(E,C)=argmax
total No. of   units  
 
For example, given the English phrase “Inha 
University” and its candidate “인하대 (Inha 
University), “University” is translated into 
“대학교”, therefore, the semantic weight be-
tween “University” and “대” is about 0.33 be-
cause only one third of the full translation is 
available in the candidate. 
Due to the range difference between phonetic 
and semantic weights, we normalized them by 
dividing the maximum phonetic and semantic 
weights in each pair of the English phrase and a 
candidate if the maximum is larger than zero.  
The strategy for us to pick up the final transla-
tion(s) is distinct on two different aspects from 
the others. If the SSP values of all candidates are 
less than the threshold, the top one obtained by 
statistical model is selected as the final transla-
tion. Otherwise, we re-rank the candidates ac-
cording to the SSP value. Then we look down 
through the new rank list and draw a “virtual” 
line if there is a big jump of SSP value. If there is 
no big jump of SSP values, the “virtual” line is 
drawn at the bottom of the new rank list. Instead 
of the top-1 candidate, the candidates above the 
“virtual” line are all selected as the final transla-
tions. It is because that an English phrase may 
have more than one correct translation in the lo-
cal language. Return to the previous example, the 
English term “Viterbi” corresponds to two Chi-
nese translations “维特比 ” and “韦特比 ”. The 
candidate list based on the statistical information 
is “编码 , 算法 , 译码 , 维特比 ,…,韦特比 ”. We 
then calculate the SSP value of these candidates 
and re-rank the candidates whose SSP values are 
larger than the threshold which we set to 0.3. 
Since the SSP value of “维特比 (0.91)” and “韦
特比 (0.91)” are both larger than the threshold 
and there is no big jump, both of them are se-
lected as the final translation.  
5 Experimental Evaluation  
Although the technique we developed has values 
in their own right and can be applied for other 
language engineering fields such as query trans-
lation for CLIR, we intend to understand to what 
extent monolingual information retrieval effec-
tiveness can be increased when relevant terms in 
different language are treated as one unit while 
indexing. We first examine the translation preci-
sion and then study the impact of our approach 
for monolingual IR. 
We crawls the web pages of a specific domain 
(university & research) by WIRE crawler pro-
vided by center of Web Research, university of 
Chile (http://www.cwr.cl/projects/WIRE/). Cur-
rently, we have downloaded 32 sites with 5,847 
645
Korean Web pages and 74 sites with 13,765 Chi-
nese Web pages. 232 and 746 English terms 
were extracted from Korean Web pages and Chi-
nese Web pages, respectively.  The accuracy of 
unifying semantically identical words in different 
languages is dependant on the translation per-
formance. The translation results are shown in 
table 1.  As it can be observed, 77% of English 
terms from Korean web pages and 83% of Eng-
lish terms from Chinese Web pages can be 
strictly translated into accurate Korean and Chi-
nese, respectively. However, additional 15% and 
14% translations contained at least one Korean 
and Chinese translations, respectively.  The er-
rors were brought in by containing additional 
related information or incomplete translation. For 
instance, the English term “blue chip” is trans-
lated into “蓝芯 (blue chip)”, “蓝筹股  (a kind of 
stock)”. However, another acceptable translation 
“绩优股  (a kind of stock)” is ignored.  An ex-
ample for incomplete translation is English 
phrase “ SIGIR 2005” which only can be trans-
late into “国际计算机检索年会  (international 
conference of computer information retrieval” 
ignoring the year.  
 
Korean Chinese  
No. % No. % 
Exactly correct 179 77% 618 83% 
At least one is 
correct but not all 
35 15% 103 14% 
Wrong translation 18 8% 25 3% 
Total 232 100% 746 100% 
Table 1. Translation performance 
We also compare our approach with two well-
known translation systems. We selected 200 
English words and translate them into Chinese 
and Korean by these systems.  Table2 and Table 
3 show the results in terms of the top 1, 3, 5 in-
clusion rates for Korean and Chinese translation, 
respectively. “Exactly and incomplete” transla-
tions are all regarded as the right translations.  
“LiveTrans” and “Google” represent the systems 
against which we compared the translation abil-
ity. Google provides a machine translation func-
tion to translate text such as Web pages. Al-
though it works pretty well to translate sentences, 
it is ineligible for short terms where only a little 
contextual information is available for translation. 
LiveTrans (Cheng et al., 2004) provided by the 
WKD lab in Academia Sinica is the first un-
known word translation system based on web-
mining. There are two ways in this system to 
translate words: the fast one with lower precision 
is based on the “chi-square” method (
2
χ ) and the 
smart one with higher precision is based on “con-
text-vector” method (CV) and “chi-square” 
method (
2
χ ) together. “ST” and “ST+PS” repre-
sent our approaches based on statistic model and 
statistic model plus phonetic and semantic model, 
respectively.   
 
 
Top -1 Top-3 Top -5
Google 56% NA NA 
“Fast” 
2
χ 37% 43% 53.5%Live 
Trans 
“Smart” 
2
χ
+CV 
42% 49% 60% 
ST(d
k
=1) 28.5 % 41% 47% 
ST 39 % 46.5% 55.5%
Our 
Methods
ST+PS 93% 93% 93% 
Table 2. Comparison (Chinese case) 
 
 
Top -1 Top-3 Top -5
Google 44% NA NA 
“Fast” 
2
χ 28% 37.5% 45% Live 
Trans 
“Smart” 
2
χ
+CV 
24.5% 44% 50% 
ST(d
k
=1) 26.5 % 35.5% 41.5%
ST 32 % 40% 46.5%
Our 
Methods
ST+PS 89% 89.5% 89.5%
Table 3. Comparison  (Korean case) 
 
Even though the overall performance of Li-
veTrans’ combined method (
2
χ +CV) is better 
than the simple method (
2
χ ) in both Table 2 and 
3, the same doesn’t hold for each individual. For 
instance, “Jordan”  is the English translation of 
Korean term   “요르단”, which  ranks 2nd and 
5th in (
2
χ )  and (
2
χ +CV), respectively. The con-
text-vector sometimes misguides the selection.  
In our two-step selection approach, the final 
selection would not be diverted by the false sta-
tistic information. In addition, in order to exam-
ine the contribution of distance information in 
the statistical method, we ran our experiments 
based on statistical method (ST) with two differ-
ent conditions. In the first case, we set  (, )
ki
dqc to 
1, that is, the location information of all candi-
dates is ignored. In the second case, (, )
ki
dqc is 
calculated based on the real textual distance of 
the candidates. As in both Table 2 and Table 3, 
the later case shows better performance. 
As shown in both Table 2 and Table 3, it can 
be observed that “ST+PS” shows the best per-
formance, then followed by “LiveTrans (smart)”, 
“ST”, “LiveTrans(fast)”, and “Google”. The sta-
646
tistical methods seem to be able to give a rough 
estimate for potential translations without giving 
high precision. Considering the contextual words 
surrounding the candidates and the English 
phrase can further improve the precision but still 
less than the improvement made by the phonetic 
and semantic information in our approach. High 
precision is very important to the practical appli-
cation of the translation results. The wrong trans-
lation sometimes leads to more damage to its 
later application than without any translation 
available.  For instance, the Chinese translation 
of “viterbi” is “算法 (algorithm)” by LiveTrans 
(fast). Obviously, treating “Viterbi” and  “算法  
(algorithm)”as one index unit is not acceptable.    
We ran monolingual retrieval experiment to 
examine the impact of our concept unification on 
IR. The retrieval system is based on the vector 
space model with our own indexing scheme to 
which the concept unification part was added. 
We employed the standard tf idf×  scheme for 
index term weighting and idf  for query term 
weighting. Our experiment is based on KT-SET 
test collection (Kim et al., 1994). It contains 934 
documents and 30 queries together with rele-
vance judgments for them. 
In our index scheme, we extracted the key 
English phrases in the Korean texts, and trans-
lated them. Each English phrases and its equiva-
lence(s) in Korean is treated as one index unit. 
The baseline against which we compared our 
approach applied a relatively simple indexing 
technique. It uses a dictionary that is Korean-
English WordNet, to identify index terms. The 
effectiveness of the baseline scheme is compara-
ble with other indexing methods (Lee and Ahn, 
1999). While there is a possibility that an index-
ing method with a full morphological analysis 
may perform better than our rather simple 
method, it would also suffer from the same prob-
lem, which can be alleviated by concept unifica-
tion approach. As shown in Figure 3, we ob-
tained 14.9 % improvement based on mean aver-
age 11-pt precision. It should be also noted that 
this result was obtained even with the errors 
made by the unification of semantically identical 
terms in different languages. 
6 Conclusion 
In this paper, we showed the importance of the 
unification of semantically identical terms in dif-
ferent languages for Asian monolingual informa-
tion retrieval, especially Chinese and Korean. 
Taking the utilization of the high translation ac-
curacy of our previous work, we successfully 
unified the most semantically identical terms in 
the corpus.  This is along the line of work where 
researchers attempt to index documents with 
concepts rather than words. We would extend 
our work along this road in the future. 
 
Recall
0. .2.4.6.81.0
Pr
ec
is
i
o
n
0.0
.2
.4
.6
.8
1.0
Baseline
Conceptual Unification
 
Figure 3. Korean Monolingual IR 
Reference 
Buckley, C., Mitra, M., Janet, A. and Walz, C.C.. 
2000.  Using Clustering and Super Concepts within 
SMART: TREC 6. Information Processing & 
Management. 36(1): 109-131. 
Cao, Y. and Li., H.. 2002.  Base Noun Phrase Transla-
tion Using Web Data and the EM Algorithm. In 
Proc. of. the 19th COLING. 
Cheng, P.,  Teng, J., Chen, R., Wang, J., Liu,W., 
Chen, L.. 2004.  Translating Specific Queries with 
Web Corpora for Cross-language Information Re-
trieval. In Proc. of ACM SIGIR. 
Davis, M.. 1997. New Experiments in Cross-language 
Text Retrieval at NMSU's Computing Research 
Lab. In Proc. Of TREC-5. 
Fung, P. and Yee., L.Y.. 1998. An IR Approach for 
Translating New Words from Nonparallel, Compa-
rable Texts. In Proc. of  COLING/ACL-98. 
Huang, F., Zhang, Y. and Vogel, S.. 2005. Mining 
Key Phrase Translations from Web Corpora, In 
Proc. of the Human Language Technologies Con-
ference (HLT-EMNLP).   
Jeong, K. S., Myaeng, S. H., Lee, J. S., Choi, K. S.. 
1999. Automatic identification and back-
transliteration of foreign words for information re-
trieval. Information Processing & Management. 
35(4): 523-540. 
Kang, B. J., and Choi, K. S. 2002. Effective Foreign 
Word Extraction for Korean Information Retrieval. 
Information Processing & Management, 38(1): 91-
109. 
647
Kang, I. H. and Kim, G. C.. 2000. English-to-Korean 
Transliteration using Multiple Unbounded Over-
lapping Phoneme Chunks. In Proc. of COLING . 
Kim, S.-H. et al.. 1994. Development of the Test Set 
for Testing Automatic Indexing. In Proc. of the 
22nd KISS Spring Conference. (in Korean).  
Lee, J, H. and Ahn, J. S.. 1996. Using N-grams for 
Korean Test Retrieval. In Proc. of SIGIR. 
Lee, J. S.. 2004.  Automatic Extraction of Translation 
Phrase Enclosed within Parentheses using Bilin-
gual Alignment Method. In Proc. of the 5th China-
Korea Joint Symposium on Oriental Language 
Processing and Pattern Recognition. 
Munkres, J.. 1957. Algorithms for the Assignment 
and Transportation Problems. J. Soc. Indust. Appl. 
Math., 5 (1957).  
Nagata, M., Saito, T., and Suzuki, K.. 2001. Using the 
Web as a Bilingual Dictionary. In Proc. of ACL 
'2001 DD-MT Workshop. 
Rapp, R.. 1999. Automatic Identification of Word 
Translations from Unrelated English and German 
corpora. In Proc. of ACL. 
Zhang, Y., Huang, F. and Vogel, S.. 2005. Mining 
Translations of OOV Terms from the Web through 
Cross-lingual Query Expansion, In Proc. of ACM 
SIGIR-05. 
Zhang, Y. and Vines, P.. 2004. Using the Web for 
Automated Translation Extraction in Cross-
Language Information Retrieval. In Proc. of ACM 
SIGIR-04. 
648
