Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 199–206,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Chinese-English Term Translation Mining Based on 
Semantic Prediction  
 
 
Gaolin Fang, Hao Yu, and Fumihito Nishino 
Fujitsu Research and Development Center, Co., LTD. Beijing 100016, China  
{glfang, yu, nishino}@cn.fujitsu.com
 
  
 
Abstract 
Using abundant Web resources to mine 
Chinese term translations can be applied 
in many fields such as reading/writing as-
sistant, machine translation and cross-
language information retrieval. In mining 
English translations of Chinese terms, 
how to obtain effective Web pages and 
evaluate translation candidates are two 
challenging issues. In this paper, the ap-
proach based on semantic prediction is 
first proposed to obtain effective Web 
pages. The proposed method predicts 
possible English meanings according to 
each constituent unit of Chinese term, and 
expands these English items using 
semantically relevant knowledge for 
searching. The refined related terms are 
extracted from top retrieved documents 
through feedback learning to construct a 
new query expansion for acquiring more 
effective Web pages. For obtaining a cor-
rect translation list, a translation 
evaluation method in the weighted sum of 
multi-features is presented to rank these 
candidates estimated from effective Web 
pages. Experimental results demonstrate 
that the proposed method has good per-
formance in Chinese-English term trans-
lation acquisition, and achieves 82.9% 
accuracy. 
1 Introduction 
The goal of Web-based Chinese-English (C-E) 
term translation mining is to acquire translations 
of terms or proper nouns which cannot be looked 
up in the dictionary from the Web using a statis-
tical method, and then construct an application 
system for reading/writing assistant (e.g., 三国演
义Gc6The Romance of Three Kingdoms). During 
translating or writing foreign language articles, 
people usually encounter terms, but they cannot 
obtain native translations after many lookup ef-
forts. Some skilled users perhaps resort to a Web 
search engine, but a large amount of retrieved 
irrelevant pages and redundant information ham-
per them to acquire effective information. Thus, 
it is necessary to provide a system to automati-
cally mine translation knowledge of terms using 
abundant Web information so as to help users 
accurately read or write foreign language articles.  
The system of Web-based term translation 
mining has many applications. 1) Read-
ing/writing assistant. 2) The construction tool of 
bilingual or multilingual dictionary for machine 
translation. The system can not only provide 
translation candidates for compiling a lexicon, 
but also rescore the candidate list of the diction-
ary. We can also use English as a medium lan-
guage to build a lexicon translation bridge 
between two languages with few bilingual anno-
tations (e.g., Japanese and Chinese). 3) Provide 
the translations of unknown queries in cross-
language information retrieval (CLIR). 4) As one 
of the typical application paradigms of the com-
bination of CLIR and Web mining. 
Automatic acquisition of bilingual translations 
has been extensively researched in the literature. 
The methods of acquiring translations are usually 
summarized as the following six categories. 1) 
Acquiring translations from parallel corpora. To 
reduce the workload of manual annotations, re-
searchers have proposed different methods to 
automatically collect parallel corpora of different 
language versions from the Web (Kilgarriff, 
2003). 2) Acquiring translations from non-
parallel corpora (Fung, 1997; Rapp, 1999). It is 
based on the clue that the context of source term 
is very similar to that of target translation in a 
large amount of corpora. 3) Acquiring transla-
tions from a combination of translations of con-
stituent words (Li et al., 2003). 4) Acquiring 
translations using cognate matching (Gey, 2004) 
199
or transliteration (Seo et al., 2004). This method 
is very suitable for the translation between two 
languages with some intrinsic relationships, e.g., 
acquiring translations from Japanese to Chinese 
or from Korean to English. 5) Acquiring transla-
tions using anchor text information (Lu et al., 
2004). 6) Acquiring translations from the Web. 
When people use Asia language (Chinese, Japa-
nese, and Korean) to write, they often annotate 
associated English meanings after terms. With 
the development of Web and the open of accessi-
ble electronic documents, digital library, and sci-
entific articles, these resources will become more 
and more abundant. Thus, acquiring term transla-
tions from the Web is a feasible and effective 
way. Nagata et al. (2001) proposed an empirical 
function of the byte distance between Japanese 
and English terms as an evaluation criterion to 
extract translations of Japanese words, and the 
results could be used as a Japanese-English dic-
tionary.   
Cheng et al. (2004) utilized the Web as the 
corpus source to translate English unknown que-
ries for CLIR. They proposed context-vector and 
chi-square methods to determine Chinese transla-
tions for unknown query terms via mining of top 
100 search-result pages from Web search engines.  
Zhang and Vines (2004) proposed using a Web 
search engine to obtain translations of Chinese 
out-of-vocabulary terms from the Web to im-
prove CLIR performance. The method used Chi-
nese as query items, and retrieved previous 100 
document snippets by Google, and then estimated 
possible translations using co-occurrence infor-
mation.  
From the review above, we know that previous 
related researches didn’t concern the issue how to 
obtain effective Web pages with bilingual 
annotations, and they mainly utilized the 
frequency feature as the clue to mine the 
translation. In fact, previous 100 Web results 
seldom contain effective English equivalents. 
Apart from the frequency information, there are 
some other features such as distribution, length 
ratio, distance, keywords, key symbols and 
boundary information which have very important 
impacts on term translation mining. In this paper, 
the approach based on semantic prediction is 
proposed to obtain effective Web pages; for 
acquiring a correct translation list, the evaluation 
strategy in the weighted sum of multi-features is 
employed to rank the candidates.  
The remainder of this paper is organized as 
follows. In Section 2, we give an overview of the 
system. Section 3 proposes effective Web page 
collection. In Section 4, we introduce translation 
candidate construction and noise solution. Sec-
tion 5 presents candidate evaluation based on 
multi-features. Section 6 shows experimental 
results. The conclusion is drawn in the last sec-
tion. 
2 System Overview 
The C-E term translation mining system based on 
semantic prediction is illustrated in Figure 1. 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 1. The Chinese-English term translation min-
ing system based on semantic prediction 
 
The system consists of two parts: Web page 
handling and term translation mining. Web page 
handling includes effective Web page collection 
and HTML analysis. The function of effective 
Web page collection is to collect these Web 
pages with bilingual annotations using semantic 
prediction, and then these pages are inputted into 
HTML analysis module, where possible features 
and text information are extracted. Term transla-
tion mining includes candidate unit construction, 
candidate noise solution, and rank&sort candi-
dates. Translation candidates are formed through 
candidate unit construction module, and then we 
analyze their noises and propose the correspond-
ing methods to handle them. At last, the approach 
using multi-features is employed to rank these 
candidates. 
Correctly exploring all kinds of bilingual anno-
tation forms on the Web can make a mining sys-
tem extract comprehensive translation results. 
After analyzing a large amount of Web page ex-
amples, translation distribution forms is summa-
rized as six categories in Figure 2: 1) Direct 
annotation (a). some have nothing (a1), and some 
have symbol marks (a2, a3) between the pair; 2) 
Separate annotation. There are English letters (b1) 
or some Chinese words (b2, b3) between the pair; 
3) Subset form (c); 4) Table form (d); 5) List 
form (e); and 6) Explanation form (f).  
 
 
Query  
“白朗峰” 
WWW
Features
1. Frequency 
2. Distribution 
3. Distance 
4. Length ratio
5. Key symbols
and boundary 
Rank & sort 
candidates
Candidate unit 
construction 
Result 
“Mont Blanc”
Effective 
 Web page 
collection 
HTML 
analysis 
Candidate noise 
solution 
200
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 2. The examples of translation distribution 
forms 
3 Effective Web page collection 
For mining the English translations of Chinese 
terms and proper names, we must obtain effective 
Web pages, that is, collecting these Web pages 
that contain not only Chinese characters but also 
the corresponding English equivalents. However, 
in a general Web search engine, when you input a 
Chinese technical term, the number of retrieved 
relevant Web pages is very large. It is infeasible 
to download all the Web pages because of a huge 
time-consuming process. If only the 100 abstracts 
of Web pages are used for the translation estima-
tion just as in the previous work, effective Eng-
lish equivalent words are seldom contained for 
most Chinese terms in our experiments, for ex-
ample: “三国演义, 三好学生, 百慕大三角, 车牌
号”. In this paper, a feasible method based on 
semantic prediction is proposed to automatically 
acquire effective Web pages. In the proposed 
method, possible English meanings of every con-
stituent unit of a Chinese term are predicted and 
further expanded by using semantically relevant 
knowledge, and these expansion units with the 
original query are inputted to search bilingual 
Web pages. In the retrieved top-20 Web pages, 
feedback learning is employed to extract more 
semantically-relevant terms by frequency and 
average length. The refined expansion terms, to-
gether with the original query, are once more sent 
to retrieve effective relevant Web pages. 
3.1 Term expansion  
Term expansion is to use predictive semantically-
relevant terms of target language as the expan-
sion of queries, and therefore resolve the issue 
that top retrieved Web pages seldom contain ef-
fective English annotations. Our idea is based on 
the assumption that the meanings of Chinese 
technical terms aren’t exactly known just through 
their constituent characters and words, but the 
closely related semantics and vocabulary infor-
mation may be inferred and predicted. For exam-
ple, the corresponding unit translations of a term 
“三国演义” are respectively: three(三), country, 
nation(国), act, practice(演), and meaning, jus-
tice(义). As seen from these English translations, 
we have a general impression of “things about 
three countries”. After expanding, the query item 
for the example above becomes "三国演义"+ 
(three | country | nation | act | practice | meaning | 
justice). The whole procedure consists of three 
steps: unit segmentation, item translation knowl-
edge base construction, and expansion knowl-
edge base evaluation. 
Unit segmentation. Getting the constituent 
units of a technical term is a segmentation proce-
dure. Because most Chinese terms consist of out-
of-vocabulary words or meaningless characters, 
the performance using general word segmenta-
tion programs is not very desirable. In this paper, 
a segmentation method is employed to handle 
term segmentation so that possible meaningful 
constituent units are found. In the inner structure 
of proper nouns or terms, the rightmost unit usu-
ally contains a headword to reflect the major 
meaning of the term. Sometimes, the modifier 
starts from the leftmost point of a term to form a 
multi-character unit. As a result, forward maxi-
mum matching and backward maximum match-
ing are respectively conducted on the term, and 
all the overlapped segmented units are added to 
candidate items. For example, for the term 
“abcd”, forward segmented units are “ab cd”, 
backward are “a bcd”, so “ab cd a bcd” will be 
viewed as our segmented items. 
Item translation knowledge base construc-
tion. Because the segmented units of a technical 
term or proper name often consist of abbreviation 
items with shorter length, limited translations 
provided by general dictionaries often cannot 
satisfy the demand of translation prediction. Here, 
a semantic expansion based method is proposed 
to construct item translation knowledge base. In 
this method, we only keep these nouns or adjec-
tive items consisting of 1-3 characters in the dic-
tionary. If an item length is greater than two 
characters and contains any item in the knowl-
edge base, its translation will be added as transla-
tion candidates of this item. For example, the 
Chinese term “流通股” can be segmented into 
the units “流通” and “股”, where “股” has only 
two English meanings “section, thigh” in the dic-
tionary. However, we can derive its meaning us-
(a1) (a2) (a3)
(b1) (b2) (b3)
(c) (d) (e) (f)
201
ing the longer word including this item such as 
“股东, 股票”. Thus, their respective translations 
“stock, stockholder” are added into the knowl-
edge base list of “股” (see Figure 3).  
 
 
 
 
 
 
Figure 3. An expansion example in the dictionary 
knowledge base 
Expansion knowledge base evaluation. To 
avoid over-expanding of translations for one item, 
using the retrieved number from the Web as our 
scoring criterion is employed to remove irrele-
vant expansion items and rank those possible 
candidates. For example, “股” and its expansion 
translation “stock” are combined as a new query 
“股 stock –股票”. It is sent to a general search 
engine like Google to obtain the count number, 
where only the co-occurrence of “ 股 ” and 
“stock” excluding the word “股票” is counted. 
The retrieved number is about 316000. If the oc-
currence number of an item is lower than a cer-
tain threshold (100), the evaluated translation 
will not be added to the item in the knowledge 
base. Those expanded candidates for the item in 
the dictionary are sorted through their retrieved 
number. 
3.2 Feedback learning 
Though pseudo-relevance feedback (PRF) has 
been successfully used in the information re-
trieval (IR), whether PRF in single-language IR 
or pre-translation PRF and post-translation PRF 
in CLIR, the feedback results are from source 
language to source language or target language to 
target language, that is, the language of feedback 
units is same as the retrieval language. Our novel 
is that the input language (Chinese) is different 
from the feedback target language (English), that 
is, realizing the feedback from source language to 
target language, and this feedback technique is 
also first applied to the term mining field. 
After the expansion of semantic prediction, the 
predicted meaning of an item has some devia-
tions with its actual sense, so the retrieved docu-
ments are perhaps not our expected results. In 
this paper, a PRF technique is employed to ac-
quire more accurate, semantically relevant terms. 
At first, we collect top-20 documents from search 
results after term expansion, and then select 
target language units from these documents, 
get language units from these documents, which 
are highly related with the original query in 
source language. However, how to effectively 
select these units is a challenging issue. In the 
literature, researchers have proposed different 
methods such as Rocchio’s method or Robert-
son’s probabilistic method to solve this problem. 
After some experimental comparisons, a simple 
evaluation method using term frequency and av-
erage length is presented in this paper. The 
evaluation method is defined as follows: 
1)(
1
)()(
+∆
+=
t
tftw , where 
N
tsD
t
N
i i
∑
=∆
=1
),(
)(  (1) 
Δ(t) represents the average length between the 
source word s and the target candidate t. If the 
greater that the average length is, the relevance 
degree between source terms and candidates will 
become lower. The purpose of adding Δ(t) to 1 
is to avoid the divide overflow in the case that the 
average length is equal to zero. D
i
(s,t) denotes the 
byte distance between source words and target 
candidates, and N represents the total number of 
candidate occurrences in the estimated Web 
pages. This evaluation method is very suitable for 
the discrimination of these words with lower, but 
same term frequencies. In the ranked candidates 
after PRF feedback, top-5 candidates are selected 
as our refined expansion items. In the previous 
example, the refined expansion items are: King-
doms, Three, Romance, Chinese, Traditional. 
These refined expansion terms, together with the 
original query, "三国演义"+(Kingdoms | Three | 
Romance | Chinese | Traditional) are once more 
sent to retrieve relevant results, which are viewed 
as effective Web pages used in the process of the 
following estimation. 
4 Translation candidate construction and 
noise solution 
The goal of translation candidate construction is 
to construct and mine all kinds of possible trans-
lation forms of terms from the Web, and effec-
tively estimate their feature information such as 
frequency and distribution. In the transferred text, 
we locate the position of a query keyword, and 
then obtain a 100-byte window with keyword as 
the center. In this window, each English word is 
built as a beginning index, and then string candi-
dates are constructed with the increase of string 
in the form of one English word unit. String can-
didates are indexed in the database with hash and 
binary search method. If there exists the same 
item as the inputted candidate, its frequency is 
increased by 1, otherwise, this candidate is added 
股
股股票 股股东 
202
to this position of the database. After handling 
one Web page, the distribution information is 
also estimated at the same time. In the program-
ming implementation, the table of stop words and 
some heuristic rules of the beginning and end 
with respect to the keyword position are em-
ployed to accelerate the statistics process. 
The aim of noise solution is to remove these ir-
relevant items and redundant information formed 
in the process of mining. These noises are de-
fined as the following two categories.  
1) Subset redundancy. The characteristic is 
that this item is a subset of one item, but its fre-
quency is lower than that item. For example, “车
牌号：License plate number (6), License plate 
(5)”, where the candidate “License plate” belongs 
to subset redundancy. They should be removed.   
2) Affix redundancy. The characteristic is that 
this item is the prefix or suffix of one item, but its 
frequency is greater than that item. For example, 
1. “三国演义: Three Kingdoms (30), Romance 
of the Three Kingdoms (22), The Romance of 
Three Kingdoms (7)”, 2. “蓝筹股: Blue Chip 
(35), Blue Chip Economic Indicators (10)”. In 
Example 1, the item “Three Kingdoms” is suffix 
redundancy and should be removed. In Example 
2, the term “Blue Chip” is in accord with the 
definition of prefix redundancy information, but 
this term is a correct translation candidate. Thus, 
the problem of affix redundancy information is 
so complex that we need an evaluation method to 
decide to retain or drop the candidate.  
To deal with subset redundancy and affix 
redundancy information, sort-based subset 
deletion and mutual information methods are 
respectively proposed. More details refer to our 
previous paper (Fang et al., 2005). 
5 Candidate evaluation based on multi-
features 
5.1 Possible features for translation pairs 
Through analyzing mass Web pages, we obtain 
the following possible features that have impor-
tant influences on term translation mining. They 
include: 1) candidate frequency and its distribu-
tion in different Web pages, 2) length ratio be-
tween source terms and target candidates (S-T), 3) 
distance between S-T, and 4) keywords, key 
symbols and boundary information between S-T. 
1) Candidate frequency and its distribution  
Translation candidate frequency is the most 
important feature and is the basis of decision-
making. Only the terms whose frequencies are 
greater than a certain threshold are further con-
sidered as candidates in our system. Distribution 
feature reflects the occurrence information of one 
candidate in different Webs. If the distribution is 
very uniform, this candidate will more possibly 
become as the translation equivalent with a 
greater weight. This is also in accord with our 
intuition. For example, the translation candidates 
of the term “认股期权” include “put option” and 
“short put”, and their frequencies are both 5. 
However, their distributions are “1, 1, 1, 1, 1” 
and “2, 2, 1”. The distribution of “put option” is 
more uniform, so it will become as a translation 
candidate of “认股期权” with a greater weight.  
2) Length ratio between S-T 
The length ratio between S-T should satisfy 
certain constraints. Only the word number of a 
candidate falls within a certain range, the possi-
bility of becoming a translation is great.  
To estimate the length ratio relation between 
S-T, we conduct the statistics on the database 
with 5800 term translation pairs. For example, 
when Chinese term has three characters, i.e. W=3, 
the probability of English translations with two 
words is largest, about P(E=2 |W =3)= 78%, and 
there is nearly no occurrence out of the range of 
1-4. Thus, different weights can be impacted on 
different candidates by using statistical distribu-
tion information of length ratio. The weight con-
tributing to the evaluation function is set 
according to these estimated probabilities in the 
experiments. 
3) Distance between S-T 
Intuitively, if the distance between S-T is 
longer, the probability of being a translation pair 
will become smaller. Using this knowledge we 
can alleviate the effect of some noises through 
impacting different weights when we collect pos-
sible correct candidates far from the source term.  
To estimate the distance between S-T, experi-
ments are carried on 5800*200 pages with 5800 
term pairs, and statistical results are depicted as 
the histogram of distances in Figure 4. 
0
2000
4000
6000
8000
10000
12000
14000
-10 0 -7 5 -50 -25 0 25 50 7 5 1 00
 
Figure 4. The histogram of distances between S-T 
203
 
In the figure, negative value represents that 
English translation located in front of the Chinese 
term, and positive value represents English trans-
lation is behind the Chinese term. As shown from 
the figure, we know that most candidates are dis-
tributed in the range of -60-60 bytes, and few 
occurrences are out of this range. The numbers of 
translations appearing in front of the term and 
after the term are nearly equal. The curve looks 
like Gaussian probability distribution, so Gaus-
sian models are proposed to model it. By the 
curve fitting, the parameters of Gaussian models 
are obtained, i.e. u=1 and sigma=2. Thus, the 
contribution probability of distance to the ranking 
function is formulized as 
8/)1),((
2
22
1
),(
−−
=
jiD
D
ejip
π
, where D(i,j) repre-
sents the byte distance between the source term i 
and the candidate j.  
4) Keywords, key symbols and boundary in-
formation between S-T 
Some Chinese keywords or capital English ab-
breviation letters between S-T can provide an 
important clue for the acquisition of possible cor-
rect translations. These Chinese keywords in-
clude the words such as “中文叫, 中文译为, 
中文名称, 中文名称为, 中文称为, 或称为, 
又称为, 英文叫, 英文名为, 英文称为, 英
文全称”. The punctuations between S-T can also 
provide very strong constraints, for example, 
when the marks “（ ）( ) [ ]” exist, the probabil-
ity of being a translation pair will greatly increase. 
Thus, correctly judging these cases can not only 
make translation finding results more compre-
hensive, but also increase the possibility that this 
candidate is as one of correct translations. 
Boundary information refers to the fact that the 
context of candidates on the Web has distinct 
mark information, for example, the position of 
transition from continuous Chinese to English, 
the place with bracket ellipsis and independent 
units in the HTML text. 
5.2 Candidate evaluation method 
After translation noise handling, we evaluate 
candidate translations so that possible candidates 
get higher scores. The method in the weighted 
sum of multi-features including: candidate fre-
quency, distribution, length ratio, distance, key-
words, key symbols and boundary information 
between S-T, is proposed to rank the candidates. 
The evaluation method is formulized as follows:  
∑∑ ++=
=
N
i
j DL
wjijiptsptScore
1
1
)),(),(([),()( δλ  
)]),(),((max
2
wjijip
D
j
δλ + , 1
21
=+ λλ    (2) 
In the equation, Score(t) is proportional to 
),( tsp
L
, N and ),( jip
D
. If the bigger these com-
ponent values are, the more they contribute to the 
whole evaluation formula, and correspondingly 
the candidate has higher score. The length ratio 
relation ),( tsp
L
 reflects the proportion relation 
between S-T as a whole, so its weight will be 
impacted on the Score(t) in the macro-view. The 
weights are trained through a large amount of 
technical terms and proper nouns, where each 
relation corresponds to one probability. N de-
notes the total number of Web pages that contain 
candidates, and partly reflects the distribution 
information of candidates in different Web pages. 
If the greater N is, the greater Score(t) will be-
come. The distance relation ),( jip
D
 is defined as 
the distance contribution probability of the jth 
source-candidate pair on the ith Web pages, 
which is impacted on every word pair emerged 
on the Web in the point of micro-view. Its calcu-
lation formula is defined in Section 5.1. The 
weights of 
1
λ  and 
2
λ  represent the proportion of 
term frequency and term distribution, and 
1
λ  de-
notes the weight of the total number of one can-
didate occurrences, and 
2
λ  represents the weight 
of counting the nearest distance occurrence for 
each Web page. wji ),(δ  is the contribution prob-
ability of keywords, key symbols and boundary 
information. If there are predefined keywords, 
key symbols, and boundary information between 
S-T, i.e., 1),( =jiδ , then the evaluation formula 
will give a reward w, otherwise, 0),( =jiδ  indi-
cate that there is no impact on the whole equation. 
6 Experiments 
Our experimental data consist of two sets: 400 C-
E term pairs and 3511 C-E term pairs in the fi-
nancial domain. There is no intersection between 
the two sets. Each term often consists of 2-8 Chi-
nese characters, and the associated translation 
contains 2-5 English words. In the test set of 400 
terms, there are more than one English translation 
for every Chinese term, and only one English 
translation for 3511 term pairs. In the test sets, 
Chinese terms are inputted to our system on 
batch, and their corresponding translations are 
viewed as a criterion to evaluate these mined 
candidates. The top n accuracy is defined as the 
204
percentage of terms whose top n translations in-
clude correct translations in the term pairs. A se-
ries of experiments are conducted on the two test 
sets.  
Experiments on the number of feedback 
pages: To obtain the best parameter of feedback 
Web pages that influence the whole system accu-
racy, we perform the experiments on the test set 
of 400 terms. The number of feedback Web 
pages is respectively set to 0, 10, 20, 30, and 40. 
N=1, 3, 5 represent the accuracies of top 1, 3, and 
5. From the feedback pages, previous 5 semanti-
cally-relevant terms are extracted to construct a 
new query expansion for retrieving more effec-
tive Web pages. Translation candidates are mined 
from these effective pages, whose accuracy 
curves are depicted in Figure 5. 
60
65
70
75
80
85
90
95
100
010203040
The number of feedback Web pages
A
ccu
r
acy
N=1
N=3
N=5
 
Figure 5.  The number of feedback Web pages 
 
As seen from the figure above, when the num-
ber of feedback Web pages is 20, the accuracy 
reaches the best. Thus, the feedback parameter in 
our experiments is set to 20. 
Experiments on the parameter 
1
λ : In the 
candidate evaluation method using multi-features, 
the parameter of 
1
λ  need be chosen through the 
experiments. To obtain the best parameter, the 
experiments are set as follows. The accuracy of 
top 5 candidates is viewed as a performance cri-
terion. The parameters are respectively set from 0 
to 1 with the increase of 0.1 step. The results are 
listed in Figure 6. As seen from the figure, 
1
λ =0.4 is best parameter, and therefore 
2
λ =0.6. 
In the following experiments, the parameters are 
all set to this value. 
80
85
90
95
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Parameter
Accur
acy
Figure 6.  The relation between the parameter
1
λ  and 
the accuracy 
Experiments on the test set of 400 terms us-
ing different methods: The methods respec-
tively without prediction(NP), with prediction(P), 
with prediction and feedback(PF) only using term 
frequency (TM), and with prediction and feed-
back using multi-features(PF+MF) are employed 
on the test set of 400 terms. The results are listed 
in Table 1. As seen from this table, if there is no 
semantic prediction, the obtained translations 
from Web pages are about 48% in the top 30 
candidates. This is because general search en-
gines will retrieve more relevant Chinese Web 
pages rather than those effective pages including 
English meanings. Thus, the semantic prediction 
method is employed. Experiments demonstrate 
the method with semantic prediction distinctly 
improves the accuracy, about 36.8%. To further 
improve the performance, the feedback learning 
technique is proposed, and it increases the aver-
age accuracy of 6.5%. Though TM is very effec-
tive in mining the term translation, the multi-
feature method fully utilizes the context of can-
didates, and therefore obtains more accurate re-
sults, about 92.8% in the top 5 candidates. 
 
Table 1. The term translation results using different 
methods 
 Top30 Top10 Top5 Top3 Top1
NP 48.0 47.5 46.0 44.0 28.0
P 84.8 83.3 82.3 79.3 60.8
PF+TM 91.3 90.8 90.3 88.3 71.0
PF+MF 95.0 94.5 92.8 91.5 78.8
 
Experiments on a large vocabulary: To vali-
date our system performance, experiments are 
carried on a large vocabulary of 3511 terms using 
different methods. One method is to use term 
frequency (TM) as an evaluation criterion, and 
the other method is to use multi-features (MF) as 
an evaluation criterion. Experimental results are 
shown as follows. 
 
Table 2. The term translation results on a large vo-
cabulary 
 Top30 Top10 Top5 Top3 Top1
TM 82.5 81.2 78.3 73.5 49.4
MF 89.1 88.4 86.0 82.9 58.2
 
From Table 2, we know the accuracy with top 
5 candidates is about 86.0%. The method using 
multi-features is better than that of using term 
frequency, and improves an average accuracy of 
7.94% 
Some examples of acquiring English transla-
tions of Chinese terms are provided in Table 3. 
1
λ
205
Only top 3 English translations are listed for each 
Chinese term.  
 
Table 3.  Some C-E mining examples  
Chinese 
terms 
The list of English translations  
(Top 3) 
三国演义 
The Three Kingdoms 
The Romance of the Three Kingdoms
The Romance of Three Kingdoms 
三好学生 
Merit student 
"Three Goods" student 
Excellent League member 
蓝筹股 
Blue Chip 
Blue Chips 
Blue chip stocks 
白朗峰 
Mont Blanc 
Mont-Blanc 
Chamonix Mont-Blanc 
百慕大三角 
Burmuda Triangle 
Bermuda Triangle 
The Bermuda Triangle 
车牌号 
License plate number 
Vehicle plate number 
Vehicle identification no 
 
7 Conclusions  
In this paper, the method based on semantic 
prediction is first proposed to acquire effective 
Web pages. The proposed method predicts 
possible meanings according to each constituent 
unit of Chinese term, and expands these items for 
searching using semantically relevant knowledge, 
and then the refined related terms are extracted 
from top retrieved documents through feedback 
learning to construct a new query expansion for 
acquiring more effective Web pages. For obtain-
ing a correct translation list, the translation 
evaluation method using multi-features is pre-
sented to rank these candidates. Experimental 
results show that this method has good perform-
ance in Chinese-English translation acquisition, 
about 82.9% accuracy in the top 3 candidates. 
References  
P.J. Cheng, J.W. Teng, R.C. Chen, et al. 2004. Trans-
lating unknown queries with web corpora for 
cross-language information retrieval, Proc. ACM 
SIGIR, pp. 146-153. 
G.L. Fang, H. Yu, and F. Nishino. 2005. Web-Based 
Terminology Translation Mining, Proc. IJCNLP, 
pp. 1004-1016. 
P. Fung. 1997. Finding terminology translations from 
nonparallel corpora, Proc. Fifth Annual Work-
shop on Very Large Corpora (WVLC'97), pp. 
192-202. 
F.C. Gey. 2004. Chinese and Korean topic search of 
Japanese news collections, In Working Notes of 
the Fourth NTCIR Workshop Meeting, Cross-
Lingual Information Retrieval Task, pp. 214-218. 
A. Kilgarriff and G. Grefenstette. 2003. Introduction 
to the special issue on the Web as corpus, Com-
putational Linguistics, 29(3): 333-348.  
H. Li, Y. Cao, and C. Li. 2003.Using bilingual web 
data to mine and rank translations, IEEE Intelli-
gent Systems, 18(4): 54-59. 
W.H. Lu, L.F. Chien, and H.J. Lee. 2004. Anchor text 
mining for translation of Web queries: A transi-
tive translation approach, ACM Trans. Informa-
tion System, 22(2): 242-269. 
M. Nagata, T. Saito, and K. Suzuki. 2001. Using the 
web as a bilingual dictionary, Proc. ACL 2001 
Workshop Data-Driven Methods in Machine 
Translation, pp. 95-102. 
R. Rapp. 1999. Automatic identification of word 
translations from unrelated English and German 
corpora, Proc. 37th Annual Meeting Assoc. Com-
putational Linguistics, pp. 519-526. 
H.C. Seo, S.B. Kim, H.G. Lim and H.C. Rim. 2004. 
KUNLP system for NTCIR-4 Korean-English 
cross language information retrieval, In Working 
Notes of the Fourth NTCIR Workshop Meeting, 
Cross-Lingual Information Retrieval Task, pp. 
103-109. 
Y. Zhang and P. Vines. 2004. Using the web for 
automated translation extraction in cross-
language information retrieval, Proc. ACM 
SIGIR, pp. 162-169. 
206
