Proceedings of the ACL Interactive Poster and Demonstration Sessions,
pages 37–40, Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Learning Source-Target Surface Patterns for  
Web-based Terminology Translation 
 
Jian-Cheng Wu 
Department of Computer Science 
National Tsing Hua University 
101, Kuangfu Road,  
Hsinchu, 300, Taiwan 
D928322@oz.nthu.edu.tw 
Tracy Lin 
Dep. of Communication Eng. 
National Chiao Tung University 
1001, Ta Hsueh Road,  
Hsinchu, 300, Taiwan 
tracylin@cm.nctu.edu.tw 
Jason S. Chang 
Department of Computer Science 
National Tsing Hua University 
101, Kuangfu Road,  
Hsinchu, 300, Taiwan 
jschang@cs.nthu.edu.tw 
 
Abstract 
This paper introduces a method for learn-
ing to find translation of a given source 
term on the Web. In the approach, the 
source term is used as a query and part of 
patterns to retrieve and extract transla-
tions in Web pages. The method involves 
using a bilingual term list to learn source-
target surface patterns. At runtime, the 
given term is submitted to a search engine 
then the candidate translations are ex-
tracted from the returned summaries and 
subsequently ranked based on the surface 
patterns, occurrence counts, and translit-
eration knowledge. We present a proto-
type called TermMine that applies the 
method to translate terms. Evaluation on a 
set of encyclopedia terms shows that the 
method significantly outperforms the 
state-of-the-art online machine translation 
systems. 
1 Introduction 
Translation of terms has long been recognized as 
the bottleneck of translation by translators. By re-
using prior translations a significant time spent in 
translating terms can be saved. For many years 
now, Computer-Aided Translation (CAT) tools 
have been touted as very useful for productivity 
and quality gains for translators. CAT tools such as 
Trados typically require up-front investment to 
populate multilingual terminology and translation 
memory. However, such investment has proven 
prohibitive for many in-house translation depart-
ments and freelancer translators and the actual 
productivity gains realized have been insignificant 
except for a few, very repetitive types of content. 
Much more productivity gain could be achieved by 
providing translation service of terminology. 
Consider the job of translating a textbook such 
as “Artificial Intelligence – A Modern Approach.” 
The best practice is probably to start by translating 
the indexes (Figure 1). It is not uncommon for 
these repetitive terms to be translated once and 
applied consistently throughout the book. For ex-
ample, A good translation F = "N� " for the 
given term E = "acoustic model," might be avail-
able on the Web due to the common practice of 
including the source terms (often in brackets,  see 
Figure 2)  when   using  a   translated term (e.g. 
"…�	Z�;N�� Acoustic Model�  t
�t�  …"). The surface patterns of co-
occurring source and target terms (e.g., "F �E") 
can be learned by using the Web as corpus. Intui-
tively, we can submit E and F to a search engine  
 
Figure 1. Some index entries in “Artificial intelli-
gence – A Modern Approach” page 1045. 
academy award, 458 
accessible, 41 
accusative case, 806 
Acero, A., 580, 1010 
Acharya, A., 131, 994 
achieves, 389 
Ackley, D. H., 133, 987 
acoustic model, 568 
 
Figure 2. Examples of web page summaries with 
relevant translations returned by Google for some 
source terms in Figure 1. 
1. ... 
�v� Academy Awards. a林B� Berlin International 
Film Festival. ... 
2. ... �兩H �x�� �(inherent Case)  d��SH��
(accusative Case) eSH~  ... 
3. ... �S� d�4禮I�(Alfred H. Ackley) 領�186�
 d�S���年來
*I�說.. 
4. ..�*k識/� �_}量Y�料 d�_I	$D參數 d	�
練�*J�|� �Acoustic Model  ���t�...  
37
and then extract the strings beginning with F and 
ending with E (or vice versa) to obtain recurring 
source-target patterns. At runtime, we can submit 
E as query, request specifically for target-language 
web-pages. With these surface patterns, we can 
then extract translation candidates Fs from the 
summaries returned by the search engine. Addi-
tional information of occurrence counts and trans-
literation patterns can be taken into consideration 
to rank Fs. 
 
Table 1. Translations by the machine translation 
system Google Translate and TermMine. 
Terms Google Translate TermMine
academy award *�	�� 
�v� 
accusative case *���� �� 
Ackley 
- 4禮 
acoustic model **�|� J�|� 
 
For instance, among many candidate translations, 
we will pick the translations "J�|�" for "acous-
tic model" and "4禮" for "Ackley, " because 
they fit certain surface-target surface patterns and 
appears most often in the relevant webpage sum-
maries. Furthermore, the first morpheme "" in "
4禮"  is consistent with prior transliterations of 
"A-" in "Ackley" (See Table 1). 
We present a prototype system called TermMine, 
that automatically extracts translation on the Web 
(Section 3.3) based on surface patterns of target 
translation and source term in Web pages auto-
matically learned on bilingual terms (Section 3.1). 
Furthermore, we also draw on our previous work 
on machine transliteration (Section 3.2) to provide 
additional evidence. We evaluate TermMine on a 
set of encyclopedia terms and compare the quality 
of translation of TermMine (Section 4) with a 
online translation system. The results seem to indi-
cate the method produce significantly better results 
than previous work. 
2 Related Work 
There is a resurgent of interested in data-intensive 
approach to machine translation, a research area 
started from 1950s. Most work in the large body of 
research on machine translation (Hutchins and 
Somers, 1992), involves production of sentence-
by-sentence translation for a given source text. In 
our work, we consider a more restricted case where 
the given text is a short phrase of terminology or 
proper names (e.g., “acoustic model” or “George 
Bush”).  
A number of systems aim to translate words and 
phrases out of the sentence context. For example, 
Knight and Graehl (1998) describe and evaluate a 
multi-stage method for performing backwards 
transliteration of Japanese names and technical 
terms into English by the machine using a genera-
tive model. In addition, Koehn and Knight (2003) 
show that it is reasonable to define noun phrase 
translation without context as an independent MT 
subtask and build a noun phrase translation subsys-
tem that improves statistical machine translation 
methods. 
Nagata, Saito, and Suzuki (2001) present a sys-
tem for finding English translations for a given 
Japanese technical term by searching for mixed 
Japanese-English texts on the Web. The method 
involves locating English phrases near the given 
Japanese term and scoring them based on occur-
rence counts and geometric probabilistic function 
of byte distance between the source and target 
terms. Kwok also implemented a term translation 
system for CLIR along the same line. 
Cao and Li (2002) propose a new method to 
translate base noun phrases. The method involves 
first using Web-based method by Nagata et al., and 
if no translations are found on the Web, backing 
off to a hybrid method based on dictionary and 
Web-based statistics on words and context vectors. 
They experimented with noun-noun NP report that 
910 out of 1,000 NPs can be translated with an av-
erage precision rate of 63%.  
In contrast to the previous research, we present a 
system that automatically learns surface patterns 
for finding translations of a given term on the Web 
without using a dictionary. We exploit the conven-
tion of including the source term with the transla-
tion in the form of recurring patterns to extract 
translations. Additional evident of data redundancy 
and transliteration patterns is utilized to validate 
translations found on the Web. 
3 The TermMine System 
In this section we describe a strategy for searching 
the Web pages containing translations of a given 
term (e.g., “Bill Clinton” or “aircraft carrier”) and 
extracting translations therein. The proposed 
method involves learning the surface pattern 
38
knowledge (Section 3.1) necessary for locating 
translations. A transliteration model automatically 
trained on a list of proper name and transliterations 
(Section 3.2) is also utilized to evaluate and select 
transliterations for proper-name terms. These 
knowledge sources are used in concert to search, 
rank, and extract translations (Section 3.3). 
3.1 Source and Target Surface patterns 
With a set of terms and translations, we can learn 
the co-occurring patterns of a source term E and its 
translation F following the procedure below: 
(1) Submit a conjunctive query (i.e. E AND F) for 
each pair (E, F) in a bilingual term list to a 
search engine. 
(2) Tokenize the retrieved summaries into three 
types of tokens: I. A punctuation II. A source 
word, designated with the letter "w" III. A 
maximal block of target words (or characters in 
the case of language without word delimiters 
such as Mandarin or Japanese). 
(3) Replace the tokens for E’s instances with the 
symbol “E” and the type-III token containing 
the translation F with the symbol “F”. Note the 
token denoted as “F” is a maximal string cover-
ing the given translation but containing no 
punctuations or words in the source language. 
(4) Calculate the distance between E and F by 
counting the number of tokens in between.  
(5) Extract the strings of tokens from E to F (or the 
other way around) within a maximum distance 
of d (d is set to 3) to produce ranked surface 
patterns P. 
 
For instance, with the source-target pair ("Califor-
nia," "�") and a retrieved summary of "... -�

�. 北� Northern California. ...," the surface 
pattern "FwE" of distance 1 will be derived. 
3.2 Transliteration Model  
TermMine also relies on a machine transliteration 
model (Lin, Wu and Chang 2004) to confirm the 
transliteration of proper names. We use a list of 
names and transliterations to estimate the translit-
eration probability function P(τ
 
| ω), for any given 
transliteration unit (TU) ω and transliteration char-
acter (TC) τ. Based on the Expectation Maximiza-
tion (EM) algorithm. A TU for an English name 
can be a syllable or consonants which corresponds 
to a character in the target transliteration. Table 2 
shows some examples of sub-lexical alignment 
between proper names and transliterations. 
Table 2. Examples of aligned transliteration units. 
Name transliteration Viterbi alignment 
Spagna v�5- s-v pag- � n-5 a--
Kohn �� Koh-� n-� 
Nayyar 	v
� Nay-	v yar- 
� 
Rivard 里YC ri-里 var- Y d-C 
Hall �' ha-� ll-' 
Kalam 藍 ka- lam-藍 
Figure 3. Transliteration probability trained on 
1,800 bilingual names (λ denotes an empty string). 
τ ω P(τ|ω) τ ω P(τ|ω) τ ω P(τ|ω)
- .458 b : .700 ye � .667 
� .271  λ .133  葉 .333 
 .059  , .033 z 	� .476 
a 
λ .051
 a .033  
λ .286 
� .923 an � .923  { .095 an 
� .077  � .077  z .048 
3.3 Finding and locating translations 
At runtime, TermMine follows the following steps 
to translate a given term E: 
(1) Webpage retrieval. The term E is submitted to 
a Web search engine with the language option 
set to the target language to obtain a set of 
summaries.  
(2) Matching patterns against summaries. The 
surface patterns P learned in the training phase 
are applied to match E in the tokenized summa-
ries, to extract a token that matches the F sym-
bol in the pattern. 
(3) Generating candidates. We take the distinct 
substrings C of all matched Fs as the candidates. 
(4) Ranking candidates. We evaluate and select 
translation candidates by using both data re-
dundancy and the transliteration model. Candi-
dates with a count or transliteration probability 
lower than empirically determined thresholds 
are discarded. 
I. Data redundancy. We rank translation candi-
dates by numbers of instances it appeared in the 
retrieved summaries. 
II. Transliteration Model. For upper-case E, we 
assume E is a proper name and evaluate each 
candidate translation C by the likelihood of C as 
the transliteration of E using the transliteration 
model described in (Lin, Wu and Chang 2004).  
39
Figure 4. The distribution of distances between 
source and target terms in Web pages. 
 
    
    
    
    
    
 % J T U B O D F
 $ P V O U
 $ P V O U
                                   
                     
Figure 5. The distribution of distances between 
source and target terms in Web pages. 
Pattern Count Acc. Percent Example distance
FE 3036 28.1% -	$拉v ATLAS 0 
EF 1925 45.9% Elton John' m
v  0 
E(F 1485 59.7% Austria(
��利 -1 
F �E 1251 71.2% -	$拉v � Atlas 1 
F(E 361 74.6% -	$拉v (Atlas 1 
F.E 203 76.5% Peter Pan. �- �  1 
EwF 197 78.3% C	- Northern California -1 
E,F 153 79.7% Mexico, �i  -1 
F � �E 137 81.0% a�R �� Titanic 2 
F � �E 119 82.1% -	$拉v �  �Atlas 2 
 
 
(5) Expanding the tentative translation. Based 
on a heuristics proposed by Smadja (1991) to 
expand bigrams to full collocations, we extend 
the top-ranking candidate with count n on both 
sides, while keeping the count greater than n/2 
(empirically determined). Note that the con-
stant n is set to 10 in the experiment described 
in Section 4. 
(6) Final ranking. Rank the expanded versions of 
candidates by occurrence count and output the 
ranked list. 
4 Experimental results 
We took the answers of the first 215 questions on a 
quiz Website (www.quiz-zone.co.uk) and hand-
translations as the training data to obtain a of sur-
face patterns. For all but 17 source terms, we are 
able to find at least 3 instances of co-occurring of 
source term and translation. Figure 4 shows distri-
bution of the distances between co-occurring 
source and target terms. The distances tend to con-
centrate between - 3 and + 3 (10,680 out of 12,398 
instances, or 86%). The 212 surface patterns ob-
tained from these 10,860 instances, have a very 
skew distribution with the ten most frequent sur-
face patterns accounting for 82% of the cases (see 
Figure 5). In addition to source-target surface pat-
terns, we also trained a transliteration model (see 
Figure 3) on 1,800 bilingual proper names appear-
ing in Taiwanese editions of Scientific American 
magazine.  
Test results on a set of 300 randomly selected 
proper names and technical terms from Encyclope-
dia Britannica indicate that TermMine produces 
300 top-ranking answers, of which 263 is the exact 
translations (86%) and 293 contain the answer key 
(98%). In comparison, the online machine transla-
tion service, Google translate produces only 156 
translations in full, with 103 (34%) matching the 
answer key exactly, and 145 (48%) containing the 
answer key. 
5 Conclusion 
We present a novel Web-based, data-intensive ap-
proach to terminology translation from English to 
Mandarin Chinese. Experimental results and con-
trastive evaluation indicate significant improve-
ment over previous work and a state-of-sate 
commercial MT system.  
References 
Y. Cao and H. Li. (2002). Base Noun Phrase Translation Us-
ing Web Data and the EM Algorithm, In Proc. of COLING 
2002, pp.127-133. 
W. Hutchins and H. Somers. (1992). An Introduction to Ma-
chine Translation. Academic Press. 
K. Knight, J. Graehl. (1998). Machine Transliteration. In 
Journal of Computational Linguistics 24(4), pp.599-612. 
P. Koehn, K. Knight. (2003). Feature-Rich Statistical Transla-
tion of Noun Phrases. In Proc. of ACL 2003, pp.311-318. 
K. L. Kwok, The Chinet system. (2004). (personal 
communication). 
T. Lin, J.C. Wu, J. S. Chang. (2004). Extraction of Name and 
Transliteration in Monolingual and Parallel Corpora. In 
Proc. of AMTA 2004, pp.177-186. 
M. Nagata, T. Saito, and K. Suzuki. (2001). Using the Web as 
a bilingual dictionary. In Proc. of ACL 2001 DD-MT 
Workshop, pp.95-102. 
F. A. Smadja. (1991). From N-Grams to Collocations: An 
Evaluation of Xtract. In Proc. of ACL 1991,  pp.279-284. 
40
