BUILDING AN MT I)ICTIONARY FROM PARAI~LEI~ TEXTS 
BASED ON LINGUISTIC AND STATISTICAL INIi'ORMATION 
Akira Kumano ltidcki ltirakawa 
R & D Center, Toshiba Corporation 
1, Komukai Toshiba-cho, Saiwai-ku, Kawasaki, 210, JAPAN 
{ km n,hirakawa} @ist.rdc.toshiba.co.jp 
Abstract 
A method for generating a machine translation 
(MT) dictionary from parallel texts is described. 
This method utilizes both statistical information 
and linguistic information to obtain corresponding 
words or phrases in parallel texts. By combining 
these two types of information, translation pairs 
which cannot be obtained by a linguistic-based 
method can be extntcted. Over 70% accurate transla- 
tions of compound nouns and over 50% of unknown 
words are obtained as tbe first candidate from small 
Japanese/Englisb parallel texts containing severe dis- 
tortions. 
1 INTRODUCTION 
Parallel texts (corpora) are useful resources for 
acquiring a variety of linguistic knowledge (Dangan, 
1991; Matsumoto, 1993), especially for machine 
translation systems which inherently require cus- 
tomizations. Translation dictionaries are, needless 
to say, the most basic and powerful knowledge 
source for improving and customizing translation 
systems. Our research interest lies in automatic gen- 
eration of translation dictionaries from parallel 
texts. In this perspective, finding corresponding 
words or phrases in bilingual texts will be the fun- 
damental factor for accurate translation. 
Statistics-based processing has proven to be very 
powerful for aligning sentences and words in paral- 
lel corpora (Brown, 1991; Gale, 1993; Chen, 1993). 
Kupiec proposes an Mgorithm for finding ~loun phras- 
es in bilingual corpora (Kupiec, 1993). In this algo o 
rithm, noui~-phrase candidates are extracted from 
tagged and aligned parallel texts using a noun phrase 
recognizer and tile correspondences of these nonn 
phrases are calculated based on the EM algorithm. 
Accuracy of around 90% has been attained for the 
Imndred highest ranking con'espondenccs. Statistics- 
based processing is effective when a relatively large 
amount of parallel texts is available, i.e. when high 
frequencies are obtained. 
On the other hand, existing linguistic knowl- 
edge can be used for finding corresponding words or 
phrases in parallel texts. For example, possible tar- 
get expressions for a source expression provided by a 
translation system (linguistic knowledge source) can 
be a key in searching the corresponding expressions 
in a corpus (Nogami, 1991; Katoh, 1993). Yanramo- 
to (1993) proposes a method for generating a transla- 
tion dictionary from Japanese/English parallel 
texts. In this method, English and Japanese com- 
pound noun phrases are extracted from parallel 
texts and their correspondences are searched by 
matching their possible translations generated by tile 
existing translation dictionary. However, 
acquirable noun phrases are limited by tile linguistic 
generative power of the translation dictionary. Fur- 
thernlore, tiffs method utilizes no sentence align- 
meat information which can reduce errors in finding 
noun phrase correspondences. 
This paper proposes a new method for generat- 
ing an MT dictionary from parallel texts. It uti- 
lizes both statistical and linguistic information to 
obtain corresponding words or phrases in parallel 
texts. By combining these two types of informa- 
tion, translation pairs which cannot be obtained by 
the above linguistic-based method can be extracted, 
and a highly accurate translation dictionary is gener- 
ated from relatively small par:dlel texts. 
2 APPROACtt TO BUILDING AN MT 
1)ICTIONARY 
Our goal in building an MT dictionary from par- 
allcl texts is to develop a robust method which 
enables highly accurate extraction of translation 
pairs from a relatively small amount of parallel 
texts as well as from parallel texts containing 
severe distortions. 
In real-world applications, generally it is 
extremely difficult especially for MT users to 
obtain a large amount of high quality parallel texts 
of one specific domain. If source and target lan- 
guages do not belong to the same linguistic family, 
like Japanese and Fnglish, tile situation becomes 
grave. 
As one typical example of MT dictionary compi- 
lation, we have selected Japanese and English patent 
doemnents which contain many state-of-the-m~t tech- 
nical terms. Althougb thes~ documents are not cul- 
76 
Japanese \[--English 1 Text l Text 
; ,,nit extractio,, I 
\[Corresponding \]-.. ~---> 
' L~nil Table \[ \ I Linguistic 
--r+-- '°d°° 
List \[ ~l candidate v_~- ~-\[ j ____._j-'-'-----~ generation ~_____~ / 
I statistical I/ j/ 
f , .mation J \[ 
Translation Pairs 
Fig. 1: Flow of building an MT dictionary 
from paralh.q texts 
turally biased, in many cases, tile organization 
between Japanese and English greatly differs and 
extensive changes are made ill translating from 
Japanese to English text and vice vm.~a. Hence, tile 
difficulty of word extraction from patents. 
To solve this problem, we explored the appro- 
priate integration method considering the use of lin- 
guistic information and statistical information to 
this end. Lingt, istic information is useful in making 
an intelligent judgment about correspondence 
between two languages even from partial texts 
because of its lexical, syntactic, and semantic knowl- 
edge; statistical information is characterized by its 
robustness against noise because it can tnmsform 
many actual examples into an abstract fom~. 
Below is the flow of ot, r method illustrated in 
Fig. 1 : 
(1) Unit Extraction: 
Pmls of documents ("units") are extracted from 
both Japanese and English texts. 
(2) Unit Mapping: 
I&mh Japanese nnit is mapped into English units. 
(3) Term Extraction: 
Japanese term candidates are extracted by the 
NP recognizer. 
(4) Translation Candidate Generation: 
English translation candidates for Japanese 
terms are extracted from English units. 
(5) English Translation D;timation: 
Tim translation candidates are evah, ated to 
obtain the best one, 
Tim subsequent sections show tim details of each 
processing. 
3 FORMING UNIT CORRI{SPON- 
DENCES 
The plausible hypothesis that parallel sentences 
cont,,in corresponding linguistic expressions is the 
major premise in Kupiec (1993). This type of info,- 
mation should be wklely used. The problem is that 
tim alignment method based on tile sentence bead 
model (Brown, 1991) is not applicable to patent doc- 
uments due to their severe disto,fions in doculnent 
strtlctures and selltence correspolldences. Conse-- 
quently, we have introduced a concept called "unit" 
which corresponds to a pa~t of sentence and adopted 
a new method to extract corresponding units by 
using linguistic knowledge as a primaxy source of 
hi formation. 
3.1 l,:xh'aclion of Units 
First, units are extracted from parallel texts. 
The unit corresponds to sentences or phrases ill tile 
text. Terms which should be extracted can be found 
within a unit. "File rest of words in the unit is 
called contextual infommtion for tile extracted 
term. Tile size of units determines tile effectiveness 
of the st,eceeding unit mapping process. For exa,n- 
pie, if we set noun phrases (enny words in a dictio- 
naly) as :.1 unit, no contextual information is avail- 
able, and thus tim probability that corresponding 
relations hold decreases. In our present implementa- 
tion, we set sentences as a unit for tile first approxi- 
mation. 
3.2 Mal)ping of Uniis 
Next, the unit mapping process creates a cone- 
sponding unit table from Japanese ~,nd English 
vails. This table stores the correslmndenee relation- 
ship between milts and its likelihood. The likeli.. 
hood is calculated based on the linguistic informa- 
tion in an MT bilingual dictionary, 
Our trait mapping algorithm is given below: 
(1) l,ct ,1 be a set of all content words in tile 
Japanese unit JU. (m iS tim number of words) 
,1 ={ Jl'J2 ..... lm} 
(2) l.et E be a set of all content words in the 
F, nglish unit \[{\[J. (n is tile number of words) 
E=:{ E 1,1{2...F; n} J 
(3) .v is the number of .li's whose translation candi- 
77 
date list includes some Ej in E. 
(4) y is the number of Ej's which is included in the 
translation candidate list of some Ji in J. 
(5) The correspondence likelihood CL is given by 
CL(JU, EU) = - x + y 
m+n 
For each JU, M (currently 3) English units 
with the highest CL(JU, EU) are stored in the corre- 
sponding unit table. 
4 GENERATING TRANSLATION 
CANDIDATES 
4.1 Extraction of Japanese Terms 
Errors in the extraction of terms and phrases 
from parallel texts eventually lead to a failure in 
acquiring the correct term/phrase correspondences. 
In Kupiec (1993) and Yamamoto (1993), term and 
phrase extraction is applied to both of parallel 
texts. In contrast, we extract from units only 
Japanese terms, thereby reducing the errors caused by 
term/phrase recognizer. Japanese NP's can be recog- 
nized more accurately than English NP's because 
Japanese has considerably less multi-category words. 
In the current implementation, the following 
two types of term candidates are extracted by the 
NP recognizer: 
(A) Compound nouns (including verbal nouns) 
Examples: "~- 7" y e" :, l- ~'~3i~" 
(=open bit line colfiguration) 
"/i~4-/JiJm~l-fJ~" 
(=minimum featuring size) 
(B) Unknown words (nouns, verbal nouns) 
Examples: "~-J- ~" (=to laminate, to form) 
" ,l-t 1.1 .~, 9 :." "Y'" (=polishing) 
Our NP recognizer utilizes the sentence awdyzer 
of a practical MT system. The word dictionary 
includes approximately 70,000 Japanese entries. 
4.2 Finding Translation Candidates 
Generation of English translation candidates for 
a Japanese term is essentially based on the following 
hypothesis: 
Hypothesis 1 
The English translation of an extracted term in 
a Japanese unit is contained in the English corm- 
sponding unit. 
Now an arbitrary word sequence in correspon- 
ding units can be a translation candidate of the 
Japanese term. We extract English translation candi- 
dates in two steps: 
Step 1 : Select English corresponding units. 
Step 2: Extract n-gram data from the units. 
Step 1 : 
When the extracted term appears in N Japanese 
units, N×M English units will be stored in the cor- 
responding unit table with their correspondence like- 
lihood. The N highest corresponding units within 
N×M combinations are extracted. When N is less 
than M, the M highest combinations arc selected. 
Step 2: 
Suppose that tile correct English translation of 
the Japanese term JW is EW, and that the mnnber of 
Japanese units in which JW appears is FJU(JW) (= 
N). From ltypothesis 1 that the translation is con- 
tained in the corresponding units EU I, EU 2 ..... 
EUFJU(JW ), EW would be a word sequence which 
often appears in corresponding units. In order to get 
such EW, we use n-gram data. 
The frequency of each n-gram (1 <_ n _< 2 x (the 
number of component words in JW)) data in 
FJU(JW) English units is calculated and then EW 
candidates are ranked by the frequency as EWC 1, 
EWC 2 .... EWCj. Because EWC with a low frequen- 
cy in the corresponding units is unlikely to be the 
correct wanslation, the data with a frequency less 
than FJU(JW) 4 are heuristically excluded from the 
candidates. The data containing be verb and the data 
which starts or ends with a preposition or an article 
are also excluded from the candidates. 
5 ESTIMATING ENGLISH TRANSLA- 
TIONS 
The translation likelihood (TL) of one transla- 
tion candidate EWCi for the term JW is defined as: 
TL(JW, EWCi) = 
F(TLS(JW, EWCi), TLL(JW, EWCi)) 
where TI~S(JW, EWCi) is "'Franslation Likelihood 
based on Statistical information," and TLL(JW, 
EWCi) "Translatiou Likelihood based on Linguistic 
info rmat ion 2 
5.1 Statistical hfformation 
TLS(JW, EWCi) is the frequency score based on 
the statistical information from Hypothesis 1 that a 
word which appears as often in tile corresponding 
units as JW in Japanese units is more likely to be 
EW. It is quantitatively defined as tile probability 
in which the translation candidate appears in the cor- 
responding traits. That is, 
78 
vrsu0~.wc?~_ TLS(JW, EWCi) = F3U(JW) 
where FEU(EWCi) is the number of corresponding 
units in which EWCi appears. 
5.2 Linguistic Information 
TLL(JW, EWCi) is tile word similarity score 
based on the accuracy of the correspondence term JW 
and the translation candidate EWCi obtained by 
using linguistic information in tile MT bilingual dic- 
tionary. Suppose one translation candidate of term 
JW=WJl, wJ2 .... wJk is EWCi=we 1, we 2 .... we I. 
Then we use the following hypottmsis. 
Hypothesis 2 
(a) If the length of EWCi is close to the length 
of JW, JW and EWCi are likely to correspond 
each other. 
(b) JW and EWCi with more word translation 
correspondences are likely to correspond each 
other. 
Under this hypothesis, the following correspon- 
dence relation (1) is the best. Term JW and transla- 
tion candidate EWCi have the same length k(-I), and 
all of their component words correspond in the dic- 
tionary, wJi:~we i indicates that we i is included in 
wJi's translation candidates in the MT bilingual dic- 
tionary. 
(1) wJl=*we 1, wJ2~we 2 ...... wJk~We k 
More generally, tim relation of each word (w j) 
in term JW and each word (we) in translation candi- 
date EWCi is classified into the following four 
classes: 
i) wj~ we 
ii) wj --* we 
iii) wj -4 
iv) ~ ---> we (qb indicates no word) 
it) shows a pair whose correspondence is not 
described in the bilingual dictionary, iii) and iv) 
indicate that the corresponding word for wj or we is 
missing. In iii), JW is longer than EWCi; and vice 
versa in iv). 
In order to estimate correspondence between JW 
and EWCi, i) and it) are scored by similarity to the 
virtual translation which holds the relation (I). 
When the nmnber of words is the same, score Q 
(constant) is given, c~Q (ct>0) is added to Q when 
there is a translation relation to reflect higher relia- 
bility of i). Therefore, Q+aQ=(I-,c~)Q is given to 
the word pair of i), and Q to the word pair of it). 
Now since we disregard the word order of a 
term, JW and EWCi are represented as sets of words: 
JW = wJl, wJ2,.., wJk ~- {wJl, w j2,.., wJk } 
EWCi = we I , we2,.., we I - {wel, we2,.., wel} 
The number of words with a lexical correspon- 
dence relation in wj and we, the number of words in 
wj without a relation and the number of words in 
we without a relation are counted as x, y, z respec- 
tively. That is, x -~ y = k and x + z= l. 
T\[.I.(JW, EWCi) is given as the ratio of tile 
score of the vmual translation to the score of 
FWCi. 
When y>_z, 
x(l-t ct)Q t-zQ 
TI2_.(JW, EWCi) = (x l y)(l -t a.)Q 
Otherwise, 
Thus, 
Tl.l.(JW, EWCi) = x(1-I a)Q + yO - (z - y)Q (x-ly)(l-~ c*)Q 
TI.I.(JW, Ewci) -. 
x(l -t ~) + z (y_>_z) 
(x-~y)(l q cQ 
x(l-~ ~x) + 2y - z (x+y)(1-tcx) (otherwise) 
By definition, TI.L(JW, t!WCi) < 1. The value 
of c~ is determined as 2 by evaluating sample tnmsla- 
lion pairs. 
Followings are the TLI,'s of three EWC's for 
JW:vk -- 7" :./ ff .:t I. ~Jy:,~ which consists of four 
component words (k=4); ":,l--- 7" :/(=open)," "tf .~, I- 
(-bit)," "~(=line)," and "Jj3~.(-method, process)." 
bit line configuration 
x:2,y-2, z=l .'.T\[.I~ - (2x3+l)/4x3 =0.58 
open bit line 
x::3, y: 1, z:-O .'. Tl.l. = (3x3)/4x3 = 0.75 
open bit line configuration 
x=3,y:l,z-I .'. TLL = (3×3+1)/4x3 =0.83 
5.3 Combination of Statistical and Lin- 
guistic Information 
We define the translation likelihood TL(JW, 
EWCi) as below: 
TL(JW, EWCi) -: 
m TLS(JW, F.WCi) + n TLL(JW, EWCi) 
m-{ tl 
Examining the value with the ratio n/ttl con- 
stant, a low value of TI.S(JW, EWCi) ill affects 
the total score, especially when the frequency 
79 
FJU(JW) is 5 or less. This shows that TLS(JW, 
EWCi) should be much weighed for JW's which 
appear often, but not for JW's with a low freqt,en- 
cy. Therefore we tentatively define ~ = n/m as a 
function of frequency FJU(JW), because !3 sbould be 
higher when FJU(JW) is low. 
\]3 = G(FJU(JW)) P + s 
{FJU(JW)} q - r 
where r is a possible minimum frequency, aqd s is 
limit of 13 as the word frequency is high enough. 
Values p=4, q=l, r=l, and s=0.5 are used in tile fol- 
lowing experiments. By introducing 13, F is rewrit- 
ten as: 
F(TLS(JW, EWCi), TLL(JW, EWCi) ) = 
_TLS(JW, EWCi) + 13 TLL(JW, EWCi) 
1+13 
In case {FJU(JW)} q is equal to or less than r, 
is meaningless, For such JW's, TL(JW, EWCi) is 
redefined as simply: 
TL(JW, EWCi) = TLL(JW, EWCi). 
Finally the translation candidate EWC i with 
the largest value of TL(JW, EWCi) is assumed to be 
the correct English translation. 
Table 1 shows the translation candidates for JW: 
~ 7" >" ~" ~, I- ,~jY~ with the best three TL's. Its 
frequency in Japanese text is FJU(JW) = 19 (13 
4 + 0.5 = 0.72). Consequently, the correct 
19-1 
translation EWC 3, open bit line cotfiguration, is 
obtained. 
Table 1: Estimation of English translation 
EWCi FEU 'I'LS TLL 'I'L 
bit line configuration 19 1.00 0.58 0.82 
open bit line 18 0.95 0.75 0.86 
open bit line configuration 18 0.95 0.83 0.90 
6 EVALUATION AND DISCUSSION 
To evaluate this method, we have estimated 
English translations of Japanese terms in seven paral- 
lel texts (Japanese specifications of patents on semi- 
conductors and their English translations by human 
translators) and compared the translations with the 
correct data given by experts in building an MT dic- 
tionary. The size of a Japanese text is 7,508 to 26, 
927 characters in 127 to 616 sentences; 99,286 charac- 
ters in 2,148 sentences in total. Examples of cor- 
rect translation pairs estimated with the highest TL 
Compound nouns: 
\]'~-'J")3ll ~-" J" ~./2 minimum featuring size 
~ -j'-5}l~f\[i~.t~t~ element separation region 
71- -- ':7" :-" t::" 'u I" ~7,t)':,:~ open bit line configuration 
cohtmn address strobe 
-e )t, 3" ~.4 cell array 
Unknown words: 
~lt I) ,~, ,~ >, p" polishing 
1/~ # collector 
~I~-Y ~ to form 
Fig. 2: Correct translation pairs 
are listed in Fig. 2. 
Table 2 shows the ranking of the correctly esti- 
mated translation pairs in seven sample texts. The 
upper row shows the average of seven individual 
texts; the lower shows the result using all seven 
texts in one time. The translation of over 70% of 
compound nouns is obtained as the first candidate, 
and over 80% in the top three. The result for 
unknown words is 54.0% and 65.0%. Though the 
accuracy for tile unknown words is relatively low, 
the estimation has been impossible for Yamamoto 
(1993). itere, tile terms whose cor,ect translations 
are not found in English texts are excepted from 
evaluation. .Such data occur when human experts 
give a noun translation for Japanese verbal noun 
term which is translated as a verb in the actual 
text. Tile ratio of this kind of translation pairs is 
abot, t 3%. Tile rate of the correct data is calculated 
by the ratio of the total occurrences. 
The accuracy for the average of unknown words 
is 52.4% in the top three. The result using all texts 
is significantly better than tile average because tile 
statistical information is the major factor in the cur- 
rent implementation. Use of more linguistic infor- 
mation such as in Dangan (1991) and Matsumoto 
(1993) would improve the total performance. 
Linguistic information has proven effective to 
estimate translations of low-frequency terms. Of 
terms which appeared only once in a Japanese text, 
215 translations are obtained correctly as the first 
candidate from 327 terms (65.7%) in seven texts. 
The fourth example of compound nouns in Fig. 
2 shows the advantage of statistical information 
because the correct translation was obtained in spite 
of the wrong word segmentation. The Japanese term 
really consists of three words (~J 9 A, 7" F 1t ~, .z 
\]. ~ - .7" ), each of whicb corresponds to "cohtmn," 
"address" and "strobe" respectively. But word seg- 
mentation output four word.~ (~J 5' \],, T F 1t ~, 
l., ~ - .7") because ":< I. ~-- 7"" is unknown and "-~ 
80 
Table 2: Aeeur'lcy of transl'dion estimates 
Compound nouns (occurrences) 
-total Tl-i~'t cstq,n--at~'~ to,;~-e~-tq m:ZteT" 
1 text ,, 
7 {ext_s~ 3,224_~.9% (2,349) 83.3% (2,680) 
Unkilown words occtnrences) 
-t-ot~al--\[- first estimate top 3 estimates 
• I O --- 55 6 I 30.1~, (16.7) 52.4% (29.1) 
I 389 | 54.0%, (210) 65.0% (253) 
k 
1-" is known as "strike." 
The CASES where no correct translatkm has been 
obtained needs to be examined. The major reasons 
for faih, res are: 
1. Errors in mappi,lg conesponding units. 
2. Errors in word segmentation of unknown 
compound wo,ds. 
Mapping unit errm.'s occur when the one-to-one 
nnit correspondence does not exist. The experiment 
using one text shows that 12 out of 98 Japanese sen- 
tences have no onE-to-one corresponding English sen- 
tence. For better unit correspondence, the trails 
should be smaller, for example, a clause or a verb 
phrase, so as to make the corresponding accuracy and 
frequency in text higher and statistical infornmtion 
more effective. It would improve the unit mapl)ing 
when one Japanese sentence is tnmslatcd into several 
English sentences or vice vmsa. 
ThE segmentation errors of unknown words 
arise often in case of Katakana compotmd word. 
Katakana is the phonetic alphabet in Jal)anese for 
spelling foreign words• Since many compound 
nourLs in a technical field consist of Katakana's with 
no space between component words, much larger lex- 
icon will contribute to more accurate segmelltation. 
7 CONCLUSION 
An MT dictionary has been generated from 
Japanese and English parallel texts. The method 
proposed in this paper assumes t, nit correspondence 
and utilizes linguistic information in an MT bilin- 
gual dictionary as well as statistical information, 
namely, word frequency, to estimate the English 
translatio,L Over 70% accun~te translations for com- 
pound nouns are obtained as the first candidate from 
small (about 300 sentences) Japanese/Fnglish paral- 
lel texts (patent specifications) containing severe 
distortions. The accnracy of the first translaticm 
candidates Ior unknown words, which calmot be 
obtained by a linguistic-based method, is over 50%• 
Tim current implementation shows promising 
results for a cliff let, It target (patent texts) despite 
relatively shnple linguistic knowledge• The overall 
lmfformance will be imlnOved by using more linguis- 
tic knowledge and optimizing panuneters calculated 
by sh~tistical information• 
References 
Brown, P. F.; l,ai, J. C.; and MErcer, R. 1, (1991). 
"Aligning sentences in parallel corlx),a." In 
Proe. of the 29th Annual Meeting of the ACL, 
16%176. 
Chen, S. F. (1993). "Aligning sentences in bilingual 
corpora using Iexical informatio,L" In Proc. of 
the 3 lxt A tmual Meeting of the A CL, 9-16. 
Dagan, I.; ltai, A.; and Schwall, U. (1991). "Two 
languages are mo,'e intkmnative than one." In 
Proc. of the 29th Ammal Meeting of the ACL, 
130-137. 
Gale, W. A., and Chnrcb, K. W. (1993). "A pro- 
gram for aligning sentences in bilingt,al corpo- 
ra." Computational Linguistics, 19(1 ), 75-90. 
Katoh, N. (1993). "Word selection by searching the 
translation candidates on monolingnal texts in 
target language." 7>chuieal Report of IEICE, 
NLC93-32. (in Japanese) 
Kupiec, J. (1993). "An algorithm for finding noun 
phrase correspondences in bilingual corpora." In 
I'roc. e( the 31st Ammal Meeting rg" the ACL, 
17-22. 
Matsumoto, Y.; \[shimoto, ll.; and Utsuro, T. 
(1993). "Structural Matching of Parallel 
Texts." In I'roc. of the 31st Annual Meeting of 
the ACL, 23-30. 
Nogami, lI.; Kumano, A.; Tanaka, K.; and Anmno, 
S. (1991). "l.earning of translation words using 
target-hmguage documents." In Proc. (f 42rid 
A m~ual Meeting of II'S.I, 2C- 6. (in Ja panes E) 
Yamamolo, Y., and Sakamoto, M. (1993). 
"Extraction of teclmical te,'m bilingual dictio- 
nary from bilingual corpus." IPSJ SIG Notes, 
N1,94-12. (in Japanese) 
81 
