Automated Generalization of Translation Examples 
Ralf D. Brown 
()a.lnc'g'c Mellon Univclsity, l~a.nguage Technologies 111stil;ul:('. 
l)ittsl)urgh, I)A \]5213-3890 
ralf+ ~,/cs. (;1 n u. cdu 
Abstract 
l~t:ovious work has shown thai adding gen- 
era.liza.tion of the exa.ml)les in the corpus of 
a.n exa.ml)le-1)ased machine tra.nsla.tion (I'31LMT) 
system ea, n reduce 1;he re(ltfire.d amount o\[' pre- 
tra.nsla.ted exa.ml)le text l)y as \[iltl(;\]l }is a.ii order 
o\[' magnitude for Spa.nish-l';nglish and l,'rench- 
l~;nglish I+',I~Mrl '. Using word clusto.t:itlg to a.tt- 
toma.ticaJly generalize the example eorl>uS ca.n 
provide the majority o\[' this inlprovement for 
l,'rench-l'hlglish wil;h no nlanuaI illtervelltioll; 
the prior work required a. la.rge I)iliugual dic- 
lionary ta.gged wil;}l 1)a.rls of speech aud the 
manual crea.tion of gl'.%llllll.:ll" rules. /~y seeding 
the clustering with a. small a.mou nt of manually- 
crea.ted iM'orma.tion, even t)el;ter t)erl'ornla.nce 
ea.n be a.chieved. This pa.l)ev descril)es a. method 
whereby bilingual word clustering ca.n 1)e per- 
\[brined using sta.nda.rd 'nto,zoli'n.qttal document 
cl ustering techniques, a, nd its e\[l'ectiveness at re- 
d ucing the. size of the exam l)le corpus ,'eq u ire(I. 
1 hltroduction 
I';xanq)le-I{ased Machine 'l'ranslaLion (I';I{M'I') 
relies on a. collection of textual units (usually 
sentences) and their tra, nsla, l,ions. New text |,o 
1)e tra, nsla, ted is nla,tched a,ga, inst the source- 
langua.ge ha.If of the colh'x;tion, and the corre- 
sponding tra.nsla.tions from the ta.rget-langua.ge 
half axe used to generate a. l;ra.nsh~tion of the 
new text. 
l~xperience with several  pairs has 
shown that producing a.n EBMT system which 
provides reasomt.ble t, ra.nsla.tion coverage of un- 
restricted texts using simple textual matching 
requires on the order of two million words of 
pre-translated texts (one million words in each 
l;mguage); if either la.nguage is highly in\[letting, 
polysynthetic, or (worse yet) a.gglu tina.tive, even 
llloro text will be required. It ma.y I)e difficult, 
time-consuming, and expensive to obtain tha.t 
much pa.rallel text, pa.rtieula.rly for lesser-used 
la.nguage pairs. Thus, it' one' wishes to develop 
a. new tr,~nslator ra.pidly a.nd a.t low cost, tech- 
niques are needed which permit the 131~MT sys~ 
tom to 1)erform just as well using substantia.lly 
less example text. 
lk~th the C,a.ijin Ii;I~MT system 1)y Veale and 
~ " r \Va.y (\]997) and 1,he a.uthor's l~\]~h/l I sySteln 
(I999) COllVel't {;he examples in the corpus into 
teml)la.tes against which the new texts ea.n I)e 
ma.tched. (la.ijin va.ria.I)lizes the well-formed 
segment mappings between source a.nd ta.rget 
sentences 1;}ta.t it is able to find, using a. closed 
set o\[' markers to segment 1.he input into l)hrasos. 
q'he a.utllor'.~ syslem i)er\['orms its generaliza.tioll 
using equix,a.lence classes (both syntactic a.nd se- 
ma.ntic) a.nd a. production-rule grammar. First, 
any occurrences of terms conta,ined in a,n equiv- 
alence class are replaced l)y a. token giving 1.he 
name of the equiwdence (:lass, a.nd then the 
gramma.r rules a~re used to replace l)a.tterns of 
words a.nd tokens I)y more genera.l tokens (such 
as <NI'> for noun phrases). (\]{town, 1999) 
showed t\]la.t one ca.n reduce the corpus size by 
as much as a.\]l order o\[' ma.gnitude in this way. 
(liven l;ha.t, explicit, ma.llua.lly-gom~ra.ted equi- 
va.lence classes red uce the need for exam l)le text, 
an obvious extelmion would l)e I;o ;~tte\]nl)t lo 
gelleral.e tll(~se classes a.ul;olna.tica.l\[y frolll the 
corpus of pre-tra.nslated exanlples. This pa.- 
1)or describes ()lie ~q)l)roa,ch to a.utoma.ted ex- 
1;racl;ioll of equiva.lence classes, using clustering 
teclmiques. 
The rema.inder of this l)aper describes how 
to 1)erform bilingua.1 word clustering using stan- 
dard monoh;ngual document clustering tech- 
niques 1)y converting the problem space; the 
va.rious clustering algorithms which were inves- 
tiga.ted; mid the effectiveness of generaliza.tion 
using the derived clusters a.t reducing the re- 
quired amount of example text. 
2 Converting the Problem 
The task of clustering words a.ccording to their 
occurrence pa, tterns ca, n 1)e testa,ted as a, sta, n- 
dard document-clustering task by converting 
the l)rol)lem sl)a.ce. For each unique word to be 
clllstered, crea.te a. l)seudo-doculnent conta.ining 
the words of the contexts in which theft word N)- 
125 
pears, and use the word itself as tile document 
identifier. After the pseudo-documents are clus- 
tered, retrieving the identitier for each docu- 
ment in a particular cluster l)roduces tile list of 
words occurring in su\[\[iciently similar contexts 
to be considered equivalent \['or the l)urposes of 
generalizing an EBM(1 ~ system. 
By itself, this approach only produces a 
monolingual clustering, but we require a, bilin- 
gum clustering fox" proper generalization since 
different senses of a word will appear in differing 
contexts. The method of Barrachina and Vilar 
(1999) provides the means for injecting bilingual 
information into the clustering process. 
Using a bilingual dictionary -- which may be 
created fl'om the corl)us using statistical meth- 
()<Is, such as those of Peter \]~rown el al (71990) or 
the author's own l)r(~viotls• work (Brown, 11997) 
and the parallel text, create a rough ma.pping 
1)etween the words in the source- half of 
each translation example in tile corpus and tile 
target- half el'that example. Whenever 
there is exactly one l)ossible translation candi- 
date listed for a word by the mapping, generate 
a bilingual word pair consisting of the word and 
its translation. This word pair will be treated 
as an indivisible token in further processing, 
adding bilingual information to the clustering 
process. \]eorming 1)airs in this manner causes 
each distinct translation of a. word to be treated 
as a separate sense; although translation pairs 
do not exactly correspond to word senses, pairs 
can be formed without any additional knowl- 
edge sonrces and are what tile EBM:I' systern 
requires for its equivalence classes. 
1,'or every unique word pair found in the 1)re- 
vious step, we a.ccurnulate counts for each word 
in the surrounding context of its occurrences. 
The context of ~n occurrence is defined to be 
tile N words immediately prior to and the N 
words immediately following the occurrence; N 
currently is set to 3. Because word order is im- 
portant, counts are accumulated separately for 
each position within the context, i.e. for N = 3, 
a particular context word may contribute to any 
of six different counts, depending on its loca- 
tion relative to the occurrence. Further, as the 
distance ffoln the occurrence increases, the sur- 
rounding words become less likely to be a true 
part of the word-pair's context, so tile counts 
are weighted to give the greatest importance 
to the words immediately adjacent to the word 
pair being examined. Currently, a silnple linear 
decay fl'om 1.0 to -~ is used, but other decay 
functions such as the reciprocal of the distance 
are also possible. Tile resulting weighted set of 
word counts tbrms the above-mentioned I)seudo- 
document which is converted into a term vector 
Ibr cosine similarity computations (a standaM 
measure in information retrieval, defined as the 
dot product of two term vectors normalized to 
unit length), 
If the clustering is seeded with a. set of ini- 
tial equivalence classes (which will be discussed 
below), then the equivalences will be used to 
generalize the contexts as they are added to tile 
overall counts \['or tile word pair. Any words in 
the context for which a unique correspondence 
can be found (and f'or which the word and its 
corresponding translation are one of the pah:s 
in an equivalence class) will be counted as if the 
name of the equivMence class had been l)resent 
in the text rather than the original word. For 
example, if days of the week are an equivalence 
class, then ':(lid he come on Fridas:' and "did 
he leave on Mends3:' will yield identical con- 
text vectors for "come" and "leave", maldng it 
easier \['or those two terms to chlster together. 
To illustrate the conversion process, consider 
tile li'rench word "('inq" in two examl)les where 
it translates into English as ::five" (thus forming 
tile word pair "cinq_fi ve") : 
<NUt> <NI/L> Le ci,zq jours dcpuis la 
<NUL> <NUL> 73e five dags si~zce lhe 
ellcs com'me~,cc~w~,t c~z cinq jours .<NUL> 
they will begin i~), five days .<NUL> 
where <NUt> is used as a placeholder when 
the word pair is too near the beginning or end 
of the sentence for the flfll context to be present. 
Note that the word order on the target- 
side \]s not considered when building the term 
vector, so it need llOt be the same as on the 
source- side; the examples were chosen 
with the same word order merely for clarity. 
The resulting ternl vector for "cinqJive" is 
a.s follows, where the numbers in parentheses 
indicate the context word's position relative to 
the word pair under consideration: 
Word Occur Weight 
<NWl.>(-3) 1 0.333 
elles(-3) 1 0.333 
1 0.667 
commenceront(-2) 1 0.667 
Le(q) 1 1.ooo 
en(-1) 1 1.000 
jours(J) 2 2.000 
depuis(2) 1 0.667 
.(2) 1 0.667 
la(3) 1 0.333 
<NUL>(3) 1 0.333 
Term vectors such as tile above are then clus- 
tered to determine equivalent usages among 
words. 
126 
3 Clustering Approaches 
A tota.l of six clustering a.lgoHthms ha.v(~ I)oen 
1.ested; th roe variants of grout)-a.vora.go. ('\]tlsl.('.,'- 
ins a.nd i, hree of agglomera.tive clustering. In- 
cl'omental group-a.vera.ge clustering was ilnple- 
mented tirst, to provide a. proof of concopt, 
borore the COml)uta.tiona.lly more expensive a.g- 
glomerative (bottom-up) clusteril~g was i lnple- 
mented. 
The incremental groul)-a.vera.ge a.lgoril;hms all 
exa.mine each word pair in turn, computing a 
similsu:ity measure to evory existing clustor. If 
th(; 1)(;st siinila.rity measur(; is a l)ov(~ a. l)r(;del;er - 
nfin('d threshold, the new word pair is i)laced 
in tile corresponding cluster; otherwis% a now 
(;\]usi;er is crea.ted. The th roe varianl;s diltT, r only 
in tile simila.rity moasure eml)loyed: 
:1. cosin(; similarity 1)(;1;w(~(;n 1,h(~ i)s(;u(lo<loc- 
umonl, a.nd the centroid o1" the oxisting clus- 
ter (standard grOUl)-a.vera.ge clusto.rillg;) 
2. a.verage of' i;\]lo cosine similaril;ies l)otwe(;n 
the l)seudo-docuni(;nl; a.nd all nl(;nll)ers o\[' 
the 0xisting (:lust(;,' (a.voragc-link clustor- 
ing) 
3. square root of' 1;h(; a.vcrag(; of 1;lie S(luared 
cosine simila.r\]l;io.s I)ctweon l;he l)seudo- 
(locuinent an(\] all molnl)(~,'s or l he existing 
('hlster (rool.-nloa.n-sqllar(, nlo(lifical.ion of 
average-liNl¢ clustering) 
Thoso i;hro(~ vnria.tiol,S give hlc,'eas\]ngly IIl()l'(': 
weight to 1,ho nea.rer mcml)ers of' tho oxist.ing 
cl ust;cr. 
Tim t)o(;1;oin-u 1) a.gglomera.tive algoril;hms all 
funcl;ion I)y (;tea.tills a. clustor For each I)Seudo- 
(\[o(:unlenl,, t;hon r(;i)(;a.1;(;(lly ln(u:ging l:li(; two 
clusl;ors witli the \]iighesl; siinila.ril,y score unl,il 
110 (,WO C\]tlS|,orH \]lSt,vo ,% ,q\]iilila,ril;y .~(:Ol'(~ (~x('.(~(;d- 
ing a l)re(Iol;ornlino(\] 1;hl:eshold. The three vari-- 
;/,IIi;S }/,ga, ill differ ()lily ill 1;lio S\]liiilaril,y lllO}lStll'O 
O llll)loyc(l: 
\]. cosine simila.rity between clustor centroids 
(st~ul(la.rd agglomei:a.tivo clustering) 
2. a.vera.ge of cosine sitnilariLy 1)etween men> 
l)ers of the two clusters (a.vera.ge-tink) 
3. nia.xilnal cosino similarity betweon a.ny pair 
Of ni('.nll)oi:s of l,\]ie i;wo clusl;(',rs (single-lin\]{) 
l"oi: (;acli of the va.i:ia.tions a.bovc, the l)r(~(l(;1,er - 
niincd (;hreshol(I is a. funci;ion of word \['r(xluoncy. 
Two words wliich each a.l)l)ea.r only onc(Y in the 
entire tra.ining text a.nd ha.re a. high simila.rib, 
score a.ro more likely to ha.re a.l)l)ea.red in siniila.r 
contexl;s I)y cohicide.nce l:ha.n 1;wo wor(ls which 
each a,1)pea.r ill 1;he traJliil/g 1;(;xi; lifty tin-its. 
l,'ro( t UO I/cy 
5 (J 
7 
8- 
10 - 
\]2 - \] 5 
>16 
Threshold- 
1 \] .00 
2 0.85 
3 0.80 
4 0.75 
0.70 
0.65 
0.60 
9 0.55 
1 \] 0.50 
0.45 
0.40 
I,'igure \] : Chtsleling 'l'hro.shold t unction 
I~br exa.ml)le ~ when using threo words on ei- 
thor side as context, a.nd a. linca.r dcca.y in t;erm 
weights, two singleton words achievo a. sinlitar- 
it; 5, scor(', of ().321 (1.000 is the ma.ximum t)os- 
siblc) if just one o\[" the immodia,tely a(lja,ccnt 
words is the sa.mc for 1)oth, evon if none of' 1;ho 
other five context words axe the sa, mc'. /ks the 
number o\[' occul'renc('s increases, l;ho contri\])u- 
l,ion t,o the simila.rit,y score o\[' hidividua.l words 
decreases, ma.king it less likely 1;o encounter a 
high score by chance. Ilencc, we wish to set 
a. si;ricl;er 1;hres\],ol(l \['or clustering low-frequollcy 
words i;hati higho,'-l'roquelmy words. 
The thr(~shold Function is exI)ressc(l in 1,(~rms 
of tim fr('(lU(mcy o1" occurrence in th(~ 1,ra.il,ing 
1.exl.s. I"or si,,gle, ull('lus(;ere(\[ \vord pairs, I, ho 
t'requollcy is sinll)ly 1,11o numb(~r ol' 1;hnos I, he 
wor(I 1)a.ir was (m(:ounl,(u'(,d. When I)e,'\['orn> 
ing groul)-a.\,erag(; (;lu.qlx;ring, the l'requoncy as- 
signod l;() a. ('\]/ml;('.r is tim sum o\[' (;h(; frequencios 
of a.ll the members; for agglomera.l.ive (:lust('.ri)lg, 
the \['re(ltten(;y is the sum when using cent;roids 
and 1,he lnaximunl fre(lucn('y <tnlong the m(;m- 
I)oJ'S wllen using l;he average or lmarest-,,(;ighl)or 
,~imila.rity. The va.lu(~ of' the (;hr(>shold \['or a. given 
pair of ('lusi,('ms is the va.lue of tim thr(~,~hold 
I'unction a.t the lower word frequency. \]:igure 1 
sl,ows l,h(', threshold tunction used in the (,Xl)Cr- 
iments whose results a, rc rel)ortcd here; cluster- 
ins is only allowed if the simila, rity measure is 
a.1)ove the indicated threshold vahm. 
On its own, clustering is quite suc(:essfill for 
generalizing EBMT ('Xaml)les, I)ut the fully- 
a.utomated t)roducl;ion of clusters is not com- 
t)a.tible with adding a, l)roduction-rule gra.mma.r 
as (lcscril)od in (l~rown, \]999). Therel'ore, the 
clustering process may 1)e seeded with a. set of 
m an u a.lly-gc'nera.ted clusters. 
VVhell seed clusters m'e a.va.ilablo., the cluster- 
ins process is moditied in two ways. First, l;he 
grOUl)-avera.ge a.pl)roa.clms a.dd an initiaJ clusl;er 
for o.a,('h soed cluslcr and the a.gglolnera.tive a p- 
127 
proaches add an initial cluster for each word 
pair; these initial clusters are tagged with the 
name of the seed cluster. Second, whenever a 
tagged chister is merged with an untagged one 
or another cluster with the same tag, the com- 
bination inherits the tag; further, merging two 
clusters with different tags is disallowed. As a 
result, the initial seed chlsters are expanded by 
adding additional word pairs while preventing 
any of the seed clusters from themselves inerg- 
ing with each other. 
One special case is handled sepa.rately, 
namely numeric strings. If both the source- 
 and target-l~mguage words of a word 
pair are numeric strings, the word pair is treated 
as if it had been specified ill the seed class 
<number>. Word pairs not containing a digit 
in either word can optionally be prevented fi'om 
being added to the <number> chlster unless 
explicitly seeded in that cluster. The former 
feature eusures that nunibers will apl)ear in a. 
single cluster, rather than in multiple chlsters. 
The latter avoids the inclusion of the many non- 
numeric word pairs (primarily adjectives) which 
would otherwise tend to cluster with numbers, 
because both they and numbers are used as 
modifiers. 
Once clustering is completed, any clusters 
which have inherited the same tag (which is 
possible when using agglomerative clustering) 
are merged. Those clusters which contain more 
than one pseudo-document are output, together 
with any inherited label, a.nd can be used as a 
set of equivalence classes for EBMT. 
Agglomerative chlstering using the maximal 
cosine sinfila.rity (single-link) produced the sub- 
jectively best clusters, and was used for the ex- 
periments described here. 
4 Experiment 
The Inethod described in the previous two 
sections was tested on French-English EBMT. 
The training corpus was a subset of the 1BM 
Ilansard corpns of Canadian parliamentary pro- 
ceedings (Linguistic Data Consortium, 1997), 
containing a total of slightly more than one 
million words, approximately half in each lan- 
guage. Word-level alignment between French 
and English was pertbrmed using a dictio- 
nary containing entries derived statistically 
from the full Hansard corpus, auglnented by 
the ARTH, French-English did;iona.ry (ARTFL 
Project, 1998). This dictionary was used for all 
EBMT and chlstering runs. 
The efl'ects of varying the amount of train- 
ing texts were determined by further sl)litting 
the training corl)us into smaller seglnents aM 
using differing numbers of segments. For each 
Clust I 
238 
260 
348 
522 
535 
1375 
1386 
1528 
;1563 
;1.652 
2008 
21.82 
2472 
3539 
M elnbers 
lJ.IS'l'OIl{E HISTOIW 
ECONOMIE ECONOMY 
CERTAINI!~MENT CEI{TAI NLY 
CERTAINEMENT SURELY 
CERTES SURELY 
JAMAIS NEVER 
PAS NOT 
I~EUT-F, TRE MAY 
H~OI~ABLEMENT PROBAI~LY 
QUE ONLY 
l.{lfl';N NOTItING 
S\[JREMENT CERTAINLY 
SUREMENT SURELY 
VRAIMENT REALLY 
CONSERVATEUR CONSEIWATI \q~J 
CQNSERVATEUII TORY 
I)EMOCIi,NI.'IQUE DEMOCtl, ATIC 
I)I~IVl OCRATIQUIE NDP 
LIBI~RAL LII3ERA L 
l)l A{NII, A%LS LAS \] 
\])ERNIIjEI{ES PAST 
I)ERNIIERIDS I{h;CENT 
PI{OCI\]A INF, S NEXT 
Q UELQUES FEW 
QUF, LQUh;S SOME 
AVONS HA\q'; 
SOMMES ARI'~ 
p p tr, p 1 ,LLC 10RALL CAMPAIGN 
EM~2CTOIi,A ILF~ EIAECTION 
FI~I)I~RAM:,S-I)I {OXq N C IAL1,;S 
FEI)ERA L-PllOVINCIAI, 
INDUS'FRIEM3~S INI)US'I'IIIAI, 
OUVRIERES LA BOUR 
FA(,J()N h;VENT 
P • 17' I~VIDLNCL CLEARLY 
EVIDh;NC\]'; OBVIO USIN 
HOMMF, S POIATICIANS 
PRISONNIFJ{S PR/SONEI{S 
RETOUR, BA.CII(, 
REVENIR BACK 
CONVENU AGREED 
SIGNE SIGNEI) 
VU SEEN 
AGRJCOLE AGR1C UL'I'URE 
ENT'IER AROUN\]) 
E N T I ER T Ill RO U G I\] O U T 
OCCIDENTAL WESTERN 
AVIDUGLI~S BI,IND 
CIIA.USSURI'2S SI-IOES 
CONSTRUC;I'EURS BUILDh;RS 
PENSIONN, F,S PENSIONERS 
RISTRAITES PENSIONERS 
VETEMENTS CLOTHING 
POISSON FISI\] 
PORC IK)RK 
Figure 2: Sanli)le Chlsters 
128 
run using clustering, the first K segments of 
the corl)uS a.re cones.Lena.ted into a. single file, 
which is used as inl)ut \['or both the clustering 
l)t:ogra, m a.nd the EI{M:I.' system. The clust;er- 
ltlg 1)rogranl is rtltt (;o deternfine a. set o1" equiv- 
alence classes, a.nd these classes a.re then pro- 
vkled to tile I';I{M:I' systetn a Jest with the tra.in- 
ing exa, mples to be indexed, lleld-out lla.nsa.rd 
text (a,1)I)roxima.lsely d5,0()O words)is then tra.ns- 
laLed, +tnd tile l)ercenta.ge of tile words in the 
test text for which the I~;I~M~.I ' system could 
lind ma,tches a.nd generate a. tl'a.lasla.tion is de- 
termined. 
To test the efl'ects of adding seed ('lttsters+ 
a set of' initia.1 clusters was generated with 
the \]te.lp of the A I{:I'I"I, dict;iona.ry. First, the 
500 most frequ(:nt words in the milliou-word 
\]\]~msa.rd sul)se.t (excluding pun('.\[;uation) were 
extracted. These terms were then nmtched 
a.gMnst the AI~.TFI, dictionary, removing those 
words which had multi-word transla.tions as 
well a.s severaJ which listed multil)le parts el" 
sl)eech For the same tra,nslation (multil>le l>a, rts 
of speech can only 1)e used i\[' the corresi>on(I- 
ing tra.tlsla.tiolls are distinct f'rom each <)ther). 
The remaining d20 tra.nslal.ion pairs, tagged for 
l)a.rt o\[' speech, were then convert:e(l inl,o se(~(I 
clusters a.nd l)rovided to the clustering t)rogra.nl. 
To fa.cilita.te experiments using the t)re-existing 
l)roduction-rule grammar, tire a.d(litiona,I tra.ns- 
la£ion I)a,h's from the lna,nually-gelmra, ix~(1 equiv- 
aJe.n(:e ('la.sses were a.dded t;o l)rovide seeds for 
five equiva.\]ence classes which a.re not, l)resent in 
the dictiona.ry. 
5 Results 
The nlethod (les('ril>ed i,I this l)a, per does (Sttl) 
jectively) a, very good jol> of clustering like 
words toget\]wx, a lid using the clusters to getl- 
era.lize EI{MT gives a. (;onsidera.I)le boost, to the. 
l)etTVol'ltl~-Lt,ce+ of' the l<\]\]\]\]\/l~\[ ' SySl;(':lll. 
l"igure 2 shows a, sa.ml)ling of tile sma.ller 
clusters generated from 1.\] million words o\[' 
Hansard text. While the nmmbers of a, clus- 
ter are often semantica,lly linked (a,s in cluster 
848, which cotltains types of politica.1 paxl;ies, or 
cluster stag), they need not be. Those clusters 
whose members a.re not semantically linked gen- 
eraJly contain words which a.17e all the sa.me l)a.rt 
of sl)eech , numl)er, a.nd gender (a.s in (:luster 
2472, which costa.ins exclusively plural nouns) 
1)ut a.s will be discussed in t;he next section, 
even those chlsters whose ,neml)ers a.re tota.lly 
unrela.ted may 1)e useful a.nd correct.. One J'a.h:ly 
cotlltl\]Otl occurreltce a, lllOllg the smaller clusters 
is that various synonymous 1;ra.nslnt;ions o\[ a 
word (from either source or target ) 
will chlster together, as in cluster \]652. This 
is pa.rticula.rly useful when tile ta.rget- 
word is the sa.me, a.s this a.llows va.rious wa.ys of 
expressing t.he same thing to be tra.nsla.ted when 
~l.lly Of" those \['OFtlIS ~/l'e present in the tra.ining 
('orpl.t s. 
Figure 3 shows how adding a.utoma.tically- 
generated equiva.lence classes sul)sta.ntially in- 
creases the covers,we of the EI3MT system. A1- 
terna.tively, lnuch less text is required to rea.ch 
a. specific level of coverage. The lowest curve in 
the. graph is tile percentage of the d5,000-word 
test text for which the EI{M:J' system was able 
to genera.te tra.nsla.tions when using strict lexi- 
c+d matching against the trahling corpus. The 
lop-most curve shows the best performa.nce, pre- 
viously achieved using 1)oth a, la.rge set of eqttiva- 
lento classes (in t;he fornt of tagged entries from 
the \]\ItYI'II+'I, dicl;iona.rv) a.nd a. production-rule 
gra.nlntar (\]{rows, J999). Of the two center 
curves, the lower is the performs.nee when gen- 
era.lizing the tra.ining corl)us using the equiv- 
alence classes which were autolna.tica.lly goner- 
ated from that same text, a.nd tim upper shows 
the t)erforma.tlce using ('lustering with the d25 
seed pairs. 
/ks can b~, seen in Figure 3, 80% cover- 
age of the test text is achieved with less than 
300,000 words using nta.ntta.lly-crea.te(l gener- 
alizat, ion information a.nd with approxima.te- 
ly 300,000 words wllen using a.utonmtically- 
creaJ;ed genera.liza.tion informa.tion, but requires 
1.2 million words when not using genera.liza.- 
ties. 90% covers.we is reached with less than 
500,000 words using lna.nua.lly-ereat.ed informa. 
lion a.nd should I>e reached with less t.ha.n 1.2 
tnillion words using a.utonm.tically-crealed gen- 
era.lization informa.tk)n, versus T million words 
without genera.liza.tion. Tiffs reduction I)y a. tim- 
(or off our to live in tile amount of text is accom- 
1)lishe(I with lit;tie o)' no degradation in the qual- 
ity of the tra.nsla.tions. Adding a. small amount 
of kt,owle(lge in the f'ornt o1" 425 seed pairs re- 
(lutes the required trahling text; even further; 
this ca.n la.rgely be attril)uted to the merging of 
clusters which would otherwise have rema.ined 
distinct, thus increasing the level of generaliza.- 
ties. 
Adding the production-rule gratnma.r to the 
seeded clustering had little effect. When usirtg 
more than 50,000 words of tra.ining text, the in- 
crease in coverage from adding the gram m a,r was 
negligible, and even with the sma.llest training 
corl)ora, (,he+ increase wa.s very modest. 
Using the sa.me thresltolds tha.t were used in 
tile fully-~mtonla.tic case, clustering on 1.\] mil- 
lion words expands the initial 425 word pairs 
in 37 clusters to a200 word pairs, a.nd adds a.n 
additions.1 555 word pairs in \]d() further non- 
(;t:ivia,1 clusters. This (:Oral)ares very fa.vorably 
129 
c- 
O o 
(D 
C1) 
CD 
O 
O 
90 
80 
70 
60 
50 
40 
30 
20 
' s 
i 
--x- .... : ...... x ........... x .... i ..... -x .......... * .... ..... 
~- .~+" 
, zz 
V3 .... i ~ , (a" / 
, // / 
/ 
i 
I 
0.2 0.4 0.6 
Corpus Size (millions of words) 
lexical matching only -~-- 
with automatic clustering -4-- 
clustering w/425 seeds -D--, 
full manual g~neralization --x--- 
0.8 
l i'igure 3: BI3MT \]~el'formance with and without Generalization 
to the 3506 word 1)airs in 221 clusters tbund 
without seeding. 
'l'he 1)rogram also runs reasonably quickly. 
The step of creating context term vectors con- 
verts approximately 500,000 words of raw text 
per minute on a 300 MHz processor. 1,'or ag- 
glomerative clustering, the processing time is 
roughly quadratic in the number of word \])airs, 
with a theoretical cubic worst case; the 17,527 
distinct word pairs found from the million-word 
training corpus require about 25 minutes to 
cluster. 
6 Discussion 
One statement made earlier deserves cla.rifica- 
tion: l;he members of ~ cluster need not be re- 
lated to each other in any way, either syntacti- 
cally or semantically, for a cluster to be useful 
and correct. This is because (absent a gram- 
mar) we do not care about the features of tile 
words in the cluster, only wh, cthc~" their tr(msla- 
lion,s Jbllow the same pattcrT~. 
An illustration based on actual experience 
is useful here. In early testing of the group- 
average clustering algorithm with seeding, the 
<conjunction> seed class of "and" and "or" 
was used. Clustering augmented this seed class 
with "," (comma.), "in", and %y". One can eas- 
ily see tha.t the comma is a valid member of the 
class, since it takes the place of "and" in lists 
of items. 13ut wllat about ':in" and "135;", wlfich 
are prepositions rather than conjunctions2 11' 
one considers the tra.nsbttion t)attern 
__ 7~ C \[ • __ Fr'eNl~ I'>cNP2 --+ EW/A 1 1 F-~:/NI):2 
it becomes clear that all of the terms in the 
expanded class give a correct translation when 
placed in the blank in this pattern, lndeed, 
one could imagine a production-rule grammar 
geared toward taking advmltage of such com- 
mon translation patterns regardless of conven- 
tional linguistic features. 
7 Conclusion and Future Work 
Using word clustering to automatically gener- 
alize the example corpus of an I;BM'I? system 
can provide the majority of tile improvement 
which can be achieved using both ~ manually- 
generated set of equivalence ('lasses and a pro- 
duct;ion rule grammar. The use of a set of small 
initial equivalence classes produces a substan- 
tial further reduction in training text at a very 
low cost (a few hours) in lal)or. 
130 
An obvious {'~xtension to using st,.e{\] clusl;ors 
iS (;(} 1+180 (,110 I'Osllll; ()\[' a, ClU,'ql;{':l'illg 1"I+111 ;IS l;tl{? 
i\]lil;ia\] seed \['or a second it{;ra,l;io\]t o1' chlsl,er- 
ing, sin('{', th{; additional g{,neralization of lo- 
{;a.i COlll;{!xl;s cnabl(;d 1)y the la.rgcr s{,e(1 clusl,(,J's 
will l)ormit a.(l(litional ex\])allSiOll O\['LIlo clusl,(Brs. 
l:or such itera.tivo {:lustoring, a.II but the last 
rou n(1 shouI(1 l)l'(2Slllllal)ly USe sl;ri(;Ler 1,hresh- 
ol(Is, to avoi(1 adding goo many irr{;l{,A,ant inonl- 
t)ers Lo tim clusLers. I)rdiminary OXl)erinmnts 
hay{ B been inconclusive --although ihc result o\[' 
a second it{wation {'onta.ins more {,{'.rms ill the 
{;lusl;ers, IBBMT l}erforma.nce {toes not seem to 
lint)rove. 
More sophistica.ted {;hlsl;o.l'illg; a.lg(}rithms such 
as k-lneans and (l('+terlninLqtic a.nnealing l\]lay' 
1)rovi(lo \])etter-qua.lity clust{ws for bcl, ter t)ei't"of 
lllall{;e} :-1+,{; the (~xi)ens(; of illCl'Oas(;(\] t)ro{'eHsill~ 
tim{'.. 
This a.i)l}Z:oach to gelWXa.l,ing e(luival('Jw(~ 
cla.sses should worl( j usl; as well \['or l)h rases as I'or 
single words, simply hy mo(lil~qng {;he conver- 
Si()ll SLOp 1;O el'oat;(; C(}lltOXt VeCl;ors l"or phrases. 
This enhancenmnt would elimi,lal;{'~ i;he current 
limitation t, hat trat,slal;ion \]):q,il:S l,O 1)O. clust(;red 
\]\]lUSt t)O single words in 1)oth s. \Vot:k 
or, this n\]o(lifi{;al;ion is {:urP(~ll|;ly ttn(ler way. 
An inleresting \['ui, ur{~ (;xI)eriment would 1)(~ 
tbr{'going gratnnlar rules based {)n standa.rd 
gl:allllll:-/,l;ical \['{'.:-1+,l;tll'(~s Sll{:h as \]).~l,rl, o\[' st){"(':{:\]l , 
and inst{,ad crea,tinp; a gran~ma, r guid(,{I I} 3 , 
{;1~{; ('lusters I'oun(l fully aul,o~tati{'ally (wil, houl, 
sce{liug) fronl th{~ exa.nll}lc re\l,. 'File r{,{:(;nt 
woH{ I)y +\(lcTait and 'l¥..iillo (I 999) {}, OXtl':dcl.,- 
ing tra1~slal, ion t}al;l;{'+rn,q woul(l a.t)poa.r t,o 1}o. a 
l){;rfe{:l; {;oml)lc'nmnt, as 1;h{'5 are it, e\[t'ect lind- 
i,g {:ont;ext strings wit\], (}l}e. slots, while the 
work descril)ed h('.re lit,(ls {,he fillers I'(}1' tJ~{)s(' 
slots. (liv{;n the al)ility to learn such +1+. gra.mmar 
without l\]\]a.nual interv{mtion, it would \])e(:onl{'. 
I)ossil)l{~ to ere'at{; an I!'I:~MT 8yst{m\] usillg g{:ql- 
era, liz{,(l e,:aml)les from nol, hi\]~g ~n{)r{; than l)ar- 
allel l;ext~ which for n~any hulguag(, pairs could 
also 1)c acquired a hnost fully a, utom~tically 1)y 
crawling the World Wide VVel) (Resnil{, :1.998). 

References 

A\]~.TFL Project. :1998. I"re'nch-£'nglish. Dic- 
lionarg. I}roject for American and French 
l{esearch {}i\] the Treasury of {,he l:rench 
l,anguage, University o\[" Chicago. http ://- 
human±t ies. uch±cago, edu/kgTFL, html. 

Sergio Barrachina a.nd Juan Miguel \:ilar. 1999. 
Bilingual Clustering Using Monolingual Algo- 
rithms. Ill 15vcecdings of lhe l@hlh \]'n.lcrna- 
~ional (7onJ?rence on Theoretical and Method- 
ological Issues in Machinc 7)'anslatio'n (~1341- 
99), pages 77 87, Chester, England, August. 

Peter Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, J. Lafferty, R. Mercer, and P. Roossin. 1990. A Statistical Approach 
to Machine Translation. Computational Linguistics. 16:79 85. 

l{alF \]). \]h:own. 1997. Autonlated \])ictionary 
l';xLracLion lot "l(nowlcdge-l,'ree:' Example- 
1},ascd Translation. In l)~vccedin:l s of ihc 
,5'cvcnlh h,l(rnaiional (.'o,@Tvncc on "IJm- 
orclical and A4clhodolo:lical l~ue.s in Ma- 
chi'~c 7}'an.~lalion (TM1-97), F, ages\] 11 1 \]8, 
Sant~t \]:c, New Mexico, July. ht;1;p://- 
www. cs. cmu. edu/Oral:f/papers, html. 

Rail" I). lh'own. 1999. Adding l,ingttistic l(nowl- 
edge Co a l,exica\] \]';xamp\]e-\]~ased Tra, nsla- 
tion System. In I)rocccding.s of thc 12iEM, h 
hzlcrnational (,'onj)rencc on 7'hcorel, ical mzd 
Mclhodoh::lical fss.uc.~ in Machine 7)'anslation. 
(7341-99), pages 22-32, Chesl:er, I~;ngland, 
August. http ://www. cs. cmu. edu/~ratf/- 
papers .html. 

I,illguistic I)al, a. Con.~orLium. 1997. lla~>'ard 
(,'ou~u.~ of I)aralld £'Rqlish (rod t'Yc,~ch. 
IAnguistic l)ata Consorl;ium, \])ecember. 
http ://wwu. ldc. upenn, edu/. 

l(evin McTl'a.it and Arturo Trujil\]o. 1999. A 
l,atlguag{>Neutral Sl)arse-\])ata Algorithnl fbr 
I:x tracti n g 'I'ranslation I}atterns. \] II \])ro(:ccd- 
in:l s of lhc l'/i:lhlh hztcrlzational Co~@rcncc 
on 77~colv:ical and Mcthodoh:gical Is.sues in 
MachMc 7~rmtslatio~t (7'A41-99), l)ages 98 
1()8, Ch{~ster, l'3ngla,(1, August. 

I}hilil } l{(,snik. \]998. I)ara.llel Strait(Is: A 1)re - 
liminary lnvesCiga.tion into Mining the \V(;b 
r 1 { for l~\]lingua.1 e\ . In I)a.vid l'a.rweI1, l,a.urle 
(:left)or, and Edua.rd llovy, editors, Mac.hine 
7}'a~.~laZion and the hdbrm.alion Soup: 7'hird 
(;'on, fcrcncc of Ihc A.~..s'ociation for A'lachi~c 
'l}.anslation in Ihe Americas (A A4'F4-9S), vol- 
un\]e 1529 o\[' Lccl'mv No*ca in ArZi./icial lnlcl- 
ligc:~zcc, i)ages 72 82, l,anghorne, I}ennsylva - 
ida, ()ct{}l}er. Springer. 

Tony Vcalc and An(ly Wa.y. 1997. Gaijin: A 
'lhml)la.te-I)riven Bootstrapl)ing AI)l)roacll 
to Exa, ml)Ie-Ba.sed Ma,chin(; Tra.nsla.tion. 
:It\] 1}roecedin:p of the NeMNLl~'97. New 
Mel.hods in Natural Langauge 1)rocessessin9, 
Sofia., lhdgaria., Sel)teml)er. http://- 
wwu. compapp, dcu. ie/"tonyv/papers/- 
gaij in. html. 

Ellen M. \:oorhees. 1986. lmplel\]mnting Ag- 
glomera.tive llierarchical Clustering Algo- 
rithms for Use in \])ocument l/etrieval. 
I'nJbrmation l~rocessi'ng and Mana.qeme'nt, 
22(6):d65 ,176. 
