Example-Based Machine Translation in the Pangloss System 
Ralf D. Brown 
(Jenter for Machine Translation 
(~arncgie Mellon \[lniversity 
5000 Fort)('s Aw;nuc 
l'ittsburgll, PA 15213-',~890 
ralf@cs, cmu. edu 
Abstract 
The Pangloss Example-Based Machine 
Translation engine (I'anEI3MT) l is a 
translation system reql,iring essentially 
no knowledge of the structure of a lan- 
guage, merely a large parallel corpus of 
example sentences atn\[ a bilingual dictio- 
nary. Input texts are segmented into se- 
quences of words occurring in the corpus, 
for which translations are determined by 
subsententia\[ alignment of the sentence 
pairs containing those sequences. These 
partial translations are then combined 
with the results of other translation en 
gines to form the final translation pro- 
duced by the Pangloss system. In an 
internal evaluation, PanEBMT achieved 
70.2% coverage of unrestricted Spanish 
news-wire text, despite a simplistic sub- 
sententia\[ alignment algorithm, a subop 
ritual dictionary, and a corpus Dora a dif- 
ferent domain than the evalual, ion texts. 
1 Introduction 
Pangloss (Nirenburg el; al., 1995) is a multi- 
engine machine translation system, in which sev- 
eral translation engines are. run in parallel to pro- 
pose translations of various portions of the input, 
Dora which the final translation is selected by a 
statistical language model. Panl'3BMT is one of 
the translation engines used by Pangloss. 
EBMT is essentially translation-by-analogy: 
given a source-language passage S and a collec- 
tion of aligned source/target text pairs, lind the 
"best" match for S in the source-language half of 
the text collection, and accept the target-language 
half of that match as the translation. PanEBMT, 
like other example-based translation systems, uses 
essentially no knowledge about its source or target 
languages; what little knowledge it does use is op- 
tional, and is supplied in a eonIiguration file. Its 
1This work as part of the l'angloss project was sup- 
ported I)y tim U.S. I)epartment of Defense 
three main knowledge sources arc: a sententially- 
aligned parallel bilingual corpus; a bilingual dic- 
tionary; and a target-language root/synonym list,. 
The fourth (minor and optional) knowledge source 
is the hmguage-specific information provided in 
the conliguration tile, which consists of n list of 
tokenizations equating words within classes such 
as w0ekdays, a list of words which ntay be elided 
during alignment (such as artMes), and a list of 
words which may be inserted 
2 Parallel Bilingual Corpus 
The corpus used by PanEBMT consists of a set of 
source/target sentence, pairs, and is flflly indexed 
on t, he source-language sentences. The corpus is 
not aligned at any granularity liner than the sen- 
tence pair; subsententia\] alignment is perfornled 
at run-time based on the sentence fragments se- 
let;ted and the other knowledge sources. 
The corpus index lists all occurrences of ev- 
ery word and punctuation mark in the source- 
language sentences contained in the corpus. The 
index has been designed to permit incremental up- 
dates, allowing new sentence pairs to be added to 
the corpus as they become awulable (for example, 
to implement a translation memory with the sys- 
tem's own output). The text is tokenized prior 
to indexing, so that words in any of the equiva- 
lence classes detined in the EBMT contiguration 
tile (such as month names, countries, or measuring 
units), as well as the predetined equiwdence class 
<nuntlmr>, are indexed under the equivalence 
class rather than their own names. For each dis- 
tinct token, the index contains a list of tile token's 
occurrences, consisting of a sentence identifier and 
the word number within the sentence. At transla- 
tion time, f'anEI~MT back-substitutes tile appro- 
priate target-language word into any translation 
which involves any tokenized words. 
'rile bilingual corpus used for the results re- 
ported here consists of 726,406 Spanish-English 
sentence pairs drawn primarily from the IIN Mul- 
tilingual (~'orpus available fl'om tile l,inguistic 
Data (Jonsortium(Graff and Finch, 1992) (Figure 
l), with a small admixture of texts from the Pan- 
169 
Las fuentes de esos comentarios y 
recomendaciones son las siguientes : 
The sources of these comments and 
recommendations are : 
E1 informe de la Junta de Auditores a la 
Asamblea General que incluye las 
observaciones del Director Ejecutivo 
del UNICEF sobre los comentarios y 
recomendaciones de la Junta de 
Auditores ; 
The report of the Board of Auditors to 
the General Assembly which incorporates 
the observations of the Executive 
Director of UNICEF on the comments and 
recommendations of the Board of 
Auditors ; 
Figure 1: Corpus Sentence Pairs 
(ACADMICOS ACADEMICS ACADEMICAL 
TITLES DEGREES) 
(ACAECIDO HAPPEN) 
(ACAECIDOS HAPPEN) 
(ACANTONADAS CANTON QUARTER TROOPS) 
(ACANTONAMIENTO CANTONMENT) 
(ACARREA CARRY CART HAUL TRANSPORT 
CAUSE OCCASION) 
(ACARREABA CARRY CART HAUL TRANSPORT 
CAUSE OCCASION) 
(ACARREARON CARRY CART HAUL TRANSPORT 
CAUSE OCCASION) 
(ACARREAR TRANSPORT HAUL CART CARRY 
LUG ALONG BRING DOWN CAUSE OCCASION 
ITS TRAIN RESULT GIVE RISE) 
Figure 2: Bilingual Dictionary Entries 
American Health Organization and prior project 
evaluations 2, indexed as described above. 
Together, the bilingual dictionary and target- 
language list, of roots and synonyms (extracted 
from WordNet when translating into English) 
provide the necessary information to lind as- 
sociations between source-language and target- 
language words in the selected sentence pairs. 
These associations are used in performing subsen- 
tential alignment. A source word is considered to 
be associated with a target-language word when- 
ever either the target word itself or any of the 
words in its root/synonym list appear in the list 
of possible translations for the source word given 
by the dictionary. 
Not all words will be associated one-to-one; 
however, the current implementation requires that 
at least one such unique association be found in 
order to provide an anchor for the alignment pro- 
tess. 
3 Implementation 
PanEBMT is implemented in C++, using the 
FramepaC library (Brown, 1996) for accessing 
Lisp data structures stored in files or sent from the 
main Pangloss module via Unix pipes. PanEBMT 
consists of approximately 13,300 lines of code, in- 
cluding the code for a glossary mode which will 
not be described here. 
PanEBMT uses a re-processed version of the 
bilingual dictionary used by Pangloss's dictionary 
translation engine (Figure 2). The re-processing 
consists of removing various high-frequency words 
and splitting all nmlti-word definitions into a list 
of single words, needed to find one-to-one associ- 
ations. 
210250 sentence pairs stern from the PAI{O corpus 
and 552 pairs from evaluations. 
4 EBMT's Place in Pangloss 
PanEBMT is merely one of the translation en- 
gines used by Pangloss; the others are trans- 
fer engines (dictionaries and glossaries) and a 
knowledge-based machine translation engine (Fig- 
ure 3). Each of these produces a set of candi- 
date translations for various segments of the in- 
put, which are then combined into a chart (Figure 
3). The chart is passed through a statistical lan- 
guage model to determine the best path through 
the chart, which is then output as the translation 
of the original input sentence. 
5 EBMT Operation 
The EBMT engine produces translations in two 
phases: 
1. find chunks by searching the corpus index for 
occurrences of consecutive words from the in- 
put text 
2. perform subsentential alignment on each sen- 
tence pair found in the first phase to deter- 
mine the translation of the chunk 
In constrast with other work on example- 
based translation, such as (Maruyama and Watan- 
abe, 1992) or early Pangloss EBMT experiments 
(Nirenburg et al., 1993), PanEBMT does not find 
an optimal partitioning of the input. Instead, it 
attempts to produce translations of every word 
sequence in the input sentence which appears in 
its corpus. The final selection of the "correct" 
cover for the input is left for the statistical lan- 
guage model, as is the case for all of the other 
translation engines in Pangloss. An advantage of 
this approach is that; it avoids discarding possible 
chunks merely because they are not part of the 
"optimal" cover for the input, instead selecting 
the input coverage by how well the translations fit 
together to form a complete translation. 
170 
Transfer M'I) 
(\]MAT l'ost-cdit ) 
H A 'l'al'get Text 
Source Text \] 
1 
!T 5/ 
< 5 / 
Figure 3: l'angloss Machine q'r;mslation System 
Architecture 
3'0 lind chunks, the engine sequentially looks up 
each word of tile input in the index. The oc<:ur- 
rence list for each word is comp~tred ;tgainst the 
occurrence list for the prior word and against the 
list of chunks extending to the prior word. For 
c,~u;h occtlrrence which is adjacent to all occnr- 
l'elwe of the prior word, a new chunk is created 
or an existing chunk is extended as appropriate. 
Alter processing all input words in this tmmner, 
the engine has determined all possible substrings 
of the input containing at least two words which 
are; present in the corpus. Since the more Dequent 
word sequences <:an o<:cur hundreds of times in 
the eorl)uS , the list of chunks is culled to elimi- 
nate all but the last tlve (by default) occurrences 
of any distinct word sequence. By selecting the 
last occurrences of each word sequence, one effec- 
tively gives the most recent additions to the cor- 
pus the highest weight, precisely what is needed 
for a translation meanory. 
Next, the sentence pairs containing tile chunks 
retold in the lirst phase are read from disk, and 
alignment is performed on each in order to de- 
termine the translation of the chunk unless the 
match is against the entire COl'pus entry, in which 
case the entire target-language sentence is taken 
as the translation. Alignment currently uses a 
rather simplistic brnte-force approach very simi- 
lar to that of (Nirenburg et el., 1994) which iden- 
tifies the minimum and maximum possible seg- 
ments of the target-language sentence which could 
possibly correspond to the chunk, and then ap- 
plies a scoring fimction to ew',ry possible substring 
of the maximum segment containing at least the 
luinimmn segment. The suhstring with the best 
score is then selected as the aligned match for the 
chunk. 
The alignment scoring function is computed 
fl'om the weighted sum of a number of extremely 
simple test flmctions. The weights call be changed 
for ditDring lengths of the source chunk in order to 
adapt to varying impacts of the tests with varying 
numl)ers of words in the chunk, as well as vary- 
it,g impacts as some or all of the. raw test stores 
change. The test functions include (in approxi- 
mate order el' importance) such measures as a) 
the number o\[' source words without <:orrcspon- 
dences in the t.;trget, b) the number of target 
words without c.orrespondences in tile source, c) 
matching words in source/target without corre- 
spondences, d) nmnber of words with COl'respon- 
dence itt the fifll target but not the candidate 
chunk, e) common sentence boundaries, f) eli(t- 
able source words, g) insertable target words, and 
It) the difference in length between source and ta> 
get chunks. 
There is one exception to the above procedure 
for retrieving and aligning chunks. If any of the 
chunks covers the entire input string and the en- 
tire source-language half of a corpus sentence pair, 
then all other chunks are discarded and the target- 
language half of the pair is prodnced as the trans- 
lation. This speeds up the system when opea'ating 
in tnmsl~tion memory mode, as would be the case 
in a system used to translate revisions of previous 
texts. Unlike a pure translation memory, however, 
Pan I'\]IIMT does not require all exact; match with 
a memorized translation. 
Figure 4 shows the set of translations gener- 
ated fi'om one sentence. The output is shown 
in the format used R)r standalone testing, which 
generates only the best translation for each dis- 
tinct clnmk; when integrated with the rest of Pan- 
gloss, Panl,;l/MT also includes information indi- 
cating which portion of tile input sentence and 
which pair fi'om the corpus were used, and can 
produce multiple translations for each chunk. The. 
number next to the source-language chunk in the 
output indicates the wdue of the scoring flnlction, 
where higher values are worse. Very poor align- 
meats (scores greater than five times the source 
chunk length) have already been omitted from the 
output. 
6 Recent Enhancements 
The EBMT engine described here is a completely 
new implementation ill C++ replacing an earlier 
Lisp version. The previous version had performed 
very poorly (to the point where its results were 
171 
E1 Banco de Santander habia sido 
elegido el lunes per las autoridades 
monetarias espanolas para comprar el 
Banco Espanol de Credito (Banesto), 
cuarto banco espanol. 
"El Banco de" (O) 
("the Bank of") 
"El Banco de Santander" (i) 
("the Bank of Santander") 
"Banco de" (0) 
("Bank of") 
"Banco de Santander" (I) 
("Bank of Santander") 
"de Santander" (0) 
("of Santander") 
"habia side" (0.5) 
("been") 
"elegido el" (0) 
("chosen the") 
"el lunes por" (0) 
("Monday by the") 
"por las" (O) 
("by the") 
"por las autoridades" (14.2) 
("by the health authorities") 
"por las autoridades monetarias" (0) 
("by the monetary authorities") 
"las autoridades monetarias" (0) 
("the monetary authorities") 
"comprar el" (0) 
("buying the") 
"Espanol de Credito" (13.2) 
("Spanish Institute of Credit for") 
"de Credito" (0) 
("of credit") 
"de Credito (" (i) 
("of credit (") 
"Credito (" (0) 
("credit (") 
", cuarto" (0) 
(", fourth") 
"banco espanol" (0) 
("Spanish bank") 
"espanol ." (0) 
("Spanish .") 
Figure 4: Sample 'Danslations 
Input words 9169 
Matched against corpus 90.4% 8294 
Alignable 84.5% 7748 
Good alignments 70.2Z 6439 
Table 1: (\]overage and Sentence Alignability 
Engine Proposed Selected 
Name Arcs Words Arcs Words Cover 
DICT 27482 27482 3451 3451 9167 
EBMT 11005 34992 1527 4768 6439 
GLOSS 17663 19249 1567 1774 5780 
Overall: 46580 71998 5415 9169 9169 
Table 2: (\]onl, ributions of Pangloss l~hlgines 
essentially ignored when combining the outputs 
of the various translation engines), for two main 
reasons: inadequate corpus size and incomplete 
indexing. 
The earlier incarnation had used a corpus of 
considerably less than 40 megabytes of text, com- 
pared to the 270 megabytes used for the results de- 
scribed herein. The seven-fold increase in corpus 
size produces a proportional increase in matches. 
Not only was the corpus fairly small, the text 
which was used was not flflly indexed. To limit 
the size of the index file, a long list of tile most 
frequent words were omitted from the index, as 
were punctuation marks. Although allowances 
were made for the words on the stop Fist, the 
missing punctuation marks always forced a break 
in clmnks, fl'equently limiting the size of chunks 
which could be found. Further, allowance was 
made for the ,m-indexed frequent words by per- 
mitring any sequence of frequent words between 
two indexed words, producing many erroneous 
matches. 
The newer implementation fully indexes the 
corpus, anti thus examines only exact matches 
with the input, ensuring that only good matches 
are actually processed. Further, PanEBMT can 
index certain word pairs to, in effect, precompute 
some two-word chunks. When applied to the five 
to ten most frequent words, this pairing can re- 
duce processing time during translation by dra- 
matically reducing the amount of data which must 
be read from the index file (for example, there 
might be 10,000 occurrences of a word pair instead 
of 1,000,000 occurrences of one of the words and 
100,000 of the other word), and thus the number 
of adjacency comparisons which must be made. 
7 Performance 
7.1 Accuracy 
PanEBMT was first put to the test during an 
172 
internM evaluation in August 1995, whi{:h w~ts 
similar in design I,o l, he ARI'A MT ewdual, ions 
(White &, O'(kmnell, 1,{)94). During this evMua 
l.ion, i;weni;y newswire arl;icles (seleel, ed from the 
l(}O articles used in Lhe l>rior A\]{I)A evMu;tl, iol 0 
averaging M)(}ul, 450 wor(ls ea(:h were l}ro(:essed 
~md sul)se(luently ex~Lmine(t. For this i}a,l)er, an- 
ol;tmr eva\]u~tl;ion was I)erformed using a sul}set of 
the l}angloss system on ~he 25?, senl;{mces in the 
l;wenl,y ~l'ti{:les. Talkie 2 shows the {,oDd nltln- 
bet of arcs prol)ose{l 1}y each {,ranslation engiue 
use(I, the mm~l}er sele(:Led for out,l}Ut, 153, the st,> 
tisl, i(:M bmgm~ge model, ~m(I the ntu~d)er {}t" source 
words represen{;ed I}y 1,hose ares. The {inal e.olumn 
shows l;he {,o{;M nund}er of source wor(Is covercd 
I}y at; leasl, {}lie 15rOl)OSed ~r{'. The vMue8 for in- 
{lividual engines {lo not sum t,(} the O~:cr.ll v;due 
1}eeause multit)le engines cml I)r(}{luce e(luiva\[enl, 
arcs, which are (:ombine(I in the {:ha rl,, wil, h both 
engil~CS {:redited for the arc. The engines lisl,e(I in 
t, he l,ables ~re 
® DICTiouary: l'anl'31~MT's asso{:ia, l, iol~ die- 
Li(}It,a,l'y~ tl,'-;e{I here priumrily 1;{} I}rovi(le cov- 
el'ztge f'{}\]' words It(){, ()l;herwise e{}vered 
® EBMT: Pm\]EI~MT 
• GLOSSaries: haa\]{t-(:raf{,e(t wor{I/l)hrase 
bilinguM glossm'ies 
7.2 Sl)e(;d 
In{lexing a 270 m{:gal}yi;c {:{}rl)US requircs al}l}r()xi- 
tn;tl;ely 45 ndmaes on a Sun SI)arcsl,;tti-n I,X when 
all tiles are located on local ,lisks, an,l an{)lher ~{} 
lllilllll,eB I,{) lmck {,he. index (n(}l, required, I}ul, im 
l}roves speed al, run time). It~cret}~enl,al a{hlil,ion 
of new data. 1,o the {:orpllS l}l'o{:e{2{l:-; ;tl, ~t l'al,c (}\[" 
roughly six megal}ytes l)er ndnute. 
A sample text (}f 15 sentences l;(}t;Mling 414 
Wol'ds ,~l, ll(l I}llll{:{41&{,iOll ll10,1'k8 c0,II I){'~ t}l'{.}{:(':ss(':(I 
in jus{, under three minul,es. The 20 texts use{l 
in l;he evalu~d,ion (:~m he {:On~l}lel,ely i}r{)eesse(I in 
l,w{} hours, inchMing sel)~U';d,e i)asscs for (ti{:l, io- 
.;try lool~ul}S ;m(l sl,~l;istieM \]~{,{leling I,y a se I} 
ar;d,e i}rogr~tm ((lescril}e{t in (Ih'owu m,{t I"re{lerk- 
inp;, 1!}95)); I)m~EI~MT a(:c(}unl,s for a,I)oul, 8{} nfin- 
tiles (}f l;hose l;wo hollrs. 
The above t;imings rel)reselll, ;1, v{LricLy of sl}ee(I 
Ol}l, imizati(}ns which Imve been N}l}lied since the 
Augusl; 1.{)95 ewdm~t;ion, r{',sulting in a {h}ul}ling 
of t;he in{lexiug spee{I and trit)ling {}f 1,r~mslal, ion 
speed. 
8 Strengths and Weaknesses 
As {:urrenl, ly i~q}h.uenl,e(I, I)m~EI~MT has I,()l,h 
si, rengl,hs ~tnd wea, knesses, ll, s s{;renglhs are l, ha, l, 
l, he nfininmt knowledge req.ired all(}ws {luM( re- 
I,argcl, ing and flint, it,s {lesip;n l}rovi{les I'{}r gra.ee 
\['u\] degra{lal, i{}n. Its we~knesses are thai; ii, is 
umd}le t,o conq}let,ely e{}ver inlsul, s , Lh;t{, it, {l{}es 
not per\['ornl well when the correspondences I)e- 
tween som'ce-ltutgu~ge mM l,~rgct-l~mguage words 
~re not one-to-.one, ~nd that~ (like statistically- 
based tr~mshrl, ion sysl,ems) i|, is sensitive to dif- 
I'ereltce,q 1)eLweel/ Lhe example corpllS 811(I I;he Sell- 
1,ences l,o be I;ranslat,ed, 
The astul, e rca.(ler will have noticed that there 
have been virl, ually no ment, ions of l, he source 
or t;arget, langmtges iP, this paper they ~r{~ not 
relcva.{; 1,15 discussions of the design ~md oper- 
al, i{m of l, he engine, since t;he only language-- 
{telsen{len{, kn(,wledge consists of l, he e(luivMence 
{:lasses and the lists of insert;able ;m{l cli&d)le 
wor{ts, which are ln'ovided via the {:Ollligllra,t, iOll 
lile. This l;mguage-indel}endent asl}e{:l; of EBMT 
mM<es I}~mEI~MT r~Lpidly retargetM)le {;o other 
l~mguage l}Mrs, and in f~t{:l; thcre are ;dready ver- 
sions {}f I};mF, I~MT provi{ling Serhocro;Ll,ia.n-i:o- 
English and El\]glish-to-Serhocroati~m trmlslal;i(ms 
(m) exl)erimenl, M (t/-tl, a, i8 ,~ts yel, ~twuilM,le for SeP 
ho{:roa, Lia, ll I)e{::4118e t.he {:olnl)\]eLe {li(:ti{}n;~ry an(I 
{'{srpus are sl, itt heiug acquire(l). (~iven thc 1,hree 
re{luire(I knowledge S(}llP{'es O\[" e{)rpllS, {li{:l, ionary, 
and word-root, list,, PanEI{MT can begin pro{h.> 
iug tr~mslat, ions for a new langtmgc pair in only a, 
few h{}urs. I,'iue tuniug will require one {;o two 
weeks 12o del;erlnine reasOllS,\])Ie word (:\[a, sses \['()r 
i;okenizal;ion (along; with the required rc-indexing 
of the {:orlms) a.nd t{) adjusl; the scoring fllll(:l, iOll 
weighl,s. 
Nun~l}er ~m(l qualil,y of I;rmlsl;tl, ions {legra(les 
gradually as the size and (lualil# {}\[ the I)ilin- 
gual diction&try aim synonym list (leerease. An in- 
{:{mtl}tel,e (licl;i{mary or rool,/synonym list m{wely 
causes Pan EB M'\[' l;o miss son.2 potenl, ial tr;msla- 
t,i{}ns. Similarly, a smMM' {:orpus t}r{}duces fewer 
l}otential m~d, ches, I}ut there i~ no t}oinl, 12}r ~my (51" 
l, he l, hree l¢nowle{Ip#2 SOllrces ~tl, which the etlg~ilte 
su{hlenly {:eases 1;o \['tlllCLiOll. ()lie can I M(c ad- 
vantage of this gradual beh~wior l}y tmihting {he 
knowledgc sources incrcmenl;Mly and using I!;I~MT 
fOP l, ra, llSlaJ, iOllS evell I)el'ore the kn{}wledge sources 
trove I)ecn eomplc(,e{I. In I)ar{,icul~tr, 1}y a(htinp; 
l},}sl,-edil,ed oul, l}lll, Of the MT sysl,elll I}ack into 
1,11{': {:Ol'l)llS } l;\[Ie sysL{':lll c;I, ll I}{: 1}{}o{;s{,ra, i}l}e{I I'r{}nl 
a rela.tively mo{lesl, inil, ial coPi)ll8 (precisely the 
i{tea, l}ehin{l ~ l;r;-msla,(;i{)n nlenlory). 
I)uring l)repa.r~l;ion of this l)a.1}er, severM ex- 
l,l'.~tlleOIlS lines were discovered in the eorlms files, 
w}lich ('all:-;(:(I lll(}l'C, l;ha, ll 2!)/1110 8ell|,{:ll(:e p;Lil'S 
(over 4% o\[" Lhc eorl}lls ) l,o t}{~ corrul)l, ed. I)11(2 
1,{) t, he exl;r~l lines, the corrut)l,ed pairs consisl, ed of 
the English target senl;enee t'ronl one pair and l, he 
Spanish sotn'ec senl, en{',e t¥(}m the following I)air. 
'l'his error had n(}t I}een diseovere{I earlier 1}e{:ause 
il, had n{} ol)vious effect ou I}anEl3MT's perfor- 
lnmt(:e ~t clear exa.ml}le of the sysl;enl}s graceful 
{ h~gra.{Ial, i(sn i}r{q}erl:y. 
I,ack {)f (:~)mt)lel,e in\[)/ll, {:{}w, rage is a severe {)t}- 
s{,;i.cle l,O IlSill~ I'anl,',l~IMT as a sl,and-ahme I, rans 
17 3 
lation system. The engine can not generate a 
chunk for a word unless it both co-occurs with ei- 
ther the preceding or following word somewhere in 
the corpus, and at least one occurrence can be suc- 
cessfiflly aligned. Additionally, candidate chunks 
are omitted if the alignment was successfifl but 
the scoring function indicates a poor match. Un- 
less all of these conditions are met, a gap in output 
occurs for the particular input word. In the con- 
text of the Pangloss system, such gaps are not a 
problem, since one of the other engines can usually 
supply a translation covering each gap. 
As currently implemented, the EBMT engine is 
unable to properly deal with translations that do 
not involve one-for-one correspondences between 
source and target words (e.g. Spanish "rail mil- 
liones" corresponding to English "billions"). Lack 
of a one-to-one correspondence between source- 
language and target:language expressions can of- 
ten cause the alignment to be incorrect or fail al- 
together under the current alignment algorithm. 
Since the corpus used in the experiments 
described here was based almost entirely on 
the UN proceedings rather than newswire text, 
PanEBMT did not find many long chunks during 
the evaluation. In fact, the average chunk was just 
over three words in length, and less than three per- 
cent of the chunks were more than six words long. 
This quite naturally affects the quality of the final 
translation, since many short pieces must be as- 
sembled into a translation rather than one or two 
long segments. 
Despite all these difficulties, PanEBMT was 
able to cover 70.2% of the input it was presented 
with good chunks, and generate some translation 
for more than 84ordinarily not outpnt at all). In- 
tegrating the hand-crafted glossaries from Pan- 
gloss into the corpus, thus adding 148,600 effec- 
tively pre-aligned phrases to the corpus, improved 
the matches against the corpus from 90.4% to 
90.9% of the input, and the coverage with good 
chunks to 73.3%. 
9 Future Enhancements 
Since PanEBMT is a fairly new implementation, 
there is still much that could be done to en- 
hance it. Among the improvements being consid- 
ered are: improving the qnality of the dictionary 
(in progress); supporting one-to-many or many- 
to-one associations for alignment; optimizing the 
test-function weights; other alignment algorithms; 
using linguistic information such as morphologi- 
cal variants and source-language synonymy to in- 
crease the number of matches against the cor- 
pus; using approximate matchings when no exact 
matches exist in the corpus; and using of a clas- 
sifier algorithm to remove redundancy from the 
corpus (suggested by C. Domashnev). 
References 
Ralf Brown (in preparation). FramepaC User's 
Manual Carnegie Mellon University (\]enter 
:for Machine Translation technical memoran- 
dnm hUp:// ww, es. cmu. edu/afs/cs, emu. edu/- 
user/ralf/pub/W W W/papers. h tml 
Ralf Brown and Robert Frederking 1995. Apply- 
ing Statistical English Language Modeling to 
Symbolic Machine ~lYanslation. In Proceedings 
of the Sixth International Conference on The- 
oretical and Methodoloqical Issues in Machine 
Translation (TMI-95), pages 221-239. Leuven, 
Belgium. 
David Graft and Rebecca Finch 1994. Multilin- 
gum Text Resources at the Linguistic Data Con- 
sortium In Proceedings of the 1994 ARPA Hu- 
man Language Technology Workshop Morgan 
Kaufinann. 
H. Maruyama and H. Watanabe 1992. Tree Cover 
Search Algorithm for Example-Based 'lYansla- 
tion. In Proceedings of the Fourth International 
Conference on Theoretical and Methodological 
Issues in Machine ~lYanslation (TMI-92), pages 
173-184. Montreal. 
M. Nagao 1984. A Framework of a Mechani- 
cal ~IYanslation between Japanese and English 
by Analogy Principle. In Artificial and Human 
Intelligence, A. Elithorn and R. Banerji (eds). 
NATO Publications 
Sergei Nirenburg, (ed.). 1995. "The Pangloss 
Mark IIl Machine Translation System." Joint 
Technical Report, Computing Research Labora- 
tory (New Mexico State University), Center for 
Machine Translation (Carnegie Mellon Univer: 
sity), Information Sciences Institute (University 
of Southern California). Issued as CMU techni- 
cal report CMU-CMT-95-145. 
Sergei Nirenburg, Stephen Beale, and Constantine 
Domashnev 1994. A Full-Text Experiment in 
Example-Based Machine Translation. In New 
Methods in Language Processing Manchester, 
England. 
Sergei Nirenburg, Constantine Domashnev, and 
Dean J. Grannes 1993. Two Approaches to 
Matching in EBMT. In Proceedings of the 
Fifth International Conference on Theoretical 
and Methodological Issues in Machine Transla- 
tion (TM\[-93). 
White, J.S. and T. O'Connell. 1994. "Evalu- 
ation in the ARPA Machine Translation Pro- 
gram: 1993 Methodology." \[n Proceedings of 
the ARPA lILT Workshop. Plainsboro, NJ. 
174 
