The Candide System for Machine Translation 
Adam L. Berger, Peter F. Brown,* Stephen A. Della Pietra, Vincent J. Della Pietra, 
John R. GiUett, John D. Lafferty, Robert L. Mercer,* Harry Printz, Luboi Urei 
IBM Thomas J. Watson Research Center 
P.O. Box 704 
Yorktown Heights, NY 10598 
ABSTRACT 
We present an overview of Candide, a system for automatic 
translation of French text to English text. Candide uses 
methods of information theory and statistics to develop a 
probability model of the translation process. This model, 
which is made to accord as closely as possible with a large 
body of French and English sentence pairs, is then used to 
generate English translations of previously unseen French 
sentences. This paper provides a tutorial in these methods, 
discussions of the training and operation of the system, and 
a summary of test results. 
1. Introduction 
Candide is an experimental computer program, now in its 
fifth year of development at IBM, for translation of French 
text to Enghsh text. Our goal is to perform fuRy-automatic, 
high-quality text-to-text translation. However, because we 
are still far from achieving this goal, the program can be used 
in both fully-automatic and translator's-assistant modes. 
Our approach is founded upon the statistical analysis of lan- 
guage. Our chief tools axe the source-channel model of com- 
munication, parametric probability models of language and 
translation, and an assortment of numerical algorithms for 
training such models from examples. This paper presents el- 
ementary expositions of each of these ideas, and explains how 
they have been assembled to produce Caadide. 
In Section 2 we introduce the necessary ideas from informa- 
tion theory and statistics. The reader is assumed to know el- 
ementary probability theory at the level of \[1\]. In Sections 3 
and 4 we discuss our language and translation models. In 
Section 5 we describe the operation of Candide as it trans- 
lates a French document. In Section 6 we present results of 
our internal evaluations and the AB.PA Machine Translation 
Project evaluations. Section 7 is a summary and conclusion. 
2. Statistical Translation 
Consider the problem of translating French text to English 
text. Given a French sentence f, we imagine that it was 
originally rendered as an equivalent Enghsh sentence e. To 
obtain the French, the Enghsh was transmitted over a noisy 
communication channel, which has the curious property that 
English sentences sent into it emerge as their French trans- 
lations. The central assumption of Candide's design is that 
the characteristics of this channel can be determined experi- 
mentally, and expressed mathematically. 
*Current address: Renaissance Technologies, Stony Brook, NY 
~ English-to-French I f 
e Channel " 
_\[ French-to-English 
-\] Decoder 6 
Figure 1: The Source-Channel Formalism of Translation. 
Here f is the French text to be translated, e is the putative 
original English rendering, and 6 is the English translation. 
This formalism can be exploited to yield French-to-English 
translations as follows. Let us write Pr(e I f) for the probabil- 
ity that e was the original English rendering of the French f. 
Given a French sentence f, the problem of automatic transla- 
tion reduces to finding the English sentence that maximizes 
P.r(e I f). That is, we seek 6 = argmsx e Pr(e I f). 
By virtue of Bayes' Theorem, we have 
= argmax Pr(e If) = argmax Pr(f I e)Pr(e) (1) 
e e 
The term Pr(fle ) models the probability that f emerges 
from the channel when e is its input. We call this function 
the translation model; its domain is all pairs (f, e) of French 
and English word-strings. The term Pr(e) models the a priori 
probability that e was suppled as the channel input. We call 
this function the language model. Each of these factors--the 
translation model and the language model--independently 
produces a score for a candidate English translation e. The 
translation model ensures that the words of e express the 
ideas of f, and the language model ensures that e is a gram- 
matical sentence. Candide sehcts as its translation the e that 
maximizes their product. 
This discussion begs two important questions. First, where 
do the models Pr(f\[ e) and Pr(e) come from? Second, even 
if we can get our hands on them, how can we search the set of 
all English strings to find 6? These questions are addressed 
in the next two sections. 
2.1. Probability Models 
We begin with a brief detour into probability theory. A prob- 
ability model is a mathematical formula that purports to ex- 
press the chance of some observation. A parametric model is 
a probability model with adjustable parameters, which can 
be changed to make the model better match some body of 
data. 
Let us write c for a body of data to be modeled, and 0 for a 
vector of parameters. The quantity Prs(c), computed accord- 
ing to some formula involving c and 0, is called the hkelihood 
157 
of c. It is the model's assignment of probability to the obser- 
vation sequence c, according to the current parameter values 
0. Typically the formula for the hkehhood includes some con- 
attaints on the dements of 0 to ensure that Pr0(c) reaUy is a 
probability distribution--that is, it is always a real vahe in 
\[0, 1\], and for fixed 0 the sum ~c Pr0(c) over all possible c 
vectors is 1. 
Consider the problem of training this parametric model to the 
data c; that is, adjusting the 0 to maximize Pr0(c). Finding 
the maximizing 0 is an exercise in constrained optimization. 
If the expression for Pr0(c) is of a suitable (simple) form, the 
maximizing parameter vector 0 can be solved for directly. 
The key elements of this problem are 
• a vector 0 of adjustable parameters, 
• constraints on these parameters to ensure that we have 
a model, 
• a vector c of observations, and 
• the adjustment of 0, subject to constraints, to maximize 
the likelihood Pr0(c). 
We often seek more than a probability model of some ob- 
served data c. There may be some hidden statistics h, which 
are related to c, but which are never directly revealed; in gen- 
eral h itself is restricted to some set 7f of admissible values. 
For instance, c may be a large corpus of grammatical text, 
and h an assignment of parts-of-speech to each of its words. 
model Pr(e). Consider the translation model. As any first- 
year language student knows, word-for-word translation of 
English to French does not work. The dictionary equivalents 
of the Enghsh words can move forward or backward in the 
sentence, they may disappear completely, and new French 
words may appear to arise spontaneously. 
Guided by this observation, our approach has been to write 
down an enormous parametric expression, Pr0(f I e), for the 
translation model. To give the reader some idea of the scale of 
the computation, there is a parameter, ~(/\[e), for the prob- 
ability that any given English word e will translate as any 
given French word f. There are parameters for the prob- 
ability that any f may arise spontaneously, and that any e 
may simply disappear. There are parameters that words may 
move forward or backward 1, 2, 3, ... positions. And so on. 
We use a similar approach to write an expression for Pr0(e). 
In this case the parameters express things like the probabil- 
ity that a word e/may appear in a sentence after some word 
sequence eta2.., e~-t. In general, the parameters are of the 
form Pr(e/Iv), where the vector v is a combination of observ- 
able statistics like the identities of nearby words, and hidden 
statistics like the grammatical structure of the sentence. We 
refer to v as a historyd, from which we predict e¢. 
The parameter values of both models are determined by EM 
training. For the translation model, the training data con- 
sists of English-French sentence pairs (e, f), where e and f 
are translations of one another. For the language model, it 
consists exclusively of Enghsh text. 
In such cases, we proceed as follows. First we write down a 
parametric model Pr0(c, h). Then we attempt to adjust the 
parameter vector 0 to maximize the likelihood Pr0(c), where 
this latter is obtained as the sum ~he~ Pr0(c, h). 
Unfortunately, when we attempt to solve this more compli- 
cated problem, we often discover that we cannot find a closed- 
form solution for 0. Instead we obtain formulae that express 
each of the desired parameters in terms of all the others, and 
also in terms of the observation vector c. 
Nevertheless, we can frequently apply an iterative technique 
called the Ezpectation-Mazimization or EM Algorithm; this 
is a recipe for computing a sequence 0z, 02, .. • of parameter 
vectors. It can be shown \[2\] that under suitable conditions, 
each iteration of the algorithm is guaranteed to produce a 
better model of the training vector c; that is, 
Pr0,+l(c) > Pr0,(c), (2) 
with strict inequality everywhere except at stationary points 
of Pr0(c). When we adjust the model's parameters this way, 
we say it has been EM-trained. 
Training a model with hidden statistics is just like training 
one that lacks them, except that it is not possible to find 
a maximizing t~ in just one go. Training is now an itera- 
tire process, involving repeated passes over the observation 
vector. Each pass yields an improved model of that data. 
Now we relate these methods to the problem at hand, which 
is to develop a translation model Pr(f \] e), and a language 
2.2. Decoding 
We do not actually search the infinite set of all English word 
strings to find the 6 that maximizes equation (1). Even if we 
restricted ourselves to word strings of length h or less, for any 
realistic length and English vocabulary C, this is far too large 
a set to search exhaustively. Instead we adapt the well-known 
stack decoding algorithm \[5\] of speech recognition. Though 
we will say more about decoding in Section 6 below, most 
of our research effort has been devoted to the two modeling 
problems. 
This is not without reason. The translation scheme we have 
just described can fail in only two ways. The first way is a 
search error, which means that our decoding procedure did 
not yield the fi that maximizes Pr(f I e)Pr(e ). The second 
way is a modeling error, which means that the best English 
translation, as supplied by a competent human, did not max- 
imize this same product. Our tests show that only 5% of our 
system's errors are search errors. 
3. Language Modeling 
Let e be a string of English words el ... eL. A language model 
Pr(e) gives the probability that e would appear in grammat- 
ical English text. 
By the laws of conditional probability we may write 
Pr(e) = Pr(et...eL) 
= Pr(et) Pr(e21et) Pr(esle~e=)--. Pr(eLle~ ... ez_t). 
158 
Given this decomposition the language modeler's job is to 
estimate each of the f distributions on the right hand side. 
If IEI is the size of the English vocabulary, then the number 
of different histories et... eh-t in the kth conditional grows 
as IEI h-t. This presents problems both in practice and in 
principle--the former because we don't have enough storage 
to write down all the different histories, the latter because 
even if we could, any one history would be exceedingly rare, 
making it impossible to estimate probabilities accurately. 
For these reasons, Candide has used the so-called trigram 
model as its workhorse. In this model, we use the approxi- 
mation 
Pr(ek I et... e~_t) ~ Pr(e~ I eh-2ek-~) 
for each term on the right hand side above. That is, we limit 
the history to two words. Each triple (ek-2ek-lek) is called 
a trigram. 
It remains to estimate the Pr(e~leh_2eh_t ). One solu- 
tion is to use maximum-likelihood trigram probabilities, 
T(eklek_2e~-t). These are obtained by scanning the training 
corpus c, counting the incidence of each trigram, and using 
these counts to form the appropriate conditional estimates. 
But even for this modest history size, we frequently encounter 
trigrams during translation that do not appear during train- 
ing. This is not surprising, since there are IC\[ s = 1.773 x l0 ts 
possible different trigrams, yet we can encounter no more 
than Icl of them during training. There are 75,349,888 dis- 
tinct trigrams in our training corpus, of which 53,737,350 
occur exactly once. 
For this reason, we employ the technique of deleted interpola- 
tion \[6\]: we express Pr(ek\[ek-2e~-t) as a linear combination 
of the trigram probability T(ek l ek-2ek-t), the bigram prob- 
ability B(ekleh_t), the unigram probability U(ek), and the 
uniform probability 1/IEI. The distributions B and U are 
obtained by counting the incidence of bigrams and unigrams 
in the same training corpus c. But there are fewer distinct 
bigrams, so we have a higher chance of seeing any given one 
in our training data, and a still higher chance of seeing any 
given unigram. The resulting formula for Pr(eklek_2ek_t) is 
called the smoothed trigrarn model. 
Even the smoothed trigram model leaves much to be desired, 
since it does not account for semantic and syntactic depen- 
dencies among words that do not lie within the same trigram. 
This has led us to use a link grammar model. This is a train- 
able, probabilistic grammar that attempts to capture all the 
information present in the trigram model, and also to make 
the long-range connections among words needed to advance 
beyond it. Link grammars are discussed in detail in \[7\]. 
4. Translation Modeling 
This section describes the dements of our translation model, 
Pr(f \[ e). We have two distinct translation models, both de- 
scribed here: an EM-trained model, and a maximum-entropy 
model. 
As we explain in Section 4.2 below, the EM-trained model 
is developed through a succession of five provisional models. 
Before we describe them, we introduce the notion of align- 
ment. 
4.1. Alignment 
Consider a pair of French and English sentences (e, f) that are 
translations of one another. Although we argued above that 
word-for-word translation will not work to develop f from e, 
it is clear that there is some relation between the individual 
words of the two sentences. A typical assignment of relations 
is depicted in Figure 2. 
Thet dog2 I 
Let chien2 
ares myt homework5 
as manger mess devoirse 
Figure 2: Alignment of a French-English Sentence Pair. The 
subscripts give the position of each word in the sentence. 
We call such a set of connections between sentences an align- 
ment. Formally we express it as a set a of pairs (j, i), where 
each pair stands for a connection between the jth word of f 
and the ith word of e. Our intention is to connect f~ and 
ei when ei was one of the words expressing in English the 
concept that fj (possibly along with other words of f) ex- 
presses in French. In its most general form, an alignment 
may consist of any set a of (j, i) pairs. But for shnplicity, we 
restrict ourselves to alignments in which each French word is 
connected to a unique English word. 
We cannot hope to discover alignments with certainty. Our 
strategy is to train a parametric model for the joint distrib- 
ution Pr(f, a \[ e), where the alignment a is hidden. In prin- 
ciple, the desired conditional Pr(f I e) may then be obtained 
as ~aPr(f, a le), where the sum is taken over all possi- 
ble alignments of e and f. In practice this is possible only 
for our first two models. For the remaining models, we ap- 
proximate Pr(f I e) as follows. During training, we find the 
single most probable alignment &, and sum Pr(f, a I e) over 
a small neighborhood of &. During decoding, we simply use Pr(f, ale). 
4.2. EM-Trained Models 
We now sketch the structure of five models of increasing com- 
plexity, the last of which is our EM-trained translation model. 
For an in-depth treatment, the reader is referred to \[3\]. 
1. Word Translation This is our simplest model, intended 
to discover probable individual-word translations. The free 
parameters of this model are word translation probabilities 
t(fj I ei). Each of these parameters is initialized to 1/I.FI, 
where Y is our French vocabulary. Thus we make no initial 
assumptions about appropriate French-English word pair- 
ings. The iterative training procedure automatically finds 
appropriate translations, and assigns them high probability. 
2. Local Alignment To make our model more realistic, 
we introduce an alignment variable aj for each position j 
of f; aj is the position in e to which the jth word of f is 
aligned. (French words that appear to arise spontaneously 
159 
are said to align to the null word, in position 0 ofe.) Formally, 
we insert a parameter Pr(a~ I J, re, l) into our expression for 
Pr(f, ale ). This expresses the probability that position 
in an arbitrary French sentence of length ra is aligned with 
position aj in any English sentence of length l that is its 
translation. The identities of the words in these positions do 
not influence the alignment probabilities. 
3. Fertilities As we observed earlier, a single English word 
may yield 0, I or more French words, for instance as when not 
translates to ne...pus. This idea is implicit in our notion of 
alignment, but not explicitly related to word identities. To 
capture this phenomenon explicitly, this model introduces 
the notion of fertility. The fertility ~(el) is the number of 
French words in f that ei generates in translation. Fertility is 
incorporated into this model through the parameters ~b(nlel), 
the probability that ~b(ei) equals n. 
4. Class-Based Alignment In the preceding model, 
though the fertilities are conditioned upon word identities, 
the alignment parameters are not. We have already pointed 
out how unrealistic this is, since it aligns positionsin the (e, f) 
pair with no regard for the words found there. This model 
remedies the problem by expressing alignments in terms of 
parameters that depend upon the classes of words that lie at 
the aligned positions. Each word f in our French vocabulary 
.T is placed in one of approximately 50 classes; likewise for 
each e in the English vocabulary S. The assignment of words 
to classes is made automatically through another statistical 
training procedure \[3\]. 
5. Non-Deficient Alignment The preceding two models 
suffer from a problem we call deficiency: they assign non-zero 
probability to "alignments" that do not correspond to strings 
of French words at all. For instance, two French words may 
be assigned to lie at the same position in the sentence. Words 
may be placed before the start of the sentence, or after its 
end. This model eliminates such spurious alignments. 
These five models are trained in succession on the same data, 
with the final parameter values of one model serving as the 
starting point for the next. For the current version of Can- 
dide, we used a corpus of 2,205,733 English-French sentence 
pairs, drawn mostly from the Hansards, which are the pro- 
ceedings of the Canadian Parliament. The entire compu- 
tation took a total of approximately 3600 processor-hours 
distributed over fifteen IBM Model 530H POWERstations. 
The reader may be wondering why we have five translation 
models instead of one. This is because the EM algorithm, 
though guaranteed to converge to a local maximum, need 
not converge to a global one. A weakness of the algorithm is 
that it may yield a parameter vector 8 that is indeed a local 
maximum, but which does not model the data well. 
It so happens though that model 1 has a special form that en- 
sures that EM training is guaranteed to converge to a global 
maximum. By using model l's final parameter vector as the 
initial vector for model 2, we are assured that we are at a 
reasonably good starting point for training the latter. By 
extension of this argument, we proceed through the training 
of each model in succession, with some confidence that each 
model's starting point is a good one. 
4.3. Context Sensitive Models 
All of the preceding translation models make one important 
simplification: each English word acts independently of all 
the others in the sentence to generate the French words to 
which it is aligned. But it is easy to convince oneself that this 
approach is inadequate; clearly run will translate differently 
in Let's run the program! and Let's run the race!. Intuitively, 
we would like to make the translation of a word depend upon 
context in which it appears. 
For this reason, we have constructed translation models that 
take context into account. Our instinct is to make the trans- 
lation of a word depend upon its neighbors, say writing 
t(fj \[ ei ei:~l ...) for the word-translation probabilities. But 
this is impractical, because of the same difficulties that con- 
front language models with long histories. 
To overcome this, we employ a technique--maximum-entropy 
modeling--that deals with small chosen subsets of a poten- 
tially large number of conditioning variables. We begin with 
a large set Q = {bl(f,e,e) b2(f,e,e) bs(f,e,e)...) of binary- 
valued functions. Each such function asks some yes/no ques- 
tion about the French word f, the English word e, and the 
context e in which e appears. 
The training procedure works iteratively to find a small sub- 
set Q' = {bhl(fj,el,e) bh2(fj,ei,e)...b~,~(fj,el,e)) that 
disambiguates the senses of the English word in context. For- 
mally, it develops a distribution t(fj I el Q') that tells us if 
fj is a good translation of e~ in the context e. Since this pro- 
cedure is costly in computer time, we develop such models 
only for the 2,000 most common English words. For more 
information about maximum-entropy modeling, the reader is 
referred to \[4\]. 
5. Analysis-Transfer-Synthesis 
Although we try to obtain accurate estimates of the parame- 
ters of our translation models by training on a large amount 
of text, this data is not used as effectively as it might be. For 
instance, the word-translation probabilities ~(parle I speaks) 
and t(parlent I speak) must be learned separately, though they 
express the underlying equivalence of the infinitives parler 
and to speak. 
For this reason, we have adopted for Candide a variation of 
the analysis-transfer-synthesis paradigm. In this paradigm, 
translation takes place not between raw French and English 
texts, but between intermediate forms of the two languages. 
Note that because translation is effected between interme- 
diate French and intermediate English, all our models are 
trained upon intermediate text as well. For training, each 
(e, f) pair of our data is subjected to an analysis step: the 
French is rendered into an intermediate French f', the Eng- 
lish into intermediate English e'. The English transformation 
is constructed to ensure that it is invertible; its inverse, from 
intermediate English to standard English, is usually called 
synthesis. 
The aim of these transformations is three-fold: to suppress 
lexicai variations that conceal regularities between the two 
languages, to reduce the size of both vocabularies, and to 
reduce the burden on the alignment model by making coor- 
160 
dinating phrases resemble each other as closely as possible 
with respect to length and word order. 
Both the English and the French analysis steps consist of 
five classes of operations: segmentation, name and number 
detection, case and spelling correction, morphological analy- 
sis, and linguistic normalization. During segmentation, the 
French is divided (if possible) into shorter phrases that rep- 
resent distinct concepts. This does not modify the text, but 
the translation model, used later, respects this division by 
ignoring alignments that cross segment boundaries. 
During name and number detection, numbers and proper 
names--word strings such as Ethiopie, Grande Bretagne and 
$.85 era--are removed from the French text and replaced by 
generic name and number markers. Removing names and 
numbers greatly reduces the size of £ and .T. The excised 
texts are translated by rule and kept in a table, to be substi- 
tuted back into the English sentence during synthesis. 
During case and spelling correction, we correct any obvi- 
ous spelling errors, and suppress the case variations in word 
spellings that arise from the conventions of English and 
French typography. 
During morphological analysis, we first use a hidden Maxkov 
model \[8\] to assign part-of-speech labels to the French, then 
use these labels to replace inflected verb forms with their in- 
fiuitives, preceded by an appropriate tense marker. We also 
put nouns into singular form and precede them by number 
markers, and perform a variety of other morphological trans- 
formations. 
Finally, during linguistic normalization we perform a series of 
word reorderings, insertions and rewritings intended to reg- 
ularize each language, and to make the two languages more 
closely resemble each other. For example, the contractions 
au and du are rewritten as d le and de le. Constructions such 
as il y a and he...pus are replaced with one-word tokens. The 
English possessive construction is made to resemble French 
by removing the's or 'sutfix, reordering noun phrases, and 
inserting an additional token. Thus my aunt's pen becomes 
intermediate English dummy-article pen's my aunt; note the 
similarity to the French le stylo de ma tante. 
6. Operation of Candide 
In previous sections we have indicated how the parameters 
of Candide's various models are determined via the EM algo- 
rithm and ma~c_imum-entropy methods. We now outline the 
steps involved in the execution of Candide as it translates a 
French passage into English. The process of translation, di- 
vided into analysis, transfer, and synthesis stages, is depicted 
in Figure 3. 
In the analysis stage, the French input string f is converted 
into f~, as discussed above. The output of this stage is de- 
noted in Figure 3 as Intermediate French. 
The transfer stage constitutes the decoding process sketched 
in Section 2.2 above. Decoding consists of two steps. In the 
first step, Candide develops a set H* of candidate decodings, 
using coarse versions of our translation and language models 
to select its elements. In the second step, the system expands 
H* and rescores the enlarged set using more sophisticated 
models. We now describe both steps in greater detail. 
In the first step, Candide applies a variation of the stack 
decoding algorithm to generate candidate decodings. Decod- 
ing proceeds left-to-right, one intermediate English word at a 
time. At each stage we maintain a ranked set H (~) of partial 
hypotheses for the intermediate English ~. 
In general, the elements of H (~) are partial decodings of f~; 
that is, only the leading i words of ~t have been filled in, 
and these account for only some of the words of f~. To ad- 
vance the decoding, some elements of H (i) are selected to be 
extended by one word. The translation and language mod- 
els work together to generate the i + 1st word; the result- 
ing partial decodings are ranked; this ranked set is H (~+l). 
An hypothesis is complete when all words of f~ have been 
accounted for. Note that while the intermediate English is 
generated left-to-right, the treatment of intermediate French 
words does not necessarily proceed left-to-right, due to the 
word-reordering property of the channel. This is one of the 
key ways that translation differs from speech--a difference 
that greatly complicates the decoding process. 
The ranking of hypotheses is according to the product 
Pr(f~ I e~)Pr(e~). In the interest of speed, and because we 
must deal with partial rather than complete sentences, we 
employ the EM-tralned translation model and the smoothed 
trigram language model. The output of this step is a ranked 
set H* of the 140 best intermediate English sentences. 
During the second step, called perturbation search, we enlarge 
H* by considering sequences of single-word deletions, inser- 
tions or replacements to its elements. Then we rerank the 
enlarged set using the link grammar language model and the 
maximum-entropy translation model. The highest-scoring in- 
termediate English sentence that we encounter during pertur- 
bation search is the output ~ of the transfer stage. 
The final stage, synthesis, converts the intermediate English 
~ into a plain English sentence ~. 
7. Performance 
We evaluate our system in two ways: through participation 
in the ARPA evaluations, and through our own internal tests. 
The ARPA evaluation methodology, devised and executed by 
PRO, is detailed in \[9\]; we recount it here briefly. AReA pro- 
vides us with a set of French passages, which we process in 
two ways. First, the passages are translated without any hu- 
man intervention. This is the fully-automatic mode. Second, 
each of the same passages is translated by two different hu- 
mans, once with and once without the aid of Transman, our 
translation assistance tool. Transman presents the user with 
an automated dictionary, a text editor, and parallel views of 
the French source and the English fully-automatic transla- 
tion. The passages are ordered in such a way as to suppress 
the influence of differing levels of translation skill, and the 
times of all the human translations are recorded. 
PRO scores all the resulting English texts for fluency and ad- 
161 
French Input f 
1 
Case and spelling correction 
Name and number detection 
Segmentation 
Morphological analysis 
Word reordering 
1 
Intermediate French f' 
,L 
Stack decoding with coarse models 
Perturbation search with refined models 
1 
Intermediate English ~ ' 
1 
English synthesis 
1 
English Output 
Figure 3: Steps in the Execution of Candide 
equacy, reporting these as numbers between 0 and 1. Flu- 
ency is intended to measure the well-formedness of translated 
sentences; adequacy is intended to measure to what extent 
the meaning of each source text is present in the transla- 
tions. The advantage afforded by Transman is determined 
by computing the ratio tTrartsman/tmanual for each passage, 
where the numerator is the time to translate the passage with 
Transman's aid, and the denominator is the time for unaided 
manual translation. 
The means of all these statistics are presented in Table 1. 
As a benchmark, this table includes a line reporting :fluency 
and adequacy results in these tests for Systran, a commercial 
fully-automatic French-English translation system, consid- 
ered by some to be the world's best. 
Our in-house evaluation methodology consists of fully- 
automatic translation of 100 sentences of 15 words or less; 
each translation is judged either correct or incorrect. These 
sentences are drawn from the same domain as our training 
data--the Hansard corpus--but they are of course not sen- 
tences that we trained on. Our 1992 system produced 45 
correct translations; our 1993 system produced 62 correct 
translations. 
Fluency Adequacy Time Ratio 
1992 1993 1992 1993 1992 1993 
Systran .466 .540 .686 .743 
Candide .511 .580 .575 .670 
Transman .819 .838 .837 .850 .688 .625 
Manual .833 .840 
Table 1: AB.PA Evaluation Results. The Systran line reports 
results for Systran French-to-English fully-automatic trans- 
lations. The Candide line reports results for our system's 
fully-automatic translations; the Transman line reports re- 
suits for our system's machine-assisted translations. 
8. Summary 
We began with a review of the source-channel formalism of 
information theory, and how it may be applied to translation. 
Our approach reduces to formulating and training two para- 
metric probability models: the language model Pr(e), and 
the translation model Pr(f I e). We described the structure 
of both models, and how they are trained. 
We explained the use of the analysis-transfer-synthesis par- 
adigm, and sketched the system's operation. Finally, we gave 
performance results for Candide, in both its human-assisted 
and fully-automatic operating modes. 
In our opinion, the most promising avenues for exploration 
are: the continued elaboration of the link grammar language 
model, more sophisticated translation models, the maximum- 
entropy modeling technique, and a more systematic approach 
to French and English morphological and syntactic analysis. 
References 
1. Allen, Arnold O. Probability, Statistics and Queueing 
Theory, Academic Press, New York, NY, 1978. 
2. Baum, L. E. An inequality and associated maximization 
technique in statistical estimation of probabilistic func- 
tions of a Matkov process. Inequalities, 3, 1972, pp 1-8. 
3. Brown, Peter F., Stephen A. Della Pietra, Vincent J. 
Della Pietra, Robert L. Mercer. The mathematics of 
statistical machine translation: parameter estimation. 
Computational Linguistics 19(2), June 1993, pp 263- 
311. 
4. Jaynes, E. T. Notes on present status and future 
prospects. Mazimum Entropy and Bayesian Methods, 
W. T. Grandy and L. H. Schick, eds. Kluwer Academic 
Press, 1990, pp 1-13. 
5. Jelinek, Frederick. A fast sequential decoding algo#thm 
using a stack. IBM Journal of Research and Develop- 
ment, 13, November 1969, pp 675-685. 
6. Jelinek, F., R. L. Mercer. Interpolated estimation of 
Markov source parameters from sparse data. In Proceed- 
ings, Workshop on Pattern Recognition in Practice, Am- 
sterdam, The Netherlands, 1980. 
7. Lafferty, John, Daniel Sleator, Davy Temperly. Gram- 
matical trigrams: a probabilistic model of link gram- 
mar. Proceedings of the 199~ AAAI Fall Symposium on 
Probabilistic Approaches to Natural Language. 
8. Meriaido, Bernard. Tagging text with a probabilistic 
model. Proceedings of the IBM Natural Language ITL, 
Paris, France, 1990, pp 161-172. 
9. White, John S., Theresa A. O'Connell, Lynn M. Carl- 
son. Evaluation of machine translation. In Human Lan- 
guage Technology, Morgan Kaufman Publishers, 1993, 
pp 206-210. 
162 
