A STATISTICAL APPROACH TO LANGUAGE TRANSLATION 
P. BROWN, J. COCKE, S. DELI,A PIETRA, V. DELLA PIETRA, 
F. JELINEK, R, MF, RCF, R, and P. ROOSSIN 
IBM Research Division 
T.J. Watson Research Center 
Department of Computer Science 
P.O. Box 218 
Yorktown Heights, N.Y. 10598 
ABSTRACT 
An approach to automatic translation is outlined that utilizes 
technklues of statistical inl'ormatiml extraction from large data 
bases. The method is based on the availability of pairs of large 
corresponding texts that are translations of each other. In our 
case, the iexts are in English and French. 
Fundamental to the technique is a complex glossary of 
correspondence of fixed locutions. The steps of the proposed 
translation process are: (1) Partition the source text into a set 
of fixed locutioris. (2) Use the glossary plus coutextual 
information to select tim corresponding set of fixed Ioctttions into 
a sequen{e forming the target sentence. (3) Arrange the words 
of the talget fixed locutions into a sequence forming the target 
sentence. 
We have developed stalistical techniques facilitating both tile 
autonlatic creation of the glossary, and the performance of tile 
three translation steps, all on the basis of an aliglnncllt of 
corresponding sentences in tile two texts. 
While wc are not yet able to provide examples of French / 
English tcanslation, we present some encouraging intermediate 
results concerning glossary creation and the arrangement of target 
WOl'd seq lie)lees. 
1. INTRODUCTION 
In this paper we will outline an approach to automatic translation 
that utilizes techniques of statistical information extraction from 
large data bases. These self-organizing techniques have proven 
successful in the field of automatic speech recognition \[1,2,3\]. 
Statistical approaches have also been used recently in 
lexicography \[41 and natural language processing \[3,5,6\]. The idea 
of automatic translation by statistical (information thco,'etic) 
methods was proposed many years ago by Warren Weaver \[711. 
As will be seen in the body of tile paper, tile suggested technique 
is based on the availability of pairs of large corresponding texts 
that are Iranslations of each other. Ill particular, we have chosen 
to work with the English and French languages because we were 
able to obtain the biqingual llansard corpus of proceedings of the 
Canadian parliament containing 30 million words of text \[8\]. We 
also prefer to apply our ideas initially to two languages whose 
word orcter is similar, a condition that French and English satisfy. 
Our approach eschews the use of an internmdiate ,nechalfism 
(language) that would encode the "meaning" of tile source text. 
The proposal will seem especially radical since very little will be 
sakl about employment of conventional grammars. This 
omissiol\], however, is not essential, and may only rcl'lect our 
relative lack of tools as well as our uncertainty about tile degree 
of grammar sophistication required. We are keeping an open 
mind! 
Ill what follows we will not be able to give actual results el French 
/ English translation: our less than a year old project is not I'ar 
enongh ahmg. Rather, we will outline our current thinking, sketch 
certain techniqttes, and substantiate our Ol)timism by presenting: 
some intermediate quantitative data. We wrote this solnewhat 
specttlativc paper hoping to stimulate interest in applications el 
statistics to transhttion and to seek cooperation in achieving this 
difficult task. 
2. A IIEURIST|C OUTLINE OF 'FILE BASIC PHII,OSOPttY 
Figure I juxtaposes a rather typical pair of corresponding English 
mid \]:rench selltenees, as they appear in the Ih.nlsard corpus. 
They arc arranged graphically so as to make evident thai (a) the 
literal word order is on the whole preserved, (b) the chulsal (and 
perhaps phrasal) structure is preserved, and (c) the sentence pairs 
contain stretches of essentially literal correspondence interrupted 
by fixed locutions. In the latter category arc \[I rise on = ie 
souleve\], \]affecting = apropos\], and \[one which reflects o n = 
i/our mettre cn doutc\]. 
It can thus be argued that translation ought to bc based on a 
complex glossary of correspondence el' fixed locutions. Inch~ded 
would be single words as well as phrases consisting el contiguous 
or tuna--contiguous words. E.g., I word = mot l, I word = proposl. 
\[not = ne ... pasl, \[no = ne ... pas\[, \[scat belt = ccmturc\[, late = 
a mangel and even (perhaps} lone which reflects Oil = \[)()ill" 
mcttrc ell doute\], etc. 
Transhttion call he sotnewhat naively regarded as a tht'cc slag¢ 
process: 
( 1 ) Partition the source text into a set of fixed locutions 
(2) Use the glossary plus contextual information to select the 
corresponding set of fixed locutions in the target language. 
(3) Arrange the words of the target fixed locutions into a 
sequence that forms the target sentence. 
This naive approach forms the basis of our work. In fact, we have 
developed statistical techniques facilitating the creation of the 
glossary, and the performance of the three translation steps. 
While the only way to refute the many weighty objections to our 
ideas woukl be to construct a machine that actually carries out 
satisfactory translation, some mitigating comments are ill order, 
7l 
We do not hope to partition uniquely the source sentence into 
locutions. In most cases, many partitions will be possible, each 
having a probability attached to it. 
Whether "affccting" is to be translated as "apropos" or 
"cuncernant," or, as our dictionary has it, "touchant" or 
"cmouvant," or in a variety of other ways, depends on the rest 
of the sentence. However, a statistical indication may be 
obtained from the presence or absence of particular guide words 
in that scntcncc. Tile statistical technique of decision trees \[9\] 
can be used to determine the guide word set, and to estimate the 
ln'obability to be attached to cach possible translate. 
The sequential arrangement of target words obtained from the 
glossary inay depend on an analysis of the source sentence• For 
instance, clause corrcspondence may be insisted upon, in which 
case only permutations of words which originate in the same 
source clause wotdd be possible. Furthermore, the character of 
the source clause may affect the probability of use of certain 
functioll words in the target clause. There is, of course, nothing 
to prcvent the use of more detailed information about the 
structure of the parse of the source sentence. However, 
preliminary experilnents presented below indicate that only a very 
crude grammar may be needed (see Section 6). 
3. CREATING THE GLOSSARY,'FIRST ATTEMPT 
We have already indicated in the previous section why creating a 
glossary is not just a matter of copying some currently available 
dictiouary into the computer, in fact, in the paired sentences of 
Figure 1, "affecting" was translated as "apropos," a 
correspondence that is riot ordinarily available• Laying aside for 
the time being the desirability of (idiomatic) word cluster - to - 
word cluster translation, what we are'after at first is to find for 
each word f in the (French) source language the list of words 
{e~, e2 ..... e,} of the (English) target language into which f can 
translate, and the probability P(e, If ) that such a translation takes 
place. 
A first approach to a solution that takes advantage of a large data 
basc of paired sentences (referred to as 'training text') may be as 
follows. Suppose for a moment that in every French / English 
sentence pair each French wordftranslates into one and only one 
English word e, and that this word is somehow revealed to the 
computer. Then we could proceed by!' 
I. Establish a counter C(e,,f) for each word e~ of the English 
w~cabulary. Initially set C(e~,f) = 0 for words et. Set J = 1. 
2. Find the Jth occurrence of the word fin the French text. Let 
it take place in the Kth sentence, and let its translate be the qth 
word in the Kth English sentence E = e~,, e~ ..... e~,. Then 
increment by 1 the counter C(e,,¢f). 
3. Increase J by 1 and repeat steps 2 and 3. 
Setting M(f ) equal to the sum of all the counters C(e,, f) at the 
conclusion of the above operation (in fact, it is easy to see that 
M(f) is the number of occurrences of fin the total French text), 
we could then estimate the probability P(e, J f ) of translating the 
word f by the word e, by the fraction C(e,,f)/M(f). 
The problem with the above approach is that it relies on correct 
identification of the translates of French words, i.e., on the 
solution of a significant part of tile translation problem. In the 
absence of such identification, the obvious recourse is to profess 
complete ignorance, beyond knowing that the translate is one of 
the words of the corresponding English sentence, each of its 
words being equally likely. Step 2 of the above algorithm then 
must be changed to 
2'. Find the Jth occurrence of the word fin the French text. Let 
it take place in the Kth sentence, and let the Kth English sentence 
consist of words e,,, e,~, ..., e,°. Then increment the counters 
C(e,,,f), C(e,,,f) ..... C(o,o,f) by tire fraction 1/n. 
This second approach is based on tile faith that in a large corpus, 
the frequency of occurrence of true translates of f in 
corresponding English sentences would overwhelm that of other 
candidates whose appearance in those sentences is accidental• 
This belief is obviously flawed. In particular, the article "the" 
would get the highest count since it would appear multiply in 
practically every English sentence, and similar problems would 
exist with other function words as well. 
What needs to bedone is to introduce some sort of normalization 
that would appropriately discount for the expected frequency of 
occurrence of words. Let P(e~) denote the probability (based on 
ttle above procedure) that the word e, is a translate of a randomly 
chosen French'word. P(e~) is given by 
Pie i) = ~f P(eilf')r(f') = ~f P(e~ lf')M(f')/M (3.i) 
where M is the total length of the French text, and M(f') is the 
number of occurrences off t in that text (as before). The fraction 
P(e, If) / P(e,) is an indicator of the strength of association of e, 
with f, since P(e, If) is normalized by the frequency P(e,) of 
associating e~ with an average word. Thus it is reasonable to 
consider e, a likely translate of f if P(e, I f ) is sufficiently large• 
The above normalization may seem arbitrary, but it has a sound 
underpinning from the field of Information Theory \[ 10\]. In fact, 
the quantity 
P(eilf) l(ei; f) = log (3.2) 
P(e,) 
is the mutual information between the French word f and the 
English word e,. 
Unfortunately, while normalization yields ordered lists of likely 
English word translates of French words, it does not provide us 
with the desired probability values. Furthermore, we get no 
guidance as to the size of a threshold T such that e, would be a 
candidate translate of f if and only if 
l(~;f) > T (3.3) 
Various ad hoe modifications exist to circumvent the two 
problems• One might, for instance, find the pair e, f with the 
highest mutual information, criminate e~ and f from all 
corresponding sentences in which they occur (i.e. decide once 
and for all that in those sentences e, is tile translate of f !), then 
re-compute all the quantities over the shortened texts, determine 
the new maximizing pair e~,f ~ and continue the process until 
some arbitrary stopping rule is invoked• 
Before the next section introduces a better approach that yields 
probabilities, we present in Figure 2 a list of high mutual 
72 
information English words for some selected French words. The 
reader will agree that even tire flawed technique is quite powerful. 
4. A SIMPLE GI,OSSARY BASED ON A MODE\[, 
O1" TIlE TRANSI,ATION PROCESS 
We will now revert to our original ambition of deriving 
probabilities of translation, P(e,\[f). Let us start by observing 
that tlm algorithm of the previous section has the following flaw: 
Shonld it be "decided" that the qth word, e,, , of the English 
sentence is Ihc translate of the rth word, ~r, of the French 
sentence, that process makes no provision for removing e,. from 
eonskk ratiou as a candidate translate of any of tile remaining 
French words (those not in the rth position)! We need to find a 
mctho0 to decide (probabilistically !) which English word was 
general ed by which l.'rench one, and then estimate P(e, tf ) by the 
relative frequency with whiehfgave rise to e, as "observed" ira tire 
texts of paired French / English sentence transhttcs. Our 
procedure will be based on a model (an admittedly crude one) of 
how Ertgtish words are generated from their French counterparts. 
With a slight additional refinement to be specified in the next 
section (see the discussion on position distortion), the following 
model will do the trick. Augment the English vocabulary by the 
NULl, vcord eo that leaves no trace in tile English text. Then each 
French word f will prodnce exactly one 'primary' English word 
(which may be, however, invisible). Furthermore, primary 
English words can produce a number of secondary ones. 
The provisions for the null word and for tile production of 
secondary words will account for the unequal length of 
corresponding French and English sentences. It would be 
expected that some (but not all) French function words would 
be killed by producing null words, and that English ones would 
be crealed by secondary production. In particular, in the example 
of Figme l, one would expect that "reflects" woakl generate both 
"which" and "on" by secondary production, and "rise" would 
similarly generate "on." On tbc other hand, the article 'T" of 
'TOrat( ur" and the preposition "a" of "apropos" wotfld both 
be expected to generate a null word in the primary process. 
This model of generation of English words from French ones then 
requires the specification of the following quantities: 
1. The probabilities P(e, lf) that the ith word of the English 
dictionary was generated by the French word f. 
2. The probabilities Q(% l e,) that the jth English word is 
generated from tile ith one in a secondary generation process. 
3. The probabilities R (k I e~) that the ith English word generates 
exactly k other words in the secondary process. By convention, 
we set R(0 \[ e0) = 1 to assure that the null word does not generate 
any other words. 
The lnollel probability that the word f generates e,, in tile primary 
process, and e~:,...,e~, in the secondary one, is equal to the product 
P(ei, lf ) R(k - 11%) Q(ei2lei,) Q(%lei~)... Q(%leq) (4.1) 
Given a pair of English and French sentences E and F, by the 
term generation pattern $ we understand the specification of 
which English words were generated from which French ones, 
and which~secondary words from which primary ones. Therefore, 
the probability P(E,$IF) of generating the words of E ira a 
pattern $ from those of F is given simply by a product of factors 
like (4.1), one for each French word. We can then think of 
estimating the probabilities P(e, l f), R(k l e,), and Q(e:l¢) by the 
following algorithm at tile start of which all counters are set to 
0: 
1. For a sentence pair E,F of the texts, find that pattern $ that 
gives the maximal value of P(E,$IF), and then make the 
(somewhat impulsive) decision that that pattern $ actually took 
place. 
2. If in the pattern $, f gave rise to e,, augment counter 
CP(e,,f) by l; if e, gave rise to k sccoudary English words, 
augment counter CR(k, e,) by 1 ; if e~ is any (secondary) word that 
was given rise to by e,, augment counter CQ(e~, e,) by 1. 
3. Carry out steps 1 and 2 for all sentence pairs of tile training 
text. 
4. Estimate the model probabilities by nornmlizing the 
correspnndiug counters, i.e., 
P(e,\]f) = CP(ei, f)/CP(f) where CP(f) = ECP(e, f) 
i 
R(k\] e i) = CR(k, ei)/CR(e,) where CR(ei) = E CR(k, ei) 
k 
Q(ejl e i) = CQ(e 1, e,)/CQ(e i) where CQ(e,) = ECQ(ei, e,) 
J 
The problem with the above algorithm is that it is circular: ila 
order to evalnate P(E,$ \] F) one needs to know the probabilities 
P(e, I)c), R(kl e,), and Q(ejle,) in the first place! Forttmately, the 
difficulty can be alleviated by use of itcrative re-estimation, which 
is a technique that starts out by guessing the values of unknown 
quantities and gradually re-adjusts them so as to account better 
and better for given data \[ 11 \]. 
More precisely, given any specification of the probabilitics 
P(e, lf), R(k l e,), and Q(%le,) , we compute the probabilities 
P(E,$ \[ F) needed in step 1, and after carrying out step 4, wct, sc 
the freshly obtained probabilities P(e, If), R(k \]e,), and Q(e, I e,) 
to repeat the process fi'om step I again, etc. We hah the 
computation when the obtained estimates stop changing from 
iteration to iteration. 
While it can be shown that tile probability estimates obtained in 
the above process will converge \[11,12\], it cannot be proven that 
the values obtained will be the desired ones. A heuristic argument 
can be formulated making it plausible that a more complex but 
computationally excessive version \[13\] will succceC Its truncated 
modification leads to a glossary that seems a very satisfactory 
one. We present some interesting examples of its P(e, If) entries 
in Figure 3. 
Two important aspects of this process have not yet been dealt 
with: the initial selection of values of P(e, lf), R(kle,) , and 
Q(51e,), and a method of finding the pattern $ maximizing 
P(E,$ \[ F). 
A good starting point is as follows: 
A. Make Q(ejle,) = l/K, where K is the size of the English 
vocabulary. 
73 
g. l.et R(I \[e,) = 0.8, R(01¢) = 0.1, R(2I<) = R(31<) = 
R(4 I g) = R(5 I e,) = 0.025 for all words e, except the null word 
ell l,et R(0 le0) = 1.0. 
C. To determine the initial distribution P(e, lf) proceed as 
I'ollows: 
(i) Estimate first P(< If ) by tile algorithm of Section 3. 
(ii) Compute the mutual information values l(e,; f) by formula 
(Y2), and for each f find the 20 words e, for which I(e,;f) is 
largest. 
(iii) I.ct P(<~I./') = P(< l f) = (I/21) - e for all words<on the 
list obtained in OiL whEre e is some small positive number. 
l)istributc the remaining probability e uniformly over all the 
I nglish words not on the list. 
I:inding tile maximizing pattern $ for a given sentence pair E, F 
is ~ well-studiEd technical problem with a variety of 
,'{mHmtatiomdly feasible solutions that arc suboptimal in some 
practically uuimportant respects I 14\]. Not to interrupt the flow 
t,l imuiti\e ideas, we omit the discussion of the corresponding 
d 1~2',11 illnns. 
5. TOWARD A COMPLEX G1,OfSSARY 
In the previous section we have introduced a techniqne that 
derives a word - to - word translation glossary. We will now 
reline tile model to make the probabilities a better reflection of 
reality, and then outline an approach for including in tile glossary 
Ihe /ixEd locations discussed in Section 2. 
It should be noted that while English / French translation is quite 
k)cal (as illustrated by the alignment of Figure 1), the model 
leading to (4.1) did not take advantage of this affinity of the two 
languages: tile relative position of the word translate pairs ill their 
respective selltences was not taken into account. If m and n 
denote the respective lengths of corresponding French and 
l:.nglish sentences, then the probability that 6~ (the kth word in the 
English sentencE) is a primary translate of.f~0 (the hth word in the 
\[:rench sentence) shoukl more accurately be given by the 
probability P(e,,kl .f,,,h,m,n) that depends both on word 
positions and sentence lengths. '1'o keep the formulation as simple 
as possiblE, WE can restrict ourselves to tile functional form 
l'(ei ,k I /i,,,h,m,n) = PW(e,~ I fh) PD(k l h,m,n) (5.1) 
In (5.1) we make thc 'distortion' distribution PD(klh,m,n) 
indcpcndcnt o1' the identity of the words whose positional 
discrepancy it dcscribcs. 
As far as secondary generation is concerned, it is first clear that 
the production of preceding words differs from that of those that 
Iollow. So the R and Q probabilities should be split into left and 
right probabilities RL and QL, and RR and QR. Furthermore, 
\re shnuld provide the Q -probabilities with their own distortion 
components that would depend on the distance of the secondary 
word from its primary 'parent'. As a result of these 
cons!derations, the probability that f~, generates (for instance) the 
primary words e,~ and preceding and following secondary words 
<~ ,, <~ ,, e,.~ would be given by 
f'W(6~ I .\[i~,) PD(k l h,m,n) RL(2 l eiA) RR(I e~a) 
QL(G_:~, 3 I G) QL(% ,,11%) QR(e,~+2,2lei ,) 
(5.2) 
Obviously, other distortion formulations are possible. The 
purpose of any is to sharpen the derivation process by restricting 
the choice of translates to the positinnally likely candidates in the 
corresponding sen tencc. 
To find fixed locutions in English, we can use the final 
probabilities QL and QR obtained by tile method of the previous 
section to compute mutual informations between primary and 
secondary word pairs, 
QR(e' I e) IR(e;e r) = log---- (5.3) 
P(e') 
and 
QL(e' I e) 1L(e~;e) = log 
P(e') 
where P(e') = C(e')/N is the relative frequency of occurrence of 
the secondary word e' in the English text (C(e') denotes the 
number of occurrences of e ' in the text of size N), and QR and 
QL are the average secondary generation probabilities, 
QR(e'\]e) = ZQR(e', i\] e) (5,4) 
i 
and 
Ql.(e'\]e) = EQR(e', i l e) 
i 
WE can then establish an experimentally appropriate threshold 
71, and iuchulc in the glossary all pairs (e, e') and (e', e) whose 
mutual information exceeds 7'. 
While tile process above results in two-word fixed locutions, 
longer locutions can be obtained iteratively in the next round after 
the two-word variety had been included in the glossary and in the 
formulation of its creation. 
To obtain French locutions, one must simply reverse the direction 
of the translation process, making English and French the source 
and target languages, respectively. 
With two-word locutions present in both the English and French 
parts of the glossary, it is necessary to reformulate the generation 
process (4.1). The change would be minimal if we could decide 
to treat the words of a locution (,/;f') as a single word f* = 
U, f') rather than as two separate words f and f' whenever both 
are found in a sentence. In such a case nothing more than a 
receding of the French text would be required. However, such a 
radical step would almost certainly be wrong: it could well 
connect auxiliaries and participles that were not part of a single 
past construction. Clearly then, the choice between separateness 
and unity should be statistical, with probabilities estimated in the 
overall glossary construction process and initialized according to 
the frequencies with which elements of the pair f,f~ were 
associated o1 not by secondary generation when they appeared in 
the same sentence. 
Since the approach of this section was not yet used to obtain any 
results, we will leave its complete mathematical specification to a 
future report. 
74 
6. GF, I~,IERATIION ()F TRANSLATED 'I'EXT 
We have pointed out in Sectk)u 2 that translation can be 
somewhat xaively regarded as a lhrec stage process: 
( I ) Partition the source text into a set of fixed locutions. 
(2) Use the glossary plus contextual information to select the 
corresponding set of fixed lomttious in the target language. 
(3) Atrange the words of thc target fixed locutkms into a 
seqtteltce forming the target sentence. 
We have just fitfished arguin b, itt Section 5 that the partilioning 
of sottrcc +ext ili1O locutions is SOIIIUWIIHt conlplex, and that it 
must be approached statistically. The basic idea of using 
cotltextttal iilfOl'l?lation tO select the correct 'sense' of a Iocutioll 
is to eonsh uct a contextual glossary based on a probability of the 
form P(el J; g'IFI ) where e and f are English anrl French 
locutions, ;tnd q, ilq denotes a 'lexical' equivalence class of the 
scalence F The tu,;t of class membership woukl typically depend 
on tilt pre~:ence of SOIIIC contbination of words in F. The choice 
of an app;opriate eqtfivalcncc chtssification schenlc would, of 
course, be .'+he subject of research based on yet another statistical 
formulation. The estimate of P(el ./'; ~11"1 ) would be derived 
from courtts o1' locttlion alignments in sentmlce translate pai,s, the 
alignments being dstimated based on non-contextual glossary 
probabilitit+s of the form (5.2). 
The last stop in our translation scheme is the re-arrangement of 
the words o1" the generated English locutions into an appropriate 
sequence. To see whc'ther this can be douc statistically, we 
explored what would happen in the ilnpossibly optimistic case 
where the words generated in (2) were exactly those of the 
l inglish sczttencc (only their order would be unknown): 
From a large f'+'uglish corpus we derived estimates of trigram 
probabilities, P(e3let, e:~), that the word el follows immediately 
the sequencc pair e~, % A model of 13,nglish sentence production 
based on a trigram estimate would conclude that a sentence 
e~, ca, ..J e,, is generated with probability 
P(el, e2) P(e3 Iet, e2) P(e41 e2, e3) ..- P(e,, I e,,+ 2, e, I) (6.1) 
We then rook other l:';nglish sentences (not included in the 
training COrlmS) and deterntined which of the n t different 
arrangements el + their n words was most likely, using the l'ormula 
(6.1). We found that in 63% of sentences of 10 words or less, 
the most likely arrangement was the original English sentence. 
I;urthermore, the most likely arrangement preserved the meaning 
of the original sentence in 79% of the cases. 
Figure, 4. shows examples hi' synonymous and non-.synonymous 
re-al'rangelnenL'~. 
We realize that very little hope exists of the glossary yielding the 
words and only the words of an English seutence translating the 
original French one, and that, furthermore, Euglish sentences arc 
typically longer than 10 words. Nevertheless, we feel that the 
abow: result is a hopeful one for fnture statistical translation 
methods incorporating the use of appropriate syntactic structure 
information. 
Mr. Speaker, I rise on a question of privilege 
Monsieur l'Orateur, je souleve la qoestion de privilege 
affecting the rights and prerogatives of pmliamentary committees 
a propos des droits et des prerogatives des eomites parlenmnlaires 
and o11o which reflects oii tile wol'd of two ininisters 
et i)otlr nlettre en d<mte les i)ro\])os tie detlX illhlistles 
of the Crown. 
tic la Cotlronne. 
FIGURE I 
AI,IGNMENT OF A FRENCII AND ENGHSH SI;,NTENCE PAIR 
75 
eau water 
lait milk 
banque bank 
banques banks 
hier yesterday 
janvier January 
jours days 
votre your 
cufants children 
trop too 
toujours always 
trois three 
monde world 
pourquoi why 
aujord'bui today 
sans without 
lui him 
mais but 
suis am 
seulemeot only 
peut cannot 
ceintures seat 
ceinturcs belts 
bravo ! 
FIGURE 2 
A LIST OF HIGH MUTUAL INFORMATION FRENCH-ENGLISH 
WORD PAIRS 
WHICH QUI 
I. qui 0.380 who 0.188 
2. que 0.177 which 0.161 
3. dont 0.082 that 0.084 
4, de 0.060 0.038 
5. d' 0.035 to 0.032 
6. laquclle 0.(131 of 0.027 
7. ou 0.027 the 0.026 
8. ct 0.022 what 0.018 
THEREFORE DONC 
1. donc 0.514 therefore 0.322 
2. consequent 0.075 so 0.147 
3. pat" 0.074 is 0.034 
4. ce 0.066 then 0.024 
5. pourquoi 0.064 thus 0.022 
6. alors 0.025 the 0.018 
7. il 0.025 that 0.013 
8. aussi 0.015 us 0.012 
STILL ENCORE 
1. encore 0.435 still 0,181 
2. toujours 0.230 again 0.174 
3. reste 0.027 yet 0.148 
4. *** 0.020 even 0.055 
5. quand 0.018 more 0.046 
6. meme 0.017 another 0,030 
7. de 0.015 further 0.021 
8. de 0.014 once 0.013 
FIGURE 3 (PART I) 
EXAMPLES OF PARTIAL GLOSSARY LISTS OF MOST LIKELY 
WORD TRANSLATES AND THEIR PROBABILITIES 
Note: *** denotes miscellaneous words not belonging to the lexicon. 
PEOPLE GENS 
1. les 0.267 people 0.781 
2. gens 0.244 they 0.013 
3. personnes 0.100 those 0.009 
4. population 0.055 individuals 0.008 
5. peuple 0.035 persons 0.005 
6. canadiens 0.031 people's 0.004 
7. habitants 0.024 men 0.004 
8. ceux 0.023 person 0.003 
OBTAIN OBTENIR 
l. obtenir 0.457 get 0.301 
2. pour 0.050 obtain 0.108 
3. les 0.033 have 0.036 
4. de 0.031 getting 0.032 
5. trouver 0.026 seeking 0.023 
6. se 0.025 available 0.021 
7. obtenu 0.020 obtaining 0.021 
8. procurer 0.020 information 0.016 
QUICKLY RAPIDEMENT 
1. rapidement 0.508 quickly 0.389 
2. vite 0.130 rapidly 0.147 
3. tot 0.042 fast 0.052 
4. rapide 0.021 quick 0.042 
5. brievement 0.019 soon 0.036 
6. aussitot 0.013 faster 0.035 
7. plus 0.012 speedy 0.026 
8. bientot 0.012 briefly 0.025 
FIGURE 3 (PART II) 
EXAMPLES OF PARTIAL GLOSSARY LISTS OF MOST LIKELY 
WORD TRANSLATES AND THEIR PROBABILITIES 
EXAMPLES OF RECONSTRUCTION TttAT PRESERVE 
MEANING: 
would I report directly to you? 
I would report directly to you? 
now let me mention some of the disadvantages. 
let me mention some of the disadvantages now, 
he did this several hours later. 
this he did several hours later. 
EXAMPLES OF RECONSTRUCTION THAT DO NOT PRESERVE 
MEANING 
these people have a fairly large rate of turnover. 
of these people have a fairly large turnover rate. 
in our organization research has two missions. 
in our missions research organization has two. 
exactly how this might be done is not clear. 
clear is not exactly how this might be done. 
FIGURE 4 
STATISTICAL ARRANGEMENT OF WORDS BELONGING TO 
ENGLISH SENTENCES 
76 

REFERENCES 

111 L.R. Bahl, F. Jclinek, and R.l,. Mercer: A maximum likelihood 
approach to contimlous speech recognition, IEEE Traosaclioos on 
Pattern Analysis and Machine Intelligence, PAM I-5 (2): 179-190, March 
1983. 

12\] .I.K. Baker: Stochastic modeling for automatic speech 
tmdcrstanding. In R.A. P, eddy, editor, Speech Recognition, pages 
521-541, Academic Press, New York, 1979. 

131 J.D. Ferguson: llidden Markov analysis: An introduction. In J.D. 
Fcrguson, Ed., ltldden Marker Models for Speech. Princeton, New 
Jersey, IDA-CRD, Oct. 1980, pp. 8-15 

14\] J. Metl. Sinclair: "Lcxicogral~hic F.vidence" in, I)ielionarie,~, 
Lexicography and Langaage l,earniog (l!1+'F Doeoments: 1211), editol 
R. llson, NewYork: Pergamon Press, pp. 81-94, 1985. 

151 P,.G. Garsidc, G.N. Lccch and (\].P,. Sampson, The Comlmlational 
Analysis of l(l,glish: a Corpus-Based AI)l)roach, I.ongman 1987 

1161 G.R. Sampson, "A Stochastic Approach to Parsing" itl. lh'oceeding+, 
of tile I lth lnlernalional Corfferenee oil (k+mputaliotml l,inguintics 
(COl ,IN(\] '86) Bonn 151-155, 1986. 

171 W. Weaver: Translalion (194.9). Reproduced in: I,ocke. WN. & 
Booth, A,D. eds.: Maelnine Iranslalimn of hmguages. Calnbrid,ee, MA.: 
MIT Press, 1955. 

18\] I\[lansards: Official l)roeeedings of the liouse of Cemlnons of Canada, 
19"I4+.78, CanadialJ Government Printie~ Bureau, Ihtll, Quebec 
( ~ ~111~/(Ja. 

19\[ I+. IIrciman, J.ll. Friedtnall, R.A. Olshen, and (J. Stone: 
Classification and Regression Trees, Wadsworth alld t~rooks, M(mtcrey. 
CA, \[ 984. 

\[10\] R.G. Gallager: Informalion Theory aad reliahle (ommuniealion, 
John Wiley and Sons, Ii1c,, New York, 1968. 

\[I I1 A.P. Dcmpstcr, N.M.l.aird, al/d It.B. ll.ubin: Maximum likelihood 
from ineolnpletc data via tile I"M algorithm, Journal of Ihe Royal 
S|atistical Society, 39(B): 1-38, 1977. 

1121 A.J, Viterbi: Error bounds Ior conw)httional codes and an 
asylntotically optimum decoding algorithm, 11,'1.;1,~ Transactions on 
Information Theory, 1T-13:2611-267, 19fi7. 

\[ 13\] L.E. Bauln: All inequality and associated inaxilnization tcc\]miquc 
in statistical estimatkm of probabilistic functions o1 a Maikov process, 
lneqoalities, 3:1-8, 1972. 

\[ 14\] F. Jclinek: A fast sequential decoding algorithm using a stack, IBM 
T. a. Watson Research Development, vol. 13, pp. 6754~85, No\. 19(¢) 
