Robust Bilingual Word Alignment 
for Machine Aided Translation 
Ido Dagan Kenneth W. Church 
AT&T Bell Laboratories 
600 Mountain Avenue 
Murray Hill, NJ 07974 
William A. Gale 
Abstract 
We have developed a new program called word_align 
for aligning parallel text, text such 
as the Canadian Hansards that are available in 
two or more languages. The program takes the 
output of char_align (Church, 1993), a robust 
alternative to sentence-based alignment pro- 
grams, and applies word-level constraints us- 
ing a version of Brown el al.'s Model 2 (Brown 
et al., 1993), modified and extended to deal 
with robustness issues. Word_align was tested 
on a subset of Canadian Hansards supplied by 
Simard (Simard et al., 1992). The combina- 
tion of word_align plus char_align reduces the 
variance (average square error) by a factor of 
5 over char_align alone. More importantly, be- 
cause word_align and char_align were designed 
to work robustly on texts that are smaller and 
more noisy than the Hansards, it has been pos- 
sible to successfully deploy the programs at 
AT&T Language Line Services, a commercial 
translation service, to help them with difficult 
terminology. 
1 Introduction 
Aligning parallel texts has recently received consid- 
erable attention (Warwick et al., 1990; Brown et al., 
1991a; Gale and Church, 1991b; Gale and Church, 
1991a; Kay and Rosenschein, 1993; Simard et al., 
1992; Church, 1993; Kupiec, 1993; Matsumoto et 
al., 1993). These methods have been used in ma- 
chine translation (Brown et al., 1990; Sadler, 1989), 
terminology research and translation aids (Isabelle, 
1992; Ogden and Gonzales, 1993), bilingual lexi- 
cography (Klavans and Tzoukermann, 1990), col- 
location studies (Smadja, 1992), word-sense disam- 
biguation (Brown et al., 1991b; Gale et al., 1992) 
and information retrieval in a multilingual environ- 
ment (Landauer and Littman, 1990). 
The information retrieval application may be 
of particular relevance to this audience. It would 
be highly desirable for users to be able to express 
queries in whatever language they chose and re- 
trieve documents that may or may not have been 
written in the same language as the query. Lan- 
dauer and Littman used SVD analysis (or Latent 
Semantic Indexing) on the Canadian Hansards, 
parliamentary debates that are published in both 
English and French, in order to estimate a kind of 
soft thesaurus. They then showed that these esti- 
mates could be used to retrieve documents appro- 
priately in the bilingual condition where the query 
and the document were written in different lan- 
guages. 
We have been most interested in the terminol- 
ogy application. How does Microsoft, or some other 
software vendor, want "dialog box," "text box," 
and "menu box" to be translated in their man- 
uals? Considerable time is spent on terminology 
questions, many of which have already been solved 
by other translators working on similar texts. It 
ought to be possible for a translator to point at 
an instance of "dialog box" in the English version 
of the Microsoft Windows manual and see how it 
was translated in the French version of the same 
manual. Alternatively, the translator can ask for a 
bilingual concordance as shown in Figure 1. A PC- 
based terminology reuse tool is being developed to 
do just exactly this. The tool depends crucially 
on the results of an alignment program to deter- 
mine which parts of the source text correspond with 
which parts of the target text. 
In working with the translators at AT&T Lan- 
guage Line Services, a commercial translation ser- 
vice, we discovered that we needed to completely 
redesign our alignment programs in order to deal 
more effectively with texts supplied by Language 
Line's customers. All too often the texts are not 
available in electronic form, and may need to be 
scanned in and processed by an OCR (optical char- 
acter recognition) device. Even if the texts are 
available in electronic form, it may not be worth 
the effort to clean them up by hand. Real texts are 
not like the bIansards; real texts are much smaller 
and not nearly as clean as the ideal texts that have 
displayed . In the Save As 
afficha Dana Enregistrer Enregistrer 
ainsi que son extension . Dana la boite de 
x When you choose a command button , the 
Lorsque commande bouton 
sissez un bouton de commande , la boite de 
,o. 
button . Dr doubl - lick the Control - 
r bouton cliquer lois Systeme 
ouvez aussi cliquer deux lois sur la case du 
o,. 
oo. 
ee ' aa , ' When you move to an empty 
Lorsque placez 
de Lorsque vous vous placez darts une zone de 
dialog box , this area is called Save 
dialogue boite cette zone est Enregistr 
dialogue Enregistrer sous, cette zone eat appele 
dialog box closes and the command is 
dialogue boite ferme commande execute 
dialogue se ferme et le programme execute la tom 
menu box . Or press ESC . If a dialog box d 
menu case Si dialogue boite p 
menu Systeme . II eat egalement possible d ' a 
text box , an iiisertion point ( flastung ve 
texte zone insertion ( 
texte vide , un point d ' insertion ( barre vertic 
Figure 1: A small sample of a bilingual concordance, based on the output of word_align. Four concordances 
for the word "box" are shown, identifying three different translations for the word: boite, case, zone. The 
concordances are selected from English and French versions of the Microsoft Windows manual (with some 
errors introduced by OCR). There are three lines of text for each instance of "box": (1) English, (2) glosses, 
and (3) French. The glosses are selected from the French text (the third line), and are written underneath 
the corresponding English words, as identified by word_align. 
been used in previous studies. 
To deal with these robustness issues, Church 
(1993) developed a character-based alignment 
method called char_align. The method was in- 
tended as a replacement for sentence-based meth- 
ods (e.g., (Brown et al., 1991a; Gale and Church, 
1991b; Kay and Rosenschein, 1993)), which are 
very sensitive to noise. This paper describes a 
new program, called word_align, that starts with 
an initial "rough" alignment (e.g., the output of 
char_align or a sentence-based alignment method), 
and produces improved alignments by exploiting 
constraints at the word-level. The alignment algo- 
rithm consists of two steps: (1) estimate transla- 
tion probabilities, and (2) use these probabilities 
to search for most probable alignment path. The 
two steps are described in the following section. 
2 The alignment Algorithm 
2.1 Estimation of translation 
probabilities 
The translation probabilities are estimated using a 
method based on Brown et al.'s Model 2 (1993), 
which is summarized in the following subsection, 
2.1.1. Then, in subsection 2.1.2, we describe 
modifications that achieve three goals: (1) en- 
able word_align to accept input which may not be 
aligned by sentence (e.g. char_align's output), (2) 
reduce the number of parameters that need to be 
estimated, and (3) prepare the ground for the sec- 
ond step, the search for the best alignment (de- 
scribed in section 2.2). 
2.1.1 Brown et al.'s Model 
In the context of their statistical machine trans- 
lation project (Brown et al., 1990), Brown et al. 
estimate Pr(f\[e), the probability that f, a sentence 
in one language (say French), is the translation of 
e, a sentence in the other language (say English). 
Pr(fle ) is computed using the concept of alignment, 
denoted by a, which is a set of connections between 
each French word in f and the corresponding En- 
glish word in e. A connection, which we will write 
f,e specifies that position j in f is connected as coBj, i , 
to position i in e. If a French word in f does not 
correspond to any English word in e, then it is 
connected to the special word n~ll (position 0 in 
e). Notice that this model is directional, as each 
French position is connected to exactly one posi- 
tion in the English sentence (which might be the 
null word), and accordingly the number of connec- 
tions in an alignment is equal to the length of the 
French sentence. However, an English word may be 
connected to several words in the French sentence, 
or not connected at all. 
Using alignments, the translation probability 
for a pair of sentences is expressed as 
Pr(fJe)-- Z Pr(f, ale) (1) 
aE.A 
where .A is the set of all combinatorially possible 
alignments for the sentences f and e (calligraphic 
font will be used to denote sets). 
In their paper, Brown et al. present a series of 
5 models of Pr(f\[e). The first two of these 5 models 
are summarized here. 
2 
Model 1 
Model 1 assumes that Pr(f, ale) depends pri- 
marily on t(f\[e), the probability that an occurrence 
of the English word e is translated as the French 
word f. That is, 
m 
Pr(fle) = E Pr(f'ale) = E Cf.e I'I t(fjie*,) 
ae.4 ae.4 j=l 
(2) 
where Cf,e, an irrelevant constant, accounts for 
certain dependencies on sentence lengths, which 
are not important for our purposes here. Except 
for Cf.e, most of the notation is borrowed from 
Brown ctal.. The variable, j, is used to refer to a 
position in a French sentence, and the variable, i, 
is used to refer to a position in an English sentence. 
The expression, fj, is used to refer to the French 
word in position j of a French sentence, and ei is 
used to refer to the English word in position i of 
an English sentence. An alignment, a, is a set of 
pairs (j, i), each of which connects a position in a 
French sentence with a corresponding position in 
an English sentence. The expression, aj, is used 
to refer to the English position that is connected 
to the French position j, and the expression, eoj, 
is used to refer to the English word in position aj. 
The variable, m, is used to denote the length of 
the French sentence and the variable, 1, is used to 
denote the length of the English sentence. 
There are quite a number of constraints that 
could be used to estimate Pr(f, ale ). Model 1 de- 
pends primarily on the translation probabilities, 
t(f\[e), and does not make use of constraints in- 
volving the positions within an alignment. These 
constraints will be exploited in Model 2. 
Brown e~ al. estimate t(f\[e) on the basis of a 
training set, a set of English and French sentences 
that have been aligned at the sentence level. Those 
values of t(f\[e) that maximize the probability of 
the training set are called the maximum likelihood 
estimates. Brown et al. show that the maximum 
likelihood estimates satisfy 
Pr(con1,~ e) 
 (.fle) = )-~',o"f'eecoW'.., 
Pr(conf:e) (3) 
where CO.A/'t,e and CO./V'.e denote sets of con- 
nections: the set CO.A/'l,e contains all connections 
in the training data between f and e, and the 
set CO.N'. e contains all connections between some 
French word and e. The probability of a connec- 
tion, con~,~ e, is the sum of the probabilities of all 
alignments that contain it. Notice that equation 
3 satisfies the constraint: ~'~.! t(fle ) = 1, for each 
English word e. 
It follows from the definition of Model 1 that 
the probability of a connection satisfies: 
Pr(conf~e) = t (4) 
• Ck=o t(filek) 
Recall that fj refers to the French word in position 
j of the French sentence f of length rn, and that 
ei refers to the English word in position i of the 
English sentence e of length I. Also, remember 
that position 0 is reserved for the null word. 
Equations 3 and 4.are used iteratively to esti- 
mate t(f\[e). That is, we start with an initial guess 
for t(fle). We then evaluation the right hand side 
of equation 4, and compute the probability of the 
connections in the training set. Then we evaluate 
equation 3, obtain new estimates for the transla- 
tion probabilities, and repeat the process, until it 
converges. This iterative process is known as the 
EM algorithm and has been shown to converge to 
a stationary point (Baum, 1972; Dempster et al., 
1977). Moreover, Brown et aL show that Model 
I has a unique maximum, and therefore, in this 
special case, the EM algorithm is guaranteed to 
converge to the maximum likelihood solution, and 
does not depend on the initial guess. 
Model 2 
Model 2 improves upon model 1 by making use 
of the positions within an alignment. For instance, 
it is much more likely that the first word of an En- 
glish sentence will be connected to a word near the 
beginning of the corresponding French sentence, 
than to some word near the end of the French sen- 
tence. Model 2 enhances Model 1 with the assuml>- 
fe tion that the probability of a connection, conj,'~ , 
depends also on j and i (the positions in f and 
e), as well as on m and I (the lengths of the two 
sentences). This dependence is expressed through 
the term a(ilj, m,l), which denotes the probabil- 
ity of connecting position j in a French sentence of 
length m with position i in an English sentence 
of length I. Since each French position is con- 
nected to exactly one English position, the con- 
straint ~"~ti= 0 a(i\[j, m, I) = 1 should hold for all j, 
m and I. In place of equation 2, we now have: 
Pr(f\[e) = EPr(f, ale) (5) 
aEA 
: E 
aE.4 i=l 
where Of. e is an irrelevant constant. 
As in Model 1, equation 3 holds for the max- 
imum likelihood estimates of the translation prob- 
abilities. The corresponding equation for the max- 
3 
imum likelihood estimates of a(iIj, m, l) is: 
Eco, f,eecoA,,-,, Pr(con f,'i e) 
a(ilj, m,l) = '" '" 
2¢°"f;eec°Xr :' Pr(conf,~) (6) 
where CO.N'~S denotes the set of connections in the 
training data between positions j and i in French 
and English sentences of lengths m and 1, respec- 
tively. Similarly, CO.N'~. 'l denotes the set of con- 
nections between position j and some English po- 
sition, in sentences of these lengths. 
Instead of equation 4, we obtain the following 
equation for the probability of a connection: 
f.e, t( fj \[el)" a( ilj, rn, l) 
~"~k=0 t(fj \[ek)-a(klj, rn, l) 
Notice that Model 1 is a special case of Model 2, 
where a(ilj , m, l) is held fixed at 1+1 " 
As before, the EM algorithm is used to com- 
pute maximum likelihood estimates for t(fle) and 
a(ilj, m, i) (using first equation 7, and then equa- 
tions 3 and 6). However, in this case, Model 2 
does not have a unique maximum, and therefore 
the results depend on the initial guesses. Brown 
et al. therefore use Model 1 to obtain estimates for 
t(fle ) which do not depend on the initial guesses. 
These values are then used as the initial guesses of 
t(fle ) in Model 2. 
2.1.2 Our model 
As mentioned in the introduction, we are interested 
in aligning corpora that are smaller and noisier 
than the Hansards. This implies severe practical 
constraints on the word alignment algorithm. As 
mentioned earlier, we chose to start with the out- 
put of char_align because it is more robust than al- 
ternative sentence-based methods. This choice, of 
course, requires certain modifications to the model 
of Brown et al. to accommodate as input an initial 
rough alignment (such as produced by char_align) 
instead of pairs of aligned sentences. It is also 
useful to reduce the number of parameters that we 
are trying to estimate, because we have much less 
data and much more noise. The paragraphs below 
describe our modifications which are intended to 
meet these somewhat different requirements. The 
two major modifications are: (a) replacing the 
sentence-by-sentence alignment with a single global 
alignment for the entire corpus, and (b) replacing 
the set of probabilities a(ilj, m, l) with a small set 
of offset probabilities. 
Word_align starts with an initial rough align- 
ment, I, which maps French positions to English 
positions (if the mapping is partial, we use linear 
extrapolation to make it complete). Our goal is to 
find a global alignment, A, which is more accurate 
than I. To achieve this goal, we first use I to deter- 
mine which connections will be considered for A. 
Let conj,i denote a connection between position j 
in the French corpus and position i in the English 
corpus (the super-scripts in eon~,~ are omitted, as 
there is no notion of sentences). We assume that 
eonj,i is a possible connection only if i falls within a 
limited window which is centered around I(j), such 
that: 
I(j)- w < i < I(j) + w (8) 
where w is a predetermined parameter specifying 
the size of the window (we typically set w to 20 
words). Connections that fall outside this window 
are assumed to have a zero probability. This as- 
sumption replaces the assumption of Brown et al. 
that connections which cross boundaries of aligned 
sentences have a zero probability. In this new 
framework, equation 3 becomes: 
~-:~con,.,~co az ,., Pr( conj.i ) 
t(fle) = ~o,,.,ecoJ¢., Pr(conj,i) (9) 
where CO.h/'j,e and COA/'.,e are taken from the set 
of possible connections, as defined by (8). 
Turning to Model 2, the parameters of the form 
a(ilj , rn, l) are somewhat more problematic. First, 
since there are no sentence boundaries, there are no 
direct equivalents for i, j, m and 1. Secondly, there 
are too many parameters to be estimated, given the 
limited size of our corpora Cone parameter for each 
combination of i,j,m and l). Fortunately, these 
parameters are highly redundant. For example, it 
is likely that a(i\[j, m, l) will be very close to a(i + 
llj+ 1,re, l) and a(itj, rn+ 1,1+ 1). 
In order to deal with these concerns, we re- 
place probabilities of the form a(ilj, m, 1) with a 
small set of offset probabilities. We use k to denote 
the offset between i, an English position which cor- 
responds to the French position j, and the English 
position which the input alignment I connects to 
j: k = i- I(j). An offset probability, o(k), is the 
probability of having an offset k for some arbitrary 
connection. According to (8), k ranges between 
-w and w. Thus, instead of equation 6, we have 
o(k) = Y:~,,.,~coJ¢~ Pr(conj,i) (10) 
~---,¢~,.,~.CO.W Pr( e°ni,i ) 
where COAl is the set of all connections and CO.hfk 
is the set of all connections with offset k. Instead 
of equation 7, we have 
Pr(conj.i) = t(fl \[el)" o(i - I(j)) 
X"I(#) +~ ~rr.\[en). o(h I(j)) z..,h=i(j)_w -~a~ 
(11) 
The last three equations are used in the EM 
algorithm in an iterative fashion as before to es- 
timate the translation probabilities and the offset 
probabilities. Table 1 and Figure 2 show some val- 
ues that were estimated in this way. The input 
consisted of a pair of Microsoft Windows manu- 
als in English (125,000 words) and its equivalent in 
French (143,000 words). Table 1 shows four French 
words and the four most likely translations, sorted 
by t(e\]f) 1. Note that the correct translation(s) are 
usually near the front of the list, though there is a 
tendency for the program to be confused by collo- 
cates such as "information about". Figure 2 shows 
the probability estimates for offsets from the ini- 
tial alignment I. Note that smaller offsets are more 
likely than larger ones, as we would expect. More- 
over, the distribution is reasonably close to normal, 
as indicated by the dotted line, which was gener- 
ated by a Gaussian with a mean of 0 and standard 
deviation of 10 2 . 
We have found it useful to make use of three fil- 
ters to deal with robustness issues. Empirically, we 
found that both high frequency and low frequency 
words caused difficulties and therefore connections 
involving these words are filtered out. The thresh- 
olds are set to exclude the most frequent function 
words and punctuations, as well as words with less 
than 3 occurrences. In addition, following a similar 
filter by Brown et al., small values of t(f\[e) are set 
to 0 after each iteration of the EM algorithm be- 
cause these small values often correspond to inap- 
propriate translations. Finally, connections to null 
are ignored. Such connections model French words 
that are often omitted in the English translation. 
However, because of OCR errors and other sources 
of noise, it was decided that this phenomenon was 
too difficult to model. 
Some words will not be aligned because of these 
heuristics. It may not be necessary, however, to 
align all words in order to meet the goal of help- 
ing translators (and lexicographers) with difficult 
terminology. 
2.2 Finding the most probable 
alignment 
The EM algorithm produces two sets of maxi- 
mum likelihood probability estimates: translation 
probabilities, t(fle), and offset probabilities, o(k). 
Brown et al. select their preferred alignment simply 
by choosing the most probable alignment according 
to the maximum likelihood probabilities, relative to 
the given sentence alignment. In the terms of our 
l ln this example, French is used as the source lan- 
guage a~ad English as the taxget. 
2The center of the estimated distribution seems 
more fiat than in a normal distribution. This might 
be explained by a higher tendency for local changes 
of word order within phrases than for order changes 
among phrases. This is merely a hypothesis, though, 
which requires further testing. 
model, it is necessary to select the alignment A 
that maximizes: 
I\] t(file')'°(i-X(J)) (12) 
con:.,eA 
Unfortunately, this method does not model the de- 
pendence between connections for French words 
that are near one another. For example, the fact 
that the French position j was connected to the 
English position i will not increase the probability 
that j + 1 will be connected to an English position 
near i. The absence of such dependence can easily 
confuse the program, mainly in aligning adjacent 
occurrences of the same word, which are common 
in technical texts. Brown et al. introduce such de- 
pendence in their Model 4. We have selected a 
simpler alternative defined in terms of offset prob- 
abilities. 
2.2.1 Determining the set of relevant 
connections 
The first step in finding the most probable align- 
ment is to determine the relevant connections for 
each French position. Relevant connections are re- 
quired to be reasonably likely, that is, their trans- 
lation probability (t(f\[e)) should exceed some min- 
imal threshold. Moreover, they are required to fall 
within a window between I(j) - w and I(j) + w in 
the English corpus, as in the previous step (param- 
eter estimation). We call a French position relevant 
if it has at least one relevant connection. Each 
alignment A then consists of exactly one connec- 
tion for each relevant French position (the irrele- 
vant positions are ignored). 
2.2.2 Determining the most probable 
alignment 
To model the dependency between connections in 
an alignment, we assume that the offset of a con- 
nection is determined relative to the preceding con- 
nection in A, instead of relative to the initial align- 
ment, I. For this purpose, we define A' (j) as a lin- 
ear extrapolation from the preceding connection in 
A: 
NE (13) A'(j) = A(jpre~) + (j - jp,e~) IV F 
where Jv,¢~ is the last French position before j 
which is aligned by A and NE and NF are the 
lengths of the English and French corpora. A'(j) 
thus predicts the connection of j, knowing the con- 
nection of jp,¢~ and assuming that the two lan- 
guages have the same word order, instead of (12), 
the most probable alignment maximizes 
H t(fjlei), o(i- A'(j)). (14) 
eona,.~A 
5 
o 
zone 
fermer 
in formations 
insertion 
English translations (with probabilities) 
box (0.58) area (0.28) want (0.04) In (0.02) 
close (0.44) when (0.08) Close (0.07) selected (0.06) 
information (0.66) about (0.15) For (0.12) see (0.04) 
insertion (0.61) point (0.23) Edit (0.06) To (0.05) 
Table 1: Estimated translation probabilities 
o 
0 
o 
a. 
0 
0 c~ 
0 
0 
0 
-20 -10 0 10 20 
French word 
Offset 
Figure 2: Estimated offset probabilities (solid line) along with a Gaussian (dashed line) for comparison. 
We approximate the offset probabilities, 0(k), rela- 
tive to A', using the maximum likelihood estimates 
which were computed relative to I (as described in 
Section 2.1.2). 
We use a dynamic programming algorithm to 
find the most probable alignment. This enables 
us to know the value A(jp,e~) when dealing with 
position j. To avoid connections with very low 
probability (due to a large offset) we require that 
t(fj \[el). o(i-- A'(j)) exceeds a pre-specified thresh- 
old T s. If the threshold is not exceeded, the 
connection is dropped from the alignment, and 
t(fjJei), o(i - A'(j)) for that connection is set to 
T when computing (14). T can therefore be inter- 
preted as a global setting of the probability that 
a random position will be connected to the null 
3In fact, the threshold on t(f, le,), which is used to 
determine the relevant connections (described in the 
previous subsection), is used just as an efficient early 
application of the threshold T. This early application 
is possible when t(f~le~)" o(k,,~==) < T, where k,~== is 
the value of k with maximal o(k). 
English word 4. A similar dynamic programming 
approach was used by Gale and Church for word 
alignment (Gale and Church, 1991a), to handle de- 
pendency between connections. 
3 Evaluation 
Word_align was first evaluated on a representative 
sample of Canadian Hansards (160,000 words in 
English and French). The sample was kindly pro- 
vided by Simard et al., along with alignments of 
sentence boundaries as determined by their panel 
of 8 judges (Simard et al., 1992). 
Ten iterations of the EM algorithm were com- 
puted to estimate the parameters of the model. 
The window size was set to 20 words in each di- 
rection, and the minimal threshold for t(fJe) was 
set to 0.005. We considered connections whose 
source and target words had frequencies between 3 
and 1700 (1700 is the highest frequency of a con- 
tent word in the corpus. We thus excluded as many 
4As mentioned earlier, we do not estimate directly 
translation probabilities for the null English word. 
function words as possible, but no content words). 
In this experiment, we used French as the source 
language and English as the target language. 
Figure 3 presents the alignment error rate of 
word_align. It is compared with the error rate of 
word_align's input, i.e. the initial rough alignment 
which is produced by char_align. The errors are 
sampled at sentence boundaries, and are measured 
as the relative distance between the output of the 
alignment program and the "true" alignment, as 
defined by the human judges 5. The histograms 
present errors in the range of-20-20, which cov- 
ers about 95% of the data s. It can be seen that 
word_align decreases the error rate significantly 
(notice the different scales of the vertical axes). In 
55% of the cases, there is no error in word_align's 
output (distance of 0), in 73% the distance from 
the correct alignment is at most i, and in 84% the 
distance is at most 3. 
A second evaluation of word_align was per- 
formed on noisy technical documents, of the type 
typically available for AT&T Language Line Ser- 
vices. We used the English and French versions of 
a manual of monitoring equipment (about 65,000 
words), both scanned by an OCR device. We sam- 
pled the English vocabulary with frequency be- 
tween three and 450 occurrences, the same vocabu- 
lary that was used for alignment. We sampled 100 
types from the top fifth by frequency of the vocabu- 
lary (quintile), 80 types from the second quintile, 60 
from the third, 40 from the fourth, and 20 from the 
bottom quintile. We used this stratified sampling 
because we wanted to make more accurate state- 
ments about our error rate by tokens than we would 
have obtained from random sampling, or even from 
equal weighting of the quintiles. After choosing the 
300 types from the vocabulary list, one token for 
each type was chosen at random from the corpus. 
By hand, the best corresponding position in the 
French version was chosen, to be compared with 
word_align ' s output. 
Table 2 summarizes the results of the second 
experiment. The figures indicate the expected rela- 
tive frequency of each offset from the correct align- 
ment. This relative frequency was computed ac- 
cording to the word frequencies in the stratified 
sample. As shown in the table, for 60.5% of the to- 
kens the alignment is accurate, and in 84% the off- 
set from the correct alingment is at most 3. These 
figures demonstrate the usefulness of word_align for 
constructing bilingual lexicons, and its impact on 
5As explained eaxlier, word_align produces a partial 
Mignment. For the purpose of the evaluation, we used 
linear interpolation to get Mignments for all the posi- 
tions in the sample. 
6Recall that the window size we used is 20 words 
in each direction, which means that word_align cannot 
recover from larger errors in char_align. 
-20 -10 0 10 
char_align errors (in wor(~s) 
20 
o 
-20 
..... nR Hno .... 
-I0 0 10 20 
~t~cl_align errors (in wor~s) 
Figure 3: Word_align reduces the variance (average 
square error) by a factor of 5 over char_align alone 
(notice the vertical scales). 
the quality of bilingual concordances (as in Fig- 
ure 1). Indeed, using bilingual concordances which 
are based on word_align's output, the translators at 
AT&T Language Line Services are now producing 
bilingual terminology lexicons at a rate of 60-100 
terms per hour! This is compared with the previous 
rate of about 30 terms per hour using char_align's 
output, and an extremely lower rate before align- 
ment tools were available. 
4 Conclusions 
Compared with other word alignment algorithms 
(Brown et al., 1993; Gale and Church, 1991a), 
word_align does not require sentence alignment as 
input, and was shown to produce useful align- 
ments for small and noisy corpora. Its robust- 
ness was achieved by modifying Brown et al.'s 
Model 2 to handle an initial "rough" alignment, 
reducing the number of parameters and introduc- 
ing a dependency between alignments of adjacent 
words. Taking the output of char_align as in- 
put, word_align produces significantly better, word- 
7 
Offset from 
correct alignment 
0 
1 
2 
3 
4 
Percentage 
60.5% 
10.8% 
7.5% 
5.2% 
1.6% 
Accumulative 
percentage 
60.5% 
71.3% 
78.8% 
84% 
85.6% 
Table 2: Word_align's precision on noisy input, 
scanned by an OCR device. 
level, alignments on the kind of corpora that are 
typically available to translators. This improve- 
ment increased the rate of constructing bilingual 
terminology lexicons at AT&T Language Line Ser- 
vices by a factor of 2-3. In addition, the align- 
ments may also be helpful to developers of lexicons 
for machine translation systems. Word_align thus 
provides an example how a model such as Brown 
et al.'s Model 2, that was originally designed for 
research in statistical machine translation, can be 
modified to achieve practical, though less ambi- 
tious, goals in the near term. 

REFERENCES 
L. E. Bantu. 1972. An inequality and an associ- 
ated maximization technique in statistical es- 
timation of probabilistic functions of a markov 
process. Inequalities, 3:1-8. 
P. Brown, J. Cooke, S. Della Pietra, 
V. Della Pietra, F. Jelinek, R.L. Mercer, and 
Roossin P.S. 1990. A statistical approach to 
language translation. Computational Linguis- 
tics, 16(2):79-85. 
P. Brown, J. Lai, and R. Mercer. 1991a. Aligning 
sentences in parallel corpora. In Proc. of the 
Annual Meeting of the ACL. 
P. Brown, S. Della Pietra, V. Della Pietra, and 
R. Mercer. 1991b. Word sense disambiguation 
using statistical methods. In Proc. of the An- 
nual Meeting of the A CL. 
Peter Brown, Stephen Della Pietra, Vincent Della 
Pietra, and Robert Mercer. 1993. The mathe- 
matics of machine translation: parameter esti- 
mation. Computational Linguistics. to appear. 
Kenneth W. Church. 1993. Char_align: A program 
for aligning parallel texts at character level. In 
Proc. of the Annual Meeting of the ACL. 
A. P. Dempster, N. M. Laird, and D. B. Rubin. 
1977. Maximum liklihood from incomplete 
data via the EM algorithm. Journal of the 
Royal Statistical Society, 39(B):1-38. 
William Gale and Kenneth Church. 1991a. Identi- 
fying word correspondence in parallel text. In 
Proc. of the DARPA Workshop on Speech and 
Natural Language. 
William Gale and Kenneth Church. 1991b. A pro- 
gram for aligning sentences in bilingual cor- 
pora. In Proc. of the Annual Meeting of the 
ACL. 
William Gale, Kenneth Church, and David 
Yarowsky. 1992. Using bilingual materials 
to develop word sense disambiguation meth- 
ods. In Proc. of the International Conference 
on Theoretical and Methodolgical Issues in Ma- 
chine Translation. 
P. Isabelle. 1992. Bi-textual aids for translators. 
In Proc. of the Annual Conference of the UW 
Center for the New OED and Text Research. 
M. Kay and M. Rosenschein. 1993. Text- 
translation alignment. Computational Linguis- 
tics. to appear. 
J. Klavans and E. Tzoukermann. 1990. The bicord 
system. In Proc. of COLING. 
Julian Kupiec. 1993. An algorithm for finding 
noun phrase correspondences in bilingual cor- 
pora. In Proc. of the Annual Meeting of the 
ACL. 
Thomas K. Landauer and Michael L. Littman. 
1990. Fully automatic cross-language docu- 
ment retrieval using latent semantic indexing. 
In Proc. of the Annual Conference of the UW 
Center for the New OED and Text Research. 
Yuji Matsumoto, Hiroyuki Ishimoto, Takehito Ut- 
suro, and Makoto Nagao. 1993. Structural 
matching of parallel texts. In Prac. of the An- 
nual Meeting of the ACL. 
William Ogden and Margarita Gonzaies. 1993. 
Norm - a system for translators. Demonstra- 
tion at ARPA Workshop on Human Language 
Technology. 
V. Sadler. 1989. Working with analogical seman- 
tics: Disambiguation techniques in DLT. Foris 
Publications. 
M. Simard, G. Foster, and P. Isabelle. 1992. Us- 
ing cognates to align sentences in bilingual cor- 
pora. In Proc. of the International Conference 
on Theoretical and Methodolgical lssues in Ma- 
chine Translation. 
Frank Smadja. 1992. How to compile a bilingual 
collocational lexicon automatically. In AAAI 
Workshop on Statistically-based Natural Lan- 
guage Processing Techniques, July. 
S. Warwick, J. Hajic, and G. Russell. 1990. Search- 
ing on tagged corpora: linguistically motivated 
concordance analysis. In Proc. of the Annual 
Conference of the UW Center for the New 
OED and Text Research. 
