ALIGNING SENTENCES IN BILINGUAL CORPORA USING 
LEXICAL INFORMATION 
Stanley F. Chen* 
Aiken Computation Laboratory 
Division of Applied Sciences 
Harvard University 
Cambridge, MA 02138 
Internet: sfc@calliope.harvard.edu 
Abstract 
In this paper, we describe a fast algorithm for 
aligning sentences with their translations in a 
bilingual corpus. Existing efficient algorithms ig- 
nore word identities and only consider sentence 
length (Brown el al., 1991b; Gale and Church, 
1991). Our algorithm constructs a simple statisti- 
cal word-to-word translation model on the fly dur- 
ing alignment. We find the alignment that maxi- 
mizes the probability of generating the corpus with 
this translation model. We have achieved an error 
rate of approximately 0.4% on Canadian Hansard 
data, which is a significant improvement over pre- 
vious results. The algorithm is language indepen- 
dent. 
1 Introduction 
In this paper, we describe an algorithm for align- 
ing sentences with their translations in a bilingual 
corpus. Aligned bilingual corpora have proved 
useful in many tasks, including machine transla- 
tion (Brown e/ al., 1990; Sadler, 1989), sense dis- 
ambiguation (Brown el al., 1991a; Dagan el at., 
1991; Gale el al., 1992), and bilingual lexicogra- 
phy (Klavans and Tzoukermann, 1990; Warwick 
and Russell, 1990). 
The task is difficult because sentences frequently 
do not align one-to-one. Sometimes sentences 
align many-to-one, and often there are deletions in 
*The author wishes to thank Peter Brown, Stephen Del- 
laPietra, Vincent DellaPietra, and Robert Mercer for their 
suggestions, support, and relentless taunting. The author 
also wishes to thank Jan Hajic and Meredith Goldsmith 
as well as the aforementioned for checking the aligmnents 
produced by the implementation. 
one of the supposedly parallel corpora of a bilin- 
gual corpus. These deletions can be substantial; 
in the Canadian Hansard corpus, there are many 
deletions of several thousand sentences and one 
deletion of over 90,000 sentences. 
Previous work includes (Brown el al., 1991b) 
and (Gale and Church, 1991). In Brown, align- 
ment is based solely on the number of words in 
each sentence; the actual identities of words are 
ignored. The general idea is that the closer in 
length two sentences are, the more likely they 
align. To perform the search for the best align- 
ment, dynamic programming (Bellman, 1957) is 
used. Because dynamic programming requires 
time quadratic in the length of the text aligned, 
it is not practical to align a large corpus as a sin- 
gle unit. The computation required is drastically 
reduced if the bilingual corpus can be subdivided 
into smaller chunks. Brown uses anchors to per- 
form this subdivision. An anchor is a piece of text 
likely to be present at the same location in both 
of the parallel corpora of a bilingual corpus. Dy- 
namic programming is used to align anchors, and 
then dynamic programming is used again to align 
the text between anchors. 
The Gale algorithm is similar to the Brown al- 
gorithm except that instead of basing alignment 
on the number of words in sentences, alignment is 
based on the number of characters in sentences. 
Dynamic programming is also used to search for 
the best alignment. Large corpora are assumed to 
be already subdivided into smaller chunks. 
While these algorithms have achieved remark- 
ably good performance, there is definite room for 
improvement. These algorithms are not robust 
with respect to non-literal translations and small 
deletions; they can easily misalign small passages 
9 
Mr. McInnis? M. McInnis? 
Yes. Oui. 
Mr. Saunders? M. Saunders? 
No. Non. 
Mr. Cossitt? M. Cossitt? 
Yes. Oui. 
: 
Figure 1: A Bilingual Corpus Fragment 
because they ignore word identities. For example, 
the type of passage depicted in Figure 1 occurs in 
the Hansard corpus. With length-based alignment 
algorithms, these passages may well be misaligned 
by an even number of sentences if one of the cor- 
pora contains a deletion. In addition, with length- 
based algorithms it is difficult to automatically re- 
cover from large deletions. In Brown, anchors are 
used to deal with this issue, but the selection of 
anchors requires manual inspection of the corpus 
to be aligned. Gale does not discuss this issue. 
Alignment algorithms that use lexical informa- 
tion offer a potential for higher accuracy. Previ- 
ous work includes (Kay, 1991) and (Catizone el 
al., 1989). However, to date lexically-based al- 
gorithms have not proved efficient enough to be 
suitable for large corpora. 
In this paper, we describe a fast algorithm 
for sentence alignment that uses lexical informa- 
tion. The algorithm constructs a simple statistical 
word-to-word translation model on the fly during 
sentence alignment. We find the alignment that 
maximizes the probability of generating the corpus 
with this translation model. The search strategy 
used is dynamic programming with thresholding. 
Because of thresholding, the search is linear in the 
length of the corpus so that a corpus need not be 
subdivided into smaller chunks. The search strat- 
egy is robust with respect to large deletions; lex- 
ical information allows us to confidently identify 
the beginning and end of deletions. 
2 The Alignment Model 
2.1 The Alignment Framework 
We use an example to introduce our framework for 
alignment. Consider the bilingual corpus (E, ~') 
displayed in Figure 2. Assume that we have con- 
structed a model for English-to-French transla- 
tion,/.e., for all E and Fp we have an estimate for 
P(Fp\]E), the probability that the English sentence 
E translates to the French passage Fp. Then, we 
can assign a probability to the English corpus E 
translating to the French corpus :T with a partic- 
ular alignment. For example, consider the align- 
ment .41 where sentence E1 corresponds to sen- 
tence F1 and sentence E2 corresponds to sentences 
F2 and F3. We get 
P(-~',.4~l,f:) = P(FIIE1)P(F~., FsIE2), 
assuming that successive sentences translate inde- 
pendently of each other. This value should be rel- 
atively large, since F1 is a good translation of E1 
and (F2, F3) is a good translation of E2. Another 
possible alignment .42 is one where E1 maps to 
nothing and E2 maps to F1, F2, and F3. We get 
P(.F',.42\]£) = P(elE1)P(F~, F2, F3IE2) 
This value should be fairly low, since the align- 
ment does not map the English sentences to their 
translations. Hence, if our translation model is 
accurate we will have 
P(~',`41I,~) >> P(.r,.421,f:) 
In general, the more sentences that are mapped 
to their translations in an alignment .4, the higher 
the value of P(~,.AIE). We can extend this idea 
to produce an alignment algorithm given a trans- 
lation model. In particular, we take the alignment 
of a corpus (~, ~) to be the alignment ,4 that max- 
imizes P(~',`41E). The more accurate the transla- 
tion model, the more accurate the resulting align- 
ment will be. 
However, because the parameters are all of the 
form P(FplE ) where E is a sentence, the above 
framework is not amenable to the situation where 
a French sentence corresponds to no English sen- 
tences. Hence, we use a slightly different frame- 
work. We view a bilingual corpus as a sequence 
of sentence beads (Brown et al., 1991b), where a 
sentence bead corresponds to an irreducible group 
of sentences that align with each other. For exam- 
ple, the correct alignment of the bilingual corpus 
in Figure 2 consists of the sentence bead \[El; F1\] 
followed by the sentence bead \[E2; \];'2, F3\]. We 
can represent an alignment `4 of a corpus as a se- 
quence of sentence beads (\[Epl; Fpl\], \[Ep2; F~\],...), 
where the E~ and F~ can be zero, one, or more 
sentences long. 
Under this paradigm, instead of expressing the 
translation model as a conditional distribution 
10 
English (£) 
El That is what the consumers 
are interested in and that 
is what the party is 
interested in. 
E2 Hon. members opposite scoff 
at the freeze suggested by 
this party; to them it is 
laughable. 
French (~) 
/'i Voil~ ce qui int6resse le 
consommateur et roll& ce 
que int6resse notre parti. 
F2 Les d6put6s d'en face se 
moquent du gel que a 
propos6 notre parti. 
F3 Pour eux, c'est une mesure 
risible. 
Figure 2: A Bilingual Corpus 
P(FpIE ) we express the translation model as a 
distribution P(\[Ep; Fp\]) over sentence beads. The 
alignment problem becomes discovering the align- 
ment A that maximizes the joint distribution 
P(£,2",.A). Assuming that successive sentence 
beads are generated independently, we get 
L 
P(C, Yr, A) = p(L) H P(\[E~;F~\]) 
k=l 
where A t. 1 = (\[E~,F;\],..., \[EL; FL\])is consistent 
with g and ~" and where p(L) is the probability 
that a corpus contains L sentence beads. 
2.2 The Basic Translation Model 
For our translation model, we desire the simplest 
model that incorporates lexical information effec- 
tively. We describe our model in terms of a series 
of increasingly complex models. In this section, 
we only consider the generation of sentence beads 
containing a single English sentence E = el ""en 
and single French sentence F = fl""fm. As a 
starting point, consider a model that assumes that 
all individual words are independent. We take 
n 
P(\[E; F\]) = p(n)p(m) H p(ei) fi p(fj) 
i=l j=l 
where p(n) is the probability that an English sen- 
tence is n words long, p(m) is the probability that 
a French sentence is m words long, p(ei) is the fre- 
quency of the word ei in English, and p(fj) is the 
frequency of the word fj in French. 
To capture the dependence between individual 
English words and individual French words, we 
generate English and French words in pairs in 
addition to singly. For two words e and f that 
are mutual translations, instead of having the two 
terms p(e) and p(f) in the above equation we 
would like a single term p(e, f) that is substan- 
tially larger than p(e)p(f). To this end, we intro- 
duce the concept of a word bead. A word bead is 
either a single English word, a single French word, 
or a single English word and a single French word. 
We refer to these as 1:0, 0:1, and 1:1 word beads, 
respectively. Instead of generating a pair of sen- 
tences word by word, we generate sentences bead 
by bead, using the hl word beads to capture the 
dependence between English and French words. 
As a first cut, consider the following "model": 
P* (B) = p(l) H p(bi) 
i=1 
where B = {bl, ..., bl} is a multiset of word beads, 
p(l) is the probability that an English sentence 
and a French sentence contain l word beads, and 
p(bi) denotes the frequency of the word bead bi. 
This simple model captures lexical dependencies 
between English and French sentences. 
However, this "model" does not satisfy the con- 
straint that ~B P*(B) = 1; because beddings B 
are unordered multisets, the sum is substantially 
less than one. To force this model to sum to one, 
we simply normalize by a constant so that we re- 
tain the qualitative aspects of the model. We take 
l p(t) "b" 
P(B) =- II p\[ i) N, Z 
While a beading B describes an unordered mul- 
tiset of English and French words, sentences are 
in actuality ordered sequences of words. We need 
to model word ordering, and ideally the probabil- 
ity of a sentence bead should depend on the or- 
dering of its component words. For example, the 
sentence John ate Fido should have a higher prob- 
ability of aligning with the sentence Jean a mang4 
Fido than with the sentence Fido a mang4 Jean. 
11 
However, modeling word order under translation 
is notoriously difficult (Brown et al., 1993), and it 
is unclear how much improvement in accuracy a 
good model of word order would provide. Hence, 
we model word order using a uniform distribution; 
we take 
I 
P(\[E;F\],B)- p(l) Hp(bi) Nin!m! i=1 
which gives us 
p(\[E;F\])= E p(l) ,(s) N,n!m! H p(b,) 
B i=1 
where B ranges over beadings consistent with 
\[E; F\] and l(B) denotes the number of beads in 
B. Recall that n is the length of the English sen- 
tence and m is the length of the French sentence. 
2.3 The Complete Translation 
Model 
In this section, we extend the translation model 
to other types of sentence beads. For simplicity, 
we only consider sentence beads consisting of one 
English sentence, one French sentence, one En- 
glish sentence and one French sentence, two En- 
glish sentences and one French sentence, and one 
English sentence and two French sentences. We 
refer to these as 1:0, 0:1, 1:1, 2:1, and 1:2 sentence 
beads, respectively. 
For 1:1 sentence beads, we take 
t(B) 
P(\[E; F\]) = P1:1 E P1:1(/) H p(bi) NLHn!ml 
B i=1 
where B ranges over beadings consistent with 
\[E;F\] and where Pz:I is the probability of gen- 
erating a 1:1 sentence bead. 
To model 1:0 sentence beads, we use a similar 
equation except that we only use 1:0 word beads, 
and we do not need to sum over beadings since 
there is only one word beading consistent with a 
1:0 sentence bead. We take 
I 
P(\[E\]) = Pl-o Pz:0(/) HP(ei) 
• Nl,l:0n! i=1 
Notice that n = I. We use an analogous equation 
for 0:1 sentence beads. 
For 2:1 sentence beads, we take 
z(s) 
P2:l (/) H p(bi) Pr(\[E1, E2; F\]) = P~:I E Nl 2:lnl !n2!m! 
B ' i=1 
where the sum ranges over beadings B consistent 
with the sentence bead. We use an analogous 
equation for 1:2 sentence beads. 
3 Implementation 
Due to space limitations, we cannot describe the 
implementation in full detail. We present its most 
significant characteristics in this section; for a 
more complete discussion please refer to (Chen, 
1993). 
3.1 Parameterization 
We chose to model sentence length using a Poisson 
distribution, i.e., we took 
At1:0 
Pl:0(/) - l! e ~1:0 
for some Al:0, and analogously for the other types 
of sentence beads. At first, we tried to estimate 
each A parameter independently, but we found 
that after training one or two A would be unnat- 
urally small or large in order to specifically model 
very short or very long sentences. To prevent this 
phenomenon, we tied the A values for the different 
types of sentence beads together. We took 
A1:1 A2:l AI:2 Al:0=A0:l---~--- 3 - 3 (1) 
To model the parameters p(L) representing the 
probability that the bilingual corpus is L sen- 
tence beads in length, we assumed a uniform 
distribution, z This allows us to ignore this term, 
since length will not influence the probability of 
an alignment. We felt this was reasonable becattse 
it is unclear what a priori information we have on 
the length of a corpus. 
In modeling the frequency of word beads, notice 
that there are five distinct distributions we need 
to model: the distribution of 1:0 word beads in 1:0 
sentence beads, the distribution of 0:1 word beads 
in 0:1 sentence beads, and the distribution of all 
word beads in 1:1, 2:1, and 1:2 sentence beads. To 
reduce the number of independent parameters we 
need to estimate, we tied these distributions to- 
gether. We assumed that the distribution of word 
beads in 1:1, 2:1, and 1:2 sentence beads are iden- 
tical. We took the distribution of word beads in 
1 To be precise, we assumed a uniform distribution over 
some arbitrarily large finite range, as one cannot have a 
uniform distribution over a countably infinite set. 
12 
1:0 and 0:1 sentence beads to be identical as well 
except restricted to the relevant subset of word 
beads and normalized appropriately, i.e., we took 
pb(e) for e E Be pc(e) : pb(e') 
and 
Pb(f) for f E By P:(f) = ~'~.:'eB, Pb(f') 
where Pe refers to the distribution of word beads 
in 1:0 sentence beads, pf refers to the distribu- 
tion of word beads in 0:1 sentence beads, pb refers 
to the distribution of word beads in 1:1, 2:1, and 
1:2 sentence beads, and Be and B I refer to the 
sets of 1:0 and 0:1 word beads in the vocabulary, 
respectively. 
3.2 Evaluating the Probability of a 
Sentence Bead 
The probability of generating a 0:1 or 1:0 sentence 
bead can be calculated efficiently using the equa- 
tion given in Section 2.3. To evaluate the proba- 
bilities of the other sentence beads requires a sum 
over an exponential number of word beadings. We 
make the gross approximation that this sum is 
roughly equal to the maximum term in the sum. 
For example, with 1:1 sentence beads we have 
Z(B) 
P(\[E;F\]) = px:lE Pa:I(/) Hp(bi) Nz,Hn!m! 
B i=1 
,~ pllmax{ Pl:I(/) I(B) : B N~m! Hp(bi)} 
i=l 
Even with this approximation, the calculation 
of P(\[E; F\]) is still intractable since it requires a 
search for the most probable beading. We use a 
greedy heuristic to perform this search; we are not 
guaranteed to find the most probable beading. We 
begin with every word in its own bead. We then 
find the 0:1 bead and 1:0 bead that, when replaced 
with a 1:1 word bead, results in the greatest in- 
crease in probability. We repeat this process until 
we can no longer find a 0:1 and 1:0 bead pair that 
when replaced would increase the probability of 
the beading. 
3.3 Parameter Estimation 
We estimate parameters by using a variation of the 
Viterbi version of the expectation-maximization 
(EM) algorithm (Dempster et al., 1977). The 
Viterbi version is used to reduce computational 
complexity. We use an incremental variation of the 
algorithm to reduce the number of passes through 
the corpus required. 
In the EM algorithm, an expectation phase, 
where counts on the corpus are taken using the 
current estimates of the parameters, is alternated 
with a maximization phase, where parameters are 
re-estimated based on the counts just taken. Im- 
proved parameters lead to improved counts which 
lead to even more accurate parameters. In the in- 
cremental version of the EM algorithm we use, in- 
stead of re-estimating parameters after each com- 
plete pass through the corpus, we re-estimate pa- 
rameters after each sentence. By re-estimating pa- 
rameters continually as we take counts on the cor- 
pus, we can align later sections of the corpus more 
reliably based on alignments of earlier sections. 
We can align a corpus with only a single pass, si- 
multaneously producing alignments and updating 
the model as we proceed. 
More specifically, we initialize parameters by 
taking counts on a small body of previously 
aligned data. To estimate word bead frequencies, 
we maintain a count c(b) for each word bead that 
records the number of times the word bead b oc- 
curs in the most probable word beading of a sen- 
tence bead. We take 
c(b) 
pb(b) - Eb, c(V) 
We initialize the counts c(b) to 1 for 0:1 and 1:0 
word beads, so that these beads can occur in bead- 
ings with nonzero probability. To enable 1:1 word 
beads to occur in beadings with nonzero probabil- 
ity, we initialize their counts to a small value when- 
ever we see the corresponding 0:1 and 1:0 word 
beads occur in the most probable word beading of 
a sentence bead. 
To estimate the sentence length parameters ,~, 
we divide the number of word beads in the most 
probable beading of the initial training data by 
the total number of sentences. This gives us an 
estimate for hi:0, and the other ~ parameters can 
be calculated using equation (1). 
We have found that one hundred sentence pairs 
are sufficient to train the model to a state where it 
can align adequately. At this point, we can process 
unaligned text and use the alignments we produce 
to further train the model. We update parameters 
based on the newly aligned text in the same way 
that we update parameters based on the initial 
\]3 
training data. 2 
To align a corpus in a single pass the model 
must be fairly accurate before starting or else the 
beginning of the corpus will be poorly aligned. 
Hence, after bootstrapping the model on one hun- 
dred sentence pairs, we train the algorithm on a 
chunk of the unaligned target bilingual corpus, 
typically 20,000 sentence pairs, before making one 
pass through the entire corpus to produce the ac- 
tual alignment. 
3.4 Search 
It is natural to use dynamic programming to 
search for the best alignment; one can find the 
most probable of an exponential number of align- 
ments using quadratic time and memory. Align- 
ment can be viewed as a "shortest distance" prob- 
lem, where the "distance" associated with a sen- 
tence bead is the negative logarithm of its proba- 
bility. The probability of an alignment is inversely 
related to the sum of the distances associated with 
its component sentence beads. 
Given the size of existing bilingual corpora and 
the computation necessary to evaluate the proba- 
bility of a sentence bead, a quadratic algorithm is 
still too profligate. However, most alignments are 
one-to-one, so we can reap great benefits through 
intelligent thresholding. By considering only a 
subset of all possible alignments, we reduce the 
computation to a linear one. 
Dynamic programming consists of incrementally 
finding the best alignment of longer and longer 
prefixes of the bilingual corpus. We prune all 
alignment prefixes that have a substantially lower 
probability than the most probable alignment pre- 
fix of the same length. 
2 In theory, one cannot decide whether a particular sen- 
tence bead belongs to the best alignment of a corpus un- 
til the whole corpus has been processed. In practice, some 
partial alignments will have much higher probabilities than 
all other ahgnments, and it is desirable to train on these 
partial alignments to aid in aligning later sections of the 
corpus. To decide when it is reasonably safe to train on a 
particular sentence bead, we take advantage of the thresh- 
olding described in Section 3.4, where improbable partial 
alignments are discarded. At a given point in time in align- 
ing a corpus, all undiscarded partial alignments will have 
some sentence beads in common. When a sentence bead is 
common to all active partial alignments, we consider it to 
he safe to train on. 
3.5 Deletion Identification 
Deletions are automatically handled within the 
standard dynamic programming framework. How- 
ever, because of thresholding, we must handle 
large deletions using a separate mechanism. 
Because lexical information is used, correct 
alignments receive vastly greater probabilities 
than incorrect alignments. Consequently, thresh- 
olding is generally very aggressive and our search 
beam in the dynamic programming array is nar- 
row. However, when there is a large deletion in 
one of the parallel corpora, consistent lexical cor- 
respondences disappear so no one alignment has 
a much higher probability than the others and 
our search beam becomes wide. When the search 
beam reaches a certain width, we take this to in- 
dicate the beginning of a deletion. 
To identify the end of a deletion, we search lin- 
early through both corpora simultaneously. All 
occurrences of words whose frequency is below a 
certain value are recorded in a hash table. When- 
ever we notice the occurrence of a rare word in 
one corpus and its translation in the other, we 
take this as a candidate location for the end of the 
deletion. For each candidate location, we exam- 
ine the forty sentences following the occurrence of 
the rare word in each of the two parallel corpora. 
We use dynamic programming to find the prob- 
ability of the best alignment of these two blocks 
of sentences. If this probability is sufficiently high 
we take the candidate location to be the end of 
the deletion. Because it is extremely unlikely that 
there are two very similar sets of forty sentences 
in a corpus, this deletion identification algorithm 
is robust. In addition, because we key off of rare 
words in considering ending points, deletion iden- 
tification requires time linear in the length of the 
deletion. 
4 Results 
Using this algorithm, we have aligned three large 
English/French corpora. We have aligned a cor- 
pus of 3,000,000 sentences (of both English and 
French) of the Canadian Hansards, a corpus of 
1,000,000 sentences of newer Hansard proceedings, 
and a corpus of 2,000,000 sentences of proceed- 
ings from the European Economic Community. In 
each case, we first bootstrapped the translation 
model by training on 100 previously aligned sen- 
tence pairs. We then trained the model further on 
14 
20,000 sentences of the target corpus. Note that 
these 20,000 sentences were not previously aligned. 
Because of the very low error rates involved, in- 
stead of direct sampling we decided to estimate 
the error of the old Hansard corpus through com- 
parison with the alignment found by Brown of the 
same corpus. We manually inspected over 500 lo- 
cations where the two alignments differed to esti- 
mate our error rate on the alignments disagreed 
upon. Taking the error rate of the Brown align- 
ment to be 0.6%, we estimated the overall error 
rate of our alignment to be 0.4%. 
In addition, in the Brown alignment approxi- 
mately 10% of the corpus was discarded because 
of indications that it would be difficult to align. 
Their error rate of 0.6% holds on the remaining 
sentences. Our error rate of 0.4% holds on the 
entire corpus. Gale reports an approximate error 
rate of 2% on a different body of Hansard data 
with no discarding, and an error rate of 0.4% if 
20% of the sentences can be discarded. 
Hence, with our algorithm we can achieve at 
least as high accuracy as the Brown and Gale algo- 
rithms without discarding any data. This is espe- 
cially significant since, presumably, the sentences 
discarded by the Brown and Gale algorithms are 
those sentences most difficult to align. 
In addition, the errors made by our algorithm 
are generally of a fairly trivial nature. We ran- 
domly sampled 300 alignments from the newer 
Hansard corpus. The two errors we found are 
displayed in Figures 3 and 4. In the first error, 
E1 was aligned with F1 and E2 was aligned with 
/'2. The correct alignment maps E1 and E2 to F1 
and F2 to nothing. In the second error, E1 was 
aligned with F1 and F2 was aligned to nothing. 
Both of these errors could have been avoided with 
improved sentence boundary detection. Because 
length-based alignment algorithms ignore lexical 
information, their errors can be of a more spec- 
tacular nature. 
The rate of alignment ranged from 2,000 to 
5,000 sentences of both English and French per 
hour on an IBM RS/6000 530H workstation. The 
alignment algorithm lends itself well to paralleliza- 
tion; we can use the deletion identification mecha- 
nism to automatically identify locations where we 
can subdivide a bilingual corpus. While it required 
on the order of 500 machine-hours to align the 
newer Hansard corpus, it took only 1.5 days of 
real time to complete the job on fifteen machines. 
5 Discussion 
We have described an accurate, robust, and fast 
algorithm for sentence alignment. The algorithm 
can handle large deletions in text, it is language 
independent, and it is parallelizable. It requires 
a minimum of human intervention; for each lan- 
guage pair 100 sentences need to be aligned by 
hand to bootstrap the translation model. 
The use of lexical information requires a great 
computational cost. Even with numerous approxi- 
mations, this algorithm is tens of times slower than 
the Brown and Gale algorithms. This is acceptable 
given that alignment is a one-time cost and given 
available computing power. It is unclear, though, 
how much further it is worthwhile to proceed. 
The natural next step in sentence alignment is 
to account for word ordering in the translation 
model, e.g., the models described in (Brown et 
al., 1993) could be used. However, substantially 
greater computing power is required before these 
approaches can become practical, and there is not 
much room for further improvements in accuracy. 

References 
(Bellman, 1957) Richard Bellman. Dynamic Pro- 
gramming. Princeton University Press, Princeton 
N.J., 1957. 
(Brown et al., 1990) Peter F. Brown, John Cocke, 
Stephen A. DellaPietra, Vincent J. DellaPietra, 
Frederick Jelinek, John D. Lafferty, Robert L. 
Mercer, and Paul S. Roossin. A statistical ap- 
proach to machine translation. Computational 
Linguistics, 16(2):79-85, June 1990. 
(Brown et al., 1991a) Peter F. Brown, Stephen A. 
DellaPietra, Vincent J. DellaPietra, and Ro- 
bert L. Mercer. Word sense disambiguation using 
statistical methods. In Proceedings 29th Annu- 
al Meeting of the ACL, pages 265-270, Berkeley, 
CA, June 1991. 
(Brown et al., 1991b) Peter F. Brown, Jennifer C. 
Lai, and Robert L. Mercer. Aligning sentences 
in parallel corpora. In Proceedings 29th Annual 
Meeting of the ACL, pages 169-176, Berkeley, 
CA, June 1991. 
(Brown et al., 1993) Peter F. Brown, Stephen A. Del- 
laPietra, Vincent J. DellaPietra, and Robert L. 
Mercer. The mathematics of machine transla- 
tion: Parameter estimation. Computational Lin- 
guistics, 1993. To appear. 
(Catizone et al., 1989) Roberta Catizone, Graham 
Russell, and Susan Warwick. Deriving transla- 
tion data from bilingual texts. In Proceedings 
of the First International Acquisition Workshop, 
Detroit, Michigan, August 1989. 
(Chen, 1993) Stanley 17. Chen. Aligning sentences in 
bilingual corpora using lexical information. Tech- 
nical Report TR-12-93, Harvard University, 1993. 
(Dagan et al., 1991) Ido Dagan, Alon Itai, and U1- 
rike Schwall. Two languages are more informa- 
tive than one. In Proceedings of the 29th Annual 
Meeting of the ACL, pages 130-137, 1991. 
(Dempster et al., 1977) A.P. Dempster, N.M. Laird, 
and D.B. Rubin. Maximum likelihood from in- 
complete data via the EM algorithm. Journal of 
the Royal Statistical Society, 39(B):1-38, 1977. 
(Gale and Church, 1991) William A. Gale and Ken- 
neth W. Church. A program for aligning sen- 
tences in bilingual corpora. In Proceedings of the 
29th Annual Meeting of the ACL, Berkeley, Cali- 
fornia, June 1991. 
(Gale et al., 1992) William A. Gale, Kenneth W. 
Church, and David Yarowsky. Using bilingual 
materials to develop word sense disambiguation 
methods. In Proceedings of the Fourth Interna- 
tional Conference on Theoretical and Methodolog- 
ical lssues in Machine Translation, pages 101- 
112, Montr4al, Canada, June 1992. 
(Kay, 1991) Martin Kay. Text-translation alignment. 
In ACH/ALLC 'gl: "Making Connections" Con- 
ference Handbook, Tempe, Arizona, March 1991. 
(Klavans and Tzoukermann, 1990) Judith Klavans 
and Evelyne Tzoukermann. The bicord system. 
In COLING-gO, pages 174-179, Helsinki, Fin- 
land, August 1990. 
(Sadler, 1989) V. Sadler. The Bilingual Knowledge 
Bank - A New Conceptual Basis for MT. 
BSO/Research, Utrecht, 1989. 
(Warwick and Russell, 1990) Susan Warwick and 
Graham Russell. Bilingual concordancing and 
bilingual lexicography. In EURALEX 4th later- 
national Congress, M~laga, Spain, 1990. 
