ALIGNING SENTENCES IN PARALLEL CORPORA 
Peter F. Brown, Jennifer C. Lai, a, nd Robert L. Mercer 
IBM Thomas J. Watson Research Center 
P.O. Box 704 
Yorktown Heights, NY 10598 
ABSTRACT 
In this paper we describe a statistical tech- 
nique for aligning sentences with their translations 
in two parallel corpora. In addition to certain 
anchor points that are available in our da.ta, the 
only information about the sentences that we use 
for calculating alignments is the number of tokens 
that they contain. Because we make no use of the 
lexical details of the sentence, the alignment com- 
putation is fast and therefore practical for appli- 
cation to very large collections of text. We have 
used this technique to align several million sen- 
tences in the English-French Hans~trd corpora and 
have achieved an accuracy in excess of 99% in a 
random selected set of 1000 sentence pairs that we 
checked by hand. We show that even without the 
benefit of anchor points the correlation between 
the lengths of aligned sentences is strong enough 
that we should expect to achieve an accuracy of 
between 96% and 97%. Thus, the technique may 
be applicable to a wider variety of texts than we 
have yet tried. 
INTRODUCTION 
Recent work by Brown et al., \[Brown et 
al., 1988, Brown et al., 1990\] has quickened 
anew the long dormant idea of using statistical 
techniques to carry out machine translation 
from one natural language to another. The 
lynchpin of their approach is a. large collection 
of pairs of sentences that. are mutual transla- 
tions. Beyond providing grist to the sta.tisti- 
cal mill, such pairs of sentences are valuable 
to researchers in bilingual lexicography \[I(la.- 
va.ns and Tzoukerma.nn, 1990, Warwick and 
Russell, 1990\] and may be usefifl in other ap- 
proaches to machine translation \[Sadler, 1989\]. 
In this paper, we consider the problem of 
extra.cting from pa.raJlel French and F, nglish 
corpora pairs sentences that are translations 
of one another. The task is not trivial because 
at times a single sentence in one language is 
translated as two or more sentences in the 
other language. At other times a sentence, 
or even a whole passage, may be missing from 
one or the other of the corpora. 
If a person is given two parallel texts and 
asked to match up the sentences in them, it is 
na.tural for him to look at the words in the sen- 
tences. Elaborating this intuitively appealing 
insight, researchers at Xerox and at ISSCO 
\[Kay, 1991, Catizone et al., 1989\] have devel- 
oped alignment Mgodthms that pair sentences 
according to the words that they contain. Any 
such algorithm is necessarily slow and, despite 
the potential for highly accurate alignment, 
may be unsuitable for very large collections 
of text. Our algorithm makes no use of the 
lexical details of the corpora, but deals only 
with the number of words in each sentence. 
Although we have used it only to align paral- 
lel French and English corpora from the pro- 
ceedings of the Canadian Parliament, we ex- 
pect that our technique wouhl work on other 
French and English corpora and even on other 
pairs of languages. The work of Gale and 
Church , \[Gale and Church, 1991\], who use 
a very similar method but measure sentence 
lengths in characters rather than in words, 
supports this promise of wider applica.bility. 
TIIE HANSARD CORPORA 
Brown el al., \[Brown et al., 1990\] describe 
the process by which the proceedings of the 
Ca.nadian Parliament are recorded. In Canada, 
these proceedings are re\[erred to as tta.nsards. 
169 
Our Hansard corpora consist of the llansards 
from 1973 through 1986. There are two files 
for each session of parliament: one English 
and one French. After converting the obscure 
text markup language of the raw data. to TEX , 
we combined all of the English files into a sin- 
gle, large English corpus and all of the French 
files into a single, large French corpus. We 
then segmented the text of each corpus into 
tokens and combined the tokens into groups 
that we call sentences. Generally, these con- 
form to the grade-school notion of a sentence: 
they begin with a capital letter, contain a. 
verb, and end with some type of sentence-final 
punctuation. Occasionally, they fall short of 
this ideal and so each corpus contains a num- 
ber of sentence fragments and other groupings 
of words that we nonetheless refer to as sen- 
tences. With this broad interpretation, the 
English corpus contains 85,016,286 tokens in 
3,510,744 sentences, and the French corpus 
contains 97,857,452 tokens in 3,690,425 sen- 
tences. The average English sentence has 24.2 
tokens, while the average French sentence is 
about 9.5% longer with 26.5 tokens. 
The left-hand side of Figure 1 shows the 
raw data for a portion of the English corpus, 
and the right-hand side shows the same por- 
tion after we converted it to TEX and divided 
it up into sentences. The sentence numbers do 
not advance regularly because we have edited 
the sample in order to display a variety of phe- 
nolnena. 
In addition to a verbatim record of the 
proceedings and its translation, the ttansards 
include session numbers, names of speakers, 
time stamps, question numbers, and indica- 
tions of the original language in which each 
speech was delivered. We retain this auxiliary 
information in the form of comments sprin- 
kled throughout the text. Each comment has 
the form \SCM{} ... \ECM{} as shown 
on the right-hand side of Figure 1. \]n ad- 
dition to these comments, which encode in- 
formation explicitly present in the data, we 
inserted Paragraph comments as suggested by 
the space command of which we see aa exam- 
ple in the eighth line on the left-hand side of 
Figure 1. 
We mark the beginning of a parliamentary 
session with a Document comment as shown 
in Sentence 1 on the right-hand side of Fig- 
ure 1. Usually, when a member addresses the 
parliament, his name is recorded and we en- 
code it in an Author comment. We see an ex- 
ample of this in Sentence 4. If the president 
speaks, he is referred to in the English cor- 
pus as Mr. Speaker and in the French corpus 
as M. le Prdsideut. If several members speak 
at once, a shockingly regular occurrence, they 
are referred to as Some Hon. Members in the 
English and as Des Voix in the French. Times 
are recorded either ~ exact times on a. 24-hour 
basis as in $entencc 8\], or as inexact times of 
which there are two forms: Time = Later, 
and Time = Recess. These are rendered in 
French as Time = Plus Tard and Time = Re- 
cess. Other types of comments that appear 
are shown in Table 1. 
ALIGNING ANCHOR POINTS 
After examining the Hansard corpora, we 
realized that the comments laced throughout 
would serve as uscflll anchor points in any 
alignment process. We divide the comments 
into major and minor anchors as follows. The 
comments Author = Mr. Speaker, Author = 
ill. le Pr(sident, Author = Some Hon. Mem- 
bers, and Author = Des Voix are called minor 
anchors. All other comments are called major 
anchors with the exception of the Paragraph 
comment which is not treated as an anchor at 
all. The minor anchors are much more com- 
mon than any particular major anchor, mak- 
ing an alignment based on them less robust 
against deletions than one based on the ma- 
jor anchors. Therefore, we have carried out 
the alignment of anchor points in two passes, 
first aligning the major anchors and then the 
minor anchors. 
Usually, the major anchors appear in both 
languages. Sometimes, however, through inat- 
tentlon on the part of the translator or other 
misa.dvel~ture, the tla.me of a speaker may be 
garbled or a comment may be omitted. In the 
first alignment pass, we assign to alignments 
170 
/*START_COMMENT* Beginning file = 048 
101 H002-108 script A *END_COMMENT*/ 
.TB 029 060 090 099 
.PL 060 
.LL 120 
.NF 
The House met at 2 p.m. 
.SP 
*boMr. Donald MacInnis (Cape Breton 
-East Richmond):*ro Mr. Speaker, 
I rise on a question of privilege af- 
fecting the rights and prerogatives 
of parliamentary committees and one 
which reflects on the word of two 
ministers. 
.SP 
*boMr. Speaker: *roThe hon. member's 
motion is proposed to the 
House under the terms of Standing 
Order 43. Is there unanimous consent? 
.SP 
*boSome hon. Members: *roAgreed. 
s*itText*ro) 
Question No. 17--*boMr. Mazankowski: 
*to 
I. For the period April I, 1973 to 
January 31, 1974, what amount of 
money was expended on the operation 
and maintenance of the Prime 
Minister's residence at Harrington 
Lake, Quebec? 
.SP 
(1415) 
s*itLater:*ro) 
.SP 
*boMr. Cossitt:*ro Mr. Speaker, I rise 
on a point of order to ask for 
clarification by the parliamentary 
secretary. 
1. \SCM{} Document = 048 101 H002-108 
script A \ECM{) 
2. The House met at 2 p.m. 
3. \SCM{} Paragraph \ECM{} 
4. \SCM{} Author = Mr. Donald MacInnis 
(Cape Breton-East Richmond) \ECM{} 
5. Mr. Speaker, I rise on a question of 
privilege affecting the rights and 
prerogatives of parliamentary 
committees and one which reflects on 
the word of two ministers. 
21. \SCM{} Paragraph \ECM{} 
22. \SCM{} Author = Mr. Speaker \ECM{} 
23. The hon. member's motion is proposed 
to the House under the terms of 
Standing Order 43. 
44. Is there unanimous consent? 
45. \SCM{} Paragraph \ECM{) 
46. \SCM{-} Author = Some hon. Members 
\ECM{} 
47. Agreed. 
61. \SCM{} Source = Text \ECM{} 
62. \SCM{} Question = 17 \ECM{} 
63. \SCM{} Author = Mr. Mazankowski 
\ECMO 
64. I. 
65. For the period April I, 1973 to 
January 31, 1974, .hat amount of 
money was expended on the operation 
and maintenance of the Prime 
Minister's residence at Harrington 
Lake, Quebec? 
66. \SCM{} Paragraph \ECN{} 
81. \SCM{) Time = (1415) \ECM{} 
82. \SCM{) Time = Later \ECM{) 
83. \SCM{} Paragraph \ECM{} 
84. \SCM{} Author = Mr. Cossitt \ECM{} 
85. Mr. Speaker, I rise on a point of 
order to ask for clarification by 
the parliamentary secretary. 
Figure 1: A sample of text before and after cleanup 
171 
a cost that favors exact matches and penalizes 
omissions or garbled matches. Thus, for ex- 
ample, we assign a cost of 0 to the pair Time 
= Later and Time = Plus Tard, but a cost 
of 10 to the pair Time = Later and Author 
= Mr. Bateman. We set the cost of a dele- 
tion at 5. For two names, we set the cost by 
counting the number of insertions, deletions, 
and substitutions necessary to transform one 
name, letter by letter, into the other. This 
value is then reduced to the range 0 to 10. 
Given the costs described above, it is a 
standard problem in dynamic programming 
to find that alignment of the major anchors 
in the two corpora with the least total cost 
\[Bellman, 1957\]. In theory, the time and space 
required to find this alignment grow as the 
product of the lengths of the two sequences 
to be aligned. In practice, however, by using 
thresholds and the partial traceback technique 
described by Brown, Spohrer, Hochschild, and 
Baker , \[Brown et al., 1982\], the time required 
can be made linear in the length of the se- 
quences, and the space can be made constant. 
Even so, the computational demand is severe 
since, in places, the two corpora are out of 
alignment by as many as 90,000 sentences ow- 
ing to mislabelled or missing files. 
This first pass renders the data as a se- 
quence of sections between aligned major an- 
chors. In the second pass, we accept or reject 
each section in turn according to the popula- 
tion of minor anchors that it contains. Specifi- 
cally, we accept a section provided that, within 
the section, both corpora contain the same 
number of minor anchors in the same order. 
Otherwise, we reject the section. Altogether, 
we reject about 10% of the data in each cor- 
pus. The minor anchors serve to divide the 
remaining sections into subsections thai. range 
in size from one sentence to several thousand 
sentences and average about ten sentences. 
ALIGNING SENTENCES AND 
PARAGRAPH BOUNDARIES 
We turn now to the question of aligning 
the individual sentences in a subsection be- 
tween minor anchors. Since the number of 
English 
Source = English 
Source = Translation 
Source = Text 
Source = List Item 
Source = Question 
Source = Answer 
Fren(;h 
Source = Traduction 
Source = Francais 
Source = Texte 
Source = List Item 
Source = Question 
Source = Reponse 
Table 1: Examples of comments 
sentences in the French corpus differs from the 
number in the English corpus, it is clear that 
they cannot be in one-to-one correspondence 
throughout. Visual inspection of the two cor- 
pora quickly reveals that although roughly 90% 
of the English sentences correspond to single 
French sentences, there are many instances 
where a single sentence in one corpus is rep- 
resented by two consecutive sentences in the 
other. Rarer, but still present, are examples 
of sentences that appear in one corpus but 
leave no trace in the other. If one is moder- 
ately well acquainted with both English and 
French, it is a simple matter to decide how the 
sentences should be aligned. Unfortunately, 
the sizes of our corpora make it impractical 
for us to obtain a complete set of alignments 
by hand. Rather, we must necessarily employ 
some automatic scheme. 
It is not surprising and further inspection 
verifies that tile number of tokens in sentences 
that are translations of one another are corre- 
lated. We looked, therefore, at the possibility 
of obtaining alignments solely on the basis of 
sentence lengths in tokens. From this point of 
view, each corl)us is simply a sequence of sen- 
tence lengths punctuated by occasional para- 
graph markers. Figure 2 shows the initial por- 
tion of such a pair of corpora. We have circled 
groups of sentence lengths to show the cor- 
rect alignment. We call each of the groupings 
a bead. In this example, we have an el-bead 
followed by an eft-bead followed by an e-bead 
followed by a ¶~¶l-bead. An alignment, then, 
is simply a sequence of beads that accounts 
for the observed sequences of sentence lengths 
and paragraph markers. We imagine the sen- 
tences in a subsection to have been generated 
by a pa.ir of random processes, the first pro- 
172 
Figure 2: Division of aligned corpora into beads 
Bead 
e 
/ 
,f ee/ 
eft 
¶! 
¶o¶t 
Text 
one English sentence 
one French sentence 
one English and one French sentence 
two English and one French sentence 
one English and two French sentences 
one English paragraph 
one French paragraph 
one English and one French paragraph 
Table 2: Alignment Beads 
ducing a sequence of beads and the second 
choosing the lengths of the sentences in each 
bead. 
Figure 3 shows the two-state Markov model 
that we use for generating beads. -We assume 
that a single sentence in one language lines up 
with zero, one, or two sentences in the other 
and that paragraph markers may be deleted. 
Thus, we allow any of the eight beads shown in 
Table 2. We also assume that Pr (e) = Pr (f), 
Pr (eft)= er (ee/), and Pr (¶¢) = Pr(¶t). 
BEAD 
...... s-L-°--P- ....... ;!:::O 
Figure 3: Finite state model for generating beads 
Given a bead, we determine the lengths of 
the sentences it contains as follows. We a.s- 
sume the probability of an English sentence 
of length g~ given an e-bead to be the same 
as the probability of an English sentence of 
length ee in the text as a whole. We denote 
this probability by Pr(ee). Similarly, we as- 
sume the probability of a French sentence of 
length g! given an f-bead to be Pr (gY)" For an 
el-bead, we assume that the English sentence 
has length e, with probability Pr (~e) and that 
log of the ratio of length of the French sen- 
tence to the length of the English sentence is 
uormMly distributed with mean /t and vari- 
ance a 2. Thus, if r = log(gt/ge), we assume 
that 
er(ts\[e, ) = c exp\[-(r- (1) 
with 0¢ chosen so that the sum of Pr(tllt, ) 
over positive values of gI is equal to unity. For 
an eel-bead, we assume that each of the En- 
glish sentences is drawn independently from 
the distribution Pr(t.) and that the log of 
the ratio of the length of the French sentence 
to the sum of the lengths of the English sen- 
tences is normally distributed with the same 
mean and variance as for an el-bead. Finally, 
for an eft-bead, we assume that the length of 
the English sentence is drawn from the distri- 
bution Pr (g,) and that the log of the ratio of 
the sum of the lengths of the French sentences 
to the length of the English sentence is nor- 
mally distributed asbefore. Then, given the 
sum of the lengths of the French sentences, 
we assume that tile probability of a particular 
pair of lengths,/~11 and ~12, is proportional to 
Vr (ef,) Pr (~S~) . 
Together, these two random processes form 
a hidden Markov model \[Baum, 1972\] for the 
generation of aligned pairs of corpora.. We de- 
termined the distributions, Pr (g,) and Pr (aS), 
front the relative frequencies of various sen- 
tence lengths in our data. Figure 4 shows for 
each language a. histogram of these for sen- 
tences with fewer than 81 tokens. Except for 
lengths 2 and 4, which include a large num- 
ber of formulaic sentences in both the French 
and the English, the distributions are very 
smooth. 
For short sentences, the relative frequency 
is a reliable estimate of the corresponding prob- 
ability since for both French and English we 
have more than 100 sentences of each length 
less tha.n 8\]. We estimated the probabilities 
173 
I 80 
mentenee length 
1 80 
.entenea length 
Figure 4: Histograms of French (top) and English (bottom) sentence lengths 
174 
of greater lengths by fitting the observed fre- 
quencies of longer sentences to the tail of a 
Poisson distribution. 
We determined M1 of the other parameters 
by applying the EM algorithm to a large sam- 
pie of text \[Baum, 1972, Dempster et al., 1977\]. 
The resulting values are shown in Table 3. 
From these parameters, we can see that 91% 
of the English sentences and 98% of the En- 
glish paragraph markers line up one-to-one 
with their French counterparts. A random 
variable z, the log of which is normMly dis- 
tributed with mean # and variance o ~, has 
mean value exp(/t + a2/2). We can also see, 
therefore, that the total length of the French 
text in an el-, eel-, or eft-bead should be about 
9.8% greater on average than the total length 
of the corresponding English text. Since most 
sentences belong to el-beads, this is close to 
the value of 9.5% given in Section 2 for the 
amount by which the length of the average 
French sentences exceeds that of the average 
English sentence. 
We can compute from the parameters in 
Table 3 that the entropy of the bead produc- 
tion process is 1.26 bits per sentence. Us- 
ing the parameters # and (r 2, we can combine 
the observed distribution of English sentence 
lengths shown in Figure 4 with the conditional 
distribution of French sentence lengths given 
English sentence lengths in Equation (1) to 
obtain the joint distribution of French and 
English sentences lengths in el-, eel-, and eft- 
beads. From this joint distribution, we can 
compute that the mutual information between 
French and English sentence lengths in these 
beads is 1.85 bits per sentence. We see there- 
fore that, even in the absence of the anchor 
points produced by the first two pa.sses, the 
correla.tion in sentence lengths is strong enough 
to allow alignment with an error rate that 
is asymptotically less than 100%. lh;arten- 
ing though such a result may be to the theo- 
retician, this is a sufficiently coarse bound on 
the error rate to warrant further study. Ac- 
cordingly, we wrote a program to Simulate the 
alignment process that we had in mind. Using 
Pr(e¢), Pr((¢), and the parameters from Ta- 
Parameter Estimate 
er (e), Pr(/) .007 
Pr (e/) .690 
Pr (eel), Pr (eft) .020 
Pr (¶~), Pr (¶f) .005 
It. .072 
tr 2 .043 
Table 3: P~rameter estimates 
ble 3, we generated an artificial pair of aligned 
corpora. We then determined the most prob- 
able alignment for the data. We :recorded 
the fraction of el-beads in the most probable 
alignment that did not correspond to el-beads 
in the true Mignment as the error rate for the 
process. We repeated this process many thou- 
sands of times and found that we could ex- 
pect an error rate of about 0.9% given the 
frequency of anchor points from the first two 
pa,sses. 
By varying the parameters of the hidden 
Markov model, we explored the effect of an- 
chor points and paragraph ma.rkers on the ac- 
curacy of alignment. We found that with para- 
graph markers but no ~tnchor points, we could 
expect an error rate of 2.0%, with anchor points 
but no l)~tra.graph markers, we could expect an 
error rate of 2.3%, and with neither anchor 
points nor pa.ragraph markers, we could ex- 
pect an error rate of 3.2%. Thus, while anchor 
points and paragraph markers are important, 
alignment is still feasible without them. This 
is promising since it suggests that one may 
be able to apply the same technique to data 
where frequent anchor points are not avail- 
able. 
RESULTS 
We aplflied the alignment algorithm of Sec- 
t.ions 3 and 4 to the Ca.na.dian Hansa.rd data 
described in Section 2. The job ran for l0 
clays on au IBM Model 3090 mainframe un- 
der an operating system that permitted ac- 
cess to 16 mega.bytes of virtual memory. The 
most probable alignment contained 2,869,041 
el-beads. Some of our colleagues helped us 
175 
And love and kisses to you, too. 
... mugwumps who sit on the fence with 
their mugs on one side and their 
wumps on the other side and do not 
know which side to come down on. 
At first reading, she may have. 
Pareillelnent. 
... en voulant m&lager la ch~vre et le choux 
ils n'arrivent 1)as k prendre patti. 
Elle semble en effet avoir un grief tout a 
fait valable, du moins au premier 
abord. 
Table 4: Unusual but correct alignments 
examine a random sample of 1000 of these 
beads, and we found 6 in which sentences were 
not translations of one another. This is con- 
sistent with the expected error rate ol 0.9% 
mentioned above. In some cases, the algo- 
rithm correctly aligns sentences with very dif- 
ferent lengths. Table 4 shows some interesting 
examples of this. 

REFERENCES 
\[Baum, 1972\] Baum, L. (1972). An inequality 
and associated maximization technique in 
statistical estimation of probabilistic func- 
tions of a Markov process. Inequalities, 3:1- 
8. 
\[Bellman, 1957\] Bellman, R. (1957). Dy- 
namic Programming. Princeton University 
Press, Princeton N.J. 
\[Brown et al., 1982\] Brown, P., Spohrer, J., 
Hochschild, P., and Baker, J. (1982). Par- 
tial traceback and dynamic programming. 
In Proceedings of the IEEE International 
Conference on Acoustics, Speech and Signal 
Processing, pages 1629-1632, Paris, France. 
\[Brown et ai., 1990\] Brown, P. F., Cocke, J., 
DellaPietra, S. A., DellaPietra, V. J., Je- 
linek, F., Lafferty, J. D., Mercer, R. L., 
and Roossin, P. S. (1990). A statisticM ap- 
proach to machine translation. Computa- 
tional Linguistics, 16(2):79-85. 
\[Brown et al., 1988\] Brown, P. F., Cocke, J., 
DellaPietra, S. A., DellaPietra., V. J., .le- 
linek, F., Mercer, R. L., and Roossin, P. S. 
(1988). A statistical approach to language 
translation. In Proceedings of the I2th In- 
ternational Conference on Computational 
Linguisticsl Budapest, Hungary. 
\[Catizone et al., 1989\] Catizone, R., Russell, 
G., and Warwick, S. (1989). Deriving trans- 
lation data \[rom bilingual texts. In Proceed- 
ings of the First International Acquisition 
Workshop, Detroit, Michigan. 
\[Dempster et al., \]977\] Dempster, A., Laird, 
N., and Rubin, D. (1977). Maximum likeli- 
hood from incomplete data via the EM al- 
gorithm. Journal of the Royal Statistical 
Society, 39(B):1-38. 
\[Gale and Church, 1991\] Gale, W. A. and 
Church, K. W. (1991). A program for align- 
ing sentences in bilingual corpora. In Pro- 
ceedings of the 2gth Annual Meeting of the 
A ssociation for Computational Linguistics, 
Berkeley, California. 
\[Kay, \]991\] Kay, M. (1991). Text-translation 
alignment. In ACII/ALLC '91: "Mak- 
in.q Connections" Conference Handbook, 
Tempe, Arizona. 
\[Klavans and Tzoukermann, 1990\] 
Kiavans, .l. and Tzoukermann, E. (1990). 
The bicord system. \]n COLING-90, pages 
174-179, Ilelsinki, Finland. 
\[Sadler, 19~9\] Sadler, V. (1989). The Bilin- 
gual Knowledge Bank- A New Conceptual 
Basis for MT. BSO/Research, Utrecht. 
\[Warwick and Russell, 1990\] Wa.rwick, S. and 
Russell, G. (1990). Bilingual concordancing 
and bilingnM lexicography. In EURALEX 
4th International Congress, M~ilaga, Spain. 
