K-vec: A New Approach for Aligning Parallel Texts 
Pascale Fung 
Computer Science Department 
Columbia University 
New York, NY 10027 USA 
fung@cs.columbia.edu 
Kenneth Ward Church 
AT&T Bell Laboratories 
600 Mountain Ave. 
Murray Hill, NJ 07974 USA 
kwc @research.att.com 
Abstract 
Various methods have been proposed for aligning 
texts in two or more languages such as the 
Canadian Parliamentary Debates (Hansards). Some 
of these methods generate a bilingual lexicon as a 
by-product. We present an alternative alignment 
strategy which we call K-vec, that starts by 
estimating the lexicon. For example, it discovers 
that the English word fisheries is similar to the 
French p~ches by noting that the distribution of 
fisheries in the English text is similar to the 
distribution of p~ches in the French. K-vec does 
not depend on sentence boundaries. 
1. Motivation 
There have been quite a number of recent papers on 
parallel text: Brown et al (1990, 1991, 1993), Chen 
(1993), Church (1993), Church et al (1993), Dagan 
et al (1993), Gale and Church (1991, 1993), 
Isabelle (1992), Kay and Rgsenschein (1993), 
Klavans and Tzoukermann (1990), Kupiec (1993), 
Matsumoto (1991), Ogden and Gonzales (1993), 
Shemtov (1993), Simard et al (1992), Warwick- 
Armstrong and Russell (1990), Wu (to appear). 
Most of this work has been focused on European 
language pairs, especially English-French. It 
remains an open question how well these methods 
might generalize to other language pairs, especially 
pairs such as English-Japanese and English- 
Chinese. 
In previous work (Church et al, 1993), we have 
reported some preliminary success in aligning the 
English and Japanese versions of the AWK manual 
(Aho, Kernighan, Weinberger (1980)), using 
charalign (Church, 1993), a method that looks for 
character sequences that are the same in both the 
source and target. The charalign method was 
designed for European language pairs, where 
cognates often share character sequences, e.g., 
government and gouvernement. In general, this 
approach doesn't work between languages such as 
English and Japanese which are written in different 
alphabets. The AWK manual happens to contain a 
large number of examples and technical words that 
are the same in the English source and target 
Japanese. 
It remains an open question how we might be able 
to align a broader class of texts, especially those 
that are written in different character sets and share 
relatively few character sequences. The K-vec 
method attempts to address this question. 
2. The K-vec Algorithm 
K-vec starts by estimating the lexicon. Consider 
the example: fisheries --~ p~ches. The K-vec 
algorithm will discover this fact by noting that the 
distribution of fisheries in the English text is 
similar to the distribution of p~ches in the French. 
The concordances for fisheries and p~ches are 
shown in Tables 1 and 2 (at the end of this paper). 1 
1. These tables were computed from a small fragment of the 
Canadian Hansards that has been used in a number of other 
studies: Church (1993) and Simard et al (1992). The 
English text has 165,160 words and the French text has 
185,615 words. 
1096 
There are 19 instances of fisheries and 21 instances 
of p~ches. The numbers along the left hand edge 
show where the concordances were found in the 
texts. We want to know whether the distribution of 
numbers in Table 1 is similar to those in Table 2, 
and if so, we will suspect that fisheries and p~ches 
are translations of one another. A quick look at the 
two tables suggests that the two distributions are 
probably very similar, though not quite identical. 2 
We use a simple representation of the distribution 
of fisheries and p~ches. The English text and the 
French text were each split into K pieces. Then we 
determine whether or not the word in question 
appears in each of the K pieces. Thus, we denote 
the distribution of fisheries in the English text with 
a K-dimensional binary vector, VU, and similarly, 
we denote the distribution of p~ches in the French 
text with a K-dimensional binary vector, Vp. The 
i th bit of Vf indicates whether or not Fisheries 
occurs in the i th piece of the English text, and 
similarly, the ith bit of Vp indicates whether or not 
p~ches occurs in the i th piece of the French text. 
If we take K be 10, the first three instances of 
fisheries in Table 1 fall into piece 2, and the 
remaining 16 fall into piece 8. Similarly, the first 4 
instances of pgches in Table 2 fall into piece 2, and 
the remaining 17 fall into piece 8. Thus, 
VT= Vp = <2 0,0,1,0,0,0,0,0,1,0 > 
Now, we want to know if VT is similar to Vp, and if 
we find that it is, then we will suspect that fisheries 
---> p~ches. In this example, of course, the vectors 
are identical, so practically any reasonable 
similarity statistic ought to produce the desired 
result. 
3. fisheries is not file translation of lections 
Before describing how we estimate the similarity of 
Vf and Vp, let us see what would happen if we tried 
to compare fisheries with a completely unrelated 
word, eg., lections. (This word should be the 
translation of elections, not fisheries.) 
2. At most, fisheries can account for only 19 instances of p~ches, 
leaving at least 2 instances ofp~ches unexplained. 
As can be seen in the concordances in Table 3, for 
K=10, the vector is <1, 1, 0, 1, 1,0, 1, 0, 0, 0>. By 
almost any measure of similarity one could 
imagine, this vector will be found to be quite 
different from the one for fisheries, and therefore, 
we will correctly discover that fisheries is not the 
translation of lections. 
To make this argument a little more precise, it 
might help to compare the contingency matrices in 
Tables 5 and 6. The contingency matrices show: 
(a) the number of pieces where both the English 
and French word were found, (b) the number of 
pieces where just the English word was found, (c) 
the number of pieces where just the French word 
was found, and (d) the number of peices where 
neither word was found. 
Table 4: A contingency matrix 
French 
English a b 
c d 
Table 5: fisheries vs. pgches 
p~ches 
fisheries 2 0 
0 8 
Table 6: fisheries vs. lections 
lections 
fisheries 0 2 
4 4 
In general, if the English and French words are 
good translations of one another, as in Table 5, then 
a should be large, and b and c should be small. In 
contrast, if the two words are not good translations 
of one another, as in Table 6, then a should be 
small, and b and c should be large. 
4. Mutual Information 
Intuitively, these statements seem to be true, but we 
need to make them more precise. One could have 
chosen quite a number of similarity metrics for this 
purpose. We use mutual information: 
1097 
prob ( VI, Vp ) 
log2 prob(Vf) prob(Vp ) 
That is, we want to compare the probability of 
seeing fisheries and p~ches in the same piece to 
chance. The probability of seeing the two words in 
the same piece is simply: 
a 
prob(Vf, Vp)- a+b+c+d 
The marginal probabilities are: 
a+b prob(Vf)- a+b+c+d 
a+c prob(Vp) = a+b+c+d 
For fisheries --~ p~ches, prob(Vf, Vp) =prob(Vf) 
=prob(Vp) =0.2. Thus, the mutual information is 
log25 or 2.32 bits, meaning that the joint 
probability is 5 times more likely than chance. In 
contrast, for fisheries ~ lections, prob ( V f, V p ) = O, 
prob(Vf) =0.5 and prob(Vp) = 0.4. Thus, the 
mutual information is log 2 0, meaning that the joint 
is infinitely less likely than chance. We conclude 
that it is quite likely that fisheries and p~ches are 
translations of one another, much more so than 
fisheries and lections. 
5. Significance 
Unfortunately, mutual information is often 
unreliable when the counts are small. For example, 
there are lots of infrequent words. If we pick a pair 
of these words at random, there is a very large 
chance that they would receive a large mutual 
information value by chance. For example, let e be 
an English word that appeared just once and letfbe 
a French word that appeared just once. Then, there 
a non-trivial chance (-~) that e andf will appear is 
in the same piece, as shown in Table 7. If this 
should happen, the mutual information estimate 
would be very large, i.e., logK, and probably 
misleading. 
Table 7: 
f 
e 1 0 
0 9 
In order to avoid this problem, we use a t-score to 
filter out insignificant mutual information values. 
prob ( Vf, Vp ) - prob (Vf) prob ( Vp ) t= 
1 prob(Vf,gp) 
Using the numbers in Table 7, t=l, which is not 
significant. (A t of 1.65 or more would be 
significant at the p > 0.95 confidence level.) 
Similarly, if e and f appeared in just two pieces 
1 each, then there is approximately a ~ chance that 
they would both appear in the same two pieces, and 
then the mutual information score would be quite 
log,, ~--, but we probably wouldn't believe it high, 
Z. 
because the t-score would be only "~-. By this 
definition of significance, we need to see the two 
words in at least 3 different pieces before the result 
would be considered significant. 
This means, unfortunately, that we would reject 
fisheries --+ p~ches because we found them in only 
two pieces. The problem, of course, is that we 
don't have enough pieces. When K=10, there 
simply isn't enough resolution to see what's going 
on. At K=100, we obtain the contingency matrix 
shown in Table 8, and the t-score is significant 
(t=2.1). 
Table 8:K=100 
p~ches 
fisheries 5 
0 
1 
94 
How do we choose K? As we have seen, if we 
choose too small a K, then the mutual information 
values will be unreliable. However, we can only 
increase K up to a point. If we set K to a 
ridiculously large value, say the size of the English 
text, then an English word and its translations are 
likely to fall in slightly different pieces due to 
random fluctuations and we would miss the signal. 
For this work, we set K to the square root of the 
size of the corpus. 
K should be thought of as a scale parameter. If we 
use too low a resolution, then everything turns into 
a blur and it is hard to see anything. But if we use 
too high a resolution, then we can miss the signal if 
7098 
it isn't just exactly where we are looking. 
Ideally, we would like to apply the K-vec algorithm 
to all pairs of English and French words, but 
unfortunately, there are too many such pairs to 
consider. We therefore limited the search to pairs 
of words in the frequency range: 3-10. This 
heuristic makes the search practical, and catches 
many interesting pairs) 
6. Results 
This algorithm was applied to a fragment of the 
Canadian Hansards that has been used in a number 
of other studies: Church (1993) and Simard et al 
(1992). The 30 significant pairs with the largest 
mutual information values are shown in Table 9. 
As can be seen, the results provide a quick-and- 
dirty estimate of a bilingual lexicon. When the pair 
is not a direct translation, it is often the translation 
of a collocate, as illustrated by acheteur ~ Limited 
and Santd -~ Welfare. (Note that some words in 
Table 9 are spelled with same way in English and 
French; this information is not used by the K-vec 
algorithm). 
Using a scatter plot technique developed by Church 
and Helfman (1993) called dotplot, we can visulize 
the alignment, as illustrated in Figure 1. The 
source text (Nx bytes) is concatenated to the target 
text (Ny bytes) to form a single input sequence of 
Nx+Ny bytes. A dot is placed in position i,j 
whenever the input token at position i is the same 
as the input token at position j. 
The equality constraint is relaxed in Figure 2. A 
dot is placed in position i,j whenever the input 
token at position i is highly associated with the 
input token at position j as determined by the 
mutual information score of their respective K- 
vecs. In addition, it shows a detailed, magnified 
and rotated view of the diagonal line. The 
alignment program tracks this line with as much 
precision as possible. 
3. The low frequency words (frequency less then 3) would 
have been rejected anyways as insignificant. 
Table 9: K-vec results 
French English 
3.2 Beauce Beauce 
3.2 Comeau Comeau 
3.2 1981 1981 
3.0 Richmond Richmond 
3.0 Rail VIA 
3.0 p~ches Fisheries 
2.8 Deans Deans 
2.8 Prud Prud 
2.8 Prud homme 
2.7 acheteur Limited 
2.7 Communications Communications 
2.7 MacDonald MacDonald 
2.6 Mazankowski Mazankowski 
2.5 croisi~re nuclear 
2.5 Sant6 Welfare 
2.5 39 39 
2.5 Johnston Johnston 
2.5 essais nuclear 
2.5 Universit6 University 
2.5 bois lumber 
2.5 Angus Angus 
2.4 Angus VIA 
2.4 Saskatoon University 
2.4 agriculteurs farmers 
2.4 inflation inflation 
2.4 James James 
2.4 Vanier Vanier 
2.4 Sant6 Health 
2.3 royale languages 
2.3 grief grievance 
7. Conclusions 
The K-vec algorithm generates a quick-and-dirty 
estimate of a bilingual lexicon. This estimate could 
be used as a starting point for a more detailed 
alignment algorithm such as word_align (Dagan et 
al, 1993). In this way, we might be able to apply 
word_align to a broader class of language 
combinations including possibly English-Japanese 
and English-Chinese. Currently, word_align 
depends on charalign (Church, 1993) to generate 
a starting point, which limits its applicability to 
European languages since char_align was designed 
for language pairs that share a common alphabet. 
References 
Aho, Kernighan, Weinberger (1980) "The AWK 
Programming Language," Addison-Wesley, 
Reading, Massachusetts, USA. 
1099 
I 
Figure 1: A Dotplot of the Hansards 
::. : ::: :l 
Figure 2: K-vec view of Hansards 
Brown, P., J. Cocke, S. Della Pietra, V. Della 
Pietra, F. Jelinek, J. Lafferty, R. Mercer, and P. 
Roossin, (1990) "A Statistical Approach to 
Machine Translation," Computational Linguistics, 
vol. 16, pp. 79-85. 
Brown, P., Lai, J., and Mercer, R. (1991) 
"Aligning Sentences in Parallel Corpora," ACL- 
91. 
Brown, P., Della Pietra, S., Della Pietra, V., and 
Mercer, R. (1993), "The mathematics of machine 
translation: parameter estimation," Computational 
Linguistics, pp. 263-312. 
Chen, S. (1993) "Aligning Sentences in Bilingual 
Corpora Using Lexical information," ACL-93, pp. 
9-16. 
Church, K. (1993) "Char_align: A Program for 
Aligning Parallel Texts at the Character Level," 
ACL-93, pp. 1-8. 
Church, K., Dagan, I., Gale, W., Fung, P., 
Helfman, J., Satish, B. (1993) "Aligning Parallel 
Texts: Do Methods Developed for English-French 
Generalize to Asian Languages?" Pacific Asia 
Conference on Formal and Computational 
Linguistics. 
Church, K. and Helfman, J. (1993) "Dotplot: a 
Program for Exploring Self-Similarity in Millions 
of Lines of Text and Code," The Journal of 
Computational and Graphical Statistics, 2:2, pp. 
153-174. 
Dagan, I., Church, K., and Gale, W. (1993) 
"Robust Word Alignment for Machine Aided 
Translation," Proceedings of the Workshop on 
Very Large Corpora: Academic and Industrial 
Perspectives, available from the ACL, pp. I-8. 
Gale, W., and Church, K. (1991) "Identifying 
Word CoiTespondences in Parallel Text," Fourth 
Darpa Workshop on Speech and Natural 
Language, Asilomar. 
Gale, W., and Church, K. (1993) "A Program for 
Aligning Sentences in Bilingual Corpora," 
Computational Linguistics, also presented at ACL- 
91. 
Isabelle, P. (1992) "Bi-Textual Aids for 
Translators," in Proceedings of the Eigth Annual 
Conference of the UW Centre for the New OED 
and Text Research, available from the UW Centre 
for the New OED and Text Research, University of 
Waterloo, Waterloo, Ontario, Canada. 
Kay, M. (1980) "The Proper Place of Men and 
1100 
Machines in Language Translation," unpublished 
ms., Xerox, Palo Alto, CA. 
Kay, M. and Rgsenschein, M. (1993) "Text- 
Translation Alignment," Computational 
Linguistics, pp. 121-142. 
Klavans, J., and Tzoukermann, E., (1990), "The 
BICORD System," COL1NG-90, pp 174-179. 
Kupiec, J. (1993) "An Algorithm for Finding Noun 
Phrase Correspondences in Bilingual Corpora," 
ACL-93, pp. 17-22. 
Matsumoto, Y., Ishimoto, It., Utsuro, T. and 
Nagao, M. (1993) "Structural Matching of Parallel 
Texts," ACL-93, pp. 23-30. 
Table 1: Concordances for fisheries 
Shemtov, H. (1993) "Text Alignment in a Tool for 
Translating Revised Documents," EACL, pp. 449- 
453. 
Simard, M., Foster, G., and Isabelle, P. (1992) 
"Using Cognates to Align Sentences in Bilingual 
Corpora," Fourth International Conference on 
Theoretical and Methodological Issues in Machine 
Translation (TMl-92), Montreal, Canada. 
Warwick-Armstrong, S. and G. Russell (1990) 
"Bilingual Concordancing and Bilingual Lexi- 
cography," Euralex. 
Wu, D. (to appem') "Aligning Parallel English- 
Chinese Text Statistically with LexicaI Criteria," 
ACL-94. 
28312 
28388 
28440 
128630 
128885 
128907 
130887 
132282 
132629 
132996 
134026 
134186 
134289 
134367 
134394 
134785 
134796 
134834 
134876 
Mr. Speaker, my question is for tile Minister of Fisheries and Oceans. Allegations have been made 
of the stocks ? I-ton. Thomas Siddon ( Minister of Fisheries and Oceans ): Mr. Speaker, 1 tell the 
calculation on which the provincial Department of Fisheries makes this allegation and I find that it 
private sector is quite weak. 1,ct us turn now to fisheries, an industry which as most important 1o 
The fishermen would like to see the l)epartment of Fisheries and Oceans put more effort towards the p 
s in particular. The budget of the Department of Fisheries and Oceans has been reduced to such ate 
' habitation ' ' trom which to base his trade in fisheries and filrs. He brought wilh him the first 
ase .just outside of my riding. The Department of Fisheries and Oceans provides employmeut for many 
and all indications are that the riclmess ot' its fisheries resource will enable it to maintain its 
taxpayer. The role of file federal Department of Fisheries and Oceans is central to the concerns of 
is the new Chainnan of the Standing Committee on Fisheries and Oceans. I am sure he will bring a w 
ortunity to discuss it with me as a member of the Fisheries Committee. The Hon. Member asked what 
he proposal has been submitted to the Minister of Fisheries and Oceans ( Mr. Siddon ) which I hope 
ch as well as on his selection as Chairman of the Fisheries Committee. I have workexl with Mr. Come 
his intense interest and expertise in the area of fisheries. It seems most appropriate, given that 
r from Eastern Canada and the new Chairman of the Fisheries and Oceans Committee. We know that the 
d Oceans Committee. We know that the Minister of Fisheries and Oceans ( Mr. Siddon ), should we s 
ows the importance of research and development to fisheries and oceans. Is he now ready to tell the 
research and development component in the area of fisheries and oceans at Bedford, in order that th 
Table 2: Concordances for p~ches 
31547 
31590 
31671 
31728 
144855 
145100 
145121 
148873 
149085 
149837 
149960 
151108 
151292 
151398 
151498 
151521 
151936 
151947 
151997 
152049 
152168 
Table 3: 
oyez certain que je prfsenterai mes excuses. Les 
6sident, ma question s ' adresse au ministre des 
poissons ? L ' hon. Thomas Siddon ( ministre des 
calculs sur lesquels le minist~re provincial des 
iv6 est beaucoup plus faible. Parlons un peu des 
braconnage. Ils voudraient que le minist~re des 
es stocks de homards. Le budget du minist&e des 
endant I ' hiver, lorsque I ' agriculture et les 
p6ches L ' existence possible d ' un march6 noir e 
P6ches et des Oc6ans. On aurait p6ch6, ddbarqud 
P~ches et des Ocgans ) 
p~ches fonde ses all6gations, etj ' y ai relev6 
p~ches, un secteur tr~s important pour 1 ' Atlant 
P6ches et des Oc6ans fasse davantage, particulibr 
P6ches et des Oc6ans a t6 amput6 de telle sorte qu 
p~ches sont peu pros leur point mort, bon nombre 
xt6rieur de ma circonscription. Le minist~re des P~ches et des Oc6ans assure de 1 ' ernploi bien d ' 
s. Dans le rapport Kirby de 1983 portant sur les p~ches de la c6te est, on a mal expliqu6 le syst~ 
eniers publics. Le r61e du ministate f6ddral des P~ches et des Ocfans se trouve au centre des pr6oc 
soit le nouveau pr6sident du comit6 permanent des p~ches et ocfans. Je suis stir que ses vastes conn 
avec moi, en ma qualit6 de membre du comit6 des p6ches et oc6ans. Le d6put6 a demand6 quelles per 
is savoir qu ' elle a t~ propos6e au ministre des Pfiches et Oc6ans ( M. Siddon ) et j ' espgre qu ' 
de son choix au poste de prfsident du comit6 des p~ches. Je travaille avec M. Comeau depuis deux 
et je connais tout 1 ' int6rgt qu ' il porte aux p6ches, ainsi que sa comp&ence cet gard. Cela s 
Est du pays et maintenant pr6sident du Comit6 des p6ches et des oc6ans. On sait que le ministre des 
6ches et des oc6ans. On sait que le ministre des P6ches et des Oc6ans ( M. Siddon ) a, disons, a 
recherche et du d6veloppement dans le domaine des p~ches et des oc6ans. Est - il pr& aujourd ' hui 
recherche et du d6veloppement dans le domaine des p6ches et des ocdans Bedford afin que ce laboratoi 
s endroits ou"g ils se trouvent et 1 ' avenir des pfiches dans I ' Est. Le prdsident suppl6ant ( M. 
Concordances for lections 
88 
207 
12439 
14999 
16164 
16386 
16389 
16431 
17419 
17427 
17438 
17461 
55169 
56641 
57853 
59027 
67980 
70161 
70456 
103132 
103186 
de prendre la parole aujourd ' hui. Bien que les lections au cours desquelles on nous a lus la t~te 
ui servent ensemble la Chambre des communes. Les lections qui se sont tenues au d6but de la deuxi6m 
n place les mesures de contr61e suffisantes. Les lections approchaient et les lib6raux voulaient me 
reprendre le contenu de son discours lectoral des lections de 1984. On se rappelle, et tousles Ca 
ertainement et s ' en rappelleront aux prochaines lecfions de tout ce qui aurait pu leur arriver. L 
n apercevront encore une fois lors des prochaines lections. Des lections, monsieur le Pr6sident, 
ncore une fois lors des prochaines lections. Des lections, monsieur le Pr6sident, il yen a eu de 
avec eux - m6mes I ' analyse des r6sultats de ces lections compl6mentaires, constateront qu ' ils o 
s et ils rfagissent. Ils ont r6agi aux derni6res 
6mentaires et ils r6agiront encore aux prochaines 
t, monsieur le Pr6sident, parlant de prochaines 
M. Layton ) dire tant6t que, ant6rieurement aux 
6titions. Je suggfrerais au Comit6 permanent des 
ulever cette question au comit6 des privil~ges et 
doivent tre renvoyfes au comit6 des privileges et 
r6t soumettre la question au comit6 permanent des 
le 16 janvier 1986... M. Hovdebo: Apr?~s les 
tinuer faire ce qu ' ils ont fait depuis quelques 
que les gens le retiennent jusqu ' aux prochaines 
donc transmis mon mandat au directeur gfn6ral des 
, deux d6put6s ont avis6 le direeteur gfn6ral des 
lections compl6mentaires et ils r6agiront encore a 
lections. Finalement, monsieur le Pr6sident, pa 
lections.., j ' coutais mon honorable coll~gue 
lections de 1984, les gens de Lachine voulaient u 
lections, des privilbges et de la proc6dure d ' t 
lections, car il y a de s6rieux doutes sur I ' in 
lections. J ' ai 1 ' intention d ' en saisir ce c 
lections, des privilbges et de la proc6dure. J ' 
lections. M. James:... le ministre d ' alors 
lections, c ' est - - dire rejeter le Nouveau par 
lections. De cette fa~on vous allez tre rejet6s d 
lections, afin de 1 ' autoriser mettre un nouveau 
lections d ' une vacance survenue la Chambre ; il 
1102 
Discourse & Pragmatics 

