A Bootstrapping Method for Extracting Bilingual Text Pairs 
Hiroshi Masuiehil Raymond Flournoy* 
*Fuii Xerox Co., Ltd. 
Corporate Research Center 
430 Sakai, Nakai-machi, Ashigarakami-gun, 
Kanagawa 259-0157, Japan 
{ masuichi, flournoy, kauflnann, 
Stefan Kaufmann * Stanley Peters * 
• Center for the Study of Language and Information 
Stanford University 
210 Panama Street, Stanford, 
CA 94305-4115, U.S.A. 
peters} @csli.stanford.edu 
Abstract 
This paper proposes a method for extracting 
bilingual text pairs from a comparable cor- 
pus. The basic idea of the method is to ap- 
ply bootstrapping to an existing corpus- 
based cross- information retrieval 
(CLIR) approach. We conducted prelimi- 
nary tests with English and Japanese bilin- 
gual corpora. The bootstrapping method 
led to much better results for the task of ex- 
tracting translation pairs compared with a 
corpus-based CLIR method without boot- 
strapping, and the extracted translation pairs 
could be useftfl training data for improving 
results of the corpus-based CLIR method. 
1 Introduction 
A parallel corpus is an important resource for 
corpus-based approaches to CLIR. These 
approaches use parallel corpora as statistical 
training data and then retrieve documents writ- 
ten in a  different from that of the query. 
One disadvantage of these approaches is lack of 
resources. Parallel corpora are not always 
readily available and those that are available 
tend to be relatively small or to cover only a 
small number of subjects. 
A bilingual comparable corpus is a set of texts 
in two different s from the same do- 
main or on the same topic. Unlike a parallel 
corpus it is composed independently in the re- 
spective  text sets. It can be more 
readily obtained from the Internet or CD-ROM 
resources than parallel corpora. Zanettin 
(1998) introduced several available bilingual 
comparable corpora such as news paper articles 
selected by dates and subject codes, medical 
articles from journals and textbooks, and articles 
for tourists from brochures and guides. Zanet- 
tin (1994) also reported that it is highly likely 
that much relevant information can be found 
across s in a topic-related bilingual 
comparable corpus. In this paper, we propose a 
method for extracting bilingual text pairs which 
share the same information fiom a bilingual 
colnparable corpus, and show the possibility that 
the resulting bilingual text pairs can be useful 
for corpus-based CLIR approaches when we use 
them as training data instead of a parallel corpus. 
Sheridan (1998) also proposed an approach to 
building lnultilingual test collection from com- 
parable corpora consisting of news articles. 
The idea is to reduce the work of manual rele- 
vance judgements by restricting news articles to 
be examined to a couple of days. Disadvan- 
tages to this approach are that it relies on time- 
sensitive texts, texts obtained by this approach 
are constrained to referencing specific events, 
and nontrivial work by hulnans is still necessm'y. 
On the other hand, our goal is to extract bilin- 
gual text pairs automatically from any kind of 
bilingual comparable corpora. 
This paper is organized as follows: Section 
2 introduces the basic idea for extracting 
relevant text pairs from a bilingual comparable 
corpus. Our method is based on a corpus-based 
CLIR method, so we overview previous corpus- 
based CLIR approaches in Section 3. Section 4 
describes an experimental procedure, the results 
it produced, and an analysis of the results. The 
conclusion is given in Section 5. 
2 The Basic Idea 
As we will describe in Section 3, several CLIR 
approaches that rely on parallel corpora have 
been proposed and lead to successful retrieval 
results. In those approaches, a parallel corpus 
used as training data should be large enough to 
obtain good retrieval results. Although we use 
a CLIR method which relies on a parallel corpus, 
we begin with a very small parallel corpus. We 
retrieve bilingual text pairs from a bilingual 
comparable corpus using the small parallel cor- 
pus as training data. Then we concatenate the 
text pairs to the initial small parallel corpus and 
grow the parallel corpus by iterating the retrieval 
and concatenation processes (Figure 1). 
1066 
I comparable COlptlS I 
Figure 1: The bootstrapping method 
This kind of bootstrapping method has a 
problem, however: It is highly sensitive to the 
accuracy of the text pairs obtained in the early 
stages of the iterations. In order to solve this 
problem, we concatenate only a small number of 
the most "reliable" text pairs to the initial paral- 
lel corpus in the early stages, then gradually 
increase the number of the text pairs which are 
concatenated to the initial parallel corpus. We 
will describe the details of the method in Section 4. 
3 Corpus-based CLIR approaches 
3.1 Previous Researches 
As we mentioned in Section 2, we use a CLIR 
method which relies on a parallel corpus in our 
bootstrapping method. One approach to cor- 
pus-based CLIR is to use the Latent Semantic 
Indexing technique proposed by Fumas et al. 
(1988) on a parallel corpus to construct a lan- 
guage illdependent representation el' queries and 
documents (Landauer and Lfltman, 1990). 
Another approach that relies on a parallel corpus 
has been suggested by l)unning and l)avis 
(1993). Their method is based on the vector 
space model and involves the linear trausforula- 
tion of the representation of a query. A parallel 
corpus can also be used to enhance existing 
knowledge-based resources. The resources ark 
used to translate the query and then classical IR 
matching techniques are applied to compute the 
similarity between the trauslated query and 
documents (Hull and Grel'enstette, 1996). 
3.2 hfforlnation Mapping for CLIR 
For our bootstrapping method, we adopted a 
CLIR method which is based on the hfforma- 
tion Mapping approach (Masuichi et al., 
1999). Information Mapping is basically a 
wlriant of the vector space model, and is based 
on an approach first proposed by Schtitze 
(1995). The approach is closely related to 
Latent Semantic Indexing, and the dilTerence 
between these two is discussed in Schfitze and 
Pedersen (1997). Note that our bootstrapping 
method does not depend on any particuhu" 
properties o1' the Information Mapping ap- 
proach, so it could employ other corpus-based 
CLIR methods such as Latent Semantic in- 
dexing. 
Information Mapping begins with a large 
word-by-word matrix. A list of n content- 
bearing words and m w)cabulary words corre- 
spond to the columns and the rows of the 
matrix. The most fiequently appearing n 
words in a training corpus are selected as 
content-bearing words and the most frequently 
appearing m words as vocabulary words. 
Each cell of the matrix holds the nmnber of 
total cooccurrences between a content-bearing 
word and a vocabulary word in the training 
corpus. In this way, an n-dimensional vector 
which represents the word's distributional 
behavior is produced t'or each vocabulary 
word. Then the original n-dimensional vec- 
tor space is converted into a condensed, lower- 
dilnensional, real-valued matrix using Singular 
Value l)ecomposition (SVD) (Berry, 1992). 
The lower-dimensional vector space is called 
word space. A document vector and a query 
vector are calculated by summing the vectors 
corresponding to the vocabulary words in the 
document or the query, and the proximity 
between the two vectors is del'ined as the cosi- 
ne of the angle between them. 
To apply this method to CL1R, we regard 
each translation pair in a training parallel 
corpus of  LI and L2 as a single 
COlnpotmd document and create a word-by- 
word matrix and then a word space. The 
word space represents a hmguage independent 
vector space for vocabulary words in both 1,1 
and L2, and therefore query and document 
vectors in both LI and L2 can be calculated 
and compmed in the salne word space. 
4 Experimental tests and Results 
4.1 Tests with complete-pair corpora 
We used an English-Japanese bilingual patent 
text corpus for our experilnental tests. For 
our first test, we prepared I000 English- 
Japanese patent text pairs as a pseudo bilin- 
gual comparable corpus. For each Japanese 
patent text in the corpus, its English transla- 
tion by humans exists j, so this corpus could be 
regarded as an ideal bilingual comparable 
corpus. We also prepared 100 pairs as an 
initial parallel corpus (a training corpus) to 
create an initial word space. All the patents 
The quality of the translations wtrics greatly from 
word-for-word translations to short sunnnaries. 
1067 
in the two corpora were randomly selected 
from the Japanese patents issued in 1991, and 
the two corpora shared no patent. We used 
only the title and abstract texts and removed 
all other information, such as author, patent ID 
and issue date. Table 1 shows an example of 
an English-Japanese pair in the corpora. All 
characters in the English texts are l-byte char- 
acters and all characters, including alphabeti- 
cal and numerical characters, in the Japanese 
texts are 2-byte, so there is no word which is 
shared by both English and Japanese texts. 
We used all words which appeared in a train- 
ing corpus as vocabulary words, and the most 
frequently appearing 3000 English words as 
content-bearing words and then reduced the 
dimension of the vectors from 3000 to 200 by 
SVD. 
llose liar 'l'ransl~zrring Fertilizer from Fcflilizer Tmlki of Mobile \["arm Machine Abslracl: 
PROlll ,I~M TO lie .SOI.VI.~D: To provide a mechanism To arrange a ferlilizer Imnsli~r bose 
from a ferlilizer lallk wilhoul catlsing hindrance Io lhe olher mcchaniSlllS, t}lc. SOI.UTION: 
A fertilizer Irans fer hose 38 Io deliver a f¢~lilizer rrllll| il fcriilizer lank 31 placed al ~1 side of 
a mobile machine l~dy I Io lhe downslream side of a ferlilizing par128 is laid along the 
oilier circulllli?rellCe of a passage 23 placed ~l\[Ollg die back and a side o1" a drivcl's seal 8 and 
exlending \[ioln Ihe driver's seal 8 Io a working inacbille I I. 
1 ~ll:~ I-/'6.~tzltE~t')x-'ga 1 h',6 ll~llEt~t128 T ~'~IV'IIE~/I~.J~'C'~IIB,I~,--7,.a 8-'2, 
Table 1 : An example of an English- 
Japanese patent pair 
We began with a word space created from 
the 100 English-Japanese translation pairs (the 
initial parallel corpus). Then using the word 
space, we calculated 1000 English patent 
vectors and 1000 Japanese patent vectors 
which correspond to the patent texts in the 
pseudo comparable corpus. Next we extract- 
ed English-Japanese patent pairs which satis- 
fied the simple condition that the English 
patent vector in the pair has the highest prox- 
imity (the biggest cosine) with the Japanese 
patent vector in the pair among the 1000 Ja- 
panese patent vectors, and vice versa (hereaf- 
ter we call these pairs mutual-proximity pairs). 
Note that mutual-proximity pairs are, of 
course, not always correct translation pairs. 
Then we selected the 10 most "reliable" mu- 
tual-proximity pairs, assuming that the higher 
the proximity between the two vectors of a 
mutual-proximity pair, the more reliable the 
mutual-proximity pair is. Finally we con- 
catenated the 10 mutual-proximity pairs to the 
initial 100 translation pairs. This is the first 
stage of our bootstrapping method. 
In the second stage, we created a new word 
space regarding the 110 English-Japanese 
pairs obtained in the first stage as a training 
corpus. Then we selected the 20 most reli- 
able mutual-proximity pairs and concatenated 
them to the initial 100 patent translation pairs. 
At the Nth stage, we selected the N* 10 most 
reliable lnutual-proximity pairs. If the num- 
ber of the nmtual-proximity pairs obtained in 
the stage is less than N*I0, all of the mutual- 
proximity pairs were concatenated to the ini- 
tial 100 patent translation pairs. 
We repeated this procedure up to the 100th 
stage. At the 100th stage, we obtained 727 
mutual-proximity pairs and 721 pairs out of 
the 727 pairs were correct translation pairs. 
Therefore the recall of the obtained pairs was 
72.1% (721/1000) and the precision was 
99.2% (721/727) (see the column of Testl and 
the row of the "bootstrapping method" of 
Table 2). On the other hand, we obtained 
341 mutual-proximity pairs and 258 pairs out 
of the 341 pairs were correct translation pairs 
in the case of the normal Information Mapping 
method which corresponds to the first stage of 
our bootstrapping method. In this case, the 
recall was 25.8% and the precision was 75.7% 
(see the column of Testl and the row of the 
"normal method" of Table 2). 
I 0C, 
8(2 
6(: 
,I(i 
2( 
~si/m (e~) 
~ recall ( ~ ) 
, I . I , I , I , 
20 40 6(1 80 100 
Figure 2: The change of precision and recall 
with complete-pair corpus 
Figure 2 shows the change of the precision 
and the recall through the 100 stages. The 
precision was kept over 93.3% and the recall 
went up gradually. We could successfully 
grow the bilingual text pairs using bootstrap- 
ping. 
llormal 
method 
boot- 
strapping 
method 
Prec 75.7 
Rec 25.8 
Prec 99.2 
Rec 72.1 
75.6 76.6 
26.6 26.9 
99.1 99.7 
74.0 73.0 
78.2 72.8 
25.4 27.1 
98.9 98.7 
71.0 70.6 
Table 2: Results of extracting tests with 
complete-pair corpus 
1068 
We prepared 4 more different sets of 1000 
pairs 1'or pseudo comparable corpora and dif- 
ferent sets of 100 pairs for initial parallel 
corpora, and repeated the same test 4 more 
times. Table 2 shows results of the 5 tests of 
the bootstrapping method and the normal 
Information Mapping method. In each case the 
bootstrapping method could drastically im- 
prove both the precision and the recall. 
We also conducted tests to see if the result- 
ing text pairs obtained at the 100th stage in the 
previous tests are useful for the normal Infer 
marion Mapping method. We prepared an- 
other 1000 English-Japanese patent translation 
pairs for each of the 5 previous tests as 
evaluation corpora. No same patents were 
shared between any two of all the corpora. 
We extracted mutual-proximity pairs froln the 
new 1000 English-Japanese pair with the nor- 
mal Information Mapping method, using (1) 
the initial parallel corpus in the previous test, 
(2) the initial parallel corpus + the mutual- 
proximity pairs obtained in the previous test, 
(3) the initial parallel corpus + the 1000 Eng- 
lish-Japanese correct translation pairs in the 
pseudo comparable corpus of the previous test, 
as a training corpus respectively. For exam- 
ple, in Test 1, the number of pairs in the 
refining corpus is 100 for (1), 827 with 6 error 
pairs for (2) and 1100 for (3). 
the Introduction, it is highly likely that a real 
bilingual comparable corpus includes bilingual 
pairs which share the same information, but it 
also includes a lot of irrelevant texts. To 
simulate this, we replaced half of the Japanese 
patent texts in the pseudo comparable corpora 
of the previous tests with different Japanese 
patent texts which were randomly selected. 
Therefore the corpus included 500 English- 
Japanese translation pairs, and 500 English 
patents and 500 Japanese patents which were 
totally irrelevant to each other. 
100 
80 
60 
40 
20 
0 
I I I I 
~f~fl ~ I , I rcc dll(%) , 
20 40 60 80 100 
Figure 3: The change of precision and recall 
with 50%-error-pair corpus 
lh'cC initial 
pairs Rcc 
inilail + Prcc boot- 
stmpping pmrs Rec 
initail + Prec complete 
pmrs Rcc 77.5 79.3 79.0 77.4 
Table 3: Results of ewduat\]on 
complete-pair corpus 
77.5 77.3 
23.8 29.3 
98.9 98.7 
74.5 75.0 
99.0 99.1 
73.3 
26.1 
98.8 
75.1 75.0 
99.6 98.7 
75.6 75.4 
25.1 25.8 
99. I 99.2 
73.5 
98.7 
78.6 
tests for 
Table 3 shows the results. The results of 
(3) can be considered as the ceilings of the 
precision aud the recall, because we used all 
the correct translation pairs in the pseudo 
comparable corpus. In each case, both the 
precision and the recall of (2) is very close to 
the ceilings, so we think the bilingual text 
pairs obtained by our bootstrapping method is 
useful as a training corpus for the normal 
Information Mapping method. 
4.2 Tests with incomplete-pair corpora 
In the tests described above, we used the ideal 
pseudo COlnparable corpus. As described in 
nolill~|\] 
mclhod 
boor 
stral~ping 
method 
Prcc 55.3 
Rcc 28.4 
l'rcc 82.1 
P, cc 69.8 
50.0 52.8 
26.8 28,0 
81.4 83.8 
70.8 67.4 
46.6 53,5 
25.(} 29.4 
81.0 80,7 
67.4 69.2 
Table 4: Results of extracting tests 
with 50%-error-pair corpus 
l'rec 77.5 77.3 73.3 75.6 initial 
pairs Rec 23.8 29.3 26. I 25. I 
initail + l'rec 96.2 95.7 93.4 93.3 boot- 
strapping pairs Rec 61. I 61.9 59.8 57.4 
initail + Prec 98.4 98.7 97.9 98,0 complete 
pan's Roe 66.1 70.7 68.6 69.0 71.1 
Table 5: Results of evaluation tests 
for 50%-error-pair corpus 
75.4 
25.8 
95.9 
60.5 
98.9 
Results are shown in Figure 3, Table 4 and 
Table5, which correspond to Figure 2, Table 2 
and Table 3 respectively. 
1069 
I(~) I I I I 
80 
60 
40 
20 
0 
+ recall (%) 
¢ 
20 40 60 80 100 
Figure 4: The change of precision and recall 
with 80%-error-pair corpus 
llOrlllal 
method 
boot- 
strapping 
method 
l'rcc 23.4 
Rec 27.5 
Prec 52.5 
Rec 58.0 
17.8 20.7 
20.0 24.5 
55.2 50.9 
53.0 55.O 
21.1 27.0 
23.5 30.0 
53.2 53.4 
53.5 50.2 
Table 6: Results of extracting tests 
with 80%-error-pair corpus 
initial 
pairs 
inilail + 
boot- strapping 
paws 
initail + complete 
pailS 
Table 7: Results of evaluation tests 
(5, for 80 N-error-pair corpus 
Prec 
Rcc 
Prcc 
Rcc 
Prcc 
Rcc 
77.5 
23.8 
82.9 
37.9 
96.1 
54.7 
77.3 73,3 
29.3 26.1 
85.5 81.1 
36.3 35.3 
96.4 96.0 
58.7 55.0 
75.6 
25. I 
83.5 
33.5 
94.0 
55.3 
75.4 
25.8 
85.4 
33.4 
95.7 
53.4 
Figure 4, Table 6 and Table 7 show results in 
the case that we replaced 80% of Japanese pat- 
ent texts with irrelevant Japanese patent texts. 
The results of these tests are not as good as 
the results of tests with the ideal pseudo compa- 
rable corpora. Figure 4 and 6 show, however, 
the bootstrapping method iml?roved both the 
precision and recall of the extracted text pairs as 
compared to the normal method. Figure 5 and 
7 also show that the bilingual text pairs obtained 
by the bootstrapping method are still useful as a 
training corpus for the normal method. 
5 Conclusion 
We proposed a lnethod of extracting bilingual 
text pairs from a comparable corpus. The 
method is based on an existing corpus-based 
CLIR method and uses bootstrapping. Alt- 
hough our research is in the preliminary stage of 
development and tested with artificial corpora 
consisting of English and Japanese patent texts, 
the bootstrapping led to nmch better results for 
the task of extracting translation pairs than the 
results produced by a normal CLIR method, and 
the extracted translation pairs could be useful for 
improving the results of the normal CLIR when 
we used them as a training corpus. 

References 

Berry, M. W. (1992) Large Scale Singular Vahte Computations. International Journal of Supercom- 
puter Applications, 6/1, pp. 13-49. 

Dunning T. E. and Davis M. W. (1993) Multi-lingual information retrieval. Computational Memoranda 
iri Co~aitive and Computer Science MCCS-93-252, New-Mexico State University, Computing Re- 
search Laboratory. 

Furnas, G. W., Dcerwester, S., Dumais, S. T., Lan- daucr, T. K., H~rshlnan, R. A., Streeter, L. A. and 
Lochbaum, K. E. (1988) lnfbrmation retrieval us- ing a singular value decomj)osition model of latel!t 
semantic structure. In proceedings of die l lth ACM International Conference on Research and 
Development in hfformation Retrieval, pp. 465- 480. 

Hull, D. and Grefenstette, G. (1996) Quelying across s." A dictionaty-bas'ed apRroach to mul- 
tilinettal infomtation retrieval. In Proceedings of SIGIR'96, \[~p. 49-57. 

Landauer, T. K. and Littman, L. M. (1990) Fully automatic cross- document retrieval using 
latettt semantic indexittg. In Proceedintzs of lhe 6lli Conference of Univcrsi.ly ol' Wterloo Centre for the 
New Oxford English Dictionary and Text Research, pp. 31-38. 

Masuichi, H., Flournoy, R., Kaufinann, S. and Peters, S. (1999) Query, Translation Method for Cross 
Language h!forthation Retrieval. In Proc'eedinjg8 of the "Worksh6p on Machine Translation lor Cross 
Lantzua~ze Inlormation Retrieval, MT Summit VII, pp. 30-S4. 

Schfitze, H. (1995) Ambiguity Resohttion in Lcm- guage Learning: Computa'tional atzd Cognitive 
9 ) Z Models. \[ hD tlicsis, Stanford University, l)cpart- 
mcnt of Linguistics. 

Schi.itze, H. and Pederscn, J. (1997)A coocur-retlce- 
based thesaurus attd two al~lglications to in- .folT~lation retrieval. Informatmn Processing & 
management, 33/3, pp. 307-318. 

Sheridan, P., Ballerini, J. P. and Schfinble, P. (1998) Building a large multilingual test,, collection from 
conqmrable news documents. In 'Cross-Lanfam~c Information Retrieval", Kluwer Academic PuNis'fi- 
ers, pp. 137-150. 

Zanettin, F. (1994) Parallel Words: Designing a Bilingual Database lk~r Translation Actiwties. In 
"Corpora in Language Education and Research: a Selection of Papers fi'om TALC 94", Lancaster 
University, UK, pp. 99-111. 

Zanettin, F. (1998) Bilingual comp(Irable con)era and the training of translators. In META, XLIII, 
4, Special Issue. The corpus-based ap.proach: a new paradigm in translation s'tudics", pp. 616-630. 
