Proceedings of the Workshop on Multilingual Language Resources and Interoperability, pages 60–67,
Sydney, July 2006. c©2006 Association for Computational Linguistics
A fast and accuratemethodfor detectingEnglish-Japaneseparalleltexts
Ken’ichiFukushima,Kenjiro Tauraand TakashiChikayama
Universityof Tokyo
ken@tkl.iis.u-tokyo.ac.jp
{tau,chikayama}@logos.ic.i.u-tokyo.ac.jp
Abstract
Parallelcorpusis a valuableresourceused
in various fields of multilingual natural
language processing. One of the most
significantproblemsin using parallelcor-
pora is the lack of their availability. Re-
searchershave investigated approachesto
collectingparalleltexts from the Web. A
basic component of these approaches is
an algorithm that judges whether a pair
of texts is parallel or not. In this paper,
we propose an algorithmthat accelerates
this task without losing accuracy by pre-
processinga bilingual dictionaryas well
as the collection of texts. This method
achieved 250,000pairs/secthroughputon
a single CPU, with the best F1 score
of 0.960 for the task of detecting 200
Japanese-Englishtranslationpairs out of
40,000. The methodis applicableto texts
of any format, and not specificto HTML
documentslabeledwith URLs. We report
detailsof thesepreprocessingmethodsand
the fast comparison algorithm. To the
best of our knowledge,this is the first re-
portedexperimentof extractingJapanese–
Englishparalleltexts froma large corpora
basedsolelyon linguisticcontent.
1 Introduction
“Parallel text” is a pair of texts which is written
in differentlanguagesand is a translationof each
other. A compilationof paralleltexts offered in a
serviceableformis calleda “parallelcorpus”.Par-
allelcorporaareveryvaluableresourcesin various
fields of multilingualnaturallanguageprocessing
such as statisticalmachinetranslation(Brown et
al., 1990),cross-lingualIR (Chenand Nie, 2000),
and constructionof dictionary(Nagao, 1996).
However, it is generallydifficultto obtainparal-
lel corporaof enoughquantityand quality. There
have only beena few varietiesof parallelcorpora.
In addition,their languageshave been biased to-
wardEnglish–Frenchandtheircontentstowardof-
ficial documentsof governmentalinstitutionsor
software manuals. Therefore,it is often difficult
to find a parallel corpus that meets the needs of
specificresearches.
To solve this problem, approaches to collect
parallel texts from the Web have been proposed.
In the Web space, all sorts of languagesare used
thoughEnglishis dominating,and the contentof
the texts seemsto be as diverse as all activitiesof
the human-beings.Therefore,this approachhas a
potentialto break the limitationin the use of par-
allelcorpora.
Previous works successfullybuilt parallel cor-
pora of interesting sizes. Most of them uti-
lized URL strings or HTMLtags as a clue to ef-
ficiently find parallel documents (Yang and Li,
2002; Nadeau and Foster, 2004). Dependingon
such informationspecific to webpageslimits the
applicabilityof the methods. Even for webpages,
many paralleltexts not conformingto the presup-
posedstyleswill be left undetected.In this work,
we have thereforedecidedto focuson a generally
applicablemethod, which is solely based on the
textual contentof the documents.The main chal-
lengethenis how to make judgementsfast.
Ourproposedmethodutilizesa bilingualdictio-
nary which, for each word in tne language,gives
the list of translationsin the other. The method
preprocessesboth the bilingualdictionaryand the
collectionof texts to make a comparisonof text
pairs in a subsequentstage faster. A comparison
60
of a text pair is carried out simply by compar-
ing two streamsof integerswithoutany dictionary
or table lookup, in time linear in the sum of the
two text sizes. With this method, we achieved
250,000 pairs/sec throughput on a single Xeon
CPU (2.4GHz). The best F1 score is 0.960, for a
datasetwhichincludes200truepairsoutof 40,000
candidatepairs. Furthercommentson thesenum-
bersare given in Section4.
In addition, to the best of our knowledge,
this is the first reportedexperimentof extracitng
Japanese–Englishparallel texts using a method
solelybasedon theirlinguisticcontents.
2 RelatedWork
Therehave been several attemptsto collectparal-
lel texts fromthe Web. We will mentiontwo con-
trastingapproachesamongthem.
2.1 BITS
Ma and LibermancollectedEnglish–Germanpar-
allel webpages(Ma and Liberman,1999). They
began with a list of websitesthat belongto a do-
mainaccosiatedwithGerman–speakingareasand
searchedfor parallelwebpagesin these sites. For
each site, they downloaded a subset of the site
to investigate what languageit is written in, and
then, downloadedall pages if it was proved to be
English–Germanbilingual. For each pair of En-
glishand Germandocument,they judgedwhether
it is a mutual translation. They made a decision
in the following manner. First, they searched a
bilingualdictionaryfor all English–Germanword
pairs in the text pair. If a word pair is found in
the dictionary, it is recognizedas an evidenceof
translation. Finally, they divided the number of
recognizedpairs by the sum of the length of the
two texts and regard this value as a scoreof trans-
lationality. Whenthis scoreis greaterthana given
threshold,the pair is judged as a mutual transla-
tion. They succeededin creatingabout63MBpar-
allelcorpuswith10 machinesthrough20 days.
The numberof webpagesis consideredto have
increasedfar morerapidlythantheperformanceof
computersin the past seven years. Therefore,we
think it is importantto reducethe cost of calcula-
tion of a system.
2.2 STRAND
If we simplymake a dicisionfor all pairsin a col-
lectionof texts, the calculationtakes Ω(n2) com-
parisons of text pairs where n is the number of
documents in the collection. In fact, most re-
searchesutilizepropertiespeculiarto certainpar-
allel webpagesto reducethe numberof candidate
pairsin advance. ResnikandSmithfocusedon the
fact that a page pair tends to be a mutualtransla-
tion whentheir URLstringsmeeta certaincondi-
tion, and examinedonly page pairs which satisfy
it (Resnikand Smith,2003). A URLstringsome-
timescontainsa substringwhichindicatesthe lan-
guagein whichthe page is written. For example,
a webpagewrittenin Japanesesometimeshave a
substringsuch asj, jp, jpn, n, eucorsjisin
its URL.They regarda pairof pagesas a candidate
when their URLs match completelyafter remov-
ingsuchlanguage-specificsubstringsand,onlyfor
these candidates,did they make a detailed com-
parisonwith bilingualdictionary. They were suc-
cessfulin collecting2190parallelpairsfrom8294
candidates. However, this URL conditionseems
so strictfor the purposethat they found8294can-
didatepairsfromas muchas 20 Terabytesof web-
pages.
3 ProposedMethod
3.1 Problemsettings
There are several evaluation criteria for parallel
text mining algorithms. They include accuracy,
executionspeed, and generality. We say an algo-
rithmis generalwhenit can be appliedto texts of
any format,not only to webpageswith associated
informationspecificto webpages(e.g., URLsand
tags). In this paper, we focuson developinga fast
and generalalgorithmfor determiningif a pair of
texts is parallel.
In general,there are two complementaryways
to improve the speedof paralleltext mining. One
is to reducethe numberof “candidatepairs”to be
compared.The other is to make a singlecompar-
ison of two texts faster. An example of the for-
meris ResnikandSmith’s URLmatchingmethod,
which is able to mine parallel texts from a very
large corpora of Tera bytes. However, this ap-
proachis very specificto the Web and, even if we
restrict our interest to webpages,there may be a
significantnumberof parallelpageswhoseURLs
do not matchthe prescribedpatternand therefore
are filtered out. Our method is in the latter cat-
egory, and is generallyapplicableto texts of any
format. The approach depends only on the lin-
guisticcontentof texts. Reducingthe numberof
61
a0a1a2
a3a4a5a6a7a8
a9a10
a5
a9
a4
a11a12
a13
a14a5
a15
a13a4a8
a9a9
a16
a7a17
a18a19a20
a21
a22
a23a24
a25a26a27a27
a28
a29a30
a18a19a20
a21
a22
a23a24
a25a26a27a27
a28
a29a30
a31a28a32a28
a29a30a33a26
a32
a21a28
a34
a35a28
a19a29a26a20a36
a31a28a32a28
a29a30a33a26
a32
a21a28
a34
a35a28
a19a29a26a20a36
a37a38
a39
a38
a39a40
a41
a42a43
a44a45a43a45
a46a46
a47
a46
a48
a47a49
a48
a50
a51
a52a53
a54a55a56a57
a57a58
a59a60
a61a61a61
a62a63a64
a65a64a66
a62a63a64a64
a67a68a69a62
a51
a62a63a64
a67a63a70
a65a64a62
a62a63a64a64
a51
a0a71a2
a3a4a17a72a14a8a7
a0a73a2a15
a7a6a7
a74
a4a72a14
a17a14a72a72
a10
a5a75
a76
a14a72a14a5a7a13a7
a9
a7a77
a9
a13
a78
a5a75
a74
a10
a13
a16
a9
a7a77
a9
a13
Figure1: Outlineof the method
comparisonswhilemaintainingthe generalitywill
be one of our futureworks.
The outline of the methodis as follows. First
we preprocessa bilingualdictionaryand build a
mapping from words to integers, which we call
“semanticID.” Texts are then preprocessed,con-
verting each word to its correspondingsemantic
ID plus its positionof the occurrence. Then we
compare all pairs of texts, using their converted
representations(Figure 1). Comparinga pair of
texts is fast becauseit is performedin time linear
in the length of the texts and does not need any
tablelookupor stringmanipulation.
3.2 Preprocessinga bilingualdictionary
We take onlynounsinto accountin our algorithm.
For the language pair of English and Japanese,
a correspondenceof parts of speech of a word
and its translationis not so clear and may make
the problemmore difficult. A result was actually
worsewheneveryopen-classwordwas considered
thanwhenonlynounswere.
The first stageof the methodis to assignan in-
teger called semanticID to every word (in both
languages)that appearsin a bilingualdictionary.
Thegoalis to assignthesameIDto a pairof words
thataretranslationsof eachother. In anidealsitua-
tionwhereeachwordof onelanguagecorresponds
one-to-onewith a word of the other language,all
you need to do is to assign differnt IDs to every
translationalrelationshipbetweentwo words. The
mainpurposeof this conversionis to make a com-
parisonof two texts in a subsequentstagefaster.
However, it’s not exactly that simple. A word
very often has more than one words as its trans-
lationso the naive methoddescribedabove is not
directly applicable. We devised an approximate
solution to address this complexity. We build
a bigraph whose nodes are words in the dictio-
naryandedgestranslationalrelationshipsbetween
them. This graph consists of many small con-
nectedcomponents,each representinga group of
words that are expectedto have similarmeanings.
We then make a mappingfrom a word to its se-
manticID. Two words are consideredtranslations
of each other when they have the same semantic
ID.
This methodcausesa side-effect of connecting
two words not directlyrelatedin the dictionary. It
has both good and bad effects. A good effect is
that it may connecttwo words that do not explic-
itly appearas translationsin the dictionary, but are
used as translationsin practice(see section 4.3).
In other words, new translationalword pairs are
detected.A bad effect,on the otherhand,is that it
potentiallyconnectsmany wordsthat do not share
meaningsat all. Figure 2 shows an actual exam-
ple of suchan undesirablecomponentobserved in
our experiment. You can go from fruit to army
throughseveral hops and these words are treated
as identicalentityin subsequentstepsof our tech-
nique. Futhermore,in the most extreme case, a
very large connectedcomponentcan be created.
Table1 shows thestatisticsof thecomponentsizes
for the English-Japanesedictionarywe have used
in our experiment(EDR ElectronicDictionary).
62
a0a1
a2a0
a3a4
a5a6
a7
a8a9
a10a11
a12
a7a7
a12a13
a11
a7
a14a8a13a12
a15a8a16a17
Figure2: Exampleof a undesirablegraph
Most componentsare fairly small (< 10 words).
The largest connectedcomponent,however, con-
sistedof 3563 nodesout of the total 28001nodes
in the entire graph and 3943 edges out of 19413.
As we will see in the next section,this had a dev-
astatingeffect on the qualityof judgementso we
clearly need a method that circumvents the sit-
uation. One possibility is to simply drop very
large components.Anotheris to divide the graph
into small components. We have tried both ap-
proaches.
Table1: Statisticsof the componentsizes
# of nodes # of components
2 6629
3 1498
4 463
5 212
6 125
7 69
8 44
9 32
10∼ 106
For partitioninggraphs, we used a very sim-
ple greedy method. Even though a more com-
plex method may be possible that takes advan-
tages of linguisticinsights,this work uses a very
simple partitioningmethodthat only looks at the
graphstructurein thiswork. A graphis partitioned
into two parts having an equal number of nodes
and a partitionis recursively performeduntileach
part becomessmallerthan a given threshold.The
thresholdis chosenso that it yieldsthe best result
for a trainingset and then applied to a test data.
For each bisection,we begin with a randompar-
tition and improves it by a local greedy search.
Given the currentpartition,it seeksa pairof nodes
which,if swapped,maximumlyreducesthe num-
ber of edgescrossingthe two parts. Ties are bro-
ken arbitrarilywhenthereare many such pairs. If
no singleswap reducesthenumberof edgesacross
parts,wesimplystop(i.e.,localsearch).A seman-
tic ID is thengiven to eachpart.
This process would lose connectionsbetween
wordsthat are originallytranslationsin the dictio-
narybut are separatedby the partitioning.We will
describea methodto partiallyrecover this loss in
the end of the next section, after describinghow
texts are preprocessed.
3.3 Preprocessingtexts
Each text (document)is preprocessedas follows.
Texts are segmentedinto words and taggedwitha
part-of-speech.Inflectionproblemsare addressed
with lemmatization.Each word is converted into
the pair(nid, pos), wherenid is the semanticID of
the partitioncontainingthe word and pos its posi-
tionof occurrence.Thepositionis normalizedand
representedas a floatingpointnumberbetween0.0
and 1.0. Any word which does not appearin the
dictionaryis simplyignored.The positionis used
to judgeif wordshavingan equalID occurin sim-
ilar positionsin bothtexts, so they suggesta trans-
lation.
Afterconvertingeach word, all (nid, pos) pairs
are sortedfirst by theirsemanticIDs breakingties
withpositions.Thissortingtakes O(nlogn) time
for a documentof n words. This preprocessing
needs to be performedonly once for each docu-
ment.
We recover the connectionsbetweenword pairs
separatedby thepartitioningin thefollowingman-
ner. Supposewords J and E are translationsof
each other in the dictionary, J is in a partition
whosesemanticID is x andE in anotherpartition
whosesemanticID is y. In this case, we translate
J into two elementsx and y. This result is as if
two separatewords, one in componentx and an-
otherin y, appearedin the originaltext, so it may
potentiallyhave an undesirableside-effect on the
qualityof judgement. It is thereforeimportantto
keep the number of such pairs reasonablysmall.
We experimentedwith both cases, one in which
we recover separateconnectionsand the other in
whichwe don’t.
3.4 Comparingdocumentpairs
We judgeif a text pair is likely to be a translation
by comparingtwo sequencesobtainedby the pre-
processing. We count the number of word pairs
63
that have an equal semanticID and whose posi-
tions are within a distance threshold. The best
thresholdis chosen to yield the best result for a
trainingset and then appliedto test set. This pro-
cess takes time linear in the length of texts since
the sequences are sorted. First, we set cursors
at the first elementof each of the two sequences.
Whenthe semanticIDs of the elementsunderthe
cursorsare equaland the differencebetweentheir
positionsis withina threshold,we count them as
an evidenceof translationalityand move bothcur-
sors forward. Otherwise,the cursor on the ele-
ment which is less accordingto the sorting cri-
teria is moved forward. In this step, we do not
performany further search to determineif origi-
nal words of the elementswere relateddirectlyin
the bilingualdictionarygivingpreferenceto speed
over accuracy. We repeatthis operationuntil any
of the cursorsreachesthe end of the sequence.Fi-
nally, we divide the numberof matchingelements
by the sum of the lengthsof the two documents.
We definethis value as “tscore,” whichstandsfor
translationalscore. At least one cursormoves af-
ter each comparison,so this algorithmfinishesin
timelinearin the lengthof the texts.
4 Experiments
4.1 Preparation
To evaluateour method,we used The EDR Elec-
tronic Dictionary1 for a bilingualdictionaryand
Fry’s Japanese-Englishparallel web corpus (Fry,
2005) for sample data. In this experiment, we
consideredonly nouns (see section 3.2) and got
a graph which consists of 28001 nodes, 19413
edges and 9178 connectedcomponentsof which
the largest has 3563nodesand 3943edges. Large
componentsincludingit needto be partitioned.
We conductedpartitioningwith differnt thresh-
olds and developed various word–ID mappings.
For each mapping,we made several variationsin
two respect. One is whethercut connectionsare
recovered or not. The other is whetherand how
many numerals, which can be easily utilized to
boost the vocaburary of the dictionary, are added
to a bilingualdictionary.
The parallelcorpuswe used had beencollected
by Fry fromfournews sites. Mosttexts in the cor-
pus are news report on computertechnologyand
the rest is on various fields of science. A single
1EDRElectronicDictionary.
http://www2.nict.go.jp/kk/e416/EDR/
documentis typically1,000–6,000bytes. He de-
tectedparalleltexts basedonlyon HTMLtagsand
link structures,which depend on websites,with-
out lookingat textual content,so there are many
false pairs in his corpus. Therefore,to evaluate
our methodprecisely, we used only 400 true par-
allel pairsthat are randomlyselectedand checked
by humaninspection.We dividedthemevenlyand
randomly into two parts and use one half for a
trainingset and the other for a test set. In exper-
iments describedin section4.4 and 4.5, we used
otherportionof the corpusto scaleexperiments.
For tokenization and pos-tagging, we used
MeCab2 to Japanesetexts and SS Tagger3 to En-
glishtexts. BecauseSS Taggerdoesn’t act as lem-
matizer, weusedmorphstr()functionin Word-
Net library4.
4.2 Effectof large componentsand a
partitioning
Figure3 shows the resultsof experimentson sev-
eralconditions.Therearethreegroupsof bars;(A)
treat every connectedcomponentequally regard-
lessof its size,(B) simplydropthe largestcompo-
nentand (C) dividelarge componentsinto smaller
parts. In each group, the upper bar corresponds
to the casethe algorithmworkswithouta distance
thresholdand the lower with it (0.2). The figures
attachedto each bar are the maxF1 score, which
is a popularmeasureto evaluatea classificational-
gorithm,and indicatehow accuratelya methodis
able to detect200 true text pairs from the test set
of 40,000pairs. We didn’t recover word connec-
tionsbroken in the partitioningstepanddidn’t add
any numeralsto the vocabraryof the bilingualdic-
tionarythis time.
The significantdifferencebetween(A) and (B)
clearlyshows the devastatingeffect of large com-
ponents. The difference between (B) and (C)
shows that the accurarycan be further improved
if large componentsarepartitionedintosmallones
in orderto utilizeas muchinformationas possible.
In addtion,the accuracy consistentlyimproves by
usingthe distancethreshold.
Next, wedeterminedthebestword–IDmapping
2MeCab:Yet AnotherPart-of-Speechand Morphological
Analyzer.
http://mecab.sourceforge.jp/
3SS Tagger- a part-of-speechtaggerfor English.
http://www-tsujii.is.s.u-tokyo.ac.jp/
˜tsuruoka/postagger/
4WordNet- a lexicaldatabasefor the Englishlanguage.
http://wordnet.princeton.edu/
64
a0
a1
a2a3a0
a0
a1
a4a4a5
a0
a1
a3a4a6
a0
a1
a4a2a0
a0
a1
a4a6a7
a0
a1
a3a3a0
a0
a1
a0 a0
a1
a6 a0
a1
a3 a0
a1
a8 a0
a1
a4 a7
a1
a0
a9a10a11
a9a12a11
a9a13a11
a14a15a16
a17a7
a18
a19a20a21
a22a23
a20a24a19
a25
a20
a26a27a28a29
a20a21
a30a29a25
a21
a22
a31a24
a18
a19a20a21a24a19
a25
a20
a26a27a28a29
a20a21
a30a29a25
a21
a22
a31a24
Figure3: Effect of the graphpartitioning
and distancethresholdand testedits performance
througha 2–foldcross validation. The best map-
pingamongthosewas the one which
• divides a component recursively until the
numberof nodes of each languagebecomes
no morethan30,
• does not recover connectionsthat are cut in
the partitioning,and
• addsnumeralsfrom0 to 999.
The best distance threshold was 0.2, and tscore
threshold0.102.We testedthisruleandthresholds
on the test set. The resultwas F1 = 0.960.
4.3 Effectof falsetranslationpairs
Our method of matchingwords differs from Ma
and Liberman’s one. Whilethey only countword
pairsthat directlyappearin a bilingualdictionary,
we identify all words having the same seman-
tic ID. Potential merits and drawbacks to accu-
racy have been describedin the section 3.2. We
comparedthe accuracy of the two algorithmsto
investigate the effect of our approximatematch-
ing. To this end, we implementedMa and Liber-
man’s method with all other conditionsand in-
put data being equal to the one in the last sec-
tion. We got maxF1 = 0.933 as a result, which
is slightlyworse than the figure reportedin their
paper. Though it is difficult to conclude where
the difference stems from, there are several fac-
tors worth pointingout. First, our experimentis
done for English-Japanese,while Ma and Liber-
man’s experimentfor English-German,whichare
more similarthan Englishand Japaneseare. Sec-
ond, their data set containsmuch more true pairs
(240out of 300)thanour dataset does(200out of
40,000).
a0
a1
a2a3a3
a0
a1
a2a4a0
a0
a1
a5a6 a0
a1
a7 a0
a1
a7a6 a0
a1
a2 a0
a1
a2a6 a8
a9
a10a11a12
a13a14
a15a12
a16a17
a10a10a15
a18
a19a14a20
a12a21a22a10a15
a23
a11
a24
a25
a21a26a27a26a11a12
a18
a22a12
a28a29
a26
a18
a22a10a30
a31a8
Figure4: The two word-matchingpolicy
This numberis also worse than that of our ex-
periment(Figure 4). This shows that, at least in
the experiment,our approachof identifyingmore
pairs than the original dictionary causes more
good effects than bad in total. We looked at word
pairswhicharenotmatchedin MaandLiberman’s
methodbut in ours. Whilemostof the pairscanbe
hardly consideredas a strict translation,some of
themare pairspracticallyusedas translations.Ex-
amplesof suchpairsare shown in Figure5.
a0
a1a2
a3a4
a5
a6
a7a8a9
a10 a11
a12a13a12a1a14a5a14a7a8a9
a10
a4
a5a5a15a14 a16
a17
a18a8a19a13a14
a20
a4
a20
a4
a8a1 a21
a22a23a24a25
a10a4
a18
a4
a5
a4
a8a1 a26
a27
a5a15a19 a28
a29
a30
a14a1a14
a31a4
a20 a32
a33
a13
a6
a8a1a14 a34
a35a36
a10
a14a37
a4
a18a14 a38
a39
a18
a3a4
a14a1
a20 a40
a41
Figure5: Word pairsnot in the dictionary
4.4 ExecutionSpeed
We have arguedthattheexecutionspeedis a major
advantage of our method. We achieved 250,000
pairs/sec throughput on single Xeon (2.4GHz)
processor. It’s difficult to make a fair com-
parison of the execution speed because Ma and
Liberman’s paper does not describe enough de-
tailsabouttheirexperimantsotherthanprocessing
3145 websiteswith 10 sparc stationsfor 10 days.
Just for a roughestimate,we introducesomebold
assumptions.Say, therewerea thousandpagesfor
each languagein a website or, in other words, a
millionpagepairs,andthe performanceof proces-
sorshasgrownby 32 timesin thepastseven years,
our methodworks more than 40 times faster than
Ma and Liberman’s one. This difference seems
65
a0a1
a2a3a2a4a5
a6a7
a4a8
a9a7
a10a8a11a2a10a4
a7
a5
a12a13
a2a4a14a15a16a17a8a14a18
a6
a2
a7
a2
a6
a16
a7
a3a15a11a4
a9
a8a3a14
a7a1
a2a10a15a16a4a17
a13
a2a11
a7a13
a4
a7
a4a10a8a3
a6a12
a10
a13a12
a2a15a18a18a3a8a15a17
a1a7a1
a2
a9a13
a2
a6a12
a8
a9
a4
a7
a2a14
a19a20
a21a21
a3a2a4a2a15a3a17
a1
a22
a23
a17
a13
a2a11
a7a13
a4
a7
a4
a1
a15a24a2
a6
a8a11a25
a26
a2
a6a13
a2a24a2
a12a7a1
a15
a7
a4
a7
a2a14a19a20
a21a21
a27
a28a28
a12
a2a3
a13
a24a2
a12
a9
a3a8a14
a26a6
a8a8
a12
a29
a26
a8a11a2a14a15a3a3a8a10a8a3a2a14
a26
a3a16a8a4
a28a28
a15a3a2a17a15a18a15
a26a6
a2a8
a9
a3a2a18a15
a13
a3
a13
a11a25
a12
a15a14a15a25a2
a12a7a13
a4a4a5a2
a26
a16
a7
a15
a30a13
a11a25a8a11
a7a1
a2
a13a12
a2a11
a7a13a7
a16a8
a9a7a1
a15
a7
a8a3a25a15a11
a31
a4a19a20
a21a21
a27
a29
a15a18
a1
a2a11a8a14a2a11a8a11
a30
a11a8a10a11a15a4
a12a13a9a9
a2a3a2a11
a7a13
a15
a7a13
a8a11
a22
a32
a5
a7a7a1
a2a11a2a10a4
a7
a5
a12a13
a2a4a4
a1
a8a10
a7a1
a15
a7a13
a11
a7a1
a2
a12a13
a4a2a15a4a2
a12a6a13
a24a2a3a4a8
a9
a14
a13
a17a2
a29
a4
a7
a2a14a19a20
a21a21
a27
a12a13a12
a11
a31a7a12a13a9a9
a2a3a2a11
a7a13
a15
a7
a2
a22
a33
a11a4
a7
a2a15
a12
a29
a7a1
a2a16
a9
a5a4a2
a12
a10
a13a7a1a7a1
a2
a13
a11
a34
a5a3a2
a12a6a13
a24a2a3a19a20
a21a21
a27
a7
a8a18a2a3
a9
a8a3a14
a7a1
a2a11a2a17a2a4a4a15a3a16
a3a2a18a15
a13
a3a4
a22
a0a1
a2
a9a13
a11
a12a13
a11a25
a13
a4a17a8a11
a7
a3a8a24a2a3a4
a13
a15
a6
a29
a2a4a18a2a17
a13
a15
a6a6
a16a15a14a8a11a25a4
a7
a2a14a19a20
a21a21
a3a2a4a2a15a3a17
a1
a2a3a4a10
a1
a8
a1
a15a24a2
a12
a2a24a8
a7
a2
a12
a15
a6
a8
a7
a8
a9
a2a11a2a3a25a16
a7
a8
a5a11a17a8a24a2a3
a13
a11a25a15a10a15a16
a7
a8
a13
a11
a12
a5a17a2
a7a1
a2a19a20
a21a21
a27
a7
a8a17
a1
a15a11a25a2
a13a12
a2a11
a7a13a7
a16
a22
a35
a4a11
a13
a18
a36
a37a38
a39
a40a41a42a43
a44
a45a46
a39
a47a48
a39
a49a50a51a52a43
a53
a54a39a47a48a55
a56
a39
a57a58a59a58a60a61a62
a39
a50a51a63a64a65a66a67a68a42a69a70
a71a72a63a73a67
a74
a75a76a77a78a42
a53
a79a80a81
a82a83a84a85a86a87a88a89a57a90a91
a92
a93a94a95a96
a97
a39
a98
a99a42a100a101a57a90a70
a92
a102
a103a52
a53
a104a105a106a52a107a76a71a39a108a109a60a110a105a106a111
a53
a49a50a51
a39
a112
a113a42a114a115a116
a39a117a39
a118a119a52a120a121a70a117a106a49a50a51a82
a122a123
a50a51a87a52a75
a124a125a126a75a60a76a72a127a128a129a90a130a76a70a42
a53
a131a132
a39
a133a134
a39
a135a136a111
a53
a40a41
a39
a49a50a51a42a117a106a49a50a51a52a137a138a70a139a99a60a140a141a142a52a60a65
a143a70
a71a72a63a73
a144a75
a130a76a70
a92
a145a146a39a117
a106a49a50a51
a39
a147a99a52
a44a76
a130
a111
a53
a54a39a148a149a39a150a151a152a117a63a153a154a52a67a70a71a72a42a155a121a129a90a60a76
a91a156
a53
a157a158a63a159
a160
a130a161a162a42
a163a164a165
a71
a160
a130a76a70
a92
a94a166
a39
a95a96a63a167a46a75a91a168a169
a53
a170a171a172a173a124a129a174a129a124
a39
a49a50
a51a63a148a149a75
a53
a91a72a128a175a176a177a94a170a171
a39
a168a169a60a129a178
a179a180
a181
a182a179a183
a40a50a51a72
a76
a109a108a109a52
a53
a170a171a42a184a185a72a75a130
a76a70a186a178
a187
a39
a50a51a63
a64a65a66a75
a53
a54a90a63a188a189a170a171a39a41a190a52a191a67a71a72a42a192a105a193a72a57a90
a130a76a70
a92
a194a55a195a196a197
a198
a199a200
a201a202
a203
a204
a205
a206a207
a205 a208
a209a210a209a199a206a203a206
a205
a206a207
a205
Figure6: A exampleof false–positive text pairs
to be caused by a difference of the complexity
betweenthe two algorithms. To the extent writ-
ten in their paper, Ma and Libermancalculateda
score of translationalityby enumeratingall com-
binationsof two wordswithina distancethreshold
and searcha bilingualdictionaryfor each combi-
nationof words. This algorithmtakes Ω(n2) time
wheren is the lengthof a text, while our method
takes O(n) time. In addition,our methoddoesn’t
need any string manipulationin the comparison
step.
4.5 Analysisof missdetections
We analyzedtext pairs for whichjudgementsdif-
fer betweenFry’s and ours.
Among pairs Fry determinedas a translation,
we examined the 10 pairs ranked highest in our
algorithm. Two of them are in fact translations,
which were not detected by Fry’s method with-
out any linguisticinformation.Theresteightpairs
are not translations. Three of the eight pairs are
aboutbioscience,anda word“cell”occurredmany
time (Figure 6). When words with an identical
semanticID appearrepeatedlyin two texts being
compared,their distancesare likely to be withina
distancethresholdand the pair gets unreasonably
high tscore. Therefore,if we take the numberof
each semanticID in a text into account,we might
be ableto improve the accuracy.
We performedthe same examinationon the 10
pairs ranked lowest among those Fry determined
not to be a translation. But no interestingfeature
couldbe foundat the moment.
5 Summaryand Future Work
In this paper, we proposed a fast and accurate
method for detecting parallel texts from a col-
lection. This method consists of major three
parts;preprocessa bilingualdictionaryintoword–
ID conversion rule, convert texts into ID se-
quences, comparesequences. With this method,
we achieved 250,000 pairs/sec on a single CPU
and best F1 score of 0.960. In addition, this
method utilizes only linguistic informationof a
textual content so that it is generallyapplicable.
Thismeansit candetectparalleldocumentsin any
format. Furthermore,our methodis independent
on languagesin essence. It can be appliedto any
pairof languagesif a bilingualdictionarybetween
the languages are available (a general language
dictionarysuffices.)
Our future study will include improving both
accuracy and speed while retainingthe generail-
ity. For accuracy, as we describedin Section4.5,
tscoretendsto increasewhenanidenticalsemantic
ID appearsmany timesin a text. We mightbe able
to deal with this problemby taking into account
the probabilitythat the distancebetweenwords is
within a threshold. Large connectedcomponents
were partitionedby a very simple method at the
present work. More involved partitioningmeth-
ods may improve the accuracy of the judgement.
For speed,reducingthe numberof comparisonsis
the mostimportantissuethat needsbe addressed.
66

References
Peter F. Brown, John Cocke, StephenA. Della Pietra,
Vincent J. Della Pietra, Fredrick Jelinek, John D.
Lafferty, Robert L. Mercer, and Paul S. Roossin.
1990. A statisticalapproachto machinetranslation.
Comput.Linguist., 16(2):79–85.
Jiang Chen and Jian-Yun Nie. 2000. Automatic
constructionof parallel english-chinesecorpus for
cross-languageinformationretrieval. In Proceed-
ings of the sixth conference on Appliednatural lan-
guage processing, pages21–28,San Francisco,CA,
USA.Morgan KaufmannPublishersInc.
John Fry. 2005. Assemblinga parallel corpus from
RSS news feeds. In Workshopon Example-Based
Machine Translation,MT SummitX, Phuket, Thai-
land, September.
Xiaoyi Ma and Mark Liberman. 1999. BITS: A
method for bilingualtext search over the web. In
MachineTranslationSummitVII, September.
David Nadeau and George Foster. 2004. Real-time
identificationof paralleltexts from bilingualnews-
feed. In ComputationalLinguisticin the North-East
(CLiNE2004), pages21–28.
Makoto Nagao, editor. 1996. Natural Language Pro-
cessing. Number15 in Iwanami Software Science.
IwanamiShoten. In Japanese.
PhilipResnikand NoahA. Smith. 2003. The web as a
parallelcorpus. Comput.Linguist., 29(3):349–380.
C. C. Yang and K. W. Li. 2002. Mining English/
Chinese parallel documentsfrom the World Wide
Web. In Proceedings of the InternationalWorld
Wide Web Conference, Honolulu,Hawaii, May.
