FindingSimilarSentencesacrossMultipleLanguagesin Wikipedia
SisayFissahaAdafre Maartende Rijke
ISLA,Universityof Amsterdam
Kruislaan403,1098SJ Amsterdam
sfissaha,mdr@science.uva.nl
Abstract
We investigatewhethertheWikipediacor-
pus is amenableto multilingualanalysis
that aims at generatingparallelcorpora.
We presenttheresultsof theapplicationof
two simpleheuristicsfortheidentification
of similartext acrossmultiple languages
in Wikipedia.Despitethesimplicityof the
methods,evaluationcarriedout on a sam-
pleofWikipediapagesshowsencouraging
results.
1 Introduction
Parallelcorporaformthe basisof muchmultilin-
gualresearchinnaturallanguageprocessing,rang-
ing fromdevelopingmultilinguallexiconsto sta-
tisticalmachinetranslationsystems. As a conse-
quence,collectingand aligningtext corporawrit-
tenin differentlanguagesconstitutesan important
prerequisitefortheseresearchactivities.
Wikipediais a multilingualfreeonlineencyclo-
pedia. Currently, it has entriesfor morethan200
languages,theEnglishWikipediabeingthelargest
onewith895,674articles,andno fewerthaneight
languageversionshavingupwardsof 100,000ar-
ticlesas of January2006. As can be seenin Fig-
ure 1, Wikipediapages for majorEuropeanlan-
guageshave reacheda level wherethey can sup-
port multilingualresearch. Despitethese devel-
opmentsin its content,researchon Wikipediahas
largelyfocusedonmonolingualaspectsso far;see
e.g.,(Voss,2005)foran overview.
In this paper, we focuson multilingual aspects
of Wikipedia.Particularly, we investigate to what
extent we can use propertiesof Wikipediaitself
to generatesimilarsentencesacrosedifferentlan-
guages.As usual,weconsidertwo sentencessim-
ilar if they contain(someor a large amountof)
overlappinginformation. This includescases in
whichsentencesmaybe exacttranslationsof each
other, one sentencemay be containedwithinan-
other, or bothsharesomebitsof information.
en de fr ja pl it sv nl pt es zh ru no fi da
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
Figure 1: Wikipediapages for the top 15 lan-
guages
The conceptuallysimplebut fundamentaltask
of identifyingsimilar sentencesacross multiple
languageshas a numberof motivations. For a
start, and as mentionedearlier, sentencealigned
corporaplayanimportantroleincorpusbasedlan-
guageprocessingmethodsin general.Second,in
the context of Wikipedia,beingableto alignsim-
ilar sentencesacrossmultiplelanguagesprovides
insightinto Wikipediaas a knowledgesource:to
whichextentdoesa given topicgetdifferentkinds
of attentionin differentlanguages? And thirdly,
the ability to find similar content in other lan-
guageswhilecreatinga pagefora topicinonelan-
guageconstitutesa usefultypeof editingsupport.
Furthermore,findingsimilarcontentacrosediffer-
ent languagescan formthe basisfor multilingual
summarizationandquestionansweringsupportfor
62
Wikipedia;at presentthelattertaskis beingdevel-
opedintoa pilotforCLEF2006(WiQA,2006).
Thereare differentapproachesfor findingsim-
ilar sentencesacrossmultiplelanguagesin non-
parallelbut comparablecorpora. Most methods
forfindingsimilarsentencesassumetheavailabil-
ity of a cleanparallelcorpus. In Wikipedia,two
versionsof a Wikipediatopicin two differentlan-
guagesarea goodstartingpointforsearchingsim-
ilar sentences.However, thesepagesmaynot al-
waysconformto the typicaldefinitionsof a bitext
whichcurrenttechniquesassume. Bitext gener-
ally refersto two versionsof a text in two differ-
ent languages(Melamed,1996). Thoughit is not
known how informationis sharedamongthe dif-
ferentlanguagesin Wikipedia,somepagestendto
be translationsof eachotherwhereasthe majority
of the pagestend to be writtenindependentlyof
each other. Therefore,two versionsof the same
topicin two differentlanguagescannotsimplybe
taken as parallelcorpora. This in turn limitsthe
applicationofsomeofthecurrentlyavailabletech-
niques.
In this paper, we presenttwo approachesfor
finding similar sentences across multiple lan-
guages in Wikipedia. The first approachuses
freely available online machine translationre-
sourcesfor translatingpagesand then carriesout
monolingualsentencesimilarity. The approach
needsa translation system,andthesearenotavail-
ableforeverypairof languagesin Wikipedia.
This motivates a second approachto finding
similarsentencesacrossmultiplelanguages,one
whichusesa bilingualtitletranslationlexiconin-
duced automaticallyusing the link structureof
Wikipedia. Briefly, two sentencesare similarif
they link to the sameentities(or rather: to pages
aboutthesameentities),andwe useWikipediait-
selftorelatepagesabouta givenentityacrossmul-
tiplelanguages.In Wikipedia,pageson the same
topic in differentlanguagesare topicallyclosely
related. This meansthat even if one page is not
a translationof another, they tend to sharesome
commoninformation.Ourunderlyingassumption
here is that there is a generalagreementon the
kindofinformationthatneedstobeincludedinthe
pagesof differenttypesof topicssuchas a biogra-
phy of a person,andthedefinitionanddescription
of a conceptetc., and that this agreementis to a
consderableextent“materialized”in thehypertext
links(andtheiranchortexts)in Wikipedia.
Ourmainresearchquestionin thispaperis this:
how do the two methodsjust outlineddiffer? A
prioriit seemsthatthe translationbasedapproach
to findingsimilarsentencesacrossmultiplelan-
guages will have a higher recall than the link-
basedmethod,whilethelatteroutperformsthefor-
merin termsof precision.Is thiscorrect?
Theremainderof the paperis organizedas fol-
lows. In Section 2, we briefly discuss related
work. Section3 provides a detaileddescription
of Wikipediaas a corpus. The two approachesto
identifyingsimilarsentencesacrossmultiplelan-
guagesarepresentedin Section4. Anexperimen-
tal evaluationis presentedin Section5. We con-
cludein Section6.
2 RelatedWork
The main focus of this paper lies with multilin-
gual text similarityand its applicationto infor-
mationaccessin the context of Wikipedia. Cur-
rent researchwork related to Wikipediamostly
describes its monolingualproperties(Ciffolilli,
2003; Vi´egas et al., 2004; Lih, 2004; Miller,
2005;BellomiandBonato,2005;Voss,2005;Fis-
saha Adafreand de Rijke, 2005). This is proba-
blydueto thefactthatdifferentlanguageversions
of Wikipediahave differentgrowth rates. Others
describeits applicationin questionansweringand
othertypesof IR systems(Ahnet al., 2005). We
believe that currently, Wikipediapagesfor major
Europeanlanguageshave reacheda level where
they cansupportmultilingualresearch.
Ontheotherhand,thereisa richbodyofknowl-
edgerelatingto multilingualtext similarity. These
includeexample-basedmachinetranslation,cross-
lingual informationretrieval, statisticalmachine
translation,sentencealignmentcostfunctions,and
bilingualphrase translation(Kirk Evans, 2005).
Each approachuses relatively different features
(content and structural features) in identifying
similartext frombilingualcorpora.Furthermore,
most methodsassumethat the bilingualcorpora
can be sentencealigned. This assumptiondoes
not holdfor our casesinceour corpusis not par-
allel. In this paper, we use contentbased fea-
tures for identifyingsimilartext acrossmultilin-
gual corpora. Particularly, we comparebilingual
lexiconandMTsystembasedmethodsfor identi-
fyingsimilartext in Wikipedia.
63
3 Wikipediaas a MultilingualCorpus
Wikipediais a free onlineencyclopediawhichis
administeredby the non-profitWikimediaFoun-
dation. The aim of the projectis to develop free
encyclopediasfor differentlanguages.It is a col-
laborative effortofa communityofvolunteers,and
its contentcanbe editedbyanyone.It is attracting
increasingattentionamongstweb users and has
joinedthetop50 mostpopularsites.
As of January1, 2006, there are versionsof
Wikipediain morethan200languages,withsizes
rangingfrom1 to over 800,000articles.We used
the ascii text version of the Englishand Dutch
Wikipedia,whichareavailableasdatabasedumps.
Eachentryof the encyclopedia(a pagein the on-
lineversion)correspondstoa singlelineinthetext
file. Eachlineconsistsof an ID (usuallythename
of the entity)followedby its description.Thede-
scriptionpartcontainsthebodyof thetext thatde-
scribesthe entity. It containsa mixtureof plain
text and text with html tags. Referencesto other
Wikipediapagesin the text are marked using“[[”
“]]” whichcorrespondsto a hyperlinkon the on-
lineversionof Wikipedia.Mostof the formatting
informationwhichis not relevant for the current
taskhasbeenremoved.
3.1 Linkswithina singlelanguage
Wikipediais a hypertext documentwitha richlink
structure.A descriptionof an entityusuallycon-
tainshypertext linksto otherpageswithinor out-
side Wikipedia. The majorityof theselinks cor-
respondto entities,which are relatedto the en-
tity being described,and have a separateentry
in Wikipedia. Theselinks are used to guidethe
readerto a moredetaileddescriptionof the con-
cept denotedby the anchortext. In otherwords,
the linksin Wikipediatypicallyindicatea topical
associationbetweenthe pages,or ratherthe enti-
tiesbeingdescribedbythepages.E.g.,in describ-
ing a particularperson,referencewillbe madeto
suchentitiesascountry, organizationandotherim-
portantentitieswhichare relatedto it and which
themselves have entriesin Wikipedia.In general,
duetothepeculiarcharacteristicsofanencyclope-
dia corpus,the hyperlinksfoundin encyclopedia
text are used to exemplifythoseinstancesof hy-
perlinksthatexistamongtopicallyrelatedentities
(Ghaniet al.,2001;RaoandTuroff, 1990).
EachWikipediapageis identifiedwitha unique
ID. These IDs are formedby concatenatingthe
words of the titlesof the Wikipediapageswhich
are uniquefor each page, e.g., the page on Vin-
cent van Goghhas “Vincentvan Gogh”as its ti-
tle and “Vincentvan Gogh”as its ID. Eachpage
may, however, be representedby differentanchor
textsin a hyperlink.Theanchortextsmaybe sim-
plemorphologicalvariantsof thetitlesuchas plu-
ral formor may representcloselyrelatedseman-
tic concept.For example,theanchortext “Dutch”
may point to the page for the Netherlands.In a
sense,the IDs functionas the canonicalformfor
severalrelatedconcepts.
3.2 Linksacrossdifferentlanguages
Differentversionsof a pagein differentlanguages
are also hyperlinked. For a given page, transla-
tionsof itstitlein otherlanguagesforwhichpages
existaregiven as hyperlinks.Thispropertyis par-
ticularlyusefulforthecurrenttaskas it helpsusto
alignthe corpusat the pagelevel. Furthermore,it
alsoallows us to inducebilinguallexiconconsist-
ing of the Wikipediatitles. Conceptualmismatch
betweenthe pages(e.g. Roof vs Dakconstructie)
is rare, and the lexiconis generallyof highqual-
ity. Unlike the generallexicon,this lexiconcon-
tains a relatively large numberof namesof indi-
vidualsand otherentitieswhichare highlyinfor-
mative andhenceare usefulin identifyingsimilar
text. Thislexiconwill formthe backboneof one
of the methodsfor identifyingsimilartext across
differentlanguages,aswillbeshowninSection4.
4 Approaches
We describetwo approachesfor identifyingsimi-
lar sentencesacrossdifferentlanguages.Thefirst
usesanMTsystemto obtaina roughtranslationof
a givenpagein onelanguageintoanotherandthen
useswordoverlapbetweensentencesas a similar-
ity measure.Oneadvantageof thismethodis that
it relieson a large lexicalresourcewhichis bigger
thanwhatcanbe extractedfromWikipedia.How-
ever, thetranslationcanbelessaccurateespecially
fortheWikipediatitleswhichformpartofthecon-
tentof a pageandareveryinformative.
Thesecondapproachrelieson a bilinguallexi-
con whichis generatedfromWikipediausingthe
link structure:pageson the sametopicin differ-
ent languagesare hyperlinked; see Figure2. We
use the titles of the pagesthat are linked in this
mannerto createa bilinguallexicon. Thus, our
bilinguallexiconconsistsof termsthat represent
64
conceptsor entitiesthathave entriesin Wikipedia,
and we will representsentencesby entriesfrom
thislexicon:an entryis usedto representthecon-
tent of a sentenceif the sentencecontainsa hy-
pertext link to the Wikipediapage for that entry.
Sentencesimilarityis thencapturedintermsofthe
sharedlexiconentriesthey share. In otherwords,
thesimilaritymeasurethatweusein thisapproach
is basedon “concept”or “pagetitle”overlap. In-
tuitively, this approachhas the advantageof pro-
ducinga brief but highlyaccuraterepresentation
of sentences,moreaccurate,we assumethan the
MTapproachas the titlescarryimportantseman-
tic information;it willalsobe moreaccuratethan
the MT approachbecausethe translationsof the
titlesaredonemanually.
Figure2: Linksto pagesdevotedto thesametopic
in otherlanguages.
BothapproachesassumethattheWikipediacor-
pus is aligned at the page level. This is eas-
ily achieved usingthe link structuresince,again,
pageson thesametopicin differentlanguagesare
hyperlinked. This, in turns, narrows down the
searchfor similar text to a pagelevel. Hence,for
a given text of a page(sentenceor chunk)in one
language,we searchfor its equivalent text (sen-
tenceor chunk)onlyin the correspondingpagein
theotherlanguage,notin theentirecorpus.
We now describethe two approachesin more
detail. To remainfocusedand avoid getting lost
in technicaldetails, we consideronly two lan-
guagesin our technicaldescriptionsand evalua-
tions below: Dutchand English;it will be clear
from our presentation,however, that our second
approachcanbe usedfor any pairof languagesin
Wikipedia.
4.1 AnMTbasedapproach
Inthisapproach,wetranslatetheDutchWikipedia
pageintoEnglishusingan onlineMTsystem.We
referto the Englishpageas source and the trans-
lated(Dutchpage)versionas target. We usedthe
BabelfishMT systemof Altavista. It supportsa
numberoflanguagepairsamongwhichareDutch-
Englishpairs. Afterboth pageshave been made
availablein English,we split the pagesinto sen-
tencesortextchucks.We thenlinkeachtextchunk
orsentenceinthesourcetoeachchuckorsentence
in the target. Followingthiswe computea simple
wordoverlapscoreforeachpair. We usedtheJac-
cardsimilaritymeasurefor this purpose.Content
words are our main featuresfor the computation
of similarity, hence,weremove stopwords.Gram-
maticallycorrecttranslationsmay not be neces-
sarysinceweareusingsimplewordoverlapasour
similaritymeasure.
The above procedurewill generatea large set
of pairs,not all of whichwill actuallybe similar.
Therefore,wefilterthelistassuminga one-to-one
correspondence,wherefor each sourcesentence
we identifyat most one target sentence. This is
a ratherstrict criterion(anotherpossibilitybeing
one-to-many),giventhefactthatthecorpusisgen-
erallyassumedtobenotparallel.Butit givessome
ideaonhowmuchofthetextcorpuscanbealigned
at smallerunits(i.e.,sentenceor text chunks).
Filteringworks as follows. First we sort the
pairsin decreasingorderof theirsimilarityscores.
Thisresultsin a ranked list of text pairsin which
the mostsimilarpairsare ranked top whereasthe
leastsimilarpairsarerankedbottom.Nextwetake
the top mostrankingpair. Sincewe are assuming
a one-to-onecorrespondence,weremove all other
pairsranked lower in the list containingeitherof
thethesentencesor text chunksin thetopranking
pair. We thenrepeatthisprocesstakingthesecond
toprankingpair. Eachstepresultsin a smallerlist.
The processcontinuesuntilthereis no morepair
to remove.
4.2 Usinga link-basedbilinguallexicon
As mentionedpreviously, this approachmakes
use of a bilinguallexiconthat is generatedfrom
Wikipediausingthe link structure. A high level
descriptionof the algorithmis given in Figure3.
Below, wefirstdescribehow thebilinguallexicon
isacquiredandhowit isusedforenrichingthelink
structureof Wikipedia.Finally, we detailhow the
65
• Generatingbilinguallexicon
• Given a topic,getthecorrespondingpages
fromEnglishandDutchWikipedia
• Split pages into sentencesand enrichthe
hyperlinks in the sentence or identify
named-entitiesin thepages.
• Representthe sentencesin thesepagesus-
ingthebilinguallexicon.
• Computeterm overlap betweenthe sen-
tencesthusrepresented.
Figure 3: The Pseudo-algorithmfor identifying
similarsentencesusinga link-basedbilinguallex-
icon.
bilinguallexiconis used for the identificationof
similarsentences.
Generatingthebilinguallexicon
Unlike the MT basedapproach,whichuses con-
tent words from the general vocabulary as fea-
tures,in thisapproach,weusepagetitlesandtheir
translations(asobtainedthroughhyperlinksas ex-
plainedabove) as our primitives for the compu-
tationof multilingualsimilarity. The first step of
thisapproach,then,is acquiringthebilinguallexi-
con,but thisis relativelystraightforward. Foreach
Wikipediapage in one language,translationsof
the title in other languages,for which there are
separateentries,are given as hyperlinks.Thisin-
formationis used to generatea bilingualtransla-
tionlexicon. Mostof thesetitlesare contentbear-
ing nounphrasesand are very usefulin multilin-
gual similaritycomputation(Kirk Evans, 2005).
Most of these noun phrasesare already disam-
buiguated,andmayconsistof eithera singleword
or multiwordunits.
Wikipedia uses a redirectionfacility to map
several titles into a canonicalform. Thesetitles
are mostly synonymous expressions. We used
Wikipedia’s redirectfeature to identifysynony-
mousexpression.
Canonicalrepresentationof a sentence
Oncewe have the bilinguallexicon,the next step
is to represent thesentencesin bothlanguagepairs
usingthislexicon.Eachsentenceis representedby
the set of hyperlinksit contains. We searcheach
hyperlinkin the bilinguallexicon. If it is found,
we replacethe hyperlinkwith the corresponding
uniqueidentificationofthebilinguallexiconentry.
If it is notfound,thehyperlinkwillbe includedas
is as partof the representation.Thisis donesince
Dutchand Englishare closelyrelatedlanguages
andmaysharemany cognatepairs.
EnrichingtheWikipedialinkstructure
As describedin the previoussection,the method
useshyperlinksin a sentenceas a highlyfocused
entity-basedrepresentationof theaboutnessof the
sentence. In Wikipedia,not all occurrencesof
named-entitiesor conceptsthat have entries in
Wikipediaare actuallyused as anchortext of a
hypertext link; becauseof this, a numberof sen-
tencesmay needlesslybe left out from the simi-
laritycomputationprocess.In orderto avoid this
problem,we automaticallyidentifyotherrelevant
hyperlinksusingthebilinguallexicongeneratedin
theprevioussection.
Identification of additional hyperlinks in
Wikipedia sentences works as follows. First
we split the sentences into constituentwords.
We then generate N gram words keeping the
relative orderof wordsin thesentences.Sincethe
anchortexts of hypertext linksmaybe multiword
expressions,we start with higher order N gram
words (N=4). We search these N grams in the
bilinguallexicon. If the N gram is foundin the
lexicon, it is taken as a new hyperlinkand will
formpartof the representationof a sentence.The
processis repeatedforlowerorderN grams.
Identifyingsimilarsentences
Once we are done representingthe sentencesas
describedpreviously, the finalstep involves com-
putationof the termoverlapbetweenthe sentence
pairsand filteringthe resultinglist. The remain-
ing stepsare similarto thosedescribedin the MT
basedapproach.For completeness,we brieflyre-
peat the steps here. First, all sentencesfrom a
DutchWikipediapageare linked to all sentences
of thecorrespondingEnglishWikipediapage.We
thencomputethe similaritybetweenthe sentence
representations,usingthe Jaccardsimilaritycoef-
ficient.
A sentencein Dutch page may be similar to
several sentencesin Englishpage whichmay re-
sult in a large numberof spuriouspairs. There-
fore,wefilterthelistusingthefollowingrecursive
procedure.First,the sentencepairsare sortedby
theirsimilarityscores. We take the pairswiththe
highestsimilarityscores. We then eliminateall
66
othersentencepairsfromthe list that containei-
therofsentencesinthispair. We continuethispro-
cesstakingthe secondhighestrankingpair. Note
thatthisprocedureassumesa one-to-onematching
rule;a sentencesin Dutchcanbe linked to at most
onesentencein English.
5 ExperimentalEvaluation
Now that we have describedthe two algorithms
for identifyingsimilarsentences,we returnto our
researchquestions. In order to answerthem we
runtheexperimentdescribedbelow.
5.1 Set-up
We took a randomsampleof 30 English-Dutch
Wikipediapagepairs. Eachpageis splitintosen-
tences. We generatedcandidateDutch-English
sentencepairs and passed them on to the two
methods.Bothmethodsreturna rankedlistofsen-
tencepairs that are similar. As explainedabove,
weassumeda one-to-onecorrespondence,i.e.,one
Englishsentencecan be linked to at mostto one
Dutchsentence.
Theoutputsof the systemsare manuallyevalu-
ated. We applya relatively lenientcriteriain as-
sessingthe results. If two sentencesoverlap in-
terms of their informationcontentthen we con-
sider them to be similar. This includescases in
whichsentencesmaybe exacttranslationof each
other, one sentencemay be containedwithinan-
other, or bothsharesomebitsof information.
5.2 Results
Table1 shows the resultsof the two methodsde-
scribedin Section4. In the table, we give two
types of numbersfor each of the two methods
MT andBilinguallexicon: Total(thetotalnumber
of sentencepairs)and Match (the numberof cor-
rectlyidentifiedsentencepairs)generatedby the
two approaches.
Overall, the two approachestend to produce
similarnumbersofcorrectlyidentifiedsimilarsen-
tence pairs. The systemsseem to performwell
on pageswhichtend to be alignableat sentence
level, i.e., parallel. This is clearly seen on the
followingpages: PierluigiCollina, Marcus Cor-
neliusFronto, George F. Kennan, whichshow a
highsimilarityat sentencelevel. Somepagescon-
tain very smalldescriptionand hencethe figures
for correctsimilarsentencesare alsosmall.Other
topics such as Classicism(Dutch: Classicisme),
Tennis, and Tank, thoughthey aredescribedin suf-
ficientdetailsin bothlanguages,theretendsto be
lessoverlapamongthe text. Themethodstendto
retrieve more accuratesimilarpairs from person
pagesthanotherpagesespeciallythosepagesde-
scribinga moreabstractconcepts.However, this
needsto be testedmorethoroughly.
Whenwe look at the total numberof sentence
pairs returned,we noticethat the bilinguallexi-
con based methodconsistentlyreturnsa smaller
amount of similar sentencepairs which makes
the methodmoreaccuratethanthe MT basedap-
proach. On average,the MT basedapproachre-
turns4.5(26%)correctsentencesandthebilingual
lexicon based approachreturns 2.9 correct sen-
tences(45%). But, on average,the MT approach
returnsthreetimesasmany sentencepairsasbilin-
guallexiconapproach.Thismaybedueto thefact
that the formermakes use of restrictedset of im-
portanttermsor conceptswhereasthe lateruses a
large generallexicon. Thoughwe remove some
of the mostfrequentlyoccuringstopwords in the
MTbasedapproach,it stillgeneratesa large num-
berof incorrect similarsentencepairsdueto some
commonwords.
In general,the numberof correctlyidentified
similar pages extractedseems small. However,
most of the Dutch pages are relatively small,
which sets the upper bound on the number of
correctlyidentifiedsentencepairsthat can be ex-
tracted. On average,eachDutchWikipediapage
in the samplecontains18 sentenceswhereasEn-
glishWikipediapagescontain65 sentences.Ex-
cludingthe pages for Tennis, Tank (Dutch: vo-
ertuig), and Tricolor, whichare relatively large,
eachDutchpagecontainson average8 sentences,
which is even smaller. Given the fact that the
pages are in general not parallel, the methods,
using simple heuristics, identifiedhigh quality
translationequivalent sentencepairs from most
Wikipediapages. Furthermore,a close examina-
tionof theoutputof thetwo approachesshow that
both tend to identifythe sameset of similarsen-
tencepairs.
We ranourbilinguallexiconbasedapproachon
the wholeDutch-EnglishWikipediacorpus. The
methodreturnedabout80M of candidatesimilar
sentences.Thoughwe do not have the resources
to evaluate this output, the results we got from
sampledata (cf. Table 1) suggestthat it contains
a significantamountof correctlyidentifiedsimilar
67
Title MT BilingualLexicon
English Dutch Total Match Total Match
HersfeldRotenburg HersfeldRotenburg 2 3 2
Manganesenodule Mangaanknol 5 2 1 1
Kettle Ketel 1 1
Treason Landverraad 2 1
PierluigiCollina PierluigiCollina 14 13 13 11
Provinceof Ferrara Ferrara(provincie) 7 1 1 1
Classicism Classicisme 8 1
Tennis Tennis 93 4 15 3
Hysteria Hysterie 14 6 9 5
George F. Kennan George Kennan 27 12 29 11
MarcusCorneliusFronto MarcusCorneliusFronto 11 9 5 5
Delphi Delphi(Griekenland) 34 2 8 1
DeBeers DeBeers 11 5 10 5
Pavel Popovich Pavel Popovytsj 7 4 4 4
Ricepudding Rijstebrij 11 1 4
Mantaray Reuzenmanta 15 3 7 2
Michelstadt Michelstadt 1 1 1 1
Tank Tank(voertuig) 84 3 27 2
Cheyenne(Wyoming) Cheyenne(Wyoming) 5 2 2 2
Goa Goa(deelstaat) 13 4 6 1
Tricolour Driekleur 57 36 13 12
Oralcancer Mondkanker 25 2 7 2
Pallium Pallium 12 2 5 4
Ajanta Ajanta 3 3 2 2
CaptainJack(band) CaptainJack 16 3 2 2
ProboscisMonkey Neusaap 15 6 4 1
PattiSmith PattiSmith 6 2 4 2
FloresIsland,Portugal Flores(Azoren) 3 2 1 1
Mercury8 MercuryMA8 11 3 4 1
Mutation Mutatie 16 4 6 3
Average 17.6 4.5 6.5 2.9
Table1: Test topics (column1 and 2). The totalnumberof sentencepairs(column3) and the number
of correctlyidentifiedsimilarsentencepairs(column4) returnedby the MT basedapproach.The to-
tal numberof sentencepairs(column5) and the numberof correctlyidentifiedsimilarsentencepairs
(column6) returnedby themethodusinga bilinguallexicon.
sentences.
6 Conclusion
Inthispaperwefocusedonmultilingualaspectsof
Wikipedia.Particularly, weinvestigatedthepoten-
tialofWikipediaforgeneratingparallelcorporaby
applyingdifferentmethodsfor identifyingsimilar
text acrossmultiplelanguages.We presentedtwo
methodsand carriedout an evaluationon a sam-
pleofDutch-EnglishWikipediapages.Theresults
show that both methods,usingsimpleheuristics,
wereableto identifysimilartext betweenthe pair
of Wikipediapagesthoughthey differin accuracy.
Thebilinguallexiconapproachreturnsfewerin-
correctpairs than the MT based approach. We
interpretthis as sayingthat our bilinguallexicon
basedmethodprovidesa moreaccuraterepresen-
tationof the aboutnessof sentencesin Wikipedia
thantheMTbasedapproach.Furthermore,there-
sult we obtainedon a sampleof Wikipediapages
and the outputof runningthe bilingual basedap-
proachonthewholeDutch-Englishgivessomein-
dicationof the potentialof Wikipediafor generat-
ingparallelcorpora.
68
As to futurework, the sentencesimilarity de-
tectionmethodsthatweconsideredarenotperfect.
E.g.,theMTbasedapproachreliesonroughtrans-
lations; it is importantto investigate the contri-
butionof highqualitytranslations.The bilingual
lexiconapproachusesonlylexicalfeatures;other
languagespecificsentencefeaturesmighthelpim-
prove results.
Acknowledgments
This research was supported by the Nether-
landsOrganizationforScientificResearch(NWO)
under project numbers 017.001.190, 220-80-
001, 264-70-050, 612-13-001, 612.000.106,
612.000.207, 612.066.302, 612.069.006, 640.-
001.501,and640.002.501.

References
D.Ahn,V. Jijkoun,G.Mishne,K.M¨uller, M.de Rijke,
and S. Schlobach. 2005. UsingWikipediaat the
TRECQATrack. In E.M.VoorheesandL.P. Buck-
land,editors,The ThirteenthText Retrieval Confer-
ence(TREC2004).
F. Bellomi and R. Bonato. 2005. Lex-
ical authorities in an encyclopedic cor-
pus: a case study with wikipedia. URL:
http://www.fran.it/blog/2005/01/
lexical-authorities-in-encyclopedic.
htm%l. Siteaccessedon June9, 2005.
A.Ciffolilli.2003.Phantomauthority, selfselective re-
cruitmentand retentionof membersin virtualcom-
munities: The case of Wikipedia. First Monday,
8(12).
S. FissahaAdafreandM. de Rijke. 2005. Discovering
missinglinks in Wikipedia. In Proceedingsof the
Workshopon Link Discovery: Issues, Approaches
andApplications(LinkKDD-2005).
R. Ghani,S. Slattery, and Y. Yang. 2001. Hypertext
categorizationusing hyperlink patterns and meta
data. In Carla Brodley and AndreaDanyluk, ed-
itors, Proceedingsof ICML-01,18th International
Conferenceon MachineLearning, pages178–185.
D. Kirk Evans. 2005. Identifying similarity
in text: Multi-lingualanalysis for summariza-
tion. URL: http://www1.cs.columbia.
edu/nlp/theses/dave_evans.pdf. Site
accessedon January5, 2006.
A. Lih. 2004. Wikipediaas participatoryjournalism:
Reliablesources?Metricsfor evaluatingcollabora-
tive mediaas a newsresource.In Proceedingsof the
5thInternationalSymposiumon OnlineJournalism.
D. Melamed.1996.A geometric approachto mapping
bitext correspondence. In Eric Brill and Kenneth
Church,editors,Proceedingsof the Conference on
EmpiricalMethodsin Natural Language Process-
ing, pages1–12,Somerset,New Jersey. Association
forComputationalLinguistics.
N. Miller. 2005. Wikipediaand the disappearing
“Author”. ETC: A Review of General Semantics,
62(1):37–40.
U. RaoandM. Turoff. 1990. Hypertext functionality:
A theoreticalframework. InternationalJournalof
Human-ComputerInteraction.
F. Vi´egas, M. Wattenberg, and D. Kushal. 2004.
Studyingcooperationand conflictbetweenauthors
with historyflow visualization. In Proceedingsof
the 2004 conference on Humanfactors in comput-
ingsystems.
J. Voss. 2005. MeasuringWikipedia. In Proceedings
10th InternationalConference of the International
SocietyforScientometricsandInformetrics.
WiQA. 2006. Questionansweringusing Wikipedia.
URL: http://ilps.science.uva.nl/
WiQA/. Siteaccessedon January5, 2006.
