Errorsinwikis:newchallengesandnewopportunities—adiscussion
document
AnnCopestake
ComputerLaboratory
UniversityofCambridge
aac@cl.cam.ac.uk
Abstract
This discussiondocumentconcernsthe
challengesto assessmentsof reliability
posedby wikisandthe potentialfor lan-
guage processingtechniquesfor aiding
readersto decidewhetherto trust partic-
ulartext.
1 Wikisandthetrustproblem
Wikis,especiallyopenwikis,posenewchallenges
for readers in decidingwhetherinformationis
trustworthy. An articlein a wikipediamay be
generallywell-writtenandappearauthoritative,so
thatthereaderisinclinedtotrustit,buthavesome
additionsby other authorswhich are incorrect.
Correctionsmay eventuallyget made,but there
willbea timelag. In particular, many peopleare
now usingWikipedia(www.wikipedia.org)
as a majorreferencesource,so the potentialfor
misinformationto be spread is increasing. It
has alreadybecomeapparentthat articlesabout
politiciansarebeingeditedbytheirstaff to make
themmorefavourableandnodoubtvariousinter-
est groupsare manipulatinginformationin more
subtleways. In fact, as wikisdevelop,problems
withreliabilitymaygetworse:authorswhowrote
an articleseveral yearsago won’t care so much
aboutitscontentandmaynotbothertochecked-
its. Whenobscuretopicsare coveredby a wiki,
thecommunitywhichiscapableofcheckingfacts
maybesmall.
Of courseerrors arise in old text too, but a
generallyauthoritative conventionalarticleis un-
likelyto containa reallymajorerrorabouta cen-
tral topic. Different old text publicationshave
differentperspectives, politicalor otherwise,but
the overallslantis usuallygenerallyknown and
hencenotproblematic.Non-wikiwebpagesmay
haveunknownauthors,butthedomainofferssome
guide to reliabilityand to likely skew and the
pagescanbeassessedasa whole.Theissuehere
is not the overallnumberof errorsin wikisver-
suspublishedtextorwebpages,buthowa reader
candecideto trusta particularpieceof informa-
tionwhentheycannotusethearticleasawholeas
aguide.
Thereisaneedforautomatictoolswhichcould
provideanaidforthereaderwhoneedsto assess
trustworthinessandalsoforauthorsandmodera-
torsscanningchanges.Similarly,moderatorsneed
toolsforidentificationof vandalism,libel,adver-
tisingandsoon.
Questions:
1. Is wikireliabilityreallya problemfor read-
ers, as I hypothesise?Perhapsreaderswho
arenotexpertin a topiccandetectproblem-
aticmaterialinawikiarticle,despitethemul-
tipleauthorship.
2. Canweuselanguageprocessingtoolstohelp
readersidentifyerrorsandmisinformationin
wikipages?
2 Learningtrustworthiness
The availabilityof changehistorieson wikisis
a resourcewhich could be exploitedfor train-
ing purposes by language processingsystems
designedto evaluate trustworthiness. If it is
possible to categorise users as trustworthy or
non-trustworthy/unknownbyindependentcriteria
(suchas overallcontributionlevel), thenwe can
usechangesmadebytrustworthyusersthatdelete
additionsmadebytheunknownusersas a means
of categorisingsometext as bad. (Possiblythe
9
commentsmadebytheeditorscouldleadto sub-
categorizationofthebadnessaserrorvsvandalism
etc.)Atoolforhighlightingpossibleproblemed-
its in wikismightthusbe developedon thebasis
ofalargeamountoftrainingdata.Techniquesde-
rivedfromareassuchaslanguage-basedspamde-
tection,subjectivitymeasurementandsooncould
be relevant. However, oneof therelativelynovel
aspectsof the wikiproblemis thatwe are look-
ingat categorisationof smalltext snippetsrather
thanlargerquantitiesoftext.Thustechniquesthat
relyon stylisticcuesprobablywon’t work. Ide-
ally, weneedto be ableto identifythe actualin-
formationprovidedbyindividualcontributorsand
classifythisasreliableorunreliable.Onewayof
lookingatthisis bydividingtextintofactoids(in
thesummarisationsense).Factoididentificationis
a reallyhardproblem,but maybethe wikiedits
themselvescouldhelphere.
Questions:
1. Canweautomaticallyclassifywikicontribu-
torsasreliable/unreliable?
2. Do trustworthy users’ edits provide good
trainingdata?
3. Arethereanyfeaturesoftextsnippetsthatal-
low classificationof reliability?(Myguess:
identificationof vandalismwill be possible
butmoresubtleeffectswon’t bedetectable.)
4. Whattoolscouldbe adaptedfromotherar-
eas of languageprocessingto addressthese
issues?
3 Anontologyoferrors?
As an extensionof theideasin theprevioussec-
tion,perhapswikihistoriescouldbe minedas a
repositoryof commonlybelieved false informa-
tion. For instance,the EN wikipediaentry for
UniversityofCambridgecurrently(Jan5th,2006)
states:
UndergraduateadmissiontoCambridge
colleges usedto dependon knowledge
of Latin and AncientGreek, subjects
taughtprincipallyin the UnitedKing-
domatfee-payingschools,calledpublic
schools.
(‘publicschools’waslinked)
One way in whichthis is wrongis that British
‘publicschools’(in this sense)are onlya small
proportionof the fee-payingschools,but equat-
ingpublicschoolswithallfee-payingschoolsisa
commonerror. Supposea trustworthy editorcor-
rectsthisparticularerrorin thisarticle(andper-
hapssimilarerrorsinthesameorotherarticles).If
wecanautomaticallyanalyseandstorethecorrec-
tion,wecoulduseittocheckforthesameerrorin
othertext. Aswikisgetlarger, thismightbecome
a usefulresourceforerrordetection/evaluationof
many text types. Thuserrorsin wikisarean op-
portunityaswellasachallenge.
10
