TextLinkagein theWiki Medium– A Comparative Study
AlexanderMehler
Departmentof ComputationalLinguistics& Text Technology
BielefeldUniversity
Bielefeld,Germany
Alexander.Mehler@uni-bielefeld.de
Abstract
We analyzefour differenttypes of docu-
mentnetworkswithrespectto theirsmall
world characteristics.These characteris-
tics allow distinguishingwiki-basedsys-
tems from citation and more traditional
text-basednetworksaugmentedby hyper-
links. The studyprovidesevidencethat a
moreappropriatenetworkmodelis needed
whichbetterreflectsthe specificsof wiki
systems.It putsemphasizeon theirtopo-
logical differences as a result of wiki-
related linking comparedto other text-
basednetworks.
1 Introduction
With the advent of web-basedcommunication,
moreandmorecorporaareaccessiblewhichman-
ifestcomplex networksbasedon intertextualrela-
tions. Thisincludesthe areaof scientificcommu-
nication(e.g. digitallibrariesas CiteSeer),press
communication(e.g. the New York Times which
linkstopicallyrelatedarticles),technicalcommu-
nication(e.g. the ApacheSoftware Foundation’s
documentationsof opensourceprojects)andelec-
tronic encyclopedia(e.g. Wikipedia and its re-
leases in a multitudeof languages). These are
sourcesof large corporaof webdocumentswhich
are connectedby citationlinks (digitallibraries),
content-basedadd-ons(onlinepresscommunica-
tion)or hyperlinksto relatedlexiconarticles(elec-
tronicencyclopedias).
Obviously, a corpusof suchdocumentsis more
thana set of textualunits. Thereis structurefor-
mationabove thelevel of singledocumentswhich
can be describedby meansof graph theoryand
network analysis(Newman,2003). But what is
new aboutthiskindof structureformation?Or do
we justhave to face the kindof structuringwhich
is alreadyknownfromotherlinguisticnetworks?
Thispaperfocuseson the specificsof network-
ingin wiki-basedsystems.It tacklesthefollowing
questions:Whatstructure do wiki-basedtext net-
workshave?Canwe expecta wiki-specifictopol-
ogy compared to more traditional(e.g. citation)
networks? Or can we expectcomparable results
whenapplyingnetworkanalysisto theseemerging
networks?In the followingsections,theseques-
tions are approachedby exampleof a language
specificrelease of the Wikipediaas well as by
wikis for technicaldocumentation. That is, we
contributeto answeringthequestionwhy wikican
be seenas somethingnew comparedto othertext
typesfromthepointof view of networking.
In orderto supportthis argumentation,section
(2) introducesthose network coefficients which
areanalyzedwithinthepresentcomparative study.
As a preprocessingstep, section (3) outlines a
webgenremodelwhichin sections(4.1)and(4.2)
is used to representand extractinstancesof four
typesof documentnetworks. This allows apply-
ingthecoefficientsofsection(2)totheseinstances
(section4.3)andnarrowingdown wiki-basednet-
works(section5). Thefinalsectionconcludesand
prospectsfuturework.
2 NetworkAnalysis
For the time being,the overall structureof com-
plex networks is investigated in terms of Small
Worlds (SW) (Newman,2003). Since its inven-
tion by Milgram(1967),this notionawaited for-
malizationas a measurablepropertyof large com-
plex networks whichallows distinguishingsmall
worldsfromrandomgraphs.Sucha formalization
was introducedby Watts & Strogatz (1998)who
1
characterizesmallworldsbytwo properties:First,
otherthanin regulargraphs,any randomlychosen
pair of nodesin a smallworld has, on average,a
considerablyshortergeodesicdistance.1 Second,
comparedto randomgraphs,smallworldsshow a
considerablyhigherlevel of clusterformation.
In this framework, cluster formationis mea-
suredbymeansoftheaveragefractionofthenum-
bertriangleinv(vi) of trianglesconnectedto vertex vi and
the numberorunderscore(vi) of triplescenteredon vi (Watts
andStrogatz,1998):2
C2 = 1nsummationdisplay
i
triangleinv(vi)
orunderscore(vi) (1)
Alternatively, the cluster coefficientC1 com-
putesthefractionof thenumberof trianglesin the
wholenetwork and the numberof its connected
vertex triples.Further, themeangeodesicdistance
l ofa networkis thearithmeticmeanofallshortest
pathsof all pairsof verticesin thenetwork. Watts
andStrogatzobserve highclustervaluesandshort
averagegeodesicdistancesin smallworldswhich
apparentlycombineclusterformationwith short-
cutsas prerequisitesof efficientinformationflow.
In the areaof informationnetworks,thisproperty
has been demonstratedfor the WWW(Adamic,
1999),but alsoforco-occurrencenetworks(Ferrer
i Canchoand Sol´e, 2001)and semanticnetworks
(SteyversandTenenbaum,2005).
In additionto the SW modelof Watts & Stro-
gatz, link distributionswerealso examinedin or-
derto characterizecomplex networks:Barab´asi&
Albert(1999)arguethatthevertex connectivityof
socialnetworksis distributedaccordingto a scale-
free power-law. They recur to the observation –
confirmedby many social-semioticnetworks,but
not by instancesof the randomgraph model of
Erd˝os & R´enyi (Bollob´as, 1985)– that the num-
ber of links per vertex can be reliablypredicted
by a power-law. Thus,the probabilityP(k) thata
randomlychosenvertex interactswithk otherver-
ticesof thesamenetworkis approximately
P(k)∼k−γ (2)
Successfullyfittinga power law to the distrib-
ution of out degrees of verticesin complex net-
worksindicates“thatmostnodeswillberelatively
1The geodesicdistanceof two verticesin a graphis the
lengthof theshortestpathin-between.
2A triangleis a subgraphof three nodeslinked to each
other. Note that all coefficientspresentedin the following
sectionsrelateby defaultto undirectedgraphs.
poorlyconnected,whilea selectminorityof hubs
will be very highly connected.” (Watts, 2003,
p.107). Thus, for a fixed numberof links, the
smallertheγ value,the shallowerthe slopeof the
curve in a log-logplot, the higherthe numberof
edgestowhichthemostconnectedhubis incident.
A limitof this modelis that it views the prob-
abilityof linkinga sourcenode to a target node
to dependsolely on the connectivity of the lat-
ter. In contrastto this, Newman(2003)proposes
a modelin whichthisprobabilityalsodependson
the connectivityof the former. Thisis donein or-
der to accountfor socialnetworks in whichver-
ticestendto be linked if they sharecertainproper-
ties (Newmanand Park, 2003),a tendency which
is calledassortativemixing. Accordingto New-
man& Park (2003)it allows distinguishingsocial
networks from non-social(e.g. artificialand bio-
logical)oneseven if they are uniformlyattributed
as smallworldsaccordingto the modelof Watts
& Strogatz (1998). Newman& Park (2003)ana-
lyze assortative mixingof vertex degrees,that is,
the correlationof the degrees of linked vertices.
They confirmthatthiscorrelationis positivein the
case of social,but negativein the case of techni-
cal networks (e.g. the Internet)whichthus prove
disassortative mixing(ofdegrees).
AlthoughtheseSWmodelswereappliedtocita-
tion networks,WWWgraphs,semanticnetworks
and co-occurrencegraphs,and thus to a variety
of linguisticnetworks, a comparative studywhich
focusesonwiki-basedstructureformationin com-
parisonto othernetworksof textualunitsis miss-
ing so far. In this paper, we presentsucha study.
Thatis, we examineSWcoefficientswhichallow
distinguishingwiki-basedsystemsfrommore“tra-
ditional”networks. In orderto do that,a general-
izedwebdocumentmodelis neededto uniformly
representthe documentnetworksto be compared.
In thefollowingsection,a webgenremodelis out-
linedforthispurpose.
3 A Webgenre Structure Model
Linguisticstructuresvarywiththefunctionsof the
discoursesin which they are manifested(Biber,
1995; Karlgren and Cutting, 1994). In anal-
ogyto theweakcontextualhypothesis(Millerand
Charles,1991)one mightstatethatstructuraldif-
ferencesreflectfunctionalonesas far as they are
confirmedby a significantlyhigh numberof tex-
tualunitsandthusareidentifiableasrecurrentpat-
2
terns. In this sense, we expect web documents
to be distinguishableby the functionalstructures
they manifest. More specifically, we agree with
the notion of webgenre (Yoshiokaand Herman,
2000)accordingto whichthe functionalstructure
of webdocumentsis determinedbytheirmember-
ship in genres (e.g. of conference websites, per-
sonalhomepages or electronicencyclopedias).
Our hypothesisis that what is commonto in-
stancesof differentwebgenresis the existenceof
an implicitlogicaldocumentstructure (LDS)– in
analogyto textual units whoseLDS is described
in termsof section,paragraphand sentencecate-
gories(Poweretal.,2003).Inthecaseofwebdoc-
umentswe hypothesizethat theirLDScomprises
fourlevels:
• Document networks consist of documents
whichserve possiblyheterogenousfunctions
if necessaryindependentlyof each other. A
webdocumentnetworkis given,forexample,
by thesystemof websitesof a university.
• Web documentsmanifest– typicallyin the
formof websites– pragmaticallyclosedacts
of web-basedcommunication(e.g. confer-
ence organizationor online presentation).
Eachwebdocumentis seento organizea sys-
temof dependentsubfunctionswhichin turn
aremanifestedby modules.
• Documentmodulesare, ideally, functionally
homogeneoussubunits of web documents
which manifestsingle, but dependentsub-
functionsin the sensethattheirrealizationis
boundto therealizationofothersubfunctions
manifestedby the sameencompassingdocu-
ment.Examplesofsuchsubfunctionsarecall
for papers, program presentationor confer-
ence venue organizationas subfunctionsof
the functionof web-basedconference orga-
nization.
• Finally, elementarybuildingblocks(e.g.lists,
tables, sections) only occur as dependent
partsof documentmodules.
This enumerationdoes not implya one-to-one
mappingbetweenfunctionallydemarcatedmani-
fested units (e.g. modules)and manifesting(lay-
out) units(e.g. web pages). Obviously, the same
functionalvariety (e.g. of a personalacademic
home page) which is mapped by a website of
dozens of interlinked pages may also be mani-
fested by a single page. The many-to-many re-
lationinducedby thisandrelatedexamplesis de-
scribedin moredetailin Mehler& Gleim(2005).
Thecentralhypothesisofthispaperisthatgenre
specificstructureformationalso concernsdocu-
ment networks. That is, we expectthem to vary
withrespectto structuralcharacteristicsaccording
to the varyingfunctionsthey meet. Thus,we do
not expect that different types of documentnet-
works(e.g.systemsof genrespecificwebsitesvs.
wiki-basednetworksvs. onlinecitationnetworks)
manifesthomogeneouscharacteristics,but signif-
icantvariationsthereof.As we concentrateon co-
efficientswhichwereoriginallyintroducedin the
context of smallworldanalyses,we expect,more
concretely, that different network types vary ac-
cordingto their fitting to or deviation from the
smallworldmodel. As we analyzeonlya couple
of networks,this observationis boundto the cor-
pusof networksconsideredin thisstudy. It never-
thelesshintsat how to rethinknetworkanalysisin
the context of newly emerging network typesas,
forexample,Wikipedia.
In orderto supportthis argumentation,the fol-
lowing sectionpresentsa modelfor representing
and extractingdocumentnetworks. After that,
theSWcharacteristicsof thesenetworksarecom-
putedanddiscussed.
4 NetworkModelingandAnalysis
4.1 GraphModeling
In order to analyse the characteristicsof docu-
mentnetworks,a formatfor uniformlyrepresent-
ing their structureis needed. In this section,we
presentgeneralizedtreesforthistask.Generalized
treesare graphswitha kerneltree-like structure–
henceforthcalledkernelhierarchy– superimposed
by graph-formingedgesas modelsof hyperlinks.
Figure(1) illustratesthis graphmodel. It distin-
guishesthreelevels of structureformation:
1. Accordingto the webgenremodelof section
(3), L1-graphsmap documentnetworks and
thuscorporaof interlinked (web)documents.
In section(4.3),four sourcesof suchnetworks
are explored: wiki documentnetworks, citation
networks, webgenre corpora and, for comparison
witha moretraditionalmedium,networksofnews-
paperarticles.
3
Figure1: Thestratifiedmodelof networkrepresentationwithkernelhierarchiesof L2-graphs.
2. L2-graphs modelthe structureof web doc-
umentsas constituentsof a given network.
This structureis seen to be based on ker-
nel hierarchiessuperimposed,amongstoth-
ers,by up, downandacrosslinks(seefig.1).
Inthecaseofwebgenrecorpora,L2-graphsmo-
delwebsites.In thecaseof citationnetworks,they
mapdocumentswhichconsistof a scientificarti-
cleandadd-onsin theformof citationlinks.Like-
wise,in the caseof onlinenewspapers,L2-graphs
modelarticlestogetherwithcontent-basedhyper-
links.Finally, in thecaseof wikis,L2-graphsrep-
resentwikidocumentseachof whichconsistsof a
wikiarticletogetherwitha correspondingdiscus-
sionandeditingpage.Accordingto thewebgenre
modelof section(3),L2-graphsmodelwebdocu-
mentswhichconsistof nodeswhosestructuringis
finallydescribedby L3-graphs:
3. L3-graphs modelthe structureof document
modules.
In the case of webgenrecorpora, L3-graphs
map the DOM3-basedstructureof the web pages
of the websitesinvolved. In the case of all other
networks distinguishedabove they representthe
logicalstructureof singletext units(e.g.the sec-
tionandparagraphstructuringof a lexicon,news-
paperor scientificarticle).Notethat the tree-like
structureof a documentmodulemaybe superim-
posedby hyperlinks,too, as illustratedin figure
(1)by theverticesmandn.
3I.e.DocumentObjectModel.
The kernelhierarchy of an L2-graphis consti-
tutedby kernellinkswhicharedistinguishedfrom
across, up, down and outside links (Amitayet
al., 2003;EironandMcCurley, 2003;Mehlerand
Gleim,2005).Thesetypescanbedistinguishedas
follows:
• Kernellinksassociatedominatingnodeswith
theirimmediatelydominatedsuccessornodes
in termsof thekernelhierarchy.
• Downlinksassociatenodeswithoneof their
(mediately)dominatedsuccessor nodes in
termsof thekernelhierarchy.
• Up links analogouslyassociatenodesof the
kernelhierarchy withoneof their(mediately
dominating)predecessornodes.
• Acrosslinksassociatenodesof thekernelhi-
erarchy noneofwhichisan(im-)mediatepre-
decessorof the other in termsof the kernel
hierarchy.
• Extra(oroutside)linksassociatenodesofthe
kernel hierarchy with nodes of other docu-
ments.
Kernelhierarchiesare exemplifiedby a confer-
encewebsiteheadedby a title and menupagere-
ferringto, for example,the correspondingcallfor
papers whichinturnleadstopagesonthedifferent
conferencesessionsetc.so thatfinallya hierarchi-
calstructureevolves. Inthisexamplethekernelhi-
erarchy evidentlyreflectsnavigationalconstraints.
Thatis, the positionof a pagein the tree reflects
4
theprobabilityto be navigatedbya readerstarting
fromtherootpageandfollowingkernellinksonly.
The kernel hierarchy of a wiki documentis
spannedby an article page in conjunctionwith
thecorrespondingdiscussion(ortalk), historyand
edit this or view source pages which altogether
form a flatly structuredtree. Likewise in the
case of citationnetworks as the CiteSeersystem
(Lawrenceet al., 1999), a documentconsistsof
the various(e.g.PDFor PS) versionsof the focal
articleas wellas of oneor morewebpagesmani-
festingits citationsby meansof hyperlinks.
From the point of view of documentnetwork
analysis,L2-graphsandinterlinks(seefig. 1) are
mostrelevant as they spanthe correspondingnet-
work mediatedby documents(e.g. websites)and
modules(e.g.webpages).Thisallows specifying
whichlinks of whichtype in whichnetwork are
examinedin thepresentstudy:
• In thecaseof citationnetworks,citationlinks
are modeledas interlinksas they relate(sci-
entific)articlesencapsulatedbydocumentsof
this network type. Citationnetworksare ex-
plored by exampleof the CiteSeersystem:
We analyzea sampleof more than 550,000
articles(see table 1) – the basic population
coversup to 800,000documents.
• In the case of newspaperarticle networks,
content-basedlinksareexploredas resources
of networking. This is done by exampleof
the 1997 volumeof the Germannewspaper
S¨uddeutscheZeitung(see table 1). That is,
firstly, nodesare given by articleswheretwo
nodesareinterlinked if thecorrespondingar-
ticles contain see also links to each other.
In the onlineand ePaper issueof this news-
paper these links are manifestedas hyper-
links. Secondly, articlesare linked if they
appear on the same page of the same is-
sue so that they belongto the samethematic
field. By means of these criteria, a bipar-
tite network (Watts, 2003)is built in which
the top-modeis spannedby topic and page
units, whereasthe bottom-modeconsistsof
textunits.Insucha network,two textsarein-
terlinked whenever they relateto at leastone
commontopicor appearon thesamepageof
thesameissue.
• In the case of webgenreswe explorea cor-
pus of 1,096conferencewebsites(see table
variable value
numberof websites 1,096
numberof webpages 50,943
numberof hyperlinks 303,278
maximumdepth 23
maximumwidth 1,035
averagesize 46
averagewidth 38
averageheight 2
Table 2: A corpusof conferenceand workshop
websites(countingunit:webpages).
1 and 2) henceforthcalledindogramcor-
pus.4 We analyzethe out degreesof all web
pagesof thesewebsitesandthusexploreker-
nel,up, down,across,interandoutsidelinks
on the level of L2-graphs. This is done in
orderto get a base line for our comparative
study, sinceWWW-basednetworks are well
known for theirsmallworldbehavior. More
specifically, this relatesto estimationsof the
exponentγ of power laws fittedto their de-
greedistributions(Newman,2003).
• These three networks are explored in or-
der to comparatively study networking in
Wikipediawhichis analyzedby exampleof
its German releasede.wikipedia.org
(seetable1). Becauseoftherichsystemofits
nodeand link types(see section4.2) we ex-
plorethreevariantsthereof.Further, in order
to get a more reliablepictureof wiki-based
structureformation,we alsoanalyzewikisin
the area of technicaldocumentation. This
is done by exampleof three wikis on open
sourceprojectsoftheApacheSoftwareFoun-
dation(cf.wiki.apache.org).
In thefollowingsection,theextractionof Wiki-
pedia-basednetworksis explainedin moredetail.
4.2 GraphExtraction– theCaseof Wiki-
basedDocumentNetworks
In the following sectionwe analyzethe network
spanned by documentmodules of the German
Wikipediaandtheirinterlinks.5 Thiscannotsim-
ply be done by extractingall its article pages.
The reasonis that Wikipediadocumentsconsist
4See http://ariadne.coli.uni-bielefeld.
de/indogram/resources.htmlfor the list of URLs
of thedocumentsinvolved.
5We downloaded and extracted the XML release of
this wiki – cf. http://download.wikimedia.org/
wikipedia/de/pagescurrent.xml.bz2.
5
network networkgenre node |V| |E|
de.wikipedia.org electronicencyclopedia wikiunit
variantI (e.g.articleor talk) 303,999 5,895,615
variantII 406,074 6,449,906
variantIII 796,454 9,161,706
wiki.apache.org/jakarta onlinetechnicaldocumentation wikiunit 916 21,835
wiki.apache.org/struts onlinetechnicaldocumentation wikiunit 1,358 40,650
wiki.apache.org/ws onlinetechnicaldocumentation wikiunit 1,042 23,871
citeseer.ist.psu.edu digitallibrary openarchive record 575,326 5,366,832
indogram conferencewebsitesgenre webpage 50,943 303,278
S¨uddeutscheZeitung1997 presscommunication newspaperarticle 87,944 2,179,544
Table1: Thedocumentnetworksanalyzedandthesizes|V| and|E| of theirvertex andedgesets.
of modules(manifestedbypages)of varioustypes
whichare likewise connectedby links of differ-
ent types. Consequently, the choiceof instances
of thesetypeshasto be carefullyconsidered.
Table (3) lists the node types (and their fre-
quencies)as foundin the wikior additionallyin-
troducedinto the study in order to organize the
type systeminto a hierarchy. One heuristicfor
extractinginstancesof node types relatesto the
URL of the correspondingpage. Category, por-
talandmediawikipages,forexample,containthe
prefixKategorie, PortalandMediaWiki,
respectively, separatedby a colon from its page
name suffix (as in http://de.wikipedia.
org/wiki/Kategorie:Musik).
Analogously, table (4) lists the edge types ei-
ther found withinthe wiki or additionallyintro-
ducedinto the study. Of specialinterestare redi-
rectnodesandlinkswhichmanifesttransitive and,
thus,mediatelinksofcontent-basedunits.Anarti-
clenodev maybelinked,forexample,witha redi-
rectnoder whichin turnredirectsto an articlew.
In this case, the documentnetwork containstwo
edges(v,r),(r,w) whichhave to be resolved to a
singleedge(v,w)if redirectsaretobeexcludedin
accordancewithwhattheMediaWikisystemdoes
whenprocessingthem.
Basedontheseconsiderations,wecomputenet-
work characteristicsof three extractionsof the
GermanWikipedia(see table 1): VariantI con-
sists of a graphwhosevertex set containsall Ar-
ticle nodes and whose edge set is based on In-
terlinks andappropriatelyresolved Redirect links.
Variant II enlarges variant I by includingother
content-relatedwikiunits,i.e. ArticleTalk, Portal,
PortalTalk, and Disambiguationpages (multiply
typednodeswereexcluded).VariantIII consists
of a graphwhosevertex setcoversall verticesand
edgesfoundin theextraction.
Type Frequency
Documentstotal 796,454
Article 303,999
RedirectNode 190,193
Talk 115,314
ArticleTalk 78,224
UserTalk 30,924
ImageTalk 2,379
WikipediaTalk 1,380
CategoryTalk 1,272
TemplateTalk 705
PortalTalk 339
MediaWikiTalk 64
HelpTalk 27
Image 97,402
User 32,150
Disambiguation 22,768
Category 21,999
Template 6,794
Wikipedia 3,435
MediaWiki 1,575
Portal 791
Help 34
Table 3: The systemof nodetypesand their fre-
quencieswithintheGermanWikipedia.
4.3 NetworkAnalysis
Basedon the inputnetworksdescribedin the pre-
vioussectionwe computetheSWcoefficientsde-
scribedin section(2). Average geodesicdistan-
ces are computedby meansof the Dijkstraalgo-
rithm based on samplesof 1,000 verticesof the
inputnetworks (or the wholevertex set if it is of
minorcardinality).Power law fittingswerecom-
putedbasedonthemodelP(x) =ax−γ+b. Note
thattable(1)doesnotlistthecardinalitiesof multi
sets of edges and, thus, does not count multiple
edgesconnectingthe samepairof verticeswithin
the correspondinginput network – therefore,the
numbersin table(1)donotnecessarilyconformto
the countsof link typesin table(4). Notefurther
thatwe compute,as usuallydonein SWanalyses,
characteristicsof undirectedgraphs.In thecaseof
wiki-basednetworks,thisis justifiedby thepossi-
bilityto processback linksinMediaWikisys-
tems. In the case of the CiteSeersystemthis is
justifiedby thefactthatit alwaysdisplayscitation
6
Type Frequency
Linkstotal 17,814,539
Interlink 12,818,378
CategoryLink 1,415,295
Categorizes 704,092
CategorizedBy 704,092
CategoryAssociatesWith 7,111
TopicOfTalk 103,253
TalkOfTopic 88,095
HyponymOf 26,704
HyperonymOf 26,704
InterPortalAssociation 1,796
Broken 2,361,902
Outside 1,276,818
InterWiki 789,065
External 487,753
Intra 1,175,290
Kernel 1,153,928
Across 6,331
Up 6,121
Reflexive 5,433
Down 3,477
Redirect 182,151
Table 4: The systemof link types and their fre-
quencieswithintheGermanWikipedia.
andcitedbylinks.Finally, in thecaseof thenews-
paperarticlenetwork, this is due to the fact that
it is basedon a bipartitegraph(see above). Note
that the indogramcorpusconsistsof predomi-
nantlyunrelatedwebsitesandthusdoesnot allow
computingclusteranddistancecoefficients.
5 Discussion
The numericalresultsin table(5) are remarkable
as they allow identifyingthreetypesof networks:
• On the one hand, we observe the extreme
caseof theS¨uddeutscheZeitung, that
is, of the newspaperarticlenetwork. It is the
only network which, at the same time, has
very high clustervalues,shortgeodesicdis-
tancesand a highdegreeof assortative mix-
ing.Thus,itsvaluessupporttheassertionthat
it behaves as a smallworldin thesenseof the
modelof Watts & Strogatz. Theonlyexcep-
tion is the remarkablylow γ value, where,
accordingto the model of Barab´asi & Al-
bert(1999),a highervaluewas expected.
• Ontheotherhand,theCiteSeersampleis the
reversecase:It hasverylow valuesofC1 and
C2, tendsto show neitherassortative, nordis-
assortative mixing,andat thesametimehasa
low γ value. Thesmallclustervaluescanbe
explainedby the low probabilitywithwhich
two authorscitedbya focalarticlearerelated
by a citationrelationon theirown.6
6Althougharticlescanbe expectedwhichcite,for exam-
• The third group is given by the wiki-based
networks: They tend to have higherC1 and
C2 valuesthanthecitationnetworkdoes,but
alsotendto show stochasticmixingandshort
geodesicdistances. The cluster values are
confirmedby the wikis of technicaldocu-
mentation(also w.r.t their numericalorder).
Thus,thesewikistendto be smallworldsac-
cordingto the model of Watts & Strogatz,
butalsoprove disassortative mixing– compa-
rableto technicalnetworks but in departure
fromsocialnetworks. Consequently, they are
ranked in-betweenthe citationandthe news-
paperarticlenetwork.
All thesenetworks show rathershort geodesic
distances.Thus,l seemsto be inappropriatewith
respect to distinguishingthem in terms of SW
characteristics.Further, all theseexamplesshow
remarkablylow valuesoftheγ coefficient.Incon-
trast to this, power laws as fitted in the analyses
reportedby Newman (2003)tend to have much
higher exponents– Newman reports on values
whichrangebetween1.4 and 3.0. This resultis
onlyrealizedby theindogramcorpusof confer-
encewebsites,thus,by a sampleof WWWdocu-
mentswhoseout degreedistributionis fittedby a
powerlaw withexponentγ =2.562.
Thesefindingssupportthe view that compared
to WWW-based networks wiki systems behave
morelike “traditional”networks of textualunits,
but are new in the sensethat their topology nei-
therapproximatestheoneof citationnetworksnor
of content-basednetworksof newspaperarticles.
In otherwords:As intertextualrelationsaregenre
sensitive (e.g. citationsin scientificcommunica-
tion vs. content-basedrelationsin press commu-
nicationvs. hyperlinksin onlineencyclopedias),
networks basedon such relationsseemto inherit
this genresensitivity. Thatis, for varyinggenres
(e.g. of scientific,technicalor presscommunica-
tion) differencesin topologicalcharacteristicsof
their instancenetworks are expected. The study
presentsresultsinsupportofthisview ofthegenre
sensitivityof text-basednetworks.
6 Conclusion
We presenteda comparative study of document
networks based on small world characteristics.
ple,deSaussureandChomsky, therecertainlyexistmuchless
citationsof de Saussurein articlesof Chomsky.
7
instance type 〈d〉 l γ C1 C2 r
WikipediavariantI undirected 19.39 3.247 0.4222 0.009840 0.223171 −0.10
WikipediavariantII undirected 15.88 3.554 0.5273 0.009555 0.186392 −0.09
WikipediavariantIII undirected 11.50 4.004 0.7405 0.007169 0.138602 −0.05
wiki.apache.org/jakartaundirected 23.84 4.488 0.2949 0.193325 0.539429 −0.50
wiki.apache.org/struts undirected 29.93 4.530 0.2023 0.162044 0.402418 −0.45
wiki.apache.org/ws undirected 22.91 4.541 0.1989 0.174974 0.485342 −0.48
citeseer.ist.psu.edu undirected 9.33 4.607 0.9801 0.027743 0.067786 −0.04
indogram directed 5.95 ××× 2.562 ××× ××× ×××
S¨uddeutscheZeitung undirected 24.78 4.245 0.1146 0.663973 0.683839 0.699
Table 5: Numericalvaluesof SW-relatedcoefficientsof structureformationin complex networks: the
averagenumber〈d〉ofedgespernode,themeangeodesicdistancel, theexponentγ ofsuccessfullyfitted
powerlaws,theclustervaluesC1,C2 andthecoefficientr of assortative mixing.
Accordingto our findings,three classesof net-
worksweredistinguished.Thisclassificationsep-
arates wiki-basedsystemsfrom more traditional
text networks but also from WWW-based web-
genres. Thus, the study provides evidencethat
there exist genre specificcharacteristicsof text-
basednetworks.Thisraisesthequestionfor mod-
els of network growth which better accountfor
these findings. Futurework aims at elaborating
sucha model.

References
LadaA. Adamic. 1999. The smallworld of web. In
Serge Abitebouland Anne-MarieVercoustre,edi-
tors,Research andAdvancedTechnology forDigital
Libraries, pages443–452.Springer, Berlin.
Einat Amitay, David Carmel, Adam Darlow, Ronny
Lempel,and Aya Soffer. 2003. The connectivity
sonar: detectingsite functionalityby structuralpat-
terns. In Proc. of the 14thACM conferenceon Hy-
pertext andHypermedia, pages38–47.
Albert-L´aszl´o Barab´asi andR´ekaAlbert.1999. Emer-
gence of scaling in random networks. Science,
286:509–512.
DouglasBiber. 1995. Dimensionsof RegisterVaria-
tion: A Cross-LinguisticComparison. Cambridge
UniversityPress,Cambridge.
B´ela Bollob´as. 1985. RandomGraphs. Academic
Press,London.
Nadav Eironand Kevin S. McCurley. 2003. Untan-
glingcompounddocumentsontheweb. InProceed-
ings of the 14thACM conferenceon Hypertext and
Hypermedia,Nottingham,UK, pages85–94.
RamonFerreri CanchoandRicardV. Sol´e. 2001.The
small-worldof humanlanguage.Proceedingsof the
RoyalSocietyof London.SeriesB, Biological Sci-
ences, 268(1482):2261–2265,November.
JussiKarlgrenandDouglassCutting.1994.Recogniz-
ing text genreswith simplemetricsusingdiscrimi-
nantanalysis. In Proc. of COLING’94, volumeII,
pages1071–1075,Kyoto,Japan.
Steve Lawrence, C. Lee Giles, and Kurt Bollacker.
1999.DigitallibrariesandAutonomousCitationIn-
dexing. IEEEComputer, 32(6):67–71.
AlexanderMehlerand R¨udigerGleim. 2005. Thenet
for the graphs— towards webgenrerepresentation
for corpuslinguisticstudies. In MarcoBaroniand
SilviaBernardini,editors,WaCky! Workingpapers
on theWeb as corpus. Gedit,Bologna,Italy.
Stanley Milgram. 1967. The small-world problem.
Psychology Today, 2:60–67.
George A. Millerand WalterG. Charles. 1991. Con-
textualcorrelatesof semanticsimilarity. Language
andCognitiveProcesses, 6(1):1–28.
MarkE. J. NewmanandJuyongPark. 2003. Why so-
cial networks are differentfromothertypesof net-
works. PhysicalReview E, 68:036122.
MarkE. J. Newman.2003. Thestructureandfunction
of complex networks. SIAMReview, 45:167–256.
Richard Power, Donia Scott, and Nadjet Bouayad-
Agha. 2003. Documentstructure. Computational
Linguistics, 29(2):211–260.
Mark Steyvers and Josh Tenenbaum. 2005. The
large-scalestructureof semanticnetworks: Statisti-
cal analysesanda modelof semanticgrowth. Cog-
nitiveScience, 29(1):41–78.
DuncanJ. WattsandStevenH.Strogatz. 1998.Collec-
tive dynamicsof ‘small-world’ networks. Nature,
393:440–442.
DuncanJ. Watts. 2003. Six Degrees.TheScienceof a
ConnectedAge. Norton& Company, New York.
TakeshiYoshiokaandGeorge Herman.2000. Coordi-
natinginformationusinggenres. Technicalreport,
MassachusettsInstituteof Technology, August.
