Computational Linkuistics: word triggers across hyperlinks
Dragomir R. Radeva0a2a1a3 , Hong Qia0 , Daniel Tama3 , Adam Winkela3
a0 School of Information and a3 Department of EECS
University of Michigan
Ann Arbor, MI 48109-1092
a4 radev,hqi,dtam,winkela
a5 @umich.edu
Abstract
It is known that context words tend to be self-
triggers, that is, the probability of a content
word to appear more than once in a document,
given that it already appears once, is signifi-
cantly higher than the probability of the first oc-
currence. We look at self-triggerability across
hyperlinks on the Web. We show that the prob-
ability of a word a6a8a7 to appear in a Web docu-
ment a9a11a10 depends on the presence of a6a12a7 in doc-
uments pointing to a9a13a10 . In Document Model-
ing, we will propose the use of a correction fac-
tor, a14 , which indicates how much more likely
a word is to appear in a document given that
another document containing the same word is
linked to it.
1 Introduction
Given the size of the Web, it is intuitively very hard to find
a given page of interest by just following links. Classic
results have shown however, that the link structure of the
Web is not random. Various models have been proposed
including power law distributions (the “rich get richer”
model), and lexical models. In this paper, we will in-
vestigate how the presence of a given word in a given
Web document a9 a10 affects the presence of the same word
in documents linked to a9 a10 . We will use the term Compu-
tational Linkuistics to describe the study of hyperlinks for
Document Modeling and Information Retrieval purposes.
1.1 Link structure of the Web
Random graphs have been studied by Erd˝os and R´enyi
(Erd˝os and R´enyi, 1960). In a random graph, edges are
added sequentially with both vertices of a new edge cho-
sen randomly.
The diameter a9 of the Web (that is, the average number
of links from any given page to another) has been found to
be a constant (approximately a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a31a30a33a32a35a34a23a26a29a28a31a30a19a36 , where
a32 is the number of documents on the Web and a36 is the
average document out-degree (i.e., the number of pages
linked from the document). This result was described in
(Barab´asi and Albert, 1999) and is based on a corpus of
800 M web pages). This estimate of a9 would entail that
in a random graph model, the size of the Web would be
approximately a37a39a38a41a40 which is 10 M times its actual size.
Clearly, a random graph model is not an appropriate de-
scription of the Web. Instead, it has been shown that due
to preferential attachment (Barab´asi and Albert, 1999),
the out-degree distribution follows a power law. The pref-
erential model makes it more likely that a new random
edge will connect two vertices that already have a high
degree. Specifically, the degree of pages is distributed
according to a42a35a43a45a44a46a24a47a36a19a48a25a49a50a15a51a34a39a36a19a52 , where a53 is a constant
strictly greater than 0. (Note this is different from a15a54a34a54a53a56a55 ,
the distribution of out-degree on random graphs.) As a
result, random walks on the Web graph soon reach well-
connected nodes.
1.2 Lexical structure of the Web
Davison (Davison, 2000) discusses the topical locality
hypothesis, namely that new edges are more likely to con-
nect pages that are semantically related. In Davison’s ex-
periment, semantic and link distances between pairs of
pages from a 100 K page corpus were computed. Davison
describes results associating TF*IDF cosine similarity
(Salton and McGill, 1983) and link hop distance. He re-
ports that the cosine similarity between pages selected at
random from his corpus is 0.02 whereas that number in-
creases significantly for topologically related pages: 0.31
for pages from the same Web domain, 0.23 for linked
pages, and 0.19 for sibling pages (pages pointed to by the
same page).
Menczer (Menczer, 2001) introduces the link-content
conjecture states that the semantic content of a web
page can be inferred from the pages that point to it.
Menczer uses a corpus of 373 K pages and employs a
non-linear least squares fit to come up with a seman-
tic model connecting cosine-based semantic similarity
a57
a43a59a58
a38a54a60
a58a62a61a31a48 and the link distance a63a33a43a64a58
a38a54a60
a58a62a61a31a48 between two
pages a58
a38
and a58a62a61 (the shortest directed distance on the hy-
pertext graph from a58
a38
to a58a65a61 ). Menczer reports that a57 and
a63 are connected via a power law:
a57
a43a66a63a23a48a68a67
a57a65a69a71a70
a43a41a15a73a72
a57 a69
a48a75a74a11a76a78a77a79a52a81a80a83a82a84a83a85a87a86a17a88 .
a57 a69 represents noise level in similarity.
Menczer reports empirically determined values of the
parameters of the fit as follows: a53
a38
a24a89a15a39a18a16 , a53a90a61a91a24a27a92a19a18a93 , and
a57a62a69
a24a94a92a19a18a92a39a95 .
Menczer’s results further confirm Davison’s observa-
tions that pages adjacent in hyperlink space to a given
page are semantically connected.
Our idea has been to investigate the circumstances un-
der which the semantic similarity between linked pages
can be explained in terms of the presence of individual
words across links.
1.3 Document modeling
In the computational linguistics and speech communities,
the notion of a language model is used to describe a prob-
ability distribution over words. Since a cluster of docu-
ments contains a subset of an entire language, a document
model is a special case of a language model. As such, it
can be expressed as a conditional probability distribution
indicating how likely a word is to appear in a document
given some context (e.g., other similar documents, the
topic of the document, etc.). Language models are used
in speech recognition (Chen and Goodman, 1996), docu-
ment indexing (Bookstein and Swanson, 1974; Croft and
Harper, 1979) and information retrieval (Ponte and Croft,
1998).
Document models are a special class of language mod-
els. One property of document models is that they can
be used to predict some lexical properties of textual doc-
uments, e.g., the frequency of a certain word. Mosteller
and Wallace (Mosteller and Wallace, 1984) discovered
that content words are ”bursty” - the appearance of a con-
tent word significantly increases the probability that the
word would appear again. Church and his colleagues
(Church and Gale, 1995; Church, 2000) describe doc-
ument models based on the distribution of the frequen-
cies of individual words over large document collections.
In (Church and Gale, 1995), Church and Gale compare
document models based on the Poisson distribution, the
2-Poisson distribution (Bookstein and Swanson, 1974),
as well as generic Poisson mixtures. A Poisson mix-
ture is described by a42a35a43a78a96a79a48a97a24a99a98
a69
a100a102a101
a43a78a103a13a48a105a104a106a43a78a103
a60
a96a65a48a41a9a39a103 , where
a104a106a43a78a103
a60
a36a19a48a107a67a109a108a83a110a11a111a113a112a113a114
a55a17a115
for a given integer non-negative value
of a96 .
Church and Gale empirically show that Poisson mix-
tures are a more accurate model for describing the dis-
tribution of words in documents within a corpus. They
obtain the best fits with the Negative Binomial model and
the K-mixture (both special cases of Poisson mixtures)
(Church and Gale, 1995). In the Negative Binomial case,
a101
a43a78a103a11a48a116a24 a112a113a117a118a110
a80
a108 a110
a111
a119
a120
a117a118a121
a76a123a122a124a88
(which is the Gamma distribution)
whereas in the K-mixture, a101 a43a29a103a11a48a12a24a125a43a75a15a124a72a126a53a56a48a75a63a33a43a29a103a11a48 a70 a52a127 a74 a77 a111a128 ,
where a63a33a43a78a96a79a48 is Dirac’s delta function.
Our study focuses on modeling across hyperlinks.
Documents linked across the web are often written by
people with different backgrounds and language usage
pattern.
1.4 Link-based document models
In Church et al.’s experiments, the documents being mod-
eled do not have hyperlinks between them. When model-
ing hyperlinked corpora, it is important to decompose the
document model into link-free and link-dependent com-
ponents. The link-free component predicts the probabil-
ity of a word a6 appearing in a document a129 regardless
of the documents that point to a129 . The link-dependent
part makes use of a particular incarnation of the link-
content conjecture, namely micro link-content depen-
dency (MLD), which we will propose in this paper.
1.5 Our framework
In traditional Information Retrieval, the main object that
is represented, and searched, is the document. In our
setup, we will be looking at the hyperlink between two
documents as the main object to retrieve. If a page a58 a10
points to page a58 a7 via link a26
a55
, then we will consider a26
a55
as
the object to index and the two pages that it links as fea-
tures describing the link.
For our experiments, we used the 2-Gigabyte wt2g cor-
pus (Hawking, 2002) which contains 247,491 Web docu-
ments connected with 3,118,248 links. These documents
contain 948,036 unique words (after Porter-style stem-
ming).
2 A link-based document model
It is well known that the distributions of words in text de-
pend on many factors such as genre, topic, author, etc.
Certain words with high content has been found to ”trig-
ger” other words to appear. Interestingly, the hyperlinks
which connect the text on the Web may also affect the
word distributions in the hypertext. For example, if page
a58a62a10 that contains education points to page a58a19a7 , then we
would expect a higher probability of seeing education in
page a58a33a7 than in a random page. This experiment was de-
signed to discover how the links between pages can trig-
ger words and change the word distributions.
For each stemmed word in wt2g, we compute the fol-
lowing numbers:
PagesContainingWord = how many pages in the col-
lection contain the word.
OutgoingLinks = the total number of outgoing links in
all the pages that contain the word.
LinkedPagesContainingWord = how many of the
linked pages contain the word.
For the latter two measures, only the links inside the
collection were considered.
The probability of a word a6 appearing in a random
page a58a62a10 is computed as
a58a19a130a87a131a132a10a134a133a41a131a135a24a136a58a118a43a29a6a138a137a139a58a65a10a66a48a106a24
a42a25a140a11a30a81a74a54a141a51a142a143a28a51a144a146a145a75a140a13a147a45a144a146a147a148a144a79a30a19a149a138a28a51a150a54a9
a151
a28a51a145a75a140a81a26a29a42a25a140a11a30a33a74a54a141
a60
where Total Pages = 247,491. If a58a62a10 contains the word
a6 and points to a new pagea58a62a61 , then the probability of the
word a6 appearing in a58a65a61 is computed as
a58a19a130a17a133a132a152a148a153
a108
a131a132a10a134a133a41a131a135a24a136a58a118a43a29a6a138a137a139a58a65a61a11a154a58a62a10a156a155a50a58a65a61a12a157a158a6a138a137a139a58a65a10a66a48
a24a160a159
a147a148a144a156a36a33a74a51a9a11a42a25a140a11a30a33a74a51a141a51a142a143a28a51a144a146a145a75a140a13a147a148a144a146a147a148a144a79a30a33a149a138a28a51a150a54a9
a161a163a162
a145a148a30a81a28a51a147a148a144a79a30
a159
a147a148a144a156a36a19a141
For instance, in the wt2g corpus there are 55,654 pages
that contain the word each, and these pages have a total
of 46,163 links pointing to the pages in the collection,
15,815 of which have the word each. Therefore, its prior
probability is a164a132a164a83a165a83a164a41a166
a61
a166a113a167a41a166 a40a2a38
a24a168a18a21a169a39a169a11a20 , and its posterior probability
is a38 a164a132a170 a38 a164
a166a113a165 a38 a165a83a171
a24a138a18a95a23a172a13a95 .
We are interested in the ratio of posterior over prior
probability for each stemmed word and would like to see
if there is any interesting relationship between this ratio
and other linguistic features.
We will look at the ratio a14a173a24a47a58a19a130a87a133a83a152a148a153
a108
a131a83a10a134a133a41a131a51a34a113a58a33a130a87a131a83a10a174a133a132a131 (the
link effect) which describes how much more likely a
linked page is to contain a given word than a random
page.
IDF (Inverse Document Frequency) values based on
the wt2g corpus are also computed. We compute IDF us-
ing the formula a147a105a9a148a175a56a43a29a6a91a48a135a24a176a72a68a177a174a178a39a179
a61
a43a75a15a91a72a136a74a54a180a174a181a39a76a123a182a146a88a66a183a132a184a11a180a17a48 , where
a9a105a175a56a43a78a6a91a48 is the document frequency (fraction of all docu-
ments containing a6 ) and a144a185a9 is the number of documents
in the collection.
2.1 Results and Discussion
Table 1 shows the different measures for the 2000 words
with lowest IDF. Each line shows the average values on a
chunk of 100 words.
As one can see in the table, the posterior probabilities
are always higher than the prior probabilities. Hypoth-
esis testing shows that the difference between prior and
posterior is statistically significant, which verifies our as-
sumption. It is noticeable that the link effect has the same
trend as the IDF values. The correlation coefficient of
these two columns is 0.7112. It is customary to use IDF
as an indicator of words’ content. Low IDF usually im-
plies a low content value. We would like to investigate
whether link effect can be used instead of IDF for cer-
tain IR tasks. Let’s consider the sample words between
and american on table 1. Intuitively, american has more
content than between, but the later has an IDF of 2.37,
higher than that of the former (2.36). However, their link
effects agree with intuition: american: 1.97, one standard
deviation higher than between: 1.40.
Table 2 compares the link effects for two ranges of
sample words with roughly the same IDF values within
each range. It shows the words in the order of IDF and of
Link Effect (a14 ). As one can see, the link effect tends to
be high for content words when IDF value alone cannot
discriminate the words.
a186a41a187a118a188a91a189a191a190a83a192a193 a186a41a187a156a188a163a189a25a194a83a192a193
sorted by sorted by sorted by sorted by
IDF link effect a195 IDF link effect a195
word IDF word a195 word IDF word a195
human 2.981 close 1.675 centuri 3.988 extend 2.085
accord 2.983 among 1.770 interact 3.990 beyond 2.477
perform 2.984 further 1.796 introduct 3.993 front 2.606
close 2.985 expect 1.864 front 3.994 centuri 2.713
press 2.992 accord 1.922 travel 3.997 elimin 2.753
applic 2.992 assist 1.962 elimin 4.009 damag 2.757
expect 2.997 human 2.093 opinion 4.013 introduct 2.843
among 2.998 perform 2.095 damag 4.017 opinion 2.984
assist 3.004 applic 2.203 beyond 4.019 travel 3.491
further 3.011 press 2.388 extend 4.021 interact 3.527
Table 2: IDF vs. link effect R
Figure 1 describes a linear fit of a14 over the 2000 words
with the lowest IDF in our corpus. A very clear trend can
be observed, whereby over most words, the value of a14 is
almost a constant. When we looked only at the top 100
or 200 words, the trend was even cleaner. However, with
2000 words one cannot help but notice that a number of
outliers appear in the left hand part of the figure. We ran a
K-Means c (with K=2) to identify two clusters of words.
The clusterer stopped after 32 iterations after identifying
the two clusters (Figures 2 and 3), each with a very clear
trend. Their means are 1.86 and 3.57, respectively.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 1: Linear fit for 2000 lowest-IDF words. The X
axis represents the prior probability a58 while the Y axis
corresponds to the posterior probability a58a62a196 .
3 Conclusion
In this paper we discussed some properties of hyperlinked
Web documents. We showed that the probability of a
word a6 a7 to appear in a Web document a9 a10 depends on the
presence of a6 a7 in documents pointing to a9 a10 .
Words Prior Posterior IDF link effect R Sample words
1-100 0.4047 0.5293 1.6761 1.3639 the, of, make, and
101-200 0.2141 0.3574 2.3803 1.6745 under, go, between, amlusterererican
201-300 0.1688 0.3209 2.6896 1.9047 market, subject, special, mean
301-400 0.1386 0.2876 2.9513 2.0750 administr, put, establish, ask
401-500 0.1192 0.2588 3.1548 2.1750 understand, social, hand, share
501-600 0.1046 0.2426 3.3326 2.3179 prevent, staff, risk, north
601-700 0.0934 0.2246 3.4879 2.4085 trade, class, size, california
701-800 0.0839 0.2201 3.6354 2.6233 global, drug, letter, softwar
801-900 0.0752 0.2004 3.7884 2.6668 sound, tool, monitor, transport
901-1000 0.0669 0.2024 3.9499 3.0200 permit, target, east, normal
1001-1100 0.0605 0.1823 4.0909 3.0149 approxim, telephon, danger, europ
1101-1200 0.0548 0.1710 4.2292 3.1213 favor, richard, map, pictur
1201-1300 0.0498 0.1752 4.3635 3.5210 professor, earth, english, republican
1301-1400 0.0454 0.1652 4.4934 3.6366 medicin, doctor, church, color
1401-1500 0.0416 0.1630 4.6166 3.9224 permiss, agenda, programm, prioriti
... ... ... ...
100001-100100 0.0000 0.0642 12.4774 363.7331 sinker, surmont, thong, undergrowth
500001-500100 0.0000 0.0215 16.9270 2658.9231 scheflin, schena, schendel, scheriff
Table 1: Some measurements over 20 sets of 100 words among the 2000 lowest-IDF words plus 2 sets of 100 words
among the words of higher IDF
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 2: Linear fit for Cluster 1, which contains many
low-IDF words such as by, with, from, etc.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 3: Linear Fit for Cluster 2. Sample words from
this cluster are photo, dream, path, etc.
References
Albert-L´aszl´o Barab´asi and R´eka Albert. 1999. Emer-
gence of scaling in random networks. Science,
286:509–512.
Abraham Bookstein and Don Swanson. 1974. Proba-
bilistic models for automatic indexing. Journal of the
American Society for Information Science, 25(5):118–
132.
Stanley F. Chen and Joshua Goodman. 1996. An empir-
ical study of smoothing techniques for language mod-
eling. In ACL-96, pages 310–318, Santa Cruz, CA.
ACL.
Kenneth Church and William Gale. 1995. Posson mix-
tures. Natural Language Engineering.
Kenneth Church. 2000. Empirical estimates of adapta-
tion: the chance of two noriegas is closer to a58a146a34a23a169 than
a58
a61 . In COLING, Saarbruecken, Germany, August.
W. Bruce Croft and David J. Harper. 1979. Using proba-
bilistic models of document retrieval without relevance
information. Journal of Documentation, 35:285–295.
Brian Davison. 2000. Topical locality in the web. In
SIGIR 2000), Athens, Greece, July.
P. Erd˝os and A. R´enyi. 1960. On the evolution of random
graphs. Publications of the Mathematical Institute of
the Hungarian Academy of Sciences, 5:17–61.
David Hawking. 2002. Web research collections - trec
web track. http://www.ted.cmis.csiro.au/TRECWeb/.
Filippo Menczer. 2001. Links tell us about lexical and
semantic web content. http://arxiv.org/cs.IR/0108004.
Frederick Mosteller and David L. Wallace. 1984. Ap-
plied Bayesian and Classical Inference - The Case of
The Federalist Papers. Springer Series in Satistics,
Springer-Verlag.
Jay Ponte and Bruce Croft. 1998. A language model-
ing approach to information retrieval. In SIGIR 1998,
pages 275–281, Melbourne, Australia, August.
Gerard Salton and Michael J. McGill. 1983. Introduction
to Modern Information Retrieval. McGraw-Hill, New
York, NY.
