A Method for Relating Multiple Newspaper Articles by Using 
Graphs, and Its Application to Webcasting 
Naohiko Uramoto and Koichi Takeda 
IBM Research, Tokyo Research Laboratory 
1623-14 Shimo-tsuruma, Yamato-shi, Kanagawa-ken 242 Japan 
{ uramoto, takeda } @trl. ibm. co.j p 
Abstract 
This paper describes methods for relating (thread- 
ing) multiple newspaper articles, and for visualizing 
various characteristics of them by using a directed 
graph. A set of articles is represented by a set of 
word vectors, and the similarity between the vec- 
tors is then calculated. The graph is constructed 
from the similarity matrix. By applying some con- 
straints on the chronological ordering of articles, an 
efficient threading algorithm that runs in O(n) time 
(where n is the number of articles) is obtained. The 
constructed graph is visualized with words that rep- 
resent the topics of the threads, and words that rep- 
resent new information in each article. The thread- 
ing technique is suitable for Webcasting (push) ap- 
plications. A threading server determines relation- 
ships among articles from various news sources, and 
creates files containing their threading information. 
This information is represented in eXtended Markup 
Language (XML), and can be visualized on most 
Web browsers. The XML-based representation and 
a current prototype are described in this paper. 
1 Introduction 
The vast quantity of information available today 
makes it difficult to search for and understand the 
information that we want. If there are many related 
documents about a topic, it is important to capture 
their relationships so that we can obtain a clearer 
overview. However, most information resources, in- 
cluding newspaper articles do not have explicit re- 
lationships. For example, although documents on 
the Web are connected by hyperlinks, relationships 
cannot be specified. 
Webcasting ("push") applications such as Point- 
cast i constitute a promising solution to the prob- 
lem of information overloading, but the articles they 
provide do not have links, or else must be manually 
linked at a high cost in terms of time and effort. 
This paper describes methods for relating news- 
paper articles automatically, and its application for 
a Webcasting application. A set of article on a par- 
I htt p://www.pointcast.com 
ticular topic is ordered chronologically, and the re- 
sults are represented as a directed graph. There are 
various ways of relating documents and visualizing 
their structure. For example, USENET articles can 
be accessed by means of newsreader software. In the 
system, a label (title) is attached to each posted mes- 
sage, specifying whether it deals with a new topic or 
is a reply to a previous message. A chain of articles 
on a topic is called a thread. In this case, the rela- 
tionships between the articles are explicitly defined. 
This post/reply-based approach makes it possible for 
a reader to group all the messages on a particular 
topic. However, it is difficult to capture the story of 
the thread from its thread structure, since appropri- 
ate titles are not added to the messages. 
This paper aims to provide ways of relating mul- 
tiple news articles and representing their structure 
in a way that is easy to understand and computa- 
tionally inexpensive. A set of relationships is defined 
here as a directed graph. A node indicates an arti- 
cle, and an arc from node X to Y indicates that the 
article X is followed by Y (or that X is adjacent to 
Y). An article contains both known and unknown 
(new) information. Known information consists of 
words shared by the beginning and ending points of 
an arc. When node X is adjacent to Y, the words 
are represented by (X fq Y). The known information 
is called genus words in this paper. Even if an article 
follows another one, it generally contains some new 
information. This information can be represented 
by subtraction (Y- X) (Damashek, 1995), and is 
called differentia words, by analogy with definition 
sentences in dictionaries, which contain genus words 
and differentia. In this paper, genus and differentiae 
words are used to calculate the similarities between 
two articles, and to visualize topics in a set of arti- 
cles. 
Since articles are ordered chronologically, there 
are some time constraints on the connectivity of 
nodes. A graph is created by constructing an ad- 
jacency matrix for nodes, which in turn is created 
from a similarity matrix for nodes. 
Some potential features of articles in a set can be 
determined by analyzing some formal aspects of the 
1307 
d2 d3 
od5 .od6 
Figure 1: Example of a Directed Graph G 
corresponding graph. For example, the paths in the 
graph show the stories of the nodes they contain. 
Multiple paths for a node (article) show that there 
are multiple stories associated with it. Furthermore, 
if the node has a long path, it is in the "main stream" 
of the topic represented by the graph. An efficient 
algorithm for finding such paths is described, later 
in the paper. 
Application of the threading method to docu- 
ments on the Web would be very useful because, al- 
though such documents are connected by hyperlinks, 
their relationships cannot be specified. In this paper, 
generated threads by this method are represented in 
eXtended Markup Language (XML) (XML, 1997), 
which is the proposed standard for exchange of in- 
formation on the Web. XML-based threads can be 
used by webcasting or push services, since various 
tools for parsing and visualizing threads are avail- 
able. 
In Section 2, a directed graph structure for arti- 
cles is defined, and the procedure for constructing a 
directed graph is described in Section 3. In Section 
4, some features of the created graph are discussed. 
Section 5 introduces a webcasting application by us- 
ing the threading technique, and Section 6 concludes 
the paper. 
2 Definition of a Graph Structure 
A set of articles is represented as an ordered set V: 
V = {dx,d2,...,d,}. 
The suffix sequence 1, 2,..., n represents the pas- 
sage of time. Article di is older than di+l. The order 
is obtained from the publication dates of the articles. 
Different time points arbitrarily are assigned to ar- 
ticles published on the same day. 
Related articles are represented as a directed 
graph (V,A). V is a set of nodes. A is a set of 
ordered pairs (i, j), where i and j are members of 
V. Figure 1 shows an example of a directed graph. 
In this case, the graph is represented as follows: 
V = {dl,d2,d3,d4,ds,d6,d6,d7}, A = {(dl,d2), 
(d2, d3), (dl, d4), (d5, d6), (d2, dT), (d3, ds), (dT, ds)} 
The nodes are ordered chronologically. The fol- 
lowing constraint is introduced into the graph: 
M = 
dl 
d2 
d3 
d4 
45 
d6 
d7 
ds 
dx d2 d3 d4 d5 d6 d7 ds 
0 1 0 1 0 0 0 0 
0 0 1 0 0 0 1 0 
0 0 0 0 0 0 0 1 
0 0 0 0 0 0 0 0 
0 0 0 0 0 1 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 
0 0 0 0 0 0 0 0 
Figure 3: Adjacency Matrix Mc of G 
Constraint 1 
For (di,dj) 6 A, i < j 
The constraint simply shows that an old article 
cannot follow a new one. 
3 Creating a Graph Structure for 
Articles 
This section describes how to construct a directed 
graph structure from a set of articles. Any directed 
graph can be represented by a matrix. Figure 3 
shows the adjacency matrix MG of the graph G in 
Figure 1. 
For example, a value of "1" for the (1, 2) element 
in M indicates that dx is adjacent to d2. Since an 
article cannot follow itself, the value of (i, i) elements 
is "0". From the time constraint defined in Section 
3, MG is an upper triangle matrix. 
The following is a procedure for constructing a 
directed graph for related articles: 
1. Calculate the similarity and difference between 
articles. 
2. Construct a similarity matrix. 
3. Convert the matrix into an adjacency matrix. 
In the next section, each step is illustrated by us- 
ing the set of articles V in Figure 2 on the subject 
of nuclear testing taken from the Nikkei Shinbun. 2 
3.1 Calculating the similarities and 
differences between articles 
The function sim(di,dj) calculates the word-based 
similarity between two articles. It is defined on the 
basis of Salton's Vector Space Model (Salton, 1968). 
Words are extracted from an article by using a mor- 
phological analyzer. Next, nouns and verbs are ex- 
tracted as keywords. 
_ di wdi sim(di,dj) = ~-,k,,, wkw k~ 
kWkw) k kw\] 
2The articles were originally written in Japanese. 
1308 
dl: The prime minister of France says that it is necessary to restart nuclear testing. 
d2: The Defense Minister suggests restarting nuclear testing. 
d3: At a summit conference, the Prime Minister will adopt a policy of requesting the French Government to 
halt nuclear testing. 
d4: China's latest nuclear test will hold up negotiations on a treaty to abolish such testing. 
d5: The Minister of Foreign Affairs, Mr. Youhei Kohno, takes a critical attitude toward China, and asks 
France to understand Japan's position. 
d6: The prime minister of New Zealand asks the French Government not to restart nuclear testing. 
dT: President of France states that nuclear testing will restart in September, and that France will conduct 
eight tests between now and next May. 
d8: France states that it will restart nuclear testing. This will hamper nuclear disarmament. 
dg: France states that it will restart nuclear testing. Australia halts defense cooperation with France. 
dlo: France states that it will restart nuclear testing. The U.S. expresses regret at the decision. 
Figure 2: V: Articles about nuclear testing 
Here, di is the weight given to the keyword Wkw 
kw in article di. Modification of the TF.IDF 
value (Robertson et al., 1976) is used for the weight- 
ing. 9d, is the weight assigned to the keyword kw, kw 
which is a differentia word for di. 
Cdl (kw) k dl 
= . u -(kwl . g w, 
d, r 1.5 kw E differentia(di) 
gkw = ~ 1 otherwise. 
Other parameters are defined as follows: 
k: constant value 
Cd,(kw): frequency of word kw in d(i) 
Cd, : number of words in d(i) 
Nk(kw): number of articles that contain the word 
kw in k articles di-k,... ,di 
The function differentia(d{) returns a set of key- 
words that appear in dj but do not appear in the 
last k articles. 
di.fferentia(di) = {kw\[Cd,(kw) > 0, and for all 
dt, 
where i - k < l < i, Cd,(kw) = O} 
3.2 Constructing a similarity matrix 
A similarity matrix for a set of articles is constructed 
by using the sim function. In a conventional hierar- 
chical clustering algorithm, a similarity for any com- 
bination of two articles is required in order to con- 
struct a hierarchical tree of the set of articles. This 
causes ~ calculations of the similarity func- 
tion, for n articles, with a consequent complexity 
of O(n2). This is very expensive when n is large. 
In our algorithm for constructing a similarity ma- 
trix, shown in Figure 4, the complexity of construct- 
ing a graph structure for an article set by using a 
constraint is O(n). The following constraint, which 
procedure MakeDistanceMatrix 
for i= 2 to n begin 
if i-k< 1 thens+- 1 elses+--i-k 
forj =stoi-lbegin 
a(i,j) +- sim(di,dj) 
j~-j+l 
end 
i+-i+l 
end 
Figure 4: Procedure for Constructing Similarity Ma- 
trix 
includes Constraint 1, is used for in threading algo- 
rithm. 
Constraint 2 
For (di,dj) E A, j - (k + l) <i<j 
This constraint means that an article can only fol- 
low the last k articles. As the result, the number of 
times the similarity matrix needs to be calculated is 
reduced by kn, giving a complexity of O(n). 
By using the algorithm, each similarity between 
nodes is calculated, and the similarity matrix in Fig- 
ure 5 shows a similarity matrix S of V. In this case, 
keywords are extracted from title sentences, and k 
is set to five. 
3.3 Conversion into an adjacency matrix 
From the similarity matrix, an adjacency matrix is 
constructed. An element s(i, j) in the similarity ma- 
trix corresponds to the element ss(i,j) in the adja- 
cency matrix SS. There are various strategies for the 
conversion. In this paper, ss(i,j) is set to 1 when 
s(i, j) > 0.18, and any node can follow at most k/2 
nodes, in this case two nodes. Figure 6 shows a re- 
sult of the conversion. Finally, a directed graph for 
V is created (Figure 7). Figure 8 shows a graph that 
visualizes the content of the articles in our example. 
1309 
S = 
dl 
d2 
d3 
d4 
ds 
d~ 
d7 
d8 
d9 
dlo 
dl d2 d3 d4 d5 d6 d7 ds d9 dio 
0 .309 .239 .072 .131 .319 0 0 0 0 
0 0 .159 .072 .131 .319 .197 0 0 0 
0 0 0 .056 .103 .498 .103 .124 0 0 
0 0 0 0 .186 .056 .046 .056 .046 0 
0 0 0 0 0 .102 .085 .102 .128 .096 
0 0 0 0 0 0 .154 .176 .206 .209 
0 0 0 0 0 0 0 .308 .320 .323 
0 0 0 0 0 0 0 0 .257 .279 
0 0 0 0 0 0 0 0 0 .287 
0 0 0 0 0 0 0 0 0 0 
Figure 5: Similarity Matrix S 
dl 
d~ 
d3 
d4 
ds 
d6 
d7 
ds 
d9 
dlo 
dl d2 ds d4 ds d6 d7 ds d9 d,o 
0 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 1 1 0 0 0 
0 0 0 0 0 I 0 0 0 0 
0 0 0 0 I 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 
0 0 0 0 0 0 0 0 1 0 
0 0 0 0 0 0 0 0 0 I 
0 0 0 0 0 0 0 0 0 0 
d2 dl 
d4 
o ,C s 
d8 
d9 
dlO 
Figure 7: Directed Graph G1 for V 
Figure 6: Adjacency Matrix SS Converted from S 
There are two threads in the graph. One concerns 
for France's restarting of nuclear testing. The other 
concerns China's latest nuclear test. The "France" 
thread contains two sub-threads. One concern re- 
quests by other countries for France to reconsider its 
stated intension of restarting nuclear testing, and the 
other concerns responses by other countries to the 
France government's official statement on testing. 
Some articles are followed by multiple articles. For 
example, d7 is the first official statement on France's 
restarting of nuclear testing, and many related arti- 
cles on this topic follow. 
Each rectangle in Figure 8 represents an article. 
Words in a rectangle are differentia words for the 
articles. These words show new information in the 
article, and make it easy to understand the content 
of the articles. If a word in an article appears in 
the differentia words for its parent article, the word 
may represent a "turning point" in the story of the 
articles. For example, the word "state" is the dif- 
ferentia word for dT, and is in its adjacent articles 
ds, dg, anddlo. This means that d7 is a starting point 
of the new topic "state." Such words are called topic 
words, and are represented in Figure 8 by bold type. 
Several features of the graph visualize the charac- 
teristics and relationships of the articles: these fea- 
tures will be discussed in the next section. 
It is difficult to evaluate the result of threading. 
We are implementing it in a webcasting (push) ap- 
plication so that it can be evaluated by the many 
people who use ordinary web browsers. The attempt 
is described in Section 5. 
4 Features of a Graph 
This section describes how the features of a con- 
structed graph represent the characteristics of arti- 
cles. 
4.1 In-degree and Out-degree 
The in-degree is the number of arcs leading to a node, 
while the out-degree is the number of arcs leading 
from it. The in-degree of di can be calculated by 
adding up the elements in the i-th column of an adja- 
cency matrix. The out-degree of di can be calculated 
by adding up the elements in the i-th row of the ma- 
trix (Figure 9). In Botafogo et al. (Botafogo et al., 
1992), a node that has a high out-degree is called an 
index node, while a node that has a high in-degree is 
called a reference node in their analysis of hypertext. 
In the set of articles V shown in Figure 9, d7 is an 
index node. In this paper, an index node denotes the 
beginning of a new topic. When the topic is impor- 
tant, many articles follow, and consequently the out- 
1310 
dl 
France restart 
nuclear testing 
d4 
China latest hold-up 
negotiation treaty 
d3 d6 
halt Summit request France ~ New Zealand 
restart 
nuc,ear 
l/ _Isuggest \[/ ~Defence Ministe~ 
~dd/r d8 
esident state \[state ,\[ hamper 
conduct~ 1 \[September \[ disarmament 
\ \ d9 
\ ~ Australia \ \] defence 
\[ cooperation 
China , Mr. Yohei Kohno ~ ~U.S. express attitude understand ~ regret decision 
Japan position 
Figure 8: Visualized Content for G1 
dl d2 d3 d4 d5 d6 d7 d8 d9 dl0 
in 0 1 1 0 1 2 1 1 2 2 
out 2 2 1 1 0 0 3 1 1 0 
Figure 9: In-degree/Out-degree of the Graph G1 
degree for the node increases. The contribution of 
reference nodes is not clear in V (d6, ds, and d9 have 
max in-degrees). Nodes that have high in-degree 
have two characteristics. The first is that when the 
articles contain multiple topics, they have many in- 
bound arcs, each representing a different topic. The 
second is that when the articles are closely related 
for a particular topic, the in-degrees of related nodes 
increase, since these articles are connected to each 
other. 
4.2 Path 
A path from one node to another node shows the 
"story flow" of articles. Multiple paths between 
two nodes show different stories about the nodes. 
For example, there are three paths between dl, 
which is a first node, and dl0. The shortest path 
(dl, d2,, dT, dl0) gives a simple outline of the articles. 
The longest path (d,, d2, d7, ds, dg, dl0) contains all 
related information on the topic. By extracting long 
paths from the graph and combining them, various 
stories can be created. 
The length of a path shows how the nodes on it 
\[ along to the "main stream" of the story. For ex- 
mple, the maximum length of a path through d6, is 
three, while that of a path through d7 is five. This 
means that a path that contains d7 is on a main 
stream of the thread and is likely to be continued. 
The longest paths for nodes can be calculated by 
using the algorithm shown in Figure 11. Its com- 
plexity is O(n), since the maximum number of arcs 
is at most nk for n nodes, from Constraint 2, defined 
in Section 3.2. 
4.3 Cycle 
A cycle 3 shows the existence of a topic. In V, 
{dT, ds, dg, dl0} is a cycle for the topic "statement." 
By recognizing cycles, we can extract topics from the 
whole graph. Furthermore, we can abstract articles 
by reducing cycles to single nodes. 
5 XML-based Representation for 
Threads 
It is important that the threading information be ex- 
changeable when we apply our method to Web docu- 
ments. Extended Markup Language (XML) is a pro- 
posed standard (XML, 1997) specified by the World 
Wide Web Consortium (W3C). In XML, tags and 
3Formally, it is called a semi-cycle, since the graph is di- 
rected. 
1311 
attributes can be defined, whereas in HTML they 
are fixed. XML documents can be used to exchange 
information that has various data structure. For 
example, Channel Definition Format (CDF)(CDF, 
1997) is a standard to offer frequently updated col- 
lections of information (channels) on Web. A CDF 
document can contains a collection of articles that 
have tree structure. In this paper, graph structures 
of created threads are represented in XML. Figure 10 
shows a part of the thread in Figure 8. 
The <thread> tag shows the beginning of the 
thread. It contains a set of deceptions for arti- 
cles, each marked <article>. Each instance of 
the <article> tag has a reference to its source 
document, an identifying id, genus and differentia 
words, and other information on the article. The 
tag <follows> is used to denote arcs from the ar- 
ticle to related articles. 
The XML documents can be separate from the 
source articles. They can be provided as part of a 
"push" service for Internet users, offering a solution 
to the problem of information overloading. In such 
a service, gatherer collects articles from Web sites 
and threader makes threads for them. The results 
are stored in XML, and then pushed to subscribers 
who can capture the flow of topics by following the 
threads. In another scenario, when a user gets an 
article, and wants to see its origin or the next re- 
lated article, he or she gets the thread containing 
the article by consulting the threading server. The 
advantage of using XML is that it will be supported 
by various tools, including Web browsers. Now we 
are prototyping the threading service system by us- 
ing a XML processor developed at our laboratory. 
Figure 12 shows a Java applet for viewing threads, 
which can run on major Web browsers. A XML doc- 
ument is parsed and visualized as tree-like structure. 
6 Related Work 
There have been several studies how to relate arti- 
cles (McKeown et al., 1995; Yamamoto et al., 1995; 
Mani et al., 1997). McKeown et al. reported a 
method for summarizing news articles (McKeown 
et al., 1995). In their approach, templates, which 
have slots and their values (for example, incident- 
location="New York"), are extracted from the ar- 
ticles. Summary sentences are constructed by com- 
bining the templates. Although this approach can 
capture topics contained in the articles, the relation- 
ships between articles are not visualized. 
Clustering techniques make it possible to visual- 
ize the contents of a set of documents. Hearst et 
al. proposed the scatter/gather approach for facil- 
itating information retrieval (Hearst et al., 1995). 
Maarek et al. related documents by using an hier- 
archical clustering algorithm that interacts with the 
user. Although these clustering algorithms impose a 
procedure GetMaxtPath(A) 
//Get max path MaxPath\[i\] for di. A is a set of arcs. 
for i = 1 to n begin MaxPath\[i\] +- NULL end 
for j = 1 to n begin 
fori=j-ktoj- lbegin 
if (di, dj) E A then 
if Length(MaxPath\[j\]) < Length(MaxPath\[i\]) + 1 
then MaxPath\[j\] e-- Connect(MaxPath\[i\],(di,dj)) 
i+--i+ 1 
end 
j+-j+l 
end 
procedure Length(path) 
returns the number of arcs in path. 
procedure Connect(path, arc) 
if path = (do,..., di) and arc = (di, dj), then 
return (do,..., di, dj). 
Figure 11: Procedure for Finding the Longest Path 
heavy computation cost, our threading algorithm is 
efficient, because it uses a chronological constraint. 
7 Conclusion 
We have described methods for threading multiple 
articles and for visualizing various characteristics of 
them by using directed graphs. An efficient thread- 
ing algorithm whose complexity is O(n) (where n is 
the number of articles) was introduced with some 
constraints on the chronological ordering of articles. 
Some further work can be done to improve our 
method. There are sonie strategies for constructing 
an adjacency matrix from a distance matrix. Differ- 
ent strategies give different graphs. We are now eval- 
uating our method by testing it with various strate- 
gies. 
The development of a technique for visualizing di- 
rected graphs is another task for the future. Al- 
though directed graphs show more useful informa- 
tion than tree structures, they are difficult to display 
in a readily understandable way. Software tools for 
handling graphs are also required. 
Formal features of graphs can express the under- 
lying characteristics of articles. More efficient and 
useful algorithms are needed to overcome the prob- 
lem of information overload. 

References 
R. Botafogo, E. Rivlin, and B Shnederman. 1992. 
Structural Analysis of Hypertexts: Identifying Hi- 
erarchies and Useful Metrics. A CM Transaction 
on Information Science, pages 143-179, Vol. 10, 
No. 2. 
C. Ellerman. 1997. 
Channel Definition Format (CDF). 
http://www.microsoft.com/standards/cdf.htm. 
M. Damashek. 1995. Gauging Similarity with n- 
Grams: Language Independent Categorization of 
Text. Proc. of Science, pages 843-848, Vol. 267. 
M. A. Hearst, D. R. Karger, and J. O. Pederson. 
1995. Scatter/Gather as a Tool for Navigation of 
Retrieval Results. Proc. of AAAI Fall Symposium 
on AI Applications in Knowledge Navigation and 
Retrieval. 
N. Jardine, and R. Sibson. 1968. The Construction 
of Hierarchic and Non-Hierarchic Classifications. 
Computer, pages 177-184. 
I. Mani and E. Bloedorn. 1997. Multi-document 
Summarization by Graph Search and Matching 
Proe. of AAAI'97, pages. 622-628. 
Y. Maarek and A. Wecker. 1994. The Librarian As- 
sistant: Automatically Assemblin 9 Books into Dy- 
namic Bookshelves. Proc. of RIAO. 
K. McKeown and D. Radev. 1995. Generating 
Summaries of Multiple News Articles. Proc. of SI- 
GIR, pages 74-82. 
S. E. Robertson and K. S. Jones. 1976. Relevance 
Weighting of Search Terms. JASIS, pages 129- 
146, Vol. 27. 
G. Salton. 1968. Automatic Information Organiza- 
tion and Retrieval. New York, NY: McGraw-Hill. 
T. Bray, J. Paoli, and C. M. Sperberg-McQeen. 1997 
Extensible Markup Language (XML). Proposed 
Recommendation. World Wide Web Consortium. 
http://www.w3.org/TR/PR-xml/ 
K. Yamamoto, S. Masuyama, and S. Naito. 1995. 
An Empirical Study on Summarizing Multiple 
Texts of Japanese Newspaper Articles. Proc. of 
NLPRS'95, pages 461-466. 
