File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2213_intro.xml
Size: 6,120 bytes
Last Modified: 2025-10-06 14:06:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2213"> <Title>A Method for Relating Multiple Newspaper Articles by Using Graphs, and Its Application to Webcasting</Title> <Section position="2" start_page="0" end_page="1307" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The vast quantity of information available today makes it difficult to search for and understand the information that we want. If there are many related documents about a topic, it is important to capture their relationships so that we can obtain a clearer overview. However, most information resources, including newspaper articles do not have explicit relationships. For example, although documents on the Web are connected by hyperlinks, relationships cannot be specified.</Paragraph> <Paragraph position="1"> Webcasting (&quot;push&quot;) applications such as Pointcast i constitute a promising solution to the problem of information overloading, but the articles they provide do not have links, or else must be manually linked at a high cost in terms of time and effort.</Paragraph> <Paragraph position="2"> This paper describes methods for relating newspaper articles automatically, and its application for a Webcasting application. A set of article on a par-I htt p://www.pointcast.com ticular topic is ordered chronologically, and the results are represented as a directed graph. There are various ways of relating documents and visualizing their structure. For example, USENET articles can be accessed by means of newsreader software. In the system, a label (title) is attached to each posted message, specifying whether it deals with a new topic or is a reply to a previous message. A chain of articles on a topic is called a thread. In this case, the relationships between the articles are explicitly defined. This post/reply-based approach makes it possible for a reader to group all the messages on a particular topic. However, it is difficult to capture the story of the thread from its thread structure, since appropriate titles are not added to the messages.</Paragraph> <Paragraph position="3"> This paper aims to provide ways of relating multiple news articles and representing their structure in a way that is easy to understand and computationally inexpensive. A set of relationships is defined here as a directed graph. A node indicates an article, and an arc from node X to Y indicates that the article X is followed by Y (or that X is adjacent to Y). An article contains both known and unknown (new) information. Known information consists of words shared by the beginning and ending points of an arc. When node X is adjacent to Y, the words are represented by (X fq Y). The known information is called genus words in this paper. Even if an article follows another one, it generally contains some new information. This information can be represented by subtraction (Y- X) (Damashek, 1995), and is called differentia words, by analogy with definition sentences in dictionaries, which contain genus words and differentia. In this paper, genus and differentiae words are used to calculate the similarities between two articles, and to visualize topics in a set of articles. null Since articles are ordered chronologically, there are some time constraints on the connectivity of nodes. A graph is created by constructing an adjacency matrix for nodes, which in turn is created from a similarity matrix for nodes.</Paragraph> <Paragraph position="4"> Some potential features of articles in a set can be determined by analyzing some formal aspects of the</Paragraph> <Paragraph position="6"> corresponding graph. For example, the paths in the graph show the stories of the nodes they contain.</Paragraph> <Paragraph position="7"> Multiple paths for a node (article) show that there are multiple stories associated with it. Furthermore, if the node has a long path, it is in the &quot;main stream&quot; of the topic represented by the graph. An efficient algorithm for finding such paths is described, later in the paper.</Paragraph> <Paragraph position="8"> Application of the threading method to documents on the Web would be very useful because, although such documents are connected by hyperlinks, their relationships cannot be specified. In this paper, generated threads by this method are represented in eXtended Markup Language (XML) (XML, 1997), which is the proposed standard for exchange of information on the Web. XML-based threads can be used by webcasting or push services, since various tools for parsing and visualizing threads are available. null In Section 2, a directed graph structure for articles is defined, and the procedure for constructing a directed graph is described in Section 3. In Section 4, some features of the created graph are discussed.</Paragraph> <Paragraph position="9"> Section 5 introduces a webcasting application by using the threading technique, and Section 6 concludes the paper.</Paragraph> <Paragraph position="10"> 2 Definition of a Graph Structure A set of articles is represented as an ordered set V:</Paragraph> <Paragraph position="12"> The suffix sequence 1, 2,..., n represents the passage of time. Article di is older than di+l. The order is obtained from the publication dates of the articles.</Paragraph> <Paragraph position="13"> Different time points arbitrarily are assigned to articles published on the same day.</Paragraph> <Paragraph position="14"> Related articles are represented as a directed graph (V,A). V is a set of nodes. A is a set of ordered pairs (i, j), where i and j are members of V. Figure 1 shows an example of a directed graph.</Paragraph> <Paragraph position="15"> In this case, the graph is represented as follows:</Paragraph> <Paragraph position="17"> (d2, d3), (dl, d4), (d5, d6), (d2, dT), (d3, ds), (dT, ds)} The nodes are ordered chronologically. The following constraint is introduced into the graph:</Paragraph> <Paragraph position="19"> Constraint 1 For (di,dj) 6 A, i < j The constraint simply shows that an old article cannot follow a new one.</Paragraph> </Section> class="xml-element"></Paper>