File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-3013_metho.xml

Size: 8,169 bytes

Last Modified: 2025-10-06 14:09:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-3013">
  <Title>Language Independent Extractive Summarization</Title>
  <Section position="4" start_page="0" end_page="50" type="metho">
    <SectionTitle>
2 Extractive Summarization
</SectionTitle>
    <Paragraph position="0"> Ranking algorithms, such as Kleinberg's HITS algorithm (Kleinberg, 1999) or Google's PageRank (Brin and Page, 1998) have been traditionally and successfully used in Web-link analysis, social networks, and more recently in text processing applications. In short, a graph-based ranking algorithm is a way of deciding on the importance of a vertex within a graph, by taking into account global information recursively computed from the entire graph, rather than relying only on local vertex-specific information. The basic idea implemented by the ranking model is that of voting or recommendation. When one vertex links to another one, it is basically casting a vote for that other vertex. The higher the number of votes that are cast for a vertex, the higher the importance of the vertex.</Paragraph>
    <Paragraph position="1"> These graph ranking algorithms are based on a random walk model, where a walker takes random steps on the graph, with the walk being modeled as a Markov process - that is, the decision on what edge to follow is solely based on the vertex where the walker is currently located. Under certain conditions, this  model converges to a stationary distribution of probabilities associated with vertices in the graph, representing the probability of finding the walker at a certain vertex in the graph. Based on the Ergodic theorem for Markov chains (Grimmett and Stirzaker, 1989), the algorithms are guaranteed to converge if the graph is both aperiodic and irreducible. The first condition is achieved for any graph that is a non-bipartite graph, while the second condition holds for any strongly connected graph. Both these conditions are achieved in the graphs constructed for the extractive summarization application implemented in TextRank.</Paragraph>
    <Paragraph position="2"> While there are several graph-based ranking algorithms previously proposed in the literature, we focus on two algorithms, namely PageRank (Brin and Page, 1998) and HITS (Kleinberg, 1999).</Paragraph>
    <Paragraph position="3"> Let G = (V,E) be a directed graph with the set of vertices V and set of edges E, where E is a subset of V x V . For a given vertex Vi, let In(Vi) be the set of vertices that point to it (predecessors), and let Out(Vi) be the set of vertices that vertex Vi points to (successors).</Paragraph>
    <Section position="1" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
2.1 PageRank
</SectionTitle>
      <Paragraph position="0"> PageRank (Brin and Page, 1998) is perhaps one of the most popular ranking algorithms, and was designed as a method for Web link analysis. Unlike other graph ranking algorithms, PageRank integrates the impact of both incoming and outgoing links into one single model, and therefore it produces only one set of scores:</Paragraph>
      <Paragraph position="2"> where d is a parameter that is set between 0 and 1, and has the role of integrating random jumps into the random walking model.</Paragraph>
    </Section>
    <Section position="2" start_page="49" end_page="50" type="sub_section">
      <SectionTitle>
2.2 HITS
</SectionTitle>
      <Paragraph position="0"> HITS (Hyperlinked Induced Topic Search) (Kleinberg, 1999) is an iterative algorithm that was designed for ranking Web pages according to their degree of &amp;quot;authority&amp;quot;. The HITS algorithm makes a distinction between &amp;quot;authorities&amp;quot; (pages with a large number of incoming links) and &amp;quot;hubs&amp;quot; (pages with a large number of outgoing links). For each vertex, HITS produces two sets of scores - an &amp;quot;authority&amp;quot; score, and a &amp;quot;hub&amp;quot; score:</Paragraph>
      <Paragraph position="2"> Starting from arbitrary values assigned to each node in the graph, the ranking algorithm iterates until convergence below a given threshold is achieved. After running the algorithm, a score is associated with each vertex, which represents the importance of that vertex within the graph. Note that the final values are not affected by the choice of the initial value, only the number of iterations to convergence may be different.</Paragraph>
      <Paragraph position="3"> When the graphs are built starting with natural language texts, it may be useful to integrate into the graph model the strength of the connection between two vertices Vi and Vj, indicated as a weight wij added to the corresponding edge. Consequently, the ranking algorithm is adapted to include edge weights, e.g. for PageRank the score is determined using the following formula (a similar change can be applied to the</Paragraph>
      <Paragraph position="5"> While the final vertex scores (and therefore rankings) for weighted graphs differ significantly as compared to their unweighted alternatives, the number of iterations to convergence and the shape of the convergence curves is almost identical for weighted and unweighted graphs.</Paragraph>
      <Paragraph position="6"> For the task of single-document extractive summarization, the goal is to rank the sentences in a given text with respect to their importance for the overall understanding of the text. A graph is therefore constructed by adding a vertex for each sentence in the text, and edges between vertices are established using sentence inter-connections, defined using a simple similarity relation measured as a function of content overlap. Such a relation between two sentences can be seen as a process of recommendation: a sentence that addresses certain concepts in a text gives the reader a recommendation to refer to other sentences in the  text that address the same concepts, and therefore a link can be drawn between any two such sentences that share common content.</Paragraph>
      <Paragraph position="7"> The overlap of two sentences can be determined simply as the number of common tokens between the lexical representations of the two sentences, or it can be run through filters that e.g. eliminate stopwords, count only words of a certain category, etc. Moreover, to avoid promoting long sentences, we use a normalization factor and divide the content overlap of two sentences with the length of each sentence.</Paragraph>
      <Paragraph position="8"> The resulting graph is highly connected, with a weight associated with each edge, indicating the strength of the connections between various sentence pairs in the text. The graph can be represented as: (a) simple undirected graph; (b) directed weighted graph with the orientation of edges set from a sentence to sentences that follow in the text (directed forward); or (c) directed weighted graph with the orientation of edges set from a sentence to previous sentences in the text (directed backward).</Paragraph>
      <Paragraph position="9"> After the ranking algorithm is run on the graph, sentences are sorted in reversed order of their score, and the top ranked sentences are selected for inclusion in the summary. Figure 1 shows an example of a weighted graph built for a short sample text.</Paragraph>
      <Paragraph position="10"> [1] Watching the new movie, &amp;quot;Imagine: John Lennon,&amp;quot; was very painful for the late Beatle's wife, Yoko Ono.</Paragraph>
      <Paragraph position="11"> [2] &amp;quot;The only reason why I did watch it to the end is because I'm responsible for it, even though somebody else made it,&amp;quot; she said. [3] Cassettes, film footage and other elements of the acclaimed movie were collected by Ono.</Paragraph>
      <Paragraph position="12">  [4] She also took cassettes of interviews by Lennon, which were edited in such a way that he narrates the picture.</Paragraph>
      <Paragraph position="13"> [5] Andrew Solt (&amp;quot;This Is Elvis&amp;quot;) directed, Solt and David L. Wolper produced and Solt and Sam Egan wrote it.</Paragraph>
      <Paragraph position="14"> [6] &amp;quot;I think this is really the definitive documentary of John Lennon's life,&amp;quot; Ono said in an interview.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML