File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3020_metho.xml
Size: 9,241 bytes
Last Modified: 2025-10-06 14:09:07
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3020"> <Title>Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Graph-Based Ranking Algorithms </SectionTitle> <Paragraph position="0"> Graph-based ranking algorithms are essentially a way of deciding the importance of a vertex within a graph, based on information drawn from the graph structure.</Paragraph> <Paragraph position="1"> In this section, we present three graph-based ranking algorithms - previously found to be successful on a range of ranking problems. We also show how these algorithms can be adapted to undirected or weighted graphs, which are particularly useful in the context of text-based ranking applications.</Paragraph> <Paragraph position="2"> Let G = (V;E) be a directed graph with the set of vertices V and set of edges E, where E is a subset of V V . For a given vertex Vi, let In(Vi) be the set of vertices that point to it (predecessors), and let Out(Vi) be the set of vertices that vertex Vi points to (successors).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2.1 HITS HITS (Hyperlinked Induced Topic Search) (Klein- </SectionTitle> <Paragraph position="0"> berg, 1999) is an iterative algorithm that was designed for ranking Web pages according to their degree of &quot;authority&quot;. The HITS algorithm makes a distinction between &quot;authorities&quot; (pages with a large number of incoming links) and &quot;hubs&quot; (pages with a large number of outgoing links). For each vertex, HITS produces two sets of scores - an &quot;authority&quot; score, and a &quot;hub&quot; score:</Paragraph> <Paragraph position="2"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Positional Power Function </SectionTitle> <Paragraph position="0"> Introduced by (Herings et al., 2001), the positional power function is a ranking algorithm that determines the score of a vertex as a function that combines both the number of its successors, and the score of its successors. null</Paragraph> <Paragraph position="2"> The counterpart of the positional power function is the positional weakness function, defined as:</Paragraph> <Paragraph position="4"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 PageRank </SectionTitle> <Paragraph position="0"> PageRank (Brin and Page, 1998) is perhaps one of the most popular ranking algorithms, and was designed as a method for Web link analysis. Unlike other ranking algorithms, PageRank integrates the impact of both incoming and outgoing links into one single model, and therefore it produces only one set of scores:</Paragraph> <Paragraph position="2"> where d is a parameter that is set between 0 and 1 1.</Paragraph> <Paragraph position="3"> For each of these algorithms, starting from arbitrary values assigned to each node in the graph, the computation iterates until convergence below a given threshold is achieved. After running the algorithm, a score is associated with each vertex, which represents the &quot;importance&quot; or &quot;power&quot; of that vertex within the graph. Notice that the final values are not affected by the choice of the initial value, only the number of iterations to convergence may be different.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Undirected Graphs </SectionTitle> <Paragraph position="0"> Although traditionally applied on directed graphs, recursive graph-based ranking algorithms can be also applied to undirected graphs, in which case the out-degree of a vertex is equal to the in-degree of the vertex. For loosely connected graphs, with the number of edges proportional with the number of vertices, undirected graphs tend to have more gradual convergence curves. As the connectivity of the graph increases (i.e. larger number of edges), convergence is usually achieved after fewer iterations, and the convergence curves for directed and undirected graphs practically overlap.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 Weighted Graphs </SectionTitle> <Paragraph position="0"> In the context of Web surfing or citation analysis, it is unusual for a vertex to include multiple or partial links to another vertex, and hence the original definition for graph-based ranking algorithms is assuming unweighted graphs.</Paragraph> <Paragraph position="1"> However, in our TextRank model the graphs are build from natural language texts, and may include multiple or partial links between the units (vertices) that are extracted from text. It may be therefore useful to indicate and incorporate into the model the &quot;strength&quot; of the connection between two vertices Vi and Vj as a weight wij added to the corresponding edge that connects the two vertices.</Paragraph> <Paragraph position="2"> Consequently, we introduce new formulae for graph-based ranking that take into account edge weights when computing the score associated with a vertex in the graph.</Paragraph> <Paragraph position="3"> 1The factor d is usually set at 0.85 (Brin and Page, 1998), and this is the value we are also using in our implementation.</Paragraph> <Paragraph position="5"> While the final vertex scores (and therefore rankings) for weighted graphs differ significantly as compared to their unweighted alternatives, the number of iterations to convergence and the shape of the convergence curves is almost identical for weighted and unweighted graphs.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Sentence Extraction </SectionTitle> <Paragraph position="0"> To enable the application of graph-based ranking algorithms to natural language texts, TextRank starts by building a graph that represents the text, and interconnects words or other text entities with meaningful relations. For the task of sentence extraction, the goal is to rank entire sentences, and therefore a vertex is added to the graph for each sentence in the text.</Paragraph> <Paragraph position="1"> To establish connections (edges) between sentences, we are defining a &quot;similarity&quot; relation, where &quot;similarity&quot; is measured as a function of content overlap. Such a relation between two sentences can be seen as a process of &quot;recommendation&quot;: a sentence that addresses certain concepts in a text, gives the reader a &quot;recommendation&quot; to refer to other sentences in the text that address the same concepts, and therefore a link can be drawn between any two such sentences that share common content.</Paragraph> <Paragraph position="2"> The overlap of two sentences can be determined simply as the number of common tokens between the lexical representations of the two sentences, or it can be run through syntactic filters, which only count words of a certain syntactic category. Moreover, to avoid promoting long sentences, we are using a normalization factor, and divide the content overlap of two sentences with the length of each sentence.</Paragraph> <Paragraph position="3"> Formally, given two sentences Si and Sj, with a sentence being represented by the set of Ni words that appear in the sentence: Si = W i1;W i2;:::;W iNi , the similarity of Si and Sj is defined as:</Paragraph> <Paragraph position="5"> The resulting graph is highly connected, with a weight associated with each edge, indicating the strength of the connections between various sentence pairs in the text2. The text is therefore represented as a weighted graph, and consequently we are using the weighted graph-based ranking formulae introduced in Section 2.5. The graph can be represented as: (a) simple undirected graph; (b) directed weighted graph with the orientation of edges set from a sentence to sentences that follow in the text (directed forward); or (c) directed weighted graph with the orientation of edges set from a sentence to previous sentences in the text (directed backward).</Paragraph> <Paragraph position="6"> After the ranking algorithm is run on the graph, sentences are sorted in reversed order of their score, and the top ranked sentences are selected for inclusion in the summary.</Paragraph> <Paragraph position="7"> Figure 1 shows a text sample, and the associated weighted graph constructed for this text. The figure also shows sample weights attached to the edges connected to vertex 93, and the final score computed for each vertex, using the PR formula, applied on an undirected graph. The sentences with the highest rank are selected for inclusion in the abstract. For this sample article, sentences with id-s 9, 15, 16, 18 are extracted, resulting in a summary of about 100 words, which according to automatic evaluation measures, is ranked the second among summaries produced by 15 other systems (see Section 4 for evaluation methodology).</Paragraph> </Section> class="xml-element"></Paper>