File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3247_metho.xml
Size: 19,675 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3247"> <Title>LexPageRank: Prestige in Multi-Document Text Summarization</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Sentence centrality and centroid-based </SectionTitle> <Paragraph position="0"> summarization Extractive summarization produces summaries by choosing a subset of the sentences in the original documents. This process can be viewed as choosing the most central sentences in a (multi-document) cluster that give the necessary and enough amount of information related to the main theme of the cluster. Centrality of a sentence is often defined in terms of the centrality of the words that it contains. A common way of assessing word centrality is to look at the centroid. The centroid of a cluster is a pseudo-document which consists of words that have frequency*IDF scores above a predefined threshold. In centroid-based summarization (Radev et al., 2000), the sentences that contain more words from the centroid of the cluster are considered as central. Formally, the centroid score of a sentence is the cosine of the angle between the centroid vector of the whole cluster and the individual centroid of the sentence. This is a measure of how close the sentence is to the centroid of the cluster. Centroid-based summarization has given promising results in the past (Radev et al., 2001).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Prestige-based sentence centrality </SectionTitle> <Paragraph position="0"> In this section, we propose a new method to measure sentence centrality based on prestige in social networks, which has also inspired many ideas in the computer networks and information retrieval.</Paragraph> <Paragraph position="1"> A cluster of documents can be viewed as a network of sentences that are related to each other.</Paragraph> <Paragraph position="2"> Some sentences are more similar to each other while some others may share only a little information with the rest of the sentences. We hypothesize that the sentences that are similar to many of the other sentences in a cluster are more central (or prestigious) to the topic. There are two points to clarify in this definition of centrality. First is how to define similarity between two sentences. Second is how to compute the overall prestige of a sentence given its similarity to other sentences. For the similarity metric, we use cosine. A cluster may be represented by a cosine similarity matrix where each entry in the matrix is the similarity between the corresponding sentence pair. Figure 1 shows a subset of a cluster used in DUC 2004, and the corresponding cosine similarity matrix. Sentence ID da0 sa1 indicates the a1 th sentence in the a0 th document. In the following sections, we discuss two methods to compute sentence prestige using this matrix.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Degree centrality </SectionTitle> <Paragraph position="0"> In a cluster of related documents, many of the sentences are expected to be somewhat similar to each other since they are all about the same topic. This can be seen in Figure 1 where the majority of the values in the similarity matrix are nonzero. Since we are interested in significant similarities, we can eliminate some low values in this matrix by defining a threshold so that the cluster can be viewed as an (undirected) graph, where each sentence of the cluster is a node, and significantly similar sentences are connected to each other. Figure 2 shows the graphs that correspond to the adjacency matrix derived by assuming the pair of sentences that have a similarity above a2a4a3a6a5a8a7a9a2a4a3a11a10a12a7 and a2a4a3a11a13 , respectively, in Figure 1 are similar to each other. We define degree centrality as the degree of each node in the similarity graph. As seen in Table 1, the choice of cosine threshold dramatically influences the interpretation of centrality.</Paragraph> <Paragraph position="1"> Too low thresholds may mistakenly take weak similarities into consideration while too high thresholds may lose much of the similarity relations in a cluster. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Eigenvector centrality and LexPageRank </SectionTitle> <Paragraph position="0"> When computing degree centrality, we have treated each edge as a vote to determine the overall prestige value of each node. This is a totally democratic method where each vote counts the same. However, this may have a negative effect in the quality of the summaries in some cases where several unwanted sentences vote for each and raise their prestiges. As an extreme example, consider a noisy cluster where all the documents are related to each other, but only one of them is about a somewhat different topic. Obviously, we wouldn't want any of the sentences in the unrelated document to be included in a generic summary of the cluster. However, assume that the unrelated document contains some sentences that are very prestigious considering only the votes in that document. These sentences will get artificially high centrality scores by the local votes from a specific set of sentences. This situation can be avoided by considering where the votes come from and taking the prestige of the voting node into account in weighting each vote. Our approach is inspired by a similar idea used in computing web page prestiges.</Paragraph> <Paragraph position="1"> One of the most successful applications of prestige is PageRank (Page et al., 1998), the underlying technology behind the Google search engine.</Paragraph> <Paragraph position="2"> PageRank is a method proposed for assigning a prestige score to each page in the Web independent of a specific query. In PageRank, the score of a page is determined depending on the number of pages that link to that page as well as the individual scores of the linking pages. More formally, the PageRank of a page a14 is given as follows:</Paragraph> <Paragraph position="4"> where a32 a0 a3a35a3a35a3a32 a37 are pages that link to a14 , Ca15a33a32a39a38a40a17 is the number of outgoing links from page a32a41a38 , and a25 is the damping factor which can be set between a2 and a5 . This recursively defined value can be computed by forming the binary adjacency matrix, a42 , of the Web, where a42a43a15a16a44a41a7a9a45a27a17a46a19a47a5 if there is a link from page a44 to page a45 , normalizing this matrix so that row sums equal to a5 , and finding the principal eigenvector of the normalized matrix. PageRank for a48 th page equals to the a48 th entry in the eigenvector. Principal eigenvector of a matrix can be computed with a simple iterative power method.</Paragraph> <Paragraph position="5"> This method can be directly applied to the cosine similarity graph to find the most prestigious sentences in a document. We use PageRank to weight each vote so that a vote that comes from a more prestigious sentence has a greater value in the centrality of a sentence. Note that unlike the original PageRank method, the graph is undirected since cosine similarity is a symmetric relation. However, deal positively with whoever represents the Security Council unless there was a clear stance on the issue of lifting the blockade off of it.</Paragraph> <Paragraph position="6"> 4 d2s3 Baghdad had decided late last October to completely cease cooperating with the inspectors of the United Nations Special Commission (UNSCOM), in charge of disarming Iraq's weapons, and whose work became very limited since the fifth of August, and announced it will not resume its cooperation with the Commission even if it were subjected to a military operation.</Paragraph> <Paragraph position="7"> 5 d3s1 The Russian Foreign Minister, Igor Ivanov, warned today, Wednesday against using force against Iraq, which will destroy, according to him, seven years of difficult diplomatic work and will complicate the regional situation in the area.</Paragraph> <Paragraph position="8"> 6 d3s2 Ivanov contended that carrying out air strikes against Iraq, who refuses to cooperate with the United Nations inspectors, &quot;will end the tremendous work achieved by the international group during the past seven years and will complicate the situation in the region.&quot; 7 d3s3 Nevertheless, Ivanov stressed that Baghdad must resume working with the Special Commission in charge of disarming the Iraqi weapons of mass destruction (UNSCOM).</Paragraph> <Paragraph position="9"> 8 d4s1 The Special Representative of the United Nations Secretary-General in Baghdad, Prakash Shah, announced today, Wednesday, after meeting with the Iraqi Deputy Prime Minister Tariq Aziz, that Iraq refuses to back down from its decision to cut off cooperation with the disarmament inspectors.</Paragraph> <Paragraph position="10"> 9 d5s1 British Prime Minister Tony Blair said today, Sunday, that the crisis between the international community and Iraq &quot;did not end&quot; and that Britain is still &quot;ready, prepared, and able to strike Iraq.&quot; 10 d5s2 In a gathering with the press held at the Prime Minister's office, Blair contended that the crisis with Iraq &quot;will not end until Iraq has absolutely and unconditionally respected its commitments&quot; towards set of cluster d1003t from DUC 2004.</Paragraph> <Paragraph position="11"> this does not make any difference in the computation of the principal eigenvector. We call this new measure of sentence similarity lexical PageRank, or LexPageRank. Table 3 shows the LexPageRank scores for the graphs in Figure 2 setting the damping factor to a5 . For comparison, Centroid score for each sentence is also shown in the table. All the numbers are normalized so that the highest ranked sentence gets the score a5 . It is obvious from the figures that threshold choice affects the LexPageRank rankings of some sentences.</Paragraph> <Paragraph position="12"> thresholds 0.1, 0.2, and 0.3, respectively, for the cluster in Figure 1.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Comparison with Centroid </SectionTitle> <Paragraph position="0"> The graph-based centrality approach we have introduced has several advantages over Centroid. First of ure 2 Sentence d4s1 is the most central sentence for thresholds 0.1 and 0.2.</Paragraph> <Paragraph position="1"> all, it accounts for information subsumption among sentences. If the information content of a sentence subsumes another sentence in a cluster, it is naturally preferred to include the one that contains more information in the summary. The degree of a node in the cosine similarity graph is an indication of how much common information the sentence has with other sentences. Sentence d4s1 in Figure 1 gets the highest score since it almost subsumes the information in the first two sentences of the cluster and has some common information with others. Another advantage is that it prevents unnaturally high IDF scores from boosting up the score of a sentence that is unrelated to the topic. Although the frequency of the words are taken into account while computing the Centroid score, a sentence that contains many rare words with high IDF values may get a high Centroid score even if the words do not occur elsewhere in the cluster.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments on DUC 2004 data </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 DUC 2004 data and ROUGE </SectionTitle> <Paragraph position="0"> We used DUC 2004 data in our experiments. There are 2 generic summarization tasks (Tasks 2, 4a, and 4b) in DUC 2004 which are appropriate for the purpose of testing our new feature, LexPageRank. Task 2 involves summarization of 50 TDT English clusters. The goal of Task 4 is to produce summaries of machine translation output (in English) of 24 Arabic TDT documents.</Paragraph> <Paragraph position="1"> For evaluation, we used the new automatic summary evaluation metric, ROUGE1, which was used for the first time in DUC 2004. ROUGE is a recall-based metric for fixed-length summaries which is based on n-gram co-occurence. It reports separate scores for 1, 2, 3, and 4-gram, and also for longest common subsequence co-occurences. Among these different scores, unigram-based ROUGE score (ROUGE-1) has been shown to agree with human judgements most (Lin and Hovy, 2003). We show three of the ROUGE metrics in our experiment results: ROUGE-1 (unigram-based), ROUGE-2 (bigram-based), and ROUGE-W (based on longest common subsequence weighted by the length).</Paragraph> <Paragraph position="2"> There are 8 different human judges for DUC 2004 Task 2, and 4 for DUC 2004 Task 4. However, a subset of exactly 4 different human judges produced model summaries for any given cluster. ROUGE requires a limit on the length of the summaries to be able to make a fair evaluation. To stick with the DUC 2004 specifications and to be able to compare our system with human summaries and as well as with other DUC participants, we produced 665-byte summaries for each cluster and computed ROUGE scores against human summaries.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 MEAD summarization toolkit </SectionTitle> <Paragraph position="0"> MEAD2 is a publicly available toolkit for extractive multi-document summarization. Although it comes as a centroid-based summarization system by default, its feature set can be extended to implement other methods.</Paragraph> <Paragraph position="1"> The MEAD summarizer consists of three components. During the first step, the feature extractor, each sentence in the input document (or cluster of documents) is converted into a feature vector using the user-defined features. Second, the feature vector is converted to a scalar value using the combiner. At the last stage known as the reranker, the scores for sentences included in related pairs are adjusted upwards or downwards based on the type of relation between the sentences in the pair. Reranker penalizes the sentences that are similar to the sentences already included in the summary so that a better information coverage is achieved.</Paragraph> <Paragraph position="2"> Three default features that comes with the MEAD distribution are Centroid, Position and Length. Position is the normalized value of the position of a sentence in the document such that the first sentence of a document gets the maximum Position value of 1, and the last sentence gets the value 0.</Paragraph> <Paragraph position="3"> Length is not a real feature score, but a cutoff value that ignores the sentences shorter than the given threshold. Several rerankers are implemented in MEAD. We observed the best results with Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998) reranker and the default reranker of the system based on Cross-Sentence Informational Subsumption (CSIS) (Radev, 2000). All of our experiments shown in Section 4.3 use CSIS reranker.</Paragraph> <Paragraph position="4"> A MEAD policy is a combination of three components: (a) the command lines for all features, (b) the formula for converting the feature vector to a scalar, and (c) the command line for the reranker. A sample policy might be the one shown in Figure 4.</Paragraph> <Paragraph position="5"> This example indicates the three default MEAD features (Centroid, Position, LengthCutoff), and our new LexPageRank feature used in our experiments.</Paragraph> <Paragraph position="6"> Our LexPageRank implementation requires the cosine similarity threshold, a2a4a3a11a10 in the example, as an argument. Each number next to a feature name shows the relative weight of that feature (except for LengthCutoff where the number 9 indicates the threshold for selecting a sentence based on the number of the words in the sentence). The reranker in the example is a word-based MMR reranker with a cosine similarity threshold, 0.5.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Results and discussion </SectionTitle> <Paragraph position="0"> We implemented the Degree and LexPageRank methods, and integrated into the MEAD system as new features. We normalize each feature so that the sentence with the maximum score gets the value 1.</Paragraph> <Paragraph position="1"> We ran MEAD with several policies with different feature weights and combinations of features. We fixed Length cutoff at 9, and the weight of the Position feature at 1 in all of the policies. We did not try a weight higher than 2.0 for any of the features since our earlier observations on MEAD showed that too high feature weights results in poor summaries. null Table 2 and Table 3 show the ROUGE scores we have got in the experiments with using LexPageRank, Degree, and Centroid in Tasks 2 and 4, respectively, sorted by ROUGE-1 scores. 'lprXTY' indicates a policy in which the weight for LexPageRank is a0 and a1 is used as threshold. 'degreeXTY' is similar except that degree of a node in the similarity graph is used instead of its LexPageRank score. Finally, 'CX' shows a policy with Centroid weight a0 . We also include two baselines for each data set. 'random' indicates a method where we have picked random sentences from the cluster to produce a summary. We have performed five random runs for each data set. The results in the tables are for the median runs. Second baseline, shown as 'lead-based' in the tables, is using only the Position feature without any centrality method. This is tantamount to producing lead-based summaries, which is a widely used and very challenging baseline in the text summarization community (Brandow et al., 1995).</Paragraph> <Paragraph position="2"> The top scores we have got in all data sets come from our new methods. The results provide strong evidence that Degree and LexPageRank are better than Centroid in multi-document generic text summarization. However, it is hard to say that Degree and LexPageRank are significantly different from each other. This is an indication that Degree may already be a good enough measure to assess the centrality of a node in the similarity graph. Considering the relatively low complexity of degree centrality, it still serves as a plausible alternative when one needs a simple implementation. Computation of Degree can be done on the fly as a side product of LexPageRank just before the power method is applied on the similarity graph.</Paragraph> <Paragraph position="3"> Another interesting observation in the results is the effect of threshold. Most of the top ROUGE scores belong to the runs with the threshold a2a4a3a6a5 , and the runs with threshold a2a4a3a11a13 are worse than the others most of the time. This is due to the information loss in the similarity graphs as we move to higher thresholds as discussed in Section 3.</Paragraph> <Paragraph position="4"> As a comparison with the other summarization systems, we present the official scores for the top five DUC 2004 participants and the human summaries in Table 4 and Table 5 for Tasks 2 and 4, respectively. Our top few results for each task are either better than or statistically indifferent from the best system in the official runs considering the 95% confidence interval.</Paragraph> </Section> </Section> class="xml-element"></Paper>