File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1017_evalu.xml
Size: 9,725 bytes
Last Modified: 2025-10-06 13:59:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1017"> <Title>Event-Based Extractive Summarization</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> We chose as our input data the document sets used in the evaluation of multidocument summarization during the first Document Understanding Conference (DUC), organized by NIST (Harman and Marcu, 2001). This collection contains 30 test document sets, each with approximately 10 news stories on different events; document sets vary significantly in their internal coherence. For each document set three human-constructed summaries are provided for each of the target lengths of 50, 100, 200, and 400 words. We selected DUC 2001 because ideal summaries are available for multiple lengths.</Paragraph> <Paragraph position="1"> Concepts and Textual Units Our textual units are sentences, while the features representing concepts are either atomic events, as described in Section 4, or a fairly basic and widely used set of lexical features, namely the list of words present in each input text. The algorithm for extracting event triplets assigns a weight to each such triplet, while for words we used as weights their tf*idf values, taking idf values from http://elib.cs.</Paragraph> <Paragraph position="2"> berkeley.edu/docfreq/.</Paragraph> <Paragraph position="3"> Evaluation Metric Given the difficulties in coming up with a universally accepted evaluation measure for summarization, and the fact that obtaining judgments by humans is time-consuming and labor-intensive, we adopted an automated process for comparing system-produced summaries to &quot;ideal&quot; summaries written by humans. The method, ROUGE (Lin and Hovy, 2003), is based on n-gram overlap between the system-produced and ideal summaries. As such, it is a recall-based measure, and it requires that the length of the summaries be controlled to allow meaningful comparisons.</Paragraph> <Paragraph position="4"> ROUGE can be readily applied to compare the performance of different systems on the same set of documents, assuming that ideal summaries are available for those documents. At the same time, ROUGE evaluation has not yet been tested extensively, and ROUGE scores are difficult to interpret as they are not absolute and not comparable across source document sets.</Paragraph> <Paragraph position="6"> In our comparison, we used as reference summaries those created by NIST assessors for the DUC task of generic summarization. The human annotators may not have created the same models if asked for summaries describing the major events in the input texts instead of generic summaries.</Paragraph> <Paragraph position="7"> Summary Length For a given set of features and selection algorithm we get a sorted list of sentences extracted according to that particular algorithm. Then, for each DUC document set we create four summaries of length 50, 100, 200, and 400. In all the suggested methods a whole sentence is added at every step. We extracted exactly 50, 100, 200, and 400 words out of the top sentences (truncating the last sentence if necessary).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Results: Static Greedy Algorithm </SectionTitle> <Paragraph position="0"> In our first experiment we use the static greedy algorithm to create summaries of various lengths. Table 2 shows in how many cases out of the 30 document sets the summary created according to atomic events receives a higher or lower ROUGE score than the summary created according to tf*idf features (rows &quot;events better&quot; and &quot;tf*idf better&quot; respectively). Row equal indicates how many of the for adaptive greedy algorithm, events versus tf*idf times each system is better rather than the average ROUGE score in each case because ROUGE scores depend on each particular document set.</Paragraph> <Paragraph position="1"> It is clear from Table 2 that the summaries created using atomic events are better in the majority of cases than the summaries created using tf*idf.</Paragraph> <Paragraph position="2"> Figure 1 shows ROUGE scores for 400-word summaries. Although in most cases the performance of the event-based summarizer is higher than the performance based on tf*idf scores, for some document sets tf*idf gives the better scores. This phenomenon can be explained through an additional analysis of document sets according to their internal coherence. Atomic event extraction works best for a collection of documents with well-defined constituent parts of events and where documents are clustered around one specific major event. For such document sets atomic events are good features for basing the summary on. In contrast, some DUC 2001 document sets describe a succession of multiple events linked in time or of different events of the same type (e.g., Clarence Thomas' ascendancy to the Supreme Court, document set 7 in Figure 1, or the history of airplane crashes, document set 30 in Figure 1). In such cases, a lot of different participants are mentioned with only few common elements (e.g., Clarence Thomas himself). Thus, most of the atomic events have similar low weights and it is difficult to identify those atomic events that can point out the most important textual units.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Results: Adaptive Greedy Algorithm </SectionTitle> <Paragraph position="0"> For the second experiment we used the adaptive greedy algorithm, which accounts for information overlap across sentences in the summary. As in the case of the simpler static greedy algorithm, we observe that events lead to a better performance in most document sets than tf*idf (Table 3). Table 3 is in fact similar to Table 2, with slightly increased numbers of document sets for which events receive higher ROUGE scores for the 100 and 200-word greedy algorithm, using tf*idf as features summaries. It is interesting to see that the difference between the ROUGE scores for the summarizers based on atomic events and tf*idf features becomes more distinct when the adaptive greedy algorithm is used; Figure 2 shows this for 400-word summaries.</Paragraph> <Paragraph position="1"> As Table 4 shows, the usage of the adaptive greedy algorithm improves the performance of a summarizer based on atomic events in comparison to the static greedy algorithm. In contrast, the reverse is true when tf*idf is used (Table 5). Figure 3 shows the change in ROUGE scores that the introduction of the adaptive algorithm offers for 400-word summaries. This indicates that tf*idf is not compatible with our information redundancy component; a likely explanation is that words are correlated, and the presence of an important word makes other words in the same sentence also potentially important, a fact not captured by the tf*idf feature. Events, on the other hand, exhibit less of a dependence on each other, since each triplet captures a specific interaction between two entities.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.3 Results: Modified Greedy Algorithm </SectionTitle> <Paragraph position="0"> In the case of the modified adaptive greedy algorithm we see improvement in performance in com- null versus tf*idf parison with the summarizers using the static greedy algorithm for both events and tf*idf (Tables 6 and 7). In other words, the prioritization of individual important concepts addresses the correlation between words and allows the summarizer to benefit from redundancy reduction even when using tf*idf as the features. The modified adaptive algorithm offers a slight improvement in ROUGE scores over the unmodified adaptive algorithm. Also, as Table 8 makes clear, events remain the better feature choice over tf*idf.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.4 Results: Comparison with DUC systems </SectionTitle> <Paragraph position="0"> For our final experiment we used the 30 test document sets provided for DUC 2003 competition, for which the summaries produced by participating summarization systems were also released. In DUC 2003 the task was to create summaries only of length 100.</Paragraph> <Paragraph position="1"> We calculated ROUGE scores for the released summaries created by DUC participants and compared them to the scores of our system with atomic events as features and adaptive greedy algorithm as the filtering method. In 14 out of 30 cases our system outperforms the median of the scores of all the 15 participating systems over that specific document set. We view this comparison as quite encouraging, as our system does not employ any of the additional features (such as sentence position or time information) used by the best DUC summarization systems, nor was it adapted to the DUC domain.</Paragraph> <Paragraph position="2"> Again, the suitability (and relative performance) of the event-based summarizer varies according to the type of documents being summarized, indicating that using our approach for a subset of document sets is more appropriate. For example, our system scored below all the other systems for the document set about a meteor shower, which included a lot of background information and no well-defined constituents of events. On the contrary, our system performed better than any DUC system for the document set describing an abortion-related murder, where it was clear who was killed and where and when it happened.</Paragraph> </Section> </Section> class="xml-element"></Paper>