File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/w98-1124_evalu.xml

Size: 8,638 bytes

Last Modified: 2025-10-06 14:00:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1124">
  <Title>Improving summarization through rhetorical parsing tuning</Title>
  <Section position="5" start_page="211" end_page="214" type="evalu">
    <SectionTitle>
4.4.2 Results
</SectionTitle>
    <Paragraph position="0"> The TRECcorpus. We have experimented with different values for noOfFries, noOfSteps, and Aw. When we ran the algorithm shown in figure 2 on the collection of 25 texts in our training TREC corpus, with noQfFries = 50, noOfSteps = 60, and Aw = 0.4, we obtained multiple configurations of weights that yielded maximal F-values of the recall and precision figures at both 10% and 20% cutoffs. Table 5 shows only the two best configurations for each cutoff. The best configuration of weights for the 10% cutoff recalls 68.33% of the sentences considered important by human judges in the whole TREC corpus  with a precision of 84.16%. The F-value of the recall and precision figures for this configuration is 75.42%, which is approximately 4% lower than the F-value that pertains to human judges and 3.5% higher that the F-value that pertains to the lead-based algorithm. The results in table 5 show that at 10% cutoff there is not too much difference between summaries built by human judges, by the rhetorical parser, and by the lead-based algorithm.</Paragraph>
    <Paragraph position="1"> Since the lead-based algorithm performs so well at the 10% level, the only conclusion that we can draw is that for short newspaper articles, the lead-based algorithm is the most efficient solution.</Paragraph>
    <Paragraph position="2"> At the 20% cutoff, the best configuration of weights recalls 59.51% of the sentences considered important by human judges in the whole corpus, with 72.11% precision. The F-value of the recall and precision figures for this configuration is 65.21%, which is about 7.5% lower than the F-value that pertains to human judges and 8.5% higher than the F-value that pertains to the lead-based algorithm. These results suggest that when we want to build longer summaries, the lead-based heuristic is no longer appropriate even within the newspaper genre (and even for simple articles).</Paragraph>
    <Paragraph position="3"> The Scientific American corpus. We also ran the algorithm shown in figure 2 on the collection of five texts in our Scientific American corpus, with noOyTries = 120, noO\]Steps = 50, and Au, = 0.4. Table 6 shows 2 configurations of weights that yielded maximal F-values of the recall and precision figures at the clause-like unit level and 4 configurations of weights that yielded maximal F-values at the sentence level. The best combination of weights for summarization at the clause-like unit level recalls 67.57% of the elementary units considered imporiant by human judges, with a precision of 73.53%. The F-value of the recall and precision figures for this configuration is 70.42%, which is less than 1% lower than the F-value that pertains to human judges and about 30% higher than the F-value that pertains to the lead-based algorithm. This result outperforms significantly the previous 52.77% recall and 50.00% precision figures that were obtained by Marcu using only the shape-based heuristic (1997c; 1997a).</Paragraph>
    <Paragraph position="4"> The best combination of weights for summarization at the sentence level recalls 69.23% of the sentences considered important by human judges, with a precision of 64.29%. The F-value of the recall and precision figures for this configuration is 66.67%, which is about ! 2% lower than the F-value that pertains to human judges but about 12% higher than the F-value that pertains to the lead-based algorithm. These results suggest that although Scientific American articles cannot be summarized properly by applying a simple, lead-based heuristic, they can be by applying the discourse-based algorithm.</Paragraph>
    <Paragraph position="5"> Discussion. Because of the limited size of the corpora and because of the rhetorical ambiguity of some of the texts in the TREC corpus, carrying out cross-validation.</Paragraph>
    <Paragraph position="6"> experiments was either meaningless or prohibitively ex-Input: A corpus of texts C'r.</Paragraph>
    <Paragraph position="7"> The manually built summaries ST for the texts in C7-.</Paragraph>
    <Paragraph position="8"> NoOfTries, NoOfSteps, A w.</Paragraph>
    <Paragraph position="9"> Output: The weights Ii,'~a~ = {w~l~,, w~k ..... w,.o,,} that yield the best summaries with respect to CT and ST.  1. ~'ma~ = {Wets,, W~,k ..... Wean} = {rand(O, 1), rand(O, 1) ..... rand(O, 1)} 2. for tries = 1 to NoOfYries 3. 14:, = {wets,, w~rk ..... w~o~) = {rand(O, 1), rand(O, I) ..... rand(O, I)} 4. Ft = F_RecallAndPrecision(wct~,, w~,k,... , w~,); 5. for flips = 1 to NoOtSteps 6. Ft = F_RecallAndPrecision(wcl~, + Aw, w,~,k, . .. , W~o~); 7. F: = F_RecallAndPrecision(w~t~,~t- A w , w~,,, . . . , w~o,); 8 ....</Paragraph>
    <Paragraph position="10"> 9. Fl3 = F.RecallAndPrecision(wct~,, w,,,~,k, * * * , wC/o,, + Aw); I0. FH = F_RecallAndPrecision(wcl~,, w,,o,-k, * * * , w,,,, - .Aw); 11. F,,~ =rnax(Ft,Fi, F2 ..... FH); 12. Ft = randomOf(Fmo~); 13. 1,~'t = weightsOf(Ft); 14. endfor 16. endfor 17. return ~&amp;quot;i':~  pensive. As a consequence, the recall and precision results that are reported in this paper can be interpreted as being only suggestive of discourse-based summarization performance. However, the experiments do support conclusions that pertain to the integration of multiple heuristics. null The analysis of the patterns of weights in tables 5 and 6 shows that, for both corpora, no individual heuristic is a clear winner with respect to its contribution to obtaining good summaries. For the TR.EC corpus, with the exception of the rhetorical-clustering-and the connectedness- null based heuristics, all other heuristics seem to contribute consistently to the improvement in summarization quality. For the Scientific American corpus, when combined with other heuristics, the marker-, rhetorical-clustering-, shape-, and title-based heuristics seem to contribute consistently to the improvement in recall and precision figures in almost all cases. In contrast, the clustering-, position-, and connectedness-based heuristics seem to be even detrimental with respect to the collection of texts that we considered.</Paragraph>
    <Paragraph position="11"> However, the conclusion that seems to be supported by  the data in tables 5 and 6 is that the strength of a summarization system does not depend so much on the use of one heuristic, but rather on the ability of the system to use an optimal combination of heuristics. The data also shows that the optimal combinations need not necessarily follow a common pattern: for example, the combinations of heuristics that yield the highest F-values of the recall and precision figures for the 10% cutoff in the TREC corpus differ dramatically: one combination relies almost entirely on the position-based heuristic, while the other combination uses a much more balanced combination of heuristics that is slightly biased towards assigning more importance to the clustering-based heuristic.</Paragraph>
    <Paragraph position="12"> In addition to the analysis of the patterns of weights that yielded optimal summaries in the two corpora, we also examined the appropriateness of using combinations of weights that were optimal for a given summary cutoff in order to summarize texts at a different cutoff. Table 7 shows the recall and precision figures that.are obtained when the patterns of weights that yielded optimal summaries at 10% cutoff are used to summarize the texts in the TREC corpus at 20% cutoff; and the recall and precision figures that are obtained when the patterns of weights that yielded optimal summaries at 20% cutoff are used to summarize the same texts at 10% cutoff. As the figures in table 7 show. the combinations of heuristics that yielded  optimal summaries at a particular cutoffdo not yield optimal summaries at other cutoffs, although we can still find combinations of heuristics that outperform the lead-based algorithm at both cutoffs. The results in table 7 suggest that there are at least two ways in which one can train a summarization system. If the system is going to be used frequently to summarize texts at a given cutoff, then it makes sense to train it to produce good summaries at that cutoff. However, if the system is going to be used to generate summaries of various lengths, then a different training methodology should be adopted, one that would ensure optimality across the whole cutoff spectrum, from 1 to 99%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML