File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/x98-1025_evalu.xml
Size: 7,924 bytes
Last Modified: 2025-10-06 14:00:33
<?xml version="1.0" standalone="yes"?> <Paper uid="X98-1025"> <Title>SUMMARIZATION: (1) USING MMR FOR DIVERSITY- BASED RERANKING AND (2) EVALUATING SUMMARIES</Title> <Section position="11" start_page="188" end_page="193" type="evalu"> <SectionTitle> 9. EXPERIMENTS AND RESULTS </SectionTitle> <Paragraph position="0"> In this section we describe the experiments we performed and results obtained in evaluating the diversity gain - MMR (Section 9.1), query expansion (Section 9.2) and compression (Section 9.3).</Paragraph> <Section position="1" start_page="188" end_page="189" type="sub_section"> <SectionTitle> 9.1 MMR (Diversity Gain) </SectionTitle> <Paragraph position="0"> In order to evaluate what the relevance loss for the MMR diversity gain in single document summarization, we created summaries for two document length percentages (measured by number of sentences) and determined how many relevant sentences the summaries contained.</Paragraph> <Paragraph position="1"> The results are given in Table 2 for document percentages 0.25 and 0.1. Two precision scores were calculated, (1) that of TREC relevance plus at least one CMU assessor marking the document as relevant (yielding 23 documents) and (2) at least two of the three CMU assessor marking the document as relevant (yielding 15 documents). From these scores we can see there is no significant statistical difference between the ~,=1, ~,=.7, and 3.=.3 scores. This is often explained by cases where the L=l article failed to pick up a piece of relevant information and the reranking of k=.7 or .3 might or vice versa. The baseline (baseln) contains the first N sentences of the document, where N is the number of sentences in the summary.</Paragraph> </Section> <Section position="2" start_page="189" end_page="193" type="sub_section"> <SectionTitle> 9.2 Query Expansion </SectionTitle> <Paragraph position="0"> We expanded the original queries by: (1) adding the highest ranked sentence of the document (a form of pseudo-relevance feedback), (2) adding the title, and (3) adding the title and the highest ranked sentence.</Paragraph> <Paragraph position="1"> The most significant effects were shown for short queries (see Figures 7, 9). For the longer queries, the effect was less (see Figures 8, 10). For 20% document length (characters rounded up to the sentence boundary) adding the highest ranked sentence (prf) and title to the query helps performance for the 110 set relevant summary judgments (Figures 7, 8). For 10% document length, I I I I 110 Set relevants docs, query o ......... ~_ _ 110 Set relevants docs, query+prf+title -4---........... ,~.,.~ - .... ~ 110 Set relevants docs, short query -E]-i:---'.-.:::::~-...~S-.:.,..~ ~.......^ 110 Set relevants docs, short query+pff+title --x ...... ~&quot;--..:~ ..... &quot;-.. v ~ Model Summs query --~--.................... .x ...................... x.:.~i-,,~'~i: L;~. &quot;'-+ ........ -+,~, Model_Summs-q ue ry+pff+title -~-.-Q, &quot;<,, ....... ~&quot; -+, &quot;*% ~'~ ~ &quot;'&quot;''X ............... &quot;&quot;~. ~ , ......... .......... -.,.</Paragraph> <Paragraph position="2"> [] .......... .~..:~:,. \ x,. ,,, '-~ &quot;,.,,, ~ .......... &quot;,, ~:~. &quot;-,. ~ x &quot;, &quot;~: &quot;% &quot;% * &quot;~;(.,. &quot;&quot;,, ~'... ,I- ........ '~:~.,. &quot;&quot;,,, &quot;\.. ......... ~ ......... ~.~--.--....~,....... 110 Set relevant docs, short query -B--F:Z:Z'.'Z':-~7~TL-:L'~'--:::~:C'7~.~.c_:~,~,__ . 1 10 Set relevant docs, short query+prf+title --x ...... t .......... ~ .......... E} .......... B.-&quot;~'<'--.~_. ....... T-tO, Model Summs query --A-.~- .................... -x ...................... x ........... ~.~.~......x.~.::&quot;-'-':.'~ &quot;~- .......... ~&quot;,C,~, Model_Summs-query+prf+title -~-.- * .......... ;;&quot; ........ --'~:~. &quot;~&quot;.C,~. 'k',, &quot;&quot;X ~. &quot; ...............</Paragraph> <Paragraph position="3"> -....</Paragraph> <Paragraph position="4"> '~&quot;~''''&quot;'&quot;E~ ......... ~ ..................... X ..................... &quot;'~.~ &quot;&quot; &quot; ~'__-.~... = ~ __. __ ~ ......... ~ .......... &quot;A ......... ~'L ~ ~-.~-.~.~-.~-.-~.~-.~.~.~.~.</Paragraph> <Paragraph position="5"> i I I I summary = .1 doc length o -_~......._.....~._. ....... summary = .2 doc length -~--. -~m~,,~.. summary = .3 doc length -G--.</Paragraph> <Paragraph position="6"> .... &quot;~-..~<...~_ summary = .4 doc length --x ...... *., ..:.,~'~ - ....... ~ summary = .5 doc length -~--'x &quot;--. * ~x for short queries just adding the title performed better than prf and the title (Figures 9,10). We will determine if these results hold over more extensive data.</Paragraph> <Paragraph position="7"> These results are similar to those obtained for document information retrieval \[27\]. Since 72% of the first sentences were marked relevant (Table 3), one area we plan to explore is results using the first sentence in the summary and/or query under specified circumstances, such as our first sentence heuristics (Section 4).</Paragraph> </Section> <Section position="3" start_page="193" end_page="193" type="sub_section"> <SectionTitle> 9.3 Compression </SectionTitle> <Paragraph position="0"> An important evaluation criteria for summarization is what is the ideal summary output length (compression of the document) and how does it affects the user's task. To begin looking at this issue, we evaluated the performance of our system at different summary lengths as a percentage of the document length.</Paragraph> <Paragraph position="1"> We used a document compression factor based on the number of characters in the document. If this cutoff fell in the middle of a sentence the rest of the sentence was allowed, thus the output summary ends up being slightly longer than the actually compression factor.</Paragraph> <Paragraph position="2"> The data set statistics are shown in Tables 3 and 4. Note that non-relevant documents (Table 4) still have a high percentage of relevant sentences. Ten documents in the 110 set were non-relevant and had no relevant sentences. We also see that the summary length or number of relevant sentences chosen per document varies significantly.</Paragraph> <Paragraph position="3"> Summaries were compared using the modified interpolated normalized recall-precision curve as previously described (Section 8.2).</Paragraph> <Paragraph position="4"> In Figure 11, we examine the effect of compression on normalized recall and precision and in Figure 12, we show a plot of normalized F1. This F1 graph indicates that the normalized F1 score is helped by having the pseudo-relevance feedback and title in the query thereby extracting relevant sentences that would otherwise be missed. As the number of sentences that are allowed in the summary grows, the difficulty of finding relevant sentences grows and thus the added prf sentence and title to the query help to find relevant sentences for their particular document. We need to do more studying on the effects of query expansion and compression on summarization, as well as see how our preliminary results hold for additional data sets.</Paragraph> <Paragraph position="5"> If we calculate the normalized F1 score for the first sentence retrieved in the summary, we obtain a score of .80 for 110 Set standard query, .67 for 110 Set short query and .79 for the Model Summaries. This indicates that even for the short query we obtain a relevant sentence two thirds of the time. However, ideally this first sentence retrieval score would be 1.0 and we will explore methods to increase this score as well as select a &quot;highly relevant&quot; first retrieved sentence for the document.</Paragraph> </Section> </Section> class="xml-element"></Paper>