File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/w03-0508_concl.xml
Size: 5,104 bytes
Last Modified: 2025-10-06 13:53:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0508"> <Title>Examining the consensus between human summaries: initial experiments with factoid analysis</Title> <Section position="5" start_page="0" end_page="0" type="concl"> <SectionTitle> 4 Discussion and future work </SectionTitle> <Paragraph position="0"> From our experiences so far, it seems that both our innovations, viz. using multiple summaries and measuring with factoids, appear to be worth pursuing further. We summarise the results for our test text in the following: + We observe a very wide selection of factoids in the summaries, only few of which are included by all summarisers.</Paragraph> <Paragraph position="1"> + The number of factoids found if new summaries are considered does not tail ofi.</Paragraph> <Paragraph position="2"> + There is a clear importance hierarchy of factoids which allows us to compile a consensus summary.</Paragraph> <Paragraph position="3"> + If single summaries are used as gold standard, the correlation between rankings based on two such gold standard summaries is low.</Paragraph> <Paragraph position="4"> + We could not flnd any large clusters of highly correlated summarisers in our data.</Paragraph> <Paragraph position="5"> + Stability with respect to the consensus summary can only be expected if a larger number of summaries are collected (in the range of at least 30-40 summaries).</Paragraph> <Paragraph position="6"> + A unigram-based measurement shows only low correlation with the factoid-based measurement. null The information that is gained through multiple summaries with factoid-similarity is insu-ciently approximated with the currently used substitutes, as the observations above show. However, what we have described here must clearly be seen as an initial experiment, and there is yet much to be done.</Paragraph> <Paragraph position="7"> First of all, the notation of the factoid (currently at atoms) needs to be made more expressive, e.g. by the addition of variables for discourse referents and events, which will make factoids more similar to FOPL expressions, and/or by the use of a typing mechanism to indicate the various forms of inference/implication. null We also need to identify a good weighting scheme to be used in measuring similarity of factoid vectors. The weighting should correct for the variation between factoids in information content, for their difierent position along an inference chain, and possibly for their position in the summary. It should also be able to express some notion of importance of the factoids, e.g. as measured by the number of summaries containing the factoid.</Paragraph> <Paragraph position="8"> Something else to investigate is the presence and distribution of factoids, types of factoids and relations between factoids in summaries and summary collections. We have the strong feeling that some of our observations were tightly linked to the type of text we used. We would like to build a balanced corpus of texts, of various subject areas and lengths, and their summaries, at several difierent lengths and possibly even multi-document, so that we can study this factor. An open question is how many summaries we should try to get for each of the texts in the corpus. It is unlikely we will be able to collect 50 summaries for each new text. Furthermore, the texts of the corpus should also be summarised by as many machine summarisers as possible, so that we can test ranking these on the basis of factoids, in a realistic framework.</Paragraph> <Paragraph position="9"> A flnal line of investigation is searching for ways to reduce the cost of factoid analysis. The flrst reason why this analysis is currently expensive is the need for large summary bases for consensus summaries.</Paragraph> <Paragraph position="10"> There is yet hope that this can be circumvented by using larger numbers of texts, as is the case in IR and in MT, where discrepancies prove to average out when large enough datasets are used. Papineni et al., e.g., were able to show that the ranking with their Bleu measure of the flve evaluated translators (two human and three machine) remained stable if only a single reference translation was used, suggesting that \we may use a big corpus with a single reference translation, provided that the translations are not all from the same translator&quot;. Possibly a similar averaging efiect will occur in the evaluation of summarisation so that smaller summary bases can be used. The second reason is the need for human annotation of factoids. Although simple unigram-based methods prove insu-cient, we will hopefully be able to come a long way in automating factoid identiflcation on the basis of existing NLP techniques, combined with information gained about factoids in research as described in the previous paragraph. All in all, the use of consensus summaries and factoid analysis, even though expensive to set up for the moment, provides a promising alternative which could well bring us closer to a solution to several problems in summarisation evaluation.</Paragraph> </Section> class="xml-element"></Paper>