File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-3254_concl.xml
Size: 4,014 bytes
Last Modified: 2025-10-06 13:54:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3254"> <Title>Evaluating information content by factoid analysis: human annotation and stability</Title> <Section position="6" start_page="0" end_page="0" type="concl"> <SectionTitle> 6 Discussion and future work </SectionTitle> <Paragraph position="0"> We have presented a new information-based summarization metric called weighted factoid score, which uses multiple summaries as gold standard and which measures information overlap, not string overlap. It can be reliably and objectively annotated in arbitrary text, which is shown by our high values for human agreement.</Paragraph> <Paragraph position="1"> We summarise our results as follows: Factoids can be deflned with high agreement by independently operating annotators in naturally occurring text (K=.70) and independently annotated with even higher agreement (K=.86 and .87).</Paragraph> <Paragraph position="2"> Therefore, we consider the deflnition of factoids intuitive and reproducible.</Paragraph> <Paragraph position="3"> The number of factoids found if new summaries are considered does not tail ofi, but weighting of factoids by frequency and/or location in the summary allows for a stable summary metric. We expect this can be improved further by including an information content weighting factor.</Paragraph> <Paragraph position="4"> If single summaries are used as gold standard (as many other summarization evaluations do), the correlation between rankings based on two such gold standard summaries can be expected to be low; in our two experiments, the correlations were %0=0.20 and 0.48 on average. According to our estimations, stability with respect to the factoid scores can only be expected if a larger number of summaries are collected (in the range of 20{30 summaries).</Paragraph> <Paragraph position="5"> System rankings based on the factoid score shows only low correlation with rankings based on a) DUC-based information overlap, and b) unigrams, a measurement based on shared words between gold standard summaries and system summary. As far as b) is concerned, this is expected, as factoid comparison abstracts over wording and captures linguistic variation of the same meaning. However, the ROUGE measure currently in development is considering various n-grams and Wordnet-based paraphrasing options (Lin, personal communication). We expect that this measure has the potential for better ranking correlation with factoid ranking, and we are currently investigating this.</Paragraph> <Paragraph position="6"> We also plan to expand our data sets to more texts, in order to investigate the presence and distribution of factoids, types of factoids and relations between factoids in summaries and summary collections. Currently, we have two large factoid-annotated data sets with 20 and 50 summaries, and a workable procedure to annotate factoids, including guidelines which were used to achieve good agreement. We now plan to elicit the help of new annotators to increase our data pool.</Paragraph> <Paragraph position="7"> Another pressing line of investigation is reducing the cost of factoid analysis. The flrst reason why this analysis is currently expensive is the need for large summary bases for consensus summaries. Possibly this can be circumvented by using larger numbers of difierent texts, as is the case in IR and in MT, where discrepancies prove to average out when large enough datasets are used. The second reason is the need for human annotation of factoids. Although simple word-based methods prove insu-cient, we expect that existing and emerging NLP techniques based on deeper processing might help with automatic factoid identiflcation.</Paragraph> <Paragraph position="8"> All in all, the use of factoid analysis and weighted factoid score, even though initially expensive to set up, provides a promising alternative which could well bring us closer to a solution to several problems in summarisation evaluation. null</Paragraph> </Section> class="xml-element"></Paper>