File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0506_evalu.xml
Size: 3,660 bytes
Last Modified: 2025-10-06 13:58:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0506"> <Title>A Study for Documents Summarization based on Personal Annotation</Title> <Section position="6" start_page="1" end_page="1" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> The problem of evaluating text summarization is a quite deep one, and some problems remain concerning the appropriate methods and types of evaluation. There are a variety of possible bases for comparison of summarization performance, e.g., summary to source, system to manual summary. In general, methods for text summarization can be classified into two categories (Firmin and B, 1998; Mani and Maybury, 1999). The first is intrinsic evaluation, which judge the quality of the summarization directly based on analysis of the summary, including user judgments of fluency of the summary, coverage of the &quot;key/essential ideas&quot;, or similarly to an &quot;ideal&quot; summary which is hard to establish. The other is extrinsic evaluation, which judge the quality of the summarization based on how it affects on the completion of other tasks, such as question answering and comprehension tasks.</Paragraph> <Paragraph position="1"> Here we use intrinsic evaluation for our summarization performance. It is to compare the system summary with an ideal manual summary. Since we need to collect annotations for experimented documents, which require reading through the text, manual summaries can be made consequently after the reading.</Paragraph> <Paragraph position="2"> The documents dataset to be evaluated are supplied with human annotations and summaries, which will be described in detail in the next section.</Paragraph> <Paragraph position="3"> For one annotated document, our annotation based summarization (ABS) system produce two versions of summaries: generic summary without considering annotations, and annotated summary considering annotations. For evaluation, we made comparison between human-made summary and generic summary, and comparisons between human-made summary and annotated summary. There are a lot of measures to make the comparisons (Firmin and B, 1998; Mani and Maybury, 1999), such as precision, recall, some of which will be used for our evaluation. Another measure we are interested in is the cosine similarity for two summaries, which is defined on keywords, and reflects the general similarity of two summaries in global distribution. For Precision and recall are generally applied to sentences; in fact they can be applied to keywords too, which reflects the percentage of keywords correctly identified. Therefore, in spite of summary similarity, our measures for evaluation also include sentences precision, sentences recall, keywords precision and keywords recall. For keywords evaluation, a keyword is correct only if it occurs in human-made summary. For sentences evaluation, a sentence in summary is correct if it has as many possible keywords as in the corresponding sentence in the human-made summary, that is, their similarity (calculated same as summary similarity) is beyond a certain threshold. We use two types of sentences match in the experiments: one is perfect match, which means a sentence in summary is correct only if it occurs in manual summary; the other is conditional match, which means most concepts of the two sentences are correct, in this case the match similarity threshold is less than 1.</Paragraph> <Paragraph position="4"> For a set of annotated documents, average values for the above five measures are calculated to show the general performance of the comparison.</Paragraph> </Section> class="xml-element"></Paper>