File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1027_metho.xml
Size: 25,786 bytes
Last Modified: 2025-10-06 14:08:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1027"> <Title>An Empirical Study of Information Synthesis Tasks</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Creation of an Information Synthesis </SectionTitle> <Paragraph position="0"> testbed We refer to Information Synthesis as the process of generating a topic-oriented report from a non-trivial amount of relevant, possibly interrelated documents. The first goal of our work is the generation of a testbed (ISCORPUS) with manually produced reports that serve as a starting point for further empirical studies and evaluation of information synthesis systems. This section describes how this testbed has been built.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Document collection and topic set </SectionTitle> <Paragraph position="0"> The testbed must have a certain number of features which, altogether, differentiate the task from current multi-document summarization evaluations: Complex information needs. Being Information Synthesis a step which immediately follows a document retrieval process, it seems natural to start with standard IR topics as used in evaluation conferences such as TREC2, CLEF3 or NTCIR4. The title/description/narrative topics commonly used in such evaluation exercises are specially well suited for an Information Synthesis task: they are complex and well defined, unlike, for instance, typical web queries.</Paragraph> <Paragraph position="1"> We have selected the Spanish CLEF 2001-2003 news collection testbed (Peters et al., 2002), because Spanish is the native language of the subjects recruited for the manual generation of reports. Out of the CLEF topic set, we have chosen the eight topics with the largest number of documents manually judged as relevant from the assessment pools. We have slightly reworded the topics to change the document retrieval focus (&quot;Find documents that...&quot;) into an information synthesis wording (&quot;Generate a report about...&quot;). Table 1 shows the eight selected topics.</Paragraph> <Paragraph position="2"> C042: Generate a report about the invasion of Haiti by UN/US soldiers.</Paragraph> <Paragraph position="3"> C045: Generate a report about the main negotiators of the Middle East peace treaty between Israel and Jordan, giving detailed information on the treaty.</Paragraph> <Paragraph position="4"> C047: What are the reasons for the military intervention of Russia in Chechnya? C048: Reasons for the withdrawal of United Nations (UN) peace- keeping forces from Bosnia.</Paragraph> <Paragraph position="5"> C050: Generate a report about the uprising of Indians in Chiapas (Mexico).</Paragraph> <Paragraph position="6"> C085: Generate a report about the operation &quot;Turquoise&quot;, the French humanitarian program in Rwanda.</Paragraph> <Paragraph position="7"> C056: Generate a report about campaigns against racism in Europe.</Paragraph> <Paragraph position="8"> C080: Generate a report about hunger strikes attempted in order to attract attention to a cause.</Paragraph> <Paragraph position="9"> This set of eight CLEF topics has two differentiated subsets: in a majority of cases (first six topics), it is necessary to study how a situation evolves in time; the importance of every event related to the topic can only be established in relation with the others. The invasion of Haiti by UN and USA troops (C042) is an example of such a topic. We will refer to them as &quot;Topic Tracking&quot; (TT) reports, because they resemble the kind of topics used in such task. The last two questions (56 and 80), however, resemble Information Extraction tasks: essentially, the user has to detect and describe instances of a generic event (cases of hunger strikes and campaigns against racism in Europe); hence we will refer to them as &quot;IE&quot; reports.</Paragraph> <Paragraph position="10"> Topic tracking reports need a more elaborated treatment of the information in the documents, and therefore are more interesting from the point of view of Information Synthesis. We have, however, decided to keep the two IE topics; first, because they also reflect a realistic synthesis task; and second, because they can provide contrastive information as compared to TT reports.</Paragraph> <Paragraph position="11"> Large document sets. All the selected CLEF topics have more than one hundred documents judged as relevant by the CLEF assessors. For homogeneity, we have restricted the task to the first 100 documents for each topic (using a chronological order).</Paragraph> <Paragraph position="12"> Complex reports. The elaboration of a comprehensive report requires more space than is allowed in current multi-document summarization experiences. We have established a maximum of fifty sentences per summary, i.e., half a sentence per document. This limit satisfies three conditions: a) it is large enough to contain the essential information about the topic, b) it requires a substantial compression effort from the user, and c) it avoids defaulting to a &quot;first sentence&quot; strategy by lazy (or tired) users, because this strategy would double the maximum size allowed.</Paragraph> <Paragraph position="13"> We decided that the report generation would be an extractive task, which consists of selecting sentences from the documents. Obviously, a realistic information synthesis process also involves rewriting and elaboration of the texts contained in the documents. Keeping the task extractive has, however, two major advantages: first, it permits a direct comparison to automatic systems, which will typically be extractive; and second, it is a simpler task which produces less fatigue.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Generation of manual reports </SectionTitle> <Paragraph position="0"> Nine subjects between 25 and 35 years-old were recruited for the manual generation of reports. All of them self-reported university degrees and a large experience using search engines and performing information searches.</Paragraph> <Paragraph position="1"> All subjects were given an in-place detailed description of the task in order to minimize divergent interpretations. They were told that, in a first step, they had to generate reports with a maximum of information about every topic within the fifty sentence space limit. In a second step, which would take place six months afterwards, they would be examined from each of the eight topics. The only documentation allowed during the exam would be the reports generated in the first phase of the experiment. Subjects scoring best would be rewarded.</Paragraph> <Paragraph position="2"> These instructions had two practical effects: first, the competitive setup was an extra motivation for achieving better results. And second, users tried to take advantage of all available space, and thus most reports were close to the fifty sentences limit. The time limit per topic was set to 30 minutes, which is tight for the information synthesis task, but prevents the effects of fatigue.</Paragraph> <Paragraph position="3"> We implemented an interface to facilitate the generation of extractive reports. The system displays a list with the titles of relevant documents in chronological order. Clicking on a title displays the full document, where the user can select any sentence(s) and add them to the final report. A different frame displays the selected sentences (also in chronological order), together with one bar indicating the remaining time and another bar indicating the remaining space. The 50 sentence limit can be temporarily exceeded and, when the 30 minute limit has been reached, the user can still remove sentences from the report until the sentence limit is reached back.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Questionnaires </SectionTitle> <Paragraph position="0"> After summarizing every topic, the following questionnaire was filled in by every user: Who are the main people involved in the topic? What are the main organizations participating in the topic? What are the key factors in the topic? Users provided free-text answers to these questions, with their freshly generated summary at hand. We did not provide any suggestions or constraints at this point, except that a maximum of eight slots were available per question (i.e. a maximum of 8X3 = 24 key concepts per topic, per user).</Paragraph> <Paragraph position="1"> This is, for instance, the answer of one user for the topic 42 about the invasion of Haiti by UN and USA troops in 1994: militares golpistas (coup attempting soldiers) golpe militar (coup attempt) restaurar la democracia (reinstatement of democracy) Finally, a single list of key concepts is generated for each topic, joining all the different answers. Redundant concepts (e.g. &quot;war&quot; and &quot;conflict&quot;) were inspected and collapsed by hand. These lists of key concepts constitute the gold standard for the similarity metric described in Section 3.2.5.</Paragraph> <Paragraph position="2"> Besides identifying key concepts, users also filled in the following questionnaire: Were you familiarized with the topic? Was it hard for you to elaborate the report? Did you miss the possibility of introducing annotations or rewriting parts of the report by hand? Do you consider that you generated a good report? Are you tired? Out of the answers provided by users, the most remarkable facts are that: only in 6% of the cases the user missed &quot;a lot&quot; the possibility of rewriting/adding comments to the topic. The fact that reports are made extractively did not seem to be a significant problem for our users.</Paragraph> <Paragraph position="3"> in 73% of the cases, the user was quite or very satisfied about his summary.</Paragraph> <Paragraph position="4"> These are indications that the practical constraints imposed on the task (time limit and extractive nature of the summaries) do not necessarily compromise the representativeness of the testbed. The time limit is very tight, but the temporal arrangement of documents and their highly redundant nature facilitates skipping repetitive material (some pieces of news are discarded just by looking at the title, without examining the content).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Generation of baseline reports </SectionTitle> <Paragraph position="0"> We have automatically generated baseline reports in two steps: For every topic, we have produced 30 tentative baseline reports using DUC style criteria: - 18 summaries consist only of picking the first sentence out of each document in 18 different document subsets. The subsets are formed using different strategies, e.g.</Paragraph> <Paragraph position="1"> the most relevant documents for the query (according to the Inquery search engine), one document per day, the first or last 50 documents in chronological order, etc.</Paragraph> <Paragraph position="2"> - The other 12 summaries consist of a) picking the first n sentences out of a set of selected documents (with different values for n and different sets of documents) and b) taking the full content of a few documents. In both cases, document sets are formed with similar criteria as above.</Paragraph> <Paragraph position="3"> Out of these 30 baseline reports, we have selected the 10 reports which have the highest sentence overlap with the manual summaries.</Paragraph> <Paragraph position="4"> The second step increases the quality of the baselines, making the task of differentiating manual and baseline reports more challenging.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Comparison of similarity metrics </SectionTitle> <Paragraph position="0"> Formal aspects of a summary (or report), such as legibility, grammatical correctness, informativeness, etc., can only be evaluated manually. However, automatic evaluation metrics can play a useful role in the evaluation of how well the information from the original sources is preserved (Mani, 2001).</Paragraph> <Paragraph position="1"> Previous studies have shown that it is feasible to evaluate the output of summarization systems automatically (Lin and Hovy, 2003). The process is based in similarity metrics between texts. The first step is to establish a (manual) reference summary, and then the automatically generated summaries are ranked according to their similarity to the reference summary.</Paragraph> <Paragraph position="2"> The challenge is, then, to define an appropriate proximity metric for reports generated in the information synthesis task.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 How to compare similarity metrics without </SectionTitle> <Paragraph position="0"> human judgments? The QARLA estimation In tasks such as Machine Translation and Summarization, the quality of a proximity metric is measured in terms of the correlation between the ranking produced by the metric, and a reference ranking produced by human judges. An optimal similarity metric should produce the same ranking as human judges.</Paragraph> <Paragraph position="1"> In our case, acquiring human judgments about the quality of the baseline reports is too costly, and probably cannot be done reliably: a fine-grained evaluation of 50-sentence reports summarizing sets of 100 documents is a very complex task, which would probably produce different rankings from different judges.</Paragraph> <Paragraph position="2"> We believe there is a cheaper and more robust way of comparing similarity metrics without using human assessments. We assume a simple hypothesis: the best metric should be the one that best discriminates between manual and automatically generated reports. In other words, a similarity metric that cannot distinguish manual and automatic reports cannot be a good metric. Then, all we need is an estimation of how well a similarity metric separates manual and automatic reports. We propose to use the probability that, given any manual report Mref, any other manual report M is closer to Mref than any other automatic report A:</Paragraph> <Paragraph position="4"> where M is the set of manually generated reports, A is the set of automatically generated reports, and &quot;sim&quot; is the similarity metric being evaluated. null We refer to this value as the QARLA5 estimation.</Paragraph> <Paragraph position="5"> QARLA has two interesting features: No human assessments are needed to compute QARLA. Only a set of manually produced summaries and a set of automatic summaries, for each topic considered. This reduces the cost of creating the testbed and, in addition, eliminates the possible bias introduced by human judges.</Paragraph> <Paragraph position="6"> It is easy to collect enough data to achieve statistically significant results. For instance, our testbed provides 720 combinations per topic to estimate QARLA probability (we have nine manual plus ten automatic summaries per topic).</Paragraph> <Paragraph position="7"> A good QARLA value does not guarantee that a similarity metric will produce the same rankings as human judges, but a good similarity metric must have a good QARLA value: it is unlikely that a measure that cannot distinguish between manual and automatic summaries can still produce high-quality rankings of automatic summaries by comparison to manual reference summaries.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Similarity metrics </SectionTitle> <Paragraph position="0"> We have compared five different metrics using the QARLA estimation. The first three are meant as baselines; the fourth is the standard similarity metric used to evaluate summaries (ROUGE); and the last one, introduced in this paper, is based on the overlapping of key concepts.</Paragraph> <Paragraph position="1"> The following metric estimates the similarity of two reports from the set of documents which are represented in both reports (i.e. at least one sentence in each report belongs to the document).</Paragraph> <Paragraph position="3"> where Mr is the reference report, M a second report and Doc(Mr);Doc(M) are the documents to which the sentences in Mr;M belong to.</Paragraph> <Paragraph position="4"> The more sentences in common between two reports, the more similar their content will be. We can measure Recall (how many sentences from the reference report are also in the contrastive report) and Precision (how many sentences from the contrastive report are also in the reference report):</Paragraph> <Paragraph position="6"> where S(Mr);S(M) are the sets of sentences in the reports Mr (reference) and M (contrastive).</Paragraph> <Paragraph position="7"> 3.2.3 Baseline 4: Perplexity A language model is a probability distribution over word sequences obtained from some training corpora (see e.g. (Manning and Schutze, 1999)). Perplexity is a measure of the degree of surprise of a text or corpus given a language model. In our case, we build a language model LM(Mr) for the reference report Mr, and measure the perplexity of the contrastive report M as compared to that language model:</Paragraph> <Paragraph position="9"> We have used the Good-Turing discount algorithm to compute the language models (Clarkson and Rosenfeld, 1997). Note that this is also a base-line metric, because it only measures whether the content of the contrastive report is compatible with the reference report, but it does not consider the coverage: a single sentence from the reference report will have a low perplexity, even if it covers only a small fraction of the whole report. This problem is mitigated by the fact that we are comparing reports of approximately the same size and without repeated sentences.</Paragraph> <Paragraph position="10"> The distance between two summaries can be established as a function of their vocabulary (unigrams) and how this vocabulary is used (n-grams). From this point of view, some of the measures used in the evaluation of Machine Translation systems, such as BLEU (Papineni et al., 2002), have been imported into the summarization task. BLEU is based in the precision and n-gram co-ocurrence between an automatic translation and a reference manual translation. null (Lin and Hovy, 2003) tried to apply BLEU as a measure to evaluate summaries, but the results were not as good as in Machine Translation. Indeed, some of the characteristics that define a good translation are not related with the features of a good summary; then Lin and Hovy proposed a recall-based variation of BLEU, known as ROUGE. The idea is the same: the quality of a proposed summary can be calculated as a function of the n-grams in common between the units of a model summary.</Paragraph> <Paragraph position="11"> The units can be sentences or discourse units:</Paragraph> <Paragraph position="13"> where MU is the set of model units, Countm is the maximum number of n-grams co-ocurring in a peer summary and a model unit, and Count is the number of n-grams in the model unit. It has been established that unigram and bigram based metrics permit to create a ranking of automatic summaries better (more similar to a human-produced ranking) than n-grams with n> 2.</Paragraph> <Paragraph position="14"> For our experiment, we have only considered unigrams (lemmatized words, excluding stop words), which gives good results with standard summaries (Lin and Hovy, 2003).</Paragraph> <Paragraph position="15"> Two summaries generated by different subjects may differ in the documents that contribute to the summary, in the sentences that are chosen, and even in the information that they provide. In our Information Synthesis settings, where topics are complex and the number of documents to summarize is large, it is likely to expect that similarity measures based on document, sentence or n-gram overlap do not give large similarity values between pairs of manually generated summaries.</Paragraph> <Paragraph position="16"> Our hypothesis is that two manual reports, even if they differ in their information content, will have the same (or very similar) key concepts; if this is true, comparing the key concepts of two reports can be a better similarity measure than the previous ones.</Paragraph> <Paragraph position="17"> In order to measure the overlap of key concepts between two reports, we create a vector ~kcfor every report, such that every element in the vector represents the frequency of a key concept in the report in relation to the size of the report: kc(M)i = freq(Ci;M)jwords(M)j being freq(Ci;M) the number of times the key concept Ci appears in the report M, and jwords(M)jthe number of words in the report.</Paragraph> <Paragraph position="18"> The key concept similarity NICOS (Nuclear Informative Concept Similarity) between two reports M and Mr can then be defined as the inverse of the Euclidean distance between their associated concept vectors:</Paragraph> <Paragraph position="20"> In our experiment, the dimensions of kc vectors correspond to the list of key concepts provided by our test subjects (see Section 2.3). This list is our gold standard for every topic.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental results </SectionTitle> <Paragraph position="0"> Figure 1 shows, for every topic (horizontal axis), the QARLA estimation obtained for each similarity metric, i.e., the probability of a manual report being closer to other manual report than to an automatic report. Table 2 shows the average QARLA measure across all topics.</Paragraph> <Paragraph position="1"> For the six TT topics, the key concept similarity NICOS performs 43% better than ROUGE, and all baselines give poor results (all their QARLA probabilities are below chance, QARLA< 0:5). A non-parametric Wilcoxon sign test confirms that the difference between NICOS and ROUGE is highly significant (p < 0:005). This is an indication that the Information Synthesis task, as we have defined it, should not be studied as a standard summarization problem. It also confirms our hypothesis that key concepts tend to be stable across different users, and may help to generate the reports.</Paragraph> <Paragraph position="2"> The behavior of the two Information Extraction (IE) topics is substantially different from TT topics. While the ROUGE measure remains stable (0:53 versus 0:54), the key concept similarity is much worse with IE topics (0:52 versus 0:77). On the other hand, all baselines improve, and some of them (SentenceSim precision and perplexity) give better results than both ROUGE and NICOS.</Paragraph> <Paragraph position="3"> Of course, no reliable conclusion can be obtained from only two IE topics. But the observed differences suggest that TT and IE may need different approaches, both to the automatic generation of reports and to their evaluation.</Paragraph> <Paragraph position="4"> One possible reason for this different behavior is that IE topics do not have a set of consistent key concepts; every case of a hunger strike, for instance, involves different people, organizations and places. The average number of different key concepts is 18:7 for TT topics and 28:5 for IE topics, a difference that reveals less agreement between subjects, supporting this argument.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Related work </SectionTitle> <Paragraph position="0"> Besides the measures included in our experiment, there are other criteria to compare summaries which could as well be tested for Information Synthesis: Annotation of relevant sentences in a corpus.</Paragraph> <Paragraph position="1"> (Khandelwal et al., 2001) propose a task, called &quot;Temporal Summarization&quot;, that combines summarization and topic tracking. The paper describes the creation of an evaluation corpus in which the most relevant sentences in a set of related news were annotated. Summaries are evaluated with a measure called &quot;novel recall&quot;, based in sentences selected by a summarization system and sentences manually associated to events in the corpus. The agreement rate between subjects in the identification of key events and the sentence annotation does not correspond with the agreement between reports that we have obtained in our experiments. There are, at least, two reasons to explain this: (Khandelwal et al., 2001) work on an average of 43 documents, half the size of the topics in our corpus.</Paragraph> <Paragraph position="2"> Although there are topics in both experiments, the information needs in our testbed are more complex (e.g. motivations for the invasion of Chechnya) Factoids. One of the problems in the evaluation of summaries is the versatility of human language. Two different summaries may contain the same information. In (Halteren and Teufel, 2003), the content of summaries is manually represented, decomposing sentences in factoids or simple facts.</Paragraph> <Paragraph position="3"> They also annotate the composition, generalization and implication relations between extracted factoids. The resulting measure is different from unigram based similarity. The main problem of factoids, as compared to other metrics, is that they require a costly manual processing of the summaries to be evaluated.</Paragraph> </Section> class="xml-element"></Paper>