File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0906_metho.xml
Size: 21,545 bytes
Last Modified: 2025-10-06 14:09:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0906"> <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 41-48, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics Evaluating Summaries and Answers: Two Sides of the Same Coin?</Title> <Section position="3" start_page="41" end_page="42" type="metho"> <SectionTitle> 2 Convergence of QA and Summarization </SectionTitle> <Paragraph position="0"> Question answering was initially conceived as essentially a fine-grained information retrieval task.</Paragraph> <Paragraph position="1"> Much research has focused on so-called factoid questions, which can typically be answered by named entities such as people, organizations, locations, etc. As an example, a system might return &quot;Bee Gees&quot; as the answer to the question &quot;What band did the music for the 1970's film 'Saturday Night Fever'?&quot;. For such well-specified information needs, question answering systems represent an improvement over traditional document retrieval systems because they do not require a user to manually browse through a ranked list of &quot;hits&quot;. Since 1999, the NIST-organized question answering tracks at TREC (see, for example, Voorhees 2003a) have served as a focal point of research in the field, providing an annual forum for evaluating systems developed by teams from all over the world. The model has been duplicated and elaborated on by CLEF in Europe and NTCIR in Asia, both of which have also introduced cross-lingual elements.</Paragraph> <Paragraph position="2"> Recently, research in question answering has shifted away from factoid questions to more complex information needs. This new direction can be characterized as a move towards answers that can only be arrived at through some form of reasoning and answers that require drawing information from multiple sources. Indeed, there are many types of questions that would require integration of both capabilities: extracting raw information &quot;nuggets&quot; from potentially relevant documents, reasoning over these basic facts to draw additional inferences, and synthesizing an appropriate answer based on this knowledge. &quot;What is the role of the Libyan government in the Lockerbie bombing?&quot; is an example of such a complex question.</Paragraph> <Paragraph position="3"> Commonalities between the task of answering complex questions and summarizing multiple documents are evident when one considers broader research trends. Both tasks require the ability to draw together elements from multiple sources and cope with redundant, inconsistent, and contradictoryinformation. Bothtasksrequireextractingfinergrained (i.e., sub-document) segments, albeit based on different criteria. These observations point to the convergence of question answering and multi-document summarization.</Paragraph> <Paragraph position="4"> Complementary developments in the summarization community mirror the aforementioned shifts in question answering research. Most notably, the DUC 2005 task requires systems to generate answers to natural language questions based on a collection of known relevant documents: &quot;The system task in 2005 will be to synthesize from a set of 2550 documents a brief, well-organized, fluent answer to a need for information that cannot be met by just stating a name, date, quantity, etc.&quot; (DUC 2005 guidelines). These guidelines were modeled after the information synthesis task suggested by Amig'o et al. (2004), which they characterize as &quot;the process of (given a complex information need) extracting, organizing, and inter-relating the pieces of information contained in a set of relevant documents, in order to obtain a comprehensive, non-redundant report that satisfies the information need&quot;. One of the examples they provide, &quot;I'm looking for information concerning the history of text compression both before and with computers&quot;, looks remarkably like a user information need current question answering systems aspire to satisfy. The idea of topic-oriented multi-document summarization isn't new (Goldstein et al., 2000), but only recently have the connections to question answering become explicit. Incidentally, it appears that the current vision of question answering is more ambitious than the information synthesis task because in the former, the set of relevant documents is not known in advance, but must first be discovered within a larger corpus.</Paragraph> <Paragraph position="5"> There is, however, an important difference between question answering and topic-focused multi-document summarization: whereas summaries are compressible in length, the same cannot be said of answers.1 For question answering, it is difficult to fix the length of a response a priori: there may be cases where it is impossible to fit a coherent, complete answer into an allotted space. On the other hand, summaries are condensed representations of content, and should theoretically be expandable and compressible based on the level of detail desired.</Paragraph> <Paragraph position="6"> What are the implications, for system evaluations, of this convergence between question answering and multi-document summarization? We believe that the two fields have much to benefit from each other. In one direction, the question answering community currently lacks experience in automatically evaluating unstructured answers, which has been the focus of much research in document summarization.</Paragraph> <Paragraph position="7"> In the other direction, the question answering community, due to its roots in information retrieval, has a good grasp on the notions of relevance and topicality, which are critical to the assessment of topic-oriented summaries. In the next section, we present a case study in leveraging summarization evaluation techniquestoautomaticallyevaluatedefinitionquestions. Following that, we discuss how lessons from question answering (and more broadly, information retrieval) can be applied to assist in evaluating summarization systems.</Paragraph> </Section> <Section position="4" start_page="42" end_page="44" type="metho"> <SectionTitle> 3 Definition Questions: A Case Study </SectionTitle> <Paragraph position="0"> Definition questions represent complex information needs that involve integrating facts from multiple documents. A typical definition question is &quot;What is the Cassini space probe?&quot;, to which a system might respond with answers that include &quot;interplanetary probe to Saturn&quot;, &quot;carries the Huygens probe to study the atmosphere of Titan, Saturn's largest moon&quot;, and &quot;a joint project between NASA, ESA, and ASI&quot;. The goal of the task is to return as many interesting &quot;nuggets&quot; of information as possible about the target entity being defined (the Cassini space probe, in this case) while minimizing the amount of irrelevant information retrieved. In the two formal evaluations of definition questions that have been conducted at TREC (in 2003 and 2004), aninformationnuggetisoperationalizedasafactfor whichanassessorcouldmakeabinarydecisionasto whetheraresponsecontainedthatnugget(Voorhees, 2003b). Additionally, information nuggets are classified as either vital or okay. Vital nuggets represent facts central to the target entity, and should be present in a &quot;good&quot; definition. Okay nuggets contribute worthwhile information about the target, but are not essential. As an example, assessors' nuggets for the question &quot;Who is Aaron Copland?&quot; are shown in Table 1. The distinction between vital and okay nuggets is consequential for the score calculation, which we will discuss below.</Paragraph> <Paragraph position="1"> In the TREC setup, a system response to a definition question is comprised of an unordered set of answer strings paired with the identifier of the document from which it was extracted. Each of these answer strings is presumed to have one or more information nuggets contained within it. Although thereisnoexplicitlimitonthelengthofeachanswer string and the number of answer strings a system is allowed to return, verbosity is penalized against, as we shall see below.</Paragraph> <Paragraph position="2"> To evaluate system output, NIST gathers answer strings from all participants, hides their association with the runs that produced them, and presents all answer strings to a human assessor. Using these responses and research performed during the original development of the question (with an off-the-shelf document retrieval system), the assessor creates an &quot;answer key&quot;; Table 1 shows the official answer key for the question &quot;Who is Aaron Copland?&quot;.</Paragraph> <Paragraph position="3"> After this answer key has been created, NIST assessors then go back over each run and manually judge whether or not each nugget is present in a particular system's response. Figure 1 shows a few examples of real system output and the nuggets that were found in them.</Paragraph> <Paragraph position="4"> The final score of a particular answer is computed as an F-measure, the harmonic mean between nugget precision and recall. The b parameter controls the relative importance of precision and recall, and is heavily biased towards the latter to model the nature of the task. Nugget recall is calculated solely as a function of the vital nuggets, which means that a system receives no &quot;credit&quot; (in terms of recall) for returning okay nuggets. Nugget precision is approximated by a length allowance based on the number ofvitalandokaynuggetsreturned; aresponselonger than the allowed length is subjected to a verbosity penalty. Using answer length as a proxy to precision appears to be a reasonable compromise because a pilot study demonstrated that it was impossible for humans to consistently enumerate the total number of nuggets in a response, a necessary step in calculating nugget precision (Voorhees, 2003b).</Paragraph> <Paragraph position="5"> The current TREC setup for evaluating definition questions necessitates having a human &quot;in the loop&quot;. Even though answer keys are available for questions from previous years, determining if a nugget was actually retrieved by a system currently requires human judgment. Without a fully-automated evaluation method, it is difficult to consistently and reproducibly assess the performance of a system outside the annual TREC cycle. Thus, researchers cannot carry out controlled laboratory experiments to rapidly explore the solution space. In many other fieldsincomputationallinguistics, theabilitytoconduct evaluations with quick turnaround has lead to rapid progress in the state of the art. Question an- null swering for definition questions appears to be missing this critical ingredient.</Paragraph> <Paragraph position="6"> To address this evaluation gap, we have recently developed POURPRE, a method for automatically evaluating definition questions based on idf-weighted unigram co-occurrences (Lin and Demner-Fushman, 2005). This idea of employing n-gram co-occurrence statistics to score the output of a computer system against one or more desired reference outputs has its roots in the BLEU metric for machine translation (Papineni et al., 2002) and the ROUGE (Lin and Hovy, 2003) metric for summarization. Note that metrics for automatically evaluating definitions should be, like metrics for evaluating summaries, biased towards recall. Fluency (i.e., precision) is not usually of concern because most systems employ extractive techniques to produce answers. Our study reports good correlation betweentheautomaticallycomputed POURPRE metric and official TREC system ranks. This measure will hopefully spur progress in definition question answering systems.</Paragraph> <Paragraph position="7"> The development of automatic evaluation metrics based on n-gram co-occurrence for question answering is an example of successful knowledge transfer from summarization to question answering evaluation. We believe that there exist many more opportunities for future exploration; as an example, there are remarkable similarities between information nuggets in definition question answering and recently-proposed methods for assessing summaries basedon fine-grainedsemantic units(Teufel andvan Halteren, 2004; Nenkova and Passonneau, 2004).</Paragraph> <Paragraph position="8"> Another promising direction of research in definition question answering involves applying the Pyramid Method (Nenkova and Passonneau, 2004) to better model the vital/okay nuggets distinction. As it currently stands, the vital/okay dichotomy is troublesome because there is no way to operationalize such a classification scheme within a system; see Hildebrandt et al. (2004) for more discussion. Yet, the effects on score are significant: a system that returns, for example, all the okay nuggets but none of the vital nuggets would receive a score of zero. In truth, the vital/okay distinction is a poor attempt at modeling the fact that some nuggets about a target are more important than others--this is exactly what the Pyramid Method is designed to capture. &quot;Building pyramids&quot; for definition questions is an avenue of research that we are currently pursuing.</Paragraph> <Paragraph position="9"> In the next section, we discuss opportunities for knowledge transfer in the other direction; i.e., how summarization evaluation can benefit from work in question answering evaluation.</Paragraph> </Section> <Section position="5" start_page="44" end_page="46" type="metho"> <SectionTitle> 4 Putting the Relevance in Summarization </SectionTitle> <Paragraph position="0"> The definition of a meaningful extrinsic evaluation metric (e.g., a task-based measure) is an issue that the summarization community has long grappled with (Mani et al., 2002). This issue has been one of the driving factors towards summaries that are specifically responsive to complex information needs. The evaluation of such summaries hinges on the notions of relevance and topicality, two themes that have received much research attention in the information retrieval community, from which question answering evolved.</Paragraph> <Paragraph position="1"> Debates about the nature of relevance are almost as old as the field of information retrieval itself (Cooper, 1971; Saracevic, 1975; Harter, 1992; Barry and Schamber, 1998; Mizzaro, 1998; Spink and Greisdorf, 2001). Theoretical discussions aside, there is evidence suggesting that there exist substantial inter-assessor differences in document-level relevance judgments (Voorhees, 2000; Voorhees, 2002); in the TREC ad hoc tracks, for example, overlap between two humans can be less than 50%.</Paragraph> <Paragraph position="2"> For factoid question answering, it has also been shown that the notion of answer correctness is less well-defined than one would expect (Voorhees and Tice, 2000; Lin and Katz, 2005 in press). This inescapable fact about the nature of information needs represents a fundamental philosophical difference between research in information retrieval and computational linguistics. Information retrieval researchers accept the fact that the notion of &quot;ground truth&quot; is not particularly meaningful, and any prescriptive attempt to dictate otherwise would result in brittle and overtrained systems of limited value. A retrieval system must be sensitive to the inevitable variations in relevance exhibited by different users.</Paragraph> <Paragraph position="3"> This philosophy represents a contrast from computational linguistics research, where ground truth does in fact exist. For example, there is a single correct parse of a natural language sentence (modulo truly ambiguous sentences), there is the notion of a correct word sense (modulo granularity issues), etc.</Paragraph> <Paragraph position="4"> Thisviewalsopervadesevaluationinmachinetranslation and document summarization, and is implicitly codified in intrinsic metrics, except that there is now the notion of multiple correct answers (i.e., the reference texts).</Paragraph> <Paragraph position="5"> Faced with the inevitability of variations in humans' notion of relevance, how can information retrieval researchers confidently draw conclusions about system performance and the effectiveness of various techniques? Meta-evaluations have shown that while some measures such as recall are relatively meaningless in absolute terms (e.g., the total number of relevant documents cannot be known without exhaustive assessment of the entire corpus, which is impractical for current document collections), relative comparisons between systems are remarkably stable. That is, if system A performs better than system B (by a metric such as mean average precision, for example), system A is highly likely to out-perform system B with any alternative sets of relevance judgments that represent different notions of relevance (Voorhees, 2000; Voorhees, 2002).</Paragraph> <Paragraph position="6"> Thus, it remains possible to determine the relative effectiveness of different retrieval techniques, and use evaluation results to guide system development.</Paragraph> <Paragraph position="7"> We believe that this philosophical starting point for conducting evaluations is an important point that summarizationresearchersshouldtaketoheart, considering that notions such as relevance and topicality are central to the evaluation of the information synthesis task. What concrete implications of this view are there? We outline some thoughts below: First, we believe that summarization metrics should embrace variations in human judgment as an inescapable part of the evaluation process. Measures for automatically assessing the quality of a system's output such as ROUGE implicitly assume that the &quot;best summary&quot; is a statistical agglomeration of the reference summaries, which is not likely to be true. Until recently, ROUGE &quot;hard-coded&quot; the so-called &quot;jackknifing&quot; procedure to estimate average human performance. Fortunately, it appears researchers have realized that &quot;model averaging&quot; may not be the best way to capture the existence of many &quot;equally good&quot; summaries. As an example, the Pyramid Method (Nenkova and Passonneau, 2004), represents a good first attempt at a realistic model of human variations.</Paragraph> <Paragraph position="8"> Second, the view that variations in judgment are an inescapable part of extrinsic evaluations would lead one to conclude that low inter-annotator agreement isn't necessarily bad. Computational linguistics research generally attaches great value to high kappa measures (Carletta, 1996), which indicate high human agreement on a particular task. Low agreement is seen as a barrier to conducting reproducible research and to drawing generalizable conclusions. However, this is not necessarily true--low agreement in information retrieval has not been a handicap for advancing the state of the art. When dealing with notions such as relevance, low kappa values can most likely be attributed to the nature of the task itself. Attempting to raise agreement by, for example, developing rigid assessment guidelines, may do more harm than good. Prescriptive attempts to define what a good answer or summary should be will lead to systems that are not useful in real-world settings. Instead, we should focus research on adaptable, flexible systems.</Paragraph> <Paragraph position="9"> Third, meta-evaluations are important. The information retrieval literature has an established tradition of evaluating evaluations post hoc to insure the reliability and fairness of the results. The aforementioned studies examining the impact of different relevance judgments are examples of such work. Due to the variability in human judgments, systems are essentially aiming at a moving target, which necessitates continual examination as to whether evaluations are accurately answering the research questions and producing trustworthy results.</Paragraph> <Paragraph position="10"> Fourth, a measure for assessing the quality of automaticscoringmetricsshouldreflectthephilosoph- null ical starting points that we have been discussing.</Paragraph> <Paragraph position="11"> As a specific example, the correlation between an automatically-calculated metric and actual human preferences is better quantified by Kendall's t than by the coefficient of determination R2. Since relative system comparisons are more meaningful than absolute scores, we are generally less interested in correlationsamongthescoresthanintherankingsof systems produced by those scores. Kendall's t computes the &quot;distance&quot; between two rankings as the minimum number of pairwise adjacent swaps necessary to convert one ranking into the other. This value is normalized by the number of items being ranked such that two identical rankings produce a correlation of 1.0; the correlation between a ranking and its perfect inverse is[?]1.0; and the expected correlation of two rankings chosen at random is 0.0. Typically, a value of greater than 0.8 is considered &quot;good&quot;, although 0.9 represents a threshold researchers generally aim for.</Paragraph> </Section> class="xml-element"></Paper>