File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1049_metho.xml
Size: 20,487 bytes
Last Modified: 2025-10-06 14:10:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1049"> <Title>Will Pyramids Built of Nuggets Topple Over?</Title> <Section position="4" start_page="384" end_page="384" type="metho"> <SectionTitle> 3 What's Vital? What's Okay? </SectionTitle> <Paragraph position="0"> Previously, we have argued that the vital/okay distinction is a source of instability in the nugget-based evaluation methodology, especially given the manner in which F-score is calculated (Hildebrandt et al., 2004; Lin and Demner-Fushman, 2005a).</Paragraph> <Paragraph position="1"> Since only vital nuggets figure into the calculation of nugget recall, there is a large &quot;quantization effect&quot; for system scores on topics that have few vital nuggets. For example, on a question that has only one vital nugget, a system cannot obtain a non-zero score unless that vital nugget is retrieved. In reality, whether or not a system returned a passage containing that single vital nugget is often a matter of luck, which is compounded by assessor judgment errors.</Paragraph> <Paragraph position="2"> Furthermore, theredoesnotappeartobeanyreliable indicators for predicting the importance of a nugget, which makes the task of developing systems even more challenging.</Paragraph> <Paragraph position="3"> The polarizing effect of the vital/okay distinction brings into question the stability of TREC evaluations. Table 2 shows statistics about the number of questions that have only one or two vital nuggets.</Paragraph> <Paragraph position="4"> Compared to the size of the testset, these numbers are relatively large. As a concrete example, &quot;F16&quot; is the target for question 71.7 from TREC 2005. The only vital nugget is &quot;First F16s built in 1974&quot;. The practical effect of the vital/okay distinction in its current form is the number of questions for which the median system score across all submitted runs is zero: 22 in TREC 2003, 41 in TREC 2004, and 44 in TREC 2005.</Paragraph> <Paragraph position="5"> Anevaluationinwhichthemedianscoreformany questions is zero has many shortcomings. For one, it is difficult to tell if a particular run is &quot;better&quot; than another--even though they may be very different in other salient properties such as length, for example. The discriminative power of the present F-score measure is called into question: are present systems that bad, or is the current scoring model insufficient to discriminate between different (poorly performing) systems? Also, as pointed out by Voorhees (2005), a score distribution heavily skewed towards zero makes meta-analysis of evaluation stability hard to perform. Since such studies depend on variability in scores, evaluations would appear more stable than they really are.</Paragraph> <Paragraph position="6"> While there are obviously shortcomings to the current scheme of labeling nuggets as either &quot;vital&quot; or &quot;okay&quot;, the distinction does start to capture the intuition that &quot;not all nuggets are created equal&quot;. Some nuggets are inherently more important than others, and this should be reflected in the evaluation methodology. The solution, we believe, is to solicit judgments from multiple assessors and develop a more refined sense of nugget importance. However, given finite resources, it is important to balance the amount of additional manual effort required with the gainsderivedfromthoseefforts. Wepresenttheidea of building &quot;nugget pyramids&quot;, which addresses the shortcomings noted here, and then assess the implications of this new scoring model against data from TREC 2003, 2004, and 2005.</Paragraph> </Section> <Section position="5" start_page="384" end_page="385" type="metho"> <SectionTitle> 4 Building Nugget Pyramids </SectionTitle> <Paragraph position="0"> As previously pointed out (Lin and Demner-Fushman, 2005b), the question answering and summarization communities are converging on the task of addressing complex informationneeds from complementary perspectives; see, for example, the recent DUC task of query-focused multi-document summarization (Amig'o et al., 2004; Dang, 2005).</Paragraph> <Paragraph position="1"> From an evaluation point of view, this provides opportunities for cross-fertilization and exchange of fresh ideas. As an example of this intellectual discourse, the recently-developed POURPRE metric for automatically evaluating answers to complex questions (Lin and Demner-Fushman, 2005a) employs n-gram overlap to compare system responses to reference output, an idea originally implemented in the ROUGE metric for summarization evaluation (Lin and Hovy, 2003). Drawing additional inspiration from research on summarization evaluation, we adapt the pyramid evaluation scheme (Nenkova and Passonneau, 2004) to address the shortcomings of the vital/okay distinction in the nugget-based evaluation methodology.</Paragraph> <Paragraph position="2"> The basic intuition behind the pyramid scheme (Nenkova and Passonneau, 2004) is simple: the importance of a fact is directly related to the number of people that recognize it as such (i.e., its popularity). The evaluation methodology calls for assessors to annotate Semantic Content Units (SCUs) found within model reference summaries. The weight assigned to an SCU is equal to the number of annotators that have marked the particular unit. These SCUs can be arranged in a pyramid, with the highest-scoring elements at the top: a &quot;good&quot; summary should contain SCUs from a higher tier in the pyramid before a lower tier, since such elements are deemed &quot;more vital&quot;.</Paragraph> <Paragraph position="3"> This pyramid scheme can be easily adapted for question answering evaluation since a nugget is roughly comparable to a Semantic Content Unit.</Paragraph> <Paragraph position="4"> We propose to build nugget pyramids for answers to complex questions by soliciting vital/okay judgments from multiple assessors, i.e., take the original reference nuggets and ask different humans to classify each as either &quot;vital&quot; or &quot;okay&quot;. The weight assigned to each nugget is simply equal to the number of different assessors that deemed it vital. We then normalize the nugget weights (per-question) so that the maximum possible weight is one (by dividing each nugget weight by the maximum weight of that particular question). Therefore, a nugget assigned &quot;vital&quot; by the most assessors (not necessarily all) would receive a weight of one.1 The introduction of a more granular notion of nugget importance should be reflected in the calculation of F-score. We propose that nugget recall be modified to take into account nugget weight:</Paragraph> <Paragraph position="6"> Where A is the set of reference nuggets that are matched within a system's response and V is the set of all reference nuggets; wm and wn are the weights of nuggetsmandn, respectively. Instead of a binary distinction based solely on matching vital nuggets, all nuggets now factor into the calculation of recall, 1Since there may be multiple nuggets with the highest score, what we're building is actually a frustum sometimes. :) subjected to a weight. Note that this new scoring model captures the existing binary vital/okay distinction in a straightforward way: vital nuggets get a score of one, and okay nuggets zero.</Paragraph> <Paragraph position="7"> We propose to leave the calculation of nugget precision as is: a system would receive a length allowance of 100 non-whitespace characters for every nugget it retrieved (regardless of importance).</Paragraph> <Paragraph position="8"> Longer answers would be penalized for verbosity.</Paragraph> <Paragraph position="9"> Having outlined our revisions to the standard nugget-based scoring method, we will proceed to describe our methodology for evaluating this new model and demonstrate how it overcomes many of the shortcomings of the existing paradigm.</Paragraph> </Section> <Section position="6" start_page="385" end_page="386" type="metho"> <SectionTitle> 5 Evaluation Methodology </SectionTitle> <Paragraph position="0"> We evaluate our methodology for building &quot;nugget pyramids&quot; using runs submitted to the TREC 2003, 2004, and 2005 question answering tracks (2003 &quot;definition&quot; questions, 2004 and 2005 &quot;other&quot; questions). There were 50 questions in the 2003 testset, 64 in 2004, and 75 in 2005. In total, there were 54 runs submitted to TREC 2003, 63 to TREC 2004, and 72 to TREC 2005. NIST assessors have manually annotated nuggets found in a given system's response, and this allows us to calculate the final F-score under different scoring models.</Paragraph> <Paragraph position="1"> We recruited a total of nine different assessors for this study. Assessors consisted of graduate students in library and information science and computer science at the University of Maryland as well as volunteers from the question answering community (obtained via a posting to NIST's TREC QA mailing list). Each assessor was given the reference nuggets along with the original questions and asked to classify each nugget as vital or okay. They were purposely asked to make these judgments without reference to documents in the corpus in order to expedite the assessment process--our goal is to propose a refinement to the current nugget evaluation methodology that addresses shortcomings while minimizing the amount of additional effort required. Combined with the answer key created by the original NIST assessors, we obtained a total of ten judgments for every single nugget in the three testsets.2 each assessor's judgments. (Assessor 0 represents the original NIST assessors.) We measured the correlation between system ranks generated by different scoring models using Kendall'st,acommonly-usedrankcorrelationmeasure in information retrieval for quantifying the similarity between different scoring methods. Kendall's t computes the &quot;distance&quot; between two rankings as the minimum number of pairwise adjacent swaps necessary to convert one ranking into the other. This value is normalized by the number of items being ranked such that two identical rankings produce a correlation of 1.0; the correlation between a ranking and its perfect inverse is[?]1.0; and the expected correlation of two rankings chosen at random is 0.0. Typically, a value of greater than 0.8 is considered &quot;good&quot;, although 0.9 represents a threshold researchers generally aim for.</Paragraph> <Paragraph position="2"> We hypothesized that system ranks are relatively unstable with respect to individual assessor's judgments. That is, how well a given system scores is to a large extent dependent on which assessor's judgments one uses for evaluation. This stems from an inescapable fact of such evaluations, well known fromstudiesofrelevanceintheinformationretrieval literature (Voorhees, 1998). Humans have legitimate differences in opinion regarding a nugget's importance, and there is no such thing as &quot;the correct answer&quot;. However, we hypothesized that these variations can be smoothed out by building &quot;nugget pyramids&quot; in the manner we described. Nugget weights reflect the combined judgments of many individual assessors, and scores generated with weights taken into account should correlate better with each individual assessor's opinion.</Paragraph> </Section> <Section position="7" start_page="386" end_page="388" type="metho"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"> To verify our hypothesis about the instability of using any individual assessor's judgments, we calculated the Kendall's t correlation between system scoresgeneratedusingthe&quot;official&quot;vital/okayjudgments(providebyNISTassessors)andeachindivid- null ual assessor's judgments. This is shown in Table 3.</Paragraph> <Paragraph position="1"> The original NIST judgments are listed as &quot;assessor 0&quot; (and not included in the averages). For all scoring models discussed in this paper, we set b, the parameter that controls the relative importance of precisionandrecall, tothree.3 Resultsshowthatalthough official rankings generally correlate well with rankings generated by our nine additional assessors, the agreement is far from perfect. Yet, in reality, the opinions of our nine assessors are not any less valid than those of the NIST assessors--NIST does not occupy a privileged position on what constitutes a good &quot;definition&quot;. We can see that variations in human judgments do not appear to be adequately captured by the current scoring model.</Paragraph> <Paragraph position="2"> Table 3 also shows the number of questions for which systems' median score was zero based on each individual assessor's judgments (out of 50 rankings generated using the ten-assessor nugget pyramid and those generated using each individual assessor's judgments. (Assessor 0 represents the original NIST assessors.) questions for TREC 2003, 64 for TREC 2004, and 75 for TREC 2005). These numbers are worrisome: in TREC 2004, for example, over half the questions (on average) have a median score of zero, and over three quarters of questions, according to assessor 9.</Paragraph> <Paragraph position="3"> Thisisproblematicforthevariousreasonsdiscussed in Section 3.</Paragraph> <Paragraph position="4"> Toevaluatescoringmodelsthatcombinetheopinions of multiple assessors, we built &quot;nugget pyramids&quot; using all ten sets of judgments in the manner outlined in Section 4. All runs submitted to each of the TREC evaluations were then rescored using the modified F-score formula, which takes into account a finer-grained notion of nugget importance.</Paragraph> <Paragraph position="5"> Rankings generated by this model were then compared against those generated by each individual assessor's judgments. Results are shown in Table 4.</Paragraph> <Paragraph position="6"> As can be seen, the correlations observed are higher thanthoseinTable3, meaningthatanuggetpyramid bettercapturestheopinionsofeachindividualassessor. A two-tailed t-test reveals that the differences in averages are statistically significant (p << 0.01 for TREC 2003/2005, p < 0.05 for TREC 2004).</Paragraph> <Paragraph position="7"> What is the effect of combining judgments from different numbers of assessors? To answer this question, we built ten different nugget pyramids of varying &quot;sizes&quot;, i.e., combining judgments from one through ten assessors. The Kendall's t corre- null is zero plotted against number of assessors whose judgments contributed to the nugget pyramid.</Paragraph> <Paragraph position="8"> lations between scores generated by each of these and scores generated by each individual assessor's judgments were computed. For each pyramid, we computed the average across all rank correlations, which captures the extent to which that particular pyramid represents the opinions of all ten assessors.</Paragraph> <Paragraph position="9"> These results are shown in Figure 2. The increase in Kendall's t that comes from adding a second assessor is statistically significant, as revealed by a two-tailed t-test (p << 0.01 for TREC 2003/2005, p < 0.05 for TREC 2004), but ANOVA reveals no statistically significant differences beyond two assessors. null From these results, we can conclude that adding a second assessor yields a scoring model that is significantly better at capturing the variance in human relevance judgments. In this respect, little is gained beyond two assessors. If this is the only advantage provided by nugget pyramids, then the boost in rank correlations may not be sufficient to justify the extra manual effort involved in building them. As we shall see, however, nugget pyramids offer other benefits as well.</Paragraph> <Paragraph position="10"> Evaluation by our nugget pyramids greatly reduces the number of questions whose median score is zero. As previously discussed, a strict vital/okay split translates into a score of zero for systems that do not return any vital nuggets. However, nugget pyramids reflect a more refined sense of nugget importance, which results in fewer zero scores. Figure 3 shows the number of questions whose median score is zero (normalized as a fraction of the entire testset) by nugget pyramids built from varying numbers of assessors. With four or more assessors, the number of questions whose median is zero for the TREC 2003 testset drops to 17; for TREC 2004, 23 for seven or more assessors; for TREC 2005, 27 for nine or more assessors. In other words, F-scores generated using our methodology are far more discriminative. The remaining questions with zero medians, we believe, accurately reflect the state of the art in question answering performance.</Paragraph> <Paragraph position="11"> An example of a nugget pyramid that combines the opinions of all ten assessors is shown in Table 5 for the target &quot;AARP&quot;. Judgments from the original NIST assessors are also shown (cf. Table 1). Note that there is a strong correlation between the original vital/okay judgments and the refined nugget weights based on the pyramid, indicating that (in this case, at least) the intuition of the NIST assessor matches that of the other assessors.</Paragraph> </Section> <Section position="8" start_page="388" end_page="389" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> In balancing the tradeoff between advantages provided by nugget pyramids and the additional manual effort necessary to create them, what is the optimal number of assessors to solicit judgments from? Results shown in Figures 2 and 3 provide some answers. In terms of better capturing different assessors'opinions, littleappearstobegainedfromgoing beyond two assessors. However, adding more judgments does decrease the number of questions whose median score is zero, resulting in a more discriminative metric. Beyond five assessors, the number of questions with a zero median score remains rela- null weights derived from the nugget pyramid building process.</Paragraph> <Paragraph position="1"> tively stable. We believe that around five assessors yield the smallest nugget pyramid that confers the advantages of the methodology.</Paragraph> <Paragraph position="2"> The idea of building &quot;nugget pyramids&quot; is an extension of a similarly-named evaluation scheme in document summarization, although there are important differences. Nenkova and Passonneau (2004) call for multiple assessors to annotate SCUs, which is much more involved than the methodology presented here, where the nuggets are fixed and assessors only provide additional judgments about their importance. This obviously has the advantage of streamlining the assessment process, but has the potential to miss other important nuggets that were not identifiedinthefirstplace. Ourexperimentalresults, however, suggest that this is a worthwhile tradeoff.</Paragraph> <Paragraph position="3"> The explicit goal of this work was to develop scoring models for nugget-based evaluation that would address shortcomings of the present approach, while introducing minimal overhead in terms of additional resource requirements. To this end, we have been successful.</Paragraph> <Paragraph position="4"> Nevertheless, there are a number of issues that are worth mentioning. To speed up the assessment process, assessors were instructed to provide &quot;snap judgments&quot;givenonlythelistofnuggetsandthetarget. No additional context was provided, e.g., documents from the corpus or sample system responses.</Paragraph> <Paragraph position="5"> It is also important to note that the reference nuggets were never meant to be read by other people--NIST makes no claim for them to be well-formed descriptions of the facts themselves. These answer keys were primarily note-taking devices to assist in the assessment process. The important question, however, is whether scoring variations caused by poorly-phrased nuggets are smaller than the variations caused by legitimate inter-assessor disagreementregardingnuggetimportance. Ourexperiments appear to suggest that, overall, the nugget pyramid scheme is sound and can adequately cope with these difficulties.</Paragraph> </Section> class="xml-element"></Paper>