XML Viewer - n06-1049

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1049_intro.xml
Size: 4,196 bytes
Last Modified: 2025-10-06 14:03:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1049">
  <Title>Will Pyramids Built of Nuggets Topple Over?</Title>
  <Section position="3" start_page="383" end_page="384" type="intro">
    <SectionTitle>
2 Evaluation of Complex Questions
</SectionTitle>
    <Paragraph position="0"> To date, NIST has conducted three large-scale evaluations of complex questions using a nugget-based evaluation methodology: &amp;quot;definition&amp;quot; questions in TREC 2003, &amp;quot;other&amp;quot; questions in TREC 2004 and TREC 2005, and &amp;quot;relationship&amp;quot; questions in TREC 2005. Since relatively few teams participated in the 2005 evaluation of &amp;quot;relationship&amp;quot; questions, this work focuses on the three years' worth of &amp;quot;definition/other&amp;quot; questions. The nugget-based paradigm has been previously detailed in a number of papers (Voorhees, 2003; Hildebrandt et al., 2004; Lin and Demner-Fushman, 2005a); here, we present only a short summary.</Paragraph>
    <Paragraph position="1"> System responses to complex questions consist of an unordered set of passages. To evaluate answers, NIST pools answer strings from all participants, removes their association with the runs that produced them, and presents them to a human assessor. Using these responses and research performed during the original development of the question, the assessor creates an &amp;quot;answer key&amp;quot; comprised of a list of &amp;quot;nuggets&amp;quot;--essentially, facts about the target. According to TREC guidelines, a nugget is defined as a fact for which the assessor could make a binary decision as to whether a response contained that nugget (Voorhees, 2003). As an example, relevant nuggets for the target &amp;quot;AARP&amp;quot; are shown in Table 1. In addition to creating the nuggets, the assessor also manually classifies each as either &amp;quot;vital&amp;quot; or &amp;quot;okay&amp;quot;. Vital nuggets represent concepts that must be in a &amp;quot;good&amp;quot; definition; on the other hand, okay nuggets contribute worthwhile information about the target but are not essential. The distinction has important implications, described below.</Paragraph>
    <Paragraph position="2"> Once the answer key of vital/okay nuggets is created, the assessor goes back and manually scores each run. For each system response, he or she decides whether or not each nugget is present. The final F-score for an answer is computed in the manner described in Figure 1, and the final score of a system run is the mean of scores across all questions. The per-question F-score is a harmonic mean between nugget precision and nugget recall, where recall is heavily favored (controlled by the b parameter, set to five in 2003 and three in 2004 and 2005). Nugget recall is computed solely on vital nuggets vital 30+ million members okay Spends heavily on research &amp; education vital Largest seniors organization vital Largest dues paying organization vital Membership eligibility is 50+  okay Abbreviated name to attract boomers okay Most of its work done by volunteers okay Receives millions for product endorsements okay Receives millions from product endorsements  (which means no credit is given for returning okay nuggets), while nugget precision is approximated by a length allowance based on the number of both vital and okay nuggets returned. Early in a pilot study, researchers discovered that it was impossible for assessors to enumerate the total set of nuggets contained in a system response (Voorhees, 2003), which corresponds to the denominator in the precision calculation. Thus, a penalty for verbosity serves as a surrogate for precision.</Paragraph>
    <Paragraph position="3"> Note that while a question's answer key only needs to be created once, assessors must manually determine if each nugget is present in a system's response. This human involvement has been identified as a bottleneck in the evaluation process, although we have recently developed an automatic scoring metric called POURPRE that correlates well with human judgments (Lin and Demner-Fushman, 2005a).</Paragraph>
    <Paragraph position="4">  in the different testsets.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML