File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1048_metho.xml

Size: 13,920 bytes

Last Modified: 2025-10-06 14:10:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1048">
  <Title>Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements</Title>
  <Section position="4" start_page="377" end_page="377" type="metho">
    <SectionTitle>
3 The Data
</SectionTitle>
    <Paragraph position="0"> For our experiments, we used the definition questions from TREC2003, the 'other' questions from TREC2004 and TREC2005, and the relationship questions from TREC2005. (Voorhees, 2003; Voorhees, 2004; Voorhees, 2005) The distribution of nuggets and questions is shown for each data set in Table 1. The number of nuggets by number of  For TREC2003 and TREC2004, the run-tags indicate the submitting institution. For TREC2005 we did not run the nonanonymized data in time for this submission. In the TREC2005  of systems that found each nugget.</Paragraph>
    <Paragraph position="1"> system responses assigned that nugget (difficulty of nuggets, in a sense) is shown in Figure 4. More than a quarter of relationship nuggets were not found by any system. Among all data sets, many nuggets were found in none or just a few responses.</Paragraph>
  </Section>
  <Section position="5" start_page="377" end_page="378" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> We report correlation (R  ), and Kendall's t b , following Lin and Demner-Fushman. Nuggeteer's scores are in the same range as real system scores, so we also report average root mean squared error from the official results. We 'corrected' the official judgements by assigning a nugget to a response if that response was judged to contain that nugget in any assessment for any system.</Paragraph>
    <Section position="1" start_page="377" end_page="378" type="sub_section">
      <SectionTitle>
4.1 Comparison with Pourpre
</SectionTitle>
      <Paragraph position="0"> (Lin et al., 2005) report Pourpre and Rouge performance with Pourpre optimal thresholds for TREC definition questions, as reproduced in Table 2.</Paragraph>
      <Paragraph position="1"> Nuggeteer's results are shown in the last column.</Paragraph>
      <Paragraph position="2">  Table 3 shows a comparison of Pourpre and Nuggeteer's correlations with official scores. As ex- null We report only micro-averaged results, because we wish to emphasize the interpretability of Nuggeteer scores. While the correlations of macro-averaged scores with official scores may be higher (as seems to be the case for Pourpre), the actual values of the micro-averaged scores are more interpretable because they include a variance.</Paragraph>
      <Paragraph position="3">  numbers of vital and okay nuggets, the average total number of nuggets per question, the number of participating systems, the average number of responses per system, and the average number of responses per question over all systems.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="378" end_page="380" type="metho">
    <SectionTitle>
POURPRE ROUGE NUGGETEER
</SectionTitle>
    <Paragraph position="0"> Run micro, cnt macro, cnt micro, idf macro, idf default stop nostem, bigram, micro, idf</Paragraph>
    <Paragraph position="2"> cial scores, for each data set (D=&amp;quot;definition&amp;quot;, O=&amp;quot;other&amp;quot;, R=&amp;quot;relationship&amp;quot;). t=1 means same order, t=-1 means reverse order. Pourpre and Rouge scores reproduced from (Lin and Demner-Fushman, 2005).</Paragraph>
    <Paragraph position="3">  mse) between scores generated by Pourpre/Nuggeteer and official scores, for the same settings as the t comparison above.</Paragraph>
    <Paragraph position="4"> pected from the Kendall's t comparisons, Pourpre's correlation is about the same or higher in 2003, but fares progressively worse in the subsequent tasks.</Paragraph>
    <Paragraph position="5"> To ensure that Pourpre scores correlated sufficiently with official scores, Lin and Demner-Fushman used the difference in official score between runs whose ranks Pourpre had swapped, and showed that the majority of swaps were between runs whose official scores were less than the 0.1 apart, a threshold for assessor agreement reported in (Voorhees, 2003).</Paragraph>
    <Paragraph position="6"> Nuggeteer scores are not only correlated with, but actually meant to approximate, the assessment scores; thus we can use a stronger evaluation: root mean squared error of Nuggeteer scores against official scores. This estimates the average difference between the Nuggeteer score and the official score, and at 0.077, the estimate is below the 0.1 threshold. This evaluation is meant to show that the scores are &amp;quot;good enough&amp;quot; for experimental evaluation, and in Section 4.4 we will substantiate Lin and Demner-Fushman's observation that higher correlation scores may reflect overtraining rather than actual improvement.</Paragraph>
    <Paragraph position="7"> Accordingly, rather than reporting the best Nuggeteer scores (Kendall's t and R  ) above, we follow Pourpre's lead in reporting a single variant (no stemming, bigrams) that performs well across the data sets. As with Pourpre's evaluation, the par- null ted against Nuggeteer scores (idf term weighting, no stemming, bigrams) for each data set (all F-measures have b = 3), with the Nuggeteer 95% confidence intervals on the score. Across the four datasets, 6 systems (3%) have an official score outside Nuggeteer's 95% confidence interval.</Paragraph>
    <Paragraph position="8"> ticular thresholds for each year are experimentally optimized. A scatter plot of Nuggeteer performance on the definition tasks is shown in Figure 5.</Paragraph>
    <Section position="1" start_page="379" end_page="379" type="sub_section">
      <SectionTitle>
4.2 N-gram size and stemming
</SectionTitle>
      <Paragraph position="0"> A hypothesis advanced with Pourpre is that bigrams, trigrams, and longer n-grams will primarily account for the fluency of an answer, rather than its semantic content, and thus not aid the scoring process. We included the option to use longer n-grams within Nuggeteer, and have found that using bigrams can yield very slightly better results than using unigrams. From inspection, bigrams sometimes capture named entity and grammatical order features.</Paragraph>
      <Paragraph position="1"> Experiments with Pourpre showed that stemming hurt slightly at peak performances. Nuggeteer has the same tendency at all n-gram sizes.</Paragraph>
      <Paragraph position="2"> Figure 6 compares Kendall's t over the possible thresholds, n-gram lengths, and stemming. The choice of threshold matters by far the most.</Paragraph>
    </Section>
    <Section position="2" start_page="379" end_page="379" type="sub_section">
      <SectionTitle>
4.3 Term weighting and stopwords
</SectionTitle>
      <Paragraph position="0"> models: a baseline coin, three models of different granularity with globally specified false positive and negative error rates, and a model with too many parameters, where even the error rates have per-nugget granularity. We select the most probable model, the per-nugget threshold model.</Paragraph>
    </Section>
    <Section position="3" start_page="379" end_page="380" type="sub_section">
      <SectionTitle>
4.4 Thresholds
</SectionTitle>
      <Paragraph position="0"> We experimented with Bayesian models for automatic threshold selection. In the models, a system response contains or does not contain each nugget as a function of the response's Nuggeteer score plus noise. Table 4 shows that, as expected, the best models do not make assumptions about thresholds being equal within a question or dataset. It is interesting to note that Bayesian inference catches the overparametrization of the model where error rates vary per-nugget as well. In essence, we do not need those additional parameters to explain the variation in the data.</Paragraph>
      <Paragraph position="1"> The t of the best selection of parameters on the 2003 data set using the model with one threshold per  nugget and global errors is 0.837 ( [?] mse=0.037).</Paragraph>
      <Paragraph position="2"> We have indeed overtrained the best threshold for this dataset (compare t=0.879,</Paragraph>
    </Section>
    <Section position="4" start_page="380" end_page="380" type="sub_section">
      <SectionTitle>
4.5 Training on System Responses
</SectionTitle>
      <Paragraph position="0"> Intuitively, if a fact is expressed by a system response, then another response with similar n-grams may also contain the same fact. To test this intuition, we tried expanding our judgement method (Equation 3) to select the maximum judgement score from among those of the nugget description and each of the system responses judged to contain that nugget.</Paragraph>
      <Paragraph position="1"> Unfortunately, the assessors did not mark which portion of a response expresses a nugget, so we also find spurious similarity, as shown in Figure 7. The final results are not conclusively better or worse overall, and the process is far more expensive.</Paragraph>
      <Paragraph position="2"> We are currently exploring the same extension for multiple &amp;quot;nugget descriptions&amp;quot; generated by manually selecting the appropriate portions of system responses containing each nugget.</Paragraph>
    </Section>
    <Section position="5" start_page="380" end_page="380" type="sub_section">
      <SectionTitle>
4.6 Judgment Precision and Recall
</SectionTitle>
      <Paragraph position="0"> Because Nuggeteer makes a nugget classification for each system response, we can report precision and recall on the nugget assignments. Table 5 shows Nuggeteer's agreement rate with assessors on whether each response contains a nugget.</Paragraph>
    </Section>
    <Section position="6" start_page="380" end_page="380" type="sub_section">
      <SectionTitle>
4.7 Novel Judgements
</SectionTitle>
      <Paragraph position="0"> Approximate evaluation will tend to undervalue new results, simply because they may not have keyword overlap with existing nugget descriptions. We are therefore creating tools to help developers manually assess their system outputs.</Paragraph>
      <Paragraph position="1"> As a proof of concept, we ran Nuggeteer on the best 2005 &amp;quot;other&amp;quot; system (not giving Nuggeteer  Unlike human assessors, Nuggeteer is not able to pick the &amp;quot;best&amp;quot; response containing a nugget if multiple responses have it, and will instead pick the first, so these values are artifactually low. However, 2005 results may be high because these results reflect anonymized runs.</Paragraph>
      <Paragraph position="2">  ments, under best settings for each year, and under the default settings.</Paragraph>
      <Paragraph position="3"> the official judgements), and manualy corrected its guesses.</Paragraph>
      <Paragraph position="4">  Assessment took about 6 hours, and our judgements had precision of 78% and recall of 90%, for F-measure 0.803+- 0.065 (compare Table 5). The official score of .299 was still within the confidence interval, but now on the high side rather than the low (.257+- .07), because we found the answers quite good. In fact, we were often tempted to add new nuggets! We later learned that it was a manual run, produced by a student at the University of Maryland.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="380" end_page="381" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> Pourpre pioneered automatic nugget-based assessment for definition questions, and thus enabled a rapid experimental cycle of system development.</Paragraph>
    <Paragraph position="1"> Nuggeteer improves on that functionality, and critically adds: * an interpretable score, comparable to official scores, with near-human error rates, * a reliable confidence interval on the estimated score, * scoring known responses exactly, * support for improving the accuracy of the score through additional annotation, and * a more robust training process We have shown that Nuggeteer evaluates the definition and relationship tasks with comparable rank swap rates to Pourpre. We explored the effects of stemming, term weighting, n-gram size, stopword removal, and use of system responses for training, all with little effect. We showed that previous methods of selecting a threshold overtrained, and have  We used a low threshold to make the task mostly correcting and less searching. This is clearly not how assessors should work, but is expedient for developers.</Paragraph>
    <Paragraph position="2">  question id 1901, response rank 2, response score 0.14 response text: best american classical music bears its stamp: witness aaron copland, whose &amp;quot;american-sounding&amp;quot; music was composed by a (the response was a sentence fragment) assigned nugget description: born brooklyn ny 1900 bigram matches: &amp;quot;american classical&amp;quot;, &amp;quot;american-sounding music&amp;quot;, &amp;quot;best american&amp;quot;, &amp;quot;whose american-sounding&amp;quot;, &amp;quot;witness aaron&amp;quot;, &amp;quot;copland whose&amp;quot;, &amp;quot;stamp witness&amp;quot;, ... response containing the nugget: Even the best American classical music bears its stamp:  ny 1900&amp;quot; at a recall score well above that of the background, despite containing none of those words. briefly described a promising way to select finer-grained thresholds automatically.</Paragraph>
    <Paragraph position="3"> Our experiences in using judgements of system responses point to the need for a better annotation of nugget content. It is possible to give Nuggeteer multiple nugget descriptions for each nugget. Manually extracting the relevant portions of correctlyjudged system responses may not be an overly arduous task, and may offer higher accuracy. It would be ideal if the community--including the assessors-were able to create and promulgate a gold-standard set of nugget descriptions for previous years.</Paragraph>
    <Paragraph position="4"> Nuggeteer currently supports evaluation for the TREC definition, 'other', and relationship tasks, for</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML