File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1048_intro.xml

Size: 4,471 bytes

Last Modified: 2025-10-06 14:03:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1048">
  <Title>Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements</Title>
  <Section position="3" start_page="376" end_page="377" type="intro">
    <SectionTitle>
2 Approach
</SectionTitle>
    <Paragraph position="0"> Nuggeteer builds one binary classifier per nugget for each question, based on n-grams (up to trigrams) in the description and optionally in any provided judgement files. The classifiers use a weight for each n-gram, an informativeness measure for each n-gram, and a threshold for accepting a response as bearing the nugget.</Paragraph>
    <Section position="1" start_page="376" end_page="376" type="sub_section">
      <SectionTitle>
2.1 N-gram weight
</SectionTitle>
      <Paragraph position="0"> The idf-based weight for an n-gram w</Paragraph>
      <Paragraph position="2"> sum of unigram idf counts from the AQUAINT corpus of English newspaper text, the corpus from which responses for the TREC tasks are drawn. We did not explore using n-gram idfs. A tf component is not meaningful because the data are so sparse.</Paragraph>
    </Section>
    <Section position="2" start_page="376" end_page="376" type="sub_section">
      <SectionTitle>
2.2 Informativeness
</SectionTitle>
      <Paragraph position="0"> Let G be the set of nuggets for some question. Informativeness of an n-gram for a nugget g is calculated based on how many other nuggets in that question  This captures the Bayesian intuition that the more outcomes a piece of evidence is associated with, the less confidence we can have in predicting the outcome based on that evidence.</Paragraph>
    </Section>
    <Section position="3" start_page="376" end_page="377" type="sub_section">
      <SectionTitle>
2.3 Judgement
</SectionTitle>
      <Paragraph position="0"> Nuggeteer does not guess on responses which have been judged by a human to contain a nugget, or those which have unambiguously judged not to, but assigns the known judgement.</Paragraph>
      <Paragraph position="1">  For unseen responses, we determine the n-gram recall for each nugget g and candidate response</Paragraph>
      <Paragraph position="3"> by breaking the candidate into n-grams and finding the sum of scores:</Paragraph>
      <Paragraph position="5"> The candidate is considered to contain all nuggets whose recall exceeds some threshold. Put another  If a response was submitted, and no response from the same system was judged to contain a nugget, then the response is considered to not contain the nugget. We normalized whitespace and case for matching previously seen responses.</Paragraph>
      <Paragraph position="6">  way, we build an n-gram language model for each nugget, and assign those nuggets whose predicted likelihood exceeds a threshold.</Paragraph>
      <Paragraph position="7"> When several responses contain a nugget, Nuggeteer picks the first (instead of the best, as assessors can) for purposes of scoring.</Paragraph>
    </Section>
    <Section position="4" start_page="377" end_page="377" type="sub_section">
      <SectionTitle>
2.4 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> We explored a number of parameters in the scoring function: stemming, n-gram size, idf weights vs. count weights, and the effect of removing stopwords. We tested all 24 combinations, and for each experiment, we cross-validated by leaving out one submitted system, or where possible, one submitting institution (to avoid training and testing on potentially very similar systems).</Paragraph>
      <Paragraph position="1">  Each experiment was performed using a range of thresholds for Equation 3 above, and we selected the best performing threshold for each data set.</Paragraph>
      <Paragraph position="2">  Because the threshold was selected after crossvalidation, it is exposed to overtraining. We used a single global threshold to minimize this risk, but we have no reason to think that the thresholds for different nuggets are related.</Paragraph>
      <Paragraph position="3"> Selecting thresholds as part of the training process can maximize accuracy while eliminating overtraining. We therefore explored Bayesian models for automatic threshold selection. We model assignment of nuggets to responses as caused by the scores according to a noisy threshold function, with separate false positive and false negative error rates. We varied thresholds and error rates by entire dataset, by question, or by individual nugget, evaluating them using Bayesian model selection.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML