File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-2004_evalu.xml

Size: 2,682 bytes

Last Modified: 2025-10-06 13:58:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2004">
  <Title>Exploiting Diversity for Answering Questions</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> Several measurements were made to ascertain the quality of the various selection techniques, as seen in Figure 1.</Paragraph>
    <Paragraph position="1"> Precision, P, indicates the accuracy of the technique, the percentage of the answers that were judged to be correct.</Paragraph>
    <Paragraph position="2"> avgP is the main measure used by NIST this year--the average precision of all prefixes of the sequence of answers placed in order of high to low confidence. Strict corresponds to the correctness criterion used by NIST-the answer must be exact and justified by the referenced document (assessor judgment a17a76a53 ). The Loose figures discard these two criteria (assessor judgment a77 a53 ). The Loose P measure was the one that was optimized during development.</Paragraph>
    <Paragraph position="3"> In Figure 1 we see both development and test set results for answer selection experiments involving a sample of the distance measures with which we experimented, as well as the best-performing system involved in the evaluation. All of the design and selection of the distance measures was done using hill-climbing on the development set, and only after this exploration was complete was the performance on the test set measured. Two general observations can be made about these results (and others not shown): taking into account a prior based on the document source (including NIL) is useful, as is working with  feature bags from the answers rather than sets. The best-performing selection system used all character strings of length 5 and less as features, combined with the multiset Tanimoto distance measure described above, and scaled with document source priors. Furthermore, a numeric string mismatch was weighted to be twice as costly as mismatching a non-numeric string.</Paragraph>
    <Paragraph position="4"> Question 1674 provides an example that contrasts this best selector with a simple voting scheme (exact string match): What day did Neil Armstrong land on the moon? 1969 (simple voting--incorrect) July 20, 1969 (best measure above--correct) While a plurality of systems answered with 1969, many others answered with variants of the correct answer that differed in punctuation, as well as on July 20, 1969; July 18, 1969; July 14, 1999; even simply 20. All of these, including the incorrect instances of 1969, contributed to the correct answer being selected.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML