File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/n03-1034_concl.xml
Size: 3,372 bytes
Last Modified: 2025-10-06 13:53:29
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1034"> <Title>References</Title> <Section position="6" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> Evaluating natural language processing technology is critical to advancing the state of the art, but also consumes significant resources. It is therefore important to validate new evaluation tasks and to establish the boundaries of what can legitimately be concluded from the evaluation. This paper presented an assessment of the task in the TREC 2002 QA track.</Paragraph> <Paragraph position="1"> While the task in earlier QA tracks had already been validated, changes to the 2002 task were significant enough to warrant further examination. In particular, the 2002 task required systems to return exact answers, to return one response per question, and to rank questions by trials overlap, there may be correlations among the trials that could bias the estimates of the error rates as compared to what would be obtained with an equal number of samples drawn from a much larger initial set of questions.</Paragraph> <Paragraph position="2"> confidence in the response; the evaluation metric emphasized the ranking. Each of these changes could increase the variability in the evaluation as compared to the earlier task. Examination of the track results did show some increase in variability, but also confirmed that system comparisons are sufficiently stable for an effective evaluation. Human assessors do not always agree as to whether an answer is exact, but the differences reflect the well-known differences in opinion as to correctness rather than inherent difficulty in recognizing whether an answer is exact. The confidence-weighted score is sensitive to changes in judgments for questions that are ranked highly, and therefore is a less stable measure than a raw count of number correct. Nonetheless, all of the observed inversions in confidence-weighted scores when systems were evaluated using different judgment sets were between systems whose scores differed by less than 0.07, the smallest difference for which the error rate of concluding two runs are different is less than 5 % for test sets of 500 questions. null A major part of the cost an evaluation is building the necessary evaluation infrastructure such as training materials, scoring procedures, and judgment sets. The net cost of an evaluation is greatly reduced if such infrastructure is reusable since the initial costs are amortized over many additional users. Reusable infrastructure also accelerates the pace of technological advancement since it allows researchers to run their own experiments and receive rapid feedback as to the quality of alternative methods. Unfortunately, neither the initial task within the TREC QA track nor the TREC 2002 task produces a reusable QA test collection. That is, it is not currently possible to use the judgment set produced during TREC to accurately evaluate a QA run that uses the same document and question sets as the TREC runs but was not judged by the human assessors. Methods for approximating evaluation scores exist (Breck et al., 2000; Voorhees and Tice, 2000), but they are not completely reliable. A key area for future work is to devise a truly reusable QA evaluation infrastructure.</Paragraph> </Section> class="xml-element"></Paper>