File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/n03-1034_abstr.xml
Size: 4,465 bytes
Last Modified: 2025-10-06 13:42:47
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1034"> <Title>References</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. Thus it is important to examine the quality of a new evaluation task to establish its reliability. This paper provides an example of one such assessment by analyzing the task within the TREC 2002 question answering track. The analysis demonstrates that comparative results from the new task are stable, and empirically estimates the size of the difference required between scores to confidently conclude that two runs are different.</Paragraph> <Paragraph position="1"> Metric-based evaluations of human language technology such as MUC and TREC and DUC continue to proliferate (Sparck Jones, 2001). This proliferation is not difficult to understand: evaluations can forge communities, accelerate technology transfer, and advance the state of the art. Yet evaluations are not without their costs. In addition to the financial resources required to support the evaluation, there are also the costs of researcher time and focus. Since a poorly defined evaluation task wastes research effort, it is important to examine the validity of an evaluation task. In this paper, we assess the quality of the new question answering task that was the focus of the TREC 2002 question answering track.</Paragraph> <Paragraph position="2"> TREC is a workshop series designed to encourage research on text retrieval for realistic applications by providing large test collections, uniform scoring procedures, and a forum for organizations interested in comparing results. The conference has focused primarily on the traditional information retrieval problem of retrieving a ranked list of documents in response to a statement of information need, but also includes other tasks, called tracks, that focus on new areas or particularly difficult aspects of information retrieval. A question answering (QA) track was started in TREC in 1999 (TREC-8) to address the problem of returning answers, rather than document lists, in response to a question.</Paragraph> <Paragraph position="3"> The task for each of the first three years of the QA track was essentially the same. Participants received a large corpus of newswire documents and a set of factoid questions such as How many calories are in a Big Mac? and Who invented the paper clip?. Systems were required to return a ranked list of up to five [document-id, answer-string] pairs per question such that each answer string was believed to contain an answer to the question. Human assessors read each string and decided whether the string actually did contain an answer to the question. An individual question received a score equal to the reciprocal of the rank at which the first correct response was returned, or zero if none of the five responses contained a correct answer. The score for a submission was then the mean of the individual questions' reciprocal ranks.</Paragraph> <Paragraph position="4"> Analysis of the TREC-8 track confirmed the reliability of this evaluation task (Voorhees and Tice, 2000): the assessors understood and could do their assessing job; relative scores between systems were stable despite differences of opinion by assessors; and intuitively better systems received better scores.</Paragraph> <Paragraph position="5"> The task for the TREC 2002 QA track changed significantly from the previous years' task, and thus a new assessment of the track is needed. This paper provides that assessment by examining both the ability of the human assessors to make the required judgments and the effect that differences in assessor opinions have on comparative results, plus empirically establishing confidence intervals for the reliability of a comparison as a function of the difference in effectiveness scores. The first section defines the 2002 QA task and provides a brief summary of the system results. The following three sections look at each of the evaluation issues in turn. The final sec- null tion summarizes the findings, and outlines shortcomings of the evaluation that remain to be addressed.</Paragraph> </Section> class="xml-element"></Paper>