File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1034_metho.xml

Size: 25,823 bytes

Last Modified: 2025-10-06 14:08:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1034">
  <Title>References</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 The TREC 2002 QA Track
</SectionTitle>
    <Paragraph position="0"> The goal of the question answering track is to foster research on systems that retrieve answers rather than documents, with particular emphasis on systems that function in unrestricted domains. To date the track has considered only a very restricted version of the general question answering problem, finding answers to closed-class questions in a large corpus of newspaper articles. Kupiec defined a closed-class question as &amp;quot;a question stated in natural language, which assumes some definite answer typified by a noun phrase rather than a procedural answer&amp;quot; (Kupiec, 1993). The TREC 2002 track continued to use closed-class questions, but made two major departures from the task as defined in earlier years. The first difference was that systems were to return exact answers rather than the text snippets containing an answer that were accepted previously. The second difference was that systems were required to return exactly one response per question and the questions were to be ranked by the system's confidence in the answer it had found.</Paragraph>
    <Paragraph position="1"> The change to exact answers was motivated by the belief that a system's ability to recognize the precise extent of the answer is crucial to improving question answering technology. The problems with using text snippets as responses were illustrated in the TREC 2001 track. Each of the answer strings shown in Figure 1 was judged correct for the question What river in the US is known as the Big Muddy?, yet earlier responses are clearly better than later ones. Accepting only exact answers as correct forces systems to demonstrate that they know precisely where the answer lies in the snippets.</Paragraph>
    <Paragraph position="2"> The second change, ranking questions by confidence in the answer, tested a system's ability to recognize when it has found a correct answer. Systems must be able to recognize when they do not know the answer to avoid returning incorrect responses. In many applications returning a wrong answer is much worse than returning a &amp;quot;Don't know&amp;quot; response.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Task Definition
</SectionTitle>
      <Paragraph position="0"> Incorporating these two changes into the previous QA task resulted in the following task definition. Participants were given a large corpus of newswire articles and a set of 500 closed-class questions. Some of the questions did not have answers in the document collection. A run consisted of exactly one response for each question. A response was either a [document-id, answer-string] pair or the string &amp;quot;NIL&amp;quot;, which was used to indicate that the system believed there was no correct answer in the collection. Within a run, questions were ordered from most confident response to least confident response. All runs were required to be produced completely automatically-no manual intervention of any kind was permitted.</Paragraph>
      <Paragraph position="1"> The document collection used as the source of answers was the the AQUAINT Corpus of English News Text (LDC catalog number LDC2002T31). The collection is comprised of documents from three different sources: the AP newswire from 1998-2000, the New York Times newswire from 1998-2000, and the (English portion of the) Xinhua News Agency from 1996-2000. There are approximately 1,033,000 documents and 3 gigabytes of text in the collection.</Paragraph>
      <Paragraph position="2"> The test set of questions were drawn from MSNSearch and AskJeeves logs. NIST assessors searched the document collection for answers to candidate questions from the logs. NIST staff selected the final test set from among the candidates that had answers, keeping some questions for which the assessors found no answer. NIST corrected the spelling, punctuation, and grammar of the questions in the logs1, but left the content as it was. NIST did not include any definition questions (Who is Duke Ellington? What are polymers?) in the test set, but otherwise made no attempt to control the relative number of different types of questions in the test set.</Paragraph>
      <Paragraph position="3"> A system response consisting of an [document-id, answer-string] pair was assigned exactly one judgment by a human assessor as follows: wrong: the answer string does not contain a correct answer or the answer is not responsive; not supported: the answer string contains a correct answer but the document returned does not support that answer; not exact: the answer string contains a correct answer and the document supports that answer, but the string contains more than just the answer (or is missing bits of the answer); right: the answer string consists of exactly a correct answer and that answer is supported by the document returned.</Paragraph>
      <Paragraph position="4"> Only responses judged right were counted as correct in the final scoring. A NIL response was counted as correct if there is no known answer in the document collection for that question (i.e., the assessors did not find an answer during the candidate selection phase and no system returned a right response for it). Forty-six questions have no known answer in the collection.</Paragraph>
      <Paragraph position="5"> The scoring metric used, called the confidence-weighted score, was chosen to emphasize the system's ability to correctly rank its responses. The metric is 1Unfortunately, some errors remain in the test questions. Scores were nevertheless computed over all 500 questions as released by NIST.</Paragraph>
      <Paragraph position="6"> the Mississippi Known as Big Muddy, the Mississippi is the longest as Big Muddy , the Mississippi is the longest messed with . Known as Big Muddy , the Mississip Mississippi is the longest river in the US the Mississippi is the longest river in the US, the Mississippi is the longest river(Mississippi) has brought the Mississippi to its lowest ipes.In Life on the Mississippi, Mark Twain wrote t  an analog of document retrieval's uninterpolated average precision in that it rewards a system for a correct answer early in the ranking more than it rewards for a correct answer later in the ranking. More formally, if there are a0 questions in the test set, the confidence-weighted score is defined to be</Paragraph>
      <Paragraph position="8"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 Track Results
</SectionTitle>
      <Paragraph position="0"> Table 1 gives evaluation results for a subset of the runs submitted to the TREC 2002 QA track. The table includes one run each from the ten groups who submitted the top-scoring runs. The run shown in the table is the run with the best confidence-weighted score (&amp;quot;Score&amp;quot;). Also given in the table are the percentage of questions answered correctly, and the precision and recall for recognizing when there is no correct answer in the document collection (&amp;quot;NIL Accuracy&amp;quot;). Precision of recognizing no answer is the ratio of the number of times NIL was returned and correct to the number of times it was returned; recall is the ratio of the number of times NIL was returned and correct to the number of times it was correct (46).</Paragraph>
      <Paragraph position="1"> QA systems have become increasingly complex over the four years of the TREC track such that there is now little in common across all systems. Generally a system will classify an incoming question according to an ontology of question types (which varies from small sets of broad categories to highly-detailed hierarchical schemes) and then perform type-specific processing. Many TREC 2002 systems used specific data sources such as name lists and gazetteers, which were searched when the system determined the question to be of an appropriate type. The web was used as a data source by most systems, though it was used in different ways. For some systems the web was the primary source of an answer that the system then mapped to a document in the corpus to return as a response. Other</Paragraph>
      <Paragraph position="3"> systems did the reverse: used the corpus as the primary source of answers and then verified candidate answers on the web. Still other systems used the web as one of several sources whose combined evidence selected the final response.</Paragraph>
      <Paragraph position="4"> The results in Table 1 illustrate that the confidence-weighted score does indeed emphasize a system's ability to rank correctly answered questions before incorrectly answered questions. For example, the exactanswer run has a greater confidence-weighted score than the pris2002 run despite answering 19 fewer questions correctly (54.2 % answered correctly vs. 58.0 % answered correctly). The systems used a variety of approaches to creating their question rankings. Almost all systems used question type as a factor since some question types are easier to answer than others. Some systems use a score to rank candidate answers for a question. When that score is comparable across questions, it can also be used to rank questions. A few groups used a training set of previous years' questions and answers to learn a good feature set and corresponding weights to predict confidence. Many systems used NIL as an indicator that the system couldn't find an answer (rather than the system was sure there was no answer), so ranked NIL responses last. With the exception of the top-scoring LCCmain2002 run, though, the NIL accuracy scores are low, indicating that systems had trouble recognizing when there was no answer in the document collection.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Judging Responses
</SectionTitle>
    <Paragraph position="0"> The TREC QA track is a comparative evaluation. In a comparative evaluation, each of two methods is used to solve a common sample set of problems, and the methods' output is scored using some evaluation metric. The method whose output produces a better evaluation score is assumed to be the more effective method. An important feature of a comparative evaluation is that only relative scores are required. In other words, the only requirement of the evaluation methodology for a comparative evaluation is that it reliably rank better methods ahead of worse methods.</Paragraph>
    <Paragraph position="1"> The remainder of this paper examines the question of whether the QA task defined above reliably ranks systems. The first aspect of the investigation examines whether human assessors can recognize exact answers.</Paragraph>
    <Paragraph position="2"> The evidence suggests that they can, though the differences of opinion as to correctness observed in earlier QA tracks remain. The second part of the investigation looks at the effect the differences of opinion have on rankings of systems given that there is only response per question and the evaluation metric emphasizes the systems' ranking of questions by confidence. The final aspect of the investigation addresses the sensitivity of the evaluation. While evaluation scores can be computed to an arbitrary number of decimal places, not all differences are meaningful. The sensitivity analysis empirically determines the minimum difference in scores required to have a small probability of error in concluding that one system is better than the other.</Paragraph>
    <Paragraph position="3"> While the idea of an exact answer is intuitively obvious, it is very difficult to formally define. As with correctness, exactness is essentially a personal opinion. Thus whether or not an answer is exact is ultimately up to the assessor. NIST did provide guidelines to the assessors regarding exactness. The guidelines stated that exact answers need not be the most minimal response possible.</Paragraph>
    <Paragraph position="4"> For example, &amp;quot;Mississippi river&amp;quot; should be accepted as exact for the Big Muddy question despite the fact that &amp;quot;river&amp;quot; is redundant since all correct responses must be a river. The guidelines also suggested that ungrammatical responses are generally not exact; a location question can have &amp;quot;in Mississippi&amp;quot; as an exact answer, but not &amp;quot;Mississippi in&amp;quot;. The guidelines also emphasized that even &amp;quot;quality&amp;quot; responses--strings that contained both a correct answer and justification for that answer--were to be  ments.</Paragraph>
    <Paragraph position="5"> considered inexact for the purposes of this evaluation.</Paragraph>
    <Paragraph position="6"> To test whether assessors consistently recognize exact answers, each question was independently judged by three different assessors. Of the 15,948 [document-id, answer-string] response pairs across all 500 questions, 1886 pairs (11.8 %) had some disagreement among the three assessors as to which of the four judgments should be assigned to the pair. Note, however, that there were only 3725 pairs that had at least one judge assign a judgment that was something other than 'wrong'. Thus, there was some disagreement among the judges for half of all responses that were not obviously wrong.</Paragraph>
    <Paragraph position="7"> Table 2 shows the distribution of the assessors' disagreements. Each response pair is associated with a triple of judgments according to the three judgments assigned by the different assessors. In the table the judgments are denoted by W for wrong, R for right, U for unsupported, and X for inexact. The table shows the number of pairs that are associated with each triple, plus the percentage of the total number of disagreements that that triple represents. null The largest number of disagreements involves right and inexact judgments: the RRX and RXX combinations account for a third of the total disagreements. Fortunately inspection of these disagreements reveals that they do not in general represent a new category of disagreement. Instead, many of the granularity differences observed in earlier QA judgment sets (Voorhees and Tice, 2000) are now reflected in this distinction. For example, a correct response for Who is Tom Cruise married to? is Nicole Kidman. Some assessors accepted just &amp;quot;Kidman&amp;quot;, but others marked &amp;quot;Kidman&amp;quot; as inexact. Some assessors also accepted &amp;quot;actress Nicole Kidman&amp;quot;, which some rejected as inexact. Similar issues arose with dates and place names. For dates and quantities, there was disagreement whether slightly off responses are wrong or inexact. For example, when the correct response is April 20, 1999, is April 19, 1999 wrong or inexact? This last distinction doesn't matter very much in practice since in either case the response is not right.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Stability of Comparative Results
</SectionTitle>
    <Paragraph position="0"> The TREC-8 track demonstrated that QA evaluation results based on text snippets and mean reciprocal rank scoring is stable despite differences in assessor opinions (Voorhees and Tice, 2000). Given that the exact answer judgments reflect these same differences of opinion, are confidence-weighted scores computed over only one response per question also stable? We repeat the test for stability used in TREC-8 to answer this question.</Paragraph>
    <Paragraph position="1"> The three assessors who judged a question were arbitrarily assigned as assessor 1, assessor 2, or assessor 3. The assessor 1 judgments for all questions were gathered into judgment set 1, the assessor 2 judgments into judgment set 2, and the assessor 3 judgments into judgment set 3. These three judgment sets were combined through adjudication into a final judgment set, which is the judgment set used to produce the official TREC 2002 scores.</Paragraph>
    <Paragraph position="2"> Each run was scored using each of the four judgment sets. For each judgment set, the runs were ranked in order from most effective to least effective using either the confidence-weighted score or the raw number of correctly answered questions. The distance between two rankings of runs was computed using a correlation measure based on Kendall's a1 (Stuart, 1983). Kendall's a1 computes the distance between two rankings as the minimum number of pairwise adjacent swaps to turn one ranking into the other. The distance is normalized by the number of items being ranked such that two identical rankings produce a correlation of a1  a10 , the correlation between a ranking and its perfect inverse is a2 a1  a10 , and the expected correlation of two rankings chosen at random is a10  a10 . Table 3 gives the correlations between all pairs of rankings for both evaluation metrics.</Paragraph>
    <Paragraph position="3"> The average a1 correlation with the adjudicated ranking for the TREC-8 results was 0.956; for TREC 2001, where two assessors judged each question, the average correlation was 0.967. The correlations for the exact answer case are somewhat smaller: the average correlation is 0.930 for the confidence-weighted score and 0.945 for the raw count of number correct. Correlations are slightly higher for the adjudicated judgment set, probably because the adjudicated set has a very small incidence of errors. The higher correlation for the raw count measure likely reflects the fact that the confidence-weighted score is much more sensitive to differences in judgments for questions at small (close to one) ranks.</Paragraph>
    <Paragraph position="4"> Smaller correlations between system rankings indicate that comparative results are less stable. It is not surprising that an evaluation based on one response per question is less stable than an evaluation based on five responses per question--there is inherently less information included in the evaluation. At issue is whether the rankings are stable enough to have confidence in the evaluation results. It would be nice to have a critical value for a1 such that correlations greater than the critical value guarantee a quality evaluation. Unfortunately, no such value can exist since a1 values depend on the set of runs being compared. In practice, we have considered correlations greater than 0.9 to be acceptable (Voorhees, 2001), so both evaluating using the confidence-weighted score and evaluating using the raw count of number correct are sufficiently stable.</Paragraph>
    <Paragraph position="5"> The vast majority of &amp;quot;swaps&amp;quot; (pairs of run such that one member of the pair evaluates as better under one evaluation condition while the other evaluates as better under the alternate condition) that occur when using different human assessors involve systems whose scores are very similar. There is a total of 177 swaps that occur when the three one-judge rankings are compared with the adjudicated ranking when using the confidence-weighted score. Only 4 of the 177 swaps involve pairs of runs whose difference in scores, a3 , is at least 0.05 as computed using the adjudicated judgment set, and there are no swaps when a3 is at least 0.07. As will be shown in the next section, runs with scores that are this similar should be assumed to be equally effective, so some swapping is to be expected.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Sensitivity Analysis
</SectionTitle>
    <Paragraph position="0"> Human judgments are not the only source of variability when evaluating QA systems. As is true with document retrieval systems, QA system effectiveness depends on the questions that are asked, so the particular set of questions included in a test set will affect evaluation results. Since the test set of questions is assumed to be a random sample of the universe of possible questions, there is always some chance that a comparison of two systems using any given test set will lead to the wrong conclusion.</Paragraph>
    <Paragraph position="1"> The probability of an error can be made arbitrarily small by using arbitrarily many questions, but there are practical limits to the number of questions that can be included in an evaluation.</Paragraph>
    <Paragraph position="2"> Following our work for document retrieval evaluation (Voorhees and Buckley, 2002), we can use the runs submitted to the QA track to empirically determine the relationship between the number of questions in a test set, the observed difference in scores (a3 ), and the likelihood that a single comparison of two QA runs leads to the correct conclusion. Once established, the relationship can be used to derive the minimum difference in scores required for a certain level of confidence in the results given there are 500 questions in the test set.</Paragraph>
    <Paragraph position="3"> The core of the procedure is comparing the effectiveness of a pair runs on two disjoint question sets of equal size to see if the two sets disagree as to which of the runs is better. We define the error rate as the percentage of comparisons that result in a swap. Since the QA track used 500 questions, we can directly compute the error rate for question set sizes up to 250 questions. By fitting curves to the values observed for question set sizes up to 250, we can extrapolate the error rates to question sets up to 500 questions.</Paragraph>
    <Paragraph position="4"> When calculating the error rate, the difference between two runs' confidence-weighted scores is categorized into one of 21 bins based on the size of the difference. The first bin contains runs with a difference of less than 0.01 (including no difference at all). The next bin contains runs whose difference is at least 0.01 but less than 0.02.</Paragraph>
    <Paragraph position="5"> The limits for the remaining bins increase by increments of 0.01, with the last bin containing all runs with a difference of at least 0.2.</Paragraph>
    <Paragraph position="6"> The requirement that the question sets be disjoint ensures that the comparisons are made on independent samples of the space of questions. That is, we assume a universe of all possible closed-class questions, and an (unknown) probability distribution of the scores for each of the two runs. We also assume that the set of questions used in the TREC 2002 QA track is a random sample of the universe of questions. A random selection from the TREC question set gives a random, paired selection from each of the runs' confidence-weighted score distributions.</Paragraph>
    <Paragraph position="7"> We take one random sample as a base case, and a different random sample (the disjoint sets) as the test case to see if the results agree.</Paragraph>
    <Paragraph position="8"> Each question set size from 1 to 250 is treated as a separate experiment. Within an experiment, we randomly select two disjoint sets of questions of the required size. We compute the confidence-weighted score over both question sets for all runs, then count the number of times we see a swap for all pairs of runs using the bins to segregate the counts by size of the difference in scores. The entire procedure is repeated 10 times (i.e., we perform 10 trials), with the counts of the number of swaps kept as running totals over all trials2. The ratio of the number of 2While the two question sets used within any one trial are disjoint, and thus independent samples, the question sets across trials are drawn from the same initial set of 500 questions and thus overlap. Because the question sets among the different swaps to the total number of cases that land in a bin is the error rate for that bin.</Paragraph>
    <Paragraph position="9"> The error rates computed from this procedure are then used to fit curves of the form a0a2a1a3a1a5a4a3a1a3a6a8a7a10a9a12a11a14a13a16a15 a6 a11a18a17a20a19a22a21a24a23 where a15 a6 and a15a26a25 are parameters to be estimated and a27 is the size of the question set. A different curve is fit for each different bin. The input to the curve-fitting procedure used only question set sizes greater than 20 since smaller question set sizes are both uninteresting and very noisy. Curves could not be fit for the first bin (differences less than .01), for the same reason, or for bins where differences were greater than 0.16. Curves could not be fit for large differences because too much of the curve is in the long flat tail.</Paragraph>
    <Paragraph position="10"> The resulting extrapolated error rate curves are plotted in Figure 2. In the figure, the question set size is plotted on the x-axis and the error rate is plotted on the y-axis. The error rate for 500 questions when a difference of 0.05 in confidence-weighted scores is observed is approximately 8 %. That is, if we know nothing about two systems except their scores which differ by 0.05, and if we repeat the experiment on 100 different sets of 500 questions, then on average we can expect 8 out of those 100 sets to favor one system while the remaining 92 to favor the other.</Paragraph>
    <Paragraph position="11"> The horizontal line in the graph in Figure 2 is drawn at an error rate of 5 %, a level of confidence commonly used in experimental designs. For question set sizes of 500 questions, there needs to be an absolute difference of at least 0.07 in confidence-weighted scores before the error rate is less than 5 %. Using the 5 % error rate standard, the pris2002, IRST02D1, and IBMPQSQACYC runs from Table 1 should be considered equivalently effective, as should the uwmtB3, BBN2002C, isi02, limsiQalir2, and ali2002b runs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML