XML Viewer - w01-1201

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1201_metho.xml
Size: 6,065 bytes
Last Modified: 2025-10-06 14:07:47
<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1201">
  <Title>Looking Under the Hood: Tools for Diagnosing Your Question Answering Engine</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The data
</SectionTitle>
    <Paragraph position="0"> The experiments in Sections 3, 4, and 5 were performed on two question answering data sets: (1) the TREC-8 Question Answering Track data set and (2) the CBC reading comprehension data set.</Paragraph>
    <Paragraph position="1"> We will briefly describe each of these data sets and their corresponding tasks.</Paragraph>
    <Paragraph position="2"> The task of the TREC-8 Question Answering track was to find the answer to 198 questions using a document collection consisting of roughly 500,000 newswire documents. For each question, systems were allowed to return a ranked list of 5 short (either 50-character or 250-character) responses. As a service to track participants, AT&amp;T provided top documents returned by their retrieval engine for each of the TREC questions. Sections 4 and 5 present analyses that use all sentences in the top 10 of these documents. Each sentence is classified as correct or incorrect automatically. This automatic classification judges a sentence to be correct if it contains at least half of the stemmed, content-words in the answer key.</Paragraph>
    <Paragraph position="3"> We have compared this automatic evaluation to the TREC-8 QA track assessors and found it to agree 93-95% of the time (Breck et al., 2000).</Paragraph>
    <Paragraph position="4"> The CBC data set was created for the Johns Hopkins Summer 2000 Workshop on Reading Comprehension. Texts were collected from the Canadian Broadcasting Corporation web page for kids (http://cbc4kids.ca/). They are an average of 24 sentences long. The stories were adapted from newswire texts to be appropriate for adolescent children, and most fall into the following domains: politics, health, education, science, human interest, disaster, sports, business, crime, war, entertainment, and environment. For each CBC story, 8-12 questions and an answer key were generated.2 We used a 650 question sub-set of the data and their corresponding 75 stories. The answer candidates for each question in this data set were all sentences in the document. The sentences were scored against the answer key by the automatic method described previously.</Paragraph>
    <Paragraph position="5"> 2This work was performed by Lisa Ferro and Tim Bevins of the MITRE Corporation. Dr. Ferro has professional experience writing questions for reading comprehension exams and led the question writing effort.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Analyzing the number of answer
</SectionTitle>
    <Paragraph position="0"> opportunities per question In this section we explore the impact of multiple answer opportunities on end-to-end system performance. A question may have multiple answers for two reasons: (1) there is more than one different answer to the question, and (2) there may be multiple instances of each answer. For example, &amp;quot;What does the Peugeot company manufacture?&amp;quot; can be answered by trucks, cars, or motors and each of these answers may occur in many sentences that provide enough context to answer the question. The table insert in Figure 1 shows that, on average, there are 7 answer occurrences per question in the TREC-8 collection.3 In contrast, there are only 1.25 answer occurrences in a CBC document. The number of answer occurrences varies widely, as illustrated by the standard deviations. The median shows an answer frequency of 3 for TREC and 1 for CBC, which perhaps gives a more realistic sense of the degree of answer frequency for most questions.</Paragraph>
    <Paragraph position="1">  To gather this data we manually reviewed 50 randomly chosen TREC-8 questions and identified all answers to these questions in our text collection. We defined an &amp;quot;answer&amp;quot; as a text fragment that contains the answer string in a context sufficient to answer the question. Figure 1 shows the resulting graph. The x-axis displays the number of answer occurrences found in the text collection per question and the y-axis shows the per3We would like to thank John Burger and John Aberdeen for help preparing Figure 1.</Paragraph>
    <Paragraph position="2"> centage of questions that had x answers. For example, 26% of the TREC-8 questions had only 1 answer occurrence, and 20% of the TREC-8 questions had exactly 2 answer occurrences (the black bars). The most prolific question had 67 answer occurrences (the Peugeot example mentioned above). Figure 1 also shows the analysis of 219 CBC questions. In contrast, 80% of the CBC questions had only 1 answer occurrence in the targeted document, and 16% had exactly 2 answer occurrences.</Paragraph>
    <Paragraph position="3">  sents one of the 50 questions we examined.4 The x-axis shows the number of answer opportunities for the question, and the y-axis represents the percentage of systems that generated a correct answer5 for the question. E.g., for the question with 67 answer occurrences, 80% of the systems produced a correct answer. In contrast, many questions had a single answer occurrence and the percentage of systems that got those correct varied from about 2% to 60%.</Paragraph>
    <Paragraph position="4"> The circles in Figure 2 represent the average percentage of systems that answered questions correctly for all questions with the same number of answer occurrences. For example, on average about 27% of the systems produced a correct answer for questions that had exactly one answer oc- null rect answer if a correct answer was in its response set. currence, but about 50% of the systems produced a correct answer for questions with 7 answer opportunities. Overall, a clear pattern emerges: the performance of TREC-8 systems was strongly correlated with the number of answer opportunities present in the document collection.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML