File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-1201_intro.xml

Size: 2,651 bytes

Last Modified: 2025-10-06 14:01:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1201">
  <Title>Looking Under the Hood: Tools for Diagnosing Your Question Answering Engine</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> When building a system to perform a task, the most important statistic is the performance on an end-to-end evaluation. For the task of open-domain question answering against text collections, there have been two large-scale end-to-end evaluations: (TREC-8 Proceedings, 1999) and (TREC-9 Proceedings, 2000). In addition, a number of researchers have built systems to take reading comprehension examinations designed to evaluate children's reading levels (Charniak et al., 2000; Hirschman et al., 1999; Ng et al., 2000; Riloff and Thelen, 2000; Wang et al., 2000). The performance statistics have been useful for determining how well techniques work.</Paragraph>
    <Paragraph position="1"> However, raw performance statistics are not enough. If the score is low, we need to understand what went wrong and how to fix it. If the score is high, it is important to understand why.</Paragraph>
    <Paragraph position="2"> For example, performance may be dependent on characteristics of the current test set and would not carry over to a new domain. It would also be useful to know if there is a particular characteristic of the system that is central. If so, then the system can be streamlined and simplified.</Paragraph>
    <Paragraph position="3"> In this paper, we explore ways of gaining insight into question answering system performance. First, we analyze the impact of having multiple answer opportunities for a question. We found that TREC-8 Q/A systems performed better on questions that had multiple answer opportunities in the document collection. Second, we present a variety of graphs to visualize and analyze functions for ranking sentences. The graphs revealed that relative score instead of absolute score is paramount. Third, we introduce bounds on functions that use term overlap1 to rank sentences. Fourth, we compute the expected score of a hypothetical Q/A system that correctly identifies the answer type for a question and correctly identifies all entities of that type in answer sentences. We found that a surprising amount of ambiguity remains because sentences often contain multiple entities of the same type.</Paragraph>
    <Paragraph position="4"> 1Throughout the text, we use &amp;quot;overlap&amp;quot; to refer to the intersection of sets of words, most often the words in the question and the words in a sentence.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML