File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1135_evalu.xml

Size: 4,394 bytes

Last Modified: 2025-10-06 13:59:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1135">
  <Title>Improving QA Accuracy by Question Inversion</Title>
  <Section position="8" start_page="1076" end_page="1077" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Due to the complexity of the learned algorithm, we decided to evaluate in stages. We first performed an evaluation with a fixed question type, to verify that the purely arithmetic components of the algorithm were performing reasonably. We then evaluated on the entire TREC12 factoid question set.</Paragraph>
    <Section position="1" start_page="1076" end_page="1077" type="sub_section">
      <SectionTitle>
4.1 Evaluation 1
</SectionTitle>
      <Paragraph position="0"> We created a fixed question set of 50 questions of the form &amp;quot;What is the capital of X?&amp;quot;, for each state in the U.S. The inverted question &amp;quot;What state is Z the capital of?&amp;quot; was correctly generated in each case. We evaluated against two corpora: the AQUAINT corpus, of a little over a million news-wire documents, and the CNS corpus, with about 37,000 documents from the Center for Nonproliferation Studies in Monterey, CA. We expected there to be answers to most questions in the former corpus, so we hoped there our method would be useful in converting 2 nd place answers to first place. The latter corpus is about WMDs, so we expected there to be holes in the state capital coverage  , for which nil identification would be useful.</Paragraph>
      <Paragraph position="1">  We manually determined that only 23 state capitals were attested to in the CNS corpus, compared with all in AQUAINT.  We added Tbilisi to the answer key for &amp;quot;What is the capital of Georgia?&amp;quot;, since there was nothing in the question to disambiguate Georgia.</Paragraph>
      <Paragraph position="2">  The baseline is our regular search-based QA-System without the Constraint process. In this baseline system there was no special processing for nil questions, other than if the search (which always contained some required terms) returned no documents. Our results are shown in Table 2.</Paragraph>
      <Paragraph position="3">  corpora.</Paragraph>
      <Paragraph position="4"> On the AQUAINT corpus, four out of seven 2 nd place finishers went to first place. On the CNS corpus 16 out of a possible 26 correct no-answer cases were discovered, at a cost of losing three previously correct answers. The percentage correct score increased by a relative 10.3% for AQUAINT and 186% for CNS. In both cases, the error rate was reduced by about a third.</Paragraph>
    </Section>
    <Section position="2" start_page="1077" end_page="1077" type="sub_section">
      <SectionTitle>
4.2 Evaluation 2
</SectionTitle>
      <Paragraph position="0"> For the second evaluation, we processed the 414 factoid questions from TREC12. Of special interest here are the questions initially in first and second places, and in addition any questions for which nils were found.</Paragraph>
      <Paragraph position="1"> As seen in Table 1, there were 32 questions which originally evaluated in rank 2. Of these, four questions were not invertible because they had no terms that were annotated with any of our named-entity types, e.g. #2285 &amp;quot;How much does it cost for gastric bypass surgery?&amp;quot; Of the remaining 28 questions, 12 were promoted to first place. In addition, two new nils were found.</Paragraph>
      <Paragraph position="2"> On the down side, four out of 108 previous first place answers were lost. There was of course movement in the ranks two and beyond whenever nils were introduced in first place, but these do not affect the current TREC-QA factoid correctness measure, which is whether the top answer is correct or not. These results are summarized in Table 3.</Paragraph>
      <Paragraph position="3"> While the overall percentage improvement was small, note that only second-place answers were candidates for re-ranking, and 43% of these were promoted to first place and hence judged correct.</Paragraph>
      <Paragraph position="4"> Only 3.7% of originally correct questions were casualties. To the extent that these percentages are stable across other collections, as long as the size of the set of second-place answers is at least about 1/10 of the set of first-place answers, this form of the Constraint process can be applied effectively.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML