File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1021_metho.xml

Size: 15,033 bytes

Last Modified: 2025-10-06 14:07:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1021">
  <Title>Ranking suspected answers to natural language questions using predictive annotation</Title>
  <Section position="5" start_page="150" end_page="152" type="metho">
    <SectionTitle>
3 Answer selection
</SectionTitle>
    <Paragraph position="0"> So far, we have described how we retrieve relevant passages that may contain the answer to a query. The output of GuruQA is a list of 10 short passages containing altogether a large  number (often more than 30 or 40) of potential answers in the form of phrases annotated with QA-Tokens.</Paragraph>
    <Section position="1" start_page="151" end_page="151" type="sub_section">
      <SectionTitle>
3.1 Answer ranking
</SectionTitle>
      <Paragraph position="0"> We now describe two algorithms, AnSel and Werlect, which rank the spans returned by GuruQA. AnSel and Werlect 1 use different approaches, which we describe, evaluate and compare and contrast. The output of either system consists of five text extracts per question that contain the likeliest answers to the questions.</Paragraph>
    </Section>
    <Section position="2" start_page="151" end_page="151" type="sub_section">
      <SectionTitle>
3.2 Sample Input to AnSel/Werlect
</SectionTitle>
      <Paragraph position="0"> The role of answer selection is to decide which among the spans extracted by GuruQA are most likely to contain the precise answer to the questions. Figure 3 contains an example of the data structure passed from GuruQA to our answer selection module.</Paragraph>
      <Paragraph position="1"> The input consists of four items:  question (e.g., &amp;quot;PERSONS NAMES&amp;quot;). The text in Figure 3 contains five spans (potential answers), of which three (&amp;quot;Biography of Margaret Thatcher&amp;quot;, &amp;quot;Hugo Young&amp;quot;, and &amp;quot;Margaret Thatcher&amp;quot;) are of types included in the SYN-class for the question (PERSON NAME).</Paragraph>
      <Paragraph position="2"> The full output of GuruQA for this question includes a total of 14 potential spans (5 PERSONs and 9 NAMEs).</Paragraph>
    </Section>
    <Section position="3" start_page="151" end_page="152" type="sub_section">
      <SectionTitle>
3.3 Sample Output of AnSel/Werlect
</SectionTitle>
      <Paragraph position="0"> The answer selection module has two outputs: internal (phrase) and external (text passage).</Paragraph>
      <Paragraph position="1"> Internal output: The internal output is a ranked list of spans as shown in Table 1. It represents a ranked list of the spans (potential answers) sent by GuruQA.</Paragraph>
      <Paragraph position="2"> External output: The external output is a ranked list of 50-byte and 250-byte extracts. These extracts are selected in a way to cover the highest-ranked spans in the list of potential answers. Examples are given later in the paper.</Paragraph>
      <Paragraph position="3"> The external output was required for the TREC evaluation while system's internal output can be used in a variety of applications, e.g., to highlight the actual span that we believe is the answer to the question within the context of the passage in which it appears.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="152" end_page="153" type="metho">
    <SectionTitle>
4 Analysis of corpus and question
</SectionTitle>
    <Paragraph position="0"> sets In this section we describe the corpora used for training and evaluation as well as the questions contained in the training and evaluation question sets.</Paragraph>
    <Section position="1" start_page="152" end_page="152" type="sub_section">
      <SectionTitle>
4.1 Corpus analysis
</SectionTitle>
      <Paragraph position="0"> For both training and evaluation, we used the TREC corpus, consisting of approximately 2 GB of articles from four news agencies.</Paragraph>
    </Section>
    <Section position="2" start_page="152" end_page="152" type="sub_section">
      <SectionTitle>
4.2 Training set TR38
</SectionTitle>
      <Paragraph position="0"> To train our system, we used 38 questions (see Figure 4) for which the answers were provided by NIST.</Paragraph>
    </Section>
    <Section position="3" start_page="152" end_page="153" type="sub_section">
      <SectionTitle>
4.3 Test set T200
</SectionTitle>
      <Paragraph position="0"> The majority of the 200 questions (see Figure 5) in the evaluation set (T200) were not substantially different from these in TR38, although the introduction of &amp;quot;why&amp;quot; and &amp;quot;how&amp;quot; questions as well as the wording of questions in the format &amp;quot;Name X&amp;quot; made the task slightly harder.</Paragraph>
      <Paragraph position="1"> Questlon/Answer (T200) Q: Why did David Koresh ask the FBI for a word processor? A: to record his revelations.</Paragraph>
      <Paragraph position="2"> Q: How tall is the Matterhorn? A: 14,776 feet 9 inches Q: How tall is the replica of the Matterhorn  Some examples of problematic questions are shown in Figure 6.</Paragraph>
      <Paragraph position="3">  Q: Why did David Koresh ask the FBI for a word processor? Q: Name the first private citizen to fly in space.</Paragraph>
      <Paragraph position="4"> Q: What is considered the costliest disaster the insurance industry has ever faced? Q: What did John Hinckley do to impress</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="153" end_page="153" type="metho">
    <SectionTitle>
5 AnSel
</SectionTitle>
    <Paragraph position="0"> AnSel uses an optimization algorithm with 7 predictive variables to describe how likely a given span is to be the correct answer to a question. The variables are illustrated with examples related to the sample question number 10001 from TR38 &amp;quot;Who was Johnny Mathis' high school track coach?&amp;quot;. The potential answers (extracted by GuruQA) are shown in Table 2.</Paragraph>
    <Section position="1" start_page="153" end_page="153" type="sub_section">
      <SectionTitle>
5.1 Feature selection
</SectionTitle>
      <Paragraph position="0"> The seven span features described below were found to correlate with the correct answers.</Paragraph>
      <Paragraph position="1"> Number: position of the span among M1 spans returned from the hit-list.</Paragraph>
      <Paragraph position="2"> Rspanno: position of the span among all spans returned within the current passage.</Paragraph>
      <Paragraph position="3"> Count: number of spans of any span class retrieved within the current passage.</Paragraph>
      <Paragraph position="4"> Notinq: the number of words in the span that do not appear in the query.</Paragraph>
      <Paragraph position="5"> Type: the position of the span type in the list of potential span types. Example: Type (&amp;quot;Lou Vasquez&amp;quot;) = 1, because the span type of &amp;quot;Lou Vasquez&amp;quot;, namely &amp;quot;PER-SON&amp;quot; appears first in the SYN-class &amp;quot;PER-</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="153" end_page="153" type="metho">
    <SectionTitle>
SON ORG NAME ROLE&amp;quot;.
</SectionTitle>
    <Paragraph position="0"> Avgdst: the average distance in words between the beginning of the span and query words that also appear in the passage. Example: given the passage &amp;quot;Tim O'Donohue, Woodbridge High School's varsity baseball coach, resigned Monday and will be replaced by assistant Johnny Ceballos, Athletic Director Dave Cowen said.&amp;quot; and the span &amp;quot;Tim O'Donohue&amp;quot;, the value of avgdst is equal to 8.</Paragraph>
    <Paragraph position="1"> Sscore: passage relevance as computed by GuruQA. null Number: the position of the span among all retrieved spans.</Paragraph>
    <Section position="1" start_page="153" end_page="153" type="sub_section">
      <SectionTitle>
5.2 AnSel algorithm
</SectionTitle>
      <Paragraph position="0"> The TOTAL score for a given potential answer is computed as a linear combination of the features described in the previous subsection:</Paragraph>
      <Paragraph position="2"> The Mgorithm that the training component of AnSel uses to learn the weights used in the formula is shown in Figure 7.</Paragraph>
      <Paragraph position="3"> For each &lt;question,span&gt; tuple in training  set : i. Compute features for each span 2. Compute TOTAL score for each span using current set of weights Kepeat 3. Compute performance on training set 4. Adjust weights wi through  At runtime, the weights are used to rank potential answers. Each span is assigned a TOTAL score and the top 5 distinct extracts of 50 (or 250) bytes centered around the span are output. The 50-byte extracts for question 10001 are shown in Figure 8. For lack of space, we are omitting the 250-byte extracts.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="153" end_page="155" type="metho">
    <SectionTitle>
6 Werlect
</SectionTitle>
    <Paragraph position="0"> The Werlect algorithm used many of the same features of phrases used by AnSel, but employed a different ranking scheme.</Paragraph>
    <Section position="1" start_page="153" end_page="154" type="sub_section">
      <SectionTitle>
6.1 Approach
</SectionTitle>
      <Paragraph position="0"> Unlike AnSel, Werlect is based on a two-step, rule-based process approximating a function with interaction between variables. In the first stage of this algorithm, we assign a rank to  every relevant phrase within each sentence according to how likely it is to be the target answer. Next, we generate and rank each N-byte fragment based on the sentence score given by GuruQA, measures of the fragment's relevance, and the ranks of its component phrases. Unlike AnSel, Werlect was optimized through manual trial-and-error using the TR38 questions.</Paragraph>
    </Section>
    <Section position="2" start_page="154" end_page="154" type="sub_section">
      <SectionTitle>
6.2 Step One: Feature Selection
</SectionTitle>
      <Paragraph position="0"> The features considered in Werlect that were also used by AnSel, were Type, Avgdst and Sscore. Two additional features were also taken into account: NotinqW: a modified version of Notinq. As in AnSel, spans that are contained in the query are given a rank of 0. However, partial matches are weighted favorably in some cases. For example, if the question asks, &amp;quot;Who was Lincoln's Secretary of State?&amp;quot; a noun phrase that contains &amp;quot;Secretary of State&amp;quot; is more likely to be the answer than one that does not. In this example, the phrase, &amp;quot;Secretary of State William Seward&amp;quot; is the most likely candidate. This criterion also seems to play a role in the event that Resporator fails to identify relevant phrase types. For example, in the training question, &amp;quot;What shape is a porpoise's tooth?&amp;quot; the phrase &amp;quot;spade-shaped&amp;quot; is correctly selected from among all nouns and adjectives of the sentences returned by Guru-QA.</Paragraph>
      <Paragraph position="1"> Frequency: how often the span occurs across different passages. For example, the test question, &amp;quot;How many lives were lost in the Pan Am crash in Lockerbie, Scotland?&amp;quot; resulted in four potential answers in the first two sentences returned by Guru-QA. Table 3 shows the frequencies of each term, and their eventual influence on the span rank. The repeated occurrence of &amp;quot;270&amp;quot;, helps promote it to first place.</Paragraph>
    </Section>
    <Section position="3" start_page="154" end_page="155" type="sub_section">
      <SectionTitle>
6.3 Step two: ranking the sentence
</SectionTitle>
      <Paragraph position="0"> spans After each relevant span is assigned a rank, we rank all possible text segments of 50 (or 250) bytes from the hit list based on the sum of the phrase ranks plus additional points for other words in the segment that match the query.</Paragraph>
      <Paragraph position="1"> The algorithm used by Werlect is shown in  i. Let candidate_set = all potential answers, ranked and sorted.</Paragraph>
      <Paragraph position="2"> 2. For each hit-list passage, extract ali spans of 50 (or 250) bytes, on word boundaries.</Paragraph>
      <Paragraph position="3"> 3. Rank and sort all segments based on phrase ranks, matching terms, and sentence ranks.</Paragraph>
      <Paragraph position="4"> 4. For each candidate in sorted</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="155" end_page="155" type="metho">
    <SectionTitle>
5. Output answer_set
</SectionTitle>
    <Paragraph position="0"> noted that on the 14 questions we were unable to classify with a QA-Token, Werlect (runs W50 and W250) achieved an MRAR of 3.5 to Ansel's 2.0.</Paragraph>
    <Paragraph position="1"> The cumulative RAR of A50 on T200 (Table 4) is 63.22 (i.e., we got 49 questions among the 198 right from our first try and 39 others within the first five answers).</Paragraph>
    <Paragraph position="2"> The performance of A250 on T200 is shown in Table 5. We were able to answer 71 questions with our first answer and 38 others within our first five answers (cumulative RAR = 85.17).</Paragraph>
    <Paragraph position="3"> To better characterize the performance of our system, we split the 198 questions into 20 groups of 10 questions. Our performance on groups of questions ranged from 0.87 to 5.50 MRAR for A50 and from 1.98 to 7.5 MRAR for A250 (Table 6).</Paragraph>
  </Section>
  <Section position="11" start_page="155" end_page="155" type="metho">
    <SectionTitle>
7 Evaluation
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the performance of our system using results from our four official runs.</Paragraph>
    <Section position="1" start_page="155" end_page="155" type="sub_section">
      <SectionTitle>
7.1 Evaluation scheme
</SectionTitle>
      <Paragraph position="0"> For each question, the performance is computed as the reciprocal value of the rank (RAR) of the highest-ranked correct answer given by the system. For example, if the system has given the correct answer in three positions: second, third, and fifth, RAR for that question is ! 2&amp;quot; The Mean Reciprocal Answer Rank (MRAR) is used to compute the overall performance of systems participating in the TREC evaluation:  Finally, Table 7 shows how our official runs compare to the rest of the 25 official submissions. Our performance using AnSel and 50-byte output was 0.430. The performance of Werlect was 0.395. On 250 bytes, AnSel scored</Paragraph>
    </Section>
  </Section>
  <Section position="12" start_page="155" end_page="156" type="metho">
    <SectionTitle>
0.319 and Werlect - 0.280.
8 Conclusion
</SectionTitle>
    <Paragraph position="0"> We presented a new technique, predictive annotation, for finding answers to natural language questions in text corpora. We showed that a system based on predictive annotation can deliver very good results compared to other competing systems.</Paragraph>
    <Paragraph position="1"> We described a set of features that correlate with the plausibility of a given text span being a good answer to a question. We experi- null mented with two algorithms for ranking potential answers based on these features. We discovered that a linear combination of these features performs better overall, while a non-linear algorithm performs better on unclassified questions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML