File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2004_metho.xml
Size: 7,245 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2004"> <Title>Exploiting Diversity for Answering Questions</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Background: The TREC Question </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Answering Task </SectionTitle> <Paragraph position="0"> Under the auspices of the National Institute for Standards and Technology (NIST), the Text Retrieval Conferences (TREC) have been an annual opportunity for the information retrieval community to evaluate techniques in a variety of tasks. For the last four years, TREC has included a question answering activity, wherein commercial and academic groups from around the world can evaluate systems designed to retrieve answers to questions, rather than simply documents from queries (Voorhees, 2002). This last year, 34 groups participated by running their systems on 500 previously unseen questions, against a corpus of approximately one million newswire documents. The task was to retrieve a single, short phrasal answer to each question from one of the documents, returning the answer string along with the document identifier. Answers were evaluated as strictly correct by the NIST assessors only if the indicated document justified the answer appropriately, and if no extraneous material was included in the answer string.</Paragraph> <Paragraph position="1"> For example, Question 1399 from the 2002 evaluation was: What mythical Scottish town appears for one day every 100 years? Participating systems returned Hong Kong, Tartan, Lockerbie, and Brigadoon, as well as a number of other candidates. Only Brigadoon was judged to be correct, and only if the system also pointed to a document that explicitly justified that answer--a document that simply mentioned the town was insufficient. Systems also had the option of indicating that they believed a question to be unanswerable from the corpus, by returning the NIL document ID.</Paragraph> <Paragraph position="2"> This year, TREC QA participants were encouraged to develop confidence assessment techniques for their systems. Systems returned the answer set sorted by decreasing confidence that each answer was correct. This ranking was taken into account by the main evaluation metric, average precision. This is defined as follows:</Paragraph> <Paragraph position="4"> where a35 is the total number of answers in the evaluation set. In this way, correct answers near the top of the system's ranking count for far more than those near the bottom, and systems are rewarded for good confidence estimates. null</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Methods </SectionTitle> <Paragraph position="0"> The task we are faced with is straightforward. Given a collection of answers to a question, choose the one most likely to be correct. For our purposes, each answer consists of the answer string and an identifier for an associated document. Our data was initially limited in that it did not indicate which answers were provided by which system--see the discussion below. Note that we use no knowledge of the question or of the document collection.</Paragraph> <Paragraph position="1"> Our assumption is that the authors of the individual systems have milked the information in their inputs to the best of their capabilities. Our goal is to combine their outputs, not to re-investigate the original problem.</Paragraph> <Paragraph position="2"> In TREC 2002's main QA evaluation there were 67 different systems or variants thereof involved. Thus, our corpus consists of a36a38a37a24a39a41a40a16a42a43a42 answers. To guard against any implicit bias due to repeated experimentation on the small dataset available, we randomly selected a 100-question subset for development of our techniques--the remaining 400 questions were kept as a test set, evaluated only once, when development was complete. While we may have wished to pursue parametric techniques, we felt that this training set was too small to explore any but the simplest (non-parametric) techniques. An exception is the experiments described below involving priors over the document sources and the systems themselves.</Paragraph> <Paragraph position="3"> Voting is an easily understood technique for selecting an answer from among the 67 suggestions. Unfortunately, voting techniques do not provide a mechanism for utilizing full knowledge of partial matches between proposed answers. While his original goal was the selection of representative DNA sequences, Gusfield (1993) introduced a general method for selecting a candidate sequence that is close to an ideal centroid of a set of sequences. His technique works for all distance measures that support a triangle inequality, and offers a bound that the sum of pairwise distances (SOP) from proposed answers to the chosen answer will be no more than twice the SOP to the actual centroid (even though the centroid may not be in the set). This basic technique has been used successfully for combining parsers (Henderson, 1999). Appealingly, the centroid method reduces to simple voting when an &quot;exact match&quot; distance is used (the complement of the Kronecker delta).</Paragraph> <Paragraph position="4"> One advantage of both simple voting and the centroid method is that they give values (distances) that are comparable between questions. An answer that receives 20 votes is more reliable than an answer that receives 10 votes, and likewise for generalized SOP values. This gives a principled method for ranking results by confidence and measuring average precision, as required for this year's TREC evaluations.</Paragraph> <Paragraph position="5"> In selecting appropriate distance measures between answers, both words and characters were explored as atomic units of similarity. Two well-known non-parametric distances are available in the literature: Levenshtein edit distance on strings and Tanimoto distance on sets (Duda et al., 2001). The latter is defined as follows:</Paragraph> <Paragraph position="7"> We experimented with each of these, and also generalized the Tanimoto distance to handle multisets by defining the obvious function to map multisets to simple sets: Given a multiset containing instances of a repeated element a63 x we can create a simple set by subscripting, e.g.,</Paragraph> <Paragraph position="9"> dard Tanimoto distance on the resulting simple sets.</Paragraph> <Paragraph position="10"> Overall, systems seemed to be conservative and answered with the NIL document (no answer) at a rather high rate (17% of all answer strings this year). To compensate for this, a &quot;source prior&quot; was collected from the 100-question training set. These four numbers recorded the accuracy expected when systems generated answers from the four document sources (Associated Press, New York Times, Xinhua News, and NIL). Those numbers were then used to scale the distance measures for the corresponding answer strings. Other than these priors, no other features of the document ID string were used.</Paragraph> </Section> class="xml-element"></Paper>