File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1054_evalu.xml
Size: 5,145 bytes
Last Modified: 2025-10-06 13:58:52
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1054"> <Title>Is It the Right Answer? Exploiting Web Redundancy for Answer Validation</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Experiments and Discussion </SectionTitle> <Paragraph position="0"> A number of experiments have been carried out in order to check the validity of the proposed answer validation technique. As a data set, the 492 questions of the TREC-2001 database have been used.</Paragraph> <Paragraph position="1"> For each question, at most three correct answers and three wrong answers have been randomly selected from the TREC-2001 participants' submissions, resulting in a corpus of 2726 question-answer pairs (some question have less than three positive answers in the corpus). As said before, AltaVista was used as search engine.</Paragraph> <Paragraph position="2"> A baseline for the answer validation experiment was defined by considering how often an answer occurs in the top 10 documents among those (1000 for each question) provided by NIST to TREC-2001 participants. An answer was judged correct for a question if it appears at least one time in the first 10 documents retrieved for that question, otherwise it was judged not correct. Baseline results are reported in Table 2.</Paragraph> <Paragraph position="3"> We carried out several experiments in order to check a number of working hypotheses. Three independent factors were considered: Estimation method. We have implemented three measures (reported in Section 4.2) to estimate an answer validity score: PMI, MLHR and CCP.</Paragraph> <Paragraph position="4"> Threshold. We wanted to estimate the role of two different kinds of thresholds for the assessment of answer validation. In the case of an absolute threshold, if the answer validity score for a candidate answer is below the threshold, the answer is considered wrong, otherwise it is accepted as relevant. In a second type of experiment, for every question and its corresponding answers the program chooses the answer with the highest validity score and calculates a relative threshold on that basis (i.e. a18a26a15a17a122a42a43a46a11a44a15a17a121a42a114a19a116a125a36</Paragraph> <Paragraph position="6"> threshold should be larger than a certain minimum value.</Paragraph> <Paragraph position="7"> Question type. We wanted to check performance variation based on different types of TREC-2001 questions. In particular, we have separated definition and generic questions from true named entities questions.</Paragraph> <Paragraph position="8"> Tables 2 and 3 report the results of the automatic answer validation experiments obtained respectively on all the TREC-2001 questions and on the subset of definition and generic questions. For each estimation method we report precision, recall and success rate. Success rate best represents the performance of the system, being the percent of [a0a6a5 a1 ] pairs where the result given by the system is the same as the TREC judges' opinion. Precision is the percent of a4a0a6a5 a1a8a7 pairs estimated by the algorithm as relevant, for which the opinion of TREC judges was the same. Recall shows the percent of the relevant answers which the system also evaluates as relevant.</Paragraph> <Paragraph position="9"> The best results on the 492 questions corpus (CCP measure with relative threshold) show a success rate of 81.25%, i.e. in 81.25% of the pairs the system evaluation corresponds to the human evaluation, and confirms the initial working hypotheses. This is 28% above the baseline success rate. Precision and recall are respectively 20-30% and 68-87% above the baseline values. These results demonstrate that the intuition behind the approach is motivated and that the algorithm provides a workable solution for answer validation.</Paragraph> <Paragraph position="10"> The experiments show that the average difference between the success rates obtained for the named entity questions (Table 3) and the full TREC-2001 question set (Table 2) is 5.1%. This means that our approach performs better when the answer entities are well specified.</Paragraph> <Paragraph position="11"> Another conclusion is that the relative threshold demonstrates superiority over the absolute threshold in both test sets (average 2.3%). However if the percent of the right answers in the answer set is lower, then the efficiency of this approach may decrease.</Paragraph> <Paragraph position="12"> The best results in both question sets are obtained by applying CCP. Such non-symmetric formulas might turn out to be more applicable in general. As conditional corrected (CCP) is not a classical co-occurrence measure like PMI and MLHR, we may consider its high performance as proof for the difference between our task and classic co-occurrence mining. Another indication for this is the fact that MLHR and PMI performances are comparable, however in the case of classic co-occurrence search, MLHR should show much better success rate. It seems that we have to develop other measures specific for the question-answer co-occurrence mining.</Paragraph> </Section> class="xml-element"></Paper>