File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1209_metho.xml
Size: 15,786 bytes
Last Modified: 2025-10-06 14:08:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1209"> <Title>Statistical QA - Classifier vs. Re-ranker: What's the difference?</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Statistical Answer Pinpointing </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Answer Modeling </SectionTitle> <Paragraph position="0"> The answer-pinpointing module gets as input a question q and a set of possible answer candidates }...{ 21 Aaaa . It outputs one of the answer }...{ 21 Aaaaa , from the candidate answer set.</Paragraph> <Paragraph position="1"> We consider two ways of modeling this problem. null One approach is the traditional classification view (Ittycheriah, 2001) where we present each Question-Answer pair to the classifier which classifies it as either correct answer (true) or incorrect answer (false), based on some evidence (features).</Paragraph> <Paragraph position="2"> In this case, we model )},...{,|( 21 qaaaacP A . Here, false}{true,c = signifies the correctness of the answer a with respect to the question q. The probability )},...{,|( 21 qaaaacP A for each QA pair is modeled independently of other such pairs. Thus, for the same question, many QA pairs are presented to the classifier as independent events (histories). If the training corpus contains Q questions with A answers for each question, the total number of events (histories) would be equal to Q A with two classes (futures) (correct or incorrect answer) for each event. Once the probabilities )},...{,|( 21 qaaaacP A have been computed, the system has to return the best answer. The following decision rule is used: )]},...{,|([maxarg 21 qaaaatruePa A a = Another way of viewing the problem is as a re-ranking task. This is possible because the QA task requires the identification of only one correct answer, instead of identifying all the correct answer in the collection. In this case, we model )},...{|( 21 qaaaaP A . If the training corpus contains Q questions with A answers for each question, the total number of events (histories) would be equal to Q, with A classes (futures). This view requires the following decision-rule to identify the answer that seems most promising:</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="21" type="metho"> <SectionTitle> 2 A </SectionTitle> <Paragraph position="0"> where, Q = total number of questions. A = total number of answer chunks considered for each question.</Paragraph> <Section position="1" start_page="0" end_page="21" type="sub_section"> <SectionTitle> 2.2 Maximum Entropy formulation </SectionTitle> <Paragraph position="0"> We use Maximum Entropy to model the given problem both as a classifier and a re-ranker. We define M feature functions, Mmqaaaaf Am ,....,1),},...{,( 21 = , that may be useful in characterizing the task. Della Pietra et. al (1995) contains good description of Maximum Entropy models.</Paragraph> <Paragraph position="1"> We model the classifier as follows:</Paragraph> <Paragraph position="3"> where, },{;,....,1;, falsetruecMmcm ==l are the model parameters.</Paragraph> <Paragraph position="4"> The decision rule for choosing the best answer is:</Paragraph> <Paragraph position="6"> The above decision rule requires comparison of different probabilities of the form )},...{,|( 21 qaaaatrueP A . However, these probabilities are modeled as independent events (histories) in the classifier and hence the training criterion does not make them directly comparable. null For the re-ranker, we model the probability</Paragraph> <Paragraph position="8"> Note that for the classifier the model parameters are cm,l , whereas for the re-ranker they are ml .</Paragraph> <Paragraph position="9"> This is because for the classifier, each feature function has different weights associated with each class (future). Hence, the classifier has twice the model parameters as compared to the re-ranker.</Paragraph> <Paragraph position="10"> The decision rule for the re-ranker is given by:</Paragraph> <Paragraph position="12"/> <Paragraph position="14"> The re-ranker makes the probabilities )},...{|( 21 qaaaaP A , considered for the decision rule, directly comparable against each other, by incorporating them into the training criterion itself. Table 1 summarizes the differences of the two models.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 2.3 Feature Functions </SectionTitle> <Paragraph position="0"> Using above formulation to model the probability distribution we need to come up with features fj. We use only four basic feature functions for our system.</Paragraph> <Paragraph position="1"> 1. Frequency: It has been observed that the correct answer has a higher frequency (Magnini et al.; 2002) in the collection of answer chunks (C). Hence we count the number of time a potential answer occurs in the IR output and use its logarithm as a feature. This is a positive continuous valued feature.</Paragraph> <Paragraph position="2"> 2. Expected Answer Class: Most of the current QA systems employ some type of Answer Class Identification module. Thus questions like &quot;When did Bill Clinton go to college?&quot;, would be identified as a question asking about a time (or a time period), &quot;Where is the sea of tranquility?&quot; would be identified as a question asking for a location. If the answer class matches the expected answer class (derived from the question by the answer identification module) this feature fires (i.e., it has a value of 1). Details of this module are explained in Hovy et al. (2002). This is a binary-valued feature.</Paragraph> <Paragraph position="3"> 3. Question Word Absent: Usually a correct answer sentence contains a few of the question words. This feature fires if the candidate answer does not contain any of the question words. This is also a binary valued feature. 4. Word Match: It is the sum of ITF2 values for the words in the question that matches identically with the words in the answer sentence. This is a positive continuous valued feature.</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 2.4 Training </SectionTitle> <Paragraph position="0"> We train our Maximum Entropy model using Generalized Iterative scaling (Darroch and Ratcliff, 1972) approach by using YASMET3 .</Paragraph> </Section> </Section> <Section position="5" start_page="21" end_page="21" type="metho"> <SectionTitle> 3 Evaluation Metric </SectionTitle> <Paragraph position="0"> The performance of the QA system is highly dependent on the performance of the two individual modules IR and answer-pinpointing. The system would have excellent performance if both have good accuracy. Hence, we need a good evaluation metric to evaluate each of these components individually. One standard metric for IR is recall and precision. We can modify this metric for QA as follows: It is almost impossible to measure recall because the IR collection is typically large and involves several hundreds of thousands of documents. Hence, we evaluate our IR by only the precision measure at top N segments. This method is actually a rather sloppy approximation to the original recall and precision measure.</Paragraph> <Paragraph position="1"> Questions with fewer correct answers in the collection would have a lower precision score as compared to questions with many answers.</Paragraph> <Paragraph position="2"> Similarly, it is unclear how one would evaluate answer questions with No Answer (NIL) in the collection using this metric. All these questions would have zero precision from the IR collection. null The answer-pinpointing module is evaluated by checking if the answer returned by the system as the top ranked (#1) answer is correct/incorrect with respect to the IR collection and the true answer. Hence, if the IR system fails to return even a single sentence that contains the correct answer for the given question, we do not penalize the answer-pinpointing module. It is again unclear how to evaluate questions with No answer (NIL). (Here, for our experiments we attribute the error to the IR module.) Finally, the combined system is evaluated by using the standard technique, wherein the Answer (ranked #1) returned by the system is judged to be either correct or incorrect and then the average is taken.</Paragraph> </Section> <Section position="6" start_page="21" end_page="21" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.1 Framework Information Retrieval </SectionTitle> <Paragraph position="0"> For our experiments, we use the Web search engine AltaVista. For every question, we remove stop-words and present all other question words as query to the Web search engine. The top relevant documents are downloaded. We apply a sentence segmentor, and identify those sentences that have high ITF overlapping words with the given question. The sentences are then re-ranked accordingly and only the top K sentences (segments) are presented as output of the IR system.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> Candidate Answer Extraction </SectionTitle> <Paragraph position="0"> For a given question, the IR returns top K segments. For our experiments a segment consists of one sentence. We parse each of the sentences and obtain a set of chunks, where each chunk is a node of the parse tree. Each chunk is viewed as a potential answer. For our experiments we restrict the number of potential answers to be at most 5000. We illustrate this process in Figure 1.</Paragraph> <Paragraph position="1"> We use the TREC 9 and TREC 10 data sets for training and the TREC 11 data set for testing. We initially apply the IR step as described above and obtain a set of at most 5000 answers. For each such answer we use the pattern file supplied by NIST to tag answer chunks as either correct (1) or incorrect (0). This is a very noisy way of tagging data. In some cases, even though the answer chunk may be tagged as correct it may not be supported by the accompanying sentence, while in other cases, a correct chunk may be graded as incorrect, since the pattern file list did not represent a exhaustive list of answers.</Paragraph> <Paragraph position="2"> We set aside 20% of the training data for validation. null</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.2 Classifier vs. Re-Ranker </SectionTitle> <Paragraph position="0"> We evaluate the performance of the QA system viewed as a classifier (with a post-processing step) and as a re-ranker. In order to do a fair evaluation of the system we test the performance of the QA system under varying conditions of the output of the IR system. The results are shown in Table 3.</Paragraph> <Paragraph position="1"> The results should be read in the following way: We use the same IR system. However, during each run of the experiment we consider only the top K sentences returned by the IR system K={1,10,50,100,150,200}. The column &quot; correct&quot; represents the number of questions the entire QA (IR + re-ranker) system answered correctly. &quot; IR Loss&quot; represents the average number of questions for which the IR failed completely (i.e., the IR did not return even a single sentence that contains the correct answer). The IR precision is the precision of the IR system for the number of sentences considered. Answer-pinpointing performance is based on the metric described above. Finally, the overall score is the score of the entire QA system. (i.e., precision at rank#1).</Paragraph> <Paragraph position="2"> The &quot; Overall Precision&quot; column indicates that the re-ranker clearly outperforms the classifier. However, it is also very interesting to compare the performance of the re-ranker &quot; Overall Precision&quot; with the &quot; Answer-Pinpointing precision&quot;. For example, in the last row, for the re-ranker the &quot; Answer-Pinpointing Precision&quot; is 0.5182 whereas the &quot; Overall Precision&quot; is only 0.34. The difference is due to the performance of the poor performance of the IR system (&quot; IR Loss&quot; = 0.344).</Paragraph> </Section> <Section position="4" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.3 Oracle IR system </SectionTitle> <Paragraph position="0"> In order to determine the performance of the answer pinpointing module alone, we perform the so-called oracle IR experiment. Here, we present to the answer pinpointing module only those sentences from IR that contain an answer4.</Paragraph> <Paragraph position="1"> The task of the answer pinpointing module is to pick out of the correct answer from the given collection. We report results in Table 4. In these results too the re-ranker has better performance as compared to the classifier. However, as we see from the results, there is a lot of room for improvement for the re-ranker system, even with a perfect IR system.</Paragraph> </Section> </Section> <Section position="7" start_page="21" end_page="21" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> Our experiments clearly indicate that the QA system viewed as a re-ranker outperforms the QA system viewed as a classifier. The difference stem from the following reasons: 1. The classification training criteria work on a more difficult objective function of trying to find whether each candidate answer answers the given question, as opposed to trying to find the best answer for the given question.</Paragraph> <Paragraph position="1"> Hence, the same feature set that works for the re-ranker need not work for the classifier. The feature set used in this problem is not good enough to help the classifier distinguish between correct and incorrect an4 This was performed by extracting all the sentences that were judged to have the correct answer by human evaluators during the TREC 2002 evaluations.</Paragraph> <Paragraph position="2"> swers for the given question (even though it is good for the re-ranker to come up with the best answer).</Paragraph> <Paragraph position="3"> 2. The comparison of probabilities across different events (histories) for the classifier, during the decision rule process, is problematic. This is because the probabilities, which we obtain after the classification approach, are only a poor estimate of the true probability. The re-ranker, however, directly allows these probabilities to be comparable by incorporating them into the model itself.</Paragraph> </Section> <Section position="8" start_page="21" end_page="21" type="metho"> <SectionTitle> 3. The QA system viewed as a classifier suf- </SectionTitle> <Paragraph position="0"> fers from the problem of a highly unbalanced data set. We have less than 1% positive examples and more than 99% negative examples (we had almost 4 million training data events) in the problem. Ittycheriah (2001), and Ittycheriah and Roukos (2002), use a more controlled environment for training their system. They have 23% positive examples and 77% negative examples. They prune out most of the incorrect answer initially, using a pre-processing step by using either a rule-based system (Ittycheriah, 2001) or a statistical system (Ittycheriah et al., 2002); and hence obtain a much more manageable distribution in the training phase of the Maximum Entropy model.</Paragraph> </Section> class="xml-element"></Paper>