File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2009_metho.xml
Size: 6,509 bytes
Last Modified: 2025-10-06 14:10:13
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2009"> <Title>Answering the Question You Wish They Had Asked: The Impact of Paraphrasing for Question Answering</Title> <Section position="3" start_page="33" end_page="34" type="metho"> <SectionTitle> 3 Using Automatic Paraphrasing in </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="33" end_page="34" type="sub_section"> <SectionTitle> Question Answering </SectionTitle> <Paragraph position="0"> We use a generic architecture (Figure 2) that treats a QA system as a black box that is invoked after a paraphrase generation module, a feature extraction module, and a paraphrase selection module are executed. The preprocessing modules identifies a paraphrase of the original question, which could be the question itself, to send as input to the QA system.</Paragraph> <Paragraph position="1"> A key advantage of treating the core QA system as a black box is that the preprocessing modules can be easily applied to improve the performance of any QA system.7 We described the paraphrase generation module in the previous section and will discuss the remaining two modules below.</Paragraph> <Paragraph position="2"> Feature Extraction Module. For each possible paraphrase, we compare it against the original question and compute the features shown in Table 1.</Paragraph> <Paragraph position="3"> These are a subset of the features that we have experimented with and have found to be meaningful for the task. All of these features are required in or7In our earlier experiments, we adopted an approach that combines answers to all paraphrases through voting. These experiments proved unsuccessful: in most cases, the answer to the original question was amplified, both when right and wrong. The sum of the IDF scores for all terms in the original question and the paraphrase.</Paragraph> <Paragraph position="4"> Paraphrases with more informative terms for the corpus at hand should be preferred.</Paragraph> <Paragraph position="5"> The distance between the vectors of both questions, IDF-weighted.</Paragraph> <Paragraph position="6"> Certain paraphrases diverge too much from the original.</Paragraph> </Section> <Section position="2" start_page="34" end_page="34" type="sub_section"> <SectionTitle> Answer Types </SectionTitle> <Paragraph position="0"> Whether answer types, as predicted by our question analyzer, are the same or overlap.</Paragraph> <Paragraph position="1"> Choosing a paraphrase that does not share an answer type with the original question is risky.</Paragraph> <Paragraph position="2"> der not to lower the performance with respect to the original question. They are ordered by their relative contributions to the error rate reduction.</Paragraph> <Paragraph position="3"> Paraphrase Selection Module. To select a paraphrase, we used JRip, the Java re-implementation of ripper (Cohen, 1996), a supervised rule learner in the Weka toolkit (Witten and Frank, 2000).</Paragraph> <Paragraph position="4"> We initially formulated paraphrase selection as a three-way classification problem, with an attempt to label each paraphrase as being &quot;worse,&quot; the &quot;same,&quot; or &quot;better&quot; than the original question. Our objective was to replace the original question with a paraphrase labeled &quot;better.&quot; However, the priors for these classes are roughly 30% for &quot;worse,&quot; 65% for &quot;same,&quot; and 5% for &quot;better&quot;. Our empirical evidence shows that successfully pinpointing a &quot;better&quot; paraphrase improves, on average, the reciprocal rank for a question by 0.5, while erroneously picking a &quot;worse&quot; paraphrase results in a 0.75 decrease. That is to say, errors are 1.5 times more costly than successes (and five times more likely). This scenario strongly suggests that a high precision algorithm is critical for this component to be effective.</Paragraph> <Paragraph position="5"> To increase precision, we took two steps. First, we trained a cascade of two binary classifiers. The first one classifies &quot;worse&quot; versus &quot;same or better,&quot; with a bias for &quot;worse.&quot; The second classifier has classes &quot;worse or same&quot; versus &quot;better,&quot; now with a bias towards &quot;better.&quot; The second step is to constrain the confidence of the classifier and only accept paraphrases where the second classifier has a 100% confidence. These steps are necessary to avoid decreasing performance with respect to the original question, as we will show in the next section.</Paragraph> </Section> </Section> <Section position="4" start_page="34" end_page="35" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> We trained the paraphrase selection module using our QA system, PIQUANT (Chu-Carroll et al., 2006). Our target corpus is the AQUAINT corpus, employed in the TREC QA track since 2002.</Paragraph> <Paragraph position="1"> As for MT engines, we employed Babelfish and Google MT,8 rule-based systems developed by SYSTRAN and Google, respectively. We adopted different MT engines based on the hypothesis that differences in their translation rules will improve the effectiveness of the paraphrasing module.</Paragraph> <Paragraph position="2"> To measure performance, we trained and tested by cross-validation over 712 questions from the TREC 9 and 10 datasets. We paraphrased the questions using the four possible combinations of MT engines with up to 11 intermediate languages, obtaining a total of 15,802 paraphrases. These questions were then fed to our system and evaluated per TREC answer key. We obtained a baseline MRR (top five answers) of 0.345 running over the original questions. An oracle run, in which the best paraphrase (or the original question) is always picked would yield a MRR of 0.48. This potential increase is substantial, taking into account that a 35% improvement separated the tenth participant from the second in TREC-9. Our three-fold cross validation using the features and algorithm described in Section 3 yielded a MRR of 0.347. Over 712 questions, it replaced 14, two of which improved performance, the rest stayed the same. On the other hand, random selection of paraphrases decreased performance to 0.156, clearly showing the importance of selecting a good paraphrase.</Paragraph> </Section> class="xml-element"></Paper>