File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1003_metho.xml
Size: 22,057 bytes
Last Modified: 2025-10-06 14:08:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1003"> <Title>A Noisy-Channel Approach to Question Answering</Title> <Section position="3" start_page="0" end_page="1" type="metho"> <SectionTitle> 2 A noisy-channel for QA </SectionTitle> <Paragraph position="0"> Assume that we want to explain why &quot;1977&quot; in sentence S in Figure 1 is a good answer for the question &quot;When did Elvis Presley die?&quot; To do this, we build a noisy channel model that makes explicit how answer sentence parse trees are mapped into questions. Consider, for example, the automatically derived answer sentence parse tree in Figure 1, which associates to nodes both syntactic and shallow semantic, named-entity-specific tags. In order to rewrite this tree into a question, we assume the following generative story: 1. In general, answer sentences are much longer than typical factoid questions. To reduce the length gap between questions and answers and to increase the likelihood that our models can be adequately trained, we first make a &quot;cut&quot; in the answer parse tree and select a sequence of words, syntactic, and semantic tags. The &quot;cut&quot; is made so that every word in the answer sentence or one of its ancestors belongs to the &quot;cut&quot; and no two nodes on a path from a word to the root of the tree are in the &quot;cut&quot;. Figure 1 depicts graphically such a cut.</Paragraph> <Paragraph position="1"> 2. Once the &quot;cut&quot; has been identified, we mark one of its elements as the answer string. In Figure 1, we decide to mark DATE as the answer string (A_DATE).</Paragraph> <Paragraph position="2"> 3. There is no guarantee that the number of words in the cut and the number of words in the question match. To account for this, we stochastically assign to every element s i in a cut a fertility according to table n(ph |s i ). We delete elements of fertility 0 and duplicate elements of fertility 2, etc. With probability p we also increment the fertility of an invisible word NULL. NULL and fertile words, i.e.</Paragraph> <Paragraph position="3"> words with fertility strictly greater than 1 enable us to align long questions with short answers. Zero fertility words enable us to align short questions with long answers.</Paragraph> <Paragraph position="4"> 4. Next, we replace answer words (including the NULL word) with question words according to the table t(q</Paragraph> <Paragraph position="6"> 5. In the last step, we permute the question words according to a distortion table d, in order to obtain a well-formed, grammatical question.</Paragraph> <Paragraph position="7"> The probability P(Q |S A ) is computed by multiplying the probabilities in all the steps of our generative story (Figure 1 lists some of the factors specific to this computation.) The readers familiar with the statistical machine translation (SMT) literature should recognize that steps 3 to 5 are nothing but a one-to-one reproduction of the generative story proposed in the SMT context by Brown et al. (see Brown et al., 1993 for a detailed mathematical description of the model and the formula for computing the probability of an alignment and target string given a source string). To simplify our work and to enable us exploit existing off-the-shelf software, in the experiments we carried out in conjunction with this paper, we assumed a flat distribution for the two steps in our The distortion probabilities depicted in Figure 1 are a simplification of the distortions used in the IBM Model 4 model by Brown et al. (1993). We chose this watered down representation only for illustrative purposes. Our QA system implements the full-blown Model 4 statistical model described by Brown et al.</Paragraph> <Paragraph position="8"> generative story. That is, we assumed that it is equally likely to take any cut in the tree and equally likely to choose as Answer any syntactic/semantic element in an answer sentence. 3 Generating training and testing material</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Generating training cases </SectionTitle> <Paragraph position="0"> Assume that the question-answer pair in Figure 1 appears in our training corpus. When this happens, we know that 1977 is the correct answer. To generate a training example from this pair, we tokenize the question, we parse the answer sentence, we identify the question terms and answer in the parse tree, and then we make a &quot;cut&quot; in the tree that satisfies the following conditions: a) Terms overlapping with the question are preserved as surface text b) The answer is reduced to its semantic or syntactic class prefixed with the symbol &quot;A_&quot; c) Non-leaves, which don't have any question term or answer offspring, are reduced to their semantic or syntactic class.</Paragraph> <Paragraph position="1"> d) All remaining nodes (leaves) are preserved as surface text.</Paragraph> <Paragraph position="2"> Condition a) ensures that the question terms will be identified in the sentence. Condition b) helps learn answer types. Condition c) brings the sentence closer to the question by compacting portions that are syntactically far from question terms and answer. And finally the importance of lexical cues around question terms and answer motivates condition d). For the question-answer pair in Figure 1, the algorithm above generates the following training example: Q: When did Elvis Presley die ? that led to this training example being generated. Our algorithm for generating training pairs implements deterministically the first two steps in our generative story. The algorithm is constructed so as to be consistent with our intuition that a generative process that makes the question and answer as similar-looking as possible is most likely to enable us learn a useful model. Each question-answer pair results in one training example. It is the examples generated through this procedure that we use to estimate the parameters of our model.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 Generating test cases </SectionTitle> <Paragraph position="0"> Assume now that the sentence in Figure 1 is returned by an IR engine as a potential candidate for finding the answer to the question &quot;When did Elvis Presley die?&quot; In this case, we don't know what the answer is, so we assume that any semantic/syntactic node in the answer sentence can be the answer, with the exception of the nodes that subsume question terms and stop words. In this case, given a question and a potential answer sentence, we generate an exhaustive set of question-answer test cases, each test case labeling as answer (A_) a different syntactic/semantic node.</Paragraph> <Paragraph position="1"> Here are some of the test cases we consider for the question-answer pair in Figure 1: Q: When did Elvis Presley die ?</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 Training Data </SectionTitle> <Paragraph position="0"> For training, we use three different sets. (i) The TREC9-10 set consists of the questions used at TREC9 and 10. We automatically generate answer-tagged sentences using the TREC9 and 10 judgment sets, which are lists of answer-document pairs evaluated as either correct or wrong. For every question, we first identify in the judgment sets a list of documents containing the correct answer. For every document, we keep only the sentences that overlap with the question terms and contain the correct answer. (ii) In order to have more variation of sentences containing the answer, we have automatically extended the first data set using the Web. For every TREC9-10 question/answer pair, we used our Web-based IR to retrieve sentences that overlap with the question terms and contain the answer. We call this data set TREC9-10Web. (iii) The third data set consists of 2381 question/answer pairs collected from http://www.quiz-zone.co.uk. We use the same method to automatically enhance this set by retrieving from the web sentences containing answers to the questions. We call this data set Quiz-Zone. Table 1 shows the size of the three To train our QA noisy-channel model, we apply the algorithm described in Section 3.1 to generate training cases for all QA pairs in the three corpora. To help our model learn that it is desirable to copy answer words into the question, we add to each corpus a list of identical dictionary word pairs w</Paragraph> <Paragraph position="2"> . For each corpus, we use GIZA (Al-Onaizan et al., 1999), a publicly available SMT package that implements the IBM models (Brown et al., 1993), to train a QA noisy-channel model that maps flattened answer parse trees, obtained using the &quot;cut&quot; procedure described in Section 3.1, into questions.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.2 Test Data </SectionTitle> <Paragraph position="0"> We used two different data sets for the purpose of testing. The first set consists of the 500 questions used at TREC 2002; the second set consists of 500 questions that were randomly selected from the Knowledge Master (KM) repository (http://www.greatauk.com). The KM questions tend to be longer and quite different in style compared to the TREC questions.</Paragraph> <Paragraph position="1"> the faithful return by the hundreds each year to m ark the anniversary of a heart disease at Graceland</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.3 A noisy-channel-based QA system </SectionTitle> <Paragraph position="0"> Our QA system is straightforward. It has only two modules: an IR module, and an answeridentifier/ranker module. The IR module is the same we used in previous participations at TREC.</Paragraph> <Paragraph position="1"> As the learner, the answer-identifier/ranker module is also publicly available - the GIZA package can be configured to automatically compute the probability of the Viterbi alignment between a flattened answer parse tree and a question.</Paragraph> <Paragraph position="2"> For each test question, we automatically generate a web query and use the top 300 answer sentences returned by our IR engine to look for an answer.</Paragraph> <Paragraph position="3"> For each question Q and for each answer sentence</Paragraph> <Paragraph position="5"> , we use the algorithm described in Section 3.2 to exhaustively generate all Q- S i,A i,j pairs. Hence we examine all syntactic constituents in a sentence and use GIZA to assess their likelihood of being a correct answer. We select the answer A</Paragraph> <Paragraph position="7"> that can be found in list retrieved by the IR module. Figure 3 depicts</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.4 Experimental Results </SectionTitle> <Paragraph position="0"> We evaluate the results by generating automatically the mean reciprocal rank (MRR) using the TREC 2002 patterns and QuizZone original answers when testing on TREC 2002 and QuizZone test sets respectively. Our baseline is a state of the art QA system, QA-base, which was ranked from second to seventh in the last 3 years at TREC. To ensure a fair comparison, we use the same Web-based IR system in all experiments with no answer retrofitting. For the same reason, we use the QA-base system with the post-processing module disabled. (This module re-ranks the answers produced by QA-base on the basis of their redundancy, frequency on the web, etc.) Table 2 summarizes results of different combinations of For the TREC 2002 corpus, the relatively low MRRs are due to the small answer coverage of the TREC 2002 patterns. For the KM corpus, the relatively low MRRs are explained by two factors: (i) for this corpus, each evaluation pattern consists of only one string - the original answer; (ii) the KM questions are more complex than TREC questions (What piece of furniture is associated with Modred, Percival, Gawain, Arthur, and Lancelot?).</Paragraph> <Paragraph position="1"> It is interesting to see that using only the TREC9-10 data as training (system A in Table 2), we are able to beat the baseline when testing on TREC 2002 questions; however, this is not true when testing on KM questions. This can be explained by the fact that the TREC9-10 training set is similar to the TREC 2002 test set while it is significantly different from the KM test set. We also notice that expanding the training to TREC9-10Web (System B) and then to Quiz-Zone (System C) improved the performance on both test sets, which confirms that both the variability across answer tagged sentences (Trec9-10Web) and the abundance of distinct questions (Quiz-Zone) contribute to the diversity of a QA training corpus, and implicitly to the performance of our system.</Paragraph> </Section> </Section> <Section position="5" start_page="1" end_page="2" type="metho"> <SectionTitle> 5 Framework flexibility </SectionTitle> <Paragraph position="0"> Another characteristic of our framework is its flexibility. We can easily extend it to span other question-answering resources and techniques that have been employed in state-of-the art QA systems. In the rest of this section, we assess the impact of such resources and techniques in the context of three case studies.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 5.1 Statistical-based &quot;Reasoning&quot; </SectionTitle> <Paragraph position="0"> The LCC TREC-2002 QA system (Moldovan et al., 2002) implements a reasoning mechanism for justifying answers. In the LCC framework, questions and answers are first mapped into logical forms. A resolution-based module then proves that the question logically follows from the answer using a set of axioms that are automatically extracted from the WordNet glosses. For example, to prove the logical form of &quot;What is the age of our solar system?&quot; from the logical form of the answer &quot;The solar system is 4.6 billion years old.&quot;, the LCC theorem prover shows that the atomic formula that corresponds to the question term &quot;age&quot; can be inferred from the atomic formula that corresponds to the answer term &quot;old&quot; using an axiom that connects &quot;old&quot; and &quot;age&quot;, because the WordNet gloss for &quot;old&quot; contains the word &quot;age&quot;. Similarly, the LCC system can prove that &quot;Voting is mandatory for all Argentines aged over 18&quot; provides a good justification for the question &quot;What is the legal age to vote in Argentina?&quot; because it can establish through logical deduction using axioms induced from WordNet glosses that &quot;legal&quot; is related to &quot;rule&quot;, which in turn is related to &quot;mandatory&quot;; that &quot;age&quot; is related to &quot;aged&quot;; and that &quot;Argentine&quot; is related to &quot;Argentina&quot;. It is not difficult to see by now that these logical relations can be represented graphically as alignments between question and answer terms (see Figure 4).</Paragraph> <Paragraph position="1"> The exploitation of WordNet synonyms, which is part of many QA systems (Hovy et al., 2001; Prager et al., 2001; Pasca and Harabagiu, 2001), is a particular case of building such alignments between question and answer terms. For example, using WordNet synonymy relations, it is possible to establish a connection between &quot;U.S.&quot; and &quot;United States&quot; and between &quot;buy&quot; and &quot;purchase&quot; in the question-answer pair (Figure 5), thus increasing the confidence that the sentence The noisy channel framework we proposed in this paper can approximate the reasoning mechanism employed by LCC and accommodate the exploitation of gloss- and synonymy-based relations found in WordNet. In fact, if we had a very large training corpus, we would expect such connections to be learned automatically from the data. However, since we have a relatively small training corpus available, we rewrite the WordNet glosses into a dictionary by creating word-pair entries that establish connections between all Wordnet words and the content words in their glosses. For example, from the word &quot;age&quot; and its gloss &quot;a historic period&quot;, we create the dictionary entries &quot;age - historic&quot; and &quot;age - period&quot;. To exploit synonymy relations, for every WordNet</Paragraph> <Paragraph position="3"> , we add to our training data all possible combinations of synonym pairs W</Paragraph> <Paragraph position="5"> Our dictionary creation procedure is a crude version of the axiom extraction algorithm described by Moldovan et al. (2002); and our exploitation of the glosses in the noisy-channel framework amounts to a simplified, statistical version of the semantic proofs implemented by LCC. Table 3 shows the impact of WordNet synonyms (WNsyn) and WordNet glosses (WNgloss) on our system. Adding WordNet synonyms and glosses improved slightly the performance on the KM questions. On the other hand, it is surprising to see that the performance has dropped when testing on TREC 2002 questions.</Paragraph> </Section> <Section position="2" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 5.2 Question reformulation </SectionTitle> <Paragraph position="0"> Hermjakob et al. (2002) showed that reformulations (syntactic and semantic) improve the answer pinpointing process in a QA system.</Paragraph> <Paragraph position="1"> To make use of this technique, we extend our training data set by expanding every question- null We also expand in a similar way the answer candidates in the test corpus. Using reformulations improved the We are grateful to Ulf Hermjakob for sharing his reformulations with us.</Paragraph> <Paragraph position="2"> In 1867, Secretary of State William H. Seward arranged for the United-States to purchase Alaska for 2 cents per acre. What year did the U.S. buy Alaska? What is the legal age to vote in Argentina? Voting is mandatory for all Argentines aged over 18 performance of our system on the TREC 2002 test set while it was not beneficial for the KM test set (see Table 4). We believe this is explained by the fact that the reformulation engine was fine tuned on TREC-specific questions, which are significantly different from KM questions.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 5.3 Exploiting data in structured -and semi- </SectionTitle> <Paragraph position="0"> structured databases Structured and semi-structured databases were proved to be very useful for question-answering systems. Lin (2002) showed through his federated approach that 47% of TREC-2001 questions could be answered using Web-based knowledge sources.</Paragraph> <Paragraph position="1"> Clarke et al. (2001) obtained a 30% improvement by using an auxiliary database created from web documents as an additional resource. We adopted a different approach to exploit external knowledge bases.</Paragraph> <Paragraph position="2"> In our work, we first generated a natural language collection of factoids by mining different structured and semi-structured databases (World Fact Book, Biography.com, WordNet...). The generation is based on manually written question-factoid template pairs, which are applied on the different sources to yield simple natural language question-factoid pairs. Consider, for example, the following two factoid-question template pairs: : _p died of causeDeath(_p).</Paragraph> <Paragraph position="3"> Using extraction patterns (Muslea, 1999), we apply these two templates on the World Fact Book database and on biography.com pages to instantiate question and answer-tagged sentence pairs such as: : Jean-Paul Sartre died of a lung ailment.</Paragraph> <Paragraph position="4"> These question-factoid pairs are useful both in training and testing. In training, we simply add all these pairs to the training data set. In testing, for every question Q, we select factoids that overlap sufficiently enough with Q as sentences that potentially contain the answer. For example, given the question &quot;Where was Sartre born?&quot; we will select the following factoids: ailment.</Paragraph> <Paragraph position="5"> Up to now, we have collected about 100,000 question-factoid pairs. We found out that these pairs cover only 24 of the 500 TREC 2002 questions. And so, in order to evaluate the value of these factoids, we reran our system C on these 24 questions and then, we used the question-factoid pairs as the only resource for both training and testing as described earlier (System D). Table 5 shows the MRRs for systems C and D on the 24 questions covered by the factoids.</Paragraph> <Paragraph position="6"> It is very interesting to see that system D outperforms significantly system C. This shows that, in our framework, in order to benefit from external databases, we do not need any additional machinery (question classifiers, answer type identifiers, wrapper selectors, SQL query generators, etc.) All we need is a one-time conversion of external structured resources to simple natural language factoids. The results in Table 5 also suggest that collecting natural language factoids is a useful research direction: if we collect all the factoids in the world, we could probably achieve much higher MRR scores on the entire TREC collection.</Paragraph> </Section> </Section> class="xml-element"></Paper>