File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1136_metho.xml

Size: 11,459 bytes

Last Modified: 2025-10-06 14:10:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1136">
  <Title>Reranking Answers for Definitional QA Using Language Modeling</Title>
  <Section position="5" start_page="1081" end_page="1081" type="metho">
    <SectionTitle>
2 http://www.wikipedia.org
</SectionTitle>
    <Paragraph position="0"> date answers were reranked based on their similarity (TFIDF score) to the centroid vector. Similar techniques were explored in (Blair-Goldensohn et al., 2003). In this paper, we explore the dependence among terms in centroid vector for improving the answer reranking for definitional QA.</Paragraph>
    <Paragraph position="1"> In recent years, language modeling has been widely employed in IR (Ponte and Croft, 1998; Song and Croft, 1998; Miller and Zhai, 1999; Lafferty and Zhai, 2001). The basic idea is to compute the conditional probability P(Q|D), i.e., the probability of generating a query Q given the observation of a document D. The searched documents are ranked in descending order of this probability.</Paragraph>
    <Paragraph position="2"> Song and Croft (1998) proposed a general language model to incorporate word dependence by using bigrams. Srikanth and Srihari (2002) introduced biterm language models similar to the bi-gram model except that the constraint of order in terms is relaxed and improved performance was observed. Gao et al. (2004) presented a new method of capturing word dependencies, in which they extended state-of-the-art language modeling approaches to information retrieval by introducing a dependence structure that learned from training data. Cao et al. (2005) proposed a novel dependence model to incorporate both relationships of WordNet and co-occurrence with the language modeling framework for IR. In our approach, we propose bigram and biterm models to capture the term dependence in centroid vector. Applying language modeling for the QA task has not been widely researched. Zhang D. and Lee (2003) proposed a method using language model for passage retrieval for the factoid QA.</Paragraph>
    <Paragraph position="3"> They trained two language models, in which one was the question-topic language model and the other was passage language model. They utilized the divergence between the two language models to rank passages. In this paper, we focus on reranking answers for definitional questions.</Paragraph>
    <Paragraph position="4"> As other ranking approaches, Xu, et al. (2005) formalized ranking definitions as classification problems, and Cui et al. (2004) proposed soft patterns to rank answers for definitional QA.</Paragraph>
  </Section>
  <Section position="6" start_page="1081" end_page="1084" type="metho">
    <SectionTitle>
3 Reranking Answers Using Language
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1081" end_page="1082" type="sub_section">
      <SectionTitle>
Model
3.1 Model background
</SectionTitle>
      <Paragraph position="0"> In practice, language model is often approximated by N-gram models.</Paragraph>
      <Paragraph position="1"> Unigram:  The unigram model makes a strong assumption that each word occurs independently. The bigram model takes the local context into consideration. It has been proved to work better than the unigram language model in IR (e.g., Song and Croft, 1998).</Paragraph>
      <Paragraph position="2"> Biterm language models are similar to bigram language models except that the constraint of order in terms is relaxed. Therefore, a document containing information retrieval and a document containing retrieval (of) information will be assigned the same generation probability. The biterm probabilities can be approximated using the frequency of occurrence of terms.</Paragraph>
      <Paragraph position="3"> Three approximation methods were proposed in Srikanth and Srihari (2002). The so-called min-Adhoc approximation truly relaxes the constraint of word order and outperformed other two approximation methods in their experiments.</Paragraph>
    </Section>
    <Section position="2" start_page="1082" end_page="1082" type="sub_section">
      <SectionTitle>
3.2 Reranking based on language model
</SectionTitle>
      <Paragraph position="0"> In our approach, we adopt bigram and biterm language models. As a smoothing approach, linear interpolation of unigrams and bigrams is employed. null Given a candidate answer A=t1t2...ti...tn and a bigram or biterm back-off language model OC trained with the ordered centroid, the probability of generating A can be estimated by Equation (4).</Paragraph>
      <Paragraph position="2"> where OC stands for the language model of the ordered centroid and l is the mixture weight combining the unigram and bigram (or biterm) probabilities. After taking logarithm and exponential for Equation (4), we get Equation (5).</Paragraph>
      <Paragraph position="4"> We observe that this formula penalizes verbose candidate answers. This can be alleviated by adding a brevity penalty, BP, which is inspired by machine translation evaluation (Papineni et al., 2001).</Paragraph>
      <Paragraph position="5">  where Lref is a constant standing for the length of reference answer (i.e., centroid vector). LA is the length of the candidate answer. By combining Equation (5) and (6), we get the final scoring function.</Paragraph>
      <Paragraph position="7"/>
    </Section>
    <Section position="3" start_page="1082" end_page="1084" type="sub_section">
      <SectionTitle>
3.3 Parameter estimation
</SectionTitle>
      <Paragraph position="0"> In Equation (7), we need to estimate three parameters: P(ti|OC), P(ti|ti-1, OC) and l .</Paragraph>
      <Paragraph position="1"> For P(ti|OC), P(ti|ti-1, OC), maximum likelihood estimation (MLE) is employed.</Paragraph>
      <Paragraph position="3"> where CountOC(X) is the occurrences of the string X in the ordered centroid and NOC stands for the total number of tokens in the ordered centroid.</Paragraph>
      <Paragraph position="4"> For biterm language model, we use the above mentioned min-Adhoc approximation (Srikanth and Srihari, 2002).</Paragraph>
      <Paragraph position="5">  For unigram, we do not need smoothing because we only concern terms in the centroid vector. Recall that bigram and biterm probabilities have already been smoothed by interpolation. Thel can be learned from a training corpus using an Expectation Maximization (EM) algorithm. Specifically, we estimate l by maximizing the likelihood of all training instances, given the bigram or biterm model:</Paragraph>
      <Paragraph position="7"> affect l . l can be estimated using EM iterative procedure: 1) Initialize l to a random estimate between 0 and 1, i.e., 0.5; 2) Update l using:</Paragraph>
      <Paragraph position="9"> where INS denotes all training instances and |INS |gives the number of training instances which is used as a normalization factor. lj gives  the number of tokens in the jth instance in the training data; 3) Repeat Step 2 until l converges.</Paragraph>
      <Paragraph position="10">  We use the TREC 2004 test set3 as our training data and we set l as 0.4 for bigram model and 0.6 for biterm model according to the experimental results.</Paragraph>
      <Paragraph position="11">  We propose a three-stage approach for answer extraction. It involves: 1) learning a language model from the web; 2) adopting the language model to rerank candidate answers; 3) removing redundancies. Figure 1 shows five main modules. Learning ordered centroid: 1) Query expansion. Definitional questions are normally short (i.e., who is Bill Gates?). Query expansion is used to refine the query intention. First, reformulate query via simply adding clue words to the questions. i.e., for &amp;quot;Who is ...?&amp;quot; question, we add the word &amp;quot;biography&amp;quot;; and for &amp;quot;What is ...?&amp;quot; question, we add the word &amp;quot;is usually&amp;quot;, &amp;quot;refers to&amp;quot;, etc. We learn these clue words using the similar method proposed in (Ravichandran and Hovy, 2002). Second, query a web search engine (i.e., Google4) with reformulated query and learn top-R (we empirically set R=5) most frequent co-occurring terms with the target from returned snippets as query expansion terms; 2) Learning centroid vector (profile). We query Google again with the target and expanded terms learned in the previous step, download top-N (we empirically set N=500 based on the tradeoff between the snippet number and the time complexity) snippets, and split snippets into sentences. Then, we retain the generated sentences that contain the target, denoted as W. Finally, learn top- null occurring terms (stemmed) from W using Equation (15) (Cui et al., 2004) as the centroid vector. (13) )()1)(log()1)(log( )1),(log()( tidfTCounttCount TtCotWeight x+++ += where Co(t, T) denotes the number of sentences in which t co-occurs with the target T, and Count(t) gives the number of sentences containing the word t. We also use the inverse document frequency of t, idf(t) 5, as a measurement of the global importance of the word; 3) Extracting ordered centroid. For each sentence in W, we retain the terms in the centroid vector as the ordered centroid list. Words not contained in the centroid vector will be treated as the &amp;quot;stop words&amp;quot; and ignored.</Paragraph>
      <Paragraph position="12"> E.g., &amp;quot;Who is Aaron Copland?&amp;quot;, the ordered centroid list is shown below(where italics are extracted and put in the ordered centroid list):  1. Today's Highlight in History: On November 14, 1900, Aaron Copland, one of America's leading 20th century composers, was born in New York City. = November 14 1900 Aaron Copland America composer born New York City 2. ...</Paragraph>
      <Paragraph position="13"> Extracting candidate answers: We extract candidates from AQUAINT corpus.</Paragraph>
      <Paragraph position="14"> 1) Querying AQUAINT corpus with the target and retrieve relevant documents; 2) Splitting documents into sentences and ex- null tracting the sentences containing the target. Here in order to improve recall, simple heuristics rules are used to handle the problem of coreference resolution. If a sentence is deemed to contain the target and its next sentence starts with &amp;quot;he&amp;quot;, &amp;quot;she&amp;quot;, &amp;quot;it&amp;quot;, or &amp;quot;they&amp;quot;, then the next sentence is retained.</Paragraph>
      <Paragraph position="15"> Training language models: As mentioned above, we train language models using the obtained ordered centroid for each question.</Paragraph>
      <Paragraph position="16"> Answer reranking: Once the language models and the candidate answers are ready for a given question, candidate answers are reranked based on the probabilities of the language models generating candidate answers.</Paragraph>
      <Paragraph position="17"> Removing redundancies: Repetitive and similar candidate sentences will be removed. Given a reranked candidate answer set CA, redundancy removing is conducted as follows: 5 We use the statistics from British National Corpus (BNC) site to approximate words' IDF,  http://www.itri.brighton.ac.uk/~Adam.Kilgarriff/bncreadme.html. null  Step 1: Initially set the result A={}, and get top j=1 element from CA and then add it to A, j=2.</Paragraph>
      <Paragraph position="18"> Step 2: Get the jth element from CA, de null noted as CAj. Compute cosine similarity between CAj and each element i of A, which is expressed as sij. Then let sik=max{s1j, s2j, ..., sij}, if sik &lt; threshold (we set it to 0.75), then add j to the set A.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML