File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1052_metho.xml
Size: 20,883 bytes
Last Modified: 2025-10-06 14:10:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1052"> <Title>Language Model Information Retrieval with Document Expansion</Title> <Section position="3" start_page="407" end_page="410" type="metho"> <SectionTitle> 2 Document Expansion Retrieval Model </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="407" end_page="407" type="sub_section"> <SectionTitle> 2.1 The KL-divergence retrieval model </SectionTitle> <Paragraph position="0"> We rst brie y review the KL-divergence retrieval model, on which we will develop the document expansion technique. The KL-divergence model is a representative state-of-the-art language modeling approach for retrieval. It covers the basic language modeling approach (i.e., the query likelihood method) as a special case and can support feedback more naturally.</Paragraph> <Paragraph position="1"> In this approach, a query and a document are assumed to be generated from a unigram query language model ThQ and a unigram document language model ThD, respectively. Given a query and a document, we would rst compute an estimate of the corresponding query model ( ^ThQ) and document model (^ThD), and then score the document w.r.t. the query based on the KL-divergence of the two models (Lafferty and Zhai, 2001):</Paragraph> <Paragraph position="3"> where V is the set of all the words in our vocabulary.</Paragraph> <Paragraph position="4"> The documents can then be ranked according to the ascending order of the KL-divergence values.</Paragraph> <Paragraph position="5"> Clearly, the two fundamental problems in such a model are to estimate the query model and the document model, and the accuracy of our estimation of these models would affect the retrieval performance signi cantly. The estimation of the query model can often be improved by exploiting the local corpus structure in a way similar to pseudo-relevance feedback (Lafferty and Zhai, 2001; Lavrenko and Croft, 2001; Zhai and Lafferty, 2001a). The estimation of the document model is most often done through smoothing with the global collection language model (Zhai and Lafferty, 2001b), though recently there has been some work on using clusters for smoothing (Liu and Croft, 2004). Our work is mainly to extend the previous work on document smoothing and improve the accuracy of estimation by better exploiting the local corpus structure. We now discuss all these in detail.</Paragraph> </Section> <Section position="2" start_page="407" end_page="408" type="sub_section"> <SectionTitle> 2.2 Smoothing of document models </SectionTitle> <Paragraph position="0"> Given a document d, the simplest way to estimate the document language model is to treat the document as a sample from the underlying multinomial word distribution and use the maximum likelihood estimator: P(w|^Thd) = c(w,d)|d |, where c(w,d) is the count of word w in document d, and |d |is the length of d. However, as discussed in virtually all the existing work on using language models for retrieval, such an estimate is problematic and inaccurate; indeed, it would assign zero probability to any word not present in document d, causing problems in scoring a document with query likelihood or KL-divergence (Zhai and Lafferty, 2001b). Intuitively, such an estimate is inaccurate because the document is a small sample.</Paragraph> <Paragraph position="1"> To solve this problem, many different smoothing techniques have been proposed and studied, usually involving some kind of interpolation of the maximum likelihood estimate and a global collection language model (Hiemstra and Kraaij, 1998; Miller et al., 1999; Zhai and Lafferty, 2001b). For example, Jelinek-Mercer(JM) and Dirichlet are two commonly used smoothing methods (Zhai and Lafferty, 2001b). JM smoothing uses a xed parameter l to control the interpolation:</Paragraph> <Paragraph position="3"> while the Dirichlet smoothing uses a documentdependent coef cient (parameterized with u) to control the interpolation:</Paragraph> <Paragraph position="5"> Here P(w|ThC) is the probability of word w given by the collection language model ThC, which is usually estimated using the whole collection of documents</Paragraph> <Paragraph position="7"/> </Section> <Section position="3" start_page="408" end_page="408" type="sub_section"> <SectionTitle> 2.3 Cluster-based document model (CBDM) </SectionTitle> <Paragraph position="0"> Recently, the cluster structure of the corpus has been exploited to improve language models for retrieval (Kurland and Lee, 2004; Liu and Croft, 2004). In particular, the cluster-based language model proposed in (Liu and Croft, 2004) uses clustering information to further smooth a document model. It divides all documents into K different clusters (K = 1000 in their experiments). Both cluster information and collection information are used to improve the estimate of the document model:</Paragraph> <Paragraph position="2"> where ThLd stands for document d's cluster model and l and b are smoothing parameters. In this clustering-based smoothing method, we rst smooth a cluster model with the collection model using Dirichlet smoothing, and then use smoothed cluster model as a new reference model to further smooth the document model using JM smoothing; empirical results show that the added cluster information indeed enhances retrieval performance (Liu and Croft, 2004).</Paragraph> </Section> <Section position="4" start_page="408" end_page="410" type="sub_section"> <SectionTitle> 2.4 Document expansion </SectionTitle> <Paragraph position="0"> From the viewpoint of data augmentation, the clustering-based language model can be regarded as expanding a document with more data from the cluster that contains the document. This is intuitively better than simply expanding every document with the same collection language model as in the case of JM or Dirichlet smoothing. Looking at it from this perspective, we see that, as the extra data for smoothing a document model, the cluster containing the document is often not optimal. Indeed, the purpose of clustering is to group similar documents together, hence a cluster model represents well the overall property of all the documents in the cluster. However, such an average model is often not accurate for smoothing each individual document.</Paragraph> <Paragraph position="1"> We illustrate this problem in Figure 1(a), where we show two documents d and a in cluster D. Clearly the generative model of cluster D is more suitable for smoothing document a than document d. In general, the cluster model is more suitable for smoothing documents close to the centroid, such as a, but is inaccurate for smoothing a document at the boundary, such as d.</Paragraph> <Paragraph position="2"> To achieve optimal smoothing, each document should ideally have its own cluster centered on the document, as shown in Figure 1(b). This is precisely what we propose expanding each document with a probabilistic neighborhood around the document and estimate the document model based on such a virtual, expanded document. We can then apply any simple interpolation-based method (e.g., JM or Dirichlet) to such a virtual document and treat the word counts given by this virtual document as if they were the original word counts.</Paragraph> <Paragraph position="3"> The use of neighborhood information is worth more discussion. First of all, neighborhood is not a clearly de ned concept. In the narrow sense, only a few documents close to the original one should be included in the neighborhood, while in the wide sense, the whole collection can be potentially included. It is thus a challenge to de ne the neighborhood concept reasonably. Secondly, the assumption that neighbor documents are sampled from the same generative model as the original document is not completely valid. We probably do not want to trust them so much as the original one. We solve these two problems by associating a con dence value with every document in the collection, which re ects our belief that the document is sampled from the same underlying model as the original document. When a document is close to the original one, we have high con dence, but when it is farther apart, our con dence would fade away. In this way, we construct a probabilistic neighborhood which can potentially include all the documents with different con dence values. We call a language model based on such a neighborhood document expansion language model (DELM).</Paragraph> <Paragraph position="4"> Technically, we are looking for a new enlarged document dprime for each document d in a text collection, such that the new document dprime can be used to estimate the hidden generative model of d more accurately. Since a good dprime should presumably be based on both the original document d and its neighborhood N(d), we de ne a function ph:</Paragraph> <Paragraph position="6"> The precise de nition of the neighborhood concept N(d) relies on the distance or similarity between each pair of documents. Here, we simply choose the commonly used cosine similarity, though other choices may also be possible. Given any two</Paragraph> <Paragraph position="8"> To model the uncertainty of neighborhood, we assign a con dence value gd(b) to every document b in the collection to indicate how strongly we believe b is sampled from d's hidden model. In general, gd(b) can be set based on the similarity of b and d the more similar b and d are, the larger gd(b) would be. With these con dence values, we construct a probabilistic neighborhood with every document in it, each with a different weight. The whole problem is thus reduced to how to de ne gd(b) exactly.</Paragraph> <Paragraph position="9"> Intuitively, an exponential decay curve can help regularize the in uence from remote documents. We therefore want gd(b) to satisfy a normal distribution centered around d. Figure 2 illustrates the shape of this distribution. The black dots are neighborhood documents centered around d. Their probability values are determined by their distances to the center. We fortunately observe that the cosine similarities, which we use to decide the neighborhood, are roughly of this decay shape. We thus use them directly without further transformation because that would introduce unnecessary parameters. We set gd(b) by normalizing the cosine similarity scores :</Paragraph> <Paragraph position="11"> Function ph serves to balance the con dence between d and its neighborhood N(d) in the model estimation step. Intuitively, a shorter document is less suf cient, hence needs more help from its neighborhood. Conversely, a longer one can rely more on itself. We use a parameter a to control this balance.</Paragraph> <Paragraph position="12"> Thus nally, we obtain a pseudo document dprime with the following pseudo term count:</Paragraph> <Paragraph position="14"> We hypothesize that, in general, Thd can be estimated more accurately from dprime rather than d itself because dprime contains more complete information about Thd.</Paragraph> <Paragraph position="15"> This hypothesis can be tested by by comparing the retrieval results of applying any smoothing method to d with those of applying the same method to dprime.</Paragraph> <Paragraph position="16"> In our experiments, we will test this hypothesis with both JM smoothing and Dirichlet smoothing.</Paragraph> <Paragraph position="17"> Note that the proposed document expansion technique is quite general. Indeed, since it transforms the original document to a potentially better expanded document , it can presumably be used together with any retrieval method, including the vector space model. In this paper, we focus on evaluating this technique with the language modeling approach. null Because of the decay shape of the neighborhood and for the sake of ef ciency, we do not have to actually use all documents in C[?]{d}. Instead, we can safely cut off the documents on the tail, and only use the top M closest neighbors for each document. We show in the experiment section that the performance is not sensitive to the choice of M when M is sufciently large (for example 100). Also, since document expansion can be done completely of ine, it can scale up to large collections.</Paragraph> </Section> </Section> <Section position="4" start_page="410" end_page="413" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> We evaluate the proposed method over six representative TREC data sets (Voorhees and Harman, 2001): AP (Associated Press news 1988-90), LA (LA Times), WSJ (Wall Street Journal 1987-92), SJMN (San Jose Mercury News 1991), DOE (Department of Energy), and TREC8 (the ad hoc data used in TREC8). Table 1 shows the statistics of these data.</Paragraph> <Paragraph position="1"> We choose the rst four TREC data sets for performance comparison with (Liu and Croft, 2004). To ensure that the comparison is meaningful, we use identical sources (after all preprocessing). In addition, we use the large data set TREC8 to show that our algorithm can scale up, and use DOE because its documents are usually short, and our previous experience shows that it is a relatively dif cult data set.</Paragraph> <Section position="1" start_page="410" end_page="411" type="sub_section"> <SectionTitle> 3.1 Neighborhood document expansion </SectionTitle> <Paragraph position="0"> Our model boils down to a standard query likelihood model when no neighborhood document is used. We therefore use two most commonly used smoothing methods, JM and Dirichlet , as our baselines. The results are shown in Table 2, where we report both the mean average precision (MAP) and precision at 10 documents. JM and Dirichlet indicate the standard language models with JM smoothing and Dirichlet smoothing respectively, and the other two are the ones combined with our document expansion. For both baselines, we tune the parameters (l for JM, and u for Dirichlet) to be optimal. We then use the same values of l or u without further tuning for the document expansion runs, which means that the parameters may not necessarily optimal for the document expansion runs. Despite this disadvantage, we see that the document expansion runs signi cantly outperform their corresponding baselines, with more than 15% relative improvement on AP. The parameters M and a were set to 100 and 0.5, respectively. To understand the improvement in more detail, we show the precision values at different levels of recall for the AP data in Table 3. Here we see that our method signi cantly outperforms the baseline at every precision point.</Paragraph> <Paragraph position="1"> In our model, we introduce two additional parameters: M and a. We rst examine M here, and then study a in Section 3.3. Figure 3 shows the performance trend with respect to the values of M. The x-axis is the values of M, and the y-axis is the non-interpolated precision averaging over all 50 queries. We draw two conclusions from this plot: (1) Neighborhood information improves retrieval accuracy; adding more documents leads to better retrieval results. (2) The performance becomes insensitive to we accept the improvement hypothesis by Wilcoxon test at signi cant level 0.1, 0.05, 0.01 respectively. M when M is suf ciently large, namely 100. The reason is twofold: First, since the neighborhood is centered around the original document, when M is large, the expansion may be evenly magni ed on all term dimensions. Second, the exponentially decaying con dence values reduce the in uence of remote documents.</Paragraph> </Section> <Section position="2" start_page="411" end_page="411" type="sub_section"> <SectionTitle> 3.2 Comparison with CBDM </SectionTitle> <Paragraph position="0"> In this section, we compare the CBDM method using the model performing the best in (Liu and Croft, 2004)1. Furthermore, we also set Dirichlet prior parameter u = 1000, as mentioned in (Liu and Croft, 2004), to rule out any potential in uence of Dirichlet smoothing.</Paragraph> <Paragraph position="1"> Table 4 shows that our model outperforms CBDM in MAP values on four data sets; the improvement 1We use the exact same data, queries, stemming and all other preprocessing techniques. The baseline results in (Liu and Croft, 2004) are con rmed.</Paragraph> <Paragraph position="2"> presumably comes from a more principled way of exploiting corpus structures. Given that clustering can at least capture the local structure to some extent, it should not be very surprising that the improvement of document expansion over CBDM is much less than that over the baselines.</Paragraph> <Paragraph position="3"> Note that we cannot ful ll Wilcoxon test because of the lack of the individual query results of CBDM.</Paragraph> </Section> <Section position="3" start_page="411" end_page="412" type="sub_section"> <SectionTitle> 3.3 Impact on short documents </SectionTitle> <Paragraph position="0"> Document expansion is to solve the insuf cient sampling problem. Intuitively, a short document is less suf cient than a longer one, hence would need more help from its neighborhood. We design experiments to test this hypothesis.</Paragraph> <Paragraph position="1"> Speci cally, we randomly shrink each document in AP88-89 to a certain percentage of its original length. For example, a shrinkage factor of 30% means each term has 30% chance to stay, or 70% chance to be ltered out. In this way, we reduce the original data set to a new one with the same number of documents but a shorter average document length. Table 5 shows the experiment results over document sets with different average document lengths. The results indeed support our hypothesis that document expansion does help short documents more than longer ones. While we can manage to improve 41% on a 30%-length corpus, the same model only gets 16% improvement on the full length corpus.</Paragraph> <Paragraph position="2"> To understand how a affects the performance we plot the sensitivity curves in Figure 4. The curves all look similar, but the optimal points slightly migrate when the average document length becomes shorter.</Paragraph> <Paragraph position="3"> A 100% corpus gets optimal at a = 0.4, but 30% corpus has to use a = 0.2 to obtain its optimum.</Paragraph> <Paragraph position="4"> (All optimal a values are presented in the fourth row of Table 5.)</Paragraph> </Section> <Section position="4" start_page="412" end_page="413" type="sub_section"> <SectionTitle> 3.4 Further improvement with pseudo </SectionTitle> <Paragraph position="0"> feedback Query expansion has been proved to be an effective way of utilizing corpus information to improve the query representation (Rocchio, 1971; Zhai and Lafferty, 2001a). It is thus interesting to examine whether our model can be combined with query expansion to further improve the retrieval accuracy. We use the model-based feedback proposed in (Zhai and Lafferty, 2001a) and take top 5 returned documents for feedback. There are two parameters in the model-based pseudo feedback process: the noisy pa- null back.*,**,*** indicate that we accept the improvement hypothesis by Wilcoxon test at signi cant level 0.1, 0.05, 0.01 respectively.</Paragraph> <Paragraph position="1"> pseu. inter. combined (%) z-score combined with the pseudo feedback.</Paragraph> <Paragraph position="2"> rameter r and the interpolation parameter s2. We x r = 0.9 and tune s to optimal, and use them directly in the feedback process combined with our models. (It again means that s is probably not optimal in our results.) The combination is conducted in the following way: (1) Retrieve documents by our DELM method; (2) Choose top 5 document to do the model-based feedback; (3) Use the expanded query model to retrieve documents again with DELM method.</Paragraph> <Paragraph position="3"> Table 6 shows the experiment results (MAP); indeed, by combining DELM with pseudo feedback, we can obtain signi cant further improvement of performance.</Paragraph> <Paragraph position="4"> As another baseline, we also tested the algorithm proposed in (Kurland and Lee, 2004). Since the algorithm overlaps with pseudo feedback process, it is not easy to further combine them. We implement its best-performing algorithm, interpolation (labeled as inter. ), and show the results in Table 7. Here, we use the same three data sets as used in (Kurland and Lee, 2004). We tune the feedback parameters to optimal in each experiment. The second last column in Table 7 shows the performance of combination of the interpolation model with the pseudo feedback and its improvement percentage. The last column is the z-scores of Wilcoxon test. The negative z-scores indicate that none of the improvement is signi cant. 2 (Zhai and Lafferty, 2001a) uses different notations. We change them because a has already been used in our own model.</Paragraph> </Section> </Section> class="xml-element"></Paper>