File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1074_metho.xml
Size: 15,908 bytes
Last Modified: 2025-10-06 14:10:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1074"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics An Iterative Implicit Feedback Approach to Personalized Search</Title> <Section position="4" start_page="585" end_page="587" type="metho"> <SectionTitle> 2 Iterative Implicit Feedback Approach </SectionTitle> <Paragraph position="0"> We propose a HITS-like iterative approach for personalized search. HITS (Hyperlink-Induced Topic Search) algorithm, first described by (J.</Paragraph> <Paragraph position="1"> Kleinberg, 1998), was originally used for the detection of high-score hub and authority web pages. The Authority pages are the central web pages in the context of particular query topics.</Paragraph> <Paragraph position="2"> The strongest authority pages consciously do not link one another -- they can only be linked by some relatively anonymous hub pages. The mutual reinforcement principle of HITS states that a web page is a good authority page if it is linked by many good hub pages, and that a web page is a good hub page if it links many good authority pages. A directed graph is constructed, of which the nodes represent web pages and the directed edges represent hyperlinks. After iteratively computing based on the reinforcement principle, each node gets an authority score and a hub score. In our approach, we exploit the relationships between documents and terms in a similar way to HITS. Unseen search results, those results which are retrieved from search engine yet not been presented to the user, are considered as &quot;authority pages&quot;. Representative terms are considered as &quot;hub pages&quot;. Here the representative terms are the terms extracted from and best representing the implicit feedback information. Representative terms confer a relevance score to the unseen For instance, There is hardly any other company's Web page linked from &quot;http://www.microsoft.com/&quot; search results -- specifically, the unseen search results, which contain more good representative terms, have a higher possibility of being relevant; the representative terms should be more representative, if they occur in the unseen search results that are more likely to be relevant. Thus, also there is mutual reinforcement principle existing between representative terms and unseen search results. By the same token, we constructed a directed graph, of which the nodes indicate unseen search results and representative terms, and the directed edges represent the occurrence of the representative terms in the unseen search results. The following Table 1 shows how our approach corresponds to HITS algorithm.</Paragraph> <Paragraph position="3"> Because we have already known that the representative terms are &quot;hub pages&quot;, and that the unseen search results are &quot;authority pages&quot;, with respect to the former, only hub scores need to be computed; with respect to the latter, only authority scores need to be computed.</Paragraph> <Paragraph position="4"> Finally, after iteratively computing based on the mutual reinforcement principle we can re-rank the unseen search results according to their authority scores, as well as select the representative terms with highest hub scores to expand the query. Below we present how to construct a directed graph to begin with.</Paragraph> <Section position="1" start_page="585" end_page="586" type="sub_section"> <SectionTitle> 2.1 Constructing a Directed Graph </SectionTitle> <Paragraph position="0"> We can view the unseen search results and the representative terms as a directed graph G = (V, E).</Paragraph> <Paragraph position="1"> A sample directed graph is shown in Figure 1: The nodes V correspond to the unseen search results (the rectangles in Figure 1) and the repre- null The occurrence of the representative terms in the unseen search results.</Paragraph> <Paragraph position="2"> sentative terms (the circles in Figure 1); a directed edge &quot;p-q[?]E&quot; is weighed by the frequency of the occurrence of a representative term p in an unseen search result q (e.g., the number put on the edge &quot;t ). We say that each representative term only has an out-degree which is the number of the unseen search results it occurs in, as well as that each unseen search result only has an in-degree which is the count of the representative terms it contains. Based on this, we assume that the unseen search results and the representative terms respectively correspond to the authority pages and the hub pages -- this assumption is used throughout the proposed algorithm.</Paragraph> </Section> <Section position="2" start_page="586" end_page="587" type="sub_section"> <SectionTitle> 2.2 A HITS-like Iterative Algorithm </SectionTitle> <Paragraph position="0"> In this section, we present how to initialize the directed graph and how to iteratively compute the authority scores and the hub scores. And then according to these scores, we show how to re-rank the unseen search results and expand the initial query.</Paragraph> <Paragraph position="1"> Initially, each unseen search result of the query are considered equally authoritative, that is,</Paragraph> <Paragraph position="3"> Where vector Y indicates authority scores of the overall unseen search results, and |Y |is the size of such a vector. Meanwhile, each representative term, with the term frequency tf j in the history query logs that have been judged related to the current query, obtains its hub score according to the follow formulation:</Paragraph> <Paragraph position="5"> (2) Where vector X indicates hub scores of the overall representative terms, and |X |is the size of the vector X. The nodes of the directed graph are initialized in this way. Next, we associate each edge with a weight:</Paragraph> <Paragraph position="7"> indicates the term frequency of the</Paragraph> <Paragraph position="9"> occurring in the unseen search result r</Paragraph> <Paragraph position="11"> )&quot; is the weight of edge that link from t</Paragraph> <Paragraph position="13"> After initialization, the iteratively computing of hub scores and authority scores starts.</Paragraph> <Paragraph position="14"> The hub score of each representative term is re-computed based on three factors: the authority scores of each unseen search result where this term occurs; the occurring frequency of this term in each unseen search result; the total occurrence of every representative term in each unseen search result. The formulation for re-computing hub scores is as follows:</Paragraph> <Paragraph position="16"> is the authority score of an unseen search result r</Paragraph> <Paragraph position="18"> &quot; indicates the set of all unseen search results those t</Paragraph> <Paragraph position="20"> dicates the set of all representative terms those r j contains.</Paragraph> <Paragraph position="21"> The authority score of each unseen search result is also re-computed relying on three factors: the hub scores of each representative term that this search result contains; the occurring frequency of each representative term in this search result; the total occurrence of each representative term in every unseen search results. The formulation for re-computing authority scores is as follows:</Paragraph> <Paragraph position="23"> hub score of a representative term t</Paragraph> <Paragraph position="25"> &quot; indicates the set of all representative terms those r</Paragraph> <Paragraph position="27"> occurs in.</Paragraph> <Paragraph position="28"> After re-computation, the hub scores and the authority scores are normalized to 1. The formulation for normalization is as follows:</Paragraph> <Paragraph position="30"> The iteration, including re-computation and normalization, is repeated until the changes of the hub scores and the authority scores are smaller than some predefined threshold th (e.g. 10 ).</Paragraph> <Paragraph position="31"> Specifically, after each repetition, the changes in authority scores and hub scores are computed using the following formulation:</Paragraph> <Paragraph position="33"> The iteration stops if c<th. Moreover, the iteration will also stop if repetition has reached a predefined times k (e.g. 30). The procedure of the iteration is shown in Figure 2.</Paragraph> <Paragraph position="34"> As soon as the iteration stops, the top n unseen search results with highest authority scores are selected and recommended to the user; the top m representative terms with highest hub scores are selected to expand the original query. Here n is a predefined number (in PAIR system we set n=3, n is given a small number because using implicit feedback information is sometimes risky.) m is determined according to the position of the biggest gap, that is, if t</Paragraph> <Paragraph position="36"> is bigger than the gap of any other two neighboring ones of the top half representative terms, then m is given a value i. Furthermore, some of these representative terms (e.g. top 50% high score terms) will be again used in the next time of implementing the iterative algorithm together with some newly incoming terms extracted from the just now click.</Paragraph> </Section> </Section> <Section position="5" start_page="587" end_page="589" type="metho"> <SectionTitle> 3 Implementation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="587" end_page="587" type="sub_section"> <SectionTitle> 3.1 System Design </SectionTitle> <Paragraph position="0"> In this section, we present our experimental system PAIR, which is an IE Browser Helper Object (BHO) based on the popular Web search engine Google. PAIR has three main modules: Result Retrieval module, User Interactions module, and Iterative Algorithm module. The architecture is shown in Figure 3.</Paragraph> <Paragraph position="1"> The Result Retrieval module runs in backgrounds and retrieves results from search engine. When the query has been expanded, this module will use the new keywords to continue retrieving. The User Interactions module can handle three types of basic user actions: (1) submitting a query; (2) clicking to view a search result; (3) clicking the &quot;Next Page&quot; link. For each of these actions, the system responds with: (a) exploiting and extracting representative terms from implicit feed-back information; (b) fetching the unseen search results via Results Retrieval module; (c) sending the representative terms and the unseen search The Iterative Algorithm module implements the HITS-like algorithm described in section 2. When this module receives data from User Interactions module, it responds with: (a) iteratively computing the hub scores and authority scores; (b) re-ranking the unseen search results and expanding the original query.</Paragraph> <Paragraph position="2"> Some specific techniques for capturing and exploiting implicit feedback information are described in the following sections.</Paragraph> </Section> <Section position="2" start_page="587" end_page="588" type="sub_section"> <SectionTitle> 3.2 Extract Representative Terms from Query Logs </SectionTitle> <Paragraph position="0"> We judge whether a query log is related to the current query based on the similarity between the query log and the current query text. Here the query log is associated with all documents that the user has selected to view. The form of each query log is as follows <query text><query time> [clicked documents]* The &quot;clicked documents&quot; consist of URL, title and snippet of every clicked document. The reason why we utilize the query text of the current query but not the search results (including title, snippet, etc.) to compute the similarity, is out of consideration for efficiency. If we had used the search results to determine the similarity, the computation could only start once the search engine has returned the search results. In our method, instead, we can exploit query logs while search engine is doing retrieving. Notice that although our system only utilizes the query logs in the last 24 hours; in practice, we can exploit much more because of its low computation cost with respect to the retrieval process performed in parallel.</Paragraph> <Paragraph position="1"> Iterate (T, R, k, th) T: a collection of m terms R: a collection of n search results k: a natural number th: a predefined threshold Apply (1) to initialize Y.</Paragraph> <Paragraph position="2"> Apply (2) to initialize X.</Paragraph> <Paragraph position="3"> Apply (3) to initialize W.</Paragraph> <Paragraph position="5"> ommends to the user. &quot;-3&quot; and &quot;-2&quot; in the right side of some results indicate the how their ranks descend.</Paragraph> <Paragraph position="6"> We use the standard vector space retrieval model (G. Salton and M. J. McGill, 1983) to compute the similarity. If the similarity between any query log and the current query exceeds a predefined threshold, the query log will be considered to be related to current query. Our system will attempt to extract some (e.g. 30%) representative terms from such related query logs according to the weights computed by applying the following formulation:</Paragraph> <Paragraph position="8"> respectively are the term frequency and inverse document frequency of t</Paragraph> <Paragraph position="10"> the clicked documents of a related query log.</Paragraph> <Paragraph position="11"> This formulation means that a term is more representative if it has a higher frequency as well as a broader distribution in the related query log.</Paragraph> </Section> <Section position="3" start_page="588" end_page="588" type="sub_section"> <SectionTitle> 3.3 Extract Representative Terms from Immediately Viewed Documents </SectionTitle> <Paragraph position="0"> The representative terms extracted from immediately viewed documents are determined based on three factors: term frequency in the immediately viewed document, inverse document frequency in the entire seen search results, and a discriminant value. The formulation is as follows:</Paragraph> <Paragraph position="2"> is the term frequency of term x</Paragraph> <Paragraph position="4"> is computed using the weighting schemes F2 (S. E. Robertson and K. Sparck Jones, 1976) as follows: Where r is the number of the immediately viewed documents containing term x i ; n is the number of the seen results containing term x i ; R is the number of the immediately viewed documents in the query; N is the number of the entire seen results.</Paragraph> </Section> <Section position="4" start_page="588" end_page="589" type="sub_section"> <SectionTitle> 3.4 Sample Results </SectionTitle> <Paragraph position="0"> Unlike other systems which do result re-ranking and query expansion respectively in different ways, our system implements these two functions simultaneously and collaboratively -- Query expansion provides diversified search results which must rely on the use of re-ranking to be After iteratively computing using our approach, the system selects some search results with top highest authority scores and recommends them to the user. In Table 2, we show that PAIR successfully re-ranks the unseen search results of &quot;jaguar&quot; respectively using the immediately viewed documents and the query logs. Simultaneously, some representative terms are selected to expand the original query. In the query of &quot;jaguar&quot; (without query logs), we click some results about &quot;Mac OS&quot;, and then we see that a term &quot;Mac&quot; has been selected to expand the original query, and some results of the new query &quot;jaguar Mac&quot; are recommended to the user under the help of re-ranking, as shown in Figure 4.</Paragraph> </Section> </Section> class="xml-element"></Paper>