File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1653_metho.xml

Size: 23,916 bytes

Last Modified: 2025-10-06 14:10:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1653">
  <Title>Relevance Feedback Models for Recommendation</Title>
  <Section position="4" start_page="449" end_page="450" type="metho">
    <SectionTitle>
2 Relevance feedback models
</SectionTitle>
    <Paragraph position="0"> The analogy between IR and CF that will be exploited in this paper is as follows.2 First, a document in IR corresponds to an item in CF. Both are represented as vectors. A document is represented as a vector of words (bag-of-words) and an item is represented as a vector of user ratings (bag-ofuser ratings). In RF, a user specifies documents that are relevant to his information need. These documents are used by the system to retrieve new  For example, Breese et al. (1998) used the vector space model to measure the similarity between users in a user-based CF framework. Wang et al. (2005) used a language modeling approach different from ours. These works, however, treated only CF. In contrast with these, our model extends language modeling approaches to incorporate both CF and CBF.</Paragraph>
    <Paragraph position="1"> relevant documents. In CF, an active user (implicitly) specifies items that he likes. These items are used to search new items that will be preferred by the active user.</Paragraph>
    <Paragraph position="2"> We use relevance models (Lavrenko and Croft, 2001; Lavrenko, 2004) as the basic framework of our relevance feedback models because (1) they perform relevance feedback well (Lavrenko, 2004) and (2) they can simultaneously handle different kinds of features (e.g., different language texts (Lavrenko et al., 2002), such as texts and images (Leon et al., 2003). These two points are essential in our application.</Paragraph>
    <Paragraph position="3"> We first introduce a multinomial model following the work of Lavrenko (2004). This model is a novel one that extends relevance feedback approaches to incorporate CF. It is like a combination of relevance feedback (Lavrenko, 2004) and cross-language information retrieval (Lavrenko et al., 2002). We then generalize that model to an approximated Polya distribution model that is better suited to CF and CBF. This generalized model is the main technical contribution of this work.</Paragraph>
    <Section position="1" start_page="449" end_page="450" type="sub_section">
      <SectionTitle>
2.1 Preparation
</SectionTitle>
      <Paragraph position="0"> Lavrenko (2004) adopts the method of kernels to estimate probabilities: Let d be an item in the database or training data, the probability of item x is estimated as p(x) = 1M summationtextdp(x|thd), where M is the number of items in the training data, thd is the parameter vector estimated from d, and p(x|thd) is the conditional probability of x given thd.3 This means that once we have defined a probability distribution p(x|th) and the method of estimating thd from d, then we can assign probability p(x) to x and apply language modeling approaches to CF and CBF.</Paragraph>
      <Paragraph position="1"> To begin with, we define the representation of item x as the concatenation of two vectors {wx,ux}, where wx = wx1wx2 ... is the sequence of words (contents) contained in x and ux = ux1ux2 ... is the sequence of users who have rated x implicitly. We use Vw and Vu to denote the set of words and users in the database.</Paragraph>
      <Paragraph position="2"> The parameter vector th is also the concatenation of two vectors {o,u}, where o and u are the parameter vectors for Vw and Vu, respectively. The probability of x given th is defined as p(x|th) = po(wx|o)pu(ux|u).</Paragraph>
    </Section>
    <Section position="2" start_page="450" end_page="450" type="sub_section">
      <SectionTitle>
2.2 Multinomial model
</SectionTitle>
      <Paragraph position="0"> Our first model regards that both po and pu follow multinomial distributions. In this case, o(w) and u(u) are the probabilities of word w and user u.</Paragraph>
      <Paragraph position="1"> Then, po(wx|o) is defined as</Paragraph>
      <Paragraph position="3"> where n(w,wx) is the number of occurrences of w in wx. In this model, we use a linear interpolation method to estimate probability od(w).</Paragraph>
      <Paragraph position="5"> prime,wd) and lo (0 [?] lo [?] 1) is a smoothing parameter. The estimation of user probabilities goes similarly: Let n(u,ux) be the number of times user u implicitly rated item x, we define or estimate pu, lu and ud in the same way. In summary, we have defined a probability distribution p(x|th) and the method of estimating thd = {od,ud} from d.</Paragraph>
      <Paragraph position="6"> To recommend top-N items, we have to rank items in the database in response to the implicit ratings of active users. We call those implicit ratings query q. It is a set of items and is represented as q = {q1 ...qk}, where qi is an item implicitly rated by an active user and k is the size of q. We next estimate thq = {oq,uq}. Then, we compare thq and thd to rank items by using Kullback-Leibler divergence D(thq||thd) (Lafferty and Zhai, 2001; Lavrenko, 2004).</Paragraph>
      <Paragraph position="8"> where oqi(w) is obtained by Eq. 2 (Lavrenko, 2004). However, we found in preliminary experiments that smoothing query probabilities hurt performance in our application. Thus, we use</Paragraph>
      <Paragraph position="10"> instead of Eq. 2 when qi is in a query.</Paragraph>
      <Paragraph position="11"> Because KL-divergence is a distance measure, we use a score function derived from [?]D(thq||thd) to rank items. We use Sq(d) to denote the score of d given q. Sq(d) is derived as follows. (We ignore terms that are irrelevant to ranking items.)</Paragraph>
      <Paragraph position="13"> The summation goes over every word w that is shared by both wqi and wd. We define S(uqi||ud) similarly.4 Then, the score of d given</Paragraph>
      <Paragraph position="15"> where ls (0 [?] ls [?] 1) is a free parameter. Finally, the score of d given q is</Paragraph>
      <Paragraph position="17"> The calculation of Sq(d) can be very efficient because once we cache Sqi(d) for each item pair of qi and d in the database, we can reuse it to calculate Sq(d) for any query q. We further optimize the calculation of top-N recommendations by storing only the top 100 items (neighbors) in decreasing order of Sqi(*) for each item qi and setting the scores of lower ranked items as 0. (Note that Sqi(d) &gt;= 0 holds.) Consequently, we only have to search small part of the search space without affecting the performance very much. These two types of optimization are common in item-based CF (Sarwar et al., 2001; Karypis, 2001).</Paragraph>
    </Section>
    <Section position="3" start_page="450" end_page="450" type="sub_section">
      <SectionTitle>
2.3 Polya model
</SectionTitle>
      <Paragraph position="0"> Our second model is based on the Polya distribution. We first introduce (hyper) parameter Th = {ao,au} and denote the probability of x given Th as p(x|Th) = po(wx|ao)pu(ux|au). ao and au are the parameter vectors for words and users.</Paragraph>
      <Paragraph position="1"> po(wx|ao) is defined as follows.</Paragraph>
      <Paragraph position="3"> where G is known as the gamma function, aow is a parameter for word w and nxw = n(w,wx). This can be approximated as follows (Minka, 2003).</Paragraph>
      <Paragraph position="5"> Ps is known as the digamma function and is similar to the natural logarithm. We call Eq. 10 the approximated Polya model or simply the Polya model in this paper.</Paragraph>
      <Paragraph position="6"> Eq. 10 indicates that the Polya distribution can be interpreted as a multinomial distribution over a modified set of counts ~n(*) (Minka, 2003). These modified counts are dumped as shown in Fig. 1. When aow - [?], n(nxw,aow) approaches nxw. When aow - 0, n(nxw,aow) = 0 if nxw = 0 otherwise it is 1. For intermediate values of aow, the mapping n dumps the original counts.</Paragraph>
      <Paragraph position="7"> Under the approximation of Eq. 10, the estimation of parameters can be understood as the maximum-likelihood estimate of a multinomial distribution from dumped counts ~n(*) (Minka, 2003). Indeed, all we have to do to estimate the parameters for ranking items is replace Pl and Pg from Section 2.2 with Pl(w|wd) = ~n(w,wd)summationtext</Paragraph>
      <Paragraph position="9"> prime,wqi). Then, as in the multinomial model, we can define S(oqi||od) with these probabilities.</Paragraph>
      <Paragraph position="10"> This argument also applies to S(uqi||ud).</Paragraph>
      <Paragraph position="11"> The approximated Polya model is a generalization of the multinomial model described in Section 2.2. If we set aow and auu very large then the Polya model is identical to the multinomial model.</Paragraph>
      <Paragraph position="12"> By comparing Eqs. 1 and 10, we can see why the Polya model is superior to the multinomial model for modeling the occurrences of words (and users).</Paragraph>
      <Paragraph position="13"> In the multinomial model, if a word with probability p occurs twice, its probability becomes p2. In the Polya model, the word's probability becomes p1.5, for example, if we set aow = 1. Clearly, p2 &lt; p1.5; therefore, the Polya model assigns higher probability. In this example, the Polya model assigns probability p to the first occurrence and p0.5(&gt; p) to the second. Since words that occur once are likely to occur again (Church, 2000), the Polya model is better suited to model the occurrences of words and users. See Yamamoto and Sadamitsu (2005) for further discussion on applying the Polya distribution to text modeling.</Paragraph>
      <Paragraph position="14"> Zaragoza et al.(2003) applied the Polya distribution to ad hoc IR. They introduced the exact Polya distribution (see Eq. 9) as an extension to the Dirichlet prior method (Zhai and Lafferty, 2001). However, we have introduced a multinomial approximation of the Polya distribution. This approximation allows us to use the linear interpolation method to mix the approximated Polya distributions. Thus, our model is similar to two-stage language models (Zhai and Lafferty, 2002) that combine the Dirichlet prior method and the linear interpolation method. In contrast to our model, Zaragoza et al.(2003) had difficulty in mixing the Polya distributions and did not treat that in their paper.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="450" end_page="454" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> We first examined the behavior of the Polya model by varying the parameters. We tied aow for every w and auu for every u; for any w and u, aow = ao and auu = au. We then compared the Polya model to an item-based CF method.</Paragraph>
    <Section position="1" start_page="450" end_page="452" type="sub_section">
      <SectionTitle>
3.1 Behavior of Polya model
3.1.1 Dataset
</SectionTitle>
      <Paragraph position="0"> We made a dataset of articles from English Wikipedia5 to evaluate the Polya model. English Wikipedia is an online encyclopedia that anyone  can edit, and it has many registered users. Our aim is to recommend a set of articles to each user that is likely to be of interest to that user. If we can successfully recommend interesting articles, this could be very useful to a wide audience because Wikipedia is very popular. In addition, because wikis are popular media for sharing knowledge, developing effective recommender systems for wikis is important.</Paragraph>
      <Paragraph position="1"> In our Wikipedia dataset, each item (article) x consisted of wx and ux. ux was the sequence of users who had edited x. If users had edited x multiple times, then those users occurred in ux multiple times. wx was the sequence of words that were typical in x. To make wx, we removed stop words and stemmed the remaining words with a Porter stemmer. Next, we identified 100 typical words in each article and extracted only those words (|wx |[?] 100 because some of them occurred multiple times). Typicality was measured using the log-likelihood ratio test (Dunning, 1993). We needed to reduce the number of words to speed up our recommender system.</Paragraph>
      <Paragraph position="2"> To make our dataset, we first extracted 302,606 articles, which had more than 100 tokens after the stop words were removed. We then selected typical words in each article. The implicit rating data were obtained from the histories of users editing these articles. Each rating consisted of {user, article, number of edits}. The size of this original rating data was 3,325,746. From this data, we extracted a dense subset that consisted of users and articles included in at least 25 units of the original data. We discarded the users who had edited more than 999 articles because they were often software robots or system operators, not casual users. The resulting 430,096 ratings consisted of 4,193 users and 9,726 articles. Each user rated (edited) 103 articles on average (the median was 57). The average number of ratings per item was 44 and the median was 36.</Paragraph>
    </Section>
    <Section position="2" start_page="452" end_page="453" type="sub_section">
      <SectionTitle>
3.1.2 Evaluation of Polya model
</SectionTitle>
      <Paragraph position="0"> We conducted a four-fold cross validation of this rating dataset to evaluate the Polya model. We used three-fourth of the dataset to train the model and one-fourth to test it.6 All users who existed in 6We needed to estimate probabilities of users and words. We used only training data to estimate the probabilities of users. However, we used all 9,726 articles to estimate the probabilities of words because the articles are usually available even when editing histories of users are not.</Paragraph>
      <Paragraph position="1"> both training and test data were used for evaluation. For each user, we regarded the articles in the training data that had been edited by the user as a query and ranked articles in response to it. These ranked top-N articles were then compared to the articles in the test data that were edited by the same user to measure the precisions for the user.</Paragraph>
      <Paragraph position="2"> We used P@N (precision at rank N = the ratio of the articles edited by the user in the top-N articles), S@N (success at rank N = 1 if some top-N articles were edited by the user, else 0), and R-precision (= P@N, where N is the number of articles edited by the user in the test data). These measures for each user were averaged over all users to get the mean precision of each measure. Then, these mean precisions were averaged over the cross validation repeats. null Here, we report the averaged mean precisions with standard deviations. We first report how R-precision varied depending on a (ao or au). a was varied over 10[?]5,0.4,1.1,2,3.3,5.4,9,16.4,38.8, and 105.</Paragraph>
      <Paragraph position="3"> The values of n(10,a) were approximately 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10, respectively, as shown in Fig. 1. When a = 105, the Polya model represents the multinomial model as discussed in Section 2.3. For each value of a, we varied l (lo or lu) over 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, and 0.99 to obtain the optimum R-precision. These optimum R-precisions are shown in Fig. 2. In this figure, CBF and CF represent the R-precisions for the content-based and collaborative filtering part of the Polya model.</Paragraph>
      <Paragraph position="4"> The values of CBF and CF were obtained by setting ls = 0 and ls = 1 in Eq. 7 (which is applied to the Polya model instead of the multinomial model), respectively. The error bars represent standard deviations.</Paragraph>
      <Paragraph position="5"> At once, we noticed that CBF outperformed CF. This is reasonable because the contents of Wikipedia articles should strongly reflect the users (authors) interest. In addition, each article had about 100 typical words, and this was richer than the average number of users per article (44). This observation contrasts with other work where CBF performed poorly compared with CF, e.g., (Ali and van Stam, 2004).</Paragraph>
      <Paragraph position="6"> Another important observation is that both curves in Fig. 2 are concave. The best R-precisions were obtained at intermediate values of a for both CF and CBF as shown in Table 1.</Paragraph>
      <Paragraph position="7">  model represents the multinomial model as discussed in Section 2.3. Thus, Fig. 2 and Table 1 show that the best R-precisions achieved by the Polya model were better than those obtained by the multinomial model. The improvement was 3.4% for CBF and 17.4% for CF as shown in Table 1. The improvement of CF was larger than that of CBF. This implies that the occurrences of users are more clustered than those of words. In other words, the degree of repetition in the editing histories of users is greater than that in word sequences. A user who edits an article are likely to edit the article again.</Paragraph>
      <Paragraph position="8"> From Fig. 2 and Table 1, we concluded that the generalization of a multinomial model achieved by the Polya model is effective in improving recommendation performance.</Paragraph>
      <Paragraph position="9"> 3.1.3 Combination of CBF and CF Next, we show how the combination of CBF and CF improves recommendation performance.</Paragraph>
      <Paragraph position="10"> We set a (ao and au) to the optimum values in Table 1 and varied l (ls, lo and lu) to obtain the R-precisions for CBF+CF, CBF and CF in Fig. 3.</Paragraph>
      <Paragraph position="11"> The values of CBF were obtained as follows. We first set ls = 0 in Eq. 7 to use only CBF scores and then varied lo, which is the smoothing parameter for word probabilities, in Eq. 2. To get the values of CF, we set ls = 1 in Eq. 7 and then varied lu, which is the smoothing parameter for user probabilities. The values of CBF+CF were obtained by varying ls in Eq. 7 while setting lo and lu to the optimum values obtained from CBF  optimum l ls = 0.2 lo = 0.01 lu = 0.2 and CF (see Table 2). These parameters (ls, lo and lu) were defined in the context of the multinomial model in Section 2.2 and used similarly in the Polya model in this experiment.</Paragraph>
      <Paragraph position="12"> We can see that the combination was quite effective as CBF+CF outperformed both CBF and CF. Table 2 shows R-precision, P@N and S@N for N = 5,10,15,20. These values were obtained by using the optimum values of l in Fig. 3.</Paragraph>
      <Paragraph position="13"> Table 2 shows the same tendency as Fig. 3. For all values of N, CBF+CF outperformed both CBF and CF. We attribute this effectiveness of the combination to the feature independence of CBF and CF. CBF used words as features and CF used user ratings as features. They are very different kinds of features and thus can provide complementary information. Consequently, CBF+CF can exploit the benefits of both methods. We need to do further work to confirm this conjecture.</Paragraph>
    </Section>
    <Section position="3" start_page="453" end_page="454" type="sub_section">
      <SectionTitle>
3.2 Comparison with a baseline method
</SectionTitle>
      <Paragraph position="0"> We compared the Polya model to an implementation of a state-of-the-art item-based CF method, CProb (Karypis, 2001). CProb has been tested with various datasets and found to be effective in top-N recommendation problems. CProb has also been used in recent work as a baseline method (Ziegler et al., 2005; Wang et al., 2005).</Paragraph>
      <Paragraph position="1"> In addition to the Wikipedia dataset, we used two other datasets for comparison. The first was  the 1 million MovieLens dataset.7 This data consists of 1,000,209 ratings of 3,706 movies by 6,040 users. Each user rated an average of 166 movies (the median was 96). The average number of ratings per movie was 270 and the median was 124. The second was the BookCrossing dataset (Ziegler et al., 2005). This data consists of 1,149,780 ratings of 340,532 books by 105,283 users. From this data, we removed books rated by less than 20 users. We also removed users who rated less than 5 books. The resulting 296,471 ratings consisted of 10,345 users and 5,943 books. Each user rated 29 books on average (the median was 10). The average number of ratings per book was 50 and the median was 33. Note that in our experiments, we regarded the ratings of these two datasets as implicit ratings. We regarded the number of occurrence of each rating as one.</Paragraph>
      <Paragraph position="2"> We conducted a four-fold cross validation for each dataset to compare CProb and Polya-CF, which is the collaborative filtering part of the Polya model as described in the previous section.</Paragraph>
      <Paragraph position="3"> For each cross validation repeat, we tuned the parameters of CProb and Polya-CF on the test data to get the optimum R-precisions, in order to compare best results for these models.8 P@N and S@N were calculated with the same parameters. These measures were averaged as described above. R-precision and P@10 are in Table 3. The maximum standard deviation of these measures was 0.001. We omitted reporting other measures because they had similar tendencies. In Table 3, WP, ML and BX represent the Wikipedia, MovieLens, and BookCrossing datasets.</Paragraph>
      <Paragraph position="4"> In Table 3, we can see that the variation of performance among datasets was greater than that between Polya-CF and CProb. Both methods per- null free parameters (au and lu). However, for MovieLens and BookCrossing datasets, Polya-CF has only one free parameter lu, because we regarded the number of occurrence of each rating as one, which means n(1,au) = 1 for all au &gt; 0 (See Fig. 1). Consequently, we don't have to tune au. Since the number of free parameters is small, the comparison of performance shown in Table 3 is likely to be reproduced when we tune the parameters on separate development data instead of test data.</Paragraph>
      <Paragraph position="5"> formed best against ML. We think that this is because ML had the densest ratings. The average number of ratings per item was 270 for ML while that for WP was 44 and that for BX was 50.</Paragraph>
      <Paragraph position="6"> Table 3 also shows that Polya-CF outperformed CProb when the dataset was ML and CProb was better than Polya-CF in the other cases. However, the differences in precision were small. Overall, we can say that the performance of Polya-CF is comparable to that of CProb.</Paragraph>
      <Paragraph position="7"> An important advantage of the Polya model over CProb is that the Polya model can unify CBF and CF in a single language modeling framework while CProb handles only CF. Another advantage of the Polya model is that we can expect to improve its performance by incorporating techniques developed in IR because the Polya model is based on language modeling approaches in IR.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="454" end_page="454" type="metho">
    <SectionTitle>
4 Future work
</SectionTitle>
    <Paragraph position="0"> We want to investigate two areas in our future work. One is the parameter estimation and the other is the refinement of the query model.</Paragraph>
    <Paragraph position="1"> We tuned the parameters of the Polya model by exhaustively searching the parameter space guided by R-precision. We actually tried to learn ao and au from the training data by using an EM method (Minka, 2003; Yamamoto and Sadamitsu, 2005). However, the estimated parameters were about 0.05, too small for better recommendations.</Paragraph>
    <Paragraph position="2"> We need further study to understand the relation between the probabilistic quality (perplexity) of the Polya model and its recommendation quality.</Paragraph>
    <Paragraph position="3"> We approximate the query model as Eq. 3. This allows us to optimize score calculation considerably. However, this does not consider the interaction among items, which may deteriorate the quality of probability estimation. We want to investigate more efficient query models in our future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML