File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1650_metho.xml
Size: 26,446 bytes
Last Modified: 2025-10-06 14:10:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1650"> <Title>Automatically Assessing Review Helpfulness</Title> <Section position="5" start_page="0" end_page="423" type="metho"> <SectionTitle> 2 Relevant Work </SectionTitle> <Paragraph position="0"> The task of automatically assessing product review helpfulness is related to these broader areas of research: automatic analysis of product reviews, opinion and sentiment analysis, and text classification.</Paragraph> <Paragraph position="1"> In the thriving area of research on automatic analysis and processing of product reviews (Hu and Liu 2004; Turney 2002; Pang and Lee 2005), little attention has been paid to the important task studied here - assessing review helpfulness. Pang and Lee (2005) have studied prediction of product ratings, which may be particularly relevant due to the correlation we find between product rating and the helpfulness of the review (discussed in Section 5). However, a user's overall rating for the product is often already available. Helpfulness, on the other hand, is valuable to assess because it is not explicitly known in current approaches until many users vote on the helpfulness of a review.</Paragraph> <Paragraph position="2"> In opinion and sentiment analysis, the focus is on distinguishing between statements of fact vs. opinion, and on detecting the polarity of sentiments being expressed. Many researchers have worked in various facets of opinion analysis.</Paragraph> <Paragraph position="3"> Pang et al. (2002) and Turney (2002) classified sentiment polarity of reviews at the document level. Wiebe et al. (1999) classified sentence level subjectivity using syntactic classes such as adjectives, pronouns and modal verbs as features. Riloff and Wiebe (2003) extracted subjective expressions from sentences using a bootstrapping pattern learning process. Yu and Hatzivassiloglou (2003) identified the polarity of opinion sentences using semantically oriented words.</Paragraph> <Paragraph position="4"> These techniques were applied and examined in different domains, such as customer reviews (Hu and Liu 2004) and news articles (TREC novelty track 2003 and 2004).</Paragraph> <Paragraph position="5"> In text classification, systems typically use bag-of-words models, although there is some evidence of benefits when introducing relevant semantic knowledge (Gabrilovich and Markovitch, 2005). In this paper, we explore the use of some semantic features for review helpfulness ranking. Another potential relevant classification task is academic and commercial efforts on detecting email spam messages , which aim to capture a much broader notion of helpfulness. For an SVM-based approach, see (Drucker et al 1999).</Paragraph> <Paragraph position="6"> Finally, a related area is work on automatic essay scoring, which seeks to rate the quality of an essay (Attali and Burstein 2006; Burstein et al. 2004). The task is important for reducing the human effort required in scoring large numbers See http://www.ceas.cc/, http://spamconference.org/ of student essays regularly written for standard tests such as the GRE. The exact scoring approaches developed in commercial systems are often not disclosed. However, more recent work on one of the major systems, e-rater 2.0, has focused on systematizing and simplifying the set of features used (Attali and Burstein 2006). Our choice of features to test was partially influenced by the features discussed by Attali and Burstein. At the same time, due to differences in the tasks, we did not use features aimed at assessing essay structure such as discourse structure analysis features. Our observations suggest that even helpful reviews vary widely in their discourse structure. We present the features which we have used below, in Section 3.2.</Paragraph> </Section> <Section position="6" start_page="423" end_page="425" type="metho"> <SectionTitle> 3 Modeling Review Helpfulness </SectionTitle> <Paragraph position="0"> In this section, we formally define the learning task and we investigate several features for assessing review helpfulness.</Paragraph> <Section position="1" start_page="423" end_page="423" type="sub_section"> <SectionTitle> 3.1 Task Definition </SectionTitle> <Paragraph position="0"> Formally, given a set of reviews R for a particular product, our task is to rank the reviews according to their helpfulness. We define a review helpfulness function, h, as:</Paragraph> <Paragraph position="2"> (r) is the number of people that will find a review helpful and rating (r) is the number of people that will find the review unhelpful. For evaluation, we resort to estimates of h from manual review assessments on websites like Amazon.com, as described in Section 4.</Paragraph> </Section> <Section position="2" start_page="423" end_page="425" type="sub_section"> <SectionTitle> 3.2 Features </SectionTitle> <Paragraph position="0"> One aim of this paper is to investigate how well different classes of features capture the helpfulness of a review. We experimented with various features organized in five classes: Structural, Lexical, Syntactic, Semantic, and Meta-data. Below we describe each feature class in turn.</Paragraph> <Paragraph position="1"> Structural Features Structural features are observations of the document structure and formatting. Properties such as review length and average sentence length are hypothesized to relate structural complexity to helpfulness. Also, HTML formatting tags could help in making a review more readable, and consequently more helpful. We experimented with the following features: of the review.</Paragraph> <Paragraph position="2"> * Sentential (SEN): Observations of the sentences, including the number of sentences, the average sentence length, the percentage of question sentences, and the number of exclamation marks.</Paragraph> <Paragraph position="3"> * HTML (HTM): Two features for the number of bold tags <b> and line breaks <br>.</Paragraph> <Paragraph position="4"> Lexical Features Lexical features capture the words observed in the reviews. We experimented with two sets of features: * Unigram (UGR): The tf-idf statistic of each word occurring in a review.</Paragraph> <Paragraph position="5"> * Bigram (BGR): The tf-idf statistic of each bi-gram occurring in a review.</Paragraph> <Paragraph position="6"> For both unigrams and bigrams, we used lemmatized words from a syntactic analysis of the reviews and computed the tf-idf statistic (Salton and McGill 1983) using the following formula:</Paragraph> <Paragraph position="8"> where N is the number of tokens in the review.</Paragraph> <Paragraph position="9"> Syntactic Features Syntactic features aim to capture the linguistic properties of the review. We grouped them into the following feature set: * Syntax (SYN): Includes the percentage of parsed tokens that are open-class (i.e., nouns, verbs, adjectives and adverbs), the percentage of tokens that are nouns, the percentage of tokens that are verbs, the percentage of tokens that are verbs conjugated in the first person, and the percentage of tokens that are adjectives or adverbs.</Paragraph> <Paragraph position="10"> Semantic Features Most online reviews are fairly short; their sparsity suggests that bigram features will not perform well (which is supported by our experiments described in Section 5.3). Although semantic features have rarely been effective in many text classification problems (Moschitti and Basili 2004), there is reason here to hypothesize that a specialized vocabulary of important words might help with the sparsity. We hypothesized Reviews are analyzed using the Minipar dependency parser (Lin 1994).</Paragraph> <Paragraph position="11"> that good reviews will often contain: i) references to the features of a product (e.g., the LCD and resolution of a digital camera), and ii) mentions of sentiment words (i.e., words that express an opinion such as &quot;great screen&quot;). Below we describe two families of features that capture these semantic observations within the reviews: * Product-Feature (PRF): The features of products that occur in the review, e.g., capacity of MP3 players and zoom of a digital camera.</Paragraph> <Paragraph position="12"> This feature counts the number of lexical matches that occur in the review for each product feature. There is no trivial way of obtaining a list of all the features of a product. In Section 5.1 we describe a method for automatically extracting product features from Pro/Con listings from Epinions.com. Our assumption is that pro/cons are the features that are important for customers (and hence should be part of a helpful review).</Paragraph> <Paragraph position="13"> * General-Inquirer (GIW): Positive and negative sentiment words describing products or product features (e.g., &quot;amazing sound quality&quot; and &quot;weak zoom&quot;). The intuition is that reviews that analyze product features are more helpful than those that do not. We try to capture this analysis by extracting sentiment words using the publicly available list of positive and negative sentiment words from the General Inquirer Unlike the previous four feature classes, meta-data features capture observations which are independent of the text (i.e., unrelated with linguistic features). We consider the following feature: * Stars (STR): Most websites require reviewers to include an overall rating for the products that they review (e.g., star ratings in Amazon.com). This feature set includes the rating score (STR1) as well as the absolute value of the difference between the rating score and the average rating score given by all reviewers (STR2).</Paragraph> <Paragraph position="14"> We differentiate meta-data features from semantic features since they require external knowledge that may not be available from certain review sites. Nowadays, however, most sites that collect user reviews also collect some form of product rating (e.g., Amazon.com, Overstock.com, and Apple.com).</Paragraph> </Section> </Section> <Section position="7" start_page="425" end_page="425" type="metho"> <SectionTitle> 4 Ranking System </SectionTitle> <Paragraph position="0"> In this paper, we estimate the helpfulness function in Equation 1 using user ratings extracted from Amazon.com, where rating + (r) is the number of unique users that rated the review r as helpful and rating (r) is the number of unique users that rated r as unhelpful. Reviews from Amazon.com form a gold standard labeled dataset of {review, h(review)} pairs that can be used to train a supervised machine learning algorithm. In this paper, we applied an SVM (Vapnik 1995) package on the features extracted from reviews to learn the function h. Two natural options for learning helpfulness according to Equation 1 are SVM Regression and SVM Ranking (Joachims 2002). Though learning to rank according to helpfulness requires only SVM Ranking, the helpfulness function provides non-uniform differences between ranks in the training set. Also, in practice, many products have only one review, which can serve as training data for SVM Regression but not SVM Ranking. Furthermore, in large sites such as Amazon.com, when new reviews are written it is inefficient to re-rank all previously ranked reviews. We therefore choose SVM Regression in this paper. We describe the exact implementation in Section 5.1.</Paragraph> <Paragraph position="1"> After the SVM is trained, for a given product and its set of reviews R, we rank the reviews of R in decreasing order of h(r), r [?] R.</Paragraph> <Paragraph position="2"> Table 1 shows four sample reviews for the iPod Photo 20GB product from Amazon.com, their total number of helpful and unhelpful votes, as well as their rank according to the helpfulness score h from both the gold standard from Amazon.com and using the SVM prediction of our best performing system described in Section 5.2.</Paragraph> </Section> <Section position="8" start_page="425" end_page="426" type="metho"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"> We empirically evaluate our review model and ranking system, described in Section 3 and Section 4, by comparing the performance of various feature combinations on products mined from Amazon.com. Below, we describe our experimental setup, present our results, and analyze system performance.</Paragraph> <Section position="1" start_page="425" end_page="426" type="sub_section"> <SectionTitle> 5.1 Experimental Setup </SectionTitle> <Paragraph position="0"> We describe below the datasets that we extracted from Amazon.com, the implementation of our SVM system, and the method we used for extracting features of reviews.</Paragraph> <Paragraph position="1"> Extraction and Preprocessing of Datasets We focused our experiments on two products from Amazon.com: MP3 Players and Digital Cameras.</Paragraph> <Paragraph position="2"> Using Amazon Web Services API, we collected reviews associated with all products in the MP3 Players and Digital Cameras categories. For MP3 Players, we collected 821 products and 33,016 reviews; for Digital Cameras, we collected 1,104 products and 26,189 reviews.</Paragraph> <Paragraph position="3"> In most retailer websites like Amazon.com, duplicate reviews, which are quite frequent, skew statistics and can greatly affect a learning algorithm. Looking for exact string matches between reviews is not a sufficient filter since authors of duplicated reviews often make small changes to the reviews to avoid detection. We built a simple filter that compares the distribution of word bi-grams across each pair of reviews. A pair is deemed a duplicate if more than 80% of their bigrams match.</Paragraph> <Paragraph position="4"> Also, whole products can be duplicated. For different product versions, such as iPods that can come in black or white models, reviews on Amazon.com are duplicated between them. We filter out complete products where each of its reviews is detected as a duplicate of another product (i.e., only one iPod version is retained).</Paragraph> <Paragraph position="5"> The filtering of duplicate products and duplicate reviews discarded 85 products and 12,097 reviews for MP3 Players and 38 products and 3,692 reviews for Digital Cameras.</Paragraph> <Paragraph position="6"> In order to have accurate estimates for the helpfulness function in Equation 1, we filtered out any review that did not receive at least five user ratings (i.e., reviews where less than five users voted it as helpful or unhelpful are filtered out). This filtering was performed before duplicate detection and discarded 45.7% of the MP3 Players reviews and 32.7% of the Digital Cameras reviews.</Paragraph> <Paragraph position="7"> Table 2 describes statistics for the final data-sets after the filtering steps. 10% of products for both datasets were withheld as development corpora and the remaining 90% were randomly sorted into 10 sets for 10-fold cross validation.</Paragraph> </Section> </Section> <Section position="9" start_page="426" end_page="428" type="metho"> <SectionTitle> SVM Regression </SectionTitle> <Paragraph position="0"> For our regression model, we deployed the state of the art SVM regression tool SVM light (Joachims 1999). We tested on the development sets various kernels including linear, polynomial (degrees 2, 3, and 4), and radial basis function (RBF). The best performing kernel was RBF and we report only these results in this paper (performance was measured using Spearman's correlation coefficient, described in Section 5.2). We tuned the RBF kernel parameters C (the penalty parameter) and g (the kernel width hyperparameter) performing full grid search over the 110 combinations of exponentially spaced parameter pairs (C,g) following (Hsu et al. 2003).</Paragraph> <Section position="1" start_page="426" end_page="426" type="sub_section"> <SectionTitle> Feature Extraction </SectionTitle> <Paragraph position="0"> To extract the features described in Section 3.2, we preprocessed each review using the Minipar dependency parser (Lin 1994). We used the parser tokenization, sentence breaker, and syntactic categorizations to generate the Length, Sentential, Unigram, Bigram, and Syntax feature sets.</Paragraph> <Paragraph position="1"> In order to count the occurrences of product features for the Product-Feature set, we developed an automatic way of mining references to product features from Epinions.com. On this website, user-generated product reviews include explicit lists of pros and cons, describing the best and worst aspects of a product. For example, for MP3 players, we found the pro &quot;belt clip&quot; and the con &quot;Useless FM tuner&quot;. Our assumption is that the pro/con lists tend to contain references to the product features that are important to customers, and hence their occurrence in a review may correlate with review helpfulness. We filtered out all single-word entries which were infrequently seen (e.g., hold, ever). After splitting and filtering the pro/con lists, we were left with a total of 9,110 unique features for MP3 Players and 13,991 unique features for Digital Cameras.</Paragraph> <Paragraph position="2"> The Stars feature set was created directly from the star ratings given by each author of an Amazon.com review.</Paragraph> <Paragraph position="3"> For each feature measurement f, we applied the following standard transformation: ( )1ln +f and then scaled each feature between [0, 1] as suggested in (Hsu et al. 2003).</Paragraph> <Paragraph position="4"> We experimented with various combinations of feature sets. Our results tables use the abbreviations presented in Section 3.2. For brevity, we report the combinations which contributed to our best performing system and those that help assess the power of the different feature classes in capturing helpfulness.</Paragraph> </Section> <Section position="2" start_page="426" end_page="427" type="sub_section"> <SectionTitle> 5.2 Ranking Performance Evaluating the quality of a particular ranking is </SectionTitle> <Paragraph position="0"> difficult since certain ranking intervals can be more important than others (e.g., top-10 versus bottom-10) We adopt the Spearman correlation coefficient r (Spearman 1904) since it is the most commonly used measure of correlation between two sets of ranked data points .</Paragraph> <Paragraph position="1"> For each fold in our 10-fold cross-validation experiments, we trained our SVM system using 9 folds. For the remaining test fold, we ranked each product's reviews according to the SVM prediction (described in Section 4) and computed the r We used the version of Spearman's correlation coefficient that allows for ties in rankings. See Siegel and Castellan (1988) for more on alternate rank statistics such as Kendall's tau.</Paragraph> <Paragraph position="2"> correlation between the ranking and the gold standard ranking from the test fold .</Paragraph> <Paragraph position="3"> Although our task definition is to learn review rankings according to helpfulness, as an intermediate step the SVM system learns to predict the absolute helpfulness score for each review. To test the correlation of this score against the gold standard, we computed the standard Pearson correlation coefficient.</Paragraph> <Paragraph position="4"> Results show that the highest performing feature combination consisted of the Length, the Unigram, and the Stars feature sets. Table 3 reports the evaluation results for every combination of these features with 95% confidence bounds. Of the three features alone, neither was statistically more significant than the others. Examining each pair combination, only the combination of length with stars outperformed the others. Surprisingly, adding unigram features to this combination had little effect for the MP3 Players. Given our list of features defined in Section 3.2, helpfulness of reviews is best captured with a combination of the Length and Stars features. Training an RBF-kernel SVM regression model does not necessarily make clear the exact relationship between input and output variables. To investigate this relationship between length and helpfulness, we inspected their Pearson correlation coefficient, which was 0.45. Users indeed tend to find short reviews less helpful than longer ones: out of the 5,247 reviews for MP3 Players that contained more than 1000 characters, the average gold standard helpfulness score was 82%; the 204 reviews with fewer than 100 characters had on average a score of 23%. The explicit product rating, such as Stars is also an Recall that the gold standard is extracted directly from user helpfulness votes on Amazon.com (see Section 4). indicator of review helpfulness, with a Pearson correlation coefficient of 0.48.</Paragraph> <Paragraph position="5"> The low Pearson correlations of Table 3 compared to the Spearman correlations suggest that we can learn the ranking without perfectly learning the function itself. To investigate this, we tested the ability of SVM regression to recover the target helpfulness score, given the score itself as the only feature. The Spearman correlation for this test was a perfect 1.0. Interestingly, the Pearson correlation was only 0.798, suggesting that the RBF kernel does learn the helpfulness ranking without learning the function exactly.</Paragraph> </Section> <Section position="3" start_page="427" end_page="428" type="sub_section"> <SectionTitle> 5.3 Results Analysis </SectionTitle> <Paragraph position="0"> Table 3 shows only the feature combinations of our highest performing system. In Table 4, we report several other feature combinations to show why we selected certain features and what was the effect of our five feature classes presented in Section 3.2.</Paragraph> <Paragraph position="1"> In the first block of six feature combinations in Table 4, we show that the unigram features out-perform the bigram features, which seem to be suffering from the data sparsity of the short reviews. Also, unigram features seem to subsume the information carried in our semantic features Product-Feature (PRF) and General-Inquirer (GIW). Although both PRF and GIW perform well as standalone features, when combined with unigrams there is little performance difference (for MP3 Players we see a small but insignificant decrease in performance whereas for Digital Cameras we see a small but insignificant improvement). Recall that PRF and GIW are simply subsets of review words that are found to be product features or sentiment words. The learning algorithm seems to discover on its own which words are most important in a review and does not use additional knowledge about the meaning of the words (at least not the semantics contained in PRF and GIW).</Paragraph> <Paragraph position="2"> We tested two different versions of the Stars feature: i) the number of star ratings, STR1; and ii) the difference between the star rating and the average rating of the review, STR2. The second block of feature combinations in Table 4 shows that neither is significantly better than the other so we chose STR1 for our best performing system. null Our experiments also revealed that our structural features Sentential and HTML, as well as our syntactic features, Syntax, did not show any significant improvement in system performance.</Paragraph> <Paragraph position="3"> In the last block of feature combinations in Table 4, we report the performance of our best performing features (Length, Unigram, and Stars) along with these other features. Though none of the features cause a performance deterioration, neither of them significantly improves performance. null</Paragraph> </Section> <Section position="4" start_page="428" end_page="428" type="sub_section"> <SectionTitle> 5.4 Discussion </SectionTitle> <Paragraph position="0"> In this section, we discuss the broader implications and potential impacts of our work, and possible connections with other research directions.</Paragraph> <Paragraph position="1"> The usefulness of the Stars feature for determining review helpfulness suggests the need for developing automatic methods for assessing product ratings, e.g., (Pang and Lee 2005).</Paragraph> <Paragraph position="2"> Our findings focus on predictors of helpfulness of reviews of tangible consumer products (consumer electronics). Helpfulness is also solicited and tracked for reviews of many other types of entities: restaurants (citysearch.com), films (imdb.com), reviews of open-source software modules (cpanratings.perl.org), and countless others. Our findings of the importance of Length, Unigrams, and Stars may provide the basis of comparison for assessing helpfulness of reviews of other entity types.</Paragraph> <Paragraph position="3"> Our work represents an initial step in assessing helpfulness. In the future, we plan to investigate other possible indicators of helpfulness such as a reviewer's reputation, the use of comparatives (e.g., more and better than), and references to other products.</Paragraph> <Paragraph position="4"> Taken further, this work may have interesting connections to work on personalization, social networks, and recommender systems, for instance by identifying the reviews that a particular user would find helpful.</Paragraph> <Paragraph position="5"> Our work on helpfulness of reviews also has potential applications to work on automatic gen- null and Digital Cameras on Amazon.com according to helpfulness. The first six lines suggest that unigrams subsume the semantic features; the next two support the use of the raw counts of product ratings (stars) rather than the distance of this count from the average rating; the final six investigate the importance of auxiliary feature sets.</Paragraph> </Section> </Section> <Section position="10" start_page="428" end_page="429" type="metho"> <SectionTitle> MP3 PLAYERS DIGITAL CAMERAS FEATURE COMBINATIONS SPEARMAN </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> 95% confidence bounds are calculated using 10-fold cross-validation.</Paragraph> <Paragraph position="3"> eration of review information, by providing a way to assess helpfulness of automatically generated reviews. Work on generation of reviews includes review summarization and extraction of useful reviews from blogs and other mixed texts.</Paragraph> </Section> class="xml-element"></Paper>