File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1053_metho.xml

Size: 22,753 bytes

Last Modified: 2025-10-06 14:07:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1053">
  <Title>Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 http://www.epinions.com
</SectionTitle>
    <Paragraph position="0"> &amp; Wiebe, 2000; Wiebe, 2000; Wiebe et al., 2001).</Paragraph>
    <Paragraph position="1"> However, although an isolated adjective may indicate subjectivity, there may be insufficient context to determine semantic orientation. For example, the adjective &amp;quot;unpredictable&amp;quot; may have a negative orientation in an automotive review, in a phrase such as &amp;quot;unpredictable steering&amp;quot;, but it could have a positive orientation in a movie review, in a phrase such as &amp;quot;unpredictable plot&amp;quot;. Therefore the algorithm extracts two consecutive words, where one member of the pair is an adjective or an adverb and the second provides context.</Paragraph>
    <Paragraph position="2"> First a part-of-speech tagger is applied to the review (Brill, 1994).3 Two consecutive words are extracted from the review if their tags conform to any of the patterns in Table 1. The JJ tags indicate adjectives, the NN tags are nouns, the RB tags are adverbs, and the VB tags are verbs.4 The second pattern, for example, means that two consecutive words are extracted if the first word is an adverb and the second word is an adjective, but the third word (which is not extracted) cannot be a noun.</Paragraph>
    <Paragraph position="3"> NNP and NNPS (singular and plural proper nouns) are avoided, so that the names of the items in the review cannot influence the classification.</Paragraph>
    <Paragraph position="4">  1. JJ NN or NNS anything 2. RB, RBR, or RBS JJ not NN nor NNS 3. JJ JJ not NN nor NNS 4. NN or NNS JJ not NN nor NNS 5. RB, RBR, or</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
RBS
</SectionTitle>
    <Paragraph position="0"> VB, VBD, VBN, or VBG anything The second step is to estimate the semantic orientation of the extracted phrases, using the PMI-IR algorithm. This algorithm uses mutual information as a measure of the strength of semantic association between two words (Church &amp; Hanks, 1989). PMI-IR has been empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL), obtaining a score of 74% (Turney, 2001). For comparison, Latent Semantic Analysis (LSA), another statistical measure of word association, attains a score of 64% on the</Paragraph>
    <Paragraph position="2"> Here, p(word1 &amp; word2) is the probability that word1 and word2 co-occur. If the words are statistically independent, then the probability that they co-occur is given by the product p(word1) p(word2). The ratio between p(word1 &amp; word2) and p(word1) p(word2) is thus a measure of the degree of statistical dependence between the words. The log of this ratio is the amount of information that we acquire about the presence of one of the words when we observe the other.</Paragraph>
    <Paragraph position="3"> The Semantic Orientation (SO) of a phrase, phrase, is calculated here as follows:</Paragraph>
    <Paragraph position="5"> The reference words &amp;quot;excellent&amp;quot; and &amp;quot;poor&amp;quot; were chosen because, in the five star review rating system, it is common to define one star as &amp;quot;poor&amp;quot; and five stars as &amp;quot;excellent&amp;quot;. SO is positive when phrase is more strongly associated with &amp;quot;excellent&amp;quot; and negative when phrase is more strongly associated with &amp;quot;poor&amp;quot;.</Paragraph>
    <Paragraph position="6"> PMI-IR estimates PMI by issuing queries to a search engine (hence the IR in PMI-IR) and noting the number of hits (matching documents). The following experiments use the AltaVista Advanced Search engine5, which indexes approximately 350 million web pages (counting only those pages that are in English). I chose AltaVista because it has a NEAR operator. The AltaVista NEAR operator constrains the search to documents that contain the words within ten words of one another, in either order. Previous work has shown that NEAR performs better than AND when measuring the strength of semantic association between words (Turney, 2001).</Paragraph>
    <Paragraph position="7"> Let hits(query) be the number of hits returned, given the query query. The following estimate of SO can be derived from equations (1) and (2) with</Paragraph>
    <Paragraph position="9"> Equation (3) is a log-odds ratio (Agresti, 1996).</Paragraph>
    <Paragraph position="10"> To avoid division by zero, I added 0.01 to the hits.</Paragraph>
    <Paragraph position="11"> I also skipped phrase when both hits(phrase NEAR &amp;quot;excellent&amp;quot;) and hits(phrase NEAR &amp;quot;poor&amp;quot;) were (simultaneously) less than four. These numbers (0.01 and 4) were arbitrarily chosen. To eliminate any possible influence from the testing data, I added &amp;quot;AND (NOT host:epinions)&amp;quot; to every query, which tells AltaVista not to include the Epinions Web site in its searches.</Paragraph>
    <Paragraph position="12"> The third step is to calculate the average semantic orientation of the phrases in the given review and classify the review as recommended if the average is positive and otherwise not recommended.</Paragraph>
    <Paragraph position="13"> Table 2 shows an example for a recommended review and Table 3 shows an example for a not recommended review. Both are reviews of the Bank of America. Both are in the collection of 410 reviews from Epinions that are used in the experiments in Section 4.</Paragraph>
    <Paragraph position="14">  online experience JJ NN 2.253 low fees JJ NNS 0.333 local branch JJ NN 0.421 small part JJ NN 0.053 online service JJ NN 2.780 printable version JJ NN -0.705 direct deposit JJ NN 1.288 well other RB JJ 0.237 inconveniently located RB VBN -1.541 other bank JJ NN -0.850 true service JJ NN -0.732 Average Semantic Orientation 0.322</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 The semantic orientation in the following tables is calculated
</SectionTitle>
    <Paragraph position="0"> using the natural logarithm (base e), rather than base 2. The natural log is more common in the literature on log-odds ratio.</Paragraph>
    <Paragraph position="1"> Since all logs are equivalent up to a constant factor, it makes no difference for the algorithm.</Paragraph>
    <Paragraph position="2">  little difference JJ NN -1.615 clever tricks JJ NNS -0.040 programs such NNS JJ 0.117 possible moment JJ NN -0.668 unethical practices JJ NNS -8.484 low funds JJ NNS -6.843 old man JJ NN -2.566 other problems JJ NNS -2.748 probably wondering RB VBG -1.830 virtual monopoly JJ NN -2.050 other bank JJ NN -0.850 extra day JJ NN -0.286 direct deposits JJ NNS 5.771 online web JJ NN 1.936 cool thing JJ NN 0.395 very handy RB JJ 1.349 lesser evil RBR JJ -2.288</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Related Work
</SectionTitle>
    <Paragraph position="0"> This work is most closely related to Hatzivassiloglou and McKeown's (1997) work on predicting the semantic orientation of adjectives. They note that there are linguistic constraints on the semantic orientations of adjectives in conjunctions. As an example, they present the following three sen- null tences (Hatzivassiloglou &amp; McKeown, 1997): 1. The tax proposal was simple and well-received by the public.</Paragraph>
    <Paragraph position="1"> 2. The tax proposal was simplistic but well-received by the public.</Paragraph>
    <Paragraph position="2"> 3. (*) The tax proposal was simplistic and well-received by the public.</Paragraph>
    <Paragraph position="3">  The third sentence is incorrect, because we use &amp;quot;and&amp;quot; with adjectives that have the same semantic orientation (&amp;quot;simple&amp;quot; and &amp;quot;well-received&amp;quot; are both positive), but we use &amp;quot;but&amp;quot; with adjectives that have different semantic orientations (&amp;quot;simplistic&amp;quot; is negative).</Paragraph>
    <Paragraph position="4"> Hatzivassiloglou and McKeown (1997) use a four-step supervised learning algorithm to infer the semantic orientation of adjectives from constraints  on conjunctions: 1. All conjunctions of adjectives are extracted from the given corpus.</Paragraph>
    <Paragraph position="5"> 2. A supervised learning algorithm combines multiple sources of evidence to label pairs of adjectives as having the same semantic orientation or different semantic orientations. The result is a graph where the nodes are adjectives and links indicate sameness or difference of semantic orientation.</Paragraph>
    <Paragraph position="6"> 3. A clustering algorithm processes the graph structure to produce two subsets of adjectives, such that links across the two subsets are mainly different-orientation links, and links inside a subset are mainly same-orientation links. 4. Since it is known that positive adjectives  tend to be used more frequently than negative adjectives, the cluster with the higher average frequency is classified as having positive semantic orientation.</Paragraph>
    <Paragraph position="7"> This algorithm classifies adjectives with accuracies ranging from 78% to 92%, depending on the amount of training data that is available. The algorithm can go beyond a binary positive-negative distinction, because the clustering algorithm (step 3 above) can produce a &amp;quot;goodness-of-fit&amp;quot; measure that indicates how well an adjective fits in its assigned cluster.</Paragraph>
    <Paragraph position="8"> Although they do not consider the task of classifying reviews, it seems their algorithm could be plugged into the classification algorithm presented in Section 2, where it would replace PMI-IR and equation (3) in the second step. However, PMI-IR is conceptually simpler, easier to implement, and it can handle phrases and adverbs, in addition to isolated adjectives.</Paragraph>
    <Paragraph position="9"> As far as I know, the only prior published work on the task of classifying reviews as thumbs up or down is Tong's (2001) system for generating sentiment timelines. This system tracks online discussions about movies and displays a plot of the number of positive sentiment and negative sentiment messages over time. Messages are classified by looking for specific phrases that indicate the sentiment of the author towards the movie (e.g., &amp;quot;great acting&amp;quot;, &amp;quot;wonderful visuals&amp;quot;, &amp;quot;terrible score&amp;quot;, &amp;quot;uneven editing&amp;quot;). Each phrase must be manually added to a special lexicon and manually tagged as indicating positive or negative sentiment. The lexicon is specific to the domain (e.g., movies) and must be built anew for each new domain. The company Mindfuleye7 offers a technology called Lexant(TM) that appears similar to Tong's (2001) system.</Paragraph>
    <Paragraph position="10"> Other related work is concerned with determining subjectivity (Hatzivassiloglou &amp; Wiebe, 2000; Wiebe, 2000; Wiebe et al., 2001). The task is to distinguish sentences that present opinions and evaluations from sentences that objectively present factual information (Wiebe, 2000). Wiebe et al.</Paragraph>
    <Paragraph position="11"> (2001) list a variety of potential applications for automated subjectivity tagging, such as recognizing &amp;quot;flames&amp;quot; (Spertus, 1997), classifying email, recognizing speaker role in radio broadcasts, and mining reviews. In several of these applications, the first step is to recognize that the text is subjective and then the natural second step is to determine the semantic orientation of the subjective text. For example, a flame detector cannot merely detect that a newsgroup message is subjective, it must further detect that the message has a negative semantic orientation; otherwise a message of praise could be classified as a flame.</Paragraph>
    <Paragraph position="12"> Hearst (1992) observes that most search engines focus on finding documents on a given topic, but do not allow the user to specify the directionality of the documents (e.g., is the author in favor of, neutral, or opposed to the event or item discussed in the document?). The directionality of a document is determined by its deep argumentative structure, rather than a shallow analysis of its adjectives. Sentences are interpreted metaphorically in terms of agents exerting force, resisting force, and overcoming resistance. It seems likely that there could be some benefit to combining shallow and deep analysis of the text.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> Table 4 describes the 410 reviews from Epinions that were used in the experiments. 170 (41%) of the reviews are not recommended and the remaining 240 (59%) are recommended. Always guessing the majority class would yield an accuracy of 59%.</Paragraph>
    <Paragraph position="1"> The third column shows the average number of phrases that were extracted from the reviews.</Paragraph>
    <Paragraph position="2"> Table 5 shows the experimental results. Except for the travel reviews, there is surprisingly little variation in the accuracy within a domain. In addi7 http://www.mindfuleye.com/ tion to recommended and not recommended, Epinions reviews are classified using the five star rating system. The third column shows the correlation between the average semantic orientation and the number of stars assigned by the author of the review. The results show a strong positive correlation between the average semantic orientation and the author's rating out of five stars.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Discussion of Results
</SectionTitle>
    <Paragraph position="0"> A natural question, given the preceding results, is what makes movie reviews hard to classify? Table 6 shows that classification by the average SO tends to err on the side of guessing that a review is not recommended, when it is actually recommended.</Paragraph>
    <Paragraph position="1"> This suggests the hypothesis that a good movie will often contain unpleasant scenes (e.g., violence, death, mayhem), and a recommended movie review may thus have its average semantic orientation reduced if it contains descriptions of these unpleasant scenes. However, if we add a constant value to the average SO of the movie reviews, to compensate for this bias, the accuracy does not improve. This suggests that, just as positive reviews mention unpleasant things, so negative reviews often mention pleasant scenes.</Paragraph>
    <Paragraph position="2">  to this hypothesis. For example, the phrase &amp;quot;more evil&amp;quot; does have negative connotations, thus an SO of -4.384 is appropriate, but an evil character does not make a bad movie. The difficulty with movie reviews is that there are two aspects to a movie, the events and actors in the movie (the elements of the movie), and the style and art of the movie (the movie as a gestalt; a unified whole). This is likely also the explanation for the lower accuracy of the Cancun reviews: good beaches do not necessarily add up to a good vacation. On the other hand, good automotive parts usually do add up to a good automobile and good banking services add up to a good bank. It is not clear how to address this issue. Future work might look at whether it is possible to tag sentences as discussing elements or wholes.</Paragraph>
    <Paragraph position="3"> Another area for future work is to empirically compare PMI-IR and the algorithm of Hatzivassiloglou and McKeown (1997). Although their algorithm does not readily extend to two-word phrases, I have not yet demonstrated that two-word phrases are necessary for accurate classification of reviews.</Paragraph>
    <Paragraph position="4"> On the other hand, it would be interesting to evaluate PMI-IR on the collection of 1,336 hand-labeled adjectives that were used in the experiments of Hatzivassiloglou and McKeown (1997). A related question for future work is the relationship of accuracy of the estimation of semantic orientation at the level of individual phrases to accuracy of review classification. Since the review classification is based on an average, it might be quite resistant to noise in the SO estimate for individual phrases.</Paragraph>
    <Paragraph position="5"> But it is possible that a better SO estimator could produce significantly better classifications.</Paragraph>
    <Paragraph position="6">  The slow, methodical way he spoke. I loved it! It made him seem more arrogant and even more evil.</Paragraph>
    <Paragraph position="7">  Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishbourne is not so good either, I was surprised.</Paragraph>
    <Paragraph position="8">  Anyone who saw the trailer in the theater over the course of the last year will never forget the images of Japanese war planes swooping out of the blue skies, flying past the children playing baseball, or the truly remarkable shot of a bomb falling from an enemy plane into the deck of the USS Arizona.</Paragraph>
    <Paragraph position="9"> Equation (3) is a very simple estimator of semantic orientation. It might benefit from more sophisticated statistical analysis (Agresti, 1996). One possibility is to apply a statistical significance test to each estimated SO. There is a large statistical literature on the log-odds ratio, which might lead to improved results on this task.</Paragraph>
    <Paragraph position="10"> This paper has focused on unsupervised classification, but average semantic orientation could be supplemented by other features, in a supervised classification system. The other features could be based on the presence or absence of specific words, as is common in most text classification work. This could yield higher accuracies, but the intent here was to study this one feature in isolation, to simplify the analysis, before combining it with other features.</Paragraph>
    <Paragraph position="11"> Table 5 shows a high correlation between the average semantic orientation and the star rating of a review. I plan to experiment with ordinal classification of reviews in the five star rating system, using the algorithm of Frank and Hall (2001). For ordinal classification, the average semantic orientation would be supplemented with other features in a supervised classification system.</Paragraph>
    <Paragraph position="12"> A limitation of PMI-IR is the time required to send queries to AltaVista. Inspection of Equation (3) shows that it takes four queries to calculate the semantic orientation of a phrase. However, I cached all query results, and since there is no need to recalculate hits(&amp;quot;poor&amp;quot;) and hits(&amp;quot;excellent&amp;quot;) for every phrase, each phrase requires an average of slightly less than two queries. As a courtesy to AltaVista, I used a five second delay between queries.8 The 410 reviews yielded 10,658 phrases, so the total time required to process the corpus was roughly 106,580 seconds, or about 30 hours.</Paragraph>
    <Paragraph position="13"> This might appear to be a significant limitation, but extrapolation of current trends in computer memory capacity suggests that, in about ten years, the average desktop computer will be able to easily store and search AltaVista's 350 million Web pages. This will reduce the processing time to less</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Applications
</SectionTitle>
    <Paragraph position="0"> There are a variety of potential applications for automated review rating. As mentioned in the in8 This line of research depends on the good will of the major search engines. For a discussion of the ethics of Web robots, see http://www.robotstxt.org/wc/robots.html. For query robots, the proposed extended standard for robot exclusion would be useful. See http://www.conman.org/people/spc/robots2.html.</Paragraph>
    <Paragraph position="1"> troduction, one application is to provide summary statistics for search engines. Given the query &amp;quot;Akumal travel review&amp;quot;, a search engine could report, &amp;quot;There are 5,000 hits, of which 80% are thumbs up and 20% are thumbs down.&amp;quot; The search results could be sorted by average semantic orientation, so that the user could easily sample the most extreme reviews. Similarly, a search engine could allow the user to specify the topic and the rating of the desired reviews (Hearst, 1992).</Paragraph>
    <Paragraph position="2"> Preliminary experiments indicate that semantic orientation is also useful for summarization of reviews. A positive review could be summarized by picking out the sentence with the highest positive semantic orientation and a negative review could be summarized by extracting the sentence with the lowest negative semantic orientation.</Paragraph>
    <Paragraph position="3"> Epinions asks its reviewers to provide a short description of pros and cons for the reviewed item.</Paragraph>
    <Paragraph position="4"> A pro/con summarizer could be evaluated by measuring the overlap between the reviewer's pros and cons and the phrases in the review that have the most extreme semantic orientation.</Paragraph>
    <Paragraph position="5"> Another potential application is filtering &amp;quot;flames&amp;quot; for newsgroups (Spertus, 1997). There could be a threshold, such that a newsgroup message is held for verification by the human moderator when the semantic orientation of a phrase drops below the threshold. A related use might be a tool for helping academic referees when reviewing journal and conference papers. Ideally, referees are unbiased and objective, but sometimes their criticism can be unintentionally harsh. It might be possible to highlight passages in a draft referee's report, where the choice of words should be modified towards a more neutral tone.</Paragraph>
    <Paragraph position="6"> Tong's (2001) system for detecting and tracking opinions in on-line discussions could benefit from the use of a learning algorithm, instead of (or in addition to) a hand-built lexicon. With automated review rating (opinion rating), advertisers could track advertising campaigns, politicians could track public opinion, reporters could track public response to current events, stock traders could track financial opinions, and trend analyzers could track entertainment and technology trends.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML