File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1017_metho.xml
Size: 15,695 bytes
Last Modified: 2025-10-06 14:08:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1017"> <Title>Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Document Classification </SectionTitle> <Paragraph position="0"> To separate documents that contain primarily opinions from documents that report mainly facts, we applied Naive Bayes1, a commonly used supervised machine-learning algorithm. This approach presupposes the availability of at least a collection of articles with pre-assigned opinion and fact labels at the document level; fortunately, Wall Street Journal articles contain such metadata by identifying the type of each article as Editorial, Letter to editor, Business and News. These labels are used only to provide the correct classification labels during training and evaluation, and are not included in the feature space. We used as features single words, without stemming or stopword removal. Naive Bayes assigns a document and assuming conditional independence of the features.</Paragraph> <Paragraph position="1"> Although Naive Bayes can be outperformed in text classification tasks by more complex methods such as SVMs, Pang et al. (2002) report similar performance for Naive Bayes and other machine learning techniques for a similar task, that of distinguishing between positive and negative reviews at the document level. Further, we achieved such high performance with Naive Bayes (see Section 8) that exploring additional techniques for this task seemed unnecessary.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Finding Opinion Sentences </SectionTitle> <Paragraph position="0"> We developed three different approaches to classify opinions from facts at the sentence level. To avoid the need for obtaining individual sentence annotations for training and evaluation, we rely instead on the expectation that documents classified as opinion on the whole (e.g., editorials) will tend to have mostly opinion sentences, and conversely documents placed in the factual category will tend to have mostly factual sentences. Wiebe et al. (2002) report that this expectation is borne out 75% of the time for opinion documents and 56% of the time for factual documents.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Similarity Approach </SectionTitle> <Paragraph position="0"> Our first approach to classifying sentences as opinions or facts explores the hypothesis that, within a given topic, opinion sentences will be more similar to other opinion sentences than to factual sen- null cs.cmu.edu/~mccallum/bow/rainbow.</Paragraph> <Paragraph position="1"> tences. We used SIMFINDER (Hatzivassiloglou et al., 2001), a state-of-the-art system for measuring sentence similarity based on shared words, phrases, and WordNet synsets. To measure the overall similarity of a sentence to the opinion or fact documents, we first select the documents that are on the same topic as the sentence in question. We obtain topics as the results of IR queries (for example, by searching our document collection for &quot;welfare reform&quot;). We then average its SIMFINDER-provided similarities with each sentence in those documents. Then we assign the sentence to the category for which the average is higher (we call this approach the &quot;score&quot; variant). Alternatively, for the &quot;frequency&quot; variant, we do not use the similarity scores themselves but instead we count how many of them, for each category, exceed a predetermined threshold (empirically set to 0.65).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Naive Bayes Classifier </SectionTitle> <Paragraph position="0"> Our second method trains a Naive Bayes classifier (see Section 3), using the sentences in opinion and fact documents as the examples of the two categories. The features include words, bigrams, and trigrams, as well as the parts of speech in each sentence. In addition, the presence of semantically oriented (positive and negative) words in a sentence is an indicator that the sentence is subjective (Hatzivassiloglou and Wiebe, 2000). Therefore, we include in our features the counts of positive and negative words in the sentence (which are obtained with the method of Section 5.1), as well as counts of the polarities of sequences of semantically oriented words (e.g., &quot;++&quot; for two consecutive positively oriented words). We also include the counts of parts of speech combined with polarity information (e.g., &quot;JJ+&quot; for positive adjectives), as well as features encoding the polarity (if any) of the head verb, the main subject, and their immediate modifiers. Syntactic structure was obtained with Charniak's statistical parser (Charniak, 2000). Finally, we used as one of the features the average semantic orientation score of the words in the sentence.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Multiple Naive Bayes Classifiers </SectionTitle> <Paragraph position="0"> Our designation of all sentences in opinion or factual articles as opinion or fact sentences is an approximation. To address this, we apply an algorithm using multiple classifiers, each relying on a different sub-set of our features. The goal is to reduce the training set to the sentences that are most likely to be correctly labeled, thus boosting classification accuracy.</Paragraph> <Paragraph position="1"> Given separate sets of features a0a2a1a4a3a5a0a7a6a8a3a4a9a8a9a4a9a10a3a5a0a12a11 , we train separate Naive Bayes classifiers a13a14a1a15a3a16a13a17a6 , a9a4a9a8a9a18a3a16a13a19a11 corresponding to each feature set. Assuming as ground truth the information provided by the document labels and that all sentences inherit the status of their document as opinions or facts, we first traina13 a1 on the entire training set, then use the resulting classifier to predict labels for the training set. The sentences that receive a label different from the assumed truth are then removed, and we train a13 a6 on the remaining sentences. This process is repeated iteratively until no more sentences can be removed. We report results using five feature sets, starting from words alone and adding in bigrams, trigrams, part-of-speech, and polarity.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Identifying the Polarity of Opinion Sentences </SectionTitle> <Paragraph position="0"> Having distinguished whether a sentence is a fact or opinion, we separate positive, negative, and neutral opinions into three classes. We base this decision on the number and strength of semantically oriented words (either positive or negative) in the sentence.</Paragraph> <Paragraph position="1"> We first discuss how such words are automatically found by our system, and then describe the method by which we aggregate this information across the sentence.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Semantically Oriented Words </SectionTitle> <Paragraph position="0"> To determine which words are semantically oriented, in what direction, and the strength of their orientation, we measured their co-occurrence with words from a known seed set of semantically oriented words. The approach is based on the hypothesis that positive words co-occur more than expected by chance, and so do negative words; this hypothesis was validated, at least for strong positive/negative words, in (Turney, 2002). As seed words, we used subsets of the 1,336 adjectives that were manually classified as positive (657) or negative (679) by Hatzivassiloglou and McKeown (1997). In earlier work (Turney, 2002) only singletons were used as seed words; varying their number allows us to test whether multiple seed words have a positive effect in detection performance. We experimented with seed sets containing 1, 20, 100 and over 600 positive and negative pairs of adjectives. For a given seed set size, we denote the set of positive seeds as ADJa20 and the set of negative seeds as ADJa21 . We then calculate a modified log-likelihood ratio a22 a3a24a23a26a25a27a3 POSa28 a8 for a word a23 a25 with part of speech POSa28 (a29 can be adjective, adverb, noun or verb) as the ratio of its collocation frequency with ADJa20 and ADJa21 within a sentence,</Paragraph> <Paragraph position="2"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Sentence Polarity Tagging </SectionTitle> <Paragraph position="0"> As our measure of semantic orientation across an entire sentence we used the average per word log-likelihood scores defined in the preceding section.</Paragraph> <Paragraph position="1"> To determine the orientation of an opinion sentence, all that remains is to specify cutoffs a62a47a20 and a62a63a21 so that sentences for which the average log-likelihood score exceeds a62a53a20 are classified as positive opinions, sentences with scores lower than a62a63a21 are classified as negative opinions, and sentences with in-between scores are treated as neutral opinions. Optimal values fora62a20 anda62a21 are obtained from the training data via density estimation--using a small, hand-labeled subset of sentences we estimate the proportion of sentences that are positive or negative. The values of the average log-likelihood score that correspond to the appropriate tails of the score distribution are then determined via Monte Carlo analysis of a much larger sample of unlabeled training data.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Data </SectionTitle> <Paragraph position="0"> We used the TREC2 8, 9, and 11 collections, which consist of more than 1.7 million newswire articles. The aggregate collection covers six different newswire sources including 173,252 Wall Street 2http://trec.nist.gov/.</Paragraph> <Paragraph position="1"> Journal (WSJ) articles from 1987 to 1992. Some of the WSJ articles have structured headings that include Editorial, Letter to editor, Business, and News (2,877, 1,695, 2,009 and 3,714 articles, respectively). We randomly selected 2,000 articles3 from each category so that our data set was approximate evenly divided between fact and opinion articles. Those articles were used for both document and sentence level opinion/fact classification.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Evaluation Metrics and Gold Standard </SectionTitle> <Paragraph position="0"> For classification tasks (i.e., classifying between facts and opinions and identifying the semantic orientation of sentences), we measured our system's performance by standard recall and precision. We evaluated the quality of semantically oriented words by mapping the extracted words and labels to an external gold standard. We took the subset of our output containing words that appear in the standard, and measured the accuracy of our output as the portion of that subset that was assigned the correct label.</Paragraph> <Paragraph position="1"> A gold standard for document-level classification is readily available, since each article in our Wall Street Journal collection comes with an article type label (see Section 6). We mapped article types News and Business to facts, and article types Editorial and Letter to the Editor to opinions. We cannot automatically select a sentence-level gold standard discriminating between facts and opinions, or between positive and negative opinions. We therefore asked human evaluators to classify a set of sentences between facts and opinions as well as determine the type of opinions.</Paragraph> <Paragraph position="2"> Since we have implemented our methods in an opinion question answering system, we selected four different topics (gun control, illegal aliens, social security, and welfare reform). For each topic, we randomly selected 25 articles from the entire combined TREC corpus (not just the WSJ portion); these were articles matching the corresponding topical phrase given above as determined by the Lucene search engine.4 From each of these documents we randomly selected four sentences. If a document happened to have less than four sentences, additional documents from the same topic were retrieved to supply the missing sentences. The resulting a0a2a1</Paragraph> <Paragraph position="4"> a58a36a58 sentences were then interleaved so that successive sentences came from different topics and documents and divided into ten 50-sentence blocks. Each block shares ten sentences with the preceding and following block (the last block is considered to precede the first one), so that 100 of the 400 sentences appear in two blocks. Each of ten human evaluators (all with graduate training in computational linguistics) was presented with one block and asked to select a label for each sentence among the following: &quot;fact&quot;, &quot;positive opinion&quot;, &quot;negative opinion&quot;, &quot;neutral opinion&quot;, &quot;sentence contains both positive and negative opinions&quot;, &quot;opinion but cannot determine orientation&quot;, and &quot;uncertain&quot;.5 Since we have one judgment for 300 sentences and two judgments for 100 sentences, we created two gold standards for sentence classification. The first (Standard A) includes the 300 sentences with one judgment and a single judgment for the remaining 100 sentences.6 The second standard (Standard B) contains the subset of the 100 sentences for which we obtained identical labels. Statistics of these two standards are given in Table 1. We measured the pairwise agreement among the 100 sentences that were judged by two evaluators, as the ratio of sentences that receive a label a6 from both evaluators divided by the total number of sentences receiving label a6 from any evaluator. The agreement across by Naive Bayes algorithm.</Paragraph> <Paragraph position="5"> the 100 sentences for all seven choices was 55%; if we group together the five subtypes of opinion sentences, the overall agreement rises to 82%. The low agreement for some labels was not surprising because there is much ambiguity between facts and opinions. An example of an arguable sentence is &quot;A lethal guerrilla war between poachers and wardens now rages in central and eastern Africa&quot;, which one rater classified as &quot;fact&quot; and another rater classified as &quot;opinion&quot;.</Paragraph> <Paragraph position="6"> Finally, for evaluating the quality of extracted words with semantic orientation labels, we used two distinct manually labeled collections as gold standards. One set consists of the previously described 657 positive and 679 negative adjectives (Hatzivassiloglou and McKeown, 1997). We also used the ANEW list which was constructed during psycholinguistic experiments (Bradley and Lang, 1999) and contains 1,031 words of all four open classes.</Paragraph> <Paragraph position="7"> As described in (Bradley and Lang, 1999), humans assigned valence scores to each score according to dimensions such as pleasure, arousal, and dominance; following heuristics proposed in psycholinguistics7 we obtained 284 positive and 272 negative words from the valence scores.</Paragraph> </Section> class="xml-element"></Paper>