File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2011_metho.xml
Size: 12,575 bytes
Last Modified: 2025-10-06 14:09:35
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2011"> <Title>Automatic Detection of Opinion Bearing Words and Sentences</Title> <Section position="3" start_page="1" end_page="61" type="metho"> <SectionTitle> 2 Past Computational Studies </SectionTitle> <Paragraph position="0"> There has been a spate of research on identifying sentence-level subjectivity in general and opinion in particular. The Novelty track In the remainder of the paper, we will mostly use &quot;opinion&quot; in place of &quot;valence&quot;. We will no longer discuss Belief, Holder, or Topic.</Paragraph> <Paragraph position="1"> (Soboroff and Harman, 2003) of the TREC-2003 competition included a task of recognizing opinion-bearing sentences (see Section 5.2).</Paragraph> <Paragraph position="2"> Wilson and Wiebe (2003) developed an annotation scheme for so-called subjective sentences (opinions and other private states) as part of a U.S. government-sponsored project (ARDA AQUAINT NRRC) in 2002. They created a corpus, MPQA, containing news articles manually annotated. Several other approaches have been applied for learning words and phrases that signal subjectivity. Turney (2002) and Wiebe (2000) focused on learning adjectives and adjectival phrases and Wiebe et al. (2001) focused on nouns. Riloff et al. (2003) extracted nouns and Riloff and Wiebe (2003) extracted patterns for subjective expressions using a bootstrapping process.</Paragraph> </Section> <Section position="4" start_page="61" end_page="63" type="metho"> <SectionTitle> 3 Data Sources </SectionTitle> <Paragraph position="0"> We developed several collections of opinion-bearing and non-opinion-bearing words. One is accurate but small; another is large but relatively inaccurate. We combined them to obtain a more reliable list. We obtained an additional list from Columbia University.</Paragraph> <Section position="1" start_page="61" end_page="62" type="sub_section"> <SectionTitle> 3.1 Collection 1: Using WordNet </SectionTitle> <Paragraph position="0"> In pursuit of accuracy, we first manually collected a set of opinion-bearing words (34 adjectives and 44 verbs). Early classification trials showed that precision was very high (the system found only opinion-bearing sentences), but since the list was so small, recall was very low (it missed many). We therefore used this list as seed words for expansion using WordNet. Our assumption was that synonyms and antonyms of an opinion-bearing word could be opinion-bearing as well, as for example &quot;nice, virtuous, pleasing, well-behaved, gracious, honorable, righteous&quot; as synonyms for &quot;good&quot;, or &quot;bad, evil, disreputable, unrighteous&quot; as antonyms. However, not all synonyms and antonyms could be used: some such words seemed to exhibit both opinion-bearing and non-opinion-bearing senses, such as &quot;solid, hot, full, ample&quot; for &quot;good&quot;. This indicated the need for a scale of valence strength. If we can measure the 'opinion-based closeness' of a synonym or antonym to a known opinion bearer, then we can determine whether to include it in the expanded set.</Paragraph> <Paragraph position="1"> To develop such a scale, we first created a non-opinion-bearing word list manually and produced related words for it using WordNet.</Paragraph> <Paragraph position="2"> To avoid collecting uncommon words, we started with a basic/common English word list compiled for foreign students preparing for the TOEFL test. From this we randomly selected 462 adjectives and 502 verbs for human annotation. Human1 and human2 annotated 462 adjectives and human3 and human2 annotated 502 verbs, labeling each word as either opinion-bearing or non-opinion-bearing.</Paragraph> <Paragraph position="3"> Now, to obtain a measure of opinion/nonopinion strength, we measured the WordNet distance of a target (synonym or antonym) word to the two sets of manually selected seed words plus their current expansion words (see Figure 1). We assigned the new word to the closer category. The following equation represents this is a category (opinion-bearing or nonopinion-bearing), w is the target word, and syn n is the synonyms or antonyms of the given word by WordNet. To compute equation (1), we built</Paragraph> <Paragraph position="5"> feature of category c which is also a member of the synonym set of the target word w, and count(f</Paragraph> <Paragraph position="7"> number of occurrences of f k in the synonym set of w. The motivation for this model is document classification. (Although we used the synonym set of seed words achieved by WordNet, we could instead have obtained word features from a corpus.) After expansion, we obtained 2682 opinion-bearing and 2548 non-opinion-bearing adjectives, and 1329 opinion-bearing and 1760 non-opinion-bearing verbs, with strength values. By using these words as features we built a Naive bayesian classifier and we finally classified 32373 words.</Paragraph> </Section> <Section position="2" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 3.2 Collection 2: WSJ Data </SectionTitle> <Paragraph position="0"> Experiments with the above set did not provide very satisfactory results on arbitrary text. For one reason, WordNet's synonym connections are simply not extensive enough. However, if we know the relative frequency of a word in opinion-bearing texts compared to non-opinion-bearing text, we can use the statistical information instead of lexical information. For this, we collected a huge amount of data in order to make up for the limitations of collection 1.</Paragraph> <Paragraph position="1"> Following the insight of Yu and Hatzivassiloglou (2003), we made the basic and rough assumption that words that appear more often in newspaper editorials and letters to the editor than in non-editorial news articles could be potential opinion-bearing words (even though editorials contain sentences about factual events as well). We used the TREC collection to collect data, extracting and classifying all Wall Street Journal documents from it either as Editorial or nonEditorial based on the occurrence of the keywords &quot;Letters to the Editor&quot;, &quot;Letter to the Editor&quot; or &quot;Editorial&quot; present in its headline. This produced in total 7053 editorial documents and 166025 non-editorial documents.</Paragraph> <Paragraph position="2"> We separated out opinion from non-opinion words by considering their relative frequency in the two collections, expressed as a probability, using SRILM, SRI's language modeling toolkit (http://www.speech.sri.com/projects/srilm/). For every word W occurring in either of the document sets, we computed the followings: We used Kneser-Ney smoothing (Kneser and Ney, 1995) to handle unknown/rare words.</Paragraph> <Paragraph position="3"> Having obtained the above probabilities we calculated the score of W as the following ratio:</Paragraph> <Paragraph position="5"> Score(W) gives an indication of the bias of each word towards editorial or non-editorial texts. We computed scores for 86,674,738 word tokens. Naturally, words with scores close to 1 were untrustworthy markers of valence. To eliminate these words we applied a simple filter as follows. We divided the Editorial and the non-Editorial collections each into 3 subsets. For each word in each {Editorial, non-Editorial} subset pair we calculated Score(W). We retained only those words for which the scores in all three subset pairs were all greater than 1 or all less than 1. In other words, we only kept words with a repeated bias towards Editorial or non-Editorial. This procedure helped eliminate some of the noisy words, resulting in 15568 words.</Paragraph> </Section> <Section position="3" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 3.3 Collection 3: With Columbia Wordlist </SectionTitle> <Paragraph position="0"> Simply partitioning WSJ articles into Editorial/non-Editorial is a very crude differentiation. In order to compare the effectiveness of our implementation of this idea with the implementation by Yu and Hatzivassiloglou of Columbia University, we requested their word list, which they kindly provided. Their list contained 167020 adjectives, 72352 verbs, 168614 nouns, and 9884 adverbs. However, this figure is significantly inflated due to redundant counting of words with variations in capitalization and a punctuation.We merged this list and ours to obtain collection 4. Among these words, we only took top 2000 opinion bearing words and top 2000 non-opinion-bearing words for the final word list.</Paragraph> </Section> <Section position="4" start_page="62" end_page="63" type="sub_section"> <SectionTitle> 3.4 Collection 4: Final Merger </SectionTitle> <Paragraph position="0"> So far, we have classified words as either opinion-bearing or non-opinion-bearing by two different methods. The first method calculates the degrees of closeness to manually chosen sets of opinion-bearing and non-opinion-bearing words in WordNet and decides its class and strength.</Paragraph> <Paragraph position="1"> When the word is equally close to both classes, it is hard to decide its subjectivity, and when WordNet doesn't contain a word or its synonyms, such as the word &quot;antihomosexsual&quot;, we fail to classify it.</Paragraph> <Paragraph position="2"> The second method, classification of words using WSJ texts, is less reliable than the lexical method. However, it does for example successfully handle &quot;antihomosexual&quot;. Therefore, we combined the results of the two methods (collections 1 and 2), since their different characteris- null tics compensate for each other. Later we also combine 4000 words from the Columbia word list to our final 43700 word list. Since all three lists include a strength between 0 and 1, we simply averaged them, and normalized the valence strengths to the range from -1 to +1, with greater opinion valence closer to 1 (see Table 1). Obviously, words that had a high valence strength in all three collections had a high over-all positive strength. When there was a conflict vote among three for a word, it aotomatically got weak strength. Table 2 shows the distribution of words according to their sources: Collection1(C1), Collection2(C2) and Collection3(C3).</Paragraph> </Section> </Section> <Section position="5" start_page="63" end_page="64" type="metho"> <SectionTitle> 4 Measuring Sentence Valence </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="63" end_page="63" type="sub_section"> <SectionTitle> 4.1 Two Models </SectionTitle> <Paragraph position="0"> We are now ready to automatically identify opinion-bearing sentences. We defined several models, combining valence scores in different ways, and eventually kept two: The intuition underlying Model 1 is that sentences in which opinion-bearing words dominate tend to be opinion-bearing, while Model 2 reflects the idea that even one strong valence word is enough. After experimenting with these models, we decided to use Model 2.</Paragraph> <Paragraph position="1"> How strong is &quot;strong enough&quot;? To determine the cutoff threshold (l) on the opinion-bearing valence strength of words, we experimented on human annotated data.</Paragraph> </Section> <Section position="2" start_page="63" end_page="64" type="sub_section"> <SectionTitle> 4.2 Gold Standard Annotation </SectionTitle> <Paragraph position="0"> We built two sets of human annotated sentence subjectivity data. Test set A contains 50 sentences about welfare reform, of which 24 sentences are opinion-bearing. Test set B contains 124 sentences on two topics (illegal aliens and term limits), of which 53 sentences are opinionbearing. Three humans classified the sentences as either opinion or non-opinion bearing. We calculated agreement for each pair of humans and for all three together. Simple pairwise agreement averaged at 0.73, but the kappa score was only 0.49.</Paragraph> <Paragraph position="1"> Table 3 shows the results of experimenting with different combinations of Model 1, Model 2, and several cutoff values. Recall, precision, Fscore, and accuracy are defined in the normal way. Generally, as the cutoff threshold increases, fewer opinion markers are included in the lists, and precision increases while recall drops. The best F-core is obtained on Test set A, Model 2, with l=0.1 or 0.2 (i.e., being rather liberal).</Paragraph> </Section> </Section> class="xml-element"></Paper>