File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1027_metho.xml
Size: 20,564 bytes
Last Modified: 2025-10-06 14:10:05
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1027"> <Title>Mining WordNet for Fuzzy Sentiment: Sentiment Tag Extraction from WordNet Glosses</Title> <Section position="3" start_page="0" end_page="209" type="metho"> <SectionTitle> 2 The Category of Sentiment as a Fuzzy </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="209" type="sub_section"> <SectionTitle> Set </SectionTitle> <Paragraph position="0"> Some semantic categories have clear membership (e.g., lexical fields (Lehrer, 1974) of color, body parts or professions), while others are much more difficult to define. This prompted the development ofapproaches thatregardthetransition frommembership to non-membership in a semantic category asgradual rather than abrupt (Zadeh, 1987; Rosch, 1978). In this paper we approach the category of sentiment as one of such fuzzy categories where some words -- such as good, bad -- are very central, prototypical members, while other, less central words may be interpreted differently by different people. Thus, as annotators proceed from the core of the category to its periphery, word mem- null bership inthis category becomes moreambiguous, and hence, lower inter-annotator agreement can be expected for more peripheral words. Under the classical truth-conditional approach the disagreement between annotators is invariably viewed as a sign of poor reliability of coding and is eliminated by 'training' annotators to code difficult and ambiguous cases in some standard way. While this procedure leads to high levels of inter-annotator agreement on a list created by a coordinated team of researchers, the naturally occurring differences in the interpretation of words located on the periphery of the category can clearly be seen when annotations by two independent teams are compared. The Table 1 presents the comparison of GI-H4 (General Inquirer Harvard IV-4 list, (Stone et al., 1966)) 2 and HM (from (Hatzivassiloglou and McKeown, 1997) study) lists of words manually annotated with sentiment tags by two different research teams.</Paragraph> </Section> </Section> <Section position="4" start_page="209" end_page="209" type="metho"> <SectionTitle> GI-H4 HM </SectionTitle> <Paragraph position="0"> List composition nouns, verbs, adj., adv.</Paragraph> <Paragraph position="1"> adj. only notations on sentiment tags. The approach to sentiment as a category with fuzzy boundaries suggests that the 21.3% disagreement between the two manually annotated lists reflects a natural variability in human annotators' judgment and that this variability is related to the degree of centrality and/or relative importance of certain words to the category of sentiment. The attempts to address this difference 2The General Inquirer (GI) list used in this study was manually cleaned to remove duplicate entries for words with same part of speech and sentiment. Only the Harvard IV-4 list component of the whole GI was used in this study, since other lists included in GI lack the sentiment annotation. Unless otherwisespecified, weused the fullGI-H4list including the Neutral words that were not assigned Positiv or Negativ annotations.</Paragraph> <Paragraph position="2"> in importance of various sentiment markers have crystallized in two main approaches: automatic assignment of weights based on some statistical criterion ((Hatzivassiloglou and McKeown, 1997; Turney and Littman, 2002; Kim and Hovy, 2004), and others) or manual annotation (Subasic and Huettner, 2001). The statistical approaches usually employ somequantitative criterion (e.g., magnitude of pointwise mutual information in (Turney and Littman, 2002), &quot;goodness-for-fit&quot; measure in (Hatzivassiloglou and McKeown, 1997), probability of word's sentiment given the sentiment if its synonyms in (Kim and Hovy, 2004), etc.) to define the strength of the sentiment expressed by a word or to establish a threshold for the membership in the crisp sets 3 of positive, negative and neutral words. Both approaches have their limitations: the first approach produces coarse results and requires large amounts of data to be reliable, while the second approach is prohibitively expensive in terms of annotator time and runs the risk of introducing a substantial subjective bias in annotations. null In this paper we seek to develop an approach for semantic annotation of a fuzzy lexical category and apply it to sentiment annotation of all WordNet words. The sections that follow (1) describe the proposed approach used to extract sentiment information from WordNet entries using STEP (Semantic Tag Extraction Program) algorithm,(2)discuss theoverallperformance ofSTEP on WordNet glosses, (3) outline the method for defining centrality of a word to the sentiment category, and(4)comparetheresults ofbothautomatic (STEP) and manual (HM) sentiment annotations to the manually-annotated GI-H4 list, which was used as a gold standard in this experiment. The comparisons are performed separately for each of the subsets of GI-H4 that are characterized by a different distance from the core of the lexical category of sentiment.</Paragraph> </Section> <Section position="5" start_page="209" end_page="211" type="metho"> <SectionTitle> 3 Sentiment Tag Extraction from WordNet Entries </SectionTitle> <Paragraph position="0"> Word lists for sentiment tagging applications can be compiled using different methods. Automatic methods of sentiment annotation at the word level can be grouped into two major categories: (1) corpus-based approaches and (2) dictionary-based approaches. The first group includes methods that rely on syntactic or co-occurrence patterns of words in large texts to determine their sentiment (e.g., (Turney and Littman, 2002; Hatzivassiloglou and McKeown, 1997; Yu and Hatzivassiloglou, 2003; Grefenstette et al., 2004) and others). The majority of dictionary-based approaches use WordNet information, especially, synsets and hierarchies, to acquire sentiment-marked words (Hu and Liu, 2004; Valitutti et al., 2004; Kim and Hovy, 2004) or to measure the similarity between candidate words and sentiment-bearing words such as good and bad (Kamps et al., 2004).</Paragraph> <Paragraph position="1"> In this paper, we propose an approach to sentiment annotation of WordNet entries that was implemented and tested in the Semantic Tag Extraction Program (STEP). This approach relies both on lexical relations (synonymy, antonymy and hyponymy) provided in WordNet and on the Word-Net glosses. It builds upon the properties of dictionary entries as a special kind of structured text: such lexicographical texts are built to establish semantic equivalence between the left-hand and the right-hand parts of the dictionary entry, and therefore are designed to match as close as possible the components of meaning of the word. They have relatively standard style, grammar and syntactic structures, which removes a substantial source of noise common to other types of text, and finally, they have extensive coverage spanning the entire lexicon of a natural language.</Paragraph> <Paragraph position="2"> The STEP algorithm starts with a small set of seed words of known sentiment value (positive or negative). This list is augmented during the first pass by adding synonyms, antonyms and hyponyms of the seed words supplied in WordNet.</Paragraph> <Paragraph position="3"> This step brings on average a 5-fold increase in the size of the original list with the accuracy of the resulting list comparable to manual annotations (78%, similar to HM vs. GI-H4 accuracy). At the second pass, the system goes through all WordNet glosses and identifies the entries that contain in their definitions the sentiment-bearing words from the extended seed list and adds these head words (or rather, lexemes) to the corresponding category -- positive, negative or neutral (the remainder). A third, clean-up pass is then performed to partially disambiguate the identified WordNet glosses with Brill's part-of-speech tagger (Brill, 1995), which performs with up to 95% accuracy, and eliminates errors introduced into the list by part-of-speech ambiguity of some words acquired in pass 1 and from the seed list. At this step, we also filter out allthose words that have been assigned contradicting, positive and negative, sentiment values within the same run.</Paragraph> <Paragraph position="4"> The performance of STEP was evaluated using GI-H4 as a gold standard, while the HM list was used as a source of seed words fed into the system. We evaluated the performance of our system against the complete list of 1904 adjectives in GI-H4 that included not only the words that were markedasPositiv, Negativ, butalso thosethatwere not considered sentiment-laden by GI-H4 annotators, and hence were by default considered neutral in our evaluation. For the purposes of the evaluation we have partitioned the entire HM list into 58 non-intersecting seed lists of adjectives. The results of the 58 runs on these non-intersecting seed lists are presented in Table 2. The Table 2 shows that the performance of the system exhibits substantial variability depending on the composition of the seed list, with accuracy ranging from 47.6% to 87.5% percent (Mean = 71.2%, Standard Devi- null The significant variability in accuracy of the runs (Standard Deviation over 10%) is attributable to the variability in the properties of the seed list words in these runs. The HM list includes some sentiment-marked words where not all meanings are laden with sentiment, but also the words where some meanings are neutral and even the words where such neutral meanings are much more frequent than the sentiment-laden ones. The runs where seed lists included such ambiguous adjectives were labeling a lot of neutral words as sentiment marked since such seed words were more likely to be found in the WordNet glosses in their more frequent neutral meaning. For example, run # 53 had in its seed list two ambiguous adjectives dim and plush, which are neutral in most of the contexts. This resulted in only 52.6% accuracy (18.6% below the average). Run # 48, on the other hand, by a sheer chance, had only unambiguous sentiment-bearing words in its seed list, and, thus, performed with a fairly high accuracy (87.5%, 16.3% above the average).</Paragraph> <Paragraph position="5"> In order to generate a comprehensive list covering the entire set of WordNet adjectives, the 58 runs werethen collapsed into asingle setofunique words. Since many of the clearly sentiment-laden adjectives that form the core of the category of sentiment were identified by STEP in multiple runs and had, therefore, multiple duplicates in the list that were counted as one entry in the combined list, the collapsing procedure resulted in a lower-accuracy (66.5% - when GI-H4 neutrals were included) but much larger list of English adjectives marked as positive (n = 3,908) or negative (n = 3,905). The remainder of WordNet's 22,141 adjectives was not found in any STEP run and hence was deemed neutral (n = 14,328).</Paragraph> <Paragraph position="6"> Overall, the system's 66.5% accuracy on the collapsed runs is comparable to the accuracy reported in the literature for other systems run on large corpora (Turney and Littman, 2002; Hatzivassiloglou and McKeown, 1997). In order to make a meaningful comparison with the results reported in (Turney and Littman, 2002), we also did an evaluation of STEP results on positives and negatives only (i.e.,the neutral adjectives fromGI-H4 list were excluded) and compared our labels to the remaining 1266 GI-H4 adjectives. The accuracy on this subset was 73.4%, which is comparabletothenumbers reported byTurneyandLittman (2002) for experimental runs on 3,596 sentiment-marked GI words from different parts of speech using a 2x109 corpus to compute point-wise mutual information between the GI words and 14 manually selected positive and negative paradigm words (76.06%).</Paragraph> <Paragraph position="7"> The analysis of STEP system performance vs.GI-H4 and of the disagreements between manually annotated HM and GI-H4 showed that the greatest challenge with sentiment tagging of words lies at the boundary between sentiment-marked (positive or negative) and sentimentneutral words. The 7% performance gain (from 66.5% to 73.4%) associated with the removal of neutrals from the evaluation set emphasizes the importance of neutral words as a major source of sentiment extraction system errors 4. Moreover, the boundary between sentiment-bearing (positive or negative) and neutral words in GI-H4 accounts for 93% of disagreements between the labels assigned to adjectives in GI-H4 and HM by two independent teams of human annotators. The view taken here is that the vast majority of such inter-annotator disagreements are not really errors but a reflection of the natural ambiguity of the words that are located on the periphery of the sentiment category.</Paragraph> </Section> <Section position="6" start_page="211" end_page="213" type="metho"> <SectionTitle> 4 Establishing the degree of word's </SectionTitle> <Paragraph position="0"> centrality to the semantic category The approach to sentiment category as a fuzzy set ascribes the category of sentiment some specific structural properties. First, as opposed to the words located on the periphery, more central elements of the set usually have stronger and more numerous semantic relations with other category members 5. Second, the membership of these central words in the category is less ambiguous than the membership of more peripheral words. Thus, we can estimate the centrality of a word in a given category in two ways: 1. Through the density of the word's relationships with other words -- by enumerating its semantic ties to other words within the field, and calculating membership scores based on the number of these ties; and 2. Through the degree of word membership ambiguity -- by assessing the inter-annotator agreement on the word membership in this category.</Paragraph> <Paragraph position="1"> Lexicographical entries in the dictionaries, such as WordNet, seek to establish semantic equivalence between the word and its definition and provide a rich source of human-annotated relationships between the words. By using a bootstrapping system, such as STEP, that follows the links between the words in WordNet to find similar words, we can identify the paths connecting members of agiven semantic category inthe dictionary. With multiple bootstrapping runs on different seed 4It is consistent with the observation by Kim and Hovy (2004) who noticed that, when positives and neutrals were collapsed into the same category opposed to negatives, the agreement between human annotators rose by 12%.</Paragraph> <Paragraph position="2"> lists, we can then produce a measure of the density of such ties. The ambiguity measure derived from inter-annotator disagreement can then be used to validate the results obtained from the density-based method of determining centrality.</Paragraph> <Paragraph position="3"> In order to produce a centrality measure, we conducted multiple runs with non-intersecting seed lists drawn from HM. The lists of words fetched by STEP on different runs partially overlapped, suggesting that the words identified by the system many times as bearing positive or negative sentiment are more central to the respective categories. The number of times the word has been fetched by STEP runs is reflected in the Gross Overlap Measure produced by the system. In some cases, there wasadisagreement between different runs on the sentiment assigned to the word. Such disagreements were addressed by computing the Net Overlap Scores for each of the found words: thetotalnumberofrunsassigning theword a negative sentiment was subtracted from the total of the runs that consider it positive. Thus, the greater the number of runs fetching the word (i.e., Gross Overlap) and the greater the agreement between these runs on the assigned sentiment, the higher the Net Overlap Score of this word.</Paragraph> <Paragraph position="4"> The Net Overlap scores obtained for each identified word were then used to stratify these words into groups that reflect positive or negative distance of these words from the zero score. The zero score was assigned to (a) the WordNet adjectives that were not identified by STEP as bearing positive or negative sentiment 6 and to (b) the words with equal number of positive and negative hits on several STEP runs. The performance measures for each of the groups were then computed to allow the comparison of STEPand human annotator performance on the words from the core and from the periphery of the sentiment category. Thus, for each of the Net Overlap Score groups, both automatic (STEP)and manual (HM)sentiment annotations were compared to human-annotated GI-H4, which was used as a gold standard in this experiment. null On 58 runs, the system has identified 3,908 English adjectives as positive, 3,905 as negative, while the remainder (14,428) of WordNet's 22,141 adjectives was deemed neutral. Of these 884 were also found in GI-H4 and/or HM lists, which allowed us to evaluate STEP performance and HM-GI agreement on the subset of neutrals as well. The graph in Figure 1 shows the distribution of adjectives by Net Overlap scores and the average accuracy/agreement rate for each group.</Paragraph> <Paragraph position="5"> Figure 1 shows that the greater the Net Overlap Score, and hence, the greater the distance of the word from the neutral subcategory (i.e., from zero), the more accurate are STEP results and the greater is the agreement between two teams of human annotators (HM and GI-H4). On average, for all categories, including neutrals, the accuracy of STEP vs.GI-H4 was 66.5%, human-annotated HMhad78.7% accuracy vs. GI-H4. Forthewords with Net Overlap of +-7 and greater, both STEP and HM had accuracy around 90%. The accuracy declined dramatically as Net Overlap scores approached zero (= Neutrals). In this category, human-annotated HM showed only 20% agreement with GI-H4, while STEP, which deemed these words neutral, rather than positive or negative, performed with 57% accuracy.</Paragraph> <Paragraph position="6"> These results suggest that the two measures of word centrality, Net Overlap Score based on multiple STEP runs and the inter-annotator agreement (HM vs. GI-H4), are directly related 7. Thus, the Net Overlap Score can serve as a useful tool in the identification of core and peripheral members of a fuzzy lexical category, as well as in predic7In our sample, the coefficient of correlation between the two was 0.68. The Absolute Net Overlap Score on the sub-groups 0 to 10 was used in calculation of the coefficient of correlation.</Paragraph> <Paragraph position="7"> tion of inter-annotator agreement and system performance on a subgroup of words characterized by a given Net Overlap Score value.</Paragraph> <Paragraph position="8"> Inorder tomake the NetOverlap Score measure usable in sentiment tagging of texts and phrases, the absolute values of this score should be normalized and mapped onto a standard [0,1] interval. Since the values of the Net Overlap Score may vary depending on the number of runs used in the experiment, such mapping eliminates the variability in the score values introduced with changes in the number of runs performed. In order to accomplish this normalization, we used the value of the Net Overlap Score as a parameter in the standard fuzzy membership S-function (Zadeh, 1975; Zadeh, 1987). This function maps the absolute values of the Net Overlap Score onto the interval from 0 to 1, where 0 corresponds to the absence of membership in the category of sentiment (in our case, these willbe the neutral words) and 1reflects the highest degree of membership in this category.</Paragraph> <Paragraph position="9"> The function can be defined as follows: where u is the Net Overlap Score for the word and a,b,g are the three adjustable parameters: a is set to 1, g is set to 15 and b, which represents a crossover point, is defined as b = (g + a)/2 = 8.</Paragraph> <Paragraph position="10"> Defined this way, the S-function assigns highest degree of membership (=1) to words that have the the Net Overlap Score u [?] 15. The accuracy vs.</Paragraph> <Paragraph position="11"> GI-H4 on this subset is 100%. The accuracy goes down as the degree of membership decreases and reaches 59% for values with the lowest degrees of membership.</Paragraph> </Section> class="xml-element"></Paper>