File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1121_metho.xml
Size: 13,583 bytes
Last Modified: 2025-10-06 14:08:48
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1121"> <Title>Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Features </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Feature vectors </SectionTitle> <Paragraph position="0"> We experimented with a range of different feature sets. Most importantly, we wanted to establish whether we would gain any significant advantage in the sentiment classification task by using features based on deep linguistic analysis or whether surface-based features would suffice.</Paragraph> <Paragraph position="1"> Previous results in authorship attribution and style classification experiments had indicated that linguistic features contribute to the overall accuracy of the classifiers, although our null hypothesis based on a review of the relevant literature for sentiment classification was that we would not gain much by using these features. The surface features we used were lemma unigrams, lemma bigrams, and lemma trigrams.</Paragraph> <Paragraph position="2"> For the linguistic features, we performed a linguistic analysis of the data with the NLPWin natural language processing system developed in Microsoft Research (an overview can be found in Heidorn 2000). NLPWin provides us with a phrase structure tree and a logical form for each string, from which we can extract an additional set of features: * part-of-speech trigrams * constituent specific length measures (length of sentence, clauses, adverbial/adjectival phrases, and noun phrases) * constituent structure in the form of context free phrase structure patterns for each constituent in a parse tree. Example:</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> DECL::NP VERB NP (a declarative </SectionTitle> <Paragraph position="0"> sentence consisting of a noun phrase a verbal head and a second noun phrase) * Part of speech information coupled with semantic relations (e.g. &quot;Verb - Subject Noun&quot; indicating a nominal subject to a verbal predicate) * Logical form features provided by NLPWin, such as transitivity of a predicate, tense information etc.</Paragraph> <Paragraph position="1"> For each of these features, except for the length features, we extract a binary value, corresponding to the presence or absence of that feature in a given document. Using binary values for presence/absence as opposed to frequency values is motivated by the rather extreme brevity of these documents.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Feature reduction </SectionTitle> <Paragraph position="0"> Feature reduction is an important part of optimizing the performance of a (linear) classifier by reducing the feature vector to a size that does not exceed the number of training cases as a starting point. Further reduction of vector size can lead to more improvements if the features are noisy or redundant.</Paragraph> <Paragraph position="1"> Reducing the number of features in the feature vector can be done in two different ways: * reduction to the top ranking n features based on some criterion of &quot;predictiveness&quot; * reduction by elimination of sets of features (e.g. elimination of linguistic analysis features etc.) Experimenting with the elimination of feature sets provides an answer to the question as to which qualitative sets of features play a significant role in the classification task Of course these methods can also be combined, for example by eliminating sets of features and then taking the top ranking n features from the remaining set.</Paragraph> <Paragraph position="2"> We used both techniques (and their combinations) in our experiments. The measure of &quot;predictiveness&quot; we employed is log likelihood ratio with respect to the target variable (Dunning 1993).</Paragraph> <Paragraph position="3"> In the experiments described below, n (in the n top-ranked features) ranged from 1000 to 40,000. The different feature set combinations we used were: * &quot;all features&quot; * &quot;no linguistic features&quot; (only word ngrams) * &quot;surface features&quot; (word ngrams, function word frequencies and POS ngrams) * &quot;linguistic features only&quot; (no word ngrams)</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> Given the four different rankings associated by users with their feedback, we experimented with two distinct classification scenarios: 1. classification of documents as belonging to category 1 versus category 4 2. classification of documents as belonging to categories 1 or 2 on the one hand, and 3 or 4 on the other Two additional scenarios can be envisioned. In the first, two classifiers (&quot;1 versus 2/3/4&quot; and &quot;4 versus 1/2/3&quot;) would be trained and their votes would be combined either through weighted probability voting or other classifier combination methods (Dietterich 1997). A second possibility is to learn a three-way distinction &quot;1 versus 2/3 versus 4&quot;. In this paper we restrict ourselves to the scenarios 1 and 2 above. Initial experiments suggest that the combination of two classifiers yields only minimal improvements.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Classification of category 1 versus </SectionTitle> <Paragraph position="0"> category 4 Figure 1 below illustrates the accuracy of the &quot;1 versus 4&quot; classifier at different feature reduction cutoffs and with different feature sets. The accuracy differences are statistically significant at the .99 confidence level, based on the 10fold cross validation scenario. Figure 2and Figure 3 show the F1-measure for target value 4 (&quot;good sentiment&quot;) and target value 1 (&quot;bad sentiment&quot;) respectively. The baseline for this experiment is 50.17% (choosing category 4 as the value for the target feature by default).</Paragraph> <Paragraph position="1"> Accuracy peaks at 77.5% when the top 2000 features in terms of log likelihood ratio are used, and when the feature set is not restricted, i.e. when these top 2000 features are drawn from linguistic and surface features. We will return to the role of linguistic features in section 4.4.</Paragraph> <Paragraph position="2"> F1-measure for both target 4 (Figure 2) and target 1 (Figure 3) exhibit a similar picture, again we achieve maximum performance by using the top 2000 features from the complete pool of features.</Paragraph> <Paragraph position="3"> Accuracy 1 versus 4</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Classification of categories 1 and 2 versus </SectionTitle> <Paragraph position="0"/> </Section> </Section> <Section position="8" start_page="0" end_page="71" type="metho"> <SectionTitle> 3 and 4 </SectionTitle> <Paragraph position="0"> Accuracy and F1-measure results for the &quot;1/2 versus 3/4&quot; task are shown in Figure 4, Figure 5 and Figure 6. Again, the accuracy differences are statistically significant. The baseline in this scenario is at 56.81% (choosing category 3/4 for the target feature by default). Classification accuracy is lower than in the &quot;1 versus 4&quot; scenario, as can be expected since the fuzzy categories 2 and 3 are included in the training and test data.</Paragraph> <Paragraph position="1"> Similarly to the &quot;1 versus 4&quot; classification, accuracy is maximal at 69.48% when the top 2000 features from the complete feature set are used.</Paragraph> <Paragraph position="2"> The F1-measure for the target value 1/2 peaks at the same feature reduction cutoff, whereas the F1-measure for the target value 3/4 benefits from more drastic feature reduction to a set of only the top-ranked 1000 features.</Paragraph> <Paragraph position="3"> Accuracy 1/2 versus 3/4</Paragraph> <Section position="1" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 4.3 Results compared to human classification </SectionTitle> <Paragraph position="0"> The numbers reported in the previous sections are substantially lower than results that have been reported on other data sets such as movie or restaurant reviews. Pang et al. (2002), for example, report a maximum accuracy of 82.9% on movie reviews. As we have observed in section 2, the data that we are dealing with here are extremely noisy. Recall that on a random sample of 200 pieces of feedback even a human evaluator could only assign a sentiment classification to 117 of the documents, the remaining 83 being either balanced in their sentiment, or too unclear or too short to be classifiable at all. In order to assess performance of our classifiers on &quot;cleaner&quot; data, we used the 117 humanly classifiable pieces of customer feedback as a test set for the best performing classifier scenario. For that purpose, we retrained both &quot;1 versus 4&quot; and &quot;1/2 versus 3/4&quot; classifiers with the top-ranked 2000 features on our data set, with the humanly evaluated cases removed from the training set. Results are shown in Table 2, the baseline in this experiment is at 77.78% (choosing the &quot;bad&quot; sentiment as a default).</Paragraph> <Paragraph position="1"> Accuracy of 85.47% as achieved by the &quot;1 versus 4&quot; scenario is in line with accuracy numbers reported for less noisy domains.</Paragraph> </Section> <Section position="2" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 4.4 The role of linguistic analysis features </SectionTitle> <Paragraph position="0"> Figure 1 through Figure 6 also show the effect of eliminating whole feature sets from the training process. A result that came as a surprise to us is the fact that the presence of very abstract linguistic analysis features based on constituent structure and semantic dependency graphs improves the performance of the classifiers. The only exception to this observation is the F1-measure for the &quot;good&quot; sentiment case in the &quot;1/2 versus 3/4&quot; scenario (Figure 5), where the different feature sets yield very much similar performance across the feature reduction spectrum, with the &quot;no linguistic features&quot; even outperforming the other feature sets by a very small margin (0.18%). While the improvement in practice may be too small to warrant the overhead of linguistic analysis, it is very interesting from a linguistic point of view that even in a domain as noisy as this one, there seem to be robust stylistic and linguistic correlates with sentiment. Note that in the &quot;1 versus 4&quot; scenario we can achieve classification accuracy of 74.5% by using only linguistic features (Figure 1), without the use of any word n-gram features (or any other word-based information) at all. This clearly indicates that affect and style are linked in a more significant way than has been previously suggested in the literature.</Paragraph> </Section> <Section position="3" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 4.5 Relevant features </SectionTitle> <Paragraph position="0"> Given that linguistic features play a consistent role in the experiments described here, we inspected the models to see which features play a particularly big role as indicated by their associated weights in the linear svm. This is particularly interesting in light of the fact that in previous research on sentiment classification, affect lexica or other special semantic resources have served as a source for features (see references in section 1). When looking at the top 100 weighted features in the best classifier (&quot;1 versus 4&quot;), we found an interesting mix of the obvious, and the not-so-obvious. Amongst the obviously &quot;affect&quot;-charged terms and features in the top 100 accurate, a simple On the other hand, there are many features that carry high weights, but are not what one would intuitively think of as a typical affect indicator: try the, of, off, ++Univ We conclude from this inspection of individual features that within a specific domain it is not necessarily advisable to start out with a resource that has been geared towards containing particularly affect-charged terminology. See Pang et al. (2002) for a similar argument. As our numbers and feature sets suggest, there are many terms (and grammatical patterns) associated with sentiment in a given domain that may not fall into a typical affect class.</Paragraph> <Paragraph position="1"> We believe that these results show that as with many other classification tasks in the machine learning literature, it is preferable to start without an artificially limited &quot;hand-crafted&quot; set of features. By using large feature sets which are derived from the data, and by paring down the number of features through a feature reduction procedure if necessary, relevant patterns in the data can be identified that may not have been obvious to the human intuition.</Paragraph> </Section> </Section> class="xml-element"></Paper>