File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0308_metho.xml

Size: 15,090 bytes

Last Modified: 2025-10-06 14:10:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0308">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Towards a validated model for affective classification of texts</Title>
  <Section position="5" start_page="56" end_page="57" type="metho">
    <SectionTitle>
3 Experiment 1: Distinguishing the four
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="56" end_page="56" type="sub_section">
      <SectionTitle>
Quadrants
</SectionTitle>
      <Paragraph position="0"> Our hypothesis is that the classi cation of two disjoint sets of moods should yield a classi cation accuracy signi cantly above a baseline of 50%. To verify our hypothesis, we conducted a series of experiments using machine learning to classify weblog posts according to their mood, each class corresponding to one particular quadrant. We used Support Vector Machines (Joachims, 2001) with three basic classic features (unigrams, POS and stems) to classify the posts as belonging to one quadrant or one of the three others. For each classi cation task, we extracted randomly 1000 testing examples, and trained separately with 2000, 4000, 8000 and 16000 examples. In each case, examples were divided equally among positive and negative examples3. The set of features used varied for each of these tasks, they were selected by thresholding each (distinct) training data set, after removing words (unigrams) from the categories poor in affective content (prepositions, determiners, etc.). To qualify as a feature, each unigram, POS or stem had to occur at least three times in the training data. The value of each feature corresponds to its number of occurence in the training examples.</Paragraph>
    </Section>
    <Section position="2" start_page="56" end_page="56" type="sub_section">
      <SectionTitle>
3.1 Results
</SectionTitle>
      <Paragraph position="0"> Our hypothesis is that, if the four quadrants depicted in Figure 1 are a suitable arrangement for affective states in the EA space, a classi er should perform signi cantly better than chance (50%).</Paragraph>
      <Paragraph position="1"> Table 1 shows the results for the binary classi cation of the quadrants. In this table, the rst column identi es the classi cation task in the form 'P vs N', where 'P' stands for positive examples and 'N' for negative examples. The 'Random' row shows results for selecting positive and negative examples randomly from all four quadrants. By 3For instance, 1000 = 500 positives from one QUAD-RANT + 500 negatives among the other three QUADRANTS. null micro-averaging accuracy for the classi cation of each quadrant vs all others (rows 10 to 13), we obtain at least 60% accuracy for the four binary classi cations of the quadrants4. The rst six rows show evidence that each quadrant forms a distinctive whole, as the classifer can easily decide between any two of them.</Paragraph>
    </Section>
    <Section position="3" start_page="56" end_page="57" type="sub_section">
      <SectionTitle>
3.2 Analysis of Results
</SectionTitle>
      <Paragraph position="0"> We introduce now table 2 that shows two thresholds of signi cance (1% and 5%) for the interpretation of current and coming results. For example, if we have 1000 trials with each trial having a probability of success of 0.5, the likelihood of getting at least 53.7% of the trials right is only 1%.</Paragraph>
      <Paragraph position="1"> This gives us a baseline to see how signi cantly well above chance a classi er performs. The SVM algorithm has linearly separated the data for each quadrant according to lexical and POS content (the features). The most sensible explanation is that the features for each class (quadrant) are semantically related, a piece of information which is relevant for the model (see section 4). It is safe to conclude that the results cannot be allocated to chance, that there is something else at work that explains the 4Micro-averaged accuracy is de ned as:</Paragraph>
      <Paragraph position="3"> where tp stands for true positive , fn for false negative , etc.</Paragraph>
      <Paragraph position="4">  accuracies consistently well above a baseline, and this something else is the typology. These results show that the abstraction offered by the four quadrants in the model seems correct. This is also supported by the observation that the classi er shows no improvements over the baseline if trained over a random selection of examples in the entire space.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="57" end_page="59" type="metho">
    <SectionTitle>
4 Experiment 2: Classification using
</SectionTitle>
    <Paragraph position="0"> Semantic Orientation from Association Our next goal is to be able to classify a text according to more than four classes (positive/negative, active/passive), by undertaking multi-category classi cation of texts according to particular regions of the space, (such as 'angry', 'sad', etc.). In order to do that we need a scoring system for each axis. In the following experiments we explore the use of such scores and give some insights into how to transform these scores of affect as measures of affect.</Paragraph>
    <Paragraph position="1"> Using binary classi ers, we have already established that if we look at the lexical contents of weblog posts tagged according to their mood by their author, these mood classes tend to cluster according to a two-dimensional typology de ned by their semantic orientation: positive or negative (evaluation), active or passive (activity). Beyond academic importance, the typology really becomes of practical interest if we can classify the posts using pre-de ned automated scores for both axis.</Paragraph>
    <Paragraph position="2"> One strategy of scoring is to extract phrases, including single words, which are good indicators of subjectivity in texts, and score them according to how they relate or 'associate' to one or the other extremity of each axis. This strategy, called Semantic Orientation (SO) from Association (A) has been used successfully (Turney and Littman, 2003) to classify texts or adjectives of all sorts according to their sentiments (in our typology this corresponds to the evaluation dimension). According to these scores, a text or adjective can be said to have, for example, a more or less positive or negative evaluation. We will use this strategy to go further in the validation of our model of affective states by scoring also the activity dimension; to our knowledge, this is the rst time this strategy is employed to get (text) scores for dimensions other than evaluation. In SO-A, we score the strength of the association between an indicator from the text and a set of positive or negative words (the paradigms Pwords and Nwords) capturing the very positive/active or negative/passive semantic orientation of the axis poles. To get the SO-A of a text, we sum over positive scores for indicators positively related to Pwords and negatively related to Nwords and negative scores for indicators positively related to Nwords and negatively related to Pwords. In mathematical terms, the SO-A of a text is:</Paragraph>
    <Paragraph position="4"> where ind stands for indicator. Note that the quantity of Pwords must be equal to Nwords.</Paragraph>
    <Paragraph position="5"> To compute A, (Kamps et al. , 2004) focus on the use of lexical relations de ned in Word-Net5 and de ne a distance measure between two terms which amounts to the length of the shortest path that connects the two terms. This strategy is interesting because it constrains all values to belong to the [-1,+1] range, but can be applied only to a nite set of indicators and has yet to be tested for the classi cation of texts. (Turney and Littman, 2003) use Pointwise Mutual Information - Information Retrieval (PMI-IR); PMI-IR operates on a wider variety of multi-words indicators, allowing for contextual information to be taken into account, has been tested extensively on different types of texts, and the scoring system can be potentially normalized between [-1,+1], as we will soon see. PMI (Church and Hanks, 1990) between two phrases is de ned as:</Paragraph>
    <Paragraph position="7"> PMI is positive when two phrases tend to co-occur and negative when they tend to be in a complementary distribution. PMI-IR refers to the fact  that, as in Informtion Retrieval (IR), multiple occurrences in the same document count as just one occurrence: according to (Turney and Littman, 2003), this seems to yield a better measure of semantic similarity, providing some resistance to noise. Computing probabilities using hit counts from IR, this yields to a value for PMI-IR of:</Paragraph>
    <Paragraph position="9"> where N is the total number of documents in the corpus. We are going to use this method for computing A in SO-A, which we call SO-PMI-IR. The con guration depicted in the remaining of this section follows mostly (Turney and Littman, 2003).</Paragraph>
    <Paragraph position="10"> Smoothing values (1/N and 1) are chosen so that PMI-IR will be zero for words that are not in the corpus, two phrases are considered NEAR if they co-occur within a window of 20 words, and log2 has been replaced by logn, since the natural log is more common in the literature for log-odds ratio and this makes no difference for the algorithm.</Paragraph>
    <Paragraph position="11"> Two crucial aspects of the method are the choice of indicators to be extracted from the text to be classi ed, as well as the sets of positive and negative words to be used as paradigms for the evaluation and activity dimensions. The ve part-of-speech (POS) patterns from (Turney, 2002) were used for the extraction of indicators, all involving at least one adjective or adverb. POS tags were acquired with TreeTagger (Schmid, 1994)6. Ideally, words used as paradigms should be context insensitive, i.e their semantic orientation is either always positive or negative. The adjectives good, nice, excellent, positive, fortunate, correct, superior and bad, nasty, poor, negative, unfortunate, wrong, inferior were used as near pure representations of positive and negative evaluation respectively, while fast, alive, noisy, young and slow, dead, quiet, old as near pure representations of active and passive activity (Summers, 1970).</Paragraph>
    <Paragraph position="12"> Departing from (Turney and Littman, 2003), who uses the Alta Vista advanced search with approximately 350 millions web pages, we used the Waterloo corpus7, with approximately 46 millions pages. To avoid introducing confusing heuristics, we stick to the con guration described above, but (Turney and Littman, 2003) have experimented with different con guation in computing SO-PMI-</Paragraph>
    <Section position="1" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
4.1 The Typology and SO-PMI-IR
</SectionTitle>
      <Paragraph position="0"> We now use the typology with an automated scoring method for semantic orientation. The results are presented in the form of a Confusion Matrix (CM). In this and the following matrices, the topleft cell indicates the overall accuracy8, the POSitive (ACTive) and NEGative (PASsive) columns represent the instances in a predicted class, the P/T column (where present) indicates the average number of patterns per text (blog post), E/P indicates the average evaluation score per pattern and A/P indicates the average activity score per pattern. Each row represents the instances in an actual class9.</Paragraph>
      <Paragraph position="1"> First, it is useful to get a clear idea of how the SO-PMI-IR experimental setup we presented compares with (Turney and Littman, 2003) on a human-annotated set of words according to their evaluation dimension: the General Inquirer (GI, (Stone, 1966)) lexicon is made of 3596 words (1614 positives and 1982 negatives)10. Table 3 summarizes the results. (Turney and Littman,  Littman, 2003) 2003) reports an accuracy of 82.8% while classifying those words, while our experiment yields an accuracy of 76.4% for the same words. Their results show that their classi er errs very slightly towards the negative pole (as shown by the accuracies of both predicted classes) and has a very balanced distribution of the word scores (as shown by the almost equal but opposite in signs values of E/Ps). This is some evidence that the paradigm words are appropriate as near pure representations of positive and negative evaluation. By contrast,  ble 3, our classi er classi ed 59.3% of the 1614 positive instances as positive and 40.7% as negative, with an average score of 1.5 per pattern.</Paragraph>
      <Paragraph position="2"> 10Note that all moods in the typology present in the GI have the same polarity for evaluation in both, which is some evidence in favour of the typology.</Paragraph>
      <Paragraph position="3">  our classi er appears to be more strongly biased towards the negative pole, probably due to the use of different corpora. This bias11should be kept in mind in the interpretation of the results to come. The second experiment focuses on the words from the typology. Table 4 shows the results. The  value of 1 under P/T re ects the fact that the experiment amounts, in practical terms, to classifying the annotation of the post (a single word). For the evaluation dimension, there is another shift towards the negative pole of the axis, which suggests that words in the typology are distributed not exactly as shown on gure 1, but instead appear to have a true location shifted towards the negative pole. The activity dimension also appear to have a negative (i.e passive) bias. There are two main possible reasons for that: words in the typology should be shifted towards the passive pole (as in the evaluation case), or the paradigm words for the passive pole are not pure representations of the extremity of the pole 12.</Paragraph>
      <Paragraph position="4"> Having established that our classi er has a negative bias for both axes, we now turn to the classi cation of the quadrants per se. In the next section, we used SO-PMI-IR to classify 1000 randomnly selected blog posts from our corpus, i.e 250 in each of the four quadrants. Some of these posts were found to have no pattern and were therefore not classi ed, which means that less than 1000 posts were actually classi ed in each experiment.</Paragraph>
      <Paragraph position="5"> We also report on the classi cation of an important subcategory of these moods called the Big Six emotions.</Paragraph>
      <Paragraph position="6"> 11Bias can be introduced by the use of a small corpus, inadequate paradigm words or typology. In practice, a quick x for neutralizing bias would be to normalize the SO-PMI-IR values by subtracting the average. This work aims at tuning the model to remove bias introduced by unsound paradigm words or typology.</Paragraph>
      <Paragraph position="7"> 12At the time of experimenting, we were not aware of an equivalent of the GI to independently verify our paradigm words for activity, but one reviewer pointed out such a resource, see http://www.wjh.harvard.edu/ ~inquirer/spreadsheet_guide.htm.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML