XML Viewer - p06-2081

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2081_metho.xml
Size: 24,563 bytes
Last Modified: 2025-10-06 14:10:27
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2081">
  <Title>Whose thumb is it anyway? Classifying author personality from weblog text</Title>
  <Section position="4" start_page="627" end_page="627" type="metho">
    <SectionTitle>
2 Background: traits and language
</SectionTitle>
    <Paragraph position="0"> Cattell's pioneering work led to the isolation of 16 primary personality factors, and later work on secondary factors led to Costa and McCrae's fivefactor model, closely related to the 'Big Five' models emerging from lexical research (Costa and McCrae, 1992). Each factor gives a continuous dimension for personality scoring. These are: Extraversion; Neuroticism; Openness; Agreeableness; and Conscientiousness (Matthews et al., 2003). Work has also investigated whether scores on these dimensions correlate with language use (Scherer, 1979; Dewaele and Furnham, 1999).</Paragraph>
    <Paragraph position="1"> Building on the earlier work of Gottschalk and Gleser, Pennebaker and colleagues secured significant results using the Linguistic Inquiry and Word Count text analysis program (Pennebaker et al., 2001). This primarily counts relative frequencies of word-stems in pre-defined semantic and syntactic categories. It shows, for instance, that high Neuroticism scorers use: more first person singular and negative emotion words; and fewer articles and positive emotion words (Pennebaker and King, 1999).</Paragraph>
    <Paragraph position="2"> So, can a text classifier trained on such features predict the author personality? We know of only one published study: Argamon et al. (2005) focussed on Extraversion and Neuroticism, dividing Pennebaker and King's (1999) population into the top- and bottom-third scorers on a dimension, and discarding the middle third. For both dimensions, using a restricted feature set, they report binary classification accuracy of around 58%: an 8% absolute improvement over their baseline. Although mood is more malleable, work on it is also relevant (Mishne, 2005). Using a more typical feature set (including n-grams of words and parts-of-speech), the best mood classification accuracy was 66%, for 'confused'. At a coarser grain, moods could be classified with accuracies of 57% (active vs. passive), and 60% (positive vs. negative).</Paragraph>
    <Paragraph position="3"> So, Argamon et al. used a restricted feature set for binary classification on two dimensions: Extraversion and Neuroticism. Given this, we now pursue three questions. (1) Can we improve performance on a similar binary classification task? (2) How accurate can classification be on the other dimensions? (3) How accurate can multiple-three-way or five-way--classification be?</Paragraph>
  </Section>
  <Section position="5" start_page="627" end_page="628" type="metho">
    <SectionTitle>
3 The weblog corpus
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="627" end_page="627" type="sub_section">
      <SectionTitle>
3.1 Construction
</SectionTitle>
      <Paragraph position="0"> A corpus of personal weblog ('blog') text has been gathered (Nowson, 2006). Participants were recruited directly via e-mail to suitable candidates, and indirectly by word-of-mouth: many participants wrote about the study in their blogs. Participants were first required to answer sociobiographic and personality questionnaires. The personality instrument has specifically been validated for online completion (Buchanan, 2001). It was derived from the 50-item IPIP implementation of Costa and McCrae's (1992) revised NEO personality inventory; participants rate themselves on 41items using a 5-point Likert scale. This provides scores for Neuroticism, Extraversion, Openness, Agreeableness and Conscientiousness.</Paragraph>
      <Paragraph position="1"> After completing this stage, participants were requested to submit one month's worth of prior weblog postings. The month was pre-specified so as to reduce the effects of an individual choosing what they considered their 'best' or 'preferred' month. Raw submissions were marked-up using XML so as to automate extraction of the desired text. Text was also marked-up by post type, such as purely personal, commentary reporting of external matters, or direct posting of internet memes such as quizzes. The corpus consisted of 71 participants (47 females, 24 males; average ages 27.8 and 29.4, respectively) and only the text marked as 'personal' from each weblog, approximately 410,000 words. To eliminate undue influence of particularly verbose individuals, the size of each weblog file was truncated at the mean word count plus 2 standard deviations.</Paragraph>
    </Section>
    <Section position="2" start_page="627" end_page="628" type="sub_section">
      <SectionTitle>
3.2 Personality distribution
</SectionTitle>
      <Paragraph position="0"> It might be thought that bloggers are more Extravert than most (because they express themselves in public); or perhaps that they are less Extravert (because they keep diaries in the first place). In fact, plotting the Extraversion scores for the corpus authors gives an apparently normal distribution; and the same applies for three other dimensions. However, scores for Openness to experience are not normally distributed. Perhaps bloggers are more Open than average; or perhaps there is response bias. Without a comparison sample of matched non-bloggers, one cannot say, and Openness is not discussed further in this paper.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="628" end_page="630" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We are thus confined to classifying on four personality dimensions. However, a number of other variables remain: different learning algorithms can be employed; authors in the corpus can be grouped in several ways, leading to various classification tasks; and more or less restricted linguistic feature sets can be used as input to the classifier.</Paragraph>
    <Section position="1" start_page="628" end_page="628" type="sub_section">
      <SectionTitle>
4.1 Algorithms
</SectionTitle>
      <Paragraph position="0"> Support Vector Machines (SVM) appear to work well for binary sentiment classification tasks, so Argamon et al. (2003) and Pang and Lee (2005) consider One-vs-All, or All-vs-All, variants on SVM, to permit multiple classifications. Choice of algorithm is not our focus, but it remains to be seen whether SVM outperforms Na&amp;quot;ive Bayes (NB) for personality classification. Thus, we will use both on the binary Tasks 1 to 3 (defined in section 4.2.1), for each of the personality dimensions, and each of the manually-selected feature sets (Levels I to IV, defined in section 4.3). Whichever performs better overall is then reported in full, and used for the multiple Tasks 4 to 7 (defined in section 4.2.2). Both approaches are applied as implemented in the WEKA toolkit (Witten and Frank, 1999) and use 10-fold cross validation.</Paragraph>
    </Section>
    <Section position="2" start_page="628" end_page="629" type="sub_section">
      <SectionTitle>
4.2 Tasks
</SectionTitle>
      <Paragraph position="0"> For any blog, we have available the scores, on continuous scales, of its author on four personality dimensions. But for the classifier, the task can be made more or less easy, by grouping authors on each of the dimensions. The simplest tasks are, of course, binary: given the sequence of words from a blog, the classifier simply has to decide whether the author is (for instance) high or low in Agreeableness. Binary tasks vary in difficulty, depending on whether authors scoring in the middle of a dimension are left out, or not; and if they are left out, what proportion of authors are left out.</Paragraph>
      <Paragraph position="1"> More complex tasks will also vary in difficulty depending on who is left out. But in the cases considered here, middle authors are now included.</Paragraph>
      <Paragraph position="2"> For a three-way task, the classifier must decide if an author is high, medium or low; and those authors known to score between these categories may, or may not, be left out. In the most challenging five-way task, no-one is left out. The point of considering such tasks is to gradually approximate the most challenging task of all: continuous rating.</Paragraph>
      <Paragraph position="3">  In these task variants, the goal is to classify authors as either high or low scorers on a dimension:  1. The easiest approach is to keep the high and low groups as far apart as possible: high scorers (H) are those whose scores fall above 1 SD above the mean; low scorers (L) are those whose scores fall below 1 SD below the mean.</Paragraph>
      <Paragraph position="4"> 2. Task-1 creates distinct groups, at the price of excluding over 50% of the corpus from the analysis. To include more of the corpus, parameters are relaxed: the high group (HH) includes anyone whose score is above .5 SD above the mean; the low group (LL) is similarly placed below.</Paragraph>
      <Paragraph position="5"> 3. The most obvious task (but not the easiest)  arises from dividing the corpus in half about the mean score. This creates high (HHH) and low (LLL) groups, covering the entire population. Inevitably, some HHH scorers will actually have scores much closer to those of LLL scorers than to other HHH scorers.</Paragraph>
      <Paragraph position="6"> These sub-groups are tabulated in Table 1, giving the size of each group within each trait. Note that in Task-2, the standard-deviation-based divisions contain very nearly the top third and bottom third of the population for each dimension. Hence, Task-2 is closest in proportion to the division by</Paragraph>
      <Paragraph position="8"> author numbers. N = Neuroticism; E = Extraversion; A = Agreeableness; C = Conscientiousness.</Paragraph>
      <Paragraph position="9">  4.2.2 Multiple classification tasks 4. Takes the greatest distinction between high (H) and low (L) groups from Task-1, and  adds a medium group, but attempts to reduce the possibility of inter-group confusion by including only the smaller medium (m) group omitted from Task-2. Not all subjects are therefore included in this analysis. Since the three groups to be classified are completely distinct, this should be the easiest of the four multiple-class tasks.</Paragraph>
      <Paragraph position="10">  5. Following Task-4, this uses the most distinct high (H) and low (L) groups, but now considers all remaining subjects medium (M). 6. Following Task-2, this uses the larger high (hH) and low (Ll) groups, with all those in between forming the medium (m) group.</Paragraph>
      <Paragraph position="11"> 7. Using the distinction between the high and  low groups of Task-5 and -6, this creates a 5-way split: highest (H), relatively high (h), medium (m), relatively low (l) and lowest (L). With the greatest number of classes, this task is the hardest.</Paragraph>
      <Paragraph position="12"> These sub-groups are tabulated in Table 2, giving the size of each group within each trait.</Paragraph>
      <Paragraph position="14"/>
    </Section>
    <Section position="3" start_page="629" end_page="630" type="sub_section">
      <SectionTitle>
4.3 Feature selection
</SectionTitle>
      <Paragraph position="0"> There are many possible features that can be used for automatic text classification. These experiments use essentially word-based bi- and trigrams. It should be noted, however, that some generalisations have been made: all proper nouns were identified via CLAWS tagging using the WMatrix tool (Rayson, 2003), and replaced with a single marker (NP1); punctuation was collapsed into a single marker (&lt;p&gt;); and additional tags correspond to non-linguistic features of blogs-for instance, &lt;SOP&gt; and &lt;EOP&gt; were used the mark the start and end of individual blogs posts.</Paragraph>
      <Paragraph position="1"> Word n-gram approaches provide a large feature space with which to work. But in the general interest of computational tractability, it is useful to reduce the size of the feature set. There are many automatic approaches to feature selection, exploiting, for instance, information gain (Quinlan, 1993). However, 'manual' methods can offer principled ways of both reducing the size of the set and avoiding overfitting. We therefore explore the effect of different levels of restriction on the feature sets, and compare them with automatic feature selection. The levels of restriction are as follows: I The least restricted feature set consists of the n-grams most commonly occurring within the blog corpus. Therefore, the feature set for each personality dimension is to be drawn from the same pool. The difference lies in the number of features selected: the size of the set will match that of the next level of restriction.</Paragraph>
      <Paragraph position="2"> II The next set includes only those n-grams which were distinctive for the two extremes of each personality trait. Only features with a corpus frequency [?]5 are included. This allows accurate log-likelihood G2 statistics to be computed (Rayson, 2003). Distinct collocations are identified via a three way comparison between the H and L groups in Task-1 (see section 4.2.1) and a third, neutral group.</Paragraph>
      <Paragraph position="3"> This neutral group contains all those individuals who fell in the medium group (M) for all four traits in the study; the resulting group was of comparable size to the H and L groups for each trait. Hence, this approach selects features using only a subset of the corpus. N-gram software was used to identify and count collocations within a sub-corpus (Banerjee  and Pedersen, 2003). For each feature found, its frequency and relative frequency are calculated. This permits relative frequency ratios and log-likelihood comparisons to be made between High-Low, High-Neutral and Low-Neutral. Only features that prove distinctive for the H or L groups with a significance of p &lt; .01 are included in the feature set.</Paragraph>
      <Paragraph position="4"> III The next set takes into account the possibility that, for a group used in Level-II, an n-gram may be used relatively frequently, but only because a small number of authors in a group use it very frequently, while others in the same group use it not at all. To enter the Level-III set, an n-gram meeting the Level-II criteria must also be used by at least 50%1 of the individuals within the subgroup for which it is reported to be distinctive.</Paragraph>
      <Paragraph position="5"> IV While Level-III guards against excessive individual influence, it may abstract too far from the fine-grained variation within a personality trait. The final manual set therefore includes only those n-grams that meet the Level-II criteria with p &lt; .001, meet the Level-III criteria, and also correlate significantly (p &lt; .05) with individual personality trait scores.</Paragraph>
      <Paragraph position="6"> V Finally, it is possible to allow the n-gram feature set to be selected automatically during training. The set to be selected from is the broadest of the manually filtered sets, those n-grams that meet the Level-II criteria. The approach adopted is to use the defaults within the WEKA toolkit: Best First search with the CfsSubsetEval evaluator (Witten and Frank, 1999).</Paragraph>
      <Paragraph position="7"> Thus, a key question is when--if ever--a 'manual' feature selection policy outperforms the automatic selection carried out under Level-V. Levels-II and -III are of particular interest, since they contain features derived from a subset of the corpus. Since different sub-groups are considered for each personality trait, the feature sets which meet the increasingly stringent criteria vary in size. Table 3 contains the size of each of the four manuallydetermined feature sets for each of the four personality traits. Note again that the number of n-grams selected from the most frequent in the cor-</Paragraph>
    </Section>
    <Section position="4" start_page="630" end_page="630" type="sub_section">
      <SectionTitle>
Low High
</SectionTitle>
      <Paragraph position="0"> [was that] [this year] N [NP1 &lt;p&gt; NP1] [to eat] [&lt;p&gt; after] [slowly &lt;p&gt;] [is that] [and buy] [point in] [and he] E [last night &lt;p&gt;] [cool &lt;p&gt;] [it the] [&lt;p&gt; NP1] [is to] [to her] [thank god] [this is not] A [have any] [&lt;p&gt; it is] [have to] [&lt;p&gt; after] [turn up] [not have] [a few weeks] [by the way] C [case &lt;p&gt;] [&lt;p&gt; i hope] [okay &lt;p&gt;] [how i] [the game] [kind of]  n-grams from the Level-IV set. pus for Level-I matches the size of the set for Level-II. In addition, the features automatically selected are task-dependent, so the Level-V sets vary in size; here, the Table shows the number of features selected for Task-2. To illustrate the types of n-grams in the feature sets, Table 4 contains four of the most significant n-grams from Level-IV for each personality class.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="630" end_page="631" type="metho">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> For each of the 60 binary classification tasks (1 to 3), the performance of the two approaches was compared. Na&amp;quot;ive Bayes outperformed Support Vector Machines on 41/60, with 14 wins for SVM and 5 draws. With limited space available, we therefore discuss only the results for NB, and use NB for Task-4 to -7. The results for the binary tasks are displayed in Table 5. Those for the multiple tasks are displayed in Table 6. Baseline is the majority classification. The most accurate performance of a feature set for each task is highlighted  tasks. Raw % accuracy for 4 personality dimensions, 3 tasks, and 5 feature selection policies. in bold while the second most accurate is marked italic.</Paragraph>
  </Section>
  <Section position="8" start_page="631" end_page="632" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> Let us consider the results as they bear in turn on the three main questions posed earlier: Can we improve on Argamon et al.'s (2005) performance on binary classification for the Extraversion and Neuroticism dimensions? How accurately can we classify on the four personality dimensions? And how does performance on multiple classification compare with that on binary classification? Before addressing these questions, we note the relatively good performance of NB compared with 'vanilla' SVM on the binary classification tasks.</Paragraph>
    <Paragraph position="1"> We also note that automatic selection generally outperforms 'manual' selection; however overfitting is very likely when examining just 71 data points. Therefore, we do not discuss the Level-V results further.</Paragraph>
    <Section position="1" start_page="631" end_page="632" type="sub_section">
      <SectionTitle>
6.1 Extraversion and Neuroticism
</SectionTitle>
      <Paragraph position="0"> The first main question relates to the feature sets chosen, because the main issue is whether word n-grams can give reasonable results on the Extraversion and Neuroticism classification tasks. Of the current binary classification tasks, Task-2 is most closely comparable to Argamon et al.'s. Here, the best performance for Extraversion was returned by the 'manual' Level-II feature set, closely followed by Level-III. The accuracy of 74.5% represents a 23.4% absolute improvement over baseline  tasks. Raw % accuracy for 4 personality dimensions, 4 tasks, and 5 feature selection policies. (45.8% relative improvement; we report relative improvement over baseline because baseline accuracies vary between tasks). The best performance for Neuroticism was returned by Level-IV. The accuracy of 83.6% represents a 30.4% absolute improvement over baseline (57.1% relative improvement). null Argamon et al.'s feature set combined insights from computational stylometrics (Koppel et al., 2002; Argamon et al., 2003) and systemic-functional grammar. Their focus on function words and appraisal-related features was intended to provide more general and informative features than the usual n-grams. Now, it is unlikely that weblogs are easier to categorise than the genres studied by Argamon et al. So there are instead at least two reasons for the improvement we report.</Paragraph>
      <Paragraph position="1"> First, although we did not use systemic-functional linguistic features, we did test n-grams selected according to more or less strict policies.</Paragraph>
      <Paragraph position="2"> So, considering the manual policies, it seems that the Level-IV was the best-performing set for Neuroticism. This might be expected, given that Level-IV potentially overfits, allowing features to be derived from the full corpus. However, in spite of this, Level-II pproved best for Extraversion. Secondly, in classifying an individual as high or low on some dimension, Argamon et al. had  (for some of their materials) 500 words from that individual, whereas we had approximately 5000 words. The availability of more words per individual is to likely to help greatly in training. Additionally, a greater volume of text increases the chances that a long term 'property' such as personality will emerge</Paragraph>
    </Section>
    <Section position="2" start_page="632" end_page="632" type="sub_section">
      <SectionTitle>
6.2 Binary classification of all dimensions
</SectionTitle>
      <Paragraph position="0"> The second question concerns the relative ease of classifying the different dimensions. Across each of Task-1 to -3, we find that classification accuracies for Agreeableness and Conscientiousness tend to be higher than those for Extraversion and Neuroticism. In all but two cases, the automatically generated feature set (V) performs best. Putting this to one side, of the manually constructed sets, the unrestricted set (I) performs worst, often below the baseline, while Level-IV is the best for classifying each task of Neuroticism.</Paragraph>
      <Paragraph position="1"> Overall, II and III are better than IV, although the difference is not large.</Paragraph>
      <Paragraph position="2"> As tasks increase in difficulty--as high and low groups become closer together, and the left-out middle shrinks--performance drops. But accuracy is still respectable.</Paragraph>
    </Section>
    <Section position="3" start_page="632" end_page="632" type="sub_section">
      <SectionTitle>
6.3 Beyond binary classification
</SectionTitle>
      <Paragraph position="0"> The final question is about how classification accuracy suffers as the classification task becomes more subtle. As expected, we find that as we add more categories, the tasks are harder: compare the results in the Tables for Task-1, -5 and -7. And, as with the binary tasks, if fewer mid-scoring individuals are left out, the task is typically harder: compare results for Task-4 and 5. It does seem that some personality dimensions respond to task difficulty more robustly than others. For instance, on the hardest task, the best Extraversion classification accuracy is 10.9% absolute over the baseline (32.2% relative), while the best Agreeableness accuracy is 30.4% absolute over the baseline (77.2% relative). It is notable that the feature set which return the best results--bar the automatic set V-tends to be Level-II, excepting for Neuroticism on Task-6, where Level-IV considerably outperforms the other sets.</Paragraph>
      <Paragraph position="1"> A supplementary question is how the best classifiers compare with human performance on this task. Mishne (2005) reports that, for general mood classification on weblogs, the accuracy of his automatic classifier is comparable to human performance. There are also general results on human personality classification performance in computer-mediated communication, which suggest that at least some dimensions can be accurately judged even when computer-mediated.</Paragraph>
      <Paragraph position="2"> Vazire and Gosling (2004) report that for personal websites, relative accuracy of judgment was, in descending order: Openness &gt; Extraversion &gt; Neuroticism &gt; Agreeableness &gt; Conscientiousness.</Paragraph>
      <Paragraph position="3"> Similarly, Gill et al. (2006) report that for personal e-mail, Extraversion is more accurately judged than Neuroticism. The current study does not have a set of human judgments to report. For now, it is interesting to note that the performance profile for the best classifiers, on the simplest tasks, appears to diverge from the general human profile, instead ranking on raw accuracy: Agreeableness &gt; Conscientiousness &gt; Neuroticism &gt; Extraversion.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML