File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1088_metho.xml

Size: 14,774 bytes

Last Modified: 2025-10-06 14:08:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1088">
  <Title>Linguistic correlates of style: authorship classification with deep linguistic analysis features</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Features
</SectionTitle>
    <Paragraph position="0"> All linguistic features have been automatically extracted using the NLPWin system (for an overview see Heidorn 2000). Note that this system produces partial constituent analyses for sentences even if no spanning parse can be found. The only exception are sentences of more than 50 words which do not result in any assignment of linguistic structure.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Length features
</SectionTitle>
      <Paragraph position="0"> We measure average length of sentences, nounphrases, adjectival/adverbial phrases, and subordinate clauses per document.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Function word frequencies
</SectionTitle>
      <Paragraph position="0"> We measure the frequencies of function word lemmas as identified by the NLPWin system. In order to be maximally &amp;quot;content-independent&amp;quot;, we normalized all personal pronouns to an artificial form &amp;quot;perspro&amp;quot; in order to not pick up on &amp;quot;she&amp;quot; or &amp;quot;he&amp;quot; frequencies which would be linked to the gender of characters in the works of fiction rather than author style. The number of observed function word lemmas is 474.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Part-of-speech trigrams
</SectionTitle>
      <Paragraph position="0"> We extract part-of-speech (POS) trigrams from the documents and use the frequencies of these trigrams as features. The NLPWin system uses a set of 8 POS tags. 819 different POS trigrams are observed in the data.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Syntactic productions
</SectionTitle>
      <Paragraph position="0"> The parses provided by the NLPWin system allow us to extract context-free grammar productions for each sentence, similar to the features in Baayen et al. (1996). Examples of common productions are:</Paragraph>
      <Paragraph position="2"> For each observed production, we measure the per-document frequency of the productions. 15.443 individual productions (types) occurred in our data, the total number of production tokens is 618.500.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Semantic information
</SectionTitle>
      <Paragraph position="0"> We extract two kinds of information from the semantic dependency graphs produced by the NLPWin system: binary semantic features and semantic modification relations. Examples of semantic features are number and person features on nouns and pronouns, tense and aspectual features on verbs, and subcategorization features (indicating realized as opposed to potential subcategorization) on verbs. There is a total of 80 such semantic features.</Paragraph>
      <Paragraph position="1"> Semantic modification relations are represented in a form where for each node A in a semantic graph the POS of A, the POS of all its n daughters  are given. Some common modification structures are illustrated below:  As with the previously discussed features, we measure per-document frequency of the observed modification structures. There are a total of 9377 such structures.</Paragraph>
      <Paragraph position="2"> 3.6 n-gram frequency features The use of word n-gram frequencies is not appropriate for style classification tasks since these features are not sufficiently content-independent. In our experiments, for example, they could pick up on nouns referring to events or locations that are part of the story told in the work of fiction at hand. We included these features in our experiments only as a point of comparison for the purely &amp;quot;form-based&amp;quot; features. In order to prevent the most obvious content-dependency in the word n-gram frequency features, we normalized proper nouns to &amp;quot;NAME&amp;quot; and singular personal pronouns to &amp;quot;Perspro&amp;quot;. null</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.7 Feature selection
</SectionTitle>
      <Paragraph position="0"> While the total number of observed syntactic and semantic patterns is very high, most of the patterns occur only very few times, or even only once.</Paragraph>
      <Paragraph position="1"> In order to eliminate irrelevant features, we employed a simple frequency cutoff, where the frequency of a pattern that occurs less than n times is not included as a feature.</Paragraph>
      <Paragraph position="2"> 4 The machine learning technique: Support vector machines For our experiments we have used support vector machines (SVMs), a machine learning algorithm that constructs a plane through a multi-dimensional hyperspace, separating the training cases into the target classes. SVMs have been used successfully in text categorization and in other classification tasks involving highly dimensional feature vectors (e.g. Joachims 1998, Dumais et al.</Paragraph>
      <Paragraph position="3"> 1998). Diederich et al. (2003) have applied support vector machines to the problem of authorship attribution. For our experiments we have used John  tool (Platt 1999). In the absence of evidence for the usefulness of more complicated kernel functions in similar experiments (Diederich et al. 2003), we used linear SVMs exclusively.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> All results discussed in this section should be interpreted against a simple baseline accuracy achieved by guessing the most frequent author (Charlotte). That baseline accuracy is 45.8%. All accuracy differences have been determined to be statistically significant at the .99 confidence level.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Feature sets in isolation
</SectionTitle>
      <Paragraph position="0"> Classification accuracy using the different feature sets (POS trigram frequencies, function word frequencies, syntactic features, semantic features) are shown in Figure 1. The four length features discussed in section 3.1 yielded a classification accuracy of only 54.85% and are not shown in</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Feature sets combined
</SectionTitle>
      <Paragraph position="0"> The combination of all feature sets yields a much increased classification accuracy across frequency thresholds as shown in Figure 2. Combining all features, including length features, consistently outperforms all other scenarios. Restricting features to those that only utilize shallow linguistic analysis, such as the POS trigram features and the function word frequency features reduces accuracy by about one percent. Interestingly, the use of syntactic and semantic features alone yields classification accuracy below the other feature combinations. In combination, though, these features contribute strongly to the overall accuracy.</Paragraph>
      <Paragraph position="1"> Semantic features which constitute the most abstract and linguistically sophisticated class, add to the accuracy of the classifier. This is evidenced by comparing the top two lines in Figure 2 which show the accuracy using all features, and the accuracy using all features except the semantic features. Also included in Figure 2 is the accuracy obtainable by using &amp;quot;content-dependent&amp;quot; bigram and trigram frequency features. As stated above, these features are not adequate for style assessment purposes since they pick up on content, whereas style assessment needs to abstract away from content and measure the form of linguistic expression. It is noteworthy, however, that the true stylistic and &amp;quot;content-independent&amp;quot; features produce a classification accuracy that outperforms the ngram features by a wide margin.</Paragraph>
      <Paragraph position="2"> Precision and recall numbers using all features with a frequency threshold of 75 (which yields the highest accuracy at 97.57%) are shown in Table 1.</Paragraph>
      <Paragraph position="3">  Table 2 shows error reduction rates for the addition of deep linguistic analysis features to the &amp;quot;shallow&amp;quot; baseline of function word frequencies and POS trigrams.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Number of features and frequency
</SectionTitle>
      <Paragraph position="0"> threshold Table 3 shows the number of features at each frequency cutoff. The total number of style-related features ranges from 6018 at a frequency cutoff of at least 5 observed instances to 546 at a frequency cutoff of 500. The size of these feature vectors is at the high end of what has typically been reported in the literature for similar experiments: For example, Argamon-Engelson et al. (1998) use feature vectors of size 1185 for newspaper style detection, Finn and Kushmerick (2003) have 36 POS features and 152 text statistics features for detection of &amp;quot;objective&amp;quot; and &amp;quot;subjective&amp;quot; genre, Koppel et al. (2004) use 130 features for authorship verification.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> We believe that the results presented in the previous section allow a number of interesting conclusions for research into automatic style and authorship assessment. First, in our experiments the addition of deep linguistic analysis features increases classification accuracy.</Paragraph>
    <Paragraph position="1"> From a linguistic perspective this is no surprise: it is clear that matters of linguistic form are those that can be captured by a syntactic and to some extent by a semantic analysis (as long as the semantic analysis is not so abstract that it completely abstracts away from any form properties of the sentence). It was less clear, though, whether an automatic language analysis system can be reliable enough to provide the necessary feature functions.</Paragraph>
    <Paragraph position="2"> This has been categorically denied in some of the literature (e.g. Stamatos et al. 2000). These statements, however, did not take into account that as long as a language analysis system is consistent in the errors it makes, machine learning techniques can pick up on correlations between linguistic features and style even though the label of a linguistic feature (the &amp;quot;quality&amp;quot; it measures) is mislabeled. Secondly, we would like to emphasize that the results we have achieved are not based on deliberate selection of a small set of features as likely candidates for correlation with style. We have selected sets of features to be included in our experiments, but whether or not an individual feature plays a role was left to the machine learning technique to decide. Ideally, then, we would pass any number of features to the classifier algorithm and expect it to select relevant features during the training process. While this is possible with a large number of training cases, a smaller number of training cases poses a limit to the number of features that should be used to achieve optimal classification accuracy and prevent overfitting. In order to prevent overfitting it is desirable to reduce the vector size to a number that does not exceed the number of training cases. Support vector machines are very robust to overfitting, and in our experiments we find that classification results were quite robust to feature vectors with up to 4 times the size of the training set. However, it is still the case that optimal accuracy is achieved where the size of the feature vector comes close to the training sample (at a frequency cutoff of 75 for the vector containing all sets of features).</Paragraph>
    <Paragraph position="3"> We also examined the features that carried high weights in the SVMs. Among the most highly weighted features we found a mix of different feature types. Below is a very small sample from the top-weighted features (recall that all features measure frequency):  * verbal predicates with a pronominal subject and a clausal object In order to determine whether our results hold on sample documents of smaller size, we conducted a second round of experiments where document length was scaled down to five sentences per document. This yielded a total of 5767 documents, which we subjected to the same 80/20 split and 5fold cross-validation as in the previous experiments. Results as shown in Table 4 are very encouraging: using all features, we achieve a maximum classification accuracy of 85%. As in our previous experiments, removing deep linguistic analysis features degrades the results.</Paragraph>
    <Paragraph position="4">  It should also be clear that simple frequency cutoffs are a crude way of reducing the number of features. Not every frequent feature is likely to be discriminative (in our example, it is unlikely that a period at the end of a sentence is discriminative), and not every infrequent feature is likely to be nondiscriminative. In fact, hapax legomena, the single occurrence of a certain lexeme has been used to discriminate authors. Baayen et al. (1996) also have pointed out the discriminatory role of infrequent syntactic patterns. What we need, then, is a more sophisticated thresholding technique to restrict the feature vector size. We have begun experimenting with log likelihood ratio (Dunning 1993) as a thresholding technique.</Paragraph>
    <Paragraph position="5"> To assess at least anecdotally whether our results hold in a different domain, we also tested on sentences from speeches of George Bush Jr. and Bill Clinton (2231 sentences from the former, 2433 sentences from the latter). Using document samples with 5 sentences each, 10-fold cross-validation and a frequency cutoff of 5, we achieved 87.63% classification accuracy using all features, and 83.00% accuracy using only shallow features (function word frequencies and POS trigrams).</Paragraph>
    <Paragraph position="6"> Additional experiments with similar methodology are under way for a stylistic classification task based on unedited versus highly edited documents within the technical domain.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML