File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/c04-1088_concl.xml

Size: 2,499 bytes

Last Modified: 2025-10-06 13:53:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1088">
  <Title>Linguistic correlates of style: authorship classification with deep linguistic analysis features</Title>
  <Section position="7" start_page="0" end_page="0" type="concl">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> We have shown that the use of deep linguistic analysis features in authorship attribution can yield a significant reduction in error rate over the use of shallow linguistic features such as function word frequencies and part of speech trigrams. We have furthermore argued that by using a modern machine learning technique that is robust to large feature vectors, combining different feature sets yields optimal results. Reducing the number of features (i.e. the number of parameters to be estimated by the learning algorithm) by frequency cutoffs to be in the range of the number of training cases produced good results, although it is to be expected that more intelligent thresholding techniques such as log likelihood ratio will further increase performance. These results hold up even if document size is reduced to only five sentences.</Paragraph>
    <Paragraph position="1"> We believe that these results show that the common argument of the &amp;quot;unreliability&amp;quot; of automatic linguistic processing used for feature extraction for style assessment is not as strong as it seems. As long as the errors introduced by a parser are systematic, a machine learning system presented with a large number of features can still learn relevant correlations.</Paragraph>
    <Paragraph position="2"> Areas for further research in this area include experimentation with additional authorship and style classification tasks/scenarios, experiments with different thresholding techniques and possibly with additional linguistic feature sets.</Paragraph>
    <Paragraph position="3"> Additionally, we plan to investigate the possibility of training different classifiers, each of which contains features from one of the four major feature sets (function word frequencies, POS trigram frequencies, syntactic production frequencies, semantic feature frequencies), and maximally n such features where n is the number of training cases.</Paragraph>
    <Paragraph position="4"> The votes from the ensemble of four classifiers could then be combined with a number of different methods, including simple voting, weighted voting, or &amp;quot;stacking&amp;quot; (Dietterich 1998).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML