File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1088_intro.xml
Size: 1,528 bytes
Last Modified: 2025-10-06 14:02:13
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1088"> <Title>Linguistic correlates of style: authorship classification with deep linguistic analysis features</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Data </SectionTitle> <Paragraph position="0"> To test our approach to authorship identification, we used texts from Anne, Charlotte and Emily Bronte. This decision was motivated by the fact that we could keep gender, education and historic style differences to a minimum in order to focus on authorship identification, and by the easy availability of electronic versions of several lengthy texts from these authors. The texts we used were: Charlotte Bronte: Jane Eyre, The Professor Anne Bronte: The Tenant of Wildfell Hall,</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Agnes Grey Emily Bronte: Wuthering Heights </SectionTitle> <Paragraph position="0"> For each of the three authors we collected all sentences from those titles and randomized their order. The total number of sentences for each author is: 13220 sentences for Charlotte, 9263 for Anne and 6410 for Emily. We produced artificial documents of 20 sentences in length from these sets of sentences. We split the resulting 1441 documents 80/20 for training and test. This split yields 288 documents for test, and 1153 documents for training. All numbers reported in this paper are based on 5-fold cross validation.</Paragraph> </Section> </Section> class="xml-element"></Paper>