File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2053_intro.xml

Size: 1,876 bytes

Last Modified: 2025-10-06 14:03:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2053">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Towards the Orwellian Nightmare Separation of Business and Personal Emails</Title>
  <Section position="4" start_page="407" end_page="407" type="intro">
    <SectionTitle>
2 Introduction to the Corpus
</SectionTitle>
    <Paragraph position="0"> Enron's email was made public on the Web by FERC (Federal Energy Regulatory Commission), during a legal investigation on Enron Corporation. The emails cover 92 percent of the staff's emails, because some messages have been deleted &amp;quot;as part of a redaction effort due to requests from affected employees&amp;quot;. The dataset was comprised of 619,446 messages from 158 users in 3,500 folders. However, it turned out that the raw data set was suffering from various data integrity problems. Various attempts were made to clean and prepare the dataset for research purposes. The dataset used in this project was the March 2, 2004 version prepared at Carnegie Mellon University, acquired from http://www.cs.cmu.edu/~enron/. This version of the dataset was reduced to 200,399 emails by removing some folders from each user. Folders like &amp;quot;discussion threads&amp;quot; and &amp;quot;all documents&amp;quot;, which were machine generated and contained duplicate emails, were removed in this version.</Paragraph>
    <Paragraph position="1"> There were on average 757 emails per each of the 158 users. However, there are between one and 100,000 emails per user. There are 30,091 threads present in 123,091 emails. The dataset does not include attachments. Invalid email addresses were replaced with &amp;quot;user@enron.com&amp;quot;. When no recipient was specified the address was replaced with &amp;quot;no_address@enron.com&amp;quot; (Klimt and Yang, 2005).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML