File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/01/w01-0506_abstr.xml

Size: 7,000 bytes

Last Modified: 2025-10-06 13:42:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0506">
  <Title>Stacking classifiers for anti-spam filtering of e-mail Georgios Sakkis , Ion Androutsopoulos , Georgios Paliouras , Vangelis Karkaletsis</Title>
  <Section position="2" start_page="0" end_page="321" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We evaluate empirically a scheme for combining classifiers, known as stacked generalization, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial email, or &amp;quot;spam&amp;quot;, floods mailboxes, causing frustration, wasting bandwidth, and exposing minors to unsuitable content. Using a public corpus, we show that stacking can improve the efficiency of automatically induced anti-spam filters, and that such filters can be used in real-life applications.</Paragraph>
    <Paragraph position="1"> Introduction This paper presents an empirical evaluation of stacked generalization, a scheme for combining automatically induced classifiers, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization.</Paragraph>
    <Paragraph position="2"> The increasing popularity and low cost of e-mail have intrigued direct marketers to flood the mailboxes of thousands of users with unsolicited messages, advertising anything, from vacations to get-rich schemes. These messages, known as spam or more formally Unsolicited Commercial E-mail, are extremely annoying, as they clutter mailboxes, prolong dial-up connections, and often expose minors to unsuitable content (Cranor &amp; Lamacchia, 1998).</Paragraph>
    <Paragraph position="3"> Legal and simplistic technical countermeasures, like blacklists and keyword-based filters, have had a very limited effect so far.  The success of machine learning techniques in text categorization (Sebastiani, 2001) has recently led to alternative, learning-based approaches (Sahami, et al. 1998; Pantel &amp; Lin, 1998; Drucker, et al. 1999). A classifier capable of distinguishing between spam and non-spam, hereafter legitimate, messages is induced from a manually categorized learning collection of messages, and is then used to identify incoming spam e-mail. Initial results have been promising, and experiments are becoming more systematic, by exploiting recently introduced benchmark corpora, and cost-sensitive evaluation measures (Gomez Hidalgo, et al. 2000; Androutsopoulos, et al. 2000a, b, c).</Paragraph>
    <Paragraph position="4"> Stacked generalization (Wolpert, 1992), or stacking, is an approach for constructing classifier ensembles. A classifier ensemble, or committee, is a set of classifiers whose individual decisions are combined in some way to classify new instances (Dietterich, 1997).</Paragraph>
    <Paragraph position="5"> Stacking combines multiple classifiers to induce a higher-level classifier with improved performance. The latter can be thought of as the president of a committee with the ground-level classifiers as members. Each unseen incoming message is first given to the members; the president then decides on the category of the  Consult www.cauce.org, spam.abuse.net, and www.junkemail.org.</Paragraph>
    <Paragraph position="6"> message by considering the opinions of the members and the message itself. Ground-level classifiers often make different classification errors. Hence, a president that has successfully learned when to trust each of the members can improve overall performance.</Paragraph>
    <Paragraph position="7"> We have experimented with two ground-level classifiers for which results on a public benchmark corpus are available: a Naive Bayes classifier (Androutsopoulos, et al. 2000a, c) and a memory-based classifier (Androutsopoulos, et al. 2000b; Sakkis, et al. 2001). Using a third, memory-based classifier as president, we investigated two versions of stacking and two different cost-sensitive scenarios. Overall, our results indicate that stacking improves the performance of the ground-level classifiers, and that the performance of the resulting anti-spam filter is acceptable for real-life applications. Section 1 below presents the benchmark corpus and the preprocessing of the messages; section 2 introduces cost-sensitive evaluation measures; section 3 provides details on the stacking approaches that were explored; section 4 discusses the learning algorithms that were employed and the motivation for selecting them; section 5 presents our experimental results followed by conclusions.</Paragraph>
    <Paragraph position="8"> 1 Benchmark corpus and preprocessing Text categorization has benefited from public benchmark corpora. Producing such corpora for anti-spam filtering is not straightforward, since user mailboxes cannot be made public without considering privacy issues. A useful public approximation of a user's mailbox, however, can be constructed by mixing spam messages with messages extracted from spam-free public archives of mailing lists. The corpus that we used, Ling-Spam, follows this approach (Androutsopoulos, et al. 2000a, b; Sakkis, et al. 2001). It is a mixture of spam messages and messages sent via the Linguist, a moderated list about the science and profession of linguistics. The corpus consists of 2412 Linguist messages and 481 spam messages.</Paragraph>
    <Paragraph position="9"> Spam messages constitute 16.6% of Ling-Spam, close to the rates reported by Cranor and LaMacchia (1998), and Sahami et al. (1998).</Paragraph>
    <Paragraph position="10"> Although the Linguist messages are more topic-specific than most users' e-mail, they are less standardized than one might expect. For example, they contain job postings, software availability announcements and even flame-like responses. Moreover, recent experiments with an encoded user mailbox and a Naive Bayes (NB) classifier (Androutsopoulos, et al. 2000c) yielded results similar to those obtained with Ling-Spam (Androutsopoulos, et al. 2000a).</Paragraph>
    <Paragraph position="11"> Therefore, experimentation with Ling-Spam can provide useful indicative results, at least in a preliminary stage. Furthermore, experiments with Ling-Spam can be seen as studies of anti-spam filtering of open unmoderated lists.</Paragraph>
    <Paragraph position="12"> Each message of Ling-Spam was converted into a vector</Paragraph>
    <Paragraph position="14"/>
    <Paragraph position="16"> forms of the same word as different attributes, a lemmatizer was applied, converting each word to its base form.</Paragraph>
    <Paragraph position="17"> To reduce the dimensionality, attribute selection was performed. First, words occurring in less than 4 messages were discarded. Then, the Information Gain (IG) of each candidate  The attributes with the m highest IG-scores were selected, with m corresponding to the best configurations of the ground classifiers that have been reported for Ling-Spam (Androutsopoulos, et al. 2000a; Sakkis, et al. 2001); see Section 4.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML