File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1027_intro.xml

Size: 3,756 bytes

Last Modified: 2025-10-06 14:02:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1027">
  <Title>Virtual Examples for Text Classification with Support Vector Machines</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Corpus-based supervised learning is now a standard approach to achieve high-performance in natural language processing. However, the weakness of supervised learning approach is to need an annotated corpus, the size of which is reasonably large.</Paragraph>
    <Paragraph position="1"> Even if we have a good supervised-learning method, we cannot get high-performance without an annotated corpus. The problem is that corpus annotation is labor intensive and very expensive. In order to overcome this, several methods are proposed, including minimally-supervised learning methods (e.g., (Yarowsky, 1995; Blum and Mitchell, 1998)), and active learning methods (e.g., (Thompson et al., 1999; Sassano, 2002)). The spirit behind these methods is to utilize precious labeled examples maximally. null Another method following the same spirit is one using virtual examples (artificially created examples) generated from labeled examples. This method has been rarely discussed in natural language processing. In terms of active learning, Lewis and Gale (1994) mentioned the use of virtual examples in text classification. They did not, however, take forward this approach because it did not seem to be possible that a classifier created virtual examples of documents in natural language and then requested a human teacher to label them.</Paragraph>
    <Paragraph position="2"> In the field of pattern recognition, some kind of virtual examples has been studied. The first report of methods using virtual examples with Support Vector Machines (SVMs) is that of Sch&amp;quot;olkopf et al. (1996), who demonstrated significant improvement of the accuracy in hand-written digit recognition (Section 3). They created virtual examples from labeled examples based on prior knowledge of the task: slightly translated (e.g., 1 pixel shifted to the right) images have the same label (class) of the original image. Niyogi et al. (1998) also discussed the use of prior knowledge by creating virtual examples and thereby expanding the effective training set size.</Paragraph>
    <Paragraph position="3"> The purpose of this study is to explore the effectiveness of virtual examples in NLP, motivated by the results of Sch&amp;quot;olkopf et al. (1996). To our knowledge, use of virtual examples in corpus-based NLP has never been studied so far. It is, however, important to investigate this approach by which it is expected that we can alleviate the cost of corpus annotation. In particular, we focus on virtual examples with Support Vector Machines, introduced by Vapnik (1995). The reason for this is that SVM is one of most successful machine learning methods in NLP.</Paragraph>
    <Paragraph position="4"> For example, NL tasks to which SVMs have been applied are text classification (Joachims, 1998; Dumais et al., 1998), chunking (Kudo and Matsumoto, 2001), dependency analysis (Kudo and Matsumoto, 2002) and so forth.</Paragraph>
    <Paragraph position="5"> In this study, we choose text classification as a first case of the study of virtual examples in NLP because text classification in real world requires minimizing annotation cost, and it is not too complicated to perform some non-trivial experiments. Moreover, there are simple methods, which we propose, to generate virtual examples from labeled examples (Section 4). We show how virtual examples can improve the performance of a classifier with SVM in text classification, especially for small training sets.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML