Virtual Examples for Text Classification with Support Vector Machines
Manabu Sassano
Fujitsu Laboratories Ltd.
4-1-1, Kamikodanaka, Nakahara-ku,
Kawasaki 211-8588, Japan
sassano@jp.fujitsu.com
Abstract
We explore how virtual examples (artifi-
cially created examples) improve perfor-
mance of text classification with Support
Vector Machines (SVMs). We propose
techniques to create virtual examples for
text classification based on the assump-
tion that the category of a document is un-
changed even if a small number of words
are added or deleted. We evaluate the pro-
posed methods by Reuters-21758 test set
collection. Experimental results show vir-
tual examples improve the performance of
text classification with SVMs, especially
for small training sets.
1 Introduction
Corpus-based supervised learning is now a stan-
dard approach to achieve high-performance in nat-
ural language processing. However, the weakness
of supervised learning approach is to need an anno-
tated corpus, the size of which is reasonably large.
Even if we have a good supervised-learning method,
we cannot get high-performance without an anno-
tated corpus. The problem is that corpus annota-
tion is labor intensive and very expensive. In or-
der to overcome this, several methods are proposed,
including minimally-supervised learning methods
(e.g., (Yarowsky, 1995; Blum and Mitchell, 1998)),
and active learning methods (e.g., (Thompson et
al., 1999; Sassano, 2002)). The spirit behind these
methods is to utilize precious labeled examples max-
imally.
Another method following the same spirit is one
using virtual examples (artificially created exam-
ples) generated from labeled examples. This method
has been rarely discussed in natural language pro-
cessing. In terms of active learning, Lewis and Gale
(1994) mentioned the use of virtual examples in text
classification. They did not, however, take forward
this approach because it did not seem to be possi-
ble that a classifier created virtual examples of doc-
uments in natural language and then requested a hu-
man teacher to label them.
In the field of pattern recognition, some kind of
virtual examples has been studied. The first re-
port of methods using virtual examples with Sup-
port Vector Machines (SVMs) is that of Sch¨olkopf
et al. (1996), who demonstrated significant improve-
ment of the accuracy in hand-written digit recogni-
tion (Section 3). They created virtual examples from
labeled examples based on prior knowledge of the
task: slightly translated (e.g., 1 pixel shifted to the
right) images have the same label (class) of the orig-
inal image. Niyogi et al. (1998) also discussed the
use of prior knowledge by creating virtual examples
and thereby expanding the effective training set size.
The purpose of this study is to explore the effec-
tiveness of virtual examples in NLP, motivated by
the results of Sch¨olkopf et al. (1996). To our knowl-
edge, use of virtual examples in corpus-based NLP
has never been studied so far. It is, however, im-
portant to investigate this approach by which it is
expected that we can alleviate the cost of corpus an-
notation. In particular, we focus on virtual examples
with Support Vector Machines, introduced by Vap-
nik (1995). The reason for this is that SVM is one of
most successful machine learning methods in NLP.
For example, NL tasks to which SVMs have been
applied are text classification (Joachims, 1998; Du-
mais et al., 1998), chunking (Kudo and Matsumoto,
2001), dependency analysis (Kudo and Matsumoto,
2002) and so forth.
In this study, we choose text classification as a
first case of the study of virtual examples in NLP be-
cause text classification in real world requires mini-
mizing annotation cost, and it is not too complicated
to perform some non-trivial experiments. Moreover,
there are simple methods, which we propose, to gen-
erate virtual examples from labeled examples (Sec-
tion 4). We show how virtual examples can improve
the performance of a classifier with SVM in text
classification, especially for small training sets.
2 Support Vector Machines
In this section we give some theoretical definitions
of SVMs. Assume that we are given the training data
B4DC
CX
BNDD
CX
B5BNBMBMBMBNB4DC
D0
BNDD
D0
B5BNDC
CX
BE CA
D2
BNDD
CX
BECUB7BDBNA0BDCVBM
The decision function CV in SVM framework is de-
fined as:
CVB4DCB5 BP D7CVD2B4CUB4DCB5B5 (1)
CUB4DCB5 BP
D0
CG
CXBPBD
DD
CX
AB
CX
C3B4DC
CX
BNDCB5B7CQ (2)
where C3 is a kernel function, CQ BE CA is a threshold,
and AB
CX
are weights. Besides, the weights AB
CX
satisfy
the following constraints:
BKCX BMBCAK AB
CX
AK BV and
D0
CG
CXBPBD
AB
CX
DD
CX
BPBCBN
where BV is a misclassification cost. The vectors DC
CX
with non-zero AB
CX
are called Support Vectors. For
linear SVMs, the kernel function C3 is defined as:
C3B4DC
CX
BNDCB5BPDC
CX
A1DCBM
In this case, Equation 2 can be written as:
CUB4DCB5 BP DBA1DCB7 CQ (3)
where DB BP
C8
D0
CXBPBD
DD
CX
AB
CX
DC
CX
. To train an SVM is to
find AB
CX
and CQ by solving the following optimization
negative example
positive example
support vector
Figure 1: Hyperplane (solid) and Support Vectors
problem:
maximize
D0
CG
CXBPBD
AB
CX
A0
BD
BE
D0
CG
CXBNCYBPBD
AB
CX
AB
CY
DD
CX
DD
CY
C3B4DC
CX
BNDC
CY
B5
subject to BKCX BMBCAK AB
CX
AK BV and
D0
CG
CXBPBD
AB
CX
DD
CX
BPBCBM
The solution gives an optimal hyperplane, which is a
decision boundary between the two classes. Figure 1
illustrates an optimal hyperplane and its support vec-
tors.
3 Virtual Examples and Virtual Support
Vectors
Virtual examples are generated from labeled exam-
ples.
1
Based on prior knowledge of a target task, the
label of a generated example is set to the same value
as that of the original example.
For example, in hand-written digit recognition,
virtual examples can be created on the assumption
that the label of an example is unchanged even if the
example is shifted by one pixel in the four princi-
pal directions (Sch¨olkopf et al., 1996; DeCoste and
Sch¨olkopf, 2002).
Virtual examples that are generated from support
vectors are called virtual support vectors (Sch¨olkopf
1
We discuss here only virtual examples which are generated
from labeled examples. We do not consider examples, the labels
of which are not known.
Virtual
Examples
Figure 2: Hyperplane and Virtual Examples
et al., 1996). Reasonable virtual support vectors are
expected to give a better optimal hyperplane. As-
suming that virtual support vectors represent natu-
ral variations of examples of a target task, the de-
cision boundary should be more accurate. Figure 2
illustrates the idea of virtual support vectors. Note
that after virtual support vectors are given, the hy-
perplane is different from that in Figure 1.
4 Virtual Examples for Text Classification
We assume on text classification the following:
Assumption 1 The category of a document is un-
changed even if a small number of words are added
or deleted.
This assumption is reasonable. In typical cases of
text classification most of the documents usually
contain two or more keywords which may indicate
the categories of the documents.
Following Assumption 1, we propose two meth-
ods to create virtual examples for text classification.
One method is to delete some portion of a document.
The label of a virtual example is given from the orig-
inal document. The other method is to add a small
number of words to a document. The words to be
added are taken from documents, the label of which
is the same as that of the document. Although one
can invent various methods to create virtual exam-
ples based on Assumption 1, we propose here very
simple ones.
Document Id Feature Vector (DC) Label (DD)
1 B4CU
BD
BNCU
BE
BNCU
BF
BNCU
BG
BNCU
BH
B5 B7BD
2 B4CU
BE
BNCU
BG
BNCU
BH
BNCU
BI
B5 B7BD
3 B4CU
BE
BNCU
BF
BNCU
BH
BNCU
BI
BNCU
BJ
B5 B7BD
4 B4CU
BD
BNCU
BF
BNCU
BK
BNCU
BL
BNCU
BDBC
B5 A0BD
5 B4CU
BD
BNCU
BK
BNCU
BDBC
BNCU
BDBD
B5 A0BD
Table 1: Example of Document Set
Before describing our methods, we describe text
representation which we used in this study. We to-
kenize a document to words, downcase them and
then remove stopwords, where the stopword list of
freeWAIS-sf
2
is used. Stemming is not performed.
We adopt binary feature vectors where word fre-
quency is not used.
Now we describe the two proposed methods:
GenerateByDeletion and GenerateByAddition. As-
sume that we are given a feature vector (a document)
DC and DC
BC
is a generated vector from DC. GenerateBy-
Deletion algorithm is:
1. Copy DC to DC
BC
.
2. For each binary feature CU of DC
BC
, if randB4B5 AK
D8 then remove the feature CU, where randB4B5 is
a function which generates a random number
from BC to BD, and D8 is a parameter to decide how
many features are deleted.
For example, suppose that we have a set of docu-
ments as in Table 1. Some possible virtual examples
generated from Document 1 by GenerateByDeletion
algorithm are B4CU
BE
BNCU
BF
BNCU
BG
BNCU
BH
BNB7BDB5, B4CU
BD
BNCU
BF
BNCU
BG
BNB7BDB5,
or B4CU
BD
BNCU
BE
BNCU
BG
BNCU
BH
BNB7BDB5.
On the other hand, GenerateByAddition algo-
rithm is:
1. Collect from a training set documents, the label
of which is the same as that of DC.
2. Concatenate all the feature vectors (documents)
to create an array CP of features. Each element
of CP is a feature which represents a word.
3. Copy DC to DC
BC
.
2
Available at http://ls6-www.informatik.uni-dortmund.de/ir/
projects/freeWAIS-sf/
Category Name Training Test
earn 2877 1087
acq 1650 719
money-fx 538 179
grain 433 149
crude 389 189
trade 369 117
interest 347 131
ship 197 89
wheat 212 71
corn 181 56
Table 2: Number of Training and Test Examples
4. For each binary feature CU of DC
BC
, if randB4B5 AK D8
then select a feature randomly from CP and put
it to DC
BC
.
For example, when we want to generate a virtual
example from Document 2 in Table 1 by Generate-
ByAddition algorithm, first we create an array CP BP
B4CU
BD
BNCU
BE
BNCU
BF
BNCU
BG
BNCU
BH
BNCU
BE
BNCU
BG
BNCU
BH
BNCU
BI
BNCU
BE
BNCU
BF
BNCU
BH
BNCU
BI
BNCU
BJ
B5.
In this case, some possible virtual examples by
GenerateByAddition are B4CU
BD
BNCU
BE
BNCU
BG
BNCU
BH
BNCU
BI
BNB7BDB5,
B4CU
BE
BNCU
BF
BNCU
BG
BNCU
BH
BNCU
BI
BNB7BDB5,orB4CU
BE
BNCU
BG
BNCU
BH
BNCU
BI
BNCU
BJ
BNB7BDB5.
An example such as B4CU
BE
BNCU
BG
BNCU
BH
BNCU
BI
BNCU
BDBC
BNB7BDB5 is never
generated from Document 2 because there are no
positive documents which have CU
BDBC
.
5 Experimental Results and Discussion
5.1 Test Set Collection
We used the Reuters-21578 dataset
3
to evaluate the
proposed methods. The dataset has several splits of a
training set and a test set. We used here “ModApte”
split, which is most widely used in the literature on
text classification. This split has 9,603 training ex-
amples and 3,299 test examples. More than 100 cat-
egories are in the dataset. We use, however, only the
most frequent 10 categories. Table 2 shows the 10
categories and the number of training and test exam-
ples in each of the categories.
5.2 Performance Measures
We use F-measure (van Rijsbergen, 1979; Lewis
and Gale, 1994) as a primal performance measure
3
Available from David D. Lewis’s page: http://
www.daviddlewis.com/resources/testcollections/reuters21578/
to evaluate the result. F-measure is defined as:
F-measure BP
B4BD B7 AC
BE
B5D4D5
AC
BE
D4 B7 D5
(4)
where D4 is precision and D5 is recall and AC is a param-
eter which decides the relative weight of precision
and recall. The D4 and the D5 are defined as:
D4 BP
number of positive and correct outputs
number of positive outputs
D5 BP
number of positive and correct outputs
number of positive examples
In Equation 4, usually AC BP BD is used, which means
it gives equal weight to precision and recall.
When we evaluate the performance of a classifier
to a multiple category dataset, there are two ways
to compute F-measure: macro-averaging and micro-
averaging (Yang, 1999). The former way is to first
compute F-measure for each category and then aver-
age them, while the latter way is to first compute pre-
cision and recall for all the categories and use them
to calculate the F-measure.
5.3 SVM setting
Through our experiments we used our original SVM
tools, the algorithm of which is based on SMO (Se-
quential Minimal Optimization) by Platt (1999). We
used linear SVMs and set a misclassification cost BV
to BCBMBCBDBIBHBGBD which is BDBPB4the average of DCA1DCB5 where
DC is a feature vector in the 9,603 size training set.
For simplicity, we fixed BV through all the experi-
ments. We built a binary classifier for each of the 10
categories shown in Table 2.
5.4 Results
First, we carried out experiments using GenerateBy-
Deletion and GenerateByAddition separately to cre-
ate virtual examples, where a virtual example was
created per Support Vector. We did not generate
virtual examples from non support vectors. We set
the parameter D8 to BCBMBCBH
4
for GenerateByDeletion and
GenerateByAddition for all the experiments.
To build an SVM with virtual examples we use
the following steps:
4
We first tried D8 BP BCBMBCBDBN BCBMBCBHBN and BCBMBDBC with GenerateBy-
Deletion using the 9603 size training set. The value D8 BPBCBMBCBH
yielded best micro-average F-measure for the test set. We used
the same value also for GenerateByAddition.
1. Train an SVM.
2. Extract Support Vectors.
3. Generate virtual examples from the Support
Vectors.
4. Train another SVM using both the original la-
beled examples and the virtual examples.
We evaluated the performance of the two methods
depending on the size of a training set. We created
subsamples by selecting randomly from the 9603
size training set. We prepared seven sizes: 9603,
4802, 2401, 1200, 600, 300, and 150.
5
Micro-
average F-measures of the two methods are shown
in Table 3. We see from Table 3 that both the meth-
ods give better performance than that of the origi-
nal SVM. The smaller the number of examples in
the training set is, the larger the gain is. For the
9603 size training set, the gain of GenerateByDele-
tion is 0.75 (BP BLBCBMBDBJ A0 BKBLBMBGBE), while for the 150
size set, the gain is 6.88 (BP BIBCBMBDBI A0 BHBFBMBEBK). These
results suggest that in the smaller training sets there
are not enough various examples to give a accurate
decision boundary and therefore the effect of virtual
examples is larger at the smaller training sets. It
is reasonable to conclude that GenerateByDeletion
and GenerateByAddition generated good virtual ex-
amples for the task and this led to the performance
gain.
After we found that the simple two methods to
generate virtual support vectors were effective, we
examined a combined method which is to use both
GenerateByDeletion and GenerateByAddition. Two
virtual examples are generated per Support Vector.
The performance of the combined method is also
shown in Table 3. The performance gain of the com-
bined method is larger than that with either Gener-
ateByDeletion or GenerateByAddition.
Furthermore, we carried out another experiment
with a combined method to create two virtual exam-
ples with GenerateByDeletion and GenerateByAd-
dition respectively. That is, four virtual examples
were generated from a Support Vector. The perfor-
mance of that setting is shown in Table 3. The best
5
Since we selected samples randomly, some smaller training
sets of low frequent categories may have had few or even zero
positive examples.
50
55
60
65
70
75
80
85
90
95
100 1000 10000
Micro-average F-measure (beta = 1)
Number of Examples in Training Set
SVM + 4 Virtual SVs Per SV
SVM
Figure 3: Micro-Average F-Measure versus Number
of Examples in the Training Set
40
45
50
55
60
65
70
75
80
85
100 1000 10000
Macro-average F-measure (beta = 1)
Number of Examples in Training Set
SVM + 4 Virtual SVs per SV
SVM
Figure 4: Macro-Average F-Measure versus Num-
ber of Examples in the Training Set. For the smaller
training sets F-measures cannot be computed be-
cause the precisions are undefined.
result is achieved by the combined method to create
four virtual examples per Support Vector.
For the rest of this section, we limit our discussion
to the comparison of the results of the original SVM
and SVM with four virtual examples per SV (SVM
with 4 VSVs). The learning curves of the original
SVM and SVM with 4 VSVs are shown in Figures 3
and 4. It is clear that SVM with 4 VSVs outper-
forms the original SVM considerably in terms of
both micro-average F-measure and macro-average
F-measure. SVM with 4 VSVs achieves a given
level of performance with roughly half of the labeled
examples which the original SVM requires. One
might suppose that the improvement of F-measure
Number of Examples in Training Set
Method 9603 4802 2401 1200 600 300 150
Original SVM 89.42 86.58 81.69 77.24 71.08 64.44 53.28
SVM + 1 VSV per SV (GenerateByDeletion) 90.17 88.62 84.45 81.11 75.32 70.11 60.16
SVM + 1 VSV per SV (GenerateByAddition) 90.00 88.51 84.48 81.14 75.33 69.59 60.04
SVM + 2 VSVs per SV (Combined) 90.27 89.33 86.27 83.59 77.44 72.81 64.22
SVM + 4 VSVs per SV (Combined) 90.45 89.69 87.12 84.97 79.16 73.25 65.05
Table 3: Comparison of Micro-Average F-measure of Different Methods. “VSV” means virtual SV.
1
2
3
4
5
6
100 1000 10000
Error Rate (%)
Number of Examples in Training Set
SVM + 4 Virtual SVs per SV
SVM
Figure 5: Error Rate versus Number of Examples in
the Training Set
is realized simply because the recall gets highly
improved while the error rate increases. We plot
changes of the error rate for 32990 tests (3299 tests
for each of the 10 categories) in Figure 5. SVM with
4 VSVs still outperforms the original SVM signifi-
cantly.
6
The performance changes for each of the 10 cat-
egories are shown in Tables 4 and 5. SVM with 4
VSVs is better than the original SVM for almost
all the categories and all the sizes except for “inter-
est” and “wheat” at the 9603 size training set. For
low frequent categories such as “ship”, “wheat” and
“corn”, the classifiers of the original SVM perform
poorly. There are many cases where they never out-
put ‘positive’, i.e. the recall is zero. It suggests that
the original SVM fails to find a good hyperplane due
to the imbalanced training sets which have very few
6
We have done the significance test which is called “p-test”
in (Yang and Liu, 1999), requiring significance at the 0.05 level.
Although at the 9603 size training set the improvement of the
error rate is not statistically significant, in all the other cases the
improvement is significant.
positive examples. In contrast, SVM with 4 VSVs
yields better results for such harder cases.
6 Conclusion and Future Directions
We have explored how virtual examples improve the
performance of text classification with SVMs. For
text classification, we have proposed methods to cre-
ate virtual examples on the assumption that the label
of a document is unchanged even if a small num-
ber of words are added or deleted. The experimen-
tal results have shown that our proposed methods
improve the performance of text classification with
SVMs, especially for small training sets. Although
the proposed methods are not readily applicable to
NLP tasks other than text classification, it is notable
that the use of virtual examples, which has been very
little studied in NLP, is empirically evaluated.
In the future, it would be interesting to employ
virtual examples with methods to use both labeled
and unlabeled examples (e.g., (Blum and Mitchell,
1998; Nigam et al., 1998; Joachims, 1999)). The
combined approach may yield better results with a
small number of labeled examples. Another interest-
ing direction would be to develop methods to create
virtual examples for the other tasks (e.g., named en-
tity recognition, POS tagging, and parsing) in NLP.
We believe we can use prior knowledge on these
tasks to create effective virtual examples.

References
Avrim Blum and Tom Mitchell. 1998. Combining la-
beled and unlabeled data with co-training. In Proceed-
ings of the 11th COLT, pages 92–100.
Dennis DeCoste and Bernhard Sch¨olkopf. 2002. Train-
ing invariant support vector machines. Machine
Learning, 46:161–190.
Susan Dumais, John Platt, David Heckerman, and
Mehran Sahami. 1998. Inductive learning algorithms
and representations for text categorization. In Pro-
ceedings of the ACM CIKM International Conference
on Information and Knowledge Management, pages
148–155.
Thorsten Joachims. 1998. Text categorization with sup-
port vector machines: Learning with many relevant
features. In Proceedings of the European Conference
on Machine Learning, pages 137–142.
Thorsten Joachims. 1999. Transductive inference for
text classification using support vector machines. In
Proceedings of the 16th International Conference on
Machine Learning, pages 200–209.
Taku Kudo and Yuji Matsumoto. 2001. Chunking with
support vector machines. In Proceedings of NAACL
2001, pages 192–199.
Taku Kudo and Yuji Matsumoto. 2002. Japanese depen-
dency analysis using cascaded chunking. In Proceed-
ings of CoNLL-2002, pages 63–69.
David D. Lewis and William A. Gale. 1994. A sequential
algorithm for training text classifiers. In Proceedings
of the Seventeenth Annual International ACM-SIGIR
Conference on Research and Development in Informa-
tion Retrieval, pages 3–12.
Kamal Nigam, Andrew McCallum, Sebastian Thrun, and
Tom Mitchell. 1998. Learning to classify text from
labeled and unlabeled documents. In Proceedings of
the Fifteenth National Conference on Artificial Intelli-
gence (AAAI-98), pages 792–799.
Partha Niyogi, Federico Girosi, and Tomaso Poggio.
1998. Incorporating prior information in machine
learning by creating virtual examples. In Proceedings
of IEEE, volume 86, pages 2196–2207.
John C. Platt. 1999. Fast training of support vec-
tor machines using sequential minimal optimization.
In Bernhard Sch¨olkopf, Christopher J.C. Burges, and
Alexander J. Smola, editors, Advances in Kernel Meth-
ods: Support Vector Learning, pages 185–208. MIT
Press.
Manabu Sassano. 2002. An empirical study of active
learning with support vector machines for Japanese
word segmentation. In Proceedings of ACL-2002,
pages 505–512.
Bernhard Sch¨olkopf, Chris Burges, and Vladimir Vap-
nik. 1996. Incorporating invariances in support vector
learning machines. In C. von der Malsburg, W. von
Seelen, J.C. Vorbr¨uggen, and B. Sendhoff, editors, Ar-
tificial Neural Networks – ICANN’96, Springer Lec-
ture Notes in Computer Science, Vol. 1112, pages 47–
52.
Cynthia A. Thompson, Mary Leaine Califf, and Ray-
mond J. Mooney. 1999. Active learning for natural
language parsing and information extraction. In Pro-
ceedings of the Sixteenth International Conference on
Machine Learning, pages 406–414.
C.J. van Rijsbergen. 1979. Information Retrieval. But-
terworths, 2nd edition.
Vladimir N. Vapnik. 1995. The Nature of Statistical
Learning Theory. Springer-Verlag.
Yiming Yang and Xin Liu. 1999. A re-examination of
text categorization methods. In Proceedings of SIGIR-
99, 2nd ACM International Conference on Research
and Development in Information Retrieval, pages 42–
49.
Yiming Yang. 1999. An evaluation of statistical ap-
proaches to text categorization. Journal of Informa-
tion Retrieval, 1(1/2):67–88.
David Yarowsky. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. In Proceed-
ings of ACL-1995, pages 189–196.
