Learning with Unlabeled Data for Text Categorization Using Bootstrapping  
and Feature Projection Techniques 
Youngjoong Ko 
Dept. of Computer Science, Sogang Univ. 
Sinsu-dong 1, Mapo-gu 
Seoul, 121-742, Korea 
kyj@nlpzodiac.sogang.ac.kr 
Jungyun Seo 
Dept. of Computer Science, Sogang Univ. 
Sinsu-dong 1, Mapo-gu 
Seoul, 121-742, Korea 
    seojy@ccs.sogang.ac.kr 
 
Abstract 
A wide range of supervised learning 
algorithms has been applied to Text 
Categorization. However, the supervised 
learning approaches have some problems. One 
of them is that they require a large, often 
prohibitive, number of labeled training 
documents for accurate learning. Generally, 
acquiring class labels for training data is costly, 
while gathering a large quantity of unlabeled 
data is cheap. We here propose a new 
automatic text categorization method for 
learning from only unlabeled data using a 
bootstrapping framework and a feature 
projection technique. From results of our 
experiments, our method showed reasonably 
comparable performance compared with a 
supervised method. If our method is used in a 
text categorization task, building text 
categorization systems will become 
significantly faster and less expensive. 
1 Introduction 
Text categorization is the task of classifying 
documents into a certain number of pre-defined 
categories. Many supervised learning algorithms 
have been applied to this area. These algorithms 
today are reasonably successful when provided 
with enough labeled or annotated training 
examples.  For example, there are Naive Bayes 
(McCallum and Nigam, 1998), Rocchio (Lewis et 
al., 1996), Nearest Neighbor (kNN) (Yang et al., 
2002), TCFP (Ko and Seo, 2002), and Support 
Vector Machine (SVM) (Joachims, 1998). 
However, the supervised learning approach has 
some difficulties. One key difficulty is that it 
requires a large, often prohibitive, number of 
labeled training data for accurate learning. Since a 
labeling task must be done manually, it is a 
painfully time-consuming process. Furthermore, 
since the application area of text categorization has 
diversified from newswire articles and web pages 
to E-mails and newsgroup postings, it is also a 
difficult task to create training data for each 
application area (Nigam et al., 1998). In this light, 
we consider learning algorithms that do not require 
such a large amount of labeled data. 
While labeled data are difficult to obtain, 
unlabeled data are readily available and plentiful. 
Therefore, this paper advocates using a 
bootstrapping framework and a feature projection 
technique with just unlabeled data for text 
categorization. The input to the bootstrapping 
process is a large amount of unlabeled data and a 
small amount of seed information to tell the learner 
about the specific task. In this paper, we consider 
seed information in the form of title words 
associated with categories. In general, since 
unlabeled data are much less expensive and easier 
to collect than labeled data, our method is useful 
for text categorization tasks including online data 
sources such as web pages, E-mails, and 
newsgroup postings.  
To automatically build up a text classifier with 
unlabeled data, we must solve two problems; how 
we can automatically generate labeled training 
documents (machine-labeled data) from only title 
words and how we can handle incorrectly labeled 
documents in the machine-labeled data. This paper 
provides solutions for these problems. For the first 
problem, we employ the bootstrapping framework. 
For the second, we use the TCFP classifier with 
robustness from noisy data (Ko and Seo, 2004). 
How can labeled training data be automatically 
created from unlabeled data and title words? 
Maybe unlabeled data don’t have any information 
for building a text classifier because they do not 
contain the most important information, their 
category. Thus we must assign the class to each 
document in order to use supervised learning 
approaches. Since text categorization is a task 
based on pre-defined categories, we know the 
categories for classifying documents. Knowing the 
categories means that we can choose at least a 
representative title word of each category. This is 
the starting point of our proposed method. As we 
carry out a bootstrapping task from these title 
words, we can finally get labeled training data. 
Suppose, for example, that we are interested in 
classifying newsgroup postings about specially 
‘Autos’ category. Above all, we can select 
‘automobile’ as a title word, and automatically 
extract keywords (‘car’, ‘gear’, ‘transmission’, 
‘sedan’, and so on) using co-occurrence 
information. In our method, we use context (a 
sequence of 60 words) as a unit of meaning for 
bootstrapping from title words; it is generally 
constructed as a middle size of a sentence and a 
document. We then extract core contexts that 
include at least one of the title words and the 
keywords. We call them centroid-contexts because 
they are regarded as contexts with the core 
meaning of each category. From the centroid-
contexts, we can gain many words contextually co-
occurred with the title words and keywords: 
‘driver’, ‘clutch’, ‘trunk’, and so on. They are 
words in first-order co-occurrence with the title 
words and the keywords. To gather more 
vocabulary, we extract contexts that are similar to 
centroid-contexts by a similarity measure; they 
contain words in second-order co-occurrence with 
the title words and the keywords. We finally 
construct context-cluster of each category as the 
combination of centroid-contexts and contexts 
selected by the similarity measure. Using the 
context-clusters as labeled training data, a Naive 
Bayes classifier can be built. Since the Naive 
Bayes classifier can label all unlabeled documents 
for their category, we can finally obtain labeled 
training data (machine-labeled data).  
When the machine-labeled data is used to learn a 
text classifier, there is another difficult in that they 
have more incorrectly labeled documents than 
manually labeled data. Thus we develop and 
employ the TCFP classifiers with robustness from 
noisy data. 
The rest of this paper is organized as follows. 
Section 2 reviews previous works. In section 3 and 
4, we explain the proposed method in detail. 
Section 5 is devoted to the analysis of the 
empirical results. The final section describes 
conclusions and future works. 
 
2 Related Works 
In general, related approaches for using unlabeled 
data in text categorization have two directions; 
One builds classifiers from a combination of 
labeled and unlabeled data (Nigam, 2001; Bennett 
and Demiriz, 1999), and the other employs 
clustering algorithms for text categorization 
(Slonim et al., 2002). 
Nigam studied an Expected Maximization (EM) 
technique for combining labeled and unlabeled 
data for text categorization in his dissertation. He 
showed that the accuracy of learned text classifiers 
can be improved by augmenting a small number of 
labeled training data with a large pool of unlabeled 
data.  
Bennet and Demiriz achieved small 
improvements on some UCI data sets using SVM. 
It seems that SVMs assume that decision 
boundaries lie between classes in low-density 
regions of instance space, and the unlabeled 
examples help find these areas. 
Slonim suggested clustering techniques for 
unsupervised document classification. Given a 
collection of unlabeled data, he attempted to find 
clusters that are highly correlated with the true 
topics of documents by unsupervised clustering 
methods. In his paper, Slonim proposed a new 
clustering method, the sequential Information 
Bottleneck (sIB) algorithm. 
 
3 The Bootstrapping Algorithm for Creating 
Machine-labeled Data 
The bootstrapping framework described in this 
paper consists of the following steps. Each module 
is described in the following sections in detail. 
 
1. Preprocessing: Contexts are separated from 
unlabeled documents and content words are 
extracted from them. 
2. Constructing context-clusters for training: 
- Keywords of each category are created 
- Centroid-contexts are extracted and verified 
- Context-clusters are created by a similarity  
measure 
3. Learning Classifier: Naive Bayes classifier are 
learned by using the context-clusters 
 
3.1 Preprocessing 
The preprocessing module has two main roles: 
extracting content words and reconstructing the 
collected documents into contexts. We use the Brill 
POS tagger to extract content words (Brill, 1995).  
Generally, the supervised learning approach with 
labeled data regards a document as a unit of 
meaning. But since we can use only the title words 
and unlabeled data, we define context as a unit of 
meaning and we employ it as the meaning unit to 
bootstrap the meaning of each category. In our 
system, we regard a sequence of 60 content words 
within a document as a context. To extract contexts 
from a document, we use sliding window 
techniques (Maarek et al., 1991). The window is a 
slide from the first word of the document to the last 
in the size of the window (60 words) and the 
interval of each window (30 words). Therefore, the 
final output of preprocessing is a set of context 
vectors that are represented as content words of 
each context. 
 
3.2 Constructing Context-Clusters for 
Training 
At first, we automatically create keywords from a 
title word for each category using co-occurrence 
information. Then centroid-contexts are extracted 
using the title word and keywords. They contain at 
least one of the title and keywords. Finally, we can 
gain more information of each category by 
assigning remaining contexts to each context-
cluster using a similarity measure technique; the 
remaining contexts do not contain any keywords or 
title words. 
3.2.1 Creating Keyword Lists 
The starting point of our method is that we have 
title words and collected documents. A title word 
can present the main meaning of each category but 
it could be insufficient in representing any 
category for text categorization. Thus we need to 
find words that are semantically related to a title 
word, and we define them as keywords of each 
category. 
The score of semantic similarity between a title 
word, T, and a word, W, is calculated by the cosine 
metric as follows: 
 
∑∑
∑
==
=
×
×
=
n
i
i
n
i
i
n
i
ii
wt
wt
WTsim
1
2
1
2
1
),(
              (1) 
 
where t
i
 and w
i
 represent the occurrence (binary 
value: 0 or 1) of words T and W in i-th document 
respectively, and n is the total number of 
documents in the collected documents. This 
method calculates the similarity score between 
words based on the degree of their co-occurrence 
in the same document.  
Since the keywords for text categorization must 
have the power to discriminate categories as well 
as similarity with the title words, we assign a word 
to the keyword list of a category with the 
maximum similarity score and recalculate the score 
of the word in the category using the following 
formula: 
 
)),(),((),(),(
maxsecmaxmaxmax
WTsimWTsimWTsimcWScore
ond
−+=  (2) 
 
where T
max
 is the title word with the maximum 
similarity score with a word W, c
max
 is the category 
of the title word T
max
, and T
secondmax
 is other title 
word with the second high similarity score with the 
word W. 
This formula means that a word with high 
ranking in a category has a high similarity score 
with the title word of the category and a high 
similarity score difference with other title words. 
We sort out words assigned to each category 
according to the calculated score in descending 
order. We then choose top m words as keywords in 
the category. Table 1 shows the list of keywords 
(top 5) for each category in the WebKB data set. 
 
Table 1. The list of keywords in the WebKB data set 
Category Title Word Keywords 
course course 
assignments, hours, instructor, 
class, fall 
faculty professor 
associate, ph.d, fax, interests, 
publications 
project project 
system, systems, research, 
software, information 
student student 
graduate, computer, science, 
page, university 
 
3.2.2 Extracting and Verifying Centroid-Contexts 
We choose contexts with a keyword or a title word 
of a category as centroid-contexts. Among 
centroid-contexts, some contexts could not have 
good features of a category even though they 
include the keywords of the category. To rank the 
importance of centroid-contexts, we compute the 
importance score of each centroid-context. First of 
all, weights (W
ij
) of word w
i
 in j-th category are 
calculated using Term Frequency (TF) within a 
category and Inverse Category Frequency (ICF) 
(Cho and Kim, 1997) as follows:  
 
))log()(log(
iijiijij
CFMTFICFTFW −×=×=     (3) 
 
where CF
i
 is the number of categories that contain 
w
i
 and M is the total number of categories. 
Using word weights (W
ij
) calculated by formula 
3, the score of a centroid-context (S
k
) in j-th 
category (c
j
) is computed as follows: 
 
N
WWW
cSScore
Njjj
jk
+++
=
...
),(
21
           (4) 
 
where N is the  number of words in the centroid-
context. 
As a result, we obtain a set of words in first-
order co-occurrence from centroid-contexts of each 
category. 
3.2.3 Creating Context-Clusters 
We gather the second-order co-occurrence 
information by assigning remaining contexts to the 
context-cluster of each category. For the assigning 
criterion, we calculate similarity between 
remaining contexts and centroid-contexts of each 
category. Thus we employ the similarity measure 
technique by Karov and Edelman (1998). In our 
method, a part of this technique is reformed for our 
purpose and remaining contexts are assigned to 
each context-cluster by that revised technique. 
 
1) Measurement of word and context similarities 
As similar words tend to appear in similar contexts, 
we can compute the similarity by using contextual 
information. Words and contexts play 
complementary roles. Contexts are similar to the 
extent that they contain similar words, and words 
are similar to the extent that they appear in similar 
contexts (Karov and Edelman, 1998). This 
definition is circular. Thus it is applied iteratively 
using two matrices, WSM and CSM. 
Each category has a word similarity matrix 
WSM
n
 and a context similarity matrix CSM
n
. In 
each iteration n, we update WSM
n
, whose rows and 
columns are labeled by all content words 
encountered in the centroid-contexts of each 
category and input remaining contexts. In that 
matrix, the cell (i,j) holds a value between 0 and 1, 
indicating the extent to which the i-th word is 
contextually similar to the j-th word. Also, we keep 
and update a CSM
n
, which holds similarities 
among contexts. The rows of CSM
n 
correspond to 
the remaining contexts and the columns to the 
centroid-contexts. In this paper, the number of 
input contexts of row and column in CSM is 
limited to 200, considering execution time and 
memory allocation, and the number of iterations is 
set as 3.  
To compute the similarities, we initialize WSM
n 
to the identity matrix. The following steps are 
iterated until the changes in the similarity values 
are small enough. 
1. Update the context similarity matrix CSM
n
, 
using the word similarity matrix WSM
n
. 
2. Update the word similarity matrix WSM
n
, using the 
context similarity matrix CSM
n
. 
2) Affinity formulae 
To simplify the symmetric iterative treatment of 
similarity between words and contexts, we define 
an auxiliary relation between words and contexts 
as affinity.  
Affinity formulae are defined as follows (Karov 
and Edelman, 1998): 
 
                  ),(max),(
inXWn
WWsimXWaff
i
∈
=   (5) 
 (6)                    ),(max),(
jnXWn
XXsimWXaff
j
∈
=
In the above formulae, n denotes the iteration 
number, and the similarity values are defined by 
WSM
n 
and CSM
n
. Every word has some affinity to 
the context, and the context can be represented by 
a vector indicating the affinity of each word to it. 
 
3) Similarity formulae 
The similarity of W
1
 to W
2 
is the average affinity of 
the contexts that include W
1
 to W
2
, and the 
similarity of a context X
1
 to X
2
 is a weighted 
average of the affinity of the words in X
1
 to X
2
. 
Similarity formulae are defined as follows: 
 
  ),(),(),(
21211
1
XWaffXWweightXXsim
n
XW
n
⋅=
∑
∈
+
(7) 
   (8)  
 ),(),(),(  
1),(   
  
21211
211
21
1
WXaffWXweightWWsim
else
WWsim
WWif 
n
XW
n
n
⋅=
=
=
∑
∈
+
+
The weights in formula 7 are computed as 
reflecting global frequency, log-likelihood factors, 
and part of speech as used in (Karov and Edelman, 
1998). The sum of weights in formula 8, which is a 
reciprocal number of contexts that contain W
1
, is 1. 
 
4) Assigning remaining contexts to a category 
We decided a similarity value of each remaining 
context for each category using the following 
method: 
       ),(),(   






=
∈∈
j
CCS
i
Cc
SXsimavercXsim
i
cji
     (9) 
 
In formula 9, i) X is a remaining context, ii) 
{ }
m
cccC ,...,,
21
= is a category set, and iii) { }
nc
SS
i
,...,
1
=CC is 
a controid-contexts set of category c
i
. 
Each remaining context is assigned to a category 
which has a maximum similarity value. But there 
may exist noisy remaining contexts which do not 
belong to any category. To remove these noisy 
remaining contexts, we set up a dropping threshold 
using normal distribution of similarity values as 
follows (Ko and Seo, 2000): 
 
                         } ),( max{
Cc
i
θσµ +≥
∈
i
cXsim (10) 
 
where i) X is a remaining context, ii) µ is an 
average of similarity values , iii) σ is a 
standard deviation of similarity values, and iv) θ is 
a numerical value corresponding to the threshold 
(%) in normal distribution table.  
),(
i
Cc
cXsim
i
∈
Finally, a remaining context is assigned to the 
context-cluster of any category when the category 
has a maximum similarity above the dropping 
threshold value. In this paper, we empirically use a 
15% threshold value from an experiment using a 
validation set. 
3.3 Learning the Naive Bayes Classifier Using 
Context-Clusters 
In above section, we obtained labeled training data: 
context-clusters. Since training data are labeled as 
the context unit, we employ a Naive Bayes 
classifier because it can be built by estimating the 
word probability in a category, but not in a 
document. That is, the Naive Bayes classifier does 
not require labeled data with the unit of documents 
unlike other classifiers.  
We use the Naive Bayes classifier with minor 
modifications based on Kullback-Leibler 
Divergence (Craven et al., 2000). We classify a 
document d
i
 according to the following formula: 
 
 
∑
∏
=
=








+∝
≈=
||
1
||
1
),(
)
ˆ
;|(
)
ˆ
;|(
log)
ˆ
;|(
)
ˆ
;(log
                    
)
ˆ
;|()
ˆ
|(
)
ˆ
|(
)
ˆ
;|()
ˆ
|(
)
ˆ
;|(
V
t it
jt
it
j
V
t
dwN
jtj
i
jij
ij
dwP
cwP
dwP
n
cP
cwPcP
dP
cdPcP
dcP
i
θ
θ
θ
θ
θθ
θ
θθ
θ
  (11) 
 
 
where i) n is the number of words in document d
i
, 
ii) w
t
 is the t-th word in the vocabulary, iii) N(w
t
,d
i
) 
is the frequency of word w
t
 in document d
i
. 
Here, the Laplace smoothing is used to estimate 
the probability of word w
t
 in class c
j 
and the 
probability of class c
j
 as follows: 
 
∑
=
+
+
=
||
1
),(||
),(1
)
ˆ
;|(
V
t
ct
ct
jt
j
j
GwNV
GwN
cwP θ           (12) 
∑
+
+
=
i
i
j
c
c
c
j
GC
G
cP
||||
||1
)
ˆ
|( θ                    (13) 
 
where  is the count of the number of times 
word w
),(
j
ct
GwN
t
 occurs in the context-cluster ( ) of 
category c
j
c
G
j
. 
 
4 Using a Feature Projection Technique for 
Handling Noisy Data of Machine-labeled 
Data 
We finally obtained labeled data of a documents 
unit, machine-labeled data. Now we can learn text 
classifiers using them. But since the machine-
labeled data are created by our method, they 
generally include far more incorrectly labeled 
documents than the human-labeled data. Thus we 
employ a feature projection technique for our 
method. By the property of the feature projection 
technique, a classifier (the TCFP classifier) can 
have robustness from noisy data (Ko and Seo, 
2004). As seen in our experiment results, TCFP 
showed the highest performance among 
conventional classifiers in using machine-labeled 
data. 
 
The TCFP classifier with robustness from noisy 
data 
Here, we simply describe the TCFP classifier using 
the feature projection technique (Ko and Seo, 
2002; 2004). In this approach, the classification 
knowledge is represented as sets of projections of 
training data on each feature dimension. The 
classification of a test document is based on the 
voting of each feature of that test document. That 
is, the final prediction score is calculated by 
accumulating the voting scores of all features.  
First of all, we must calculate the voting ratio of 
each category for all features. Since elements with 
a high TF-IDF value in projections of a feature 
must become more useful classification criteria for 
the feature, we use only elements with TF-IDF 
values above the average TF-IDF value for voting. 
And the selected elements participate in 
proportional voting with the same importance as 
the TF-IDF value of each element. The voting ratio 
of each category c
j
 in a feature t
m
 is calculated by 
the following formula: 
 
 
∑∑
∈∈
⋅=
mmmm
jj
Ilt
lm
Ilt
mlmm
dtwltcydtwtcr
)()(
),())(,(),(),(
rr
   (14) 
 
In formula 14, w ),( dt
m
r
is the weight of term t
m
 in 
document d, I
m 
denotes a set of elements selected 
for voting and  is a function; if the 
category for an element t  is equal to c , the 
output value is 1. Otherwise, the output value is 0.  
{}1.0∈
)(l
m
))(,( ltcy
mj
j
Next, since each feature separately votes on 
feature projections, contextual information is 
missing. Thus we calculate co-occurrence 
frequency of features in the training data and 
modify TF-IDF values of two terms t
i
 and t
j
 in a 
test document by co-occurrence frequency between 
them; terms with a high co-occurrence frequency 
value have higher term weights.  
Finally, the voting score of each category c in 
the m-th feature t
j
m
 of a test document d is 
calculated by the following formula: 
 
))(1log(),(),(),(
2
mmmm
ttcrdttwtcvs
jj
χ+⋅⋅=
r
   (15) 
 
where tw(t
m
,d) denotes a modified term weight by 
the co-occurrence frequency and denotes 
the calculated χ
)(
2
m
tχ
m
2 
statistics value of . t
 
 
