A Differential LSI Method for Document Classification
Liang Chen
Computer Science Department
University of Northern British Columbia
Prince George, BC, Canada V2N 4Z9
chenl@unbc.ca
Naoyuki Tokuda
R & D Center, Sunflare Company
Shinjuku-Hirose Bldg., 4-7 Yotsuya
Sinjuku-ku, Tokyo, Japan 160-0004
tokuda n@sunflare.co.jp
Akira Nagai
Advanced Media Network Center
Utsunomiya University
Utsunomiya, Tochigi, Japan 321-8585
anagai@cc.utsunomiya-u.ac.jp
Abstract
We have developed an effective prob-
abilistic classifier for document classi-
fication by introducing the concept of
the differential document vectors and
DLSI (differential latent semantics index)
spaces. A simple posteriori calculation
using the intra- and extra-document statis-
tics demonstrates the advantage of the
DLSI space-based probabilistic classifier
over the popularly used LSI space-based
classifier in classification performance.
1 Introduction
This paper introduces a new efficient supervised
document classification procedure, whereby given a
number of labeled documents preclassified into a fi-
nite number of appropriate clusters in the database,
the classifier developed will select and classify any
of new documents introduced into an appropriate
cluster within the learning stage.
The vector space model is widely used in docu-
ment classification, where each document is repre-
sented as a vector of terms. To represent a doc-
ument by a document vector, we assign weights
to its components usually evaluating the frequency
of occurrences of the corresponding terms. Then
the standard pattern recognition and machine learn-
ing methods are employed for document classifica-
tion(Li et al., 1991; Farkas, 1994; Svingen, 1997;
Hyotyniemi, 1996; Merkl, 1998; Benkhalifa et al.,
1999; Iwayama and Tokunaga, 1995; Lam and Low,
1997; Nigam et al., 2000).
In view of the inherent flexibility imbedded within
any natural language, a staggering number of dimen-
sions seem required to represent the featuring space
of any practical document comprising the huge num-
ber of terms used. If a speedy classification algo-
rithm can be developed (Sch¨utze and Silverstein,
1997), the first problem to be resolved is the dimen-
sionality reduction scheme enabling the documents’
term projection onto a smaller subspace.
Like an eigen-decomposition method extensively
used in image processing and image recognition
(Sirovich and Kirby, 1987; Turk and Pentland,
1991), the Latent Semantic Indexing (LSI) method
has proved to be a most efficient method for the di-
mensionality reduction scheme in document analy-
sis and extraction, providing a powerful tool for the
classifier (Sch¨utze and Silverstein, 1997) when in-
troduced into document retrieval with a good per-
formance confirmed by empirical studies (Deer-
wester et al., 1990; Berry et al., 1999; Berry et
al., 1995).The LSI method has also demonstrated its
efficiency for automated cross-language document
retrieval in which no query translation is required
(Littman et al., 1998).
In this paper, we will show that exploiting both
of the distances to, and the projections onto, the
LSI space improves the performance as well as the
robustness of the document classifier. To do this,
we introduce, as the major vector space, the differ-
ential LSI (or DLSI) space which is formed from
the differences between normalized intra- and extra-
document vectors and normalized centroid vectors
of clusters where the intra- and extra-document
refers to the documents included within or outside of
the given cluster respectively. The new classifier sets
up a Baysian posteriori probability function for the
differential document vectors based on their projec-
tions on DLSI space and their distances to the DLSI
space, the document category with a highest proba-
bility is then selected. A similar approach is taken
by Moghaddam and Pentland for image recognition
(Moghaddam and Pentland, 1997; Moghaddam et
al., 1998).
We may summarize the specific features intro-
duced into the new document classification scheme
based on the concept of the differential document
vector and the DLSI vectors:
1. Exploiting the characteristic distance of the dif-
ferential document vector to the DLSI space
and the projection of the differential document
onto the DLSI space, which we believe to de-
note the differences in word usage between the
document and a cluster’s centroid vector, the
differential document vector is capable of cap-
turing the relation between the particular docu-
ment and the cluster.
2. A major problem of context sensitive seman-
tic grammar of natural language related to syn-
onymy and polysemy can be dampened by the
major space projection method endowed in the
LSIs used.
3. A maximum for the posteriori likelihood func-
tion making use of the projection of differen-
tial document vector onto the DLSI space and
the distance to the DLSI space provides a con-
sistent computational scheme in evaluating the
degree of reliability of the document belonging
to the cluster.
The rest of the paper is arranged as follows: Sec-
tion 2 will describe the main algorithm for setting up
the DLSI-based classifier. A simple example is com-
puted for comparison with the results by the stan-
dard LSI based classifier in Section 3. The conclu-
sion is given in Section 4.
2 Main Algorithm
2.1 Basic Concepts
A term is defined as a word or a phrase that appears
at least in two documents. We exclude the so-called
stop words such as “a”, “the” , ”of” and so forth.
Suppose we select and list the terms that appear in
the documents asa0a2a1a4a3a5a0a7a6a8a3a4a9a4a9a4a9a10a3a5a0a12a11 .
For each document a13 in the collection, we assign
each of the terms with a real vector a14a16a15 a1a18a17a19a3a15 a6a20a17a19a3a4a9a4a9a4a9a21a3
a15
a11a22a17a24a23 , with
a15a26a25
a17a28a27a30a29
a25
a17a32a31a34a33
a25 , where
a29
a25
a17 is the local
weighting of the term a0a25 in the document indicating
the significance of the term in the document, while
a33
a25 is a global weight of all the documents, which is
a parameter indicating the importance of the term
in representing the documents. Local weights could
be either raw occurrence counts, boolean, or loga-
rithms of occurrence counts. Global ones could be
no weighting (uniform), domain specific, or entropy
weighting. Both of the local and global weights are
thoroughly studied in the literatures (Raghavan and
Wong, 1986; Luhn, 1958; van Rijsbergen, 1979;
Salton, 1983; Salton, 1988; Lee et al., 1997), and
will not be discussed further in this paper. An exam-
ple will be given below:
a29
a25
a17a35a27a37a36a16a38a24a33
a14a7a39a21a40a42a41a35a25
a17a43a23 and a33
a25
a27
a39a8a44
a39
a45a47a46a49a48a51a50
a52
a53
a17a2a54a55a1a57a56
a25
a17a10a36a16a38a24a33
a14
a56
a25
a17a24a23a58a3
where
a56
a25
a17a35a27a60a59a62a61a63
a64
a61
,a65a19a25 is the total number of times that
term a0a25 appears in the collection, a41a66a25a17 the number of
times the term a0a25 appears in the document a13 , and a50
the number of documents in the collection. The doc-
ument vector a14a16a15 a1a18a17 a3a15 a6a20a17 a3a4a9a4a9a4a9a21a3a15 a11a22a17 a23 can be normalized
as a14a68a67 a1a18a17a69a3a67a6a20a17a49a3a4a9a4a9a4a9a10a3a67a11a70a17a43a23 by the following formula:
a67a71a25
a17a35a27
a15a69a25
a17a24a72a74a73a75
a75
a76
a11
a53
a77
a54a55a1
a15
a6
a77
a17a8a78 (1)
The normalized centroid vector a79
a27
a14a16a80
a1 a3
a80
a6 a3a4a9a4a9a4a9a21a3
a80
a11 a23 of a cluster can be calcu-
lated in terms of the normalized vector as
a80a2a25
a27 a81
a25
a72a83a82 a84
a11
a17a2a54a55a1
a81
a6
a25
, where a14a81a19a1a4a3a71a81a24a6a43a3a4a9a4a9a4a9a21a3a71a81a21a11a66a23a86a85
is a mean vector of the member documents in the
cluster which are normalized as a0 a1a10a3a0 a6a8a3a4a9a4a9a4a9a10a3a0 a77 ; i.e.,
a14
a81a19a1a57a3a71a81a24a6a8a3a4a9a4a9a4a9a10a3a71a81a21a11a66a23a85 a27
a1
a77
a84
a77
a17a2a54a55a1
a0
a17 . We can always
take a79 itself as a normalized vector of the cluster.
A differential document vector is defined as a0a55a25 a44
a0
a17 where
a0 a25 and a0
a17 are normalized document vec-
tors satisfying some criteria as given above.
A differential intra-document vector a1a3a2 is the dif-
ferential document vector defined as a0 a25 a44a4a0 a17 , where
a0 a25 and a0
a17 are two normalized document vectors of
same cluster.
A differential extra-document vector a1a6a5 is the
differential document vector defined as a0 a25 a44a7a0 a17 ,
where a0 a25 and a0 a17 are two normalized document vec-
tors of different clusters.
The differential term by intra- and extra-
document matrices a1 a2 and a1a8a5 are respectively de-
fined as a matrix, each column of which comprise
a differential intra- and extra- document vector re-
spectively.
2.2 The Posteriori Model
Any differential term by document a9 -by-a50 matrix
of a1 , say, of rank a10a4a11a13a12 a27a15a14a17a16a19a18 a14a20a9 a3a50 a23 , whether it
is a differential term by intra-document matrix a1 a2
or a differential term by extra-document matrix a1a3a5
can be decomposed by SVD into a product of three
matrices: a1
a27a22a21a24a23a26a25a42a85 , such that a21 (left singular
matrix) and a25 (right singular matrix) are an a9 -by-
a12 and a12 -by-
a50 unitary matrices respectively with the
first a10 columns of U and V being the eigenvectors of
a1a3a1
a85 and
a1
a85
a1 respectively. Here
a23 is called sin-
gular matrix expressed by a23 a27 diaga14a28a27
a1a10a3
a27
a6a49a3a4a9a4a9a4a9a21a3
a27a30a29 ),
where a27 a25 are nonnegtive square roots of eigen values
of a1a3a1 a85 , a27a2a25a32a31a34a33 for a35a36a11a37a10 and a27a57a25 a27 a33 for a35a26a31a37a10 .
The diagonal elements of a23 are sorted in the
decreasing order of magnitude. To obtain a new
reduced matrix a23 a77 , we simply keep the k-by-k
leftmost-upper corner matrix (a38a40a39a37a10 ) of a23 , deleting
other terms; we similarly obtain the two new matri-
ces a21 a77 and a25 a77 by keeping the left most a38 columns
of a21 and a25 respectively. The product of a21 a77 , a23 a77 and
a25 a85
a77 provide a reduced matrix
a1
a77 of
a1 which ap-
proximately equals to a1 .
How we choose an appropriate value of a38 , a re-
duced degree of dimension from the original matrix,
depends on the type of applications. Generally we
choose a38a40a41 a39a42a33a43a33 for a39a42a33a43a33a43a33a3a11
a50
a11a45a44a46a33a43a33a43a33 , and the cor-
responding a38 is normally smaller for the differential
term by intra-document matrix than that for the dif-
ferential term by extra- document matrix, because
the differential term by extra-document matrix nor-
mally has more columns than the differential term
by intra-document matrix has.
Each of differential document vector a12 could find
a projection on the a38 dimensional fact space spanned
by the a38 columns of a21 a77 . The projection can easily
be obtained by a21 a85a77 a12 .
Noting that the mean a47
a48 of the differential intra-
(extra-) document vectors are approximately a33 , we
may assume that the differential vectors formed fol-
lows a high-dimensional Gaussian distribution so
that the likelihood of any differential vector a48 will
be given by
a49
a14
a48a51a50
a1
a23 a27a53a52a30a54a56a55
a57
a44
a1
a6
a65a62a14
a48
a23a59a58
a14a28a60a46a61
a23a63a62a46a64
a6
a50a66a65a67a50
a1
a64
a6
a3
where a65a62a14a48 a23 a27 a48 a85 a65a69a68
a1
a48 , and a65 is the covariance of
the distribution computed from the training set ex-
pressed a65 a27
a1
a62 a1a3a1
a85 .
Since a27
a6
a25
constitutes the eigenvalues of a1a3a1 a85 , we
have a23
a6
a27a70a21 a85
a1a6a1
a85a51a21 , and thus we have
a65 a14
a48
a23 a27
a50
a48
a85
a14a71a1a3a1
a85a22a23
a68
a1
a48
a27 a50
a48
a85a32a21a67a23
a68
a6
a21a35a85
a48
a27 a50a73a72 a85a74a23
a68
a6
a72 ,
where a72 a27a75a21 a85 a48 a27 a14a72 a1a4a3 a72 a6a8a3a4a9a4a9a4a9a10a3 a72 a62 a23a85 .
Because a23 is a diagonal matrix,a65 a14a48 a23 can be repre-
sented by a simpler form as: a65a62a14a48 a23 a27 a50 a84a77a76
a25
a54a55a1
a72
a6
a25
a72
a27
a6
a25
.
It is most convenient to estimate it as
a78
a65 a14
a48
a23 a27 a50
a14
a77
a53
a25
a54a55a1
a72
a6
a25
a72
a27a2a25
a6
a40
a39
a79
a76
a53
a25
a54
a77a42a80
a1
a72
a6
a25
a23
a78
where a79 a27
a1
a76
a68
a77
a84 a76
a25
a54
a77a30a80
a1
a27
a6
a25
. In practice, a27 a25 (a35a81a31a7a38 )
could be estimated by fitting a function (say, a39
a72
a35 )
to the available a27 a25 (a35a82a11a83a38 ), or we could let a79 a27
a27
a6
a77a42a80
a1
a72
a60 since we only need to compare the rela-
tive probability. Because the columns of a21 are or-
thogonal vectors, a84a84a76
a25
a54
a77a42a80
a1
a72
a6
a25
could be estimated by
a50a85a50a48a51a50a85a50
a6
a44
a84
a77
a25
a54a55a1
a72
a25
a6
. Thus, the likelihood function
a49
a14
a48a86a50
a1
a23 could be estimated by
a78
a49
a14
a48a86a50
a1
a23 a27
a50
a1
a64
a6
a52a30a54a87a55a89a88
a44
a62
a6
a84
a77
a25
a54a55a1a51a90a30a91a61
a92
a91a61a87a93
a9
a52a30a54a56a55a3a94
a44
a62a46a95
a91a97a96a99a98a101a100a6a103a102a105a104
a14a28a60a46a61
a23a62a43a64
a6a74a106
a77
a25
a54a55a1
a27a2a25
a9
a79
a96
a76
a68
a77
a100
a64
a6
a3
(2)
where a72 a27a107a21 a85a77 a48 , a108
a6
a14
a48
a23 a27
a50a85a50a48a86a50a85a50
a6
a44
a84
a77
a25
a54a55a1
a72
a6
a25
, a79 a27a1
a76
a68
a77
a84a77a76
a25
a54
a77a42a80
a1
a27
a6
a25
, and a10 is the rank of matrix a1 . In
practice, a79 may be chosen as a27
a6
a77a42a80
a1
a72
a60 , and
a50 may be
substituted for a10 . Note that in equation (2), the term
a84
a90 a91a61
a92
a91a61
describes the projection of a48 onto the DLSI
space, while a108a24a14
a48
a23 approximates the distance from
a48
to DLSI space.
When both a49 a14a48a51a50a1 a2 a23 and a49 a14a48a86a50a1a8a5 a23 are computed,
the Baysian posteriori function can be computed as:
a49
a14a71a1 a2
a50a48
a23a51a27
a49
a14
a48a86a50
a1 a2
a23a49
a14a71a1 a2
a23
a49
a14
a48a51a50
a1a8a2
a23a49
a14a71a1a8a2
a23
a40
a49
a14
a48a86a50
a1 a5
a23a49
a14a71a1 a5
a23
a3
where a49 a14a71a1 a2 a23 is set to a39 a72 a50a1a0 where a50a1a0 is the number
of clusters in the database 1 while a49 a14a71a1 a5
a23 is set to
a39 a44
a49
a14a71a1a8a2
a23 .
2.3 Algorithm
2.3.1 Setting up the DLSI Space-Based
Classifier
1. By preprocessing documents, identify terms ei-
ther of the word and noun phrase from stop
words.
2. Construct the system terms by setting up the
term list as well as the global weights.
3. Normalize the document vectors of all the col-
lected documents, as well as the centroid vec-
tors of each cluster.
4. Construct the differential term by intra-
document matrix a1
a11a3a2
a62a5a4
a2
, such that each of its
column is an differential intra-document vec-
tor2.
5. Decompose a1 a2 , by an SVD algorithm, into
a1 a2
a27 a21
a2
a23
a2
a25 a85
a2
(a23 a2 a27 diaga14a28a27 a2a7a6
a1a10a3
a27 a2a7a6
a6a43a3a4a9a4a9a4a9a23 ,
followed by the composition of a1 a2a7a6a77 a4 a27
a21
a77
a4
a23
a77
a4
a25a42a85
a77
a4
giving an approximate a1 a2 in terms
of an appropriate a38 a4 , then evaluate the likeli-
hood function:
a49
a14
a48a51a50
a1a8a2
a23 a27
a50
a1
a64
a6
a2
a52a30a54a56a55 a88
a44
a62 a4
a6
a84
a77
a4
a25
a54a55a1 a90 a91a61
a92
a91
a4a9a8
a61 a93
a9
a52a30a54a56a55 a94
a44
a62 a4 a95
a91 a96a99a98a101a100a6a103a102
a4
a104
a14a28a60a46a61
a23a62a5a4 a64
a6 a106 a77
a4
a25
a54a55a1
a27 a2a7a6a25
a9
a79
a96
a76 a4
a68
a77
a4
a100
a64
a6
a4
a3
(3)
1a10a12a11a14a13 a4a16a15 can also be set to be an average number of recalls
divided by the number of clusters in the data base if we do not
require that the clusters are non-overlapped
2For a cluster with
a17 elements, we may include at most a18a20a19
a21 differential intra-document vectors in
a13 a4 to avoid the linear
dependency among columns
where a72 a27a107a21 a85a77
a4
a48 ,
a108
a6
a14
a48
a23 a27
a50a85a50a48a86a50a85a50
a6
a44
a84
a77
a4
a25
a54a55a1
a72
a6
a25
,
a79
a4
a27
a1
a76 a4
a68
a77
a4
a84
a76 a4
a25
a54
a77
a4
a80
a1
a27
a6
a2a22a6a25
, and a10 a4 is the rank of
matrix a1 a2 . In practice, a10
a4 may be set to
a50
a2 ,
and a79 a4 to a27
a6
a2a22a6
a77
a4
a80
a1
a72
a60 if both
a50
a2 and a9 are suffi-
ciently large.
6. Construct the term by extra- document matrix
a1
a11a23a2
a62a25a24
a5
, such that each of its column is an
extra- differential document vector.
7. Decompose a1 a5 , by exploiting the SVD al-
gorithm, into a1 a5 a27 a21 a5 a23 a5 a25 a85
a5
(a23 a5 a27
diaga14a28a27a42a5 a6a1a10a3 a27a30a5 a6a6a43a3a4a9a4a9a4a9a23 , then with a proper a38 a24 , de-
fine the a1 a5 a6a77 a24 a27 a21 a77 a24 a23 a77 a24 a25 a85a77 a24 to approximate
a1a8a5 . We now define the likelihood function as,
a49
a14
a48a51a50
a1a8a5
a23 a27
a50
a1
a64
a6
a24
a52a30a54a56a55 a88
a44
a62a26a24
a6
a84
a77
a24
a25
a54a55a1 a90 a91a61
a92
a91
a24
a8
a61 a93
a9
a52a30a54a87a55 a88
a44
a62a26a24 a95
a91 a96a98a101a100
a6a103a102
a24
a93
a14a28a60a46a61
a23
a62a26a24 a64
a6
a106 a77
a24
a25
a54a55a1
a27a30a5 a6a25
a9
a79
a96
a76
a24
a68
a77
a24
a100
a64
a6
a24
a3
(4)
where a72 a27a105a21 a85a77 a24 a48 , a108
a6
a14
a48
a23 a27
a50a85a50a48a51a50a85a50
a6
a44
a84
a77
a24
a25
a54a55a1
a72
a6
a25
,
a79
a24
a27
a1
a76
a24
a68
a77
a24
a84
a76
a24
a25
a54
a77
a24
a80
a1
a27
a6
a5 a6a25
, a10 a24 is the rank of
matrix a1 a5 . In practice, a10 a24 may be set to a50 a24 ,
and a79 a24 to a27
a6
a5 a6
a77
a24
a80
a1
a72
a60 if both
a50
a24 and
a9 are suf-
ficiently large.
8. Define the posteriori function:
a49
a14a71a1 a2
a50a48
a23 a27
a49
a14
a48a51a50
a1a8a2
a23a49
a14a71a1a8a2
a23
a49
a14
a48a51a50
a1 a2
a23a49
a14a71a1 a2
a23
a40
a49
a14
a48a51a50
a1a8a5
a23a49
a14a71a1a8a5
a23
a3
(5)
a49
a14a71a1a8a2
a23 is set to
a39
a72 a50 a0 where a50 a0 is the number
of clusters in the database and a49 a14a71a1a6a5 a23 is set to
a39 a44
a49
a14a71a1 a2
a23 .
2.3.2 Automatic Classification by DLSI
Space-Based Classifier
1. A document vector is set up by generating the
terms as well as their frequencies of occurrence
in the document, so that a normalized docu-
ment vector a27 is obtained for the document
from equation (1).
For each of the clusters of the data base, repeat
the procedure of item 2-4 below.
2. Using the document to be classified, construct a
differential document vector a48 a27 a27a28a44 a79 , where
a79 is the normalized vector giving the center or
centroid of the cluster.
3. Calculate the intra-document likelihood func-
tion a49 a14a48a51a50a1 a2 a23 , and calculate the extra- docu-
ment likelihood function a49 a14
a48a51a50
a1a6a5
a23 for the doc-
ument.
4. Calculate the Bayesian posteriori probability
function a49 a14a71a1 a2 a50a48 a23 .
5. Select the cluster having a largest a49 a14a71a1 a2 a50a48 a23 as
the recall candidate.
3 Examples and Comparison
3.1 Problem Description
We demonstrate our algorithm by means of numeri-
cal examples below. Suppose we have the following
8 documents in the database:
a0
a1 : Algebra and Geometry Education System.
a0
a6 : The Software of Computing Machinery.
a0a1a0 : Analysis and Elements of Geometry.
a0a3a2 : Introduction to Modern Algebra and Geometry.
a0a1a4 : Theoretical Analysis in Physics.
a0a1a5 : Introduction to Elements of Dynamics.
a0a7a6 : Modern Alumina.
a0a1a8 : The Foundation of Chemical Science.
And we know in advance that they belong to
4 clusters, namely, a0 a1a10a3a0 a6a10a9 a79 a1 , a0a1a0 a3a0 a2 a9 a79 a6 ,
a0a1a4
a3
a0a3a5
a9
a79a11a0 and a0a7a6
a3
a0a1a8
a9
a79 a2 where a79
a1 belongs
to Computer related field, a79 a6 to Mathematics, a79 a0 to
Physics, anda79 a2 to Chemical Science. We will show,
as an example, below how we will set up the classi-
fier to classify the following new document:
a27 : “The Elements of Computing Science.”
We should note that a conventional matching
method of “common” words does not work in this
example, because the words “compute” and, “sci-
ence” in the new document appear in a79
a1 and
a79 a2
separately, while the word “elements” occur in both
a79
a6 and
a79a12a0 simultaneously, giving no indication on
the appropriate candidate of classification simply by
counting the “common” words among documents.
We will now set up the DLSI-based classifier and
LSI-based classifier for this example.
First, we can easily set up the document vectors of
the database giving the term by document matrix by
simply counting the frequency of occurrences; then
we could further obtain the normalized form as in
Table 1.
The document vector for the new document
a27 is given by: a14a71a33
a3
a33
a3
a33
a3
a33
a3
a39
a3
a33
a3
a33
a3
a39
a3
a33
a3
a33
a3
a33
a3
a33
a3
a33
a3
a33
a3
a39
a3
a33
a3
a33
a3
a33
a23a85 , and in normalized form by
a14a71a33
a3
a33
a3
a33
a3
a33
a3
a33
a78a14a13a16a15a17a15
a44
a13
a33 a60a16a18a17a19
a3
a33
a3
a33
a3
a33
a78a14a13a20a15a17a15
a44
a13
a33 a60a16a18a17a19
a3
a33
a3
a33
a3
a33
a3
a33
a3
a33
a3
a33
a3
a33
a78a14a13a16a15a17a15
a44
a13
a33 a60a17a18a17a19
a3
a33
a3
a33
a3
a33
a23a86a85 .
3.2 DLSI Space-Based Classifier
The normalized form of the centroid of each cluster
is shown in Table 2.
Following the procedure of the previous section,
it is easy to construct both the differential term by
intra-document matrix and the differential term by
extra-document matrix. Let us denote the differ-
ential term by intra-document matrix by a1
a1
a8
a2
a2
a2
a27
a14a0
a1
a44 a79
a1a57a3
a0a1a0 a44 a79
a6a43a3
a0a3a4 a44 a79a11a0
a3
a0a7a6 a44 a79 a2
a23 and the differ-
ential term by extra-document matrix by a1
a1
a8
a2
a2
a5
a27
a14a0
a6
a44a34a79
a6a43a3
a0 a2 a44a34a79a11a0
a3
a0a1a5 a44a34a79 a2
a3
a0a3a8 a44 a79
a1a2a23 respectively.
Note that the a0 a25 ’s and a79 a25 ’s can be found in the ma-
trices shown in tables 1 and 2.
Now that we know a1 a2 and a1a8a5 , we can de-
compose them into a1 a2 a27 a21 a2 a23 a2 a25 a85
a2
and a1 a5 a27
a21
a5
a23
a5
a25 a85
a5
by using SVD algorithm, where
a21
a4a23a22
a24a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a26
0.25081 0.0449575 -0.157836 -0.428217
0.130941 0.172564 0.143423 0.0844264
-0.240236 0.162075 -0.043428 0.257507
-0.25811 -0.340158 -0.282715 -0.166421
-0.237435 -0.125328 0.439997 -0.15309
0.300435 -0.391284 0.104845 0.193711
0.0851724 0.0449575 -0.157836 0.0549164
0.184643 -0.391284 0.104845 0.531455
-0.25811 -0.340158 -0.282715 -0.166421
0.135018 0.0449575 -0.157836 -0.0904727
0.466072 -0.391284 0.104845 -0.289423
-0.237435 -0.125328 0.439997 -0.15309
0.296578 0.172564 0.143423 -0.398707
-0.124444 0.162075 -0.043428 -0.0802377
-0.25811 -0.340158 -0.282715 -0.166421
-0.237435 -0.125328 0.439997 -0.15309
0.0851724 0.0449575 -0.157836 0.0549164
-0.124444 0.162075 -0.043428 -0.0802377
a27a29a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a30
a31
a32
a4a23a22 diaga11a34a33a36a35a37a38a33a39a33a39a33a39a40a38a37
a31
a33a41a35a42a29a43a45a44a29a46a39a43a45a42
a31
a33a41a35a42a38a43a45a44a29a46a39a43a39a42
a31
a33a36a35a44a38a37a38a46a39a46a39a42a39a42 a15
a31
a47
a4a48a22
a24a25
a26
0.465291 0.234959 -0.824889 0.218762
-0.425481 -2.12675E-9 1.6628E-9 0.904967
-0.588751 0.733563 -0.196558 -0.276808
0.505809 0.637715 0.530022 0.237812
a27a29a28
a30
a31
a21 a24
a22
a24a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a25
a26
0.00466227 -0.162108 0.441095 0.0337051
-0.214681 0.13568 0.0608733 -0.387353
0.0265475 -0.210534 -0.168537 -0.529866
-0.383378 0.047418 -0.195619 0.0771912
0.216445 0.397068 0.108622 0.00918756
0.317607 -0.147782 -0.27922 0.0964353
0.12743 0.0388027 0.150228 -0.240946
0.27444 -0.367204 -0.238827 -0.0825893
-0.383378 0.047418 -0.195619 0.0771912
-0.0385053 -0.38153 0.481487 -0.145319
0.19484 -0.348692 0.0116464 0.371087
0.216445 0.397068 0.108622 0.00918756
-0.337448 -0.0652302 0.351739 -0.112702
0.069715 0.00888817 -0.208929 -0.350841
-0.383378 0.047418 -0.195619 0.0771912
0.216445 0.397068 0.108622 0.00918756
0.12743 0.0388027 0.150228 -0.240946
0.069715 0.00888817 -0.208929 -0.350841
a27a29a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a28
a30
a31
a32 a24
a22 diaga11
a21
a35a43a39a42
a21
a42a38a40
a31
a21
a35a0 a42a29a43a2a1a39a44
a31
a21
a35a0 a44a29a37a39a37
a21
a31
a33a41a35a43a2a1a39a37a39a40a38a43a39a42 a15
a31
a47 a24
a22
a24a25
a26
0.200663 0.901144 -0.163851 0.347601
-0.285473 -0.0321555 0.746577 0.600078
0.717772 -0.400787 -0.177605 0.540952
-0.60253 -0.162097 -0.619865 0.475868
a27a29a28
a30
a35
We now choose the number a38 in such a way that
a27
a77
a44 a27
a77a42a80
a1 remains sufficiently large. Let us choose
a38
a4
a27
a38
a24
a27
a39 and a38
a4
a27
a38
a24
a27
a44 to test the
classifier. Now using equations (3), (4) and (5),
we can calculate the a49 a14a48a51a50a1 a2 a23 , a49 a14a48a51a50a1 a5 a23 and fi-
nally a49 a14a71a1 a2 a50a48 a23 for each differential document vec-
tor a48 a27 a27 a44 a79 a25 (a35
a27
a39
a3
a60
a3
a44
a3a4a3 ) as shown in Ta-
ble 3. The a79 a25 having a largest a49 a14a71a1 a2 a50a27 a44 a79 a25a23 is
chosen as the cluster to which the new document
a27 belongs. Because both
a50
a2 ,
a50
a24 are actually quite
small, we may here set a79 a4 a27
a1
a76 a4
a68
a77
a4
a84
a76 a4
a25
a54
a77
a4
a80
a1
a27
a6
a2a7a6a25
,
and a79 a24 a27
a1
a76
a24
a68
a77
a24
a84
a76
a24
a25
a54
a77
a24
a80
a1
a27
a6
a5 a6a25
. The last row of Ta-
ble 3 clearly shows that Cluster a79 a6 , that is, “Math-
ematics” is the best possibility regardless of the pa-
rameters a38 a4 a27 a38 a24 a27 a39 or a38 a4 a27 a38 a24 a27 a44 chosen,
showing the robustness of the computation.
3.3 LSI Space-Based Classifier
As we have already explained in Introduction, the
LSI based-classifier works as follows: First, employ
an SVD algorithm on the term by document matrix
to set up an LSI space, then the classification is com-
pleted within the LSI space.
Using the LSI-based classifier, our experiment
show that, it will return a79 a0 , namely “Physics”, as
the most likely cluster to which the document a27 be-
longs. This is obviously a wrong result.
3.4 Conclusion of the Example
For this simple example, the DLSI space-based ap-
proach finds the most reasonable cluster for the doc-
ument “The elements of computing science”, while
the LSI approach fails to do so.
4 Conclusion and Remarks
We have made use of the differential vectors of two
normalized vectors rather than the mere scalar co-
sine of the angle of the two vectors in document
classification procedure, providing a more effective
means of document classifier. Obviously the con-
cept of differential intra- and extra-document vec-
tors imbeds a richer meaning than the mere scalar
measure of cosine, focussing the characteristics of
each document wheere the new classifier demon-
strates an improved and robust performance in doc-
ument classification than the LSI-based cosine ap-
proach. Our model considers both of the projec-
tions and the distances of the differential vectors to
the DLSI spaces, improving the adaptability of the
conventional LSI-based method to the unique char-
acteristics of the individual documents which is a
common weakness of the global projection schemes
including the LSI. The simple experiment demon-
strates convincingly that the performance of our
model outperforms the standard LSI space-based ap-
proach. Just as the cross-language ability of LSI,
DLSI method should also be able to be used for doc-
ument classification of docuements in multiple lan-
guages. We have tested our method using larger col-
lection of texts, we will give details of the results
elsewhere. .

References
M. Benkhalifa, A. Bensaid, and A Mouradi. 1999.
Text categorization using the semi-supervised fuzzy c-
means algorithm. In 18th International Conference of
the North American Fuzzy Information Processing So-
ciety, pages 561–565.
Michael W. Berry, Susan T. Dumais, and G. W. O’Brien.
1995. Using linear algebra for intelligent information
retrieval. SIAM Rev., 37:573–595.
Michael W. Berry, Zlatko Drmac, and Elizabeth R. Jes-
sup. 1999. Matrices, vector spaces, and information
retrieval. SIAM Rev., 41(2):335–362.
Scott Deerwester, Susan T. Dumais, Grorge W. Furnas,
Thomas K. Landauer, and Richard Harshman. 1990.
Indexing by latent semantic analysis. Journal of the
American Society for Information Science, 41(6):391–
407.
Jennifer Farkas. 1994. Generating document clusters us-
ing thesauri and neural networks. In Canadian Con-
ference on Electrical and Computer Engineering, vol-
ume 2, pages 710–713.
H. Hyotyniemi. 1996. Text document classification
with self-organizing maps. In STeP ’96 - Genes, Nets
and Symbols. Finnish Artificial Intelligence Confer-
ence, pages 64–72.
M. Iwayama and T. Tokunaga. 1995. Hierarchical
bayesian clustering for automatic text classification.
In Proceedings of the Fourteenth International Joint
Conference on Artificial Intelligence, volume 2, pages
1322–1327.
Wai Lam and Kon-Fan Low. 1997. Automatic document
classification based on probabilistic reasoning: Model
and performance analysis. In Proceedings of the IEEE
International Conference on Systems, Man and Cyber-
netics, volume 3, pages 2719–2723.
D. L. Lee, Huei Chuang, and K. Seamons. 1997. Docu-
ment ranking and the vector-space model. IEEE Soft-
ware, 14(2):67–75.
Wei Li, Bob Lee, Franl Krausz, and Kenan Sahin. 1991.
Text classification by a neural network. In Proceed-
ings of the Twenty-Third Annual Summer Computer
Simulation Conference, pages 313–318.
M. L. Littman, Fan Jiang, and Greg A. Keim. 1998.
Learning a language-independent representation for
terms from a partially aligned corpus. In Proceedings
of the Fifteenth International Conference on Machine
Learning, pages 314–322.
H. P. Luhn. 1958. The automatic creation of literature
abstracts. IBM Journal of Research and Development,
2(2):159–165, April.
D. Merkl. 1998. Text classification with self-organizing
maps: Some lessons learned. Neurocomputing, 21(1-
3):61–77.
B. Moghaddam and A. Pentland. 1997. Probabilistic vi-
sual learning for object representation. IEEE Trans.
Pattern Analysis and Machine Intelligence, 19(7):696–
710.
B. Moghaddam, W. Wahid, and A. Pentland. 1998.
Beyond eigenfaces: Probabilistic matching for face
recognition. In The 3rd IEEE Int’l Conference on
Automatic Face & Gesture Recognition, Nara, Japan,
April.
Kamal Nigam, Andrew Kachites MaCcallum, Sebastian
Thrun, and Tom Mitchell. 2000. Text classification
from labeled and unlabeled documents using em. Ma-
chine Learning, 39(2/3):103–134, May.
V. V. Raghavan and S. K. M. Wong. 1986. A criti-
cal analysis of vector space model for information re-
trieval. Journal of the American Society for Informa-
tion Science, 37(5):279–87.
Gerard Salton. 1983. Introduction to Modern Informa-
tion Retrieval. McGraw-Hill.
Gerard Salton. 1988. Term-weighting approaches in
automatic text retrieval. Information Processing and
Management, 24(5):513–524.
Hinrich Sch¨utze and Craig Silverstein. 1997. Projections
for efficient document clustering. In Proceedings of
SIGIR’97, pages 74–81.
L. Sirovich and M. Kirby. 1987. Low-dimensional pro-
cedure for the characterization of human faces. Jour-
nal of the Optical Society of America A, 4(3):519–524.
Borge Svingen. 1997. Using genetic programming for
document classification. In John R. Koza, editor, Late
Breaking Papers at the 1997 Genetic Programming
Conference, pages 240–245, Stanford University, CA,
USA, 13–16 July. Stanford Bookstore.
M. Turk and A. Pentland. 1991. Eigenfaces for recogni-
tion. Journal of Cognitive Neuroscience, 3(1):71–86.
C. J. van Rijsbergen. 1979. Information retrieval. But-
terworths.
