Very Low-Dimensional Latent Semantic Indexing for Local Query Regions
Yinghui Xu Kyoji Umemura
Toyohashi Unversity of Technology Dept. of Information and Computer Sciences
1-1, Hibarigaoka, Toyohashi, Aichi,Japan
xyh@ss.ics.tut.ac.jp umemura@tutics.tut.ac.jp
Abstract
In this paper, we focus on performing
LSI on very low SVD dimensions. The
results show that there is a nearly linear
surface in the local query region. Using
low-dimensional LSI on local query re-
gion we can capture such a linear surface,
obtain much better performance than
VSM and come comparably to global
LSI. The surprisingly small requirements
of the SVD dimension resolve the com-
putation restrictions. Moreover, on the
condition that several relevant sample
documents are available, application of
low-dimensional LSI to these documents
yielded comparable IR performance to
local RF but in a different manner.
1 Introduction
The increasing size of searchable text collection
poses a great challenge to performing the informa-
tion retrieval (IR) task. Latent Semantic Index-
ing (LSI) is an enhancement of the familiar Vector
Model of IR. It satis es the IR task through discov-
ering corpus-wide word relationship based on co-
occurrence analysis of a whole collection. LSI has
been successfully applied to various document col-
lections and has achieved favorable results, some-
times outperforming VSM (Dumais, 1996). How-
ever, the principal challenges to applying LSI to
large data collections are the cost of computing and
storing SVD.
Local analysis of the information in a set of top-
ranked documents for the query is one promising
way to solve the computationally demanding IR task
for a large collection. To solve the computational
complexity of LSI, David Hull introduced one inter-
esting method, local LSI, for routing problems(Hull,
1994). The basic idea is: apply the SVD to a set of
documents known to be relevant to the query; then
all the documents in the collection can be folded into
the reduced space of those relevant documents. By
concentrating on the local space around the query re-
sults, we may be able to compute using  exible and
ef cient LSI algorithms.
In this paper we put much emphasis on local
dimensionality analysis of the local query regions
 lled with relevant documents. In ideal experimen-
tal cases, local LSI involves only the documents
known to be relevant to the query. To our surprise,
in most of our experiments, local LSI obtains its best
IR performance using just one or two SVD dimen-
sions. These interesting results moved us to try per-
forming local LSI with one or two SVD dimensions
on the top return sets of VSM in ad-hoc IR experi-
ments. We found that this worked surprisingly well.
In a practical setting, local LSI may be regarded as a
variation of pseudo relevance feedback (RF). There-
fore, the comparative results with local RF are pro-
vided in this paper as well. The experiments show
that local LSI with one or two SVD dimensions can
contribute to expanding the query information in a
manner different from traditional local RF.
This paper is organized as follows: Section 2 re-
views existing related techniques. Section 3 de-
scribes the implementation architecture of the ex-
periments and gives the experiment results. Section
4 explains the result and points out characteristic of
the local LSI. Section 5 draws the conclusions.
2 Related works
2.1 Latent Semantic Indexing
Latent semantic indexing (Berry et al., 1999) is one
kind of vector-based query-expansion methods that
use neither terms nor documents as the orthogo-
nal basis of a semantic space. Instead, it computes
the most signi cant orthogonal dimensions in the
term-document matrix of the corpus, via SVD, and
projects documents into the low rank subspace thus
found. LSI then computes semantic similarity based
on the proximity among projected vectors.
LSI uses SVD to factor the term-document
training matrix A into three factors: A =
U§V T = Udiag( 1; 2;¢¢¢; n)V T Where
U = (u1;u2;¢¢¢;um) 2 <m£m and V =
(v1;v2;¢¢¢;vn) 2 <n£n are unitary matrices (i.e.
UTU = I;V TV = I), whose columns are the
left and the right singular vectors of A respectively,
§2<m£n is a diagonal matrix whose diagonal ele-
ments are non-negative and arranged in descending
order ( 1 ‚  2 ‚ ¢¢¢ ‚  k), and p = min(m;n).
The values  1; 2;¢¢¢; p are known as the singular
values of A, and are the square roots of the eigen-
values of AAT and ATA.Suppose the rank of A is
r, then r • p and only  1 ‚  2 ‚ ¢¢¢ ‚  r are
positive, while the remaining (p-r), if r<p, singular
values are zero. In LSI retrieval, researchers are only
concerned with the  rst r singular values of A. LSI
uses the structure from SVD to obtain the reduced-
dimension form of the training matrix A as its  la-
tent semantic space. Notation for k • r, de nes the
reduced-dimension form of A to be A = U§V T =
Udiag( 1; 2;¢¢¢; k;0;¢¢¢;0)V T . That is, Ak is
obtained by discarding the r-k least signi cant sin-
gular values and the corresponding left and right
singular vectors of A (since they are now mul-
tiplied by zeros). Then, the  rst k columns of
U that correspond to the k largest singular values
of A together constitute the projection matrix for
LSI: Sim(~d;~q) = (ATk ~d) † (ATk ~q). Analogous to
VSM, the vector representation of a document is
the weighted sum of the vector representation of
its constituent terms. For document vector di and
query vector qi, ATk ~d and ATK~q are now the LSI vec-
tor representations of that document and query, re-
spectively, in the reduced-dimension vector space.
This process is known as  folding in documents (or
queries) into the training space. Actually, LSI as-
sumes that the semantic associations among terms
can be found through this one-step analysis of their
statistical usage in the collection, and they are im-
plicitly stored in the singular vectors computed by
SVD.
2.2 Relevance Feedback
A feedback query creation algorithm developed by
Rocchio (Rocchio, 1971) in the mid-1960s has, over
the years, proven to be one of the most successful
pro le learning algorithms. The algorithm is based
upon the fact that if the relevance for a query is
known, an optimal query vector will maximize the
average query-document similarity for the relevant
documents, and will simultaneously minimize the
average query-document similarity for non-relevant
documents. Rocchio shows that an optimal query
vector is the difference vector among the centroid
vectors for the relevant and non-relevant documents.
~Qo = 1R PD2Rel: ~D¡ 1N¡R PD=2Rel: ~D where R is
the number of relevant documents, and N is the to-
tal number of documents in the collection. Also, all
negative components of the resulting optimal query
are assigned a zero weight. To maintain focus of the
query, researchers have found that it is useful to in-
clude the original user-query in the feedback query
creation process. Also, coef cients have been intro-
duced in Rocchio’s formulation, which control the
contribution of the original query, the relevant docu-
ments, and the non-relevant documents to the feed-
back query. These modi cations yield the follow-
ing query reformulation function: ~Qn = fi£ ~Qo +
fl £ 1R P
D2rel
~D ¡  £ 1N¡R P
D=2rel
~D In this paper,
the experiment results based on the local RF were
performed for comparing with the results of Local
LSI. The terms in the query are reweighted using
the Rocchio formula with fi : fl :  = 1 : 1 : 0.
As for the local information relevant to the query,
they were obtained by extracting several top-ranked
documents through the VSM retrieving process in
the experiments. Jiang has ever used the similar
experiments (VSM+LSI) for Local LSI in his pa-
per  Approximate Dimension Reduction at NTCIR 
(Fan and Littmen, 2000).
document set (corpus)topic set (query)
    term (m) X doc (n) matrix
1. <docid - document vector>
2. <termid -term vector>   Qids (          )
query vector(   )
preprocess
1. stopword removal.
2. porter’s stemming.
3. smart "ltc" tw
doci, docj, ... ,dock
doc1   vector
:       :
docn  vector
creating local query region
through the identified documents
SVD
reduced feature space (          )
organized by singular vectors
 L Q S X W
 G R F  Y H F W R U
 L Q S X W
 T X H U \  Y H F W R U \ H V
query vector set
Qid1  (t1, t2, ... tr)
:
Qidk  (t1, t2, ... tr)
output the score-list for Qids
and continuefor next query
score table
Qid  Docid  score
 Q R
relevant sets of Qids
local LSI
local
KA
nkji ≤≤ ,,,1  "
ks≤≤1
sq
 G
Ks≤
( ) ( ) ( )local localT Ts k kscore qid q A d A=  G G  <
Figure 1: Implementation architecture
3 Experiment Set-up and Results
3.1 Implementation Architecture
Figure 1 shows overall the architecture of the exper-
iment. The procedure is described as follows:
1. Indexing the document collection and query
sets
2. Given a query, retrieving some document by
the relevant sets. In some cases the relevant sets
are derived from the known relevant documents
and in other cases we regarded the top returned
documents as the relevant sets.
3. performing the singular value decomposition
on documents identi ed in 2.
4. Only a few dimensions for the LSI are retained.
5. Projecting the document vectors and the query
vector into the user-cared feature space, and
then using the standard Cosine measure to get
the  nal score for this query.
6. Back to the step 2 and continue the analysis on
the next query in the same way.
Step 1 is pre-processing procedure for IR system.
Only the tag removal, upper case characters trans-
verse, stoplist removal and Porter’s stemming were
adopted in this phase (Frakes and Baeza-Yates,
1992). Next, the smart  ltc term weighting scheme
(Salton and McGill, 1983) was used to compute the
entries of the term document matrix for the collec-
tion and entries of the query vector. The second step
can be regarded as  lter container. In this paper,
the three kinds of routine schemes were performed.
In the  rst case, the local space for each query was
represented simply by all document vectors, which
have already been judged to be relevant (appearing
in the relevant judgment  le). We note that although
it is an ideal case, it may form a useful upper bound
on performance. In the second case, we assume the
condition that the user provides a reasonable num-
ber of relevant documents. In the third case, the lo-
cal space for each query was built on the top return
sets of VSM. The use of the top returned items from
VSM is similar to blind feedback or pseudo RF.
3.2 Characteristic of test collection
There are three test collections in our experiments.
Two of them, Cran eld and Medlars, are small. The
third one is a large-scale test collection, NACSIS.
The Cran eld corpus consists of 1,400 documents
on aerodynamics and 225 queries, while Medlars
consists of 1,033 medical abstracts and 30 queries.
Although these two collections are very small, they
were used extensively in the past by IR researchers.
As for the NACSIS test collection for the IR 1 & 2
(NTCIR 1&NTCIR 2) (Kando, 2001), these docu-
ments are abstracts of academic papers presented at
meetings hosted by 65 Japanese scientists and lin-
guists. In our experiments, the English Monolingual
IR was performed. This collection consists of ap-
proximately 320,000 English documents in NTCIR-
1 and NTCIR-2.
3.3 Local Routine Experiments (Ideal Case)
We  rst present the experimental results on the ideal
condition. The document vectors already judged to
be relevant to the query were used. SVD calcula-
tion are performed on the local region organized by
Table 1: Results on the Cran., Med. and NTCIR are shown in terms of ave. precision, precision at document
cutoff of 10. Results of the local LSI experiment based on three different SVD dimensions were provided.
Cran eld Medlars NTCIR (E-E) (D run)
K Avr. P-R R-p K Avr. P-R R-p K Avr. P-R R-p
VSM - 0.4148 0.3885 - 0.5306 0.5359 - 0.212 0.2277
+0% +0% +0% +0% +0% +0%
G. 200 0.4543 0.4180 80 0.6680 0.6648 - - -
LSI +9% +0% +8% +0% +26% +0% +25% +0% - - -
1 0.8833 0.8243 1 0.8946 0.8139 1 0.6997 0.6508
+113% +95% +112% +97% +69% +34% +52% +22% +230% +186%
L. 2 0.8607 0.8185 2 0.8769 0.8035 2 0.7062 0.6314
LSI +108% +90% +108% +96% +67% +32% +52% +20% +233% +177%
3 0.8585 0.8102 3 0.8726 0.8019 3 0.6934 0.6293
+107% +90% +108% +96% +68% +30% +51% +20% +228% +176%
these relevant documents with respect to its query.
The IR performance of VSM and global LSI were
regarded as the baseline for comparison. As for the
NTCIR collection, English-English Monolingual IR
was performed and we only extracted the  D (De-
scription)  eld of the topic as the query. Due to
its large size, only the result of VSM is the base-
line. Additionally, to observe the in uences of SVD
factors on the IR performance for local LSI exper-
iments, results based on LSI dimension from 1 to
3 were also provided for comparison. As we ex-
pected, the majority of experimental studies are di-
rected towards obtaining better solutions for the lo-
cal routine LSI method. In table 1, K represents
the SVD dimension for LSI analysis. As for the k
value of global LSI, it is the parameter by which
LSI yields the best IR performance. The improve-
ment in the average precision of local routine LSI
is 113, 69 and 233 percent better than that of VSM
on Cran eld, Medlars and NTCIR test collections
respectively. The improvement in average precision
of local routine LSI is 95 and 34 percent better than
that of global LSI on Cran eld with 200 SVD di-
mensions and Medlars with 80 SVD dimensions re-
spectively. Moreover, in the case of SVD factors
equal to 1, we obtain the best IR performance among
all cases on the Cran eld and Medlars. While the
NTCIR collection obtained its best IR performance
with 2 SVD dimensions, there is only a slight dif-
ference between the case with 1 singular vector and
the case with 2. Such small numbers caught our at-
tention, since they indicate that there is a nearly lin-
ear surface in the local region and that the dominant
SVD dimensions can capture such surface and yield
a good IR performance for local LSI analysis.
To clarify how the local LSI space in uences IR
performance, we projected the document vectors
onto the extracted local routine LSI space and  g-
ured out the distribution in  gure 2. The data of
plots are based on one query from the Medlars col-
lection. Only the largest singular vector was used
for the left plot, and the two largest were used for the
right. Based on the plots, we  nd that these dimen-
sions do not vary signi cantly for the non-relevant
documents, Thus, they tend to cluster around the
origin. On the other hand, the relevant document
space illustrates that local SVD factors are designed
to capture their structure.
Since the pre-judged set of documents is generally
not available for the ad-hoc query, In this paper, to
investigate the ef ciency of local LSI using very low
dimensions, we continue to do some experiments us-
ing different numbers of relevant documents, which
were selected from the relevant judgment  le. The
comparative results based on four cases in which the
SVD factors equal 1, 2 and 3, respectively, were
shown in Table 2. The second column is the condi-
tion, which means that the number of relevant doc-
uments belonging to the analyzing object (query)
should exceed the value in table. Column 3  #qry 
0 200 400 600 800 1000 1200
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Doc. ID
i
n
n
e
r
 
p
r
o
d
u
c
t
 
v
a
l
u
e
 
the distribution of inner product of document vector 
with the largest singular vector
non-relevant document
query
relevant document
0.0 0.1 0.2 0.3 0.4 0.5 0.6
-0.6
-0.4
-0.2
0.0
0.2
0.4
s
e
c
o
n
d
 
f
a
c
t
o
r
largest factor
 non-relevant document
 query
 relevant document
Figure 2: Medlars: document collection distribution after represented by the Local region singular vectors.
For the left  gure the X-axis is doc.ID and Y are the inner products of the doc vector with the largest singular
vectors. X and Y coordinates on the right are the inner product of the document vectors with the  rst and
second largest singular vector, respectively.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Medlars
A
v
e
.
 
p
r
e
c
i
s
i
o
n
-
r
e
c
a
l
l
local LSI (s=20,k=1)
global LSI (k=80)
VSM
0.0 0.2 0.4 0.6 0.8 1.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Cranfield
A
v
e
.
 
p
r
e
c
i
s
i
o
n
-
r
e
c
a
l
l
local LSI (s=3,k=2)
global LSI (k=200)
VSM
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
NTCIR
A
v
e
.
 
p
r
e
c
i
s
i
o
n
-
r
e
c
a
l
l
Local LSI (s=3,k=2)
VSM
Figure 3: the Ave. precision-recall comparison plots between the best run of local LSI with the baseline
VSM and global LSI.
indicates the numbers of queries in the test collec-
tion which satisfy the condition appearing in the sec-
ond column. The fourth column gives the parameter
indicating the number of relevant documents to be
used for creating the local space of the correspon-
dent query. As we expected, local LSI using one or
two SVD dimensions built from the  rst two singu-
lar vectors resulted in the best IR performance in the
partially ideal experiments. The comparison of the
results was shown in the Table 2.
We know that the most important step in LSI is
the phase of SVD. It requires O(k £ nz2) to  nd
the k leading eigenvectors. The parameter nz is the
non-zero entries of the term-by-document matrix.
These requirements are unacceptably high for doc-
ument data sets as the non-zero entries number tens
of thousands. According to the LSI analyzing proce-
dure, it includes the SVD phase and the subsequent
projecting treatment. For global LSI, the compu-
tation complexity can be evaluated by: O(nz2k +
#qry £ k2 £ nz2 £ qnz2) k = (100 » 300)
While our approach can be estimated by: O(#qry£
[(nz2lockloc)+(k2loc £nz2 £qnz2)]) k = 1 or 2
In the above equation,  nzloc represents the non-
zero entries of the local query region.  qnz are
non-zero entities in the query vector. The value of
 nzloc varies with the number of known relevant
documents. Note that the difference between these
two equations shows clearly that local LSI on small
SVD dimensions is much easier to compute than
global LSI. According to our observation, it is par-
ticularly fast when computing only the largest sin-
gular value.
Based on the above experiments, the interesting
results and the power of the two largest singular vec-
tors prompted us to try putting the local LSI with one
or two singular dimensions into the practical experi-
ments. In this paper, we used the simplest and most
ef cient VSM method as the initial retrieval step for
extracting the relevant information around the query.
We assume that the top-ranked documents obtained
by VSM are relevant documents. The details are in-
troduced in section 3.4.
3.4 Ad-hoc local LSI experiment
In this experiment, we note that using the top re-
turned items from VSM is sometimes called blind
feedback or pseudo RF. Hence, we borrow the idea
of local RF. The expanded query representation was
obtained by combining the original query vector
with its projecting result on the local SVD dimen-
sions. The equation for expanding the scheme is as
follows: ~qnew = ~qori +Alock (Alock )T~qori In the equa-
tion:
Alock = Ulock §lock (V lock )T
Sim(~d;~qnew)= ~d†(~qori +Ulock (§lock )2(Ulock )T~qori)
As for the parameter k, representing the SVD dimen-
sionality of the local region, we set its value equal to
1 or 2 in this experiment. At  rst, to show that local
LSI on small dimensions works well is a practical
case clearly, we gave the comparable plots between
local LSI with the baseline VSM and global LSI.
The 11ppt. average precision recall plots of local
LSI were  gured out for the three test collections in
 gure 3. The symbol s in the  gure represents the
sample size and k represents the SVD dimension.
To our satisfactions, local LSI based query expan-
sion method does much better than VSM and more
closely approaches the global LSI.
Next, to investigate the effectiveness of low di-
mensional LSI on local query region in restructur-
ing the user cared information space, local RF with
Rocchio’s weights fi : fl :  = 1 : 1 : 0, as in
Xu and Croft (Xu and Croft, 2000), was used for
comparison. Both of them were used on the same
sample documents. The difference between them is
a twofold one. First, the standard RF formula shown
in section 2.2 make use of weighting parameters for
query expansion, while our approach does not. Sec-
ondly, different combination object was used. The
local RF experiment performed in this paper makes
use of the centroid of the top s returned document
vectors. In our approach, we combine the original
query vector with its projecting results on the low
local SVD space.
Table 3 shows these results in terms of varying
feedback size with one or two SVD dimensions. The
 rst column  sample size in the table is the value of
s according to which we would select the top rank
documents . We see that local LSI outperforms lo-
cal RF for most combinations of sample size and
one or two SVD dimensions in the experiment on
Medlars. The best run on Medlars using local LSI
is 8.4% better than the best in local RF. As for the
best run on Cran eld and on NTCIR, local LSI got
comparable results with the local RF. In the exper-
iments, we note that with the increasing of sample
size, the precision of local LSI decreased more than
that of local RF. Based on our analysis, there are two
reasons for this. First, In the VSM based local LSI
experiments, we assume that the top s documents
from the initial retrieval by VSM are relevant, al-
though that assumption does not always hold. In the
case where the dominant components of the top s
return sets are non-relevant, the maintained SVD di-
mensions would deviate from the orientation that we
preferred. This will in uence the following projec-
tion procedure greatly. The average precision-recall
results of VSM on Cran eld and NTCIR is 0.38 and
0.21, respectively. Neither one is ideal. The second
factor is the characteristic of the test collection. The
number of relevant documents for query sets ranges
from 2 to 40 and from 3 to 170 for the Cran eld
and NTCIR, respectively. With such wide range of
query sets, some queries don’t have enough relevant
documents for this strategy to be feasible. There-
fore, from the experiment results, it is still reason-
able for us to believe that if several relevant sample
documents of a query are available, low-dimensional
local LSI will be able to achieve comparable perfor-
mance to local RF.
4 Analysis and discussion
4.1 Local dimensions
One important variable for LSI retrieval is the num-
ber of dimensions in the reduced space. In this pa-
per, we found that one or two SVD dimensions are
able to represent the structure of the local region
that corresponds to the user’s interests. The  rst two
largest singular vectors will represent the two major
Table 2: Ave. precision-recall comparing results
based on different SVD factors.
Coll. Cond. #qry #sel. SVD Ave.
Rel. fact. P-R
1 0.6857
10 2 0.6667
Cran. >15 27 3 0.6654
(#rel) 1 0.5749
5 2 0.5692
3 0.5641
1 0.7945
10 2 0.8007
Med. >15 25 3 0.7952
(#rel) 1 0.7160
5 2 0.7142
3 0.7137
1 0.3899
10 2 0.3987
NTCIR >15 23 3 0.3967
(#rel) 1 0.2917
5 2 0.2913
3 0.2883
interests. The local SVD dimensions built on them
have the ability to absorb the interests of a query and
have no interest in the non-relevant information. It
indicates that there is near linear surface in the local
query region. That is why local LSI works well on
small dimensions, especially on the condition that
there is only one dominant interests in the query. Of
course, in cases where there is much noisy informa-
tion in the local region, the SVD dimension may fail
to satisfy the true needs of the user. Finally, based
on the experiments in this paper, we would like to
point out that for performing SVD on a particular
local query region, the requirement of the SVD di-
mension should not be demanding. In our opinion,
2 or less is suf cient to obtain ideal IR performance.
4.2 Size of local region
The size of a local region is also one important pa-
rameter for local LSI. We did not do much analysis
on how to determine the best size of the local region
for local LSI. In the absence of any clear guidelines
now, we merely offer some suggestions and an anal-
ysis. The local region should be large enough so that
it will contain more relevant information. However,
there are also several reasons why the local region
should not be too large. Adding a large number of
non-relevant documents of marginal value will only
increase the number of LSI factors needed to de-
scribe the local region without improving their qual-
ity, and this will only degrade the IR performance.
Therefore, as for the size of the local region, it is a
tradeoff. According to the experimental results and
the analysis in section 3.4, since the local LSI does
well on one or two SVD dimensions, so as to avoid
in uences of non-relevant information brought by
more involved documents, it is better to restrict the
size of a local region below 30. In the experiments
on Medlars, local LSI produced its best run at 20 top
return documents. Of course, the threshold for the
size of local region should be collection-dependent
and experiment-determined. It may also be possible
to set the threshold by the performance of the initial
retrieval method, but we have not yet analyzed this.
4.3 Advantages
Finally, we would like to point out the advantages
of low-dimensional LSI analysis for local query re-
gion. Our results compared with VSM and global
LSI show clearly that local LSI with low dimensions
performs much better than VSM under some sample
sets and achieves the comparable IR performance to
global LSI. Additionally, because the largest singu-
lar vectors are essential for retrieval performance on
the local query regions, local LSI approaches the
computational complexity of global LSI by using
such small SVD dimensions. Despite the fact that
local LSI has increased the cost of separate SVD
computation for each query, the relative modest re-
quirements of SVD dimension make it feasible for
large scale IR task.
Compared with the local RF method, both the lo-
cal LSI and local RF achieve better results by pro-
viding high-centralized relevant information in the
local region. Provided that relevant sample docu-
ments are used with the same number, local RF is
able to make use of the combination of document
vectors and a heuristic procedure to improve IR per-
formance, while local LSI makes use of SVD to
extract the useful information from the information
space. In some sense, this SVD method is more
comprehensive than local RF.
Table 3: comparative results of Local LSI and Local
relevance feedback on the local region organized by
the return sets of VSM on Med., Cran. and NTCIR,
respectively. The SVD dimension value for the local
LSI is the one from which the best IR performance
was obtained at the speci c sample size.
#ss. svd 11 ppt. Ave. P-R R-p
(s) fac. LLSI LRF LLSI LRF
3 2 0.5858 0.5977 0.5760 0.5816
5 1 0.6417 0.6243 0.6300 0.6198
10 1 0.6577 0.6152 0.6431 0.6093
20 1 0.6764 0.6044 0.6393 0.5845
30 1 0.6598 0.5854 0.6246 0.5699
40 2 0.6514 0.5776 0.6157 0.5722
#ss. svd 11 ppt. Ave. P-R R-p
(s) fac. LLSI LRF LLSI LRF
3 2 0.4524 0.4528 0.4206 0.4186
5 2 0.4443 0.4403 0.4203 0.4145
10 2 0.4357 0.4327 0.3981 0.3988
20 2 0.3993 0.4269 0.3571 0.3870
30 2 0.3782 0.4252 0.3345 0.3915
40 2 0.3464 0.4232 0.3106 0.3873
#ss. svd 11 ppt. Ave. P-R R-p
(s) fac. LLSI LRF LLSI LRF
3 2 0.2367 0.2346 0.2380 0.2297
5 2 0.2292 0.2302 0.2341 0.2347
10 2 0.2119 0.2249 0.2205 0.2404
20 2 0.1728 0.2110 0.1800 0.2203
30 2 0.1458 0.2026 0.1575 0.2208
40 2 0.1404 0.1978 0.1470 0.2171
5 Conclusion and future work
In this paper, the results show that very low-
dimensional LSI on the local query region performs
IR task well. Such small dimensional requirements
of local LSI make it more attractive, enabling us to
better address the computation complexity. We can
perform the low-dimensional LSI on several known
relevant document spaces to obtain signi cant im-
provements in retrieval performance. Moreover,
provided that several relevant sample documents are
available, local LSI using small dimensions obtains
results comparable to the local RF although in a dif-
ferent manner. Our future work will:
1. Continue to study the optimal size of local re-
gion for local LSI so as to automatically deter-
mine it.
2. Find a more ef cient initial retrieval method
for obtaining high quality sample sets of each
query.
Acknowledgement
This work was supported in The 21st Century COE
Program  Intelligent Human Sensing , from the
ministry of Education, Culture,Sports,Science and
Technology.

References

M. W. Berry, Zlatko Drmax, and Elizabeth R. Jessup.
1999. Matrix, vector space, and information retrieval
(technical report). SIAM Review, 41:335 362.

S. T. Dumais. 1996. Using for information  ltering:
Trec-3 experiments. In In Donna K. Harman, editor,
The 3rd Text Retrieval Conference (TREC-3), pages
282 291. Department of Commerce, National Institute
of Standards and Technology.

J. Fan and M. L. Littmen. 2000. Approximate dimension
equalization in vector based information retrieval. In
Proceedings of the Seventeenth International Confer-
ence on Machine Learning. Morgan-Kauffman.

W. B. Frakes and R. Baeza-Yates. 1992. Information
retrieval - Data Structure Algorithms. Prentice Hall,
Englewood Cliffs, New Jersey 07632.

D. Hull. 1994. Improving text retrieval for the routing
problem using latent semantic indexing. In In pro-
ceedings of the 17th Annual International ACM SIGIR
Conference on Research and Development in Informa-
tion Retrieval, pages 282 291. Association for com-
puting Machnery.

N. Kando. 2001. Clir syetem evaluation at ntcir work-
shop. National Information (NII) Japan.

J. Rocchio. 1971. Relevance feedback in information
retrieval. In The Smart Retrieval System-Experiments
in Automatic Document Processing, pages 313 323.
Englewood Cliffs, NJ, 1971, Prentice-Hall, Inc.

G. Salton and M. J. McGill. 1983. Introduction to Mod-
ern Information Retrieval. McGraw-Hill, New York,
NY.

J. Xu and W. B. Croft. 2000. Improving the effec-
tiveness of informational retrieval with local context
analysis. ACM Transactions on Information Systems
(TOIS), 18(1).
