Information Retrieval Capable of Visualization and High Precision
Qing Ma1;2 and Kousuke Enomoto1
1Ryukoku University / 2NICT, Japan
qma@math.ryukoku.ac.jp
Masaki Murata and Hitoshi Isahara
NICT, Japan
fmurata,isaharag@nict.go.jp
Abstract
We present a neural-network based self-
organizing approach that enables vi-
sualization of the information retrieval
while at the same time improving its
precision. In computer experiments,
two-dimensional documentary maps in
which queries and documents were
mapped in topological order accord-
ing to their similarities were created.
The ranking of the results retrieved us-
ing the maps was better than that of
the results obtained using a conven-
tional TFIDF method. Furthermore, the
precision of the proposed method was
much higher than that of the conven-
tional TFIDF method when the process
was focused on retrieving highly rel-
evant documents, suggesting that the
proposed method might be especially
suited to information retrieval tasks in
which precision is more critical than re-
call.
1 Introduction
Information retrieval (IR) has been studied since
an earlier stage [e.g., (Menzel, 1966)] and sev-
eral kinds of basic retrieval models have been pro-
posed (Salton and Buckley, 1988) and a number
of improved IR systems based on these models
have been developed by adopting various NLP
techniques [e.g., (Evans and Zhai, 1996; Mitra
et al., 1997; Mandara, et al., 1998; Murata, et
al., 2000)]. However, an epoch-making technique
that surpasses the TFIDF weighted vector space
model, the main approach to IR at present, has not
yet been invented and IR is still relatively impre-
cise. There are also challenges presenting a large
number of retrieval results to users in a visual and
intelligible form.
Our aim is to develop a high-precision, visual
IR system that consists of two phases. The first
phase is carried out using conventional IR tech-
niques in which a large number of related docu-
ments are gathered from newspapers or websites
in response to a query. In the second phase the
visualization of the retrieval results and picking
are performed. The visualization process clas-
sifies the query and retrieval results and places
them on a two-dimensional map in topological
order according to the similarity between them.
To improve the precision of the retrieval process,
the picking process involves further selection of a
small number of highly relevant documents based
on the classification results produced by the visu-
alization process.
This paper presents a new approach by using
the self-organizing map (SOM) proposed by Ko-
honen (Kohonen, 1997) for this second IR phase1.
To enable the second phase to be slotted into a
practical IR system as described above, visual-
1There have been a number of studies of SOM on data
mining and visualization [e.g., (Kohonen, et al., 2000)] since
the WEBSOM was developed in 1996. To our knowledge,
however, these works mainly focused on confirming the ca-
pabilities of SOM in the self-organization and/or in the vi-
sualization. In this study, we slot the SOM-based processing
into a practical IR system that enables visualization of the
IR while at the same time improving its precision. The an-
other feature of our study differing from others is that we
performed comparative studies with TFIDF-based IR meth-
ods, the major approach to IR in NLP field.
138
ization and picking should be carried out for a
single query and set of related documents. In
this paper, however, for the purpose of evaluating
the proposed system, correct answer data, consist-
ing of multiple queries and related documents as
used in the 1999 IR contest, IREX (Murata, et
al., 2000), was used. The procedure of the sec-
ond IR-phase in this paper is therefore as follows.
Given a set of queries and related documents, a
documentary map is first automatically created
through self-organization. This map provides vis-
ible and continuous retrieval results in which all
queries and documents are placed in topological
order according to their similarity2. The docu-
mentary map provides users with an easy method
of finding documents related to their queries and
also enables them to see the relationships between
documents with regard to the same query, or even
the relationships between documents across dif-
ferent queries. In addition, the documents related
to a query can be ranked by simply calculating
the Euclidean distances between the points of the
queries and the points of the documents in the
map and then choosing the N closest documents
in ranked order as the retrieval results for each
query. If a small N is set, then the retrieval results
are limited to the most highly relevant documents,
thus improving the retrieval precision.
Computer experiments showed that meaning-
ful two-dimensional documentary maps could be
created; The ranking of the results retrieved us-
ing the map was better than that of the results ob-
tained using a conventional TFIDF method. Fur-
thermore, the precision of the proposed method
was much higher than that of the conventional
TFIDF method when the retrieval process focused
on retrieving the most highly relevant documents,
which indicates that the proposed method might
be particularly useful for picking the best docu-
ments, thus greatly improving the IR precision.
2 Self-organizing documentary maps
and ranking related documents
A SOM can be visualized as a two-dimensional
array of nodes on which a high-dimensional in-
2For a specific query, other queries and documents in the
map are considered to be irrelevant (i.e., documents unre-
lated to the query). This map is therefore equivalent to a
map consisting of one query and related and unrelated docu-
ments, which will be adopted in the practical IR system that
we aim to develop.
put vector can be mapped in an orderly manner
through a learning process. After the learning, a
meaningful nonlinear coordinate system for dif-
ferent input features is created over the network.
This learning process is competitive and unsuper-
vised and is called a self-organizing process.
Self-organizing documentary maps are ones in
which given queries and all related documents
in the collection are mapped in order of similar-
ity, i.e., queries and documents with similar con-
tent are mapped to (or best-matched by) nodes
that are topographically close to one another, and
those with dissimilar content are mapped to nodes
that are topographically far apart. Ranking is the
procedure of ranking documents related to each
query from the map by calculating the Euclidean
distances between the points of the queries and
the points of the documents in the map and choos-
ing the N closest documents as the retrieval result.
2.1 Data
The queries are those used in a dry run of the
1999 IREX contest and the documents relating to
the queries are original Japanese newspaper arti-
cles used in the contest as the correct answers. In
this study, only nouns (including Japanese verbal
nouns) were selected for use.
2.2 Data coding
Suppose we have a set of queries:
Q = fQ i (i = 1;¢¢¢ ;q)g; (1)
where q is the total number of queries, and a set
of documents:
A = fAi j (i = 1;¢¢¢ ;q;j = 1;¢¢¢ ;ai)g;
(2)
where ai is the total number of documents related
to Q i. For simplicity, where there is no need to
distinguish between queries and documents, we
use the same term “documents” and the same no-
tation Di to represent either a query Q i or a doc-
ument Ai j. That is, we define a new set
D = fDi (i = 1;¢¢¢ ;d)g = Q
[
A (3)
which includes all queries and documents. Here,
d is the total number of queries and documents,
i.e.,
d = q +
qX
i=1
ai: (4)
139
Each document, Di, can then be defined by the
set of nouns it contains as
Di = fnoun(i)1 ;w(i)1 ;¢¢¢ ;noun(i)ni ;w(i)ni g; (5)
where noun(i)k (k = 1;¢¢¢ ;ni) are all different
nouns in the document Di and w(i)k is a weight
representing the importance of noun(i)k (k =
1;¢¢¢ ;ni) in document Di. The weights are com-
puted by their tf or tfidf values. That is,
w(i)j = tf(i)j or tf(i)j idfj: (6)
In the case of using tf, the weights are normalized
such that
w(i)1 +¢¢¢+w(i)ni = 1: (7)
Also, when using the Japanese thesaurus, Bun-
rui Goi Hyou (The National Institute for Japanese
Language, 1964) (BGH for short), synonymous
nouns in the queries are added to the sets of
nouns from the queries shown in Eq. (5) and their
weights are set to be the same as those of the orig-
inal nouns.
Suppose we have a correlative matrix whose el-
ement dij is some metric of correlation, or a sim-
ilarity distance, between the documents Di and
Dj; i.e., the smaller the dij, the more similar the
two documents. We can then code document Di
with the elements in the i-th row of the correlative
matrix as
V(Di) = [di1;di2;¢¢¢ ;did]T: (8)
The V(Di) 2 <d is the input to the SOM. There-
fore, the method to compute the similarity dis-
tance dij is the key to creating the maps. Note
that the individual dij of vector V(Di) only re-
flects the relationships between a pair of docu-
ments when they are considered independently.
To establish the relationships between the doc-
ument Di and all other documents, representa-
tions such as vector V(Di) are required. Even
if we have these high-dimensional vectors for
all the documents, it is still difficult to estab-
lish their global relationships. We therefore need
to use an SOM to reveal the relationships be-
tween these high-dimensional vectors and repre-
sent them two-dimensionally. In other words, the
role of the SOM is merely to self-organize vec-
tors; the quality of the maps created depends on
the vectors provided.
In computing the similarity distance dij be-
tween documents, we take two factors into ac-
count: (1) the larger the number of common
nouns in two documents, the more similar the two
documents should be (i.e., the shorter the simi-
larity distance); (2) the distance between any two
queries should be based on their application to the
IR processing; i.e., by considering the procedure
used to rank the documents relating to each query
from the map. For this reason, the document-
similarity distance between queries should be set
to the largest value. To satisfy these two factors,
dij is calculated as follows:
dij =
8
>>>>
><
>>>>
>:
1 if both Di and Dj
are queries
1¡ jCijjjDij+jDjj¡jCijj not the case mentioned
above and i 6= j
0; if i=j
(9)
where jDij and jDjj are values (the numbers of
elements) of sets of documents Di and Dj de-
fined by Eq. (5) and jCijj is the value of the in-
tersection Cij of the two sets Di and Dj. jCijj
is therefore some metric of document similarity
(the inverse of the similarity distance dij) between
documents Di and Dj which is normalized by
jDij+jDjj¡jCijj. Before describing the methods
for computing them, we first rewrite the definition
of documents given by Eq. (5) for Di and Dj as
follows.
Di = f(c1;w(i)c1 ;¢¢¢ ;cl;w(i)cl );
(n(i)1 ;w(i)1 ;¢¢¢ ;n(i)mi;w(i)mi)g; (10)
and
Dj = f(c1;w(j)c1 ;¢¢¢ ;cl;w(j)cl );
(n(j)1 ;w(j)1 ;¢¢¢ ;n(j)mj;w(j)mj)g; (11)
where ck (k = 1;¢¢¢ ;l) are the common nouns of
documents Di and Dj and n(i)k (k = 1;¢¢¢ ;mi)
and n(j)k (k = 1;¢¢¢ ;mj) are nouns of documents
Di and Dj which differ from each other. By com-
paring Eq. (5) and Eqs. (10) and (11), we know
140
that l+mi +mj = ni +nj. Thus, jDij (or jDjj)
of Eq. (9) can be calculated as follows.
jDij =
lX
k=1
w(i)ck +
miX
k=1
w(i)k : (12)
For calculating jCijj, on the other hand, since the
weights (of either common or different nouns)
generally differ between two documents, we de-
vised four methods which are expressed as fol-
lows.
Method A:
jCijj =
lX
k=1
max(w(i)ck;w(j)ck ): (13)
Method B:
jCijj =
lX
k=1
w(i)ck +w(j)ck
2 : (14)
Method C:
jCijj =
8>
>>>
<
>>>>
:
Pl
k=1 max(w
(i)
ck;w
(j)
ck ) if one is a queryand the other
is a document
Pl
k=1
w(i)ck+w(j)ck
2 : if both aredocuments
(15)
Method D:
jCijj =
8>
>>>
<
>>>>
:
Pl
k=1 max(w
(i)
ck;w
(j)
ck ) if one is a queryand the other
is a documentP
l
k=1 min(w
(i)
ck;w
(j)
ck ): if both aredocuments
(16)
Note that we need not consider the case where
both are queries for calculating jCijj because this
has been considered independently as shown by
Eq. (9).
3 Experimental Results
3.1 Data
Six queries Q i (i = 1;¢¢¢ ;q, q = 6) and 433
documents Ai j (i = 1;¢¢¢ ;q, q = 6, j =
1;¢¢¢ ;ai and Pqi=1 ai = 433) used in the dry run
Table 1: Distribution of documents used in the
experiments
a1 a2 a3 a4 a5 a6 P6i=1 ai
80 89 42 108 49 65 433
of the 1999 IREX contest were used for our ex-
periments. The distribution of these documents,
i.e., the number ai (i = 1;¢¢¢ ;q, q = 6) of docu-
ments related to each query, is shown in Table 1.
It should be noted that since the proposed IR
approach will be slotted into a practical IR sys-
tem in the second phase in which a small number
(say below 1,000, or even below 500) of the re-
lated documents should have been collected, this
experimental scale is definitely a practical one.
3.2 SOM
We used a SOM of a 40£40 two-dimensional ar-
ray. Since the total number d of queries and doc-
uments to be mapped was 439, i.e., d = q +P
6
i=1 ai = 439, the number of dimensions of in-
put n was 439. In the ordering phase, the number
of learning steps T was set at 10,000, the initial
value of the learning rate fi(0) at 0.1, and the ini-
tial radius of the neighborhood  (0) at 30. In the
fine adjustment phase, T was set at 15,000, fi(0)
at 0.01, and  (0) at 5. The initial reference vec-
tors mi(0) consisted of random values between 0
and 1.0.
3.3 Results
We first performed a preliminary experiment and
analysis to determine which of the four methods
was the optimal one for calculating jCijj shown
in Eqs. (13)-(16). Table 2 shows the IR precision,
i.e., the precision of the ranking results obtained
from the self-organized documentary maps cre-
ated using the four methods. The IR precision was
calculated by follows.
P = 1q
qX
i=1
#related to Q i in the retrieved ai documents
ai ;
(17)
where q is the total number of queries, # means
number, and ai is the total number of documents
related to Q i as shown in Table 1.
In the case of using tf values as weights of
nouns, method B obviously did not work. Al-
141
Table 2: IR precision for the four methods for cal-
culating jCijj
Weight Method
A
Method
B
Method
C
Method
D
tf 0.33 0.20 0.41 0.45
tfidf 0.85 0.76 0.91 0.78
though the similarity between queries was manda-
torily set to the largest value, all six queries were
mapped in almost the same position, thus produc-
ing the poorest result. We consider the reason for
this was as follows. In general, the number of
words in a query is much smaller than the num-
ber of words in the documents, and the number
of queries is much smaller than the number of
documents collected. As described in section 2,
each query was defined by a vector consisting of
all similarities between the query and five other
queries and all documents in the collection. We
think that using the average weights of words ap-
pearing in the queries and documents to calculate
the similarities between queries and documents,
as in method B, tends to produce similar vectors
for the queries. All of these query vectors are then
mapped to almost the same position. With coding
method A, because the larger of the two weights
of a query and a document is used, the same prob-
lem could also arise in practice. There were no es-
sential differences between coding methods C and
D, which were almost equally precise. Neither of
these methods have the shortcomings described
above for methods A and B. However, when tfidf
values were used as the weights of the nouns, even
methods A and B worked quite well. Therefore, if
we use tfidf values as the weights of the nouns, we
may use either of the four methods. Based on this
analysis and the preliminary experimental result
that method C and D had highest precisions in the
cases of using tf and tfidf values as weights of the
nouns, respectively, we used methods C and D for
calculating jCijj in all the remaining experiments.
Table 3 shows the IR precision obtained using
various methods. From this table we can see that
the proposed method in the case of SOM (w=tfidf,
C), i.e., using method C for calculating jCijj, us-
ing tfidf values as the weights of nouns, and not
using the Japanese thesaurus (BGH), in the case
of SOM (w=tfidf, D), i.e., using method D, us-
ing tfidf values, and not using the BGH, and in
Table 3: IR precision obtained using various
methods
TFIDF TFIDF
(BGH)
SOM
(w=tf,
D)
SOM
(w=
tfidf,
C)
SOM
(w=
tfidf,
C,
BGH)
SOM
(w=
tfidf,
D)
SOM
(w=
tfidf,
D,
BGH)
0.67 0.75 0.45 0.91 0.77 0.78 0.73
Table 4: IR precision for top N related documents
N TFIDF TFIDF
(BGH)
SOM
(w=tf,
D)
SOM
(w=
tfidf,
C)
SOM
(w=
tfidf,
C,
BGH)
SOM
(w=
tfidf,
D)
SOM
(w=
tfidf,
D,
BGH)
10 0.83 0.88 0.75 1.0 0.97 1.0 0.97
20 0.79 0.86 0.68 0.99 0.95 0.98 0.97
30 0.73 0.84 0.62 0.99 0.94 0.97 0.91
40 0.71 0.82 0.58 0.98 0.90 0.97 0.87
the case of SOM (w=tfidf, C, BGH), i.e., using
method C, using tfidf values, and using the BGH
produced the highest, second highest, and third
highest precision, respectively, of all the methods
including the conventional TFIDF method. When
the BGH was used, however, the IR precision of
the proposed method dropped inversely, whereas
that of the conventional TFIDF improved. The
lower precision of the proposed method when us-
ing BGH might be due to the calculation of the
denominator of Eq. (9); this will be investigated
in future study.
Table 4 shows the IR precision obtained using
various methods when the retrieval process is fo-
cused on the top N related documents. From this
table we can see that the IR precision of the pro-
posed method, no matter whether the BGH was
used or not, or whether method C or D was used
for calculating jCijj, was much higher than that
of the conventional TFIDF method when the pro-
cess was focused on retrieving the most relevant
documents. This result demonstrated that the pro-
posed method might be especially useful for pick-
ing highly relevant documents, thus greatly im-
proving the precision of IR.
Figure 1 shows the left-top area of a self-
organized documentary map obtained using the
proposed method in the case of SOM (w=tfidf,
D)3. From this map, we can see that query Q 4
3Note that the map obtained using the proposed method
in the case of SOM (w=tfidf, C), which had the highest IR
precision, was better than this.
142
Figure 1: Left-top area of self-organized docu-
mentary map
and its related documents A4 ⁄ (where * denotes
an Arabic numeral), Q 2 and its related docu-
ments A2 ⁄ were mapped in positions near each
other. Similar results were obtained for the other
queries which were not mapped in the area of the
figure. This map provides visible and continu-
ous retrieval results in which all queries and docu-
ments are placed in topological order according to
their similarities. The map provides an easy way
of finding documents related to queries and also
shows the relationships between documents with
regard to the same query and even the relation-
ships between documents across different queries.
Finally, it should be noted that each map that
consists of 400 to 500 documents was obtained in
10 minutes by using a personal computer with a
3GHZ CPU of Pentium 4.
4 Conclusion
This paper described a neural-network based self-
organizing approach that enables information re-
trieval to be visualized while improving its preci-
sion. This approach has a practical use by slot-
ting it into a practical IR system as the second-
phase processor. Computer experiments of practi-
cal scale showed that two-dimensional documen-
tary maps in which queries and documents are
mapped in topological order according to their
similarities can be created and that the ranking
of the results retrieved using the created maps
is better than that produced using a conventional
TFIDF method. Furthermore, the precision of the
proposed method was much higher than that of
the conventional TFIDF method when the pro-
cess was focused on retrieving the most relevant
documents, suggesting that the proposed method
might be especially suited to information retrieval
tasks in which precision is more important than
recall.
In future work, we first plan to re-confirm the
effectiveness of using the BGH and to further im-
prove the IR accuracy of the proposed method.
We will then begin developing a practical IR sys-
tem capable of visualization and high precision
using a two-phase IR procedure. In the first phase,
a large number of related documents are gath-
ered from newspapers or websites in response to
a query presented using conventional IR; the sec-
ond phase involves visualization of the retrieval
results and picking the most relevant results.
References
H. Menzel. 1966. Information needs and uses in science
and technology. Annual Review of Information Science
and Technology, 1, pp. 41-69.
G. Salton and C. Buckley. 1988. Term-weighting ap-
proaches in automatic text retrieval. Information
Processing & Management, 24(5), pp. 513-523.
D. A. Evans and C. Zhai. 1996. Noun-phrase analysis in
unrestricted text for information retrieval. ACL’96, pp.
17-24.
M. Mitra, C. Buckley, A. Singhal, and C. Cardie, C.
1997. An analysis of statistical and syntactic phrases.
RIAO’97, pp. 200-214.
R. Mandara, T. Tokunana, and H. Tanaka 1998. The use
of WordNet in information retrieval. COLING-ACL’98
Workshop: Usage of WordNet in Natural Language
Processing Systems, pp. 31-37.
M. Murata, Q. Ma, K. Uchimoto, H. Ozaku, M. Uchiyama,
and H. Hitoshi 2000. Japanese probabilistic informa-
tion retrieval using location and category information.
IRAL’2000.
T. Kohonen 1997. Self-organizing maps. Springer, 2nd
Edition.
T. Kohonen, S. Kaski, K. Lagus, J. Salojarrvi, J. Honkela,
V. Paatero, and A. Saarela. 2000. Self Organization of
a Massive Document Collection. IEEE Trans. Neural
Networks, 11, 3, pp. 574-585.
The National Institute for Japanese Language. 1964. Bunrui
Goi Hyou (Japanese Thesaurus). Dainippon-tosho.
143
