Searching for Topics in a Large Collection of Texts
Martin Holub Jiˇr´ı Semeck´y Jiˇr´ı Diviˇs
Center for Computational Linguistics
Charles University, Prague
a0 holub|semecky
a1 @ufal.mff.cuni.cz
jiri.divis@atlas.cz
Abstract
We describe an original method that
automatically finds specific topics in a
large collection of texts. Each topic is
first identified as a specific cluster of
texts and then represented as a virtual
concept, which is a weighted mixture of
words. Our intention is to employ these
virtual concepts in document indexing.
In this paper we show some preliminary
experimental results and discuss direc-
tions of future work.
1 Introduction
In the field of information retrieval (for a detailed
survey see e.g. (Baeza-Yates and Ribeiro-Neto,
1999)), document indexing and representing doc-
uments as vectors belongs among the most suc-
cessful techniques. Within the framework of the
well known vector model, the indexed elements
are usually individual words, which leads to high
dimensional vectors. However, there are several
approaches that try to reduce the high dimension-
ality of the vectors in order to improve the effec-
tivity of retrieving. The most famous is probably
the method called Latent Semantic Indexing (LSI),
introduced by Deerwester et al. (1990), which em-
ploys a specific linear transformation of original
word-based vectors using a system of “latent se-
mantic concepts”. Other two approaches which
inspired us, namely (Dhillon and Modha, 2001)
and (Torkkola, 2002), are similar to LSI but dif-
ferent in the way how they project the vectors of
documents into a space of a lower dimension.
Our idea is to establish a system of “virtual
concepts”, which are linear functions represented
by vectors, extracted from automatically discov-
ered “concept-formative clusters” of documents.
Shortly speaking, concept-formative clusters are
semantically coherent and specific sets of docu-
ments, which represent specific topics. This idea
was originally proposed by Holub (2003), who
hypothesizes that concept-oriented vector models
of documents based on indexing virtual concepts
could improve the effectiveness of both automatic
comparison of documents and their matching with
queries.
The paper is organized as follows. In section 2
we formalize the notion of concept-formative clus-
ters and give a heuristic method of finding them.
Section 3 first introduces virtual concepts in a
formal way and shows an algorithm to construct
them. Then, some experiments are shown. In sec-
tions 4 we compare our model with another ap-
proach and give a brief survey of some open ques-
tions. Finally, a short summary is given in sec-
tion 5.
2 Concept-formative clusters
2.1 Graph of a text collection
Let a2a4a3a6a5a8a7a10a9a12a11a13a7a15a14a16a11a18a17a18a17a18a17a19a11a13a7a21a20a23a22 be a collection of text
documents; a24 is the size of the collection. Now
suppose that we have a function a25a13a26a28a27a30a29a31a7a33a32a34a11a13a7a19a35a19a36a37a3
a25a13a26a38a27a30a29a31a7a39a35a40a11a13a7a15a32a41a36a43a42 a44a31a45a46a11a12a47a8a48 , which gives a degree of
document similarity for each pair of documents.
Then we represent the collection as a graph.
Definition: A labeled graph a0 is called graph of
collection a2 ifa0 a3 a29 a2 a11a2a1 a36 where
a1 a3 a5 a5a8a7a40a32a34a11a13a7a39a35a16a22a4a3a6a5a8a7a3a10a9a12a11 a25a13a26a38a27a30a29a31a7a40a32a34a11a13a7a39a35a19a36a14a13a10a15a17a16a22
and each edge a18 a3 a5a8a7a21a32a34a11a13a7a19a35a16a22 a42a19a1 is labeled by
number a15 a29a20a18a19a36 a3 a25a13a26a28a27a30a29a31a7a15a32 a11a13a7a39a35a19a36 , called weight of a18 ;
a15 a16a22a21 a45 is a given document similarity threshold
(i.e. a threshold weight of edge).
Now we introduce some terminology and neces-
sary notation. Leta0 a3 a29 a2 a11a2a1 a36 be a graph of col-
lection a2 . Each subseta23a25a24 a2 is called a cut ofa0 ;
a26
a23 stands for the complement a2a28a27a29a23 . Ifa30 a11a2a31a32a24 a2
are disjoint cuts then
a33a35a34
a29a36a30 a36 a3 a5a37a18a4a3a38a18 a42a39a1a40a11a41a18a42a24a43a30a23a22 is a set of edges
within cuta30 ;
a33
a15 a29a36a30 a36 a3 a44a40a45a2a46a6a47a49a48a51a50a53a52a38a54 a29a20a18a19a36 is called weight of
cuta30 ;
a33a35a34
a29a36a30 a11a2a31 a36 a3
a34
a29a36a30a56a55a57a31 a36a58a27 a29
a34
a29a36a30 a36a53a55
a34
a29a36a31 a36 a36 is a set
of edges between cutsa30 anda31 ;
a33
a15 a29a36a30 a11a2a31 a36 a3a59a44a60a45a2a46a6a47a49a48a51a50a62a61a63a64a52a65a54 a29a20a18a19a36 is called weight
of the connection between cutsa30 anda31 ;
a33
a26
a15 a3 a15 a29 a2 a36a49a66a68a67
a20
a14a70a69
is the expected weight of
edge in grapha0 ;
a33
a26
a15 a29a71a23 a36 a3
a26
a15a73a72 a67a75a74a76a77a74
a14 a69
is the expected weight of
cuta23 ;
a33
a26
a15 a29a71a23 a11
a26
a23 a36 a3
a26
a15a57a72a78a3a23a10a3a79a72a29a31a24a81a80a82a3a23a83a3a36 is the expected
weight of the connection between cut X and
the rest of the collection;
a33 each cut
a23 naturally splits the collection into
three disjoint subsets a2 a3a84a23a85a55a83a86
a76
a55a83a87
a76where
a86
a76
a3 a5a89a88 a42 a2a90a27a91a23a92a3
a34
a29 a5a89a88 a22a40a11a49a23 a36a93a7a3a59a94a21a22
anda87
a76
a3 a2a82a27 a29a71a23a95a55a90a86
a76
a36 .
2.2 Quality of cuts
Now we formalize the property of “being concept-
-formative” by a positive real function called qual-
ity of cut. A high value of quality means that a cut
must be specific and extensive.
A cut a23 is called specific if (i) the weight
a15 a29a71a23 a36 is relatively high and (ii) the connec-
tion between a23 and the rest of the collection
a15 a29a71a23 a11
a26
a23 a36 is relatively small. The first prop-
erty is called compactness of cut, and is defined
as a96a29a97a40a27 a29a71a23 a36 a3a98a15 a29a71a23 a36a49a66 a26a15 a29a71a23 a36 , while the other is
called exhaustivity of cut, which is defined as
a99a22a100a70a101
a29a71a23 a36 a3
a26
a15 a29a71a23 a11
a26
a23 a36a49a66a37a15 a29a71a23 a11
a26
a23 a36 . Both functions
are positive.
Thus, the specificity of cuta23 can be formalized
by the following formula
a102
a15 a29a71a23 a36
a26
a15 a29a71a23 a36a38a103a105a104a107a106
a72a109a108
a26
a15 a29a71a23 a11
a26
a23 a36
a15 a29a71a23 a11
a26
a23 a36a107a110
a104a112a111
— the greater this value, the more specific the
cut a23 ; a113 a9 and a113 a14 are positive parameters, which
are used for balancing the two factors.
The extensity of cut a23 is defined as a positive
function a99a22a100a115a114 a29a71a23 a36 a3a117a116a118a97a38a119
a16a79a120a71a121a49a122
a3a23a83a3 where a123
a45a78a124
a16 is a
threshold size of cut.
Definition: The total quality of cuta125 a29a71a23 a36 is a pos-
itive real function composed of all factors men-
tioned above and is defined as
a125 a29a71a23 a36 a3a126a96a29a97a40a27 a29a71a23 a36
a104a107a106
a72
a99a127a100a115a101
a29a71a23 a36
a104a112a111
a72
a99a127a100a128a114
a29a71a23 a36
a104a89a129where the three lambdas are parameters whose
purpose is balancing the three factors.
To be concept-formative, a cut (i) must have a
sufficiently high quality and (ii) must be locally
optimal.
2.3 Local optimization of cuts
A cut a23 a24 a2 is called locally optimal regarding
quality function a125 if each cut a23a35a130a131a24 a2 which is
only a small modification of the original a23 does
not have greater quality, i.e. a125 a29a71a23a35a130a28a36a29a132a81a125 a29a71a23 a36 .
Now we describe a local search procedure
whose purpose is to optimize any input cut a23 ;
if a23 is not locally optimal, the output of the
Local Search procedure is a locally optimal
cuta23a35a133 which results from the originala23 as its lo-
cal modification. First we need the following def-
inition:
Definition: Potential of document a7 a42 a2 with re-
spect to cuta23a25a24 a2 is a real function
a134
a29a31a7 a11a49a23 a36 : a2a136a135a60a137 a29 a2 a36a105a138a80a140a139a25a141 defined as
a134
a29a31a7 a11a49a23 a36 a3a142a125 a29a71a23a143a55 a5a8a7 a22a39a36a144a80a56a125 a29a71a23a143a27 a5a8a7a10a22a39a36 a17
The Local Search procedure is described in
Fig. 1. Note that
1. Local Search gradually generates a se-
quence of cutsa23
a48a146a145a2a52
a11a49a23
a48
a9
a52
a11a49a23
a48
a14
a52
a11 a17a18a17a18a17 so that
Input: the graph of text collection a0 ;
an initial cut a1a3a2a4a1a6a5a8a7a10a9a12a11a13a0 .
Output: locally optimal cut a1a15a14 .
Algorithm: a16a18a17a20a19
loop: a21a22a17a24a23a26a25a10a27a29a28a31a30a8a32a6a121a34a33a36a35a38a37a8a39a41a40a43a42a45a44a47a46a49a48a50a1a51a5a8a52a53a9a55a54
if a42a38a44a56a21a57a48a10a1 a5a8a52a41a9 a54a59a58a60a19 then a61
a1 a5a8a52a53a62
a106
a9 a17a24a1 a5a8a52a53a9a64a63 a61a65a21a67a66
a16a68a17a24a16a70a69a72a71
goto loop
a66
a21a73a17a24a23a26a25a10a27a29a28a74a23a26a75a57a76 a33a36a77a36a78a10a35 a37a79a39a41a40a80a42a45a44a47a81a64a48a50a1a6a5a8a52a41a9a55a54
if a42a38a44a56a21a57a48a10a1a6a5a8a52a41a9a55a54a59a82a60a19 then a61
a1 a5a8a52a53a62
a106
a9 a17a24a1 a5a8a52a53a9a64a83 a61a26a21a67a66
a16a68a17a24a16a70a69a72a71
goto loop
a66
a1 a14 a17a24a1 a5a8a52a53a9
end
Figure 1: The Local Search Algorithm
(i) a125 a29a71a23 a48a32a85a84 a9a52a36a87a86a81a125 a29a71a23 a48a32a52a36 fora5 a21 a47 , and
(ii) cut a23 a48a32a52 always arises from a23 a48a32a85a84 a9a52 by
adding or taking away one document
into/from it;
2. since the quality of modified cuts cannot in-
crease infinitely, a finite a88 a21 a45 necessarily
exists so thata23 a48a53a89 a52 is locally optimal and con-
sequently the program stops at least after the
a88 -th iteration;
3. each output cuta23 a133 is locally optimal.
Now we are ready to precisely define concept-
-formative clusters:
Definition: A cut a23 a24 a2 is called a concept-
-formative cluster if
(i) a125 a29a71a23 a36 a21a91a90 a16 where a90 a16 is a threshold quality
and
(ii) a23 a3a98a23 a133 where a23 a133 is the output of the
Local Search algorithm.
The whole procedure for finding concept-
formative clusters consists of two basic stages:
first, a set of initial cuts is found within the whole
collection, and then each of them is used as a seed
for the Local Search algorithm, which locally
optimizes the quality functiona125 .
Note that a113 a9a12a11 a113a10a14a16a11 a113a29a92 are crucial parameters,
which strongly affect the whole process of search-
ing and consequently also the character of re-
sulting concept-formative clusters. We have op-
timized their values by a sort of machine learn-
ing, using a small manually annotated collection
of texts. When optimized a113 -parameters are used,
the Local Search procedure tries to simulate
the behavior of human annotator who finds topi-
cally coherent clusters in a training collection. The
task ofa113 -optimization leads to a system of linear
inequalities, which we solve via linear program-
ming. As there is no scope for this issue here, we
cannot go into details.
3 Virtual concepts
In this section we first show that concept-
-formative clusters can be viewed as fuzzy sets. In
this sense, each concept-formative cluster can be
characterized by a membership function. Fuzzy
clustering allows for some ambiguity in the data,
and its main advantage over hard clustering is
that it yields much more detailed information
on the structure of the data (cf. (Kaufman and
Rousseeuw, 1990), chapter 4).
Then we define virtual concepts as linear func-
tions which estimate degree of membership of
documents in concept-formative clusters. Since
virtual concepts are weighted mixtures of words
represented as vectors, they can also be seen as
virtual documents representing specific topics that
emerge in the analyzed collection.
Definition: Degree of membership of a document
a7 a42 a2 in a concept-formative cluster a23 a24 a2
is a function a93 a29a31a7 a11a49a23 a36 : a2 a135a95a94 a29 a2 a36a35a138a80a58a139 a141 . For
a7 a42 a23a35a55a144a86
a76
we define a93 a29a31a7 a11a49a23 a36 a3 a34a100a68a96 a29a98a97 a134 a29a31a7a10a11a49a23 a36 a36
where a97a83a13 a45 is a constant. For a7 a42 a87
a76
we define
a93 a29a31a7a10a11a49a23 a36 a3a4a45 .
The following holds true for any concept-
-formative clustera23 and any document a7 :
a33
a93 a29a31a7a10a11a49a23 a36 a21 a47 iff a7 a42 a23 ;
a33
a93 a29a31a7a10a11a49a23 a36 a42 a29a31a45a46a11a12a47a8a36 iff a7 a42a39a86
a76
.
Now we formalize the notion of virtual con-
cepts. Let a99 a9 a11a65a99 a14 a11a18a17a18a17a18a17a19a11a65a99 a20 a42a126a141a60a100 be vector rep-
resentations of documents a7 a9a12a11a13a7a15a14a16a11a18a17a18a17a18a17a19a11a13a7a21a20 , where
Input:a0
pairs a1a3a2
a106
a48a5a4a6a44
a2
a106
a54a7a6 a48a9a8a9a8a9a8 a48
a1a3a2a11a10
a48a5a4a6a44
a2a11a10
a54a7a6
where a2
a106
a48a9a8a9a8a12a8 a48
a2 a10a14a13a16a15a18a17 ;
a19 . . . maximal number of words in output concept;
a20 a21a23a22a24a22 . . . quadratic residual error threshold.
Output:a25
a14
a13a26a15 a17 . . . output concept;
a21a23a22a23a22
a14 . . . quadratic residual error;
a19
a14 . . . number of words in the output concept.
Algorithm:a27
a17a29a28 ,
a21a23a22a23a22
a14a38a17 a69a31a30
while a44a5a32
a27
a32 a58
a19a34a33 a21a23a22a24a22
a14 a82
a20 a21a23a22a23a22
a54 do a61
a21a23a22a24a22
a14 a17 a69a31a30
for each a16 a13 a61 a71a34a48a35a8a9a8a9a8 a48a37a36 a66 a63
a27
do a61a25
a17 output of MLRa44a85a61
a1a3a2a39a38
a48a24a4a6a44
a2a40a38
a54a7a6a10a66
a10
a38a12a41
a106
a48
a27
a83 a61a65a16a55a66a26a54
a21a23a22a24a22
a17 a44
a10
a38a12a41
a106
a44a42a4a6a44
a2 a38
a54a44a43
a25a46a45
a2 a38
a54
a111
if a21a23a22a24a22 a58 a21a23a22a24a22 a14 then a61a25
a14a45a17
a25
, a16a50a14a38a17a24a16 , a21a23a22a24a22 a14a38a17 a21a23a22a23a22
a66
a66a27
a17
a27
a83 a61 a16 a14 a66
a66
a19
a14a45a17a47a32
a27
a32
end
Figure 2: The Greedy Regression Algorithm
a48 is the number of indexed terms. We look for
such a vector a49
a76
a42a39a141a60a100 so that
a49
a76
a72a67a99 a32a51a50 a93 a29a31a7a40a32a34a11a49a23 a36
approximately holds for anya5 a42 a5a15a47a16a11a18a17a18a17a18a17a39a11a13a24 a22 . This
vector a49
a76
is then called virtual concept corre-
sponding to concept-formative clustera23 .
The task of finding virtual concepts can be
solved using the Greedy Regression Algorithm
(GRA), originally suggested by Semeck´y (2003).
3.1 Greedy Regression Algorithm
The GRA is directly based on multiple linear re-
gression (see e.g. (Rice, 1994)). The GRA works
in iterations and gradually increases the number of
non-zero elements in the resulting vector, i.e. the
number of words with non-zero weight in the re-
sulting mixture. So this number can be explicitly
restricted by a parameter. This feature of the GRA
has been designed for the sake of generalization,
in order to not overfit the input sample.
The input of the GRA consists of (i) a sam-
ple set of document vectors with the correspond-
ing values of a93 a29a31a7a10a11a49a23 a36 , (ii) a maximum number of
non-zero elements, and (iii) an error threshold.
The GRA, which is described in Fig. 2, re-
quires a procedure for solving multiple linear re-
gression (MLR) with a limited number of non-
zero elements in the resulting vector. Formally,
a93a53a52a51a54 a29 a5a21a44a56a55 a35 a11a49a88a8a35a19a48 a22a58a57
a35a60a59 a9
a11a62a61 a36 gets on input
a33 a set of
a63 vectors a55 a35a23a42a39a141a60a100 ;
a33 a corresponding set of
a63 valuesa88 a35 a42 a141 to be
approximated; and
a33 a set of indexes
a61 a24a43a5a15a47a16a11a18a17a18a17a18a17a8a11
a48
a22 of the ele-
ments which are allowed to be non-zero in
the output vector.
The output of the MLR is a vector
a55 a133 a3a65a64a67a66a2a119 a27 a26a3a68
a69
a57
a70
a35a71a59 a9
a29a56a55a39a72a72a55 a35a91a80a35a88a8a35a19a36
a14
where each considered a55 a3 a44a56a73 a9 a11a18a17a18a17a18a17a12a11a35a73
a100
a48 must
fulfill a73 a32 a3a4a45 for anya5 a42 a5a15a47a16a11a18a17a18a17a18a17a19a11 a48 a22a14a27a74a61 .
Implementation and time complexity
For solving multiple linear regression we use a
public-domain Java package JAMA (2004), devel-
oped by the MathWorks and NIST. The computa-
tion of inverse matrix is based on the LU decom-
position, which makes it faster (Press et al., 1992).
As for the asymptotic time complexity of the
GRA, it is in a75 a29a98a88a82a72 a48 a72 complexity of the MLRa36
since the outer loop runs a88 times at maximum and
the inner loop always runs nearly a48 times. The
MLR substantially consists of matrix multiplica-
tions in dimension a63 a135 a88 and a matrix inversion
in dimension a88a40a135 a88 . Thus the complexity of the
MLR is in a75 a29a98a88
a14
a72a39a63a77a76 a88
a92
a36 a3a78a75 a29a98a88
a14
a72a40a63 a36 because
a88a95a86a79a63 . So the total complexity of the GRA is in
a75 a29
a48
a72a80a63 a72a57a88
a92
a36 .
To reduce this high computational complexity,
we make a term pre-selection using a heuristic
method based on linear programming. Then, the
GRA does not need to deal with high-dimensional
vectors ina141 a100 , but works with vectors in dimen-
sion a48 a130a82a81 a48 . Although the acceleration is only
linear, the required time has been reduced more
than ten times, which is practically significant.
3.2 Experiments
The experiments reported here were done on a
small experimental collection of a24 a3 a83a40a84a33a11a86a85a40a85a39a87
Czech documents. The texts were articles from
two different newspapers and one journal. Each
document was morphologically analyzed and lem-
matized (Hajiˇc, 2000) and then indexed and rep-
resented as a vector. We indexed only lemmas
of nouns, adjectives, verbs, adverbs and numer-
als whose document frequency was greater than a47a18a45
and less than a0a16a45a46a11a13a45 a45 a45 . Then the number of indexed
terms was a48 a3a65a83a2a1a33a11a86a83a40a84a40a83 . The cosine similarity was
used to compute the document similarity; thresh-
old wasa15 a16 a3 a45a46a17 a83 . There were a3 a87a4a0a33a11a5a3a6a0a39a87 edges in
the graph of the collection.
We had computed a set of concept-formative
clusters and then approximated the corresponding
membership functions by virtual concepts.
The first thing we have observed was that the
quadratic residual error systematically and progre-
sivelly decreases in each GRA iteration. More-
over, the words in virtual concepts are obviously
intelligible for humans and strongly suggest the
topic. An example is given in Table 1.
words in the concept the weights
Czech lemma literally transl. a19 a2a8a7 a19 a2 a71 a19
bosensk´y Bosnian a71a71a8a9a11a10 a71a60a8a19a57a71
Srb Serb a19a58a8a12a14a13 a19a72a8a13a36a19
UNPROFOR UNPROFOR a19a72a8a7a16a15 a19a72a8a17a11a7
OSN UN a19a58a8a7a16a13 a19a72a8a12a34a19
Sarajevo Sarajevo a19a72a8a17 a19 a19a72a8a17a11a10
muslimsk´y Muslim (adj) — a19a72a8a12a14a10
odvolat withdraw — a9a72a8a8a71a18a17
srbsk´y Serbian — a19a72a8a9a11a7
gener´al general (n) — a19a58a8a13a16a7
list paper — a43 a71a60a8a10a67a71
quadratic residual error: a9a72a8a7a19a17 a19a58a8a10a19a12a34a19
Table 1: Two virtual concepts (a88 a3a20a1 and a88 a3 a47a18a45 )
corresponding to cluster #318.
Another example is cluster #19 focused on
“pension funds”, which was approximated
(a88 a3a20a0a16a45 ) by the following words (literally trans-
lated):
pensiona62 (adj), pensiona62 (n), funda62 , additional insurancea62 ,
inheritancea62 , paymenta21 , interesta62 (n), dealera62 , regulationa21 ,
lawsuita62 , Augusta21 (adj), measurea21 (n), approvea62 ,
increasea62 (v), appreciationa62 , propertya62 , tradea21 (adj),
attentivelya62 , improvea62 , coupona21 (adj).
(The signs after the words indicate their positive
or negative weights in the concept.) Figure 3
shows the approximation of this cluster by virtual
concept.
Figure 3: The approximation of membership func-
tion corresponding to cluster #19 by a virtual con-
cept (the number of words in the concept a88 a3a22a1 ).
4 Discussion
4.1 Related work
A similar approach to searching for topics and em-
ploying them for document retrieval has been re-
cently suggested by Xu and Croft (2000), who,
however, try to employ the topics in the area of
distributed retrieval.
They use document clustering, treat each clus-
ter as a topic, and then define topics as probabil-
ity distributions of words. They use the Kullback-
-Leibler divergence with some modification as a
distance metric to determine the closeness of a
document to a cluster. Although our virtual con-
cepts cannot be interpreted as probability distribu-
tions, in this point both approaches are quite simi-
lar.
The substantial difference is in the clustering
method used. Xu and Croft have chosen the
K-Means algorithm, “for its efficiency”. In con-
trast to this hard clustering algorithm, (i) our
method is consistently based on empirical analysis
of a text collection and does not require an a priori
given number of topics; (ii) in order to induce per-
meable topics, our concept-formative clusters are
not disjoint; (iii) the specificity of our clusters is
driven by training samples given by human.
Xu and Croft suggest that retrieval based on
topics may be more robust in comparison with
the classic vector technique: Document ranking
against a query is based on statistical correlation
between query words and words in a document.
Since a document is a small sample of text, the
statistics in a document are often too sparse to re-
liably predict how likely the document is relevant
to a query. In contrast, we have much more texts
for a topic and the statistics are more stable. By
excluding clearly unrelated topics, we can avoid
retrieving many of the non-relevant documents.
4.2 Future work
As our work is still in progress, there are some
open questions, which we will concentrate on in
the near future. Three main issues are (i) evalua-
tion, (ii) parameters setting (which is closely con-
nected to the previous one), and (iii) an effective
implementation of crucial algorithms (the current
implementation is still experimental).
As for the evaluation, we are building a manu-
ally annotated test collection using which we want
to test the capability of our model to estimate inter-
-document similarity in comparison with the clas-
sic vector model and the LSI model. So far, we
have been working with a Czech collection for we
also test the impact of morphology and some other
NLP methods developed for Czech. Next step will
be the evaluation on the English TREC collec-
tions, which will enable us to rigorously evaluate
if our model really helps to improve IR tasks.
The evaluation will also give us criteria for pa-
rameters setting. We expect that a positive value
ofa54 a16 will significantly accelerate the computation
without loss of quality, but finding the right value
must be based on the evaluation. As for the most
important parameters of the GRA (i.e. the size of
the sample set a63 and the number of words in con-
cept a88 ), these should be set so that the resulting
concept is a good membership estimator also for
documents not included in the sample set.
5 Summary
We have designed and implemented a system that
automatically discovers specific topics in a text
collection. We try to employ it in document index-
ing. The main directions for our future work are
thorough evaluation of the model and optimization
of the parameters.
Acknowledgments
This work has been supported by the Ministry of
Education, project Center for Computational Lin-
guistics (project LN00A063).
References
Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto.
1999. Modern Information Retrieval. ACM Press /
Addison-Wesley.
Scott C. Deerwester, Susan T. Dumais, Thomas K. Lan-
dauer, George W. Furnas, and Richard A. Harshman.
1990. Indexing by latent semantic analysis. JASIS,
41(6):391–407.
Inderjit S. Dhillon and D. S. Modha. 2001. Concept
decompositions for large sparse text data using clus-
tering. Machine Learning, 42(1/2):143–175.
Jan Hajiˇc. 2000. Morphological tagging: Data vs. dic-
tionaries. In Proceedings of the 6th ANLP Confer-
ence, 1st NAACL Meeting, pages 94–101, Seattle.
Martin Holub. 2003. A new approach to concep-
tual document indexing: Building a hierarchical sys-
tem of concepts based on document clusters. In
M. Aleksy et al. (eds.): ISICT 2003, Proceedings
of the International Symposium on Information and
Communication Technologies, pages 311–316. Trin-
ity College Dublin, Ireland.
JAMA. 2004. JAMA: A Java Matrix Package. Public-
domain, http://math.nist.gov/javanumerics/jama/.
Leonard Kaufman and Peter J. Rousseeuw. 1990.
Finding Groups in Data. John Wiley & Sons.
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.
Flannery. 1992. Numerical Recipes in C. Second
edition, Cambridge University Press, Cambridge.
John A. Rice. 1994. Mathematical Statistics and Data
Analysis. Second edition, Duxbury Press, Califor-
nia.
Jiˇr´ı Semeck´y. 2003. Semantic word classes extrac-
ted from text clusters. In 12th Annual Confer-
ence WDS 2003, Proceeding of Contributed Papers.
MATFYZPRESS, Prague.
Kari Torkkola. 2002. Discriminative features for doc-
ument classification. In Proceedings of the Interna-
tional Conference on Pattern Recognition, Quebec
City, Canada, August 11–15.
Jinxi Xu and W. Bruce Croft. 2000. Topic-based lan-
guage models for distributed retrieval. In W. Bruce
Croft (ed.): Advances in Information Retrieval,
pages 151–172. Kluwer Academic Publishers.
