File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-2003_intro.xml
Size: 2,446 bytes
Last Modified: 2025-10-06 14:02:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-2003"> <Title>Searching for Topics in a Large Collection of Texts</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In the field of information retrieval (for a detailed survey see e.g. (Baeza-Yates and Ribeiro-Neto, 1999)), document indexing and representing documents as vectors belongs among the most successful techniques. Within the framework of the well known vector model, the indexed elements are usually individual words, which leads to high dimensional vectors. However, there are several approaches that try to reduce the high dimensionality of the vectors in order to improve the effectivity of retrieving. The most famous is probably the method called Latent Semantic Indexing (LSI), introduced by Deerwester et al. (1990), which employs a specific linear transformation of original word-based vectors using a system of &quot;latent semantic concepts&quot;. Other two approaches which inspired us, namely (Dhillon and Modha, 2001) and (Torkkola, 2002), are similar to LSI but different in the way how they project the vectors of documents into a space of a lower dimension.</Paragraph> <Paragraph position="1"> Our idea is to establish a system of &quot;virtual concepts&quot;, which are linear functions represented by vectors, extracted from automatically discovered &quot;concept-formative clusters&quot; of documents. Shortly speaking, concept-formative clusters are semantically coherent and specific sets of documents, which represent specific topics. This idea was originally proposed by Holub (2003), who hypothesizes that concept-oriented vector models of documents based on indexing virtual concepts could improve the effectiveness of both automatic comparison of documents and their matching with queries.</Paragraph> <Paragraph position="2"> The paper is organized as follows. In section 2 we formalize the notion of concept-formative clusters and give a heuristic method of finding them.</Paragraph> <Paragraph position="3"> Section 3 first introduces virtual concepts in a formal way and shows an algorithm to construct them. Then, some experiments are shown. In sections 4 we compare our model with another approach and give a brief survey of some open questions. Finally, a short summary is given in section 5.</Paragraph> </Section> class="xml-element"></Paper>