File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2024_intro.xml
Size: 4,744 bytes
Last Modified: 2025-10-06 14:02:56
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2024"> <Title>Information Retrieval Capable of Visualization and High Precision</Title> <Section position="2" start_page="0" end_page="138" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Information retrieval (IR) has been studied since an earlier stage [e.g., (Menzel, 1966)] and several kinds of basic retrieval models have been proposed (Salton and Buckley, 1988) and a number of improved IR systems based on these models have been developed by adopting various NLP techniques [e.g., (Evans and Zhai, 1996; Mitra et al., 1997; Mandara, et al., 1998; Murata, et al., 2000)]. However, an epoch-making technique that surpasses the TFIDF weighted vector space model, the main approach to IR at present, has not yet been invented and IR is still relatively imprecise. There are also challenges presenting a large number of retrieval results to users in a visual and intelligible form.</Paragraph> <Paragraph position="1"> Our aim is to develop a high-precision, visual IR system that consists of two phases. The first phase is carried out using conventional IR techniques in which a large number of related documents are gathered from newspapers or websites in response to a query. In the second phase the visualization of the retrieval results and picking are performed. The visualization process classifies the query and retrieval results and places them on a two-dimensional map in topological order according to the similarity between them.</Paragraph> <Paragraph position="2"> To improve the precision of the retrieval process, the picking process involves further selection of a small number of highly relevant documents based on the classification results produced by the visualization process.</Paragraph> <Paragraph position="3"> This paper presents a new approach by using the self-organizing map (SOM) proposed by Kohonen (Kohonen, 1997) for this second IR phase1.</Paragraph> <Paragraph position="4"> To enable the second phase to be slotted into a practical IR system as described above, visual1There have been a number of studies of SOM on data mining and visualization [e.g., (Kohonen, et al., 2000)] since the WEBSOM was developed in 1996. To our knowledge, however, these works mainly focused on confirming the capabilities of SOM in the self-organization and/or in the visualization. In this study, we slot the SOM-based processing into a practical IR system that enables visualization of the IR while at the same time improving its precision. The another feature of our study differing from others is that we performed comparative studies with TFIDF-based IR methods, the major approach to IR in NLP field.</Paragraph> <Paragraph position="5"> ization and picking should be carried out for a single query and set of related documents. In this paper, however, for the purpose of evaluating the proposed system, correct answer data, consisting of multiple queries and related documents as used in the 1999 IR contest, IREX (Murata, et al., 2000), was used. The procedure of the second IR-phase in this paper is therefore as follows. Given a set of queries and related documents, a documentary map is first automatically created through self-organization. This map provides visible and continuous retrieval results in which all queries and documents are placed in topological order according to their similarity2. The documentary map provides users with an easy method of finding documents related to their queries and also enables them to see the relationships between documents with regard to the same query, or even the relationships between documents across different queries. In addition, the documents related to a query can be ranked by simply calculating the Euclidean distances between the points of the queries and the points of the documents in the map and then choosing the N closest documents in ranked order as the retrieval results for each query. If a small N is set, then the retrieval results are limited to the most highly relevant documents, thus improving the retrieval precision.</Paragraph> <Paragraph position="6"> Computer experiments showed that meaningful two-dimensional documentary maps could be created; The ranking of the results retrieved using the map was better than that of the results obtained using a conventional TFIDF method. Furthermore, the precision of the proposed method was much higher than that of the conventional TFIDF method when the retrieval process focused on retrieving the most highly relevant documents, which indicates that the proposed method might be particularly useful for picking the best documents, thus greatly improving the IR precision.</Paragraph> </Section> class="xml-element"></Paper>