File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/e99-1034_intro.xml
Size: 2,945 bytes
Last Modified: 2025-10-06 14:06:51
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1034"> <Title>Finding content-bearing terms using term similarities</Title> <Section position="3" start_page="243" end_page="243" type="intro"> <SectionTitle> 3 Discussion </SectionTitle> <Paragraph position="0"> In this paper, we have formulated the hypothesis that query terms which are good descriptors of the information need tend to be more similar to each other. We have proposed a method to verify if the hypothesis holds in practice, and presented some preliminary investigations on the CACM collection which seem to confirm the hypothesis. But many other investigations have to be done on bigger collections, involving more elaborate measures of similarity using weights, different contexts (paragraphs, sentences), and not only single words but also phrases. Experiments are ongoing on a subset of the TREC collection (200 Mb), and preliminary results seem to confirm the hypothesis. Our hope is that investigations on this large test collection should yield better results, since the computed similarities are statistically more reliable when they are computed on larger data sets.</Paragraph> <Paragraph position="1"> In a way, this work can be related to word sense disambiguation. This problem has already been addressed in the field of the information retrieval, but it has been shown that the impact of word sense disambiguation is of limited utility (Krovetz and Croft, 1992). Here the problem is not the determination of the correct sense of a word, but rather the determination of the usefulness of a query term for retrieval. However, it would be interesting to see if techniques developed for word sense disambiguation such as (Yarowsky, 1992) could be adapted to determine the usefulness of a query term for retrieval.</Paragraph> <Paragraph position="2"> From our preliminary investigations, it seems that similarities can be used as positive and as negative evidence that a term should be useful for retrieval. The other part of our work is to determine a technique for using this pattern in order to improve term weighting, and at the end improve retrieval effectiveness. While simple techniques might work and will be tried (e.g. clustering), we seriously doubt about it because every relationship between query terms should be taken into account, and this leads to very complex interactions. We are presently developing a model where the probability of the state (content/noisy) of a term is determined by uncertain inference, using a technique for representing and handling uncertainty named Probabilistic Argumentation Systems (Kohlas and Haenni, 1996). In the next future, this model will be implemented and tested against simpler models. If the model allows to predict reasonably well the state of each query term, this information can be used to refine the weighting of query terms and lead to better information retrieval.</Paragraph> </Section> class="xml-element"></Paper>