File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-0705_intro.xml

Size: 7,189 bytes

Last Modified: 2025-10-06 14:06:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0705">
  <Title>I I I I I I I I I I I I Indexing with WordNet synsets can improve text retrieval</Title>
  <Section position="2" start_page="0" end_page="38" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Text retrieval deals with the problem of finding all the relevant documents in a text collection for a given user's query. A large-scale semantic database such as WordNet (Miller, 1990) seems to have a great potential for this task. There are, at least, two obvious reasons: * It offers the possibility to discriminate word senses in documents and queries. This would prevent matching spring in its &amp;quot;metal device&amp;quot; sense with documents mentioning spring in the sense of springtime. And then retrieval accuracy could be improved.</Paragraph>
    <Paragraph position="1"> * WordNet provides the chance of matching semantically related words. For instance, spring, fountain, outflow, outpouring, in the appropriate senses, can be identified as occurrences of the same concept, 'natural flow of ground water'. And beyond synonymy, WordNet can be used to measure semantic distance between occurring terms to get more sophisticated ways of comparing documents and queries.</Paragraph>
    <Paragraph position="2"> However, the general feeling within the information retrieval community is that dealing explicitly with semantic information does not improve significantly the performance of text retrieval systems. This impression is founded on the results of some experiments measuring the role of Word Sense Disambiguation (WSD) for text retrieval, on one hand, and some attempts to exploit the features of Word-Net and other lexical databases, on the other hand. In (Sanderson, 1994), word sense ambiguity is shown to produce only minor effects on retrieval accuracy, apparently confirming that query/document matching strategies already perform an implicit disambiguation. Sanderson also estimates that if explicit WSD is performed with less than 90% accuracy, the results are worse than non disambiguating at all. In his experimental setup, ambiguity is introduced artificially in the documents, substituting randomly chosen pairs of words (for instance, banana and kalashnikov) with artificially ambiguous terms (banana/kalashnikov). While his results are very interesting, it remains unclear, in our opinion, whether they would be corroborated with real occurrences of ambiguous words. There is also other minor weakness in Sanderson's experiments. When he ~disambiguates&amp;quot; a term such as spring/bank to get, for instance, bank, he has done only a partial disambiguation, as bank can be used in more than one sense in the text collection.</Paragraph>
    <Paragraph position="3"> Besides disambiguation, many attempts have been done to exploit WordNet for text retrieval purposes.</Paragraph>
    <Paragraph position="4"> Mainly two aspects have been addressed: the enrichment of queries with semantically-related terms, on one hand, and the comparison of queries and documents via conceptual distance measures, on the other.</Paragraph>
    <Paragraph position="5"> Query expansion with WordNet has shown to be potentially relevant to enhance recall, as it permits matching relevant documents that could not contain any of the query terms (Smeaton et al., 1995). However, it has produced few successful experiments.</Paragraph>
    <Paragraph position="6"> For instance, (Voorhees, 1994) manually expanded 50 queries over a TREC-1 collection (Harman, 1993) using synonymy and other semantic relations from WordNet 1.3. Voorhees found that the expansion was useful with short, incomplete queries, and rather useless for complete topic statements -where other expansion techniques worked better-. For short queries, it remained the problem of selecting the expansions automatically; doing it badly could degrade retrieval performance rather than enhancing it. In</Paragraph>
    <Paragraph position="8"> (Richardson and Smeaton, 1995), a combination of rather sophisticated techniques based on WordNet, including automatic disambiguation and measures of semantic relatedness between query/document concepts resulted in a drop of effectiveness. Unfortunately, the effects of WSD errors could not be discerned from the accuracy of the retrieval strategy.</Paragraph>
    <Paragraph position="9"> However, in (Smeaton and Quigley, 1996), retrieval on a small collection of image captions - that is, on very short documents - is reasonably improved using measures of conceptual distance between words based on WordNet 1.4. Previously, captions and queries had been manually disambiguated against WordNet. The reason for such success is that with very short documents (e.g. boys playing in the sand) the chance of finding the original terms of the query (e.g. of children running on a beach) are much lower than for average-size documents (that typically inelude many phrasings for the same concepts). These results are in agreement with (Voorhees, 1994), but it remains the question of whether the conceptual distance matching would scale up to longer documents and queries. In addition, the experiments in . (Smeaton and Quigley, 1996) only consider nouns, while WordNet offers the chance to use all open-class words (nouns, verbs, adjectives and adverbs).</Paragraph>
    <Paragraph position="10"> Our essential retrieval strategy in the experiments reported here is to adapt a classical vector model based system, using WordNet synsets as indexing space instead of word forms. This approach combines two benefits for retrieval: one, that terms axe fully disambiguated (this should improve precision); and two, that equivalent terms can be identified (this should improve recall). Note that query expansion does not satisfy the first condition, as the terms used to expand are words and, therefore, are in turn ambiguous. On the other hand, plain word sense disambiguation does not satisfy the second condition.</Paragraph>
    <Paragraph position="11"> as equivalent senses of two different words are not matched. Thus, indexing by synsets gets maximum matching and minimum spurious matching, seeming a good starting point to study text retrieval with WordNet.</Paragraph>
    <Paragraph position="12"> Given this approach, our goal is to test two main issues which are not clearly answered -to our knowledge- by the experiments mentioned above: * Abstracting from the problem of sense disambiguation, what potential does WordNet offer for text retrieval? In particular, we would like to extend experiments with manually disambiguated queries and documents to average-size texts.</Paragraph>
    <Paragraph position="13"> * Once the potential of WordNet is known for a manually disambiguated collection, we want to test the sensitivity of retrieval performance to disambiguation errors introduced by automatic WSD.</Paragraph>
    <Paragraph position="14"> This paper reports on our first results answering these questions. The next section describes the test collection that we have produced. The experiments are described in Section 3, and the last Section discusses the results obtained.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML