File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/00/w00-1104_relat.xml
Size: 8,492 bytes
Last Modified: 2025-10-06 14:15:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1104"> <Title>Semantic Indexing using WordNet Senses</Title> <Section position="3" start_page="35" end_page="37" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> There are three main approaches reported in the literature regarding the incorporation of semantic information into IR systems: (1)conceptual inde~ng, (2) query expansion and (3) semantic indexing. The former is based on ontological taxonomies, while the last two make use of Word Sense Disambiguation aigorithm~.</Paragraph> <Section position="1" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 2.1 Conceptual indexlr~g </SectionTitle> <Paragraph position="0"> The usage of concepts for document indexing is a relatively new trend within the IR field. Concept matching is a technique that has been used in limited domains, like the legal field were conceptual indexing has been applied by (Stein, 1997). The FERRET system (Mauldin, 1991) is another example of sets, called synsets. A synset is associated with a particular sense of a word, and thus we use sense-based and synset-based interchangeably.</Paragraph> <Paragraph position="1"> how concept identification can improve II:t systems.</Paragraph> <Paragraph position="2"> To our knowledge, the most intensive work in this direction was performed by Woods (Woods, 1997), at Sun Microsystems Laboratories. He creates some custom built ontological taxonomies based on subsumtion and morphology for the purpose of indexing and retrieving documents. Comparing the performance of the system that uses conceptual indexing, with the performance obtained using classical retrieval techniques, resulted in an increased performance and recall. He defines also a new measure, called success rate which indicates if a question has an answer in the top ten documents returned by a retrieval system. The success rate obtained in the case of conceptual indexing was 60%, respect to a maximum of 45~0 obtained using other retrieval systems. This is a signi~cant improvement and shows that semantics can have a strong impact on the effectiveness of IR systems.</Paragraph> <Paragraph position="3"> The experiments described in (Woods, 1997) refer to small collections of text, as for example the Unix manual pages (about 10MB of text). But, as shown in (Ambroziak, 1997), this is not a limitation; conceptual indexing can be successfully applied to much larger text collections, and even used in Web browsing.</Paragraph> </Section> <Section position="2" start_page="35" end_page="36" type="sub_section"> <SectionTitle> 2.2 Query Expungion </SectionTitle> <Paragraph position="0"> Query expansion has been proved to have positive effects in retrieving relevant information (Lu and Keefer, 1994). The purpose of query extension can be either to broaden the set of documents retrieved or to increase the retrieval precision. In the former case, the query is expanded with terms similar with the words from the original query, while in the second case the expansion procedure adds completely new terms.</Paragraph> <Paragraph position="1"> There are two main techniques used in expanding an original query. The first one considers the use of Machine Readable Dictionary; (Moldovan and Mihaicea, 2000) and (Voorhees, 1994) are making use of WordNet to enlarge the query such as it includes words which are semantically related to the concepts from the original query. The basic semantic relation used in their systems is the synonymy relation. This technique requires the disambiguation of the words in the input query and it was reported that this method can be useful if the sense disambiguation is highly accurate. The other technique for query expan.qion is to use relevance feedback, as used in SMART (Buckley et al., 1994).</Paragraph> </Section> <Section position="3" start_page="36" end_page="37" type="sub_section"> <SectionTitle> 2.3 Semantic indexing </SectionTitle> <Paragraph position="0"> The usage of word senses in the process of document indexing is a pretty much debated field of discussions. The basic idea is to index word meanings, rather than words taken as lexical strings. A survey of the efforts of incorporating WSD into IR is presented in (Sanderson, 2000). Experiments performed by different researchers led to various, sometime contradicting results. Nevertheless, the conclusion which can be drawn from all these experiments is that a highly accurate Word Sense Disambiguation algorithm is needed in order to obtain an increase in the performance of IR systems.</Paragraph> <Paragraph position="1"> Ellen Voorhees (Voorhees, 1998) (Voorhees, 1999) tried to resolve word ambiguity in the collection of documents, as well as in the query, and then she compared the results obtained with the performance of a standard run. Even if she used different weighting schemes, the overall results have shown a degradation in IR effectiveness when word meanings were used for indexing. Still, as she pointed out, the precision of the WSD technique has a dramatic influence on these results. She states that a better WSD can lead to an increase in IR performance.</Paragraph> <Paragraph position="2"> A rather &quot;artificial&quot; experiment in the same direction of semantic indexing is provided in (Sanderson, 1994). He uses pseudo-words to test the utility of disambiguation in IR.</Paragraph> <Paragraph position="3"> A pseudo-word is an artificially created ambiguous word, like for example &quot;banana-door&quot; (pseudo-words have been introduced for the first time in (Yarowsky, 1993), as means of testing WSD accuracy without the costs associated with the acquisition of sense tagged corpora). Different levels of ambiguity were introduced in the set of documents prior to indexing. The conclusion drawn was that WSD has little impact on IR performance, to the point that only a WSD algorithm with over 90% precision could help IR systems.</Paragraph> <Paragraph position="4"> The reasons for the results obtained by Sanderson have been discussed in (Schutze and Pedersen, 1995). They argue that the usage of pseudo-words does not always provide an accurate measure of the effect of WSD over IR performance. It is shown that in the case of pseudo-words, high-frequency word types have the majority of senses of a pseudoword, i.e. the word ambiguity is not realistically modeled. More than this, (Schutze and Pedersen, 1995) performed experiments which have shown that semantics can actually help retrieval performance. They reported an increase in precision of up to 7% when sense based indexing is used alone, and up to 14% for a combined word based and sense based indexing.</Paragraph> <Paragraph position="5"> One of the largest studies regarding the applicability of word semantics to IR is reported by Krovetz (Krovetz and Croft, 1993), (Krovetz, 1997). When talking about word ambiguity, he collapses both the morphological and semantic aspects of ambiguity, and refers them as polysemy and homonymy. He shows that word senses should be used in addition to word based indexing, rather than indexing on word senses alone, basically because of the uncertainty involved in sense disambiguation. He had extensively studied the effect of lexical ambiguity over ~ the experiments described provide a clear indication that word meanings can improve the performance of a retrieval system.</Paragraph> <Paragraph position="6"> (Gonzalo et al., 1998) performed experiments in sense based indexing: they used the SMART retrieval system and a manually disambiguated collection (Semcor). It turned out that indexing by synsets can increase recall up to 29% respect to word based indexing. Part of their experiments was the simulation of a WSD algorithm with error rates of 5%, 10%, 20%, 30% and 60%: they found that error rates of up to 10% do not substantially af- null fect precision, and a system with WSD errors below 30% still perform better than a standard run. The results of their experiments are encouraging, and proved that an accurate WSD algorithm can significantly help IR systems. null We propose here a system which tries to combine the benefits of word-based and synset-based indexing. Both words and synsets are indexed in the input text, and the retrieval is then performed using either one or both these sources of information. The key to our system is a WSD method for open text.</Paragraph> </Section> </Section> class="xml-element"></Paper>