File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1013_metho.xml
Size: 7,885 bytes
Last Modified: 2025-10-06 14:15:21
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1013"> <Title>Complementing WordNet with Roget's and Corpus-based Thesauri for Information Retrieval</Title> <Section position="4" start_page="97" end_page="97" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> Experiments were carried out on the TREC-7 Collection, which consists of 528,155 documents and 50 topics (Voorhees and Harman, to appear 1999).</Paragraph> <Paragraph position="1"> TREC is currently de facto standard test collection in information retrieval community.</Paragraph> <Paragraph position="2"> Table 1 shows topic-length statistics, Table 2 shows document statistics, and Figure 4 shows an example topic.</Paragraph> <Paragraph position="3"> We use the title, description, and combined title+description+narrative of these topics. Note that in the TREC-7 collection the description contains all terms in the title section.</Paragraph> <Paragraph position="4"> For our baseline, we used SMART version 11.0 (Salton, 1971) as information retrieval engine with the Inc.ltc weighting method. SMART is an information retrieval engine based on the vector space model in which term weights are calculated based on term frequency, inverse document frequency and document length normalization.</Paragraph> <Paragraph position="5"> Automatic indexing of a text in SMART system involves the following steps : of a word are normalized to the same stem.</Paragraph> <Paragraph position="6"> SMART system uses the variant of Lovin method to apply simple rules for suffix stripping. null * Weighting : The term (word and phrase) vector thus created for a text, is weighted using t f, idf, and length normalization considerations. null Table 3 gives the average of non-interpolated precision using SMART without expansion (baseline), expansion using only WordNet, expansion using only the corpus-based syntactic-relation-based thesaurus, expansion using only the corpus-based co-occurrence-based thesaurus, and expansion using combined thesauri. For each method we also give the relative improvement over the baseline. We can see that the combined method out-perform the isolated use of each type of thesaurus significantly.</Paragraph> </Section> <Section position="5" start_page="97" end_page="98" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> In this section we discuss why our method using WordNet is able to improve information retrieval performance. The three types of thesaurus we used have different characteristics. Automatically constructed thesauri add not only new terms but also new relationships not found in WordNet. If two terms often co-occur in a document then those two terms are likely to bear some relationship.</Paragraph> <Paragraph position="1"> The reason why we should use not only automatically constructed thesauri is that some relationships may be missing in them For example, consider the words colour and color. These words certainly share the same context, but would never appear in the same document, at least not with a frequency recognized by a co-occurrence-based method. In general, different words used to describe similar concepts may never be used in the same document, and are thus missed by cooccurrence methods. However their relationship may be found in WordNet, Roget's, and the syntactically-based thesaurus.</Paragraph> <Paragraph position="2"> One may ask why we included Roget's Thesaurus here which is almost identical in nature to WordNet. The reason is to provide more evidence in the final weighting method. Including Roget's as part of the combined thesaurus is better than not including it, although the improvement is not significant (4% for title, 2% for description and 0.9% for all terms in the query). One reason is that the coverage of Roget's is very limited.</Paragraph> <Paragraph position="3"> A second point is our weighting method. The advantages of our weighting method can be summarized as follows: * the weight of each expansion term considers the similarity of that term to all terms in the ocean remote sensing.</Paragraph> <Paragraph position="4"> Narrative: Documents discussing the development and application of spaceborne ocean remote sensing in oceanography, seabed prospecting and mining, or any marinescience activity are relevant. Documents that discuss the application of satellite remote sensing in geography, agriculture, forestry, mining and mineral prospecting or any land-bound science are not relevant, nor are references to international marketing or promotional advertizing of any remote-sensing technology. Synthetic aperture radar (SAR) employed in ocean remote sensing is relevant. term.</Paragraph> <Paragraph position="5"> * the weight of an expansion term also depends on its similarity within all types of thesaurus.</Paragraph> <Paragraph position="6"> Our method can accommodate polysemy, because an expansion term taken from a different sense to the original query term sense is given very low weight. The reason for this is that the weighting method depends on all query terms and all of the thesauri. For example, the word bank has many senses in WordNet. Two such senses are the financial institution and river edge senses. In a document collection relating to financial banks, the river sense of bank will generally not be found in the cooccurrence-based thesaurus because of a lack of articles talking about rivers. Even though (with small possibility) there may be some documents in the collection talking about rivers, if the query contained the finance sense of bank then the other terms in the query would also tend to be concerned with finance and not rivers. Thus rivers would only have a relationship with the bank term and there would be no relations with other terms in the original query, resulting in a low weight.</Paragraph> <Paragraph position="7"> Since our weighting method depends on both the query in its entirety and similarity over the three thesauri, wrong sense expansion terms are given very low weight.</Paragraph> </Section> <Section position="6" start_page="98" end_page="99" type="metho"> <SectionTitle> 6 Related Research </SectionTitle> <Paragraph position="0"> Smeaton (1995) and Voorhees (1994; 1988) proposed an expansion method using WordNet. Our method differs from theirs in that we enrich the coverage of WordNet using two methods of automatic thesaurus construction, and we weight the expansion term appropriately so that it can accommodate polysemy.</Paragraph> <Paragraph position="1"> Although Stairmand (1997) and Richardson (1995) proposed the use of WordNet in information retrieval, they did not use WordNet in the query expansion framework.</Paragraph> <Paragraph position="2"> Our syntactic-relation-based thesaurus is based on the method proposed by Hindle (1990), although Hindle did not apply it to information retrieval. Hindle only extracted subject-verb and object-verb relations, while we also extract adjective-noun and noun-noun relations, in the manner of Grefenstette (1994), who applied his syntactically-based thesaurus to information retrieval with mixed results. Our system improves on Grefenstette's results since we factor in thesauri which contain hierarchical information absent from his automatically derived thesaurus. Our weighting method follows the Qiu and Frei (1993) method, except that Qiu used it to expand terms from a single automatically constructed thesarus and did not consider the use of more than one thesaurus.</Paragraph> <Paragraph position="3"> This paper is an extension of our previous work (Mandala et al., to appear 1999) in which we ddid not consider the effects of using Roget's Thesaurus as one piece of evidence for expansion and used the Tanimoto coefficient as similarity coefficient instead of mutual information.</Paragraph> </Section> class="xml-element"></Paper>