File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/e99-1013_intro.xml

Size: 10,410 bytes

Last Modified: 2025-10-06 14:06:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1013">
  <Title>Complementing WordNet with Roget's and Corpus-based Thesauri for Information Retrieval</Title>
  <Section position="3" start_page="94" end_page="97" type="intro">
    <SectionTitle>
2 Thesauri
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="94" end_page="94" type="sub_section">
      <SectionTitle>
2.1 WordNet
</SectionTitle>
      <Paragraph position="0"> In WordNet, words are organized into taxonomies where each node is a set of synonyms (a synset) representing a single sense. There are 4 different taxonomies based on distinct parts of speech and many relationships defined within each. In this paper we use only noun taxonomy with hyponymy/hypernymy (or is-a) relations, which relates more general and more specific senses (Miller, 1988). Figure 1 shows a fragment of the WordNet taxonomy.</Paragraph>
      <Paragraph position="1"> The similarity between word wl and we is defined as the shortest path from each sense of wl to each sense of w2, as below (Leacock and Chodorow, 1988; Resnik, 1995) sim(wl, w2) = max\[- log(2~) \] where N v is the number of nodes in path p from wl to w2 and D is the maximum depth of the taxonomy.</Paragraph>
    </Section>
    <Section position="2" start_page="94" end_page="94" type="sub_section">
      <SectionTitle>
2.2 Roget's Thesaurus
In Roget's Thesaurus (Chapman, 1977), words
</SectionTitle>
      <Paragraph position="0"> are classified according to the ideas they express, and these categories of ideas are numbered in sequence. The terms within a category are further organized by part of speech (nouns, verbs, adjectives, adverbs, prepositions, conjunctions, and interjections). Figure 2 shows a fragment of Roget's category.</Paragraph>
      <Paragraph position="1"> In this case, our similarity measure treat all the words in Roger as features. A word w possesses the feature f if f and w belong to the same Roget category. The similarity between two words is then defined as the Dice coefficient of the two feature vectors (Lin, 1998).</Paragraph>
      <Paragraph position="3"> where R(w) is the set of words that belong to the same Roget category as w.</Paragraph>
    </Section>
    <Section position="3" start_page="94" end_page="96" type="sub_section">
      <SectionTitle>
2.3 Corpus-based Thesaurus
</SectionTitle>
      <Paragraph position="0"> This method is based on the assumption that a pair of words that frequently occur together in the same document are related to the same subject.</Paragraph>
      <Paragraph position="1"> Therefore word co-occurrence information can be used to identify semantic relationships between words (Schutze and Pederson, 1997; Schutze and Pederson, 1994). We use mutual information as a tool for computing similarity between words. Mutual information compares the probability of the co-occurence of words a and b with the independent probabilities of occurrence of a and b (Church and Hanks, 1990).</Paragraph>
      <Paragraph position="3"> where the probabilities of P(a) and P(b) are estimated by counting the number of occurrences of a and b in documents and normalizing over the size of vocabulary in the documents. The joint probability is estimated by counting the number of times that word a co-occurs with b and is also normalized over the size of the vocabulary.</Paragraph>
      <Paragraph position="4">  In contrast to the previous section, this method attempts to gather term relations on the basis of linguistic relations and not document co-occurrence statistics. Words appearing in similax grammatical contexts are assumed to be similar, and therefore classified into the same class (Lin, 1998; Grefenstette, 1994; Grefenstette, 1992; Ruge, 1992; Hindle, 1990).</Paragraph>
      <Paragraph position="5"> First, all the documents are parsed using the Apple Pie Parser. The Apple Pie Parser is a natural language syntactic analyzer developed by Satoshi Sekine at New York University (Sekine and Grishman, 1995). The parser is a bottom-up probabilistic chart parser which finds the parse tree with the best score by way of the best-first search algorithm. Its grammar is a semi-context sensitive grammar with two non-terminals and was automatically extracted from Penn Tree Bank syntactically tagged corpus developed at the University of Pennsylvania. The parser generates a syntactic tree in the manner of a Penn Tree Bank bracketing. Figure 3 shows a parse tree produced by this parser.</Paragraph>
      <Paragraph position="6"> The main technique used by the parser is the best-first search. Because the grammar is probabilistic, it is enough to find only one parse tree with highest possibility. During the parsing process, the parser keeps the unexpanded active nodes in a heap, and always expands the active node with the best probability.</Paragraph>
      <Paragraph position="7"> Unknown words are treated in a special manner. If the tagging phase of the parser finds an unknown word, it uses a list of parts-of-speech defined in the parameter file. This information has been collected from the Wall Street Journal corpus and uses part of the corpus for training and the rest for testing. Also, it has separate lists for such information as special suffices like -ly, -y, -ed, -d, and -s. The accuracy of this parser is reported  9. Relation. -- N. relation, bearing, reference, connection, concern,, cogaation ; correlation c. 12; analogy; similarity c. 17; affinity, homology, alliance, homogeneity, association; approximation c. (nearness) 197; filiation c. (consanguinity) 11\[obs3\]; interest; relevancy c. 23; dependency, relationship, relative position.</Paragraph>
      <Paragraph position="8">  as parseval recall 77.45 % and parseval precision 75.58 %.</Paragraph>
      <Paragraph position="9"> Using the above parser, the following syntactic structures are extracted :  an adjective modifies a noun.</Paragraph>
      <Paragraph position="10"> * Noun-Noun a noun modifies a noun.</Paragraph>
      <Paragraph position="11"> Each noun has a set of verbs, adjectives, and nouns that it co-occurs with, and for each such relationship, a mutual information value is calcu-</Paragraph>
      <Paragraph position="13"> where fsub(vi, nj) is the frequency of noun nj occurring as the subject of verb vi, L~,b(n~) is the frequency of the noun nj occurring as subject of any verb, f(vi) is the frequency of the verb vi, and Nsub is the number of subject clauses.</Paragraph>
      <Paragraph position="14"> fob~ (nj ,11i )/Nobj * Iobj(Vi, nj) = log (Yob~(nj)/Nob~)(f(vl)/Nob~) where fobj(Vi, nj) is the frequency of noun nj occurring as the object of verb vi, fobj(nj) is the frequency of the noun nj occurring as object of any verb, f(vi) is the frequency of the verb vi, and Nsub is the number of object clauses.</Paragraph>
      <Paragraph position="16"> where f(ai, nj) is the frequency of noun nj occurring as the argument of adjective ai, fadj(nj) is the frequency of the noun nj occurring as the argument of any adjective, f(ai) is the frequency of the adjective ai, and Nadj is the number of adjective clauses.</Paragraph>
      <Paragraph position="18"> log f .... (~j,~)/N .... where (f oun (nj )/ Nnou. )(f (ni )/ Nnoun ) f(ai,nj) is the frequency of noun nj occurring as the argument of noun hi, fnoun(nj) is the frequency of the noun n~ occurring as the argument of any noun, f(ni) is the frequency of the noun hi, and N.o~,n is the number of noun clauses.</Paragraph>
      <Paragraph position="19"> The similarity sim(w,wz) between two words w~ and w2 can be computed as follows :</Paragraph>
      <Paragraph position="21"/>
    </Section>
    <Section position="4" start_page="96" end_page="97" type="sub_section">
      <SectionTitle>
Expansion Method
</SectionTitle>
      <Paragraph position="0"> A query q is represented by the vector -~ = (ql, q2,---, qn), where each qi is the weight of each search term ti contained in query q. We used SMART version 11.0 (Saiton, 1971) to obtain the initial query weight using the formula ltc as be-</Paragraph>
      <Paragraph position="2"> where tfik is the occurrrence frequency of term tk in query qi, N is the total number of documents in the collection, and nk is the number of documents to which term tk is assigned.</Paragraph>
      <Paragraph position="3"> Using the above weighting method, the weight of initial query terms lies between 0 and 1. On the other hand, the similarity in each type of thesaurus does not have a fixed range. Hence, we apply the following normalization strategy to each type of thesaurus to bring the similarity value into the range \[0, 1\].</Paragraph>
      <Paragraph position="5"> The similarity value between two terms in the combined thesauri is defined as the average of their similarity value over all types of thesaurus.</Paragraph>
      <Paragraph position="6"> The similarity between a query q and a term tj can be defined as belows : simqt(q, tj) = Z qi * sim(ti, tj) tiEq where the value of sim(ti, tj) is taken from the combined thesauri as described above.</Paragraph>
      <Paragraph position="7"> With respect to the query q, all the terms in the collection can now be ranked according to their simqt. Expansion terms are terms tj with high simqt (q, t j).</Paragraph>
      <Paragraph position="8"> The weight(q, tj) of an expansion term tj is defined as a function of simqt(q, tj): weight(q, tj) - simqt(q, tj) ZtiEq qi where 0 &lt; weight(q, tj) &lt; 1.</Paragraph>
      <Paragraph position="9"> The weight of an expansion term depends both on all terms appearing in a query and on the similarity between the terms, and ranges from 0 to 1. The weight of an expansion term depends both on the entire query and on the similarity between the terms. The weight of an expansion term can be interpreted mathematically as the weighted mean of the similarities between the term tj and all the query terms. The weight of the original query terms are the weighting factors of those similarities (Qiu and Frei, 1993).</Paragraph>
      <Paragraph position="10"> Therefore the query q is expanded by adding the following query : ~ee = (al, a2, ..., at) where aj is equal to weight(q, tj) if tj belongs to the top r ranked terms. Otherwise aj is equal to  The resulting expanded query is : ~ezpanded &amp;quot;~- ~ o ~ee where the o is defined as the concatenation operator. null The method above can accommodate polysemy, because an expansion term which is taken from a different sense to the original query term is given a very low weight.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML