File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/c80-1089_metho.xml
Size: 13,858 bytes
Last Modified: 2025-10-06 14:11:19
<?xml version="1.0" standalone="yes"?> <Paper uid="C80-1089"> <Title>tistics, Bruckman b, Fershl I, Schmetterer - Vienna Physics Verlagdeg</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> A CONCEPTUAL FRAMEWORK FOR AUTOMATIC AND DYNAMIC THESAURUS UPDATING IN INFORMATION RETRIEVAL SYSTEMS M.F. BRUANDET Laboratoire IMAG </SectionTitle> <Paragraph position="0"> B.P. 53X, 38041 GRENOBLE Cedex (France)</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> This paper aims at presenting a methodology for automatic thesaurus construction in order to help the search of documents and we want to obtain the development of classes for specific topics (for a given corpus) without a priori semantic information. Information contained in the thesaurus lead to new search formulations via automatic and/or user feedback. This presentation even being theoretical is oriented toward a database implementation.</Paragraph> <Paragraph position="1"> Preliminary remarks Different strategies used in Information Retrieval Systems must be developped to increase &quot;recall&quot; and &quot;precision ''8'9. The classic one is the construction of thesaurus. A thesaurus is usually defined as a set of terms (called descriptors) and a set of relations between these terms.</Paragraph> <Paragraph position="2"> This study is made for an information retrieval system using an inverted file (bitmap, each key-word points to a set of documents containing this keyword). For formulating a request the user defines a set of keywords and boolean operators on this set (for example MISTRAL, GOLEM-PASSAT, STAIRS systems). When entering a document into the database, a module (e.g. PIAF) 4,5 generates stems from the data (several grammatical variants of the same word are reduced to a canonical form). We call this form an item.</Paragraph> <Paragraph position="3"> Thesaurus construction in the context of local documents Our object is to find a method for the construction of non-hierarchical relations and the definition of item clusters from these relations.</Paragraph> <Paragraph position="4"> A point to be underlined is that this methodology could efficiently be used only on homogeneous collections of texts. To this purpose, we only consider a database subset : the local set of all documents returned from a given query. The local clustering method makes use of the common occurrences of items within a certain &quot;neighborhood&quot;, this method has been studied by R. ATTAR and A.S. FRAENKEL (in &quot;Local feedback in full-text retrieval&quot;) I.</Paragraph> <Paragraph position="5"> Let be DPS the local set of documents retrieved from a given query and TZ the set of items contained in DZ. We define a metrical function which is inversely proportional to the distance between items in the same sentence. Each item is defined by its coordinates (DN, SN, IN) where DN is the document number, SN the sentence number and IN the item number within a sentence.</Paragraph> <Paragraph position="6"> For any item tC/ T~, let wt(i) be the coordinate of the ith occurrence of t.</Paragraph> <Paragraph position="7"> For any couple (s,t) C/ T~ x T%, we define d = *\]wt(i) - Ws(J) \] the distance between the ith occurrence of t and j th occurrence of s.</Paragraph> <Paragraph position="8"> In fact</Paragraph> <Paragraph position="10"> where the summation is over all occurrences i and j of s and t.</Paragraph> <Paragraph position="12"> In order to normalize the function, we take b(s,t) ~R(S,t) - ~ where f(t) is the number of occurrences of t for all local documents DPS</Paragraph> <Paragraph position="14"> Through this function, we obtain for an item s a reference vector R which is a list of items t s related to s, such as DR(S,t) is greater (or equal) than a threshold-e. These values form an eigen vector : E R .</Paragraph> <Paragraph position="15"> s Taking into account new local information in thesaurus updating Without excluding for the thesaurus the search of hierarchical relations (specific or generic), we try tO build a set or a group of items having a notion of &quot;similarity&quot; or &quot;liaison&quot; between themselves. This thesaurus is built as the answers of the used Information Retrieval System are analysed. It must be structured so that the updating should be dynamic and automatic ; the implementation study has not yet been examined. The main problem of updating is to take into account &quot;liaisons&quot;, &quot;proximities&quot; or &quot;similarities&quot; between the already registered items in the thesaurus and the new liaisons found after a new query. For any query, we obtain a set of items related to s. Let be R the previous reference vector * $ (~R its assoclated function) and R's the newly s calculated vector (~R' its associated function).</Paragraph> <Paragraph position="16"> s --586--A new reference vector may be calculated from R and R' using two functions m(s,t) and M(s,t) : s S The function M involves all the items t which are related, or not, to s in R and Ri (see Table I). The function m allows us t~ consider only the items which are both in R and in R' (see Table S S I).</Paragraph> <Paragraph position="17"> One might consider m and M to be respectively the union and intersection of items t related to s. Table I using the above functions m and M (formulas (4), (5)) strongest bindings between items. Any association between s and t is meaningful only as regard to the &quot;binding strength&quot;, that is to say the value of the association function.</Paragraph> <Paragraph position="18"> Use of the functions m and M for thesaurus construction and updating For an item x, only the items related to x in several local contexts must be considered in the thesaurus. Thus, it is necessary to keep records of the initial queries into a pseudo-thesaurus. In this pseudo-thesaurus is registered, for any item x, the set of items related to x in one or more local contexts.</Paragraph> <Paragraph position="19"> Let be</Paragraph> <Paragraph position="21"> for x belonging to the set of items T, (T = uT%).</Paragraph> <Paragraph position="22"> Concerning an item x of T~, three reference vectors (and their associated functions) can be yielded : R , PS and T which are the * X sets of items t re~ated to x x respectlvely considered in the treated local context, in one or.</Paragraph> <Paragraph position="23"> more local contexts kept in the pseudo-thesaurus, and in the global context kept in the thesaurus.</Paragraph> <Paragraph position="24"> These sets can be void, also several cases can be encountered : I) PS and T are not void X X The updating process is performed in three steps : Step l : ~~_2~_~_~ In order to know, if the newly calculated liaisons in R x already exist in other local context, we compare R x and PS x.</Paragraph> <Paragraph position="25"> Only the common items of these two reference vectors are considered, and we form a temporary reference vector P using the function m (formula (4)). x In Px only items from R x which are previously related to x in at least one context are retained. The stronger connections are decreased (see Table \[) because we can suppose they are only</Paragraph> <Paragraph position="27"> the elgen vector E T (of Tx) is modified using the function ~ (formula (4)) ; --587-(ii) if the items t in T x are different from those occuring in Px' then a new reference vector T is constructed combinating the values of x functlons D T and ~PS using M (formula (5)). x x</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Remarks : </SectionTitle> <Paragraph position="0"> - We do not calculate the new association function between two items for T with m (formula * x (4)), because we do not introduce new items related to x in the thesaurus, when new items appear in several local contexts.</Paragraph> <Paragraph position="1"> - The function M uses the common or not common items and introduces in the thesaurus the new items, which are related to x in at least two local contexts.</Paragraph> <Paragraph position="2"> Step 3 : Pseudo-thesaurus ~!!~_~!~!~ The pseudo-thesaurus updating must take into account the new items Occuring in R x. The new association function for PS x is calculated from the association function ~R and the old association function ~PS using M ~formula (5)).</Paragraph> <Paragraph position="4"> This case corresponds to the situation~lere x is never appeared in any local context. We create the reference vectors PS in the pseudo-thesaurus . and R x with the assoclatlon function ~R (PSx = Rx). No information about x is kept in ~he the-</Paragraph> <Paragraph position="6"> This corresponds to the case where x is already appeared in only one local context. If R x # C/, then we can build the initial reference vector T in the thesaurus. We use the association function m (formula (4)) calculated from the values of association functions D R and Dps (respecti* x ~R vely contained in the elgen vectors x and EpSx). The present experimentation exhibits among the items related to x in T x (initial step) local synonyms, some global synonyms and many parasistic items. After a few thesaurus updatings the values of the association function for parasistic items rapidly decrease, and the values for local and global synonyms increase. It is clear that reliability of such a thesaurus can be reached only after a large number of queries. In such a situation new updating procedures might be considered so that new parasistic items should not be introduced in T x (thus breaking the stability of Tx).</Paragraph> <Paragraph position="7"> Global treatment of thesaurus Let be T the large set of items registered in the thesaurus. In order to classify T (i.e. to split T into classes of similar items), we consider the couple of reference vectors T x and Ty (so E T and</Paragraph> <Paragraph position="9"> Let be r(x,y) a similarity measure :</Paragraph> <Paragraph position="11"> range is \[0,1\].</Paragraph> <Paragraph position="12"> We can use an association matrix (i.e. term-term matrix) between items and found a partition of T in equivalence classes. Moreover, this method hardly applies to a great many items and does not seem realistic for a large scale dictionnary (6000 or 10000 items, for example) which are common in information retrieval field. To overcome this drawback, we may try to build up the global association matrix from the local ones.</Paragraph> <Paragraph position="13"> Some ideas have been suggested 2 using the fuzzy sets theory6, 13 but there are still theoretical approaches.</Paragraph> <Paragraph position="14"> Feedback query processing Number of parers are related to thefeedback query processing\],v, \]2 and our approach is similar. We think to adopt the following strategy, though we lack practical results to assert better &quot;score&quot; on queries.</Paragraph> <Paragraph position="15"> After a query we have therefore a set R of items s related to s (for each s ~ T~) and a partition of T% into equivalence classes F 4. In the thesaurus we might have both a set T (Jitems related to s) and a partition of the global set T into equivalence classes C.. i Several strategies can be used, they are detailed in an other paper 4. We can use only local context, global context or both global and local context. We summarize some of the solutions below : \]) use of only global context A query is automatically generated with t instead of x when t belongs to the reference vector T and ~T (x,t) is greater or equal than a threshold ~. x If the user agrees, a new query is generated with t when x and t are equivalent in the thesaurus. 2) use of both local and global context When an item t is considered as &quot;similar&quot; to x both in local context (Rx) and in global context (Tx) and D R (x,t) N D T (x,t), t automatically x x replaces x in the query. When R and T x have common items, we can purpose toXthe user new queries with item t appearing in T x but not in Rx (~T (x,t) e ~).</Paragraph> <Paragraph position="16"> x As previously mentioned we can use the same strategy using the local equivalence classes F. and global equivalence classes C~ (automatic fledback query processing with xlc C. n F., and under i j user control with x e C i but x ~ C i n Fj and C i n Fj # ~).</Paragraph> <Paragraph position="17"> In this last case, we can think global synonymies allow to retrieve new documents originally left out.</Paragraph> <Paragraph position="18"> --588--From the previous analysis, it seems that the best strategies should be those using both local and global contexts, but this needs to be verified. null Conclusions We conclude from present experimentation on small number of french texts that the thesaurus updating method shall give horizontal thesaurus relations.</Paragraph> <Paragraph position="19"> Moreover unexpected relation between items should appear in the thesaurus, that is association which strongly reflects the corpus' content and which could not a priori be established and enhanced.</Paragraph> <Paragraph position="20"> The methodology presented above does not exclude any further intervention on the thesaurus to refine semantic information about some particular cases, such as modifying values of the association function for some items, enriching definition of synonyms, Our next goal for such a design of the thesaurus is twofold : I) we wish to make possible non boolean queries through the use of fuzzy keywords and subsequent improvement of dialogue ; 2) we wish to cluster documents with a dynamic indexing mechanism.</Paragraph> </Section> </Section> class="xml-element"></Paper>