File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1090_metho.xml

Size: 22,244 bytes

Last Modified: 2025-10-06 14:11:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="C86-1090">
  <Title>On the Use of Term Associations in Automatic Information Retrieval</Title>
  <Section position="2" start_page="380" end_page="381" type="metho">
    <SectionTitle>
2. Associative Text Processing Methods
A) Thesaurus Operations
</SectionTitle>
    <Paragraph position="0"> A thesaurus is a word grouping device which provides a hierarchical and/or a clustered arrangement of the vocabulary for certain sub-ject areas. Thesauruses are used in text processing for three main purposes \[2\]: a) as authority lists where the thesaurus normalizes the indexing vocabulary by distinguishing terms that are allowed as content identifiers from the remainder of the vocabulary; null b) as grouping devices where the vocabulary is broken down into classes of related, or synonymous terms, as in the traditional RogetWs thesaurus; c) as term hierarchies where more general terms are broken down into groups of narrower terms, that may themselves be broken down further into still narrower groups.</Paragraph>
    <Paragraph position="1"> When a thesaurus is available for a particular subject area, each term found in a document can be used as an entry point into the thesaurus, and additional (synonymous or hierarchically related) terms included in the same thesaurus class as the original can be supplied. Such a thesaurus operation normalizes the vocabulary and provides additional opportunities for matches between query and document vocabularies. The vocabulary expansion tends to enhance the search recall (the proportion of relevant materials actually retrieved as a result of a search process).</Paragraph>
    <Paragraph position="2"> When the subject area is narrowly circumscribed and knowledgeable subject experts are available, useful thesaurus arrangements can he manually constructed by human experts that may provide substantial enhancements in retrieval effectiveness. Table 1 shows the average search precision (the proportion of retrieved materials actually relevant) obtained at certain fixed recall points for a collection of 400 documents in engineering used with 17 search requests. In that case, the performance of a manually constructed thesaurus (the Harris Three thesaurus) is compared with a content analysis system based on weighted word stems extracted from document and query texts. The output of Table 1 shows that at the high recall end of performance range, the thesaurus provides much better zetrieval output than the word stem process.</Paragraph>
    <Paragraph position="3">  While the use of thesauruses is widely advocated as a means for normalizing the vocabulary of document texts, no consensus exists about the best way of constructing a useful thesaurus. It was hoped early on, that thesaurses could be built automatically by studying the occurrence characteristics of the terms in the documents, and grouping into common thesaurus classes those terms that co-occur sufficiently often in the text of the documents: \[4\] &amp;quot;the statistical material that may be required in the manual compilation of dictionaries and thesauruses may be derived from the original texts in any desired form and degree of detail.&amp;quot; Later is was recognized that thesauruses constructed by using the occurrence characteristics of the vocabulary in the documents of&amp;quot; a collection do not in fact provide generally valid paradigmatic temn relations, but identify instead locally valid syntagmatic relations derivable from the particular document environment. \[5\] To utilize the conventional paradigmatic te~l relations existing in particular sub-ject areas, the vocabulary arrangements must effectively be constructed by subject experts using largely ad-hoc procedures made up for each particular occasion. The thesaurus method is therefore not generally usable in operational environments.</Paragraph>
    <Paragraph position="4"> B) Automatic Term Associations While generally valid thesauruses are difficult to build, locally valid f~Km~di~aILig/l can be generated automatically by making use of similarity measurements between pairs of te~mls based, for example, on the number of documents in which the terms co-occur in the documents of a collectiondeg The number of common sentences in which a pair of words can be found may also be taken into account, as well as some measure of proximity between the words in the various texts. Using similarity measurements between word pairs, term association maps can be constructed, and these may be displayed and used by the search personnel to formulate useful query statements, and to obtain expanded document representations. \[6,7\]  A sample assignment of terms to documents is Shown in the matrix of Fig. l(a). Fig. l(b) shows the corresponding document-term graph where a line between term T. and document D.</Paragraph>
    <Paragraph position="5"> represents the correspondin~ term assignmen~ appears in Fig. l(b). Given the assignment of Figs. l(a) and (b). term associations may be derived by grouping sets of terms appearing in term4J4 similar contexts. For example, and T_ may be grouped because these te appea~ jointly in documents D I and D2; similarity terms T 1 and T 6 appear in aocuments D 1 and D 4. The grouping operations may be used to obtain a term association map of the kind shown in Fig. l(c), where associated temDs are joined by lines on the map.</Paragraph>
    <Paragraph position="6"> The operations of a typical associativ~ system are illustrated in Fig. 2.</Paragraph>
    <Paragraph position="7"> \[7\] The original query words are listed on the left-hand side of Fig. 2, and the derived associated terms are shown on the right. The value of a given associated term--for example, &amp;quot;Inter u national Organizations&amp;quot;-- is computed as the sum of the term association values between the given term and all origilml query tez~s (0.5 for ~IUnited Nations&amp;quot; plus 0.4 for &amp;quot;Pan American Union&amp;quot; in the example of Fig. 2). Finally the retrieval value of a document is computed as the sum of the common term association values for all matching terms that are present in both queries and documents. Many variations are possible of the basic scheme illustrated in Fig. 2; in each case, the hope is that valid term associations would make it possible to achieve a greater degree of congruence between document representations and query formulations.</Paragraph>
  </Section>
  <Section position="3" start_page="381" end_page="383" type="metho">
    <SectionTitle>
UNITED NATIONS 1.5 LEAGUE OF NATIONS 0.5
</SectionTitle>
    <Paragraph position="0"> PAN. AM. UNION 1.5 INT. COOPERATION 0.8  In practice, it is found that the use of term associations can improve the search recall by providing new matches between the term assigned to queries and documents that were not available in the original query and document. In addition, the search precision can also be enhanced by reinforcing the strength of already existing term matches. \[5\] Unfortunately, the experimental evidence indicates that only about 20 percent of automatically derived associations between pairs of terms are semantically significant; the associative indexing process does not therefore provide guaranteed advantages in retrieval effectiveness.</Paragraph>
    <Paragraph position="1"> Table 2 shows a typical evaluation output for a collection of 400 documents in engineering used with 17 search requests. The output of Table 2 shows that the automatic term associations provide an increase in average search precision only at the high recall end of the performance range. Overall, the average search precision decreases by 13 percent for the col- null More recently other vocabulary expansion experiments have been conducted using associated terms derived by statistical term co-occurrence criteria. \[9-10\] Once again, the evaluation results were disappointing: \[10\] hOur results on query expansion using the NPL data are disappointing. We have not been able to achieve any significant improvements over nonexpansion. We have repeated previous experiments in which the query was expanded, and the resulting set of search terms then weighted... Once again the results have been conflicting..,&amp;quot; The conclusions derived from the available evidence indicate that the vocabulary expansion techniques which add to the existing content identifiers related terms specified in a thesaurus, or derived by term co-occurrence measurements, do not provide methods for improving retrieval effectiveness. Generally valid thesauruses for large subject areas are difficult to generate and the automatic term co-occurrence procedures do not offer adequate quality control. Efforts to enhance the recall performance of search systems must therefore be based on different techniques designed to generate indexing vocabularies of broader scope, including especially word stem generation and suffix truncation methods.</Paragraph>
    <Paragraph position="3"> The vocabulary expansion methods described up to now are designed principally to improve search recall. Search precision may be enhanced by using narrow indexing vocabularies consisting largely of JL&amp;I~B p~ replacing the normally used single terms. Thus &amp;quot;computer science&amp;quot; or &amp;quot;computer programming&amp;quot; could replace a broader term such as 'tcaleu1~tor&amp;quot; or &amp;quot;computer&amp;quot;. The recognition and aesig u ent of term phrases poses much the same probems as the previously described generation of term associations and the expansion of indexing vocabularies. In particular, an accurate determination of useful te~,l phrases, and the rejection of extraneous phrases, must he based on syntactic analyses of query and document texts suppl~nented by semen= tic components valid for the subject areas under consideration. Unfortunately, complete linguistic analyses of topic areas of reasonable scope are unavailable for reasons of efficiency as well as effectiveness. In practice, it is then necessary to fall back on simpler phrase generation methods in which phrases are identified as sequences of co-occurring terms with appropriate statistical and/or syntactic properties. In such simple phrase generation environments quality control is, however, difficult to achieve.</Paragraph>
    <Paragraph position="4"> The following phrase generation methods are of main interest: a) statistical methods where each phrase \]~ (the main phrase component) bas a stated minimal occurrence frequency in the texts under consideration, and each phrase exhibits another stated minimal occurrence frequency, and the distance in number of intervening words between phrase heads and phrase components is limited to a stated number of words; b) a simple syntactic pattern matching method where a dictionary search method is used to assign syntactic indicators to the text elements, and phrases are then recognized as sequences of words exhibiting certain previously established patterns of syntactic indications (e.g. adjective-noun-noun, or preposition-adjective-noun); \[11-12\] c) a more complete syntactic analysis method supplemented if possible by appropriate semantic restrictions to control the variety of permitted syntactic phrase constructions for the available texts. \[13,14\] When statistical phrase generation methods are used, a large number of useful phrases can in fact be identified, together unfortunately with a large number of improper phrases that are difficult to reject on formal grounds. For example, given a query text such as &amp;quot;h~nophilia and christmas disease, especially in regard to the specific complication of pseudotumor formation (occurrence, pathogenesis, treatment, prognosis)' it is easy to produce correct phrase combinations such as &amp;quot;christmas disease&amp;quot; and &amp;quot;pseudotumor formation&amp;quot;. At the same time the statist:ical phrase formation process produces inappropriate patterns such as &amp;quot;formation occurrence '~ and &amp;quot;complication formation&amp;quot;deg \[15\] Overall a statistical phrase formation process will be of questionable usefulness.</Paragraph>
    <Paragraph position="5"> Table 3 shows a comparison of the average search precision results for certain fixed recall values between a standard single term indexing system, and a system where the single terms are supplemented by statistically determined phrase combinations. The output of Table 3 for four different document collections in computer science (CACM), documentation (CISI)~ medicine (MED) and aeronautics (CRAN) shows that the phrase process affords modest average i~provements for three collections out of four.</Paragraph>
    <Paragraph position="6"> \[15\] However, the improvement is not guaranteed, and is in any case limited to a few percentage points in the average precision.</Paragraph>
    <Paragraph position="7"> The evaluation results available for tile syntax-based methods are not much more encouraging. \[16\] The basic syntactic analysis approach must be able to cope with ordinary word ambiguities (Imllp base, army base, baseball base), the recognition of distinct syntactic constructs with identical meanings, discourse problems exceeding sentence boundaries such as pronoun referents from one sentence to another,  and the difficulties of interpreting many complex meaning units in ordinary texts. An illustration of the latter kind is furnished by the phrase &amp;quot;high frequency transistor oscillator&amp;quot;, where it is important to avoid the interpretation &amp;quot;high frequency transistor&amp;quot; while admitting &amp;quot;transistor oscillator&amp;quot; and &amp;quot;high frequency oscillator&amp;quot;. A sophisticated syntactic analysis system with substantial semantic components was unable in that case to reject the extraneous interpretations &amp;quot;frequency transistor oscillators which are high (tall)&amp;quot; and &amp;quot;frequency oscillators using high (tall) transistors&amp;quot;. \[17\] In addition to the problems inherent in the language analysis component of a phrase indexing system, a useful text processing component must also deal with phrase classification, that is the recognition of syntactically distinct patterns that are semantically identical (&amp;quot;computer programs&amp;quot;, &amp;quot;instruction sets for computers&amp;quot;, &amp;quot;programs for calculating machines&amp;quot;). The phrase classification problem itself raises complex problems that are not close to solution.</Paragraph>
    <Paragraph position="8"> \[18\] In summary, the use of complex identifying units and term associations in automatic text processing environments is currently hampered by difficulties of a fundamental nature. The basic theories needed to construct useful term grouping schedules and thesauruses valid for particular subject areas are not sufficiently developed. As a result, the effectiveness of associative retrieval techniques based on term grouping and vocabulary expansion leaves something to be desired. The same is true of the syntactic and semantic language analysis theories used to generate a large proportion of the applicable complex content descriptions and phrases, and to reject the majority of extraneous term combinations.</Paragraph>
    <Paragraph position="9"> The question arises whether any retrieval situations exists in which it is useful to go beyond the basic single term text analysis methodology, consisting of the extraction of single terms from natural language query and document texts. This question is examined in the remaining section of this note.</Paragraph>
  </Section>
  <Section position="4" start_page="383" end_page="384" type="metho">
    <SectionTitle>
3. The Usefulness of Complex Text Processing
</SectionTitle>
    <Paragraph position="0"> Three particular text processing situations can be identified where term association technlques have proved to be useful. The first one is the well-known ~ ~ process where initial search operations are conducted with preliminary query formulations obtained from the user population. Following the retrieval of certain stored text items, the user is asked to respond by furnishing relevance assessments for some of the previously retrieved items; these relevance assessments are then used by the system to construct new, improved query formulations which may furnish additional, hopefully improved, retrieval output. In particular, the query statements are altered by adding terms extracted from previously retrieved items that were identified as relevant to the user's purposes, while at the same time removing query terms included in previously retrieved items designated as nonrelevant.</Paragraph>
    <Paragraph position="1"> The relevance feedback methodology represents an associative retrieval technique, since new query terms are obtained from certain designated documents that hopefully are related to the originally available formulations.</Paragraph>
    <Paragraph position="2"> Relevance feedback techniques have been used with vector queries formulated as sets of search terms \[9, 19-20\], and more recently with Boolean queries. \[21\] The effectiveness of the feedback procedure has never been questioned.</Paragraph>
    <Paragraph position="3"> Table 4 shows typical evaluation output for four different document collections in terms of average search precision at ten recall points (from a recall of 0.I to a recall of 1.0 in steps of 0.I) averaged over the stated number of user queries. The output of Table 4 applies to Boolean queries with binary weighted terms. \[21\] The improvements in retrieval precision due to the user feedback process ranges from 22% to 110% for a single search iteration. When the feedback process is repeated three timesj the improvement in search precision increases to 63% to 207%, Evidently, the user relevance information which applies to particular queries at particular times makes it possible to find a sufficient number of interesting term associations to substantially improve the retrieval output.</Paragraph>
    <Paragraph position="4"> A second possibility for generating improved retrieval output consists in limiting the analysis effort to the ~ ~ 19~ /~ instead of the document texts. In a recent study, term phrases were first extracted from natural language query texts using a simple, manually controlled, syntactic analysis process. These query phrases were then recognized in document texts bY a rough pattern  matching procedure distinguishing pairs and triples of terms occurring in the same phrases of documents, and pairs and triples of terms occurring in the same sentences of documents. \[22\] Whenever a phrase match is obtained between a query phrase and a document text, the retrieval weight of the document is appropriately increased.</Paragraph>
    <Paragraph position="5"> An evaluation of such a manually controlled syntactic phrase recognition system based on query statement analysis reveals that substantial improvements in retrieval effectiveness are obtainable for the phrase assignments, compared with the single term alternatives. Table 5 shows average search precision values at five recall levels for 25 user queries used with the CACM collection in computer science. \[22\] On average the query analysis system raises the search precision by 32 percent.</Paragraph>
    <Paragraph position="6">  The special processing described up to now is user related in the sense that user query formulations and user relevance assessments are utilized to improve the retrieval procedures.</Paragraph>
    <Paragraph position="7"> The last possibility for the use of complex information descriptions consists in incorporating stored JgRO_~ renresentations covering particular subject areas to enhance the descriptions of document and query content. \[23-25\] Various theories of knowledge representation are current, including for example, models based on the use of frames representing events and descriptions of interest in a given subject.</Paragraph>
    <Paragraph position="8"> Frames designating particular entities may be represented by tabular structures, with open Hslots&amp;quot; filled with attributes of the entities, or values of attributes. Relationships between frames are expressed by using attributed that are themselves represented by other frames, and by adding links between frames. Frame operations can also be introduced to manipulate the knowledge structure when new facts or entities become known, or when changes occur in item relationships, There is some evidence that when the knowledge base needed to analyze the available texts is narrowly circumscribed and limited in scope, useful frame structures can in fact be intellectually prepared to enhance the retrieval operations. \[26\] However, when the needed topic area is not of strictly limited scope, the construction of useful knowledge bases is much less straightforward and the knowledge-based processing techniques become of limited effectiveness. It has been suggested that: in these circumstances, the system user himself might help in building the knowledge structures. \[27\] While this remains a possibility, it is hard to imagine that untrained users can lay out the subject knowledge of interest in particular areas and specify concept relationships such as synonyms, generalizations, instantiations, and cross-references with sufficient accuracy. In any case, no examples exist at the present time where user constructed knowledge bases have proved generally valid for different collections in particular subject areas. In fact, the situation appears much the same as it was thirty years ago: it seems quite easy to build locally valid te~ association systems by ad-hoc means; these tools fail however in somewhat different environments, and do not furnish reliable means for improving text processing systems in general. null For the foreseeable future, text processing systems using complex information identifications and term associations must therefore be limited to narrowly restricted topic areas, or must alternatively be based on simple user inputs, such as ~ocument relevance data, t:hat can be furnished by untrained users without undue hardship.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML