File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/85/e85-1017_metho.xml
Size: 21,973 bytes
Last Modified: 2025-10-06 14:11:42
<?xml version="1.0" standalone="yes"?> <Paper uid="E85-1017"> <Title>for Two Collections</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> ON THE REPRESENTATION OF QUERY TERM RELATIONS BY SOFT BOOLEAN oPERATORS </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="116" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> The language analysis component in most text retrieval systems is confined to a recognition of noun phrases of the type normally included in back-of-the-book indexes, and an identification of related terms included in a preconstructed thesaurus of quasi-synonyms. Even such a restricted language analysis is fraught with difficulties because of the well-known problems in the analysis of compound nominals, and the hazards and cost of constructing word synonym classes valid for large text samples.</Paragraph> <Paragraph position="1"> In this study an extended (soft) Boolean logic is used for the formulation of information retrieval queries which is capable of representing both the use of compound noun phrases as well as the inclusion of synonym constructions in the query statements. The operations of the extended Boolean logic are described, and evaluation output is included to demonstrate the effectiveness of the extended logic compared with that of ordinary text retrieval systems.</Paragraph> <Paragraph position="2"> I. Linguistic Approaches in Information Retrieval It is possible to classify the various automatic text processing systems by the depth and type of linguistic analysis needed for their operations. Sophisticated language understanding components are believed to be essential to carry out automatic text transformations such as text abstracting and text translation. \[I,14,24\] Complete language understanding systems are also needed in automatic question-answering where direct responses to user queries are automatically generated by the system. \[11\] On the other hand, relatively less sophisticated language analysis systems may be adequate for bibliographic information retrieval, where references as opposed to direct answers are retrieved in response to user queries. \[21\] In bibllographic retrieval, the content of individual documents is normally represented by sets of key words, or key phrases, and only a few specified term relationships are recognized using Department ot Computer Science, Cornell University, Ithaca, New York 14853.</Paragraph> <Paragraph position="3"> This study was supported in part by the National Science Foundation under grant 1ST 83-16166.</Paragraph> <Paragraph position="4"> preconstructed dictionaries or thesauruses. Even in this relatively simplified environment one does not normally undertake a linguistic analysis of any scope. In fact, syntactic and semantic analysis have been used in bibliographic information retrieval only under special circumstances to analyze query phrases \[22\], to process structured text samples of a certain kind, \[7,15\], or finally to process texts in severely restricted topic areas. \[2\] Where special conditions do not obtain, the preferred approach in information retrieval has been to use statistical or probabilistic criteria for the generation of the content identifiers assigned to documents and search queries. Obviously, not all terms are equally useful for content identification. Accordin E to the term discrimination theory, the following criteria are of importance in this connection \[16\]: a) terms which occur with high frequency in the documents of a collection are not preferred for content representation because such terms are too broad to distinguish the documents from each other; b) terms which occur with very low frequency in the collection are also not optimal, because such terms affect only a very small fraction of documents; c) the best terms tend to be low-to-medium frequency entities which can be produced by taking single terms that exhibit the required frequency characteristics; alternatively, it is possible to obtain medium frequency entities by refining high frequency terms thereby rendering them more narrow, or by broadening low frequency terms.</Paragraph> <Paragraph position="5"> In many operational information situations, the term broadening and narrowing operations are effectively carried out by using formulations in which the terms are connected by Boolean operators. The use of Boolean logic in retrieval is discussed in more detail in the remainder of this note.</Paragraph> </Section> <Section position="3" start_page="116" end_page="120" type="metho"> <SectionTitle> 2. Extended Boolean Logic in Information Retrieval </SectionTitle> <Paragraph position="0"> It is customary to express information search requests by using Boolean formulas that include the operators and, or, and no~. Of particular interest in a linguistic context are the and and or operators: null</Paragraph> <Paragraph position="2"> The and-operator is a device for specifying a compulsory phrase where all terms in the and-clause must be present to affect the retrieval operation. Thus a query statement such as &quot;information and retrieval&quot; is used to represent the compound nominals &quot;information retrieval&quot;, or &quot;retrieval of information&quot;. The and-operator is used as a refining device since a broad term such as &quot;information&quot; is made more speclflc when it is incorporated in an and-clause.</Paragraph> <Paragraph position="3"> The or-operator, on the other hand, is a device for specifying a group of synonymous terms, or alternatively, a thesaurus class of terms in which all terms are treated as coequal. That is, any one term in an or-clause will cause retrieval of the corresponding document, and each term is assumed to be as good as any other term.</Paragraph> <Paragraph position="4"> The or-operator is a broadening device because each or-clause has a broader scope than any individual clause component.</Paragraph> <Paragraph position="5"> While the logical operators ,nd and or are used universally in retrieval environments, the assomptions of Boolean logic are not verified in normal text processing enviror..ents. Strict synonyms occur relatively rarely in query formulations or in the texts of documents, so that the nOrmal or-clause does not reflect a practical situation. In fact, it should be possible to make distinctions between more or less important terms in an or-clause; furthermore, or-clauses should be usable to represent collections of loosely related terms instead of only strict synonyms, Analogously, it should be possible to relax the compulsory nature of the phrase components included in an ~&~-clause, and distinctions ought to be introducable between phrase components of greater or lesser importance.</Paragraph> <Paragraph position="6"> In summary, the uncertain (fuzzy) nature of the term relationships which obtain in the natural language are not reflected by the rules of ordinary Boolean logic. \[25\] Instead a relaxed type of logic is needed which is capable of broadening or narrowing the term units, while also providing for distinctions in term importance and for the specification of fuzzy or soft term relationships. Such an extended logical system was introduced recently with the following main properties: \[17-18\] a) The extended logic system distinguishes among more or less important terms in both gueries and documents by using weights, or importance indicators attached to the terms. Thus instead of terms A and B, the system processes terms (A.a) and (B,b) respectively, where a and b designate the weights of terms A and B.</Paragraph> <Paragraph position="8"> The extended system simulates the llnguistic characteristics of more or less strict synonyms, by attaching a ~-value to each or-operator that specifies the degree of strictness of the corresponding operator.</Paragraph> <Paragraph position="9"> The higher the p-value attached to an operator, the closer is the interpretation of that operator in accordance with the rules of ordinary Boolean logic. On the other hand, the smaller the p-value, the more relaxed is the interpretation of the or-operator.</Paragraph> <Paragraph position="10"> The extended system also simulates the linguistic characteristics of more or less strict phrase attachment, by usin E a p-value for each and-operator. The higher the p-value, the more similar * the corresponding operator will be to the compulsory Boolean and. Correspondingly, the smaller the p-value, the more relaxed is the interpretation of the and operator.</Paragraph> <Paragraph position="11"> The extended system (unlike the ordinary Boolean system) provides ranked output of the stored documents in presumed decreasing order of importance of a given item with respect to a query. In addition, the extended system provides much better retrieval output, than systems based on conventional Boolean logic. Experimentally, improvements of 100 to 200 percent in retrieval effectiveness have been noted for the extended logic over the conventional Boolean system. \[17,18\] It is not possible in the present context to furnish the details of the operation of the extended logic system. The following results are, however, relatively easy to prove: \[17\] a) When p-values equal to infinity are used, the extended system produces results identical to that of the conventional Boolean logic systems; b) When the p-values are reduced from infinity, the distinctions between phrase components (and) and synonym specification (or) become more and more blurred; c) When p reaches its lower limit of 1, the distinction between and and or operators is completely lost. and the system reduces the queries (A and B) and (A or B) to a system with terms (A,B), without any relationship specification between terms A and B.</Paragraph> <Paragraph position="12"> Using linguistic analogues, the following examples illustrate the operations of the extended logic system. The p-value attached to operators is shown in each case as an exponent: i) (A andco B) interpreted as ALL OF (A,B) (strict phrase) iii (A and 3 B) interpreted as MOST OF (A,B) (fuzzy phrase) iii) (A and I B) interpreted as SET (A,B) (more matching terms are worth more than fewer matching terms) iv) (A fl~ I B) identical to (A ~nd I B) interpreted as SET (A,B) v) (A ~ 3 B) interpreted as SOME OF (A,B) (fuzzy synonym) vi) (A ~ B) interpreted as ONE OF (A,B) (strict synonym) 3. Experimental Results The operations of the extended logic system are illustrated by using a collection of 3204 computer science articles (titles and abstracts) originally published in the C~unications of the ACM (the CACM collection), and a collection of 1460 articles in library science obtained from the Institute for Scientific Infomation (the CISI collection). Table 1 shows average performance figures for 7 selected queries used with CACM, and 4 selected queries for CISI. The performance in Table 1 is stated in terms of the search Dreclslon at various ~ points averaged over the set of search requests in use. \[19\] The data of Table 1 indicate that the conventional Boolean searches (p = co, Boolean) produce by far the worst performance for both collections. Performance improvements between 100 and 200 percent are obtained by relaxing the interpretation of the Boolean operators (that is, by using lower pvalues). A distinction must be made between taking into account only single term matches (p-values are equal to 1), and giving extra weight to term phrase matches (A and B .rid ...), and to synonym set matches (A or B or ...), when p-values higher than 1 must be used. The results of Table I show that for the CACM queries the best overall policy is a complete softening of the Boolean operators down to p = 1. Evidently not many of the quasi-Boolean phrases included in the CACM queries were also present in the document abstracts. For the ISI queries, on the other hand, 154 percent improvement is produced when p = 1; when the phrase combinations are given extra weight, the improvement in performance jumps to 164 percent for p = 2, and to 182 percent when and- and or-operatocs are given different values (p and = 2.5 and p or = 1.5, respectively).</Paragraph> <Paragraph position="13"> These phenomena are further illustrated in the output of Tables 2 and 3. The comparison between query CACM Q5 and Document 756 is outlined in Table 2. No abstract was available for document 756; hence only the title words could be used in the query-document comparison. As the example shows. only the term &quot;editing&quot; was present in both document title and query. This explains why the single term match (p = l) produces the best output rank of 5 for this document. Obviously, the sample document is not retrievable by the pure Boolean search (p = co) as demonstrated by the simulated retrieval rank of 1667 out of 3204 CACM documents.</Paragraph> <Paragraph position="14"> Table 3 shows an example where matching phrases make a substantial difference in the retrieval results. The matched phrases in Document 1410 are given a double underline in Table 3, whereas matched single terms have a single underline. The output of Table 3 shows that when the single terms alone are considered, document 1410 is retrieved with a rank of 53 in response to query ISI Q33. When the phrase matches are given extra weight (p = 2. or p and = 5, p or = 2), the retrieval rank improves to 2 and 7, respectively. These results demonstrate that the conventional Boolean logic does not adequately reflect the tentative and uncertain nature of the relations between terms in the language. When a relaxed interpretation of Boolean logic is used, the correspondence with the fuzzy nature of linguistic relations is much greater and dramatic improvements in term matching and hence retrieval effectiveness are obtained.</Paragraph> <Paragraph position="15"> 4. Relationship of Extended Boolean Model with The extended Boolean system is based on the use of certain term relationships--notably term phrases and synonymous constructions. These relations are. however, interpreted flexibly, reflecting the uncertain nature of term relations in the language. Tn the extended system, soft Boolean queries are easy to formulate, and methods exist for a completely automatic formulation of the soft queries, given only some basic information about user needs. \[20\] Analogously, initial queries may be automatically reformulated, following an initial search operation, based on information obtained from the user about the relevance of previously retrieved documents. \[183 The current development may then be related to other retrieval models that incorporate term relations, and to systems with advanced user interfaces. Term relations of a statistical, or probabilistic nature are included in the probabilistic retrieval model; more general linguistic relations are used in systems that include a natural language analyzer. In the probabilistic retrieval system, the documents are ranked in decreasing order of the probabilistic expression p(x\[rel)/P(xlnonrel) where P(x~rel) and P(x\[nonrel) represent the occurrence probabilities of an item x in the relevant and non-relevant document subsetso respectively. \[23\] The</Paragraph> <Paragraph position="17"> taken into account, weighted terms p = 2, some and and or combinations taken into account, weighted terms</Paragraph> <Paragraph position="19"> ISl Q33 Ouerv ~ (natural language) Retrieval systems providing the automated transmission of information to the user from a distance The use of ~l~f~e~m~fi ~ to provide rapid transfer of ~ has great appeal. Because of a growing interest in the applicability of this technology to IJJZEPSEigPS, a grant was provided to the Institute of LiJZEax~Research to conduct an experiment in equipment in a working library situation.</Paragraph> <Paragraph position="20"> The feasibility of ~for interlibrary use was explored. is provided on the performance, cost, and utility of ~.L~~ for libraries</Paragraph> <Section position="1" start_page="119" end_page="120" type="sub_section"> <SectionTitle> Retrieval Ranks for Doc 1410 </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> Illustration for Phrase Matching Process Table 3 required occurrence probabilities of the various documents depend on the occurrence probabilities in the respective document subsets of the individual terms x.,x.,~, etc. When term relationships are x j to be used, t~e occurrence probabilities must also be available for term pairs--for example, P(x..Irel), and P(x..\[nonrel); for term triples P(x.~J._\[rel), P(x..~InX~nrel), and so on, for higher i K .I . orde~ term combz~ions.</Paragraph> <Paragraph position="3"> Unfortunately, the experiences accumulated with the probabilistic retrieval model show that enough information is rarely available in practical situations to render possible an accurate estimation of the needed probabilities. In practice, it then becomes necessary to avoid the use of term dependencies by assuming that all terms occur independently. The probabilistic model is then effectively equivalent to a vector processing system that does not include any term relations. \[3\] When linguistic analysis methods are used to analyze query and document content, it is in theory possible to provide a precise representation of query and document content by including a great variety ot term relations in the search and retrieval Operations. In particular, complex indexing units such as noun and prepositional phrases might then be assigned to the information items for content representation, Unfortunately, a complete treatment of noun phrases by automatic means remains elusive in view of the multiplicity of different term relations that are expressible by noun and prepositional phrases. An automatic recognition of semantically equivalent noun phrases of the kind needed for the construction of classification schedules is also exceedingly difficult. For practical purposes, the use of term relations that is theoretically possible in the probabilistic and language-based retrieval models is thus of questionable help in general retrieval situations where topic areas and linguistic complexities are not severely restricted. The Boolean model which includes only a general pnrase (den, tea by the Boolean and) and a general synonym relation (denote~ by the Boolean ~tE) may not therefore represent an intolerable simplification when measured against the realistically possible, alternative methodologies.</Paragraph> <Paragraph position="4"> Considering now the user-system interfaces that have been designed for use in information retrieval, the following types ot development may be distinguished.</Paragraph> <Paragraph position="5"> a) The use of minicomputer-based file accessing methods providing simple access to specific data bases, or to specific file catalogs. Such systems are often menu-driven and otfer a conversational style, permitting the user to consult a given term classification or thesaurus, and to browse through the doc~ent corresponding to a given query formulation. \[4,6J b) The construction of large, sophisticated systems designed to provide unified interface methods to a variety of data bases implemented on a single retrieval facility, or to data bases available on a multiplicity of different retrieval systems.</Paragraph> <Paragraph position="6"> \[12,13\] A connnon command language may then be provided by the interface system, in addition to tutorial and help provisions, or even diagnostic procedures able to detect, and possibly to correct questionable search strategies.</Paragraph> <Paragraph position="7"> c) The use of interface methods based on fancy graphic displays that make it possible to exhibit vocabulary schedules, command sequences, and messages that may be helpful during the course of the search operations.</Paragraph> <Paragraph position="8"> \[5,103 d) The simulation ot automatic &quot;search experts&quot; that are able to translate arbitrary queries in natural language by using stored knowledge bases for query analysis and search purposes, Such automatic experts may perform the work normally assigned to human search intermediaries, in the sense that a conversational dialog system ascertains user requirements and chooses search strategies corresponding to particular user needs. \[8,9\] In each case the automatic interface system is designed to help the user to access a possibly unfamiliar retrieval system and to pick a useful search strategy. The operational retrieval system that actually performs the searches is normally not modified by the interface system. The extended Boolean system described in this note differs from these other developments because the conventional search system is actually modified by replacing a complete Boolean match by a fuzzy query-document comparison system. Furthermore, the burden placed on the user during the query construction process is kept as small as possible.</Paragraph> <Paragraph position="9"> The minicomputer-based facilities and the fancy graphic di,play systems may be used in conjunction with the extended Boolean processing, since the two types of developments are somewhat independent of each other, The same is true of the systems that provide common interfaces to mulriple data bases. The retrieval expert capable of interacting with the user in natural language may not he usable in practical situations for some years to come, unless severe restrictions are imposed on the topic areas under consideration, and the freedom of formulating the search requests, An interface system of more limited scope may be more effective under current clrcumstances than the automated ~expert&quot; of the future.</Paragraph> </Section> </Section> class="xml-element"></Paper>