File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1030_metho.xml

Size: 19,235 bytes

Last Modified: 2025-10-06 14:14:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="X96-1030">
  <Title>amp;quot;Retrieving Records from a Gigabyte of text on a</Title>
  <Section position="3" start_page="0" end_page="144" type="metho">
    <SectionTitle>
OVERVIEW
</SectionTitle>
    <Paragraph position="0"> A typical (full-text) information retrieval (IR) task is to select documents from a database in response to a user's query, and rank these documents according to relevance. This has been usually accomplished using statistical methods (often coupled with manual encoding) that (a) select terms (words,  phrases, and other units) from documents that are deemed to best represent their content, and (b) create an inverted index file (or files) that provide an easy access to documents containing these terms. A subsequent search process will attempt to match preprocessed user queries against term-based representations of documents in each case determining a degree of relevance between the two which depends upon the number and types of matching terms. Although many sophisticated search and matching methods are available, the crucial problem remains to be that of an adequate representation of content for both the documents and the queries.</Paragraph>
    <Paragraph position="1"> In term-based representation, a document (as well as a query) is transformed into a collection of weighted terms, derived directly from the document text or indirectly through thesauri or domain maps.</Paragraph>
    <Paragraph position="2"> The representation is anchored on these terms, and thus their careful selection is critical. Since each unique term can be thought to add a new dimensionality to the representation, it is equally critical to weigh them properly against one another so that the document is placed at the correct position in the N-dimensional term space. Our goal here is to have the documents on the same topic placed close together, while those on different topics placed sufficiently apart. Unfortunately, we often do not know how to compute terms weights. The statistical weighting formulas, based on terms distribution within the database, such as ~.idf, are far from optimal, and the assumptions of term independence which are routinely made are false in most cases. This situation is even worse when single-word terms are intermixed with phrasal terms and the term independence becomes harder to justify.</Paragraph>
    <Paragraph position="3"> The simplest word-based representations of content, while relatively better understood, are usually inadequate since single words are rarely specific enough for accurate discrimination, and their grouping is often accidental. A better method is to identify groups of words that create meaningful phrases, especially if these phrases denote important concepts in the database domain. For example, joint venture is an important term in the Wall Street Journal (WSJ henceforth) database, while neither joint nor venture is important by itself. In the retrieval experiments with the training TREC database, we noticed that both joint and venture were dropped from the list of terms by the system because their idf (inverted document frequency) weights were too low. In large databases, such as TIPSTER, the use of phrasal terms is not just desirable, it becomes necessary.</Paragraph>
    <Paragraph position="4"> The challenge is to obtain &amp;quot;semantic&amp;quot; phrases, or &amp;quot;concepts&amp;quot;, which would capture underlying semantic uniformity across various surface forms of expression. Syntactic structures are often reasonable indicators of content, certainly better than 'statistical phrases' -- where words are grouped solely on the basis of physical proximity (e.g., &amp;quot;college junior&amp;quot; is not the same as &amp;quot;junior college&amp;quot;) -- however, the creation of compound terms makes the term matching process more complex since in addition to the usual problems of lexical meaning, one must deal with structure (e.g., &amp;quot;college junior&amp;quot; is the same as &amp;quot;junior in college&amp;quot;). In order to deal with structure, the parser's output needs to be &amp;quot;normalized&amp;quot; or &amp;quot;regularized&amp;quot; so that complex terms with the same or closely related meanings would indeed receive matching representations. One way to regularize syntactic structures is to transform them into operatorargument form, or at least head-modifier form, as will be further explained in this paper. In effect, therefore, we aim at obtaining a semantic representation. This result has been achieved to a certain extent in our work thus far.</Paragraph>
    <Paragraph position="5"> Do we need to parse indeed? Our recent results indicate that some of the critical semantic dependencies can in fact be obtained without the intermediate step of syntactic analysis, and directly from lexical-level representation of text. We have applied our noun phrase disambiguation method directly to word sequences generated using part-of-speech information, and the results were most promising. At this time we have no data how these results compare to those obtained via parsing.</Paragraph>
    <Paragraph position="6"> No matter how we eventually arrive at the compound terms, we hope they would let us to capture more accurately the semantic content of a document. It is certainly true that the compound terms such as South Africa, or advanced document processing, when found in a document, give us a better idea about the content of such document than isolated word matches. What happens, however, if we do not find them in a document? This situation may arise for several reasons: (1) the term/concept is not there, (2) the concept is there but our system is unable to identify it, or (3) the concept is not explicitly there, but its presence can be infered using general or domain-specific knowledge. This is certainly a serious problem, since we now attach more weight to concept matching than isolated word matching, and missing a concept can reflect more dramatically on system's recall. The inverse is also true: finding a concept where it really isn't makes an irrelevant document more likely to be highly ranked than with single-word based representation. Thus, while the rewards maybe greater, the risks are increasing as well.</Paragraph>
    <Paragraph position="7">  One way to deal with this problem is to allow the system to fall back on partial matches and single word matches when concepts are not available, and to use query expansion techniques to supply missing terms. Unfortunately, thesaurus-based query expansion is usually quite uneffective, unless the subject domain is sufficiently narrow and the thesaurus sufficiently domain-specific. For example, the term natural language may be considered to subsume a term denoting a specific human language, e.g., English. Therefore, a query containing the former may be expected to retrieve documents containing the latter. The same can be said about language and English, unless language is in fact a part of the compound term programming language in which case the association language - Fortran is appropriate. This is a problem because (a) it is a standard practice to include both simple and compound terms in document representation, and (b) term associations have thus far been computed primarily at word level (including fixed phrases) and therefore care must be taken when such associations are used in term matching. This may prove particularly troublesome for systems that attempt term clustering in order to create &amp;quot;meta-terms&amp;quot; to be used in document representation. null In the remainder of this paper we discuss particulars of the present system and some of the observations made while processing TREC-4 data. While this description is meant to be self-contained, the reader may want to refer to previous TREC papers by this group for more information about the system.</Paragraph>
  </Section>
  <Section position="4" start_page="144" end_page="144" type="metho">
    <SectionTitle>
OVERALL DESIGN
</SectionTitle>
    <Paragraph position="0"> Our information retrieval system consists of a traditional statistical backbone (NIST's PRISE system \[2\]) augmented with various natural language processing components that assist the system in data-base processing (stemming, indexing, word and phrase clustering, selectional restrictions), and translate a user's information request into an effective query. This design is a careful compromise between purely statistical non-linguistic approaches and those requiring rather accomplished (and expensive) semantic analysis of data, often referred to as 'conceptual retrieval'.</Paragraph>
    <Paragraph position="1"> In our system the database text is first processed with a fast syntactic parser. Subsequently certain types of phrases are extracted from the parse trees and used as compound indexing terms in addition to single-word terms. The extracted phrases are statistically analyzed as syntactic contexts in order to discover a variety of similarity links between smaller subphrases and words occurring in them. A further filtering process maps these similarity links onto semantic relations (generalization, specialization, synonymy, etc.) after which they are used to transform a user's request into a search query.</Paragraph>
    <Paragraph position="2"> The user's natural language request is also parsed, and all indexing terms occurring in it are identified. Certain highly ambiguous, usually single-word terms may be dropped, provided that they also occur as elements in some compound terms. For example, &amp;quot;natural&amp;quot; is deleted from a query already containing &amp;quot;natural language&amp;quot; because &amp;quot;natural&amp;quot; occurs in many unrelated contexts: &amp;quot;natural number&amp;quot;, &amp;quot;natural logarithm&amp;quot;, &amp;quot;natural approach&amp;quot;, etc. At the same time, other terms may be added, namely those which are linked to some query term through admissible similarity relations. For example, &amp;quot;unlawful activity&amp;quot; is added to a query (TREC topic 055) containing the compound term &amp;quot;illegal activity&amp;quot; via a synonymy link between &amp;quot;illegal&amp;quot; and &amp;quot;unlawful&amp;quot;. After the final query is constructed, the database search follows, and a ranked list of documents is returned. In TREC-4, the automatic query expansion has been limited to to routing runs, where we refined our version of massive expansion using relevenace information wrt. the training database. Query expansion via automatically generated domain map was not usd in offical ad-hoc runs. Full details of TIP parser have been described in the TREC-1 report \[8\], as well as in other works \[6,7\], \[9,10,11,12\].</Paragraph>
    <Paragraph position="3"> As in TREC-3, we used a randomized index splitting mechanism which creates not one but several balanced sub-indexes. These sub-indexes can be searched independently and the results can be merged meaningfully into a single ranking.</Paragraph>
  </Section>
  <Section position="5" start_page="144" end_page="145" type="metho">
    <SectionTitle>
LINGUISTIC TERMS
</SectionTitle>
    <Paragraph position="0"> Syntactic phrases extracted from TTP parse trees are head-modifier pairs. The head in such a pair is a central element of a phrase (main verb, main noun, etc.), while the modifier is one of the adjunct arguments of the head. In the TREC experiments reported here we extracted head-modifier word and fixed-phrase pairs only. The following types of pairs are considered: (1) a head noun and its left adjective or noun adjunct, (2) a head noun and the head of its right adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of the sub-ject phrase and the main verb. These types of pairs account for most of the syntactic variants \[5\] for relating two words (or simple phrases) into pairs cartying compatible semantic content. For example, the pair retrieve+information will be extracted from any of the following fragments: information retrieval system; retrieval of information from databases; and  information that can be retrieved by a user-controlled interactive search process.</Paragraph>
    <Paragraph position="1"> The notorious ambiguity of nominal compounds remains a serious difficulty in obtaining head-modifier pairs of highest accuracy. In order to cope with this, the pair extractor looks at the distribution statistics of the compound terms to decide whether the association between any two words (nouns and adjectives) in a noun phrase is both syntactically valid and semantically significant. For example, we may accept language+natural and processing+language from natural language processing as correct, however, case+trading would make a mediocre term when extracted from insider trading case. On the other hand, it is important to extract trading+insider to be able to match documents containing phrases insider trading sanctions act or insider trading activity.</Paragraph>
    <Paragraph position="2"> Proper names, of people, places, events, organizations, etc., are often critical in deciding relevance of a document. Since names are traditionally capitalized in English text, spotting them is relatively easy, most of the time. It is important that all names recognized in text, including those made up of multiple words, e.g., South Africa or Social Security, are represented as tokens, and not broken into single words, e.g., South and Africa, which may turn out to be different names altogether by themselves. On the other hand, we need to make sure that variants of the same name are indeed recognized as such, e.g., U.S.</Paragraph>
    <Paragraph position="3"> President Bill Clinton and President Clinton, with a degree of confidence. One simple method, which we use in our system, is to represent a compound name dually, as a compound token and as a set of single-word terms. This way, if a corresponding full name variant cannot be found in a document, its component words matches can still add to the document score.</Paragraph>
  </Section>
  <Section position="6" start_page="145" end_page="145" type="metho">
    <SectionTitle>
TERM WEIGHTING ISSUES
</SectionTitle>
    <Paragraph position="0"> Finding a proper term weighting scheme is critical in term-based retrieval since the rank of a document is determined by the weights of the terms it shares with the query. One popular term weighting scheme, known as tf.idf, weights terms proportionately to their inverted document frequency scores and to their in-document frequencies (tO. The in-document frequency factor is usually normalized by the document length, that is, it is more significant for a term to occur 5 times in a short 20-word document, than to occur 10 times in a 1000-word article.</Paragraph>
    <Paragraph position="1"> In our post-TREC-2 experiments we changed the weighting scheme so that the phrases (but not the names which we did not distinguish in TREC-2) were more heavily weighted by their idf scores while the in-document frequency scores were replaced by logarithms multiplied by sufficiently large constants.</Paragraph>
    <Paragraph position="2"> In addition, the top N highest-idf matching terms (simple or compound) were counted more toward the document score than the remaining terms. This 'hotspot' retrieval option is discussed in the next section. Schematically, these new weights for phrasal and highly specific terms are obtained using the following formula, while weights for most of the single-word terms remain unchanged: weight (Ti )=( C1 *log (tf )+C 2 &amp;quot; Ix(N ,i ) )*idf In the above, tx(N,i) is 1 for i &lt;N and is 0 otherwise. The tx(N,i) factor realizes our notion of &amp;quot;hot spot&amp;quot; matching, where only top N matches are used in computing the document score. This creates an effect of &amp;quot;locality&amp;quot;, somewhat similar to that achieved by passage-level retrieval. In TREC-3, where this weighing scheme was fully deployed for the first time, it proved very useful for sharpening the focus of long, frequently convoluted queries. In TREC-3 where the query length ranged from 20 to 100+ valid terms, setting N to 15 or 20 (including phrasal concepts) typically lead to a precision gain of about 20%. In TREC-4, the average query length is less than 10 terms, which we considered too short for using locality matching, and this part of the weighting scheme was in effect unused in the official runs. This turned out to be a mistake, as we rerun TREC-4 experiments after the conference, only to find out that our results improved visibly when the locality part of the weighting scheme was restored.</Paragraph>
    <Paragraph position="3"> Changing the weighting scheme for compound terms, along with other minor improvements (such as expanding the stopword list for topics) has lead to the overall increase of precision of 20% to 25% over our baseline results in TREC-3.</Paragraph>
  </Section>
  <Section position="7" start_page="145" end_page="146" type="metho">
    <SectionTitle>
SUMMARY OF RESULTS
</SectionTitle>
    <Paragraph position="0"> The bulk of the text data used in TREC-4 has been previously processed for TREC-3 (about 3.3 GBytes). Routing experiments involved some additional new text (about 500 MBytes), which we processed through our NLP module. The parameters of this process were essentially the same as in TREC-3, and an interested reader is referred to our TREC-3 paper. Two types of retrieval have been done: (1) new topics 201-250 were run in the ad-hoc mode against the Disk-2&amp;3 database, l and (2) topics 3-191 1 Actually, only 49 topics were used in evaluation, since relevance judgements were unavailable for topic 201 due to an error. null  (a selection of 50 topics in this range), previously used in TREC-1 to TREC-3, were run in the routing mode against the Disk-1 database plus the new data including material from Federal Register, IR Digest and Internet newsgroups. In each category 2 official runs were performed, with different set up of system's parameters. Massive query expansion has been implemented as an automatic feedback mode using known relevance judgements for these topics with respect TREC-3 database.</Paragraph>
    <Paragraph position="1"> Summary statistics for routing runs are shown in Tables 1 and 2. In general, we can note substantial improvement in performance when phrasal terms are used, especially in ad-hoc runs. Looking back at TREC-2 and TREC-3 one may observe that these improvements appear to be tied to the length and specificity of the query: the longer the query, the more improvement from linguistic processes. This can be seen comparing the improvement over base-line for automatic adhoc runs (very short queries), for manual runs (longer queries), and for semi-interactive runs (yet longer queries). In addition, our TREC-3 results (with long and detailed queries) showed 20-25% improvement in precision attributed to NLP, as compared to 10-16% in TREe-4. At this time we are unable to explain the much smaller improvements in routing evaluations: while the massive query expansion definitely works, NLP has hard time topping these improvements.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML