File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1044_metho.xml

Size: 25,241 bytes

Last Modified: 2025-10-06 14:14:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1044">
  <Title>Building Effective Queries In Natural Language Information Retrieval</Title>
  <Section position="3" start_page="299" end_page="301" type="metho">
    <SectionTitle>
2. STREAM-BASED INFORMATION
RETRIEVAL MODEL
</SectionTitle>
    <Paragraph position="0"> Our NLIR system encompasses several statistical and natural language processing (NLP) techniques for robust text analysis. These has been organized together into a &amp;quot;stream model&amp;quot; in which alternative methods of document indexing are strung together to perform in parallel. Stream indexes are built using a mixture of different indexing approaches, term extracting and weighting strategies, even different search engines. The final results are produced by merging ranked lists of documents obtained from searching all stream indexes with appropriately preprocessed queries, i.e., phrases for phrase stream, names for names stream, etc. The merging process weights contributions from each stream using a combination that was found the most effective in training runs. This allows for an easy combination of alternative retrieval and routing methods, creating a meta-search strategy which maximizes the contribution of each stream. Both Cornell's SMART version 11, and NIST's Prise search engines were used as basic engines .2 Our NLIR system employs a suite of advanced natural language processing techniques in order to assist the sta- null tistical retrieval engine in selecting appropriate indexing terms for documents in the database, and to assign them semantically validated weights. The following term extraction methods have been used; they correspond to the indexing streams in our system.</Paragraph>
    <Paragraph position="1">  1. Eliminate stopwords: original text words minus certain no-content words are used to index documents.</Paragraph>
    <Paragraph position="2"> 2. Morphological stemming: we normalize across morphological word variants (e.g., &amp;quot;proliferation&amp;quot;, &amp;quot;proliferate&amp;quot;, &amp;quot;proliferating&amp;quot;) using a lexicon-based stemmer.</Paragraph>
    <Paragraph position="3"> 3. Phrase extraction: we use various shallow text processing techniques, such as part-of-speech tagging, phrase boundary detection, and word co-occurrence metrics to identify stable strings of words, such as &amp;quot;joint venture&amp;quot;.</Paragraph>
    <Paragraph position="4"> 4. Phrase normalization: we identify &amp;quot;head+modifier&amp;quot;  pairs in order to normalize across syntactic variants such as &amp;quot;weapon proliferation&amp;quot;, &amp;quot;proliferation of weapons&amp;quot;, &amp;quot;proliferate weapons&amp;quot;, etc. into &amp;quot;weapon+proliferate&amp;quot;.</Paragraph>
    <Paragraph position="5"> 5. Proper names: we identify proper names for indexing, including people names and titles, location names, organization names, etc.</Paragraph>
    <Paragraph position="6"> Among the advantages of the stream architecture we may include the following: * stream organization makes it easier to compare the contributions of different indexing features or representations. For example, it is easier to design experiments which allow us to decide if a certain representation adds information which is not contributed by other streams.</Paragraph>
    <Paragraph position="7"> * it provides a convenient testbed to experiment with algorithms designed to merge the results obtained using different IR engines and/or techniques.</Paragraph>
    <Paragraph position="8"> * it becomes easier to fine-tune the system in order to obtain optimum performance * it allows us to use any combination of IR engines without having to modify their code at all.</Paragraph>
    <Paragraph position="9"> While our stream architecture may be unique among IR systems, the idea of combining evidence from multiple sources has been around for some time. Several researchers have noticed in the past that different systems may have similar performance but retrieve different documents, thus suggesting that they may  2. SMART version 11 is freely available, unlike the more advanced version 12.</Paragraph>
    <Paragraph position="10"> complement one another. It has been reported that the use of different sources of evidence increases the performance of a system (see for example, Callan et al., 1995; Fox et a1.,1993; Saracevic &amp; Kantor, 1988).</Paragraph>
    <Paragraph position="11"> 3. STREAMS USED IN NLIR SYSTEM</Paragraph>
    <Section position="1" start_page="299" end_page="301" type="sub_section">
      <SectionTitle>
3.1 Head-Modifier Pairs Stream
</SectionTitle>
      <Paragraph position="0"> Our most linguistically advanced stream is the head+modifier pairs stream. In this stream, documents are reduced to collections of word pairs derived via syntactic analysis of text followed by a normalization process intended to capture semantic uniformity across a variety of surface forms, e.g., &amp;quot;information retrieval&amp;quot;, &amp;quot;retrieval of information&amp;quot;, &amp;quot;retrieve more information&amp;quot;, &amp;quot;information that is retrieved&amp;quot;, etc. are all reduced to &amp;quot;retrieve+information&amp;quot; pair, where &amp;quot;retrieve&amp;quot; is a head or operator, and &amp;quot;information&amp;quot; is a modifier or argument. null The pairs stream is derived through a sequence of processing steps that include:  noun phrases.</Paragraph>
      <Paragraph position="1"> Syntactic phrases extracted from TIP parse trees are head-modifier pairs. The head in such a pair is a central element of a phrase (main verb, main noun, etc.), while the modifier is one of the adjunct arguments of the head. It should be noted that the parser's output is a predicate-argument structure centered around main elements of various phrases. The following types of pairs are considered: (1) a head noun and its left adjective or noun adjunct, (2) a head noun and the head of its fight adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of the subject phrase and the main verb. These types of pairs account for most of the syntactic variants for relating two words (or simple phrases) into pairs carrying compatible semantic content. This also gives the pair-based representation sufficient flexibility to effectively capture content elements even in complex expressions. Long, complex phrases are similarly decomposed into collections of pairs, using corpus statistics to resolve structural ambiguities.</Paragraph>
    </Section>
    <Section position="2" start_page="301" end_page="301" type="sub_section">
      <SectionTitle>
3.2 Linguistic Phrase Stream
</SectionTitle>
      <Paragraph position="0"> We used a regular expression pattern matcher on the part-of-speech tagged text to extract noun groups and proper noun sequences. The major rules we used are:  1. a sequence of modifiers (vbnlvbgljj) followed by at least one noun, such as: &amp;quot;cryonic suspend&amp;quot;, &amp;quot;air traffic control system&amp;quot;; 2. proper noun(s) modifying a noun, such as: &amp;quot;u.s. citizen&amp;quot;, &amp;quot;china trade&amp;quot;; 3. proper noun(s) (might contain '&amp;'), such as: &amp;quot;warren  commission&amp;quot;, &amp;quot;national air traffic controller&amp;quot;. In these experiments, the length of phrases was limited to maximum 7 words.</Paragraph>
      <Paragraph position="1"> sion, whereas lnc.ntc slightly sacrifices the average precision, but gives better recall (see Buckley, 1993). We used also a plain text stream. This stream was obtained by indexing the text of the documents &amp;quot;as is&amp;quot; without stemming or any other processing and running the unprocessed text of the queries against that index. Finally, some experiments involved the fragments stream. This was the result of spliting the documents of the STEM stream into fragments of constant length (1024 characters) and indexing each fragment as if it were a different document. The queries used with this stream were the usual stem queries. For each query, the resulting ranking was filtered to keep, for each document, the highest score obtained by the fragments of that document.</Paragraph>
    </Section>
    <Section position="3" start_page="301" end_page="301" type="sub_section">
      <SectionTitle>
3.3 Name Stream
</SectionTitle>
      <Paragraph position="0"> Proper names, of people, places, events, organizations, etc., are often critical in deciding relevance of a document. Since names are traditionally capitalized in English text, spotting them is relatively easy, most of the time. Many names are composed of more than a single word, in which case all words that make up the name are capitalized, except for prepositions and such, e.g., The United States of America. It is important that all names recognized in text, including those made up of multiple words, e.g., South Africa or Social Security, are represented as tokens, and not broken into single words, e.g., South and Africa, which may turn out to be different names altogether by themselves. On the other hand, we need to make sure that variants of the same name are indeed recognized as such, e.g., U.S. President Bill Clinton and President Clinton, with a degree of confidence. One simple method, which we use in our system, is to represent a compound name dually, as a compound token and as a set of single-word terms. This way, if a corresponding full name variant cannot be found in a document, its component words matches can still add to the document score. A more accurate, but arguably more expensive method would be to use a substring comparison procedure to recognize variants before matching.</Paragraph>
    </Section>
    <Section position="4" start_page="301" end_page="301" type="sub_section">
      <SectionTitle>
3.4 Other Streams used
</SectionTitle>
      <Paragraph position="0"> The stems stream is the simplest, yet, it turns out, the most effective of all streams, a backbone in our multi-stream model. It consists of stemmed non-stop single-word tokens (plus hyphenated phrases). Our early experiments with multi-stream indexing using SMART suggested that the most effective weighting of this stream is lnc.ltc, which yields the best average preci-Table 1 shows relative performance of each stream tested for this evaluation. Note that the standard stemmed-word representation (stems stream) is still the most efficient one, but linguistic processing becomes more important in longer queries. In this evaluation, the short queries are one-sentence search directives such as the following: What steps are being taken by governmental or even private entities world-wide to stop the smuggling of aliens. The long queries, on the other hand, contain substantially more text as the result of full-text expansion described in section 5 below.</Paragraph>
      <Paragraph position="1"> TABLE 1. How different streams perform relative to one another (ll-pt avg. Prec) short long</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="301" end_page="302" type="metho">
    <SectionTitle>
4. STREAM MERGING STRATEGY
</SectionTitle>
    <Paragraph position="0"> The results obtained from different streams are list of documents ranked in order of relevance: the higher the rank of a retrieved document, the more relevant it is presumed to be. In order to obtain the final retrieval result, ranking lists obtained from each stream have to be combined together by a process known as merging or fusion.</Paragraph>
    <Paragraph position="1"> The final ranking is derived by calculating the combined relevance scores for all retrieved documents. The following are the primary factors affecting this process:  1. document relevancy scores from each stream 2. retrieval precision distribution estimates within ranks from various streams, e.g., projected precision between ranks 10 and 20, etc.; 3. the overall effectiveness of each stream (e.g. measured as average precision on training data) 4. the number of streams that retrieve a particular document, and 5. the ranks of this document within each stream.</Paragraph>
    <Paragraph position="2">  Generally, a more effective stream will more effect on shaping the final ranking. A document which is retrieved at a high rank from such a stream is more likely to end up ranked high in the final result. In addition, the performance of each stream within a specific range of ranks is taken into account. For example, if phrases stream tends to pack relevant documents into top 10-20 retrieved documents (but not so much into 110) we would give premium weights to the documents found in this region of phrase-based ranking, etc. Table  relationships, etc. Unfortunately, an average search query does not look anything like this, most of the time. It is more likely to be a statement specifying the semantic criteria of relevance. This means that except for the semantic or conceptual resemblance (which we cannot model very well as yet) much of the appearance of the query (which we can model reasonably well) may be, and often is, quite misleading for search purposes.</Paragraph>
    <Paragraph position="3"> Where can we get the right queries? In today's information retrieval systems, query expansion usually pertains content and typically is limited to adding, deleting or re-weighting of terms. For example, content terms from documents judged relevant are added to the query while weights of all terms are adjusted in order to reflect the relevance information. Thus, terms occurring predominantly in relevant documents will have their weights increased, while those occurring mostly in non-relevant documents will have their weights decreased. This process can be performed automatically using a relevance feedback method, e.g., Roccio's (1971), with the relevance information either supplied manually by the user (Harman, 1988), or otherwise guessed, e.g. by assuming top 10 documents relevant, etc. (Buckley, et al., 1995). A serious problem with this content-term expansion is its limited ability to capture and represent many important aspects of what makes some documents relevant to the query, including particular term co-occurrence patterns, and other hardto-measure text features, such as discourse structure or stylistics. Additionally, relevance-feedback expansion depends on the inherently partial relevance information, which is normally unavailable, or unreliable.</Paragraph>
    <Paragraph position="4"> Other types of query expansions, including general purpose thesauri or lexical databases (e.g., Wordnet) have been found generally unsuccessful in information retrieval (cf. Voorhees &amp; Hou, 1993; Voorhees, 1994) Note that again, long text queries benefit more from linguistic processing.</Paragraph>
  </Section>
  <Section position="5" start_page="302" end_page="304" type="metho">
    <SectionTitle>
5. QUERY EXPANSION EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="302" end_page="303" type="sub_section">
      <SectionTitle>
5.1 Why query expansion?
</SectionTitle>
      <Paragraph position="0"> The purpose of query expansion in information retrieval is to make the user query resemble more closely the documents it is expected to retrieve. This includes both content, as well as some other aspects such as composition, style, language type, etc. If the query is indeed made to resemble a &amp;quot;typical&amp;quot; relevant document, then suddenly everything about this query becomes a valid search criterion: words, collocations, phrases, various An alternative to term-only expansion is a full-text expansion which we tried for the first time in TREC-5.</Paragraph>
      <Paragraph position="1"> In our approach, queries are expanded by pasting in entire sentences, paragraphs, and other sequences directly from any text document. To make this process efficient, we first perform a search with the original, unexpanded queries (short queries), and then use top N (I 0, 20) returned documents for query expansion. These documents are not judged for relevancy, nor assumed relevant; instead, they are scanned for passages that contain concepts referred to in the query. Expansion material can be found in both relevant and non-relevant documents, benefitting the final query all the same. In fact, the presence of such text in otherwise non-relevant documents underscores the inherent limitations of distribution-based term reweighting used in relevance feed- null back. Subject to some further &amp;quot;fitness criteria&amp;quot;, these expansion passages are then imported verbatim into the query. The resulting expanded queries undergo the usual text processing steps, before the search is run again.</Paragraph>
      <Paragraph position="2"> Full-text expansion can be accomplished manually, as we did initially to test feasibility of this approach, or automatically, as we tried in later with promising results. We first describe the manual process focussing on guidelines set forth in such a way as to minimize and streamline human effort, and lay the ground for eventual automation. We then describe our first attempt at automated expansion, and discuss the results from both.</Paragraph>
      <Paragraph position="3"> The initial evaluations indicate that queries expanded manually following the prescribed guidelines are improving the system's performance (precision and recall) by as much as 40%. This appear to be true not only for our own system, but also for other systems: we asked other groups participating in TREC-5 to run search using our expanded queries, and they reported nearly identical improvements. At this time, automatic text expansion produces less effective queries than manual expansion, primarily due to a relatively unsophisticated mechanism used to identify and match concepts in the queries.</Paragraph>
    </Section>
    <Section position="2" start_page="303" end_page="303" type="sub_section">
      <SectionTitle>
5.2 Guidelines for manual query expansion
</SectionTitle>
      <Paragraph position="0"> We have adopted the following guidelines for query expansion. They were constructed to observe realistic limits of the manual process, and to prepare ground for eventual automation.</Paragraph>
      <Paragraph position="1">  1. NLIR retrieval is run using the 50 original &amp;quot;short&amp;quot; queries.</Paragraph>
      <Paragraph position="2"> 2. Top 10 documentss retrieved by each query are retained for expansion. We obtain 50 expansion sub-collections, one per query.</Paragraph>
      <Paragraph position="3"> 3. Each query is manually expanded using phrases,  sentences, and entire passages found in any of the documents from this query's expansion subcollection. Text can both added and deleted, but care is taken to assure that the final query has the same format as the original, and that all expressions added are well-formed English strings, though not necessarily well-formed sentences. A limit of 30 minutes per query in a single block of time is observed. 4. Expanded queries are sent through all text processing steps necessary to run the queries against multiple stream indexes.</Paragraph>
      <Paragraph position="4"> 5. Rankings from all streams are merged into the final result.</Paragraph>
      <Paragraph position="5"> There are two central decision making points that affect the outcome of the query expansion process following the above guidelines. The first point is how to locate text passages that are worth looking at -- it is impractical, if not downright impossible to read all 10 documents, some quite long, in under 30 minutes. The second point is to actually decide whether to include a given passage, or a portion thereof, in the query. To facilitate passage spotting, we used simple word search, using key concepts from the query to scan down document text. Each time a match was found, the text around (usually the paragraph containing it) was read, and if found &amp;quot;fit&amp;quot;, imported into the query. We experimented also with various &amp;quot;pruning&amp;quot; criteria: passages could be either imported verbatim into the query, or they could be &amp;quot;pruned&amp;quot; of &amp;quot;obviously&amp;quot; undesirable noise terms. In evaluating the expansion effects on query-by-query basis we have later found that the most liberal expansion mode with no pruning was in fact the most effective. This would suggest that relatively self-contained text passages, such as paragraphs, provide a balanced representation of content, that cannot be easily approximated by selecting only some words.</Paragraph>
    </Section>
    <Section position="3" start_page="303" end_page="304" type="sub_section">
      <SectionTitle>
5.3 Automatic Query Expansion
</SectionTitle>
      <Paragraph position="0"> Queries obtained through the full-text manual expansion proved to be overwhelmingly better than the original search queries, providing as much as 40% precision gain. These results were sufficiently encouraging to motivate us to investigate ways of performing such expansions automatically.</Paragraph>
      <Paragraph position="1"> One way to approximate the manual text selection process, we reasoned, was to focus on those text passages that refer to some key concepts identified in the query, for example, &amp;quot;alien smuggling&amp;quot; for query 252 below. The key concepts (for now limited to simple noun groups) were identified by either their pivotal location within the query (in the Title field), or by their repeated occurrences within the query Description and Narrative fields. As in the manual process, we run a &amp;quot;short&amp;quot; query retrieval, this time retaining 100 top documents retrieved by each query. An automated process then scans these 100 documents for all paragraphs which contain occurrences, including some variants, of any of the key concepts identified in the original query. The paragraphs are subsequently pasted verbatim into the query. The original portion of the query may be saved in a special field to allow differential weighting. Finally, the expanded queries were run to produce the final result.</Paragraph>
      <Paragraph position="2">  The above, clearly simplistic technique has produced some interesting results. Out of the fifty queries we tested, 34 has undergone the expansion. Among these 34 queries, we noted precision gains in 13, precision loss in 18 queries, with 3 more basically unchanged.</Paragraph>
      <Paragraph position="3"> However, for these queries where the improvement did occur it was very substantial indeed: the average gain was 754% in 11-pt precision, while the average loss (for the queries that lost precision) was only 140%. Overall, we still can see a 7% improvement on all 50 queries (vs. 40%+ when manual expansion is used).</Paragraph>
      <Paragraph position="4"> Our experiments show that selecting the right paragraphs from documents to expand the queries can dramatically improve the performance of a text retrieval system. This process can be automated, however, the challenge is to devise more precise automatic means of &amp;quot;paragraph picking&amp;quot;.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="304" end_page="305" type="metho">
    <SectionTitle>
6. SUMMARY OF RESULTS
</SectionTitle>
    <Paragraph position="0"> In this section we summarize the results obtained from query expansion and other related experiments.</Paragraph>
    <Paragraph position="1"> An automatic run means that there was no human intervention in the process at any time. A manual run means that some human processing was done to the queries, and possibly multiple test runs were made to improve the queries. A short query is derived using only one section of a TREC-5 topic, namely the DESCRIPTION field. A full query is derived from any or all fields in the original topic. A long query is obtained through our full-text expansion method (manual, or automatic). An example TREC-5 query is show below; note that the Description field is what one may reasonably expect to be an initial search query, while Narrative provides some further explanation of what relevant material may look like. The Topic field provides a single concept of interest to the searcher; it was not permitted in the short queries.</Paragraph>
    <Paragraph position="2">  &lt;top&gt; &lt;num&gt; Number: 252 &lt;title&gt; Topic: Combating Alien Smuggling &lt;desc&gt; Description:  What steps are being taken by governmental or even private entities world-wide to stop the smuggling of aliens.</Paragraph>
    <Paragraph position="3"> &lt;narr&gt; Narrative: To be relevant, a document must describe an effort being made (other than routine border patrols) in any country of the world to prevent the illegal penetration of aliens across borders.</Paragraph>
    <Paragraph position="4"> &lt;/top&gt; Table 3 summarizes selected runs performed with our NLIR system on TREC-5 database using queries 251 through 300. Table 4 gives the performance of Cornell's (now Sabir Inc.) SMART system version 12, using advanced Lnu.ltu term weighting scheme, and query expansion through automatic relevance feedback (rel.fbk), on the same database and with the same queries. Sabir used our long queries to obtain long query run. Note the consistently large improvements in retrieval precision attributed to the expanded queries.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML