File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1046_metho.xml

Size: 63,292 bytes

Last Modified: 2025-10-06 14:14:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="X96-1046">
  <Title>The Text REtrieval Conferences (TRECs)</Title>
  <Section position="3" start_page="374" end_page="376" type="metho">
    <SectionTitle>
2. THE TASKS
2.1 The Main Tasks
</SectionTitle>
    <Paragraph position="0"> All four TREC conferences have centered around two main tasks based on traditional information retrieval modes: a &amp;quot;routing&amp;quot; task and an &amp;quot;adhoc&amp;quot; task. In the routing task it is assumed that the same questions are always being asked, but that new data is being searched.</Paragraph>
    <Paragraph position="1"> This task is similar to that done by news clipping services or by library profiling systems. In the adhoc task, it is assumed that new questions are being asked against a static set of data. This task is similar to how a researcher might use a library, where the collection is known, but where the questions likely to be asked are unknown.</Paragraph>
    <Paragraph position="2"> In TREC the routing task is represented by using known topics and known relevant documents for those topics, but new data for testing. The training for this task is shown in the left-hand column of Figure 1. The  participants are given a set of known (or training) topics, along with a set of documents, including known relevant documents for those topics. The topics consist of natural language text describing a user's information need (see section 3.3 for details). The topics are used to create a set of queries (the actual input to the retrieval system) which are then used against the training documents. This is represented by Q1 in the diagram. Many sets of Q1 queries might be built to help adjust systems to this task, to create better weighting algorithms, and in general to prepare the system for testing. The results of this training are used to create Q2, the routing queries to be used against the test documents (testing task shown on the middle column of Figure 1).</Paragraph>
    <Paragraph position="3"> The 50 routing topics for testing are a specific subset of the training topics (selected by NIST). In TREC-3 the routing topics corresponded to the TREC-2 adhoc topics, i.e., topics 100-150. The test documents for TREC-3 were the documents on disk 3 (see section 3.2).</Paragraph>
    <Paragraph position="4"> Although this disk had been part of the general training data, there were no relevance judgments for topics 100-150 made on this disk of documents. This lessthan-optimal testing was required by the last-minute unavailability of new data.</Paragraph>
    <Paragraph position="5"> In TREC-4 a slightly different methodology was used to select the routing topics and test data. Because of the difficulty in getting new data, it was decided to select the new data first, and then select topics that matched the data. The ready availability of more Federal Register documents suggested the use of topics that tended to find relevant documents in the Federal Register. Twenty-five of the routing topics were picked using this criteria. This also created a subcoUection of the longer, more structured Federal Register documents for later use in the research community. The second set of 25 routing topics was selected to build a subeollection in the domain of computers. The testing documents for the computer issues were documents from the Intemet, plus part of the Ziff coUection.</Paragraph>
    <Paragraph position="6"> The adhoc task is represented by new topics for known documents. This task is shown on the right-hand side of Figure 1, where the 50 new test topics are used to create Q3 as the adhoc queries for searching against the training documents. Fifty new topics (numbers 150-200) were generated for TREC-3, with fifty additional new topics created for TREC-4 (numbers 201-250). The known documents used in TREC-3 were on disks 1 and 2, and those used in TREC-4 were on disks 2 and 3. Sections 3.2 and 3.3 give more details about the documents used and the topics that were created. The results from searches using Q2 and Q3 are the official test results sent to NIST for the routing and ad-hoc tasks.</Paragraph>
    <Paragraph position="7">  In addition to clearly defining the tasks, other guidelines are provided in TREC. These guidelines deal with the methods of indexing and knowledgebase construction and with the methods of generating the queries from the supplied topics. In general, they are constructed to reflect an actual operational environment, and to allow as fair as possible separation among the diverse query construction approaches. Three generic categories of query construction were defined, based on the mount and kind of manual intervention used.</Paragraph>
    <Paragraph position="8">  1. Automatic (completely automatic query construction) null 2. Manual (manual query construction) 3. Interactive (use of interactive techniques to con- null struct the queries) The participants were able to choose between two levels of participation: Category A, full participation, or Category B, full participation using a reduced dataset (1/4 of the full document set). Each participating group was provided the data and asked to turn in either one or two sets of results for each topic. When two sets of results were sent, they could be made using different methods of creating queries, or different methods of searching these queries. Groups could choose to do the routing task, the adhoc task, or both, and were asked to submit the top 1000 documents retrieved for each topic for evaluation.</Paragraph>
    <Section position="1" start_page="375" end_page="376" type="sub_section">
      <SectionTitle>
2.2 The Tracks
</SectionTitle>
      <Paragraph position="0"> One of the goals of TREC is to provide a common task evaluation that allows cross-system comparisons.</Paragraph>
      <Paragraph position="1"> This has proven to be a key strength in TREC. The second major strength is the loose definition of the two main tasks allowing a wide range of experiments. The addition of secondary tasks (tracks) in TREC-4 combines these strengths by creating a common evaluation  for tasks that are either related to the main tasks, or are a more focussed implementation of those tasks.</Paragraph>
      <Paragraph position="2"> Five formal tracks were run in TREC-4: a multilingual track, an interactive track, a database merging track, a &amp;quot;confusion&amp;quot; track, and a filtering track. In TREC-3, four out of the five tracks were run as preliminary investigations into the feasibility of running formal tracks in TREC-4.</Paragraph>
      <Paragraph position="3"> The multilingual track represents an extension of the adhoc task to a second language (Spanish). An informal Spanish test was run in TREC-3, but the data arrived late and few groups were able to take part. In TREC~ the track was made official and 10 groups took part.</Paragraph>
      <Paragraph position="4"> There were about 200 megabytes of Spanish data (the El Norte newspaper from Monterey, Mexico), and 25 topics. Groups used the adhoc task guidelines, and submitted the top 1000 documents retrieved for each of the 25 Spanish topics.</Paragraph>
      <Paragraph position="5"> The interactive track focusses the adhoc task on the process of doing searches interactively. It was felt by many groups that TREC uses evaluation for a batch retrieval environment rather than the more common interactive environments seen today. However there are few tools for evaluating interactive systems, and none that seem appropriate to TREC. The use of the interactive query construction method in TREC-3 demonstrated interest in using interactive search techniques, so a formal track was formed for TREC-4. The interactive track has a double goal of developing better methodologies for interactive evaluation and investigating in depth how users search the TREC topics. Eleven groups took part in this track in TREC-4. A subset of the adhoc topics was used, and many different types of experiments were run. The common thread was that all groups used the same topics, performed the same task(s), and recorded the same information about how the searches were done. Task 1 was to retrieve as many relevant docnments as possible within a certain timeframe. Task 2 was to construct the best query possible.</Paragraph>
      <Paragraph position="6"> The database merging task also represents a focussing of.the adhoc task. In this case the goal was to investigate techniques for merging results from the various TREC subcoUections (as opposed to treating the collections as a single entity). Several groups tried these techniques in TREC-3 and it was decided to form a track in this area for TREC-4. There were 10 subcollections defined corresponding to the various dates of the data, i.e. the three different years of the Wall Street Journal, the two different years of the AP newswire, the two sets of Ziff documents (one on each disk), and the three single subcollections (the Federal Register, the San Jose Mercury News, and the U.S. Patents). The 3 participating groups ran the adhoc topics separately on each of the 10 subcollections, merged the results, and submitted these results, along with a baseline run treating the subcollections as a single collection.</Paragraph>
      <Paragraph position="7"> The &amp;quot;confusion&amp;quot; track represents an extension of the current tasks to deal with corrupted data such as would come from OCR or speech input. This was a new track proposed during the TREC-3 conference. The track followed the adhoc task, but using only the category B data. This data was randomly corrupted at NIST using character deletions, substitutions, and additions to create data with a 10% and 20% error rate (i.e., 10% or 20% of the characters were affected). Note that this process is neutral in that it does not model OCR or speech input.</Paragraph>
      <Paragraph position="8"> Four groups used the baseline and 10% corruption level: only two groups tried the 20% level.</Paragraph>
      <Paragraph position="9"> The filtering track represents a variation of the current routing track. For several years some participants have been concerned about the definition of the routing task.</Paragraph>
      <Paragraph position="10"> and a few groups experimented in TREC-3 with an alternative definition of routing. In TREC-4 the track was formalized. It used the same topics, training documents, and test documents as the routing task. The difference was that the results submitted for the filtering runs were unranked sets of documents satisfying three &amp;quot;utility function&amp;quot; criteria. These criteria were designed to approximate a high precision run, a high recall ran, and a &amp;quot;balanced&amp;quot; run. For more details on this track see the paper &amp;quot;The TREC-4 Filtering Track&amp;quot; by David Lewis (in the TREC-4 proceedings).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="376" end_page="380" type="metho">
    <SectionTitle>
3. THE TEST COLLECTION (ENGLISH)
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="376" end_page="376" type="sub_section">
      <SectionTitle>
3.1 Introduction
</SectionTitle>
      <Paragraph position="0"> Like most traditional retrieval collections, there are three distinct parts to this collection -- the documents.</Paragraph>
      <Paragraph position="1"> the questions or topics, and the relevance judgments or &amp;quot;right answers.&amp;quot;</Paragraph>
    </Section>
    <Section position="2" start_page="376" end_page="378" type="sub_section">
      <SectionTitle>
3.2 The Documents
</SectionTitle>
      <Paragraph position="0"> The documents were distributed on CD-ROMs with about 1 gigabyte of data on each, compressed to fit. For TREC-3 and TREC-4, disks 1, 2 and 3 were all available as training material (see Table 3). In TREC-3.</Paragraph>
      <Paragraph position="1"> disks 1 and 2 were also used for the adhoc task, and disk 3 for the routing task. In TREC-4, disks 2 and 3 were used for the adhoc task, and new data (also shown in Table 3) was used for the routing task. The following shows the actual contents of each of the three CD-ROMs (disks 1, 2, and 3).</Paragraph>
      <Paragraph position="2">  Table 3 shows some basic document collection statistics. Although the collection sizes are roughly equivalent in megabytes, there is a range of document lengths across collections, from very short documents (DOE) to very long fiR), Also, the range of document lengths within a collection varies. For example, the documents  from the AP are similar in length, but the WSJ, the ZIFF and especially the FR documents have much wider range of lengths within their collections.</Paragraph>
      <Paragraph position="3"> The documents are uniformly formatted into SGML, with a DTD included for each collection to allow easy parsing.</Paragraph>
    </Section>
    <Section position="3" start_page="378" end_page="379" type="sub_section">
      <SectionTitle>
3.3 The Topics
</SectionTitle>
      <Paragraph position="0"> In designing the TREC task, there was a conscious decision made to provide &amp;quot;user need&amp;quot; statements rather than more traditional queries. Two major issues were involved in this decision. First, there was a desire to allow a wide range of query construction methods by keeping the topic (the need statement) distinct from the query (the actual text submitted to the system). The second issue was the ability to increase the amount of information available about each topic, in particular to include with each topic a clear statement of what criteria make a document relevant.</Paragraph>
      <Paragraph position="1"> Sample TREC-1/TREC-2 topic  &lt;top&gt; &lt;head&gt; 7~pster Topic Description &lt;num&gt; Number: 066 &lt;dora&gt; Domain: Science and Technology &lt;title&gt; Topic: Natural Language Processing &lt;desc&gt; Description: Document will identi y a type of natural language processing technology which is being developed or marketed in the U.S.</Paragraph>
      <Paragraph position="2"> &lt;narr&gt; Narrative:  A relevant document will identi~ a company or institution developing or marketing a natural language processing technology, identify the technology, and identi~ one or more features of the company's product. null  &lt;con&gt; Concept(s): 1. natural language processing 2. translation, language, dictionary, font 3. software applications &lt;fac&gt; Factor(s): &lt;nat&gt; Nationality: U.S.</Paragraph>
      <Paragraph position="4"> Each topic is formatted in the same standard method to allow easier automatic construction of queries.</Paragraph>
      <Paragraph position="5"> Besides a beginning and an end marker, each topic has a number, a short title, and a one-sentence description.</Paragraph>
      <Paragraph position="6"> There is a narrative section which is aimed at providing a complete description of document relevance for the assessors. Each topic also has a concepts section with a list of concepts related to the topic. This section is designed to provide a mini-knowledgebase about a topic such as a real searcher might possess. Additionally each topic can have a definitions section and/or a factors section. The definition section has one or two of the definitions critical to a human understanding of the topic.</Paragraph>
      <Paragraph position="7"> The factors section is included to allow easier automatic query building by listing specffic items from the narrative that constrain the documents that are relevant. Two particular factors were used in the TREC-1/TREC-2 topics: a time factor (current, before a given date, etc.) and a nationality factor (either involving only certain countries or excluding certain countries).</Paragraph>
      <Paragraph position="8"> The new (adhoc) topics used in TREC-3 reflect a slight change in direction. Whereas the TREC-1/TREC-2 topics were designed to mimic a real user's need, and were written by people who are actual users of a retrieval system, they were intended to represent long-standing information needs for which a user might be willing to create elaborate topics. This made them more suited to the routing task than to the adhoc task, where users are likely to ask much shorter questions. The adhoc topics used in TREC-3 (topics 151-200) are not only much shorter, but also are missing the complex structure of the earlier topics. In particular the concepts field has been removed because it was felt that real adhoc questions would not contain this field, and because inclusion of the field discouraged research into techniques for expansion of &amp;quot;too short&amp;quot; user need expressions. The shorter topics do not create a problem for the muting task, as experience in TREC-1 and 2 has shown that the use of the training documents allows a shorter topic (or no topic at all).</Paragraph>
      <Paragraph position="9">  &lt;num&gt; Number: 168 &lt;title&gt; Topic: Financing AMTRAK &lt;desc&gt; Description: A document will address the ngle of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK).</Paragraph>
      <Paragraph position="10"> &lt;narr&gt; Narrative: A relevant document must pro- null vide information on the government's responsibility to make AMTRAK an economically viable enti~. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant.</Paragraph>
      <Paragraph position="11"> In addition to being shorter, the new topics were written by the same group of people that did the relevance judgments (see next section). Specifically, each of the new topics (numbers 151-200) was developed from a genuine need for information brought in by the assessors. Each assessor constructed his/her own topics from some initial statements of interest, and performed all the relevance assessments on these topics (with a few exceptions).</Paragraph>
      <Paragraph position="12"> However, participants in TREC-3 felt that the topics were still too long compared with what users normally submit to operational retrieval systems. Therefore the TREC-4 topics were made even shorter. Only one field was used (i.e. there is no title field and no narrative field).</Paragraph>
      <Paragraph position="13"> Sample TREC..4 Topic &lt;nura&gt; Number: 207 &lt;desc&gt; What are the prospects of the Quebec separatists achieving independence from the rest of Canada? Table 4 gives the average number of terms in the title, description, narrative, and concept fields (all three fields for TREC-1 and TREC-2, no concept field in TREC-3, and only a description field in TREC-4). As can be seen, the topics are indeed much shorter, particularly in</Paragraph>
    </Section>
    <Section position="4" start_page="379" end_page="380" type="sub_section">
      <SectionTitle>
3.4 The Relevance Judgments
</SectionTitle>
      <Paragraph position="0"> The relevance judgments are of critical importance to a test collection. For each topic it is necessary to compile a list of relevant documents; hopefully as comprehensive a list as possible. All four TRECs have used the pooling method \[3\] to assemble the relevance assessments. In this method a pool of possible relevant documents is created by taking a sample of documents selected by the various participating systems. This sampie is then shown to the human assessors. The particular sampling method used in TREC is to take the top 100 documents retrieved in each submitted run for a given topic and merge them into the pool for assessment. This is a valid sampling technique since all the systems used ranked retrieval methods, with those documents most likely to be relevant returned first.</Paragraph>
      <Paragraph position="1"> A measure of the effect of pooling can be seen by examining the overlap of retrieved documents. Table 5 shows the statistics from the merging operations in the four TREC conferences. For example, in TREC-1 and TREC-2 the top 100 documents from each run (33 runs in TREC-1 and 40 runs in TREC-2) could have produced a total of 3300 and 4000 documents to be judged (for the adhoc task). The average number of documents actually judged per topic (those that were unique) was 1279 (39%) for TREC-1 and 1106 (28%) for TREC-2.</Paragraph>
      <Paragraph position="2"> Note that even though the number of runs has increased by more than 20% (adhoc), the number of unique documents found has actually dropped. The percentage of relevant documents found, however, has not changed much. The more accurate results going from TREC-1 to TREC-2 mean that fewer nonrelevant documents are being found by the systems. This trend continued in TREC-3, with a major drop (particularly for the routing task) that reflects increased accuracy in rejecting nonrelevant documents. In TREC-4, the trend was reversed.</Paragraph>
      <Paragraph position="3"> In the case of the adhoc task (including most of the track runs also), there is a slight increase in the percentage of unique documents found, probably caused by the wider variety of expansion terms used by the systems to compensate for the lack of a narrative section in the topic. A larger percentage increase is seen in the routing task, due to fewer runs being pooled, i.e., a higher percentage of documents is likely to be unique. Also the TREC-4 routing task was more difficult, both  because of the long Federal Register documents and because there was a mismatch of the testing data to the training data (for the computer topics). Both these factors led to less accurate filtering of nonrelevant documents. null The total number of relevant documents found has dropped with each TREC, and that drop has been caused by a deliberate tightening of the topics each year to better guarantee completeness of the relevance judgments (see below for more details on this).</Paragraph>
      <Paragraph position="4">  Evaluation of retrieval results using the assessments from this sampfing method is based on the assumption that the vast majority of relevant documents have been found and that documents that have not been judged can be assumed to be not relevant. A test of this assumption was made using TREC-2 results, and again during the TREC-3 evaluation. In both cases, a second set of 100 documents was examined from each system, using only a sample of topics and systems in TREC-2, and using all topics and systems in TREC-3.</Paragraph>
      <Paragraph position="5"> For the TREC-2 completeness tests, a median of 21 new relevant documents per topic was found (11% increase in total relevant documents). This averages to 3 new relevant documents found in the second 100 documents for each system, and this is a high estimate for all systems since the 7 runs sampled for additional judgments were from the better systems. Similar results were found for the more complete TREC-3 testing, with a median of 30 new relevant documents per topic for the adhoc task, and 13 new ones for the routing task. This averages to well less than one new relevant document per run, since 48 runs from all systems were used in the adhoc test (49 runs in the routing test). These tests show that the levels of completeness found during the TREC-2 and TREC-3 testing are quite acceptable for this type of evaluation.</Paragraph>
      <Paragraph position="6"> The number of new relevant documents found was shown to be correlated with the original number of relevant documents. Table 6 shows the breakdown for the 50 adhoc topics in TREC-3. The median of 30 new relevant documents occurs for a topic with 122 original relevant documents. Topics with many more relevant documents initially tend to have more new ones found, and this has led to a greater emphasis on using topics with fewer relevant documents.</Paragraph>
      <Paragraph position="7">  initial number of relevant documents In addition to the completeness issue, relevance judgments need to be checked for consistency. In each of the TREC evaluations, each topic was judged by a single assessor to ensure the best consistency of judgment. Some testing of this consistency was done after TREC-2, when a sample of the topics and documents was rejudged by a second assessor. The results showed an average agreement between the two judges of about 80%. In TREC-4 all the adhoc topics had samples re judged by two additional assessors, with the results being about 72% agreement among all three judges, and 88% agreement between the initial judge and either one of the two additional judges. This is a remarkably high level of agreement in relevance assessment, and probably is due to the general lack of ambiguity in the topics.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="380" end_page="382" type="metho">
    <SectionTitle>
4. EVALUATION
</SectionTitle>
    <Paragraph position="0"> An important component of TREC was to provide a common evaluation fonun. Standard recall/precision figures have been calculated for each TREC system, along with some single-value evaluation measures. New for TREC-3 was a histogram for each system showing performance on each topic. In general, more emphasis has been placed on a &amp;quot;per topic analysis' in an effort to get beyond the problems of averaging across topics.</Paragraph>
    <Paragraph position="1">  Work has been done, however, to find statistical differences among the systems (see paper &amp;quot;A Statistical Analysis of the TREC-3 Data&amp;quot; by Jean Tague-Sutcliffe and James Blustein in the TREC-3 proceedings.) Additionally charts have been published in the proceedings that consolidate information provided by the systems describing features and system timing, and allowing some primitive comparison of the amount of effort needed to produce the results.</Paragraph>
    <Section position="1" start_page="381" end_page="381" type="sub_section">
      <SectionTitle>
4.1 Definition of Recall/Precision
</SectionTitle>
      <Paragraph position="0"> Figure 2 shows typical recall/precision curves. The x axis plots the recall values at fixed levels of recall, where Recall = number of relevant items retrieved total number of relevant items in collection The y axis plots the average precision values at those given recall values, where precision is calculated by Precision = number of relevant items retrieved total number of items retrieved These curves represent averages over the 50 topics.</Paragraph>
      <Paragraph position="1"> The averaging method was developed many years ago \[4\] and is well accepted by the infomaation retrieval community. The curves show system performance across the full range of retrieval, i.e., at the early stage of retrieval where the highly-ranked documents give high accuracy or precision, and at the final stage of retrieval where there is usually a low accuracy, but more complete retrieval. The use of these curves assumes a ranked output from a system. Systems that provide an unranked set of documents are known to be less effective and therefore were not tested in the TREC program. The curves in figure 2 show that system A has a much higher precision at the low recall end of the graph and therefore is more accurate. System B however has higher precision at the high recall end of the curve and therefore will give a more complete set of relevant documents, assuming that the user is willing to look further in the ranked list.</Paragraph>
    </Section>
    <Section position="2" start_page="381" end_page="382" type="sub_section">
      <SectionTitle>
4.2 Single-Value Evaluation Measures
</SectionTitle>
      <Paragraph position="0"> In addition to recall/precision curves, there are 2 single-value measures used in TREC.</Paragraph>
      <Paragraph position="1"> The first measure, the non-interpolated average precision, corresponds to the area under an ideal (noninterpolated) recall/precision curve. To compute this average, a precision average for each topic is first calculated. This is done by computing the precision after every retrieved relevant document and then averaging  these precisions over the total number of retrieved relevant documents for that topic. These topic averages are then combined (averaged) across all topics in the appropriate set to create the non-interpolated average precision for that set.</Paragraph>
      <Paragraph position="2"> The second measure used is an average of the precision for each topic after I00 documents have been retrieved for that topic. This measure is useful because it reflects a clearly comprehended retrieval point. It took on added importance in the TREC environment because only the top 100 documents retrieved for each topic were actually assessed. For this reason it produces a guaranteed evaluation point for each system.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="382" end_page="392" type="metho">
    <SectionTitle>
5. RESULTS
5.1 Introduction
</SectionTitle>
    <Paragraph position="0"> One of the important goals of the TREC conferences is that the participating groups freely devise their own experiments within the TREC task. For some groups this means doing the routing and/or adhoc task with the goal of achieving high retrieval effectiveness performance. For other groups, however, the goals are more diverse and may mean experiments in efficiency, unusual ways of using the data, or experiments in how &amp;quot;users&amp;quot; would view the TREC paradigm.</Paragraph>
    <Paragraph position="1"> The overview of the results discusses the effectiveness of the systems and analyzes some of the similarities and differences in the approaches that were taken. It points to some of the other experiments run in TREC-3 where results cannot be measured completely using recall/precision measures, and discusses the tracks in TREC-4.</Paragraph>
    <Paragraph position="2"> In all cases, readers are referred to the system papers in the TREC-3 and TREC-4 proceedings for more details.</Paragraph>
    <Section position="1" start_page="382" end_page="385" type="sub_section">
      <SectionTitle>
5.2 TREC-3 Adhoc Results
</SectionTitle>
      <Paragraph position="0"> The TREC-3 adhoc evaluation used new topics (topics 151-200) against two disks of training documents (disks 1 and 2). A dominant feature of the adhoc task in TREC-3 was the removal of the concepts field in the topics (see more on this in the discussion of the topics, section 3.3) Many of the participating groups designed their experiments around techniques to expand the shorter and less &amp;quot;rich&amp;quot; topics.</Paragraph>
      <Paragraph position="1"> There were 48 sets of results for adhoc evaluation in TREC-3, with 42 of them based on runs for the full data set. Of these, 28 used automatic construction of queries, 12 used manual construction, and 2 used interactive construction.</Paragraph>
      <Paragraph position="2"> Figure 3 shows the recall/precision curves for the 6 TREC-3 groups with the highest non-interpolated average precision using automatic construction of queries. The runs are ranked by the average precision and only one run is shown per group (both official Cornell runs would have qualified for this set).</Paragraph>
      <Paragraph position="3"> A short summary of the techniques used in these runs shows the breadth of the approaches. For more details on the various runs and procedures, please see the cited paper in the TREC-3 proceedings.</Paragraph>
      <Paragraph position="4"> cityal -- City University, London (&amp;quot;Okapi at TREC-3&amp;quot; by S.E. Robertson, S. Walker, S. Jones, M.M. Hancock-Beaulieu and M. Gatford) used a probabilistic term weighting scheme similar to that used in TREC-2, but expanded the topics by up to 40 terms (average around 20) automatically selected from the top 30 documents retrieved. They also used dynamic passage retrieval in addition to the whole document retrieval in their final ranking.</Paragraph>
      <Paragraph position="5">  W. Brace Croft and Daniel W. Nachbar) used a version ofprobabilistic weighting that allows easy combining of evidence (an inference net). Their basic term weighting formula (and query processing) was simplified from that used in TREC-2, and they also used passage retrieval and whole document information in their ranking. The topics were expanded by 30 phrases that were automaticaUy selected from a phrase &amp;quot;thesaurus&amp;quot; that had been previously built automatically from the entire corpus of documents.</Paragraph>
      <Paragraph position="6"> CrnlEA -- ComeU University (&amp;quot;Automatic Query Expansion Using SMART: TREC-3 by Chris Buckley, Gerard Salton, James Allan and Amit Singhal) used the vector-space SMART system, with term weighting similar to that done in TREC-2. The top 30 documents were used in a Rocchio relevance feedback technique to massively expand (500 terms + 10 phrases) the topics. No passage retrieval was done in this run; the second Cornell run (CrnlLA) used their local/global weighting schemes (with no topic expansion).</Paragraph>
      <Paragraph position="8"> used in document ranking, but only minimal topic expansion was used, with that expansion based on pre-constructed general-purpose synonym classes for abbreviations and other exact synonyms.</Paragraph>
      <Paragraph position="9"> pircsl - Queens College, CUNY (&amp;quot;TREC-3 Ad-Hoc, Routing Retrieval and Thresholding Experiments using PIRCS&amp;quot; by K.L. Kwok, L. Grunfeld and D.D. Lewis) used a spreading activation model on subdocuments (550-word chunks). Topic expansion was done by allowing activation from the top 6 documents in addition to the terms in the original topic. The highest 30 terms were chosen, with an average of 11 of those not in the original topic.</Paragraph>
      <Paragraph position="10"> ETHO02 -- Swiss Federal Institute of Technology (ETH) (&amp;quot;Improving a Basic Retrieval Method by Links and Passage Level Evidence&amp;quot; by Daniel Knaus, Elke Mittendorf and Peter Schauble) used a completely new method in TREC-3 based on combining information from three very different retrieval techniques. The three techniques are a vector-space system, a passage retrieval method using a Hidden Markov model, and a &amp;quot;topic expansion&amp;quot; method based on document links generated automaticaUy from analysis of common phrases.</Paragraph>
      <Paragraph position="11"> The dominant new themes in the automatic adhoc runs are the use of some type of term expansion beyond the terms contained in the &amp;quot;less rich&amp;quot; (TREC-3) topics, and some form of passage or subdocument retrieval element. Note that term expansion is mostly a recall device; adding new terms to a topic increases the chances of matching the wide variation of terms usually found in relevant documents. But adding terms also increases the &amp;quot;noise&amp;quot; factor, so accuracy may need to be improved via a precision device, and hence the use of passages, subdocuments, or more local weighting.</Paragraph>
      <Paragraph position="12"> Two main types of term expansion were used by these top groups: term expansion based on a pre-constructed thesaurus (for example the INQUERY PhraseFinder) and term expansion based on selected terms from the top X documents (as done by City, ComeU, and PIRCS).</Paragraph>
      <Paragraph position="13"> Both techniques worked well. The top 3 runs (cityal, 1NQI01, and CrnlEA) have excellent performance (see Figure 3) in the &amp;quot;middle&amp;quot; recall range (30 to 80%), with this performance likely coming from the query expansion. null The use of the top 30 documents as a source of terms, as opposed to using the entire corpus, should be sensitive to the quality of the documents in this initial set. Notably, for 6 of the 8 topics in which the INQI01 run was superior (a 20% or more improvement in average precision) to the cityal run, the 1NQ101 run was also superior to the CrnlEA run. These topics tended to have  for Passage Retrieval and Topic Expansion fewer relevant documents, but also tended to be topics for which the systems bringing terms in manually (such as by manually selecting from a thesaurus or outside sources) also did well.</Paragraph>
      <Paragraph position="14"> Another factor in topic expansion is the number of terms being added to the topics. The average number of terms in the queries is widely varied, with the City group averaging around 50 terms (20 terms from expansion), the INQUERY system using around 100 terms on average, and the Comell system using 550 terms on average. This huge variation seemed to have little effect on results, largely because each group found the level of topic expansion appropriate for their retrieval techniques. The cityal run tended to &amp;quot;miss&amp;quot; more relevant documents than the CrnlEA run (7 topics were seriously hurt by this problem), but was better able to rank relevant documents within the 1000 document cutoff so that more relevant documents appeared in the top 100 documents. This better ranking could have happened because of the many fewer terms that were used, or could be caused by the use of passage retrieval in the City run.</Paragraph>
      <Paragraph position="15"> The use of passages or subdocuments to reduce the noise effect of large documents has been used for several years in the PIRCS system. City, INQUERY and Cornell all did many experiments for TREC-3 to first determine the correct length of a passage, and then to find the appropriate use of passages in their ranking schemes. INQUERY and Cornell use overlapped passages of fixed length (200 words) as compared to City's non-overlapped passages of 4 to 30 paragraphs in length. All three systems use information from passages and whole documents retrieved rather than passage retrieval alone. (Cornell's version of this is called local/global weighting.) Both INQUERY and City combined the passage retrieval with query expansion; Cornell did two separate runs.</Paragraph>
      <Paragraph position="16"> The westpl run did not use topic expansion, although a rrfixture of passages and whole documents was used in the final ranking of documents. The performance has suffered for this in the middle recall range. West Publishing used their production system to see how far it differed from the research systems and therefore did not want to use more radical topic expansion methods.</Paragraph>
      <Paragraph position="17"> Additionally they used a shortened topic (title + description + first sentence of narrative) because it was more similar in length to the topics submitted by their users. The INQI01 run had 18 topics with superior performance to the westpl run, mostly because of new relevant documents being retrieved to the top 1000 document set. The westpl nan was superior to the INQIO1 run for l I topics, mostly caused by better ranking for those topics.</Paragraph>
      <Paragraph position="18"> The pircsl system used both passage retrieval (subdocuments) and topic expansion. This system used far fewer top documents for expansion (the top 6 as opposed to the top 30), and this may have hurt performance. There were 22 topics in which the INQIO1 run was superior to the pircs2 nan, and these were mostly because of missed relevant documents. Even though both systems added about the same number of expansion terms, using only the top 6 documents as a source of terms for spreading activation might have provided too much focussing of the concepts.</Paragraph>
      <Paragraph position="19"> The ETHO01 run used both topic expansion and passages, in addition to a baseline vector-space system.</Paragraph>
      <Paragraph position="20"> Both the topic expansion and the passage determination were completely new (untried) techniques; additionally there are known difficulties in combining multiple methods. In comparison to the ComeU expansion results (CrnlEA), the main problems appear to be missed relevant documents for all 17 of the topics where the Cornell results were superior. The ETH results were superior for 8 topics, mostly because of better ranking.</Paragraph>
      <Paragraph position="21"> Clearly this is a very promising approach and more experimentation is needed.</Paragraph>
      <Paragraph position="22"> Table 7 shows a breakdown of improvements from expansion and passage retrieval that combines information from the non-official runs given in the individual papers. In general groups seem to be getting about 20% improvement over their own baselines (less for ETH and PIRCS), with that improvement coming in different percentages from passage retrieval or expansion, depending on the specific retrieval techniques being used.</Paragraph>
      <Paragraph position="23">  Figure 4 shows the recall/precision curves for the 6 TREC-3 groups with the highest non-interpolated average precision using manual construction of queries. A short summary of the techniques used in these runs follows. Again, for more details on the various runs and procedures, see the cited papers in the TREC-3 proceedings. null INQI02 -- University of Massachusetts at Amherst.</Paragraph>
      <Paragraph position="24"> This run is a manual modification of the INQIO1 run, with strict rules for the modifications to only allow removal of words and phrases, modification of weights, and addition of proximity restrictions.</Paragraph>
    </Section>
    <Section position="2" start_page="385" end_page="387" type="sub_section">
      <SectionTitle>
Brkly7 - University of California, Berkeley (&amp;quot;Experi-
</SectionTitle>
      <Paragraph position="0"> ments in the Probabilistic Retrieval of Full Text Documents&amp;quot; by William S. Cooper, Aitao Chert and Fredric C. Gey) is a modification of the Brkly6 run, with that modification being the manual expansion of the queries by adding synonyms found from other sources. The Brkly6 run uses a logistic regression model to combine information from 6 measures of document relevancy based on term matches and term distribution. The coefficients were learned from the training data in a manner similar to that done in TREC-2, but the specific set of measures used has been expanded and modified for TREC-3. No passage retrieval was done.</Paragraph>
      <Paragraph position="1"> ASSCTV1 - Mead Data Central, Inc (&amp;quot;Query Expansion/Reduction and its Impact on Retrieval Effectiveness&amp;quot; by X. Allan Lu and Robert B Keefer) is also a manual expansion of queries using an associative thesaurus built from the TREC data. The retrieval system used in ASSCTV1 is the SMART system.</Paragraph>
      <Paragraph position="2"> VTc2s2 -- Virginia Tech (&amp;quot;Combination of Multiple Searches&amp;quot; by Joseph A. Shaw and Edward A. Fox) used a combination of multiple types of queries, with 2 types of natural language vector-space queries and 3 types of manually constructed P-Norm (soft Boolean) queries.</Paragraph>
      <Paragraph position="3"> pircs2 - Queens College, CUNY. This run is a modification of the base PIRCS system to use manually constructed soR Boolean queries.</Paragraph>
      <Paragraph position="4"> rutfuaI - Rutgers University (&amp;quot;Decision Level Data Fusion for Routing of Documents in the TREC3 Context: A Best Cases Analysis of Worst Case Results&amp;quot; by Paul B. Kantor) used data fusion methods to combine the retrieval ranks from three different retrieval schemes all using the INQUERY system. Two of the schemes used Boolean queries (one with ranking and one without) and the third used the same queries without operatots. null The three dominant themes in the runs using  manually constructed queries are manual modification of automatically generated queries (INQI02), manual expansion of queries (Brkly7 and ASSCTV1) and combining of multiple retrieval techniques or queries. Three runs can be compared to a &amp;quot;baseline&amp;quot; run to check the effects of manual versus automatic query construction.</Paragraph>
      <Paragraph position="5"> INQi02, the manually modified version of 1NQ101, had a 15% improvement in average precision over INQi01, and 17 topics that were superior in performance for the manual system (as opposed to only 3 for the automatic system). An analysis of those topics shows that many more relevant documents were in the top 1000 documents and the top 100 documents, probably caused by manually eliminating much of the noise that was producing higher ranks for nonrelevant documents. This noise elimination could have happened because many spurious terms had been manually removed from the queries (INQI02 had an average of about 30 terms as opposed to nearly 100 terms in 1NQI01), or could have come from the use of the proximity operators.</Paragraph>
      <Paragraph position="6"> The Brkly7 run, a manually expanded version of Brkly6, used about the same number of terms as the INQI02 run (around 36 terms on average), but the terms had been manually pulled from multiple sources (as opposed to editing an automatic expansion as done by INQUERY). The improvement from Brkly6 to Brkly7 is a 34% gain in average precision, with 25 topics having superior performance in the manually expanded run.</Paragraph>
      <Paragraph position="7"> Note however that there was no topic expansion done in the automatic Brkly6 run, so this improvement represents the results of a good manual topic expansion over no expansion at all.</Paragraph>
      <Paragraph position="8"> The INQUERY system outperforms the Berkeley system by 14% in average precision, with much of that difference coming in the high recall end of the graph (see Figure 4). This is consistent with the difference in their topic expansion techniques in that the automatic expansion (even manually edited) is likely to bring in terms that users might not select from &amp;quot;non-focussed&amp;quot; sources. The ASSCTV1 nan also represents a manual expansion effort, but using a pre-built thesaurus as opposed to using textual sources for the expansion. The topics were expanded to create a query averaging around 135 terms and then were run using the default Cornell SMART system. A comparison of the automatically expanded CrnlEA run and the manually expanded ASS-CTVI run shows minimal difference in average precision, but superior performance in 18 of the topics for the manual expansion (as opposed to only 10 of the topics having superior performance for the automatic Cornell run). In both cases, the improvements come from finding more relevant documents because of the expansions, but different expansion methods help different topics.</Paragraph>
      <Paragraph position="9"> The pircs2 run is a manual query version of the base-line PIRCS system. A soft Boolean query is created from the topic, but no topic expansion is done. There is minimal difference in average precision between the two PIRCS runs, but more topics show superior performance for the soft Boolean query pircs2 run (8 superior topics versus 4 superior topics for the topic expansion pircsl run). It is not clear whether this difference comes from the increased precision of the soft Boolean approach or from the relatively poor performance of the PIRCS term expansion results.</Paragraph>
      <Paragraph position="10"> In TREC-3, as opposed to TRECs 1 and 2, the manual query construction methods perform better than their automatic counterparts. The removal of some of the topic structure (the concepts) has allowed differences to appear that could not be seen in earlier TRECs. Since topic expansion was necessary to produce top scores, the superiority of the manual expansion over no expansion in the Berkeley runs should not be surprising. Less clear is why the manual modifications in the INQI02 run showed superior performance to the automatic nan with no modifications. The likely explanation is that the automatic term expansion methods are relatively uncontrolled in TREC-3 and manual intervention plays an important role.</Paragraph>
      <Paragraph position="11"> The last two groups in the top six systems using manual query construction used some form of combination of retrieval techniques. The Virginia Tech group (VTc2s2) combined the results of up to 5 different types of query constraction (3 P-Norms with different P values and 2 vector-space, one short and one manually expanded) to create their results. They used a simple combination method (adding all the similarity values) and tested various combinations of query types. Their best result combined only two of the query types, one a P-Norm and one a vector-space. A series of additional runs (see paper for details) confirmed that the best method was to combine the results of the best two query techniques (the &amp;quot;long&amp;quot; vector-space and the P=2 P-Norm). They concluded that improvements from combining results only occurred when the input techniques were sufficiently different.</Paragraph>
      <Paragraph position="12"> Although the Rutgers group (rutfual) used more elaborate combining techniques, they came to the same conclusion. Combining different retrieval techniques offers improvements over a single technique (over 30% for the Virginia Tech group), but the input techniques need to be more varied to get further improvements.</Paragraph>
      <Paragraph position="13"> But the more varied the individual techniques, the more  need for elaborate combining methods such as used in the rutfual run. The automatic ETHO01 run best exempli.fies the direction needed here; first getting &amp;quot;good&amp;quot; performance for three very different but complementary techniques and then discovering the best ways of combining results.</Paragraph>
      <Paragraph position="14"> Several comments should be made with respect to the overall adhoc recall/precision averages. First, the better results are very similar and it is unlikely that there is any statistical difference between them. The Scheffe&amp;quot; tests run by Jean Tague-Sutcliffe (see paper &amp;quot;A Statistical Analysis of the TREC-3 Data&amp;quot; by Jean Tague-Sutcliffe and James Blustein in the TREC-3 proceedings) show that the top 20 category A runs (manual and automatic mixed) are all statistically equivalent at the a=0.05 level. This lack of system differentiation comes from the very wide performance variation across topics (the cross-topic variance is much greater than the cross-system variance) and points to the need for more research into how to statistically characterize the TREC results.</Paragraph>
      <Paragraph position="15"> As a second point, it should be noted that these adhoc results represent significant improvements over TREC-2. Figure 5 shows the top three systems in TREC-3 and the top three systems in TREC-2. This improvement was unexpected as the removal of the concepts section seemed likely to cause a considerable performance drop (up to 30% was predicted). Instead the advance of topic expansion techniques caused major improvements in performance with less &amp;quot;user&amp;quot; input (the concepts). Because of the different sets of topics involved, the exact amount of improvement cannot be computed. However the CorneU group has run older systems (those used in TREC-1 and TREC-2) against the TREC-3 topics. This shows an improvement of 20% for their expansion run (CrnlEA) over the TREC-2 system, and this is likely to be typical for many of the systerns this year.</Paragraph>
    </Section>
    <Section position="3" start_page="387" end_page="392" type="sub_section">
      <SectionTitle>
5.3 TREC-4 Adhoe Results
</SectionTitle>
      <Paragraph position="0"> The TREC-4 adhoc evaluation used new topics (topics 201-250) against two disks of training documents (disks 2 and 3). A dominant feature of the adhoc task in TREC-4 was the much shorter topics (see more on this in the discussion of the topics, section 3.3). Many groups tried their automatic query expansion methods on the shorter topics (with good success); other groups also did manual query construction experiments to contrast these methods for the very short topics.</Paragraph>
      <Paragraph position="1"> There were 39 sets of results for adhoc evaluation in TREC-4, with 33 of them based on runs for the full data set. Of these, 14 used automatic construction of queries,</Paragraph>
      <Paragraph position="3"> and 19 used manual construction. All of the category B groups used automatic construction of the queries.</Paragraph>
      <Paragraph position="4"> Figure 6 shows the recall/precision curves for the 6 TREC-4 groups with the highest non-interpolated average precision using automatic construction of queries.</Paragraph>
      <Paragraph position="5"> The runs are ranked by the average precision and only one run is shown per group (both official Cornell runs would have qualified for this set).</Paragraph>
      <Paragraph position="6"> A short summary of the techniques used in these runs shows the breadth of the approaches and the changes in approach from TREC-3. For more details on the various runs and procedures, please see the cited papers in the TREC-4 proceedings.</Paragraph>
      <Paragraph position="7"> CrnlEA -- Comell University (&amp;quot;New Retrieval Approaches Using SMART: TREC-4&amp;quot; by Chris Buckley, Amit Singhal, Mandar Mitra, (Gerald Salton)) used the SMART system, but with a non-cosine length norrealization method. The top 20 documents were used to locate 50 terms and 10 phrases for expansion, as contrasted with using the top 30 documents to massively expand (500 terms + 10 phrases) the topics as in TREC-3. This change in expansion techniques was mostly due to the major change in the basic algorithm.</Paragraph>
      <Paragraph position="8"> However, additional care was taken not to overexpand the very short topics. Work has continued at Comell in improving their radical new matching algorithm, and further information can be found in \[5\].</Paragraph>
      <Paragraph position="9"> pircsl -- Queens College, CUNY (&amp;quot;TREC-4 Ad-Hoc, Routing Retrieval and Filtering Experiments using PIRCS&amp;quot; by K.L. Kwok and L. Gmnfeld) used a spreading activation model on subdocuments (550-word chunks). It was expected that this type of model would be particularly affected by the shorter topics, and experiments were run trying several methods of topic expansion. For this automatic run, expansion was done by selecting 50 terms from the top 40 subdocuments in addition to the terms in the original topic. Several other experiments were made using manual modifications/expansions of the topics and these are reported with the manual adhoc results. The experiments with short topics has continued and further results can be seen in \[6\].</Paragraph>
      <Paragraph position="10"> cityal -- City University, London (&amp;quot;Okapi at TREC-4&amp;quot; by S.E. Robertson, S. Walker, M.M. Beaulieu, M. Gatford and A, Payne&amp;quot;) used a probabilistic term weighting scheme similar to that used in TREC-3. An average of 20 terms were automatically selected from the top 50 documents retrieved (only initial and final passages of these documents were used for term selection). The use of passages seemed to have httle effect. This run was a  base run for theft experiments in manual query editing.</Paragraph>
      <Paragraph position="11"> INQ201 -- University of Massachusetts at Amherst (&amp;quot;Recent Experiments with INQUERY&amp;quot; by James Allan, Lisa Bellesteros, James P. Callan, W. Bruce Croft and Zhihong Lu) used a version of probabilistic weighting that allows easy combining of evidence (an inference net). Their basic term weighting formula underwent a major change between TREC-3 and TREC-4 that combined the TREC-3 INQUERY weighting with the OKAPI (City University) weighting. They also used passage retrieval as in TREC-3, but found it detrimental in TREC-4. The topics were expanded by 30 phrases that were automatically selected from a phrase &amp;quot;thesaurus&amp;quot; (InFinder) that had previously been built automatically from the entire corpus of documents. Expansion did not work as well as in TREC-3, and additional work comparing the use of InFinder and the use of the top documents for expansion is reported in \[7\].</Paragraph>
      <Paragraph position="12"> siemsl -- Siemens Corporate Research (&amp;quot;Siemens  Merging&amp;quot; by Ellen M. Voorhees) used the SMART retrieval strategies from TREC-3 in this run (their base run for the database merging track). The standard vector normalization was used, and query expansion was done using the Rocchio method to select up to 100 terms and 10 phrases from the top 15 documents retrieved.</Paragraph>
      <Paragraph position="13">  tions into similarity measures. The best of these measures combined the standard cosine measure with the OKAPI measure. No topic expansion was done for this run.</Paragraph>
      <Paragraph position="14"> It is interesting to note that many of the systems did critical work on their term weighting/similarity measures between TREC-3 and TREC-4. Three of the top 6 runs were results of major revisions in the basic ranking algorithms, revisions that were the outcome of extensive analysis work on previous TREC results. At Cornell they investigated the problems with using a cosine norrealization on the long documents in TREC. This investigation resulted in a completely new term weighting/similarity strategy that performs well for all lengths of documents. The University of Massachusetts examined the issue of dealing with terms having a high fiequency in documents (which is also related to document length). The result of their investigation was a term weighting algorithm that combined the OKAPI algorithm (City University) for high frequency terms with the old INQUERY algorithm for lower frequency terms.</Paragraph>
      <Paragraph position="15"> The work at RMIT (the citri2 run) was part on their ongoing effort to test various term weighting schemes.</Paragraph>
      <Paragraph position="16"> These experiments in more sophisticated term weighting and matching algorithms are yet another step in the adaptation of retrieval systems to a full-text environment. The issues of long documents, with their higher frequency terms, mean that the algorithms originally built for abstract-length documents need rethinking. This did not happen in earlier TRECs because the problem seemed less important than, for example, discoveting automatic query expansion methods in TREC-3.</Paragraph>
      <Paragraph position="17"> The dominant new feature in TREC-4 was the very short topics. These topics were much shorter than any previous TREC topics (an average reduction from 107 terms in TREC-3 to 16 terms in TREC-4). In general the participating groups took two approaches: 1) they used roughly the same techniques that they would have on the longer topics, and 2) most of them tried some investigative manual experiments. Of the 6 runs shown in Figure 6, two runs (INQ201 and cityal) used a similar number and source of expansion terms as for the longer queries. The SMART group (Crn/AE) used many fewer terms because of their new algorithms. The pircsl run was a result of more expansion, but this was due to corrections of problems in TREC-3 as opposed to changes needed for the shorter topics. The run from Siemens siemsl was made as a baseline for database merging, and therefore had less expansion. There was no expansion in the cirri21 run.</Paragraph>
      <Paragraph position="18">  TREC-3 and TREC-4 for 4 of the groups that did well in each evaluation. As expected, all groups had worse performance. The performance for City University, where similar algorithms were used in TREC-3 and TREC-4, dropped by 36%. A similar drop (34%) was true for the INQUERY results, even though the new algorithm resulted an almost 5% improvement in results (for the TREC-4 topics). Whereas the CorneU results represented a major improvement in performance over the TREC-3 algorithms, their overall performance dropped by 14%.</Paragraph>
      <Paragraph position="19"> This points to several issues that need further investigation in TREC-5. First, experiments must still continue on the shorter topics, since this represents the typical initial input query. The results from the shorter topics may be so poor that the top documents provide misleading expansion terms. This was a major concern in TREC-3 and analysis of this issue is clearly needed.</Paragraph>
      <Paragraph position="20"> The fact that passage retrieval, which provided substantial improvement of results in TREC-3, did not help</Paragraph>
      <Paragraph position="22"> with the shorter TREC-4 topics indicates that other types of &amp;quot;noise&amp;quot; control may be needed for short topics. It may be that the statistical &amp;quot;clues&amp;quot; presented by these shorter topics are simply not enough to provide good retrieval performance and that better human-aided systerns need to be tested.</Paragraph>
      <Paragraph position="23"> However, the manual systems also suffered major drops in performance (see Figure 8). This leads to a second issue, i.e. a need for further investigation into the causes of the generally poorer performance in the TREC-4 adhoe task. It may be that the narrative section of the topic is necessary to make the intent of the user clear to both the manual query builder and the automatic systems. The fact that machine performance mirrored human performance in TREC-4 makes the decrease in automatic system performance more acceptable, but still requires further analysis into why both types of query construction were so affected by the very short topics.</Paragraph>
      <Paragraph position="24"> Figure 9 shows the recall/precision curves for the 6 TREC-4 groups with the highest non-interpolated average precision using manual construction of queries. A short summary of the techniques used in these runs follows. Again, for more details on the various runs and procedures, see the cited papers in the TREC-4 proceedings. null  two-level searching scheme in which the documents are first ranked via coarse-grain methods, and then the resulting subset is further refined. There are thesaurus tools available for expansion, and this run was the result of many experiments into such issues as term groupings and assignment of term strengths.</Paragraph>
      <Paragraph position="26"> Routing Retrieval and Filtering Experiments using PIRCS&amp;quot; by K.L. Kwok and L. Grunfeld) is a manual modification of the automatic queries in pircsl. The modification was to replicate words (this increases the weight) and to add a few associated words (an average of 1.73 words per query or at most 3 content words).</Paragraph>
      <Paragraph position="27"> The simple replication of words led to a 12% increase in performance; adding the associated words (the pircs2 run) upped this increase to 30% improvement over the initial automatic query.</Paragraph>
      <Paragraph position="28"> uwgcll -- University of Waterloo (&amp;quot;Shortest Substring Ranking (MultiText Experiments for TREC-4)&amp;quot; by Charles L.A. Clarke, Gordon V. Cormack, and Forbes J.</Paragraph>
      <Paragraph position="29"> Burkowski) used queries that were manually built in a special query language called CGCL. This query  language uses Boolean operators and proximity constraints to create intervals of text that satisfy specific conditions. The ranking algorithms rely on combining the results of increasing less restrictive queries until the  only allow removal of words and phrases, modification of weights, and addition of proximity restrictions. This type of manual modification increased overall average precision by 21%. The same types of modification gained only 15.5% in TREC-3.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML