File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/x98-1031_metho.xml

Size: 57,901 bytes

Last Modified: 2025-10-06 14:15:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1031">
  <Title>THE TEXT RETRIEVAL CONFERENCES (TRECS)</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE TEXT RETRIEVAL CONFERENCES (TRECS)
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="241" type="metho">
    <SectionTitle>
1 INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Phase III of the TIPSTER project included three workshops for evaluating document detection (information retrieval) projects: the fifth, sixth and seventh Text REtrieval Conferences (TRECs). This work was co-sponsored by the National Institute of Standards and Technology (NIST), and included evaluation not only of the TIPSTER contractors, but also of many information retrieval groups outside of the TIPSTER project. The conferences were run as workshops that provided a forum for participating groups to discuss their system results on the retrieval tasks done using the TIPSTER/TREC collection. As with the first four TRECs, the goals of these workshops were: * to encourage research in text retrieval based on large test collections; * to increase communication among industry, academia, and government by creating an open forum for the exchange of research ideas; * to speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements in retrieval methodologies on real-world problems; * to increase the availability of appropriate evaluation techniques for use by industry and academia, including development of new evaluation techniques more applicable to current systems; and * to serve as a showcase for state-of-the-art retrieval systems for DARPA and its clients.</Paragraph>
    <Paragraph position="1"> For each TREC, NIST provides a test set of documents and questions. Participants run their retrieval systems on the data, and return to NIST a list of the retrieved top-ranked documents. NIST pools the individual results, judges the retrieved documents for correctness, and evaluates the results. The TREC cycle ends with a workshop that is a forum for participants to share their experiences. The most recent workshop in the series, TREC-7, was held at NIST in November 1998.</Paragraph>
    <Paragraph position="2"> The number of participating systems has grown from 25 in TREC-1 to 38 in TREC-5 (Table 1), 51 in TREC-6 (Table 1), and 56 in TREC-7 (Table 1).</Paragraph>
    <Paragraph position="3"> The groups include representatives from 16 different countries and 32 companies.</Paragraph>
    <Paragraph position="4"> TREC provides a common test set to focus research on a particular retrieval task, yet actively encourages participants to do their own experiments within the umbrella task. The individual experiments broaden the scope of the research that is done within TREC and make TREC more attractive to individual participants. This marshaling of research efforts has succeeded in improving the state of the art in retrieval technology, both in the level of basic performance (see Figure 1) and in the ability of these systems to function well in diverse environments, such as retrieval in a filtering operation or retrieval against multiple languages.</Paragraph>
    <Paragraph position="5"> Each of the TREC conferences has centered around two main tasks: the routing task (not run in TREC7) and the ad hoc task (these tasks are described in more detail in Section 2.3). In addition, starting in TREC-4 a set of &amp;quot;tracks&amp;quot; or tasks that focus on particular subproblems of text retrieval was introduced. These tracks include tasks that concentrate on a specific part of the retrieval process (such as the interactive track which focuses on user-related issues), or tasks that tackle research in related areas, such as the retrieval of spoken &amp;quot;documents&amp;quot; from news broadcasts. null The graph in Figure i shows that retrieval effectiveness has approximately doubled since the beginning of TREC. This means, for example, that retrieval engines that could retrieve three good documents within the top ten documents retrieved in 1992 are now likely to retrieve six good documents in the top ten documents retrieved for the same search. The figure plots retrieval effectiveness for one well-known retrieval engine, the SMART system of Cornell University. The SMART system has consistently been one of the more effective systems in TREC, but other systems are  U.S. Department of Defense Table 3:TREC-7 participants comparable with it, so the graph is representative of the increase in effectiveness for the field as a whole. Researchers at Cornell ran the version of SMART used in each of the seven TREC conferences against each of the seven ad hoc test sets (Buckley, Mitra, Walz, &amp; Cardie, 1999). Each line in the graph connects the mean average precision scores produced by each version of the system for a single test. For each test, the TREC-7 system has a markedly higher mean average precision than the TREC-1 system. The recent decline in the absolute scores reflects the evolution towards more realistic, and difficult, test questions, and also possibly a dilution of effort because of the many tracks being run in TRECs 5, 6, and 7.</Paragraph>
    <Paragraph position="6"> The seven TREC conferences represent hundreds of retrieval experiments. The Proceedings of each conference captures the details of the individual experiments, and the Overview paper in each Proceedings summarizes the main findings of each conference. A special issue on TREC-6 will be published in Information Processing and Management (Voorhees, in press), which includes an Overview of TREC-6 (Voorhees &amp; Harman, in press) as well as an analysis of the TREC effort by Sparck Jones (in press).</Paragraph>
  </Section>
  <Section position="3" start_page="241" end_page="244" type="metho">
    <SectionTitle>
2 THE TASKS
</SectionTitle>
    <Paragraph position="0"> Each of the TREC conferences has centered around two main tasks, the routing task and the ad hoc task.</Paragraph>
    <Paragraph position="1"> In addition, starting in TREC-4 a set of &amp;quot;tracks,&amp;quot; tasks that focus on particular subproblems of text retrieval, was introduced. This section describes the goals of the two main tasks. Details regarding the tracks are given in Section 6.</Paragraph>
    <Section position="1" start_page="241" end_page="243" type="sub_section">
      <SectionTitle>
2.1 The Routing Task
</SectionTitle>
      <Paragraph position="0"> The routing task in the TREC workshops investigates the performance of systems that use standing queries to search new streams of documents. These searches are similar to those required by news clipping services and library profiling systems. A true routing  with those questions on a completely new document set.</Paragraph>
      <Paragraph position="1"> The training for the routing task is shown in the left-hand column of Figure 2. Participants are given a set of topics and a document set that includes known relevant documents for those topics. The topics consist of natural language text describing a user's information need (see sec. 3.2 for details). The topics are used to create a set of queries (the actual input to the retrieval system) that are then used against the training documents. This is represented by Q1 in the diagram. Many Q1 query sets might be built to help adjust the retrieval system to the task, to create better weighting algorithms, and to otherwise prepare the system for testing. The result of the training is query set Q2, routing queries derived from the routing topics and run against the test documents.</Paragraph>
      <Paragraph position="2"> The testing phase of the routing task is shown in the middle column of Figure 2. The output of running Q2 against the test documents is the official test result for the routing task.</Paragraph>
    </Section>
    <Section position="2" start_page="243" end_page="243" type="sub_section">
      <SectionTitle>
2.2 The Ad Hoc Task
</SectionTitle>
      <Paragraph position="0"> The ad hoc task investigates the performance of systems that search a static set of documents using new topics. This task is similar to how a researcher might use a library--the collection is known but the questions likely to be asked are not known. The right-hand column of Figure 2 depicts how the ad hoc task is accomplished in TREC. Participants are given a document collection consisting of approximately 2 gigabytes of text and 50 new topics. The set of relevant documents for these topics in the document set is not known at the time the participants receive the topics. Participants produce a new query set, Q3, from the ad hoc topics and run those queries against the ad hoc documents. The output from this run is the official test result for the ad hoc task.</Paragraph>
    </Section>
    <Section position="3" start_page="243" end_page="244" type="sub_section">
      <SectionTitle>
2.3 Task Guidelines
</SectionTitle>
      <Paragraph position="0"> In addition to the task definitions, TREC participants are given a set of guidelines outlining acceptable methods of indexing, knowledge base construction, and generating queries from the supplied topics. In general, the guidelines are constructed to reflect an actual operational environment and to allow fair comparisons among the diverse query construction approaches. The allowable query construction methods in TRECs 5, 6, and 7 were divided into au- null tomatic methods, in which queries are derived completely automatically from the topic statements, and manual methods, which includes queries generated by all other methods. This definition of manual query construction methods permitted users to look at individual documents retrieved by the ad hoc queries and then reformulate the queries based on the documents retrieved.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="244" end_page="246" type="metho">
    <SectionTitle>
3 THE TEST COLLECTIONS
</SectionTitle>
    <Paragraph position="0"> Like most traditional retrieval test collections, there are three distinct parts to the collections used in TREC: the documents, the questions or topics, and the relevance judgments or &amp;quot;right answers.&amp;quot; This section describes each of these pieces for the collections used in the main tasks in TRECs 5, 6, and 7. Many of the tracks have used the same data or used data constructed in a similar method but in a different environment, such as in multiple languages or using different guidelines (such as high precision searching).</Paragraph>
    <Section position="1" start_page="244" end_page="244" type="sub_section">
      <SectionTitle>
3.1 Documents
</SectionTitle>
      <Paragraph position="0"> TREC documents are distributed on CD-ROM's with approximately 1 GB of text on each, compressed to fit. Table 3.1 shows the statistics for all the English document collections used in TREC. TREC-5 used disks 2 and 4 for the ad hoc testing, while TRECs 6 and 7 used disks 4 and 5 for ad hoc testing. The FBIS on disk 5 (FBIS-1) was used for testing in the TREC-5 routing task and for training in the TREC-6 routing task, with new FBIS (FBIS-2) being used for testing in TREC-6. There was no routing task in TREC-7.</Paragraph>
      <Paragraph position="1"> Documents are tagged using SGML to allow easy parsing (see Fig. 3). The documents in the different datasets have been tagged with identical major structures, but they have different minor structures. The philosophy in the formatting at NIST is to leave the data as close to the original as possible. No attempt is made to correct spelling errors, sentence fragments, strange formatting around tables, or similar faults.</Paragraph>
    </Section>
    <Section position="2" start_page="244" end_page="245" type="sub_section">
      <SectionTitle>
3.2 Topics
</SectionTitle>
      <Paragraph position="0"> In designing the TREC task, there was a conscious decision made to provide &amp;quot;user need&amp;quot; statements rather than more traditional queries. Two major issues were involved in this decision. First, there was a desire to allow a wide range of query construction methods by keeping the topic (the need statement) distinct from the query (the actual text submitted to the system). The second issue was the ability to increase the amount of information available about each topic, in particular to include with each topic a clear statement of what criteria make a document relevant.</Paragraph>
      <Paragraph position="1"> The topics used in TREC-1 and TREC-2 (topics 1-150) were very detailed, containing multiple fields and lists of concepts related to the subject of the topics. The ad hoc topics used in TREC-3 (151-200)  alphanumeric characters. No stop words were were much shorter and did not contain the complex structure of the earlier topics. Nonetheless, participants in TREC-3 felt that the topics were still too long compared with what users normally submit to operational retrieval systems. Therefore the TREC-4 topics (201-250) were made even shorter: a single field consisting of a one sentence description of the information need. Figure 4 gives a sample topic from each of these sets.</Paragraph>
      <Paragraph position="2"> One of the conclusions reached in TREC-4 was that the much shorter topics caused both manual and automatic systems trouble, and that there were issues associated with using short topics in TREC that needed further investigation (Harman, 1996). Accordingly, the TREC-5 ad hoc topics re-introduced the title and narrative fields, making the topics similar in format to the TREC-3 topics. TREC-6 and TREC-7 topics used this same format, as shown in Figure 5. While having the same format as the TREC-3 topics, on average the later topics are shorter (contain fewer words) than the TREC-3 topics. Table 3.2 shows the lengths of the various sections in the TREC topics as they have evolved over the 7 TRECs.</Paragraph>
      <Paragraph position="3"> Since TREC-3, the ad hoc topics have been created by the same person (or assessor) who performed the relevance assessments for that topic. Each assessor comes to NIST with ideas for topics based on his or her own interests, and searches the ad hoc collection (looking at approximately 100 documents per topic) to estimate the likely number of relevant documents per candidate topic. NIST personnel select the final 50 topics from among these candidates, based on having both a reasonable range of estimated number of relevant documents across topics and on balancing the load across assessors.</Paragraph>
    </Section>
    <Section position="3" start_page="245" end_page="246" type="sub_section">
      <SectionTitle>
3.3 Relevance Assessments
</SectionTitle>
      <Paragraph position="0"> Relevance judgments are of critical importance to a test collection. For each topic it is necessary to compile a list of relevant documents--as comprehensive  a list as possible. All TRECs have used the pooling method (Sparck Jones ~ van Rijsbergen, 1975) to assemble the relevance assessments. In this method a pool of possible relevant documents is created by taking a sample of documents selected by the various participating systems. This pool is then shown to the human assessors. The particular sampling method used in TREC is to take the top 100 documents retrieved in each submitted run for a given topic and merge them into the pool for assessment. This is a valid sampling technique since all the systems used ranked retrieval methods, with those documents most likely to be relevant returned first. On average, an assessor judges approximately 1500 documents per topic.</Paragraph>
      <Paragraph position="1"> Given the vital role relevance judgments play in a test collection, it is important to assess the quality of the judgments created in TREC. In particular, both the completeness and the consistency of the relevance judgments are of interest. Completeness measures the degree to which all the relevant documents for a topic have been found; consistency measures the degree to which the assessor has marked all the &amp;quot;truly&amp;quot; relevant documents relevant and the &amp;quot;truly&amp;quot; irrelevant documents irrelevant.</Paragraph>
      <Paragraph position="2"> The completeness of the TREC relevance judgments has been investigated both at NIST (Harman, 1996) and independently at the Royal Melbourne Institute of Technology (RMIT) (Zobel, 1998). Both studies found that the completeness for most topics is adequate, though topics with many relevant documents are likely to have yet more relevant documents that have not been found through pooling.</Paragraph>
      <Paragraph position="3"> For this reason, NIST has deliberately chosen more tightly focused topics in recent TRECs. Both studies also found that any lack of completeness did not bias the results of particular systems. Indeed, the RMIT study showed that systems that did not contribute documents to the pool can still be evaluated fairly with the resulting judgments.</Paragraph>
      <Paragraph position="4"> The consistency of the TREC judgments was investigated at NIST by obtaining multiple independent assessments for a set of topics and evaluating systems using each of the different judgment sets (Voorhees, 1998). The study confirmed that the comparative results for different runs remains stable despite changes in the underlying judgments. Taken together, these studies validate the use of the TREC collections for retrieval research.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="246" end_page="249" type="metho">
    <SectionTitle>
4 EVALUATION
</SectionTitle>
    <Paragraph position="0"> An important element of TREC is to provide a common evaluation forum. A standard evaluation pack- null &lt;num&gt; Number : 051 &lt;dom&gt; Domain: International Economics &lt;title&gt; Topic: Airbus Subsidies &lt;desc&gt; Description: Document will discuss government assistance to Airbus Industrie, or mention a trade dispute between Airbus and a U.S. aircraft producer over the issue of subsidies.</Paragraph>
    <Paragraph position="1"> &lt;narr&gt; Narrative: A relevant document will cite or discuss assistance to Airbus Industrie by the French, German, British or Spanish government(s), or will discuss a trade dispute between Airbus or the European governments and a U.S. aircraft producer, most likely Boeing Co. or McDonnell Douglas Corp., or the U.S. government, over federal subsidies to Airbus.</Paragraph>
    <Paragraph position="2"> &lt;con&gt; Concept(s): 1. Airbus Industrie 2. European aircraft consortium, Messerschmitt-Boelkow-BlohmGmbH, British Aerospace PLC, Aerospatiale, Construcciones Aeronauticas S.A. 3. federal subsidies, government assistance, aid, loan, financing 4. trade dispute, trade controversy, trade tension 5. General Agreement on Tariffs and Trade (GATT) aircraft code 6. Trade Policy Review Group (TPKG) 7. complaint, objection 8. retaliation, anti-dumping duty petition, countervailing duty petition, sanctions &lt;hum&gt; Number: 168 &lt;title&gt; Topic: Financing AMTRAK &lt;desc&gt; Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK). &lt;narr&gt; Narrative:  A relevant document must provide information on the government's responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAE would also be relevant.</Paragraph>
    <Paragraph position="3">  &lt;num&gt; Number: 207 &lt;desc&gt; What are the prospects of the Quebec separatists achieving independence from the rest of Canada?  &lt;num&gt; Number: 312 &lt;title&gt; Hydroponics &lt;desc&gt; Description: Document will discuss the science of growing plants in water or some substance other than soil.</Paragraph>
    <Paragraph position="4"> &lt;hart&gt; Narrative:  A relevant document will contain specific information on the necessary nutrients, experiments, types of substrates, and/or any other pertinent facts related to the science of hydroponics. Related information includes, but is not limited to, the history of hydroponics, advantages over standard soil agricultural practices, or the approach of suspending roots in a humid enclosure and spraying them periodically with a nutrient solution to promote plant growth.  Lengths count number of tokens in topic statement including stop words.</Paragraph>
    <Paragraph position="5"> age, called trec_eval, is used to evaluate each of the submitted runs. trec_eval was developed by Chris Buckley at Cornell University and is available by anonymous ftp from ftp. cs. cornell, edu in the pub/smart directory. TREC reports a variety of recall- and precision-based evaluation measures for each run to give a broad picture of the run.</Paragraph>
    <Paragraph position="6"> Since TREC-3 there has been a histogram for each system showing performance on each topic. In general, more emphasis has been placed in later TRECs on a &amp;quot;per topic analysis&amp;quot; in an effort to get beyond the problems of averaging across topics. Work has been done, however, to find statistical differences among the systems (see paper &amp;quot;A Statistical Analysis of the TREC-3 Data&amp;quot; by Jean Tague-Sutcliffe and James Blustein in the TREC-3 proceedings.) Additionally charts have been published in the proceedings that consolidate information provided by the systems describing features and system timing, allowing some primitive comparison of the amount of effort needed to produce the results.</Paragraph>
    <Paragraph position="7"> Figure 4 shows two typical recall/precision curves. The x axis plots a fixed set of recall levels where number o/ relevant items retrieved Recall = total number o/ relevant items in the collection&amp;quot; The y axis plots precision values at the given recall level, where precision is calculated by number o/ relevant items retrieved Precision -total number o\] items retrieved These curves represent averages over the 50 topics. The averaging method was developed many years ago (Salton &amp; McGill, 1983) and is well accepted by the information retrieval community. The curves  show system performance across the full range of retrieval, i.e., at the early stage of retrieval where the highly-ranked documents give high accuracy or precision, and at the final stage of retrieval where there is usually a low accuracy, but more complete retrieval. The use of these curves assumes a ranked output from a system. Systems that provide an unranked set of documents are known to be less effective and therefore were not tested in the TREC program.</Paragraph>
    <Paragraph position="8"> The curves in Figure 4 show that system A has a much higher precision at the low recall end of the graph and therefore is more accurate. System B however has higher precision at the high recall end of the curve and therefore will give a more complete set of relevant documents, assuming that the user is willing to look further in the ranked list.</Paragraph>
    <Paragraph position="9"> The single-valued evaluation measure most frequently used in TREC is the mean (non-interpolated) average precision. The average precision for a single topic is the mean of the precision obtained after each relevant document is retrieved (using zero as the precision for relevant documents that are not retrieved). The mean average precision for a run consisting of multiple topics is the mean of the average precision scores of each of the individual topics in the run. The average precision measure has a recall component in that it reflects the performance of a retrieval run across all relevant documents, and a precision component in that it weights documents retrieved earlier more heavily than documents retrieved later. Geometrically, mean average precision is the area underneath a non-interpolated recall-precision curve.</Paragraph>
  </Section>
  <Section position="6" start_page="249" end_page="257" type="metho">
    <SectionTitle>
5 RETRIEVAL RESULTS
</SectionTitle>
    <Paragraph position="0"> One of the important goals of the TREC conferences is that the participating groups freely devise their own experiments within the TREC task(s). For some groups, particularly the groups new to TREC, this means doing the ad hoc and/or routing task with the goal of achieving high retrieval effectiveness performance. Other groups use TREC as an opportunity to run experiments especially tuned to their own environment, either taking part in the organized tracks or performing associated tasks that can be evaluated easily within the TREC framework. The experimental work performed for TRECs 5, 6, and 7 is therefore both too broad and too extensive to be summarized within this paper. What is presented is some analysis of the trends within the ad hoc and routing tasks, plus a summary of the various tracks that have been run in these three TRECs. In all cases, readers are referred to the full TREC proceedings for papers from the various groups that give more details of their experiments. null</Paragraph>
    <Section position="1" start_page="249" end_page="255" type="sub_section">
      <SectionTitle>
5.1 The Ad Hoc Results
</SectionTitle>
      <Paragraph position="0"> The basic TREC ad hoc paradigm has presented three major challenges to search engine technology from the beginning. The first is the vast scale-up in terms of number of documents to be searched, from several megabytes of documents to 2 gigabytes of documents. This system engineering problem occupied most systems in TREC-1, and has continued to be the initial work for most new groups entering TREC.</Paragraph>
      <Paragraph position="1"> The second challenge is that these documents are mostly full-text and therefore much longer than most algorithms in TREC-1 were designed to handle. The document length issue has resulted in major changes to the basic term weighting algorithms, starting in TREC-2. The third challenge has been the idea that a test question or topic contains multiple fields, each representing either facets of a user's question or the various lengths of text that question could be represented in. The particular fields, and the lengths of these fields, have changed across the various TRECs, resulting in different research issues as the basic environment has changed.</Paragraph>
      <Paragraph position="2"> Because TREC-1 required significant system rebuilding by most participating groups due to the huge increase in the size of the document collection, the  TREC-1 results should be viewed as only very preliminary due to severe time constraints. TREC-2 occurred in August of 1993, less than 10 months after the first conference, and the TREC-2 results can be seen as both a validation of the earlier experiments on the smaller test collections and as an excellent base-line for the more complex experimentation that has taken part in later TRECs.</Paragraph>
      <Paragraph position="3"> Table 5.1 summarizes the ad hoc task across the 6 TRECs that have occurred since 1992. It illustrates some of the common issues that have affected all groups, and also shows the initial use and subsequent spread of some of the now-standard techniques that have emerged from TREC.</Paragraph>
      <Paragraph position="4"> Five different research areas are shown in the table, with research in many of these areas triggered by changes in the TREC evaluation environment. For example, the use of subdocuments or passages was caused by the initial difficulties in handling full text documents, particularly excessively long ones. The use of better term weighting, including correct length normalization procedures, made this technique less used in TREC's 4 and 5, but it resurfaced in TREC-6 to facilitate better input to relevance feedback.</Paragraph>
      <Paragraph position="5"> The first research area shown in the table is that of term weighting. Most of the initial participants in TREC used term weighting that had been developed and tested on very small test collections with short documents (abstracts). Many of these algorithms were modified to handle longer documents in simple ways, however some algorithms were not amenable to this approach, resulting in some new fundamental research. The group from the Okapi system, City University, London (Robertson, Walker, Hancock-Beaulieu, &amp; Gatford, 1994) decided to experiment with a completely new term weighting algorithm that was both theoretically and practically based on term distribution within longer documents. By TREC-3 this algorithm had been &amp;quot;perfected&amp;quot; into the BM25 algorithm now in use by many of the systems in TRECs 5, 6 and 7. Continuing along this same row in table 5.1, three other systems (the SMART system from Cornell (Singhal, Buckley, &amp; Mitra, 1996), the PIRCS system from CUNY (Kwok, 1996) and the INQUERY system from the University of Massachusetts (Allan, Ballesteros, Callan, Croft, &amp; Lu, 1996) changed their weighting algorithms in TREC-4 based on analysis comparing their old algorithms to the new BM25 algorithm. By TREC-5 many of the groups had adopted these new weighting algorithms, with the early adopters being those systems with similar structural models.</Paragraph>
      <Paragraph position="6"> TREC-6 saw even further expansion of the use of these new weighting algorithms (alternatively called the Okapi/SMART algorithm, or the Cornell implementation of the Okapi algorithm). In particular, many groups adapted these algorithms to new models, often involving considerable experimentation to find the correct fit. For example IRIT (Boughanem &amp; Soul6-Dupuy, 1998) modified the Okapi algorithm to fit a spreading activation model, IBM (Brown &amp; Chong, 1998) modified it to deal with unigrams and trigrams, and the Australian National University (Hawking, Thistlewaite, &amp; Craswell, 1998) and the University of Waterloo (Cormack, Clarke, Palmer, &amp; To, 1998) used it in conjunction with various types of proximity measures. Of major note is the fact that City University also ran major experiments (Walker, Robertson, Boughanem, Jones, &amp; Sparck Jones, 1998) with the BM25 weighting algorithm in TREC-6, including extensive exploration of the various existing parameters, and addition of some new ones involving the use of non-relevant documents! null It could be expected that 6 years of term weighting experiments would lead to a convergence of the algorithms. However, a snapshot of the top 8 systems in TREC-7 (see Table 5.1) shows that these systems are derived from many models and use different term weighting algorithms and similarity measures. Of particular note here is that new models and term weighting algorithms are still being developed, and that these are competitive with the more established methods. This applies both to new variations on old weighting algorithms, such as the double log tf weighting from AT&amp;T (Singhal, Choi, Hindle, Lewis, &amp; Pereira, 1999) and to more major variations such as the new weighting algorithm from TNO (Hiemstra &amp; Kraaij, 1999), and the completely new retrieval model from BBN (Miller, Leek, &amp; Schwartz, 1999).</Paragraph>
      <Paragraph position="7"> The second new technique started back in TREC-2 (the second line of table 5.1) was the use of smaller sections of documents, called subdocuments, by the PIRCS system at City University of New York (Kwok &amp; Grunfeld, 1994). Again this issue was forced by the difficulty of using the PIRCS spreading activation model for documents having a wide variety of lengths.</Paragraph>
      <Paragraph position="8"> By TREC-3 many of the groups were also using subdocuments, or passages, to help with retrieval. But, as mentioned before, TREC's 4 and 5 saw far less use of this technique as many groups dropped the use of passages due to minimal added improvements in performance. null TREC-6 saw a revival in the use of passages, but generally only for specific uses. Whereas the PIRCS system continued to use 550-word subdocu- null ments for all its processing, most systems used passages only in the topic expansion phase. The Australian National University (Hawking et al., 1998) worked with &amp;quot;hot spots&amp;quot; of 500 characters surrounding the original topic terms to locate new expansion terms. AT&amp;T (Singhal, 1998) used overlapping windows of 50 words to help rerank the top 50 documents before selecting the final documents for use in expansion. The University of Waterloo (Cormack et al., 1998) used passages of maximum length 64 words to select expansion terms, whereas Verity (Pedersen, Silverstein, &amp; Vogt, 1998) used their automatic summarizer for this purpose. Two groups ( Lexis-Nexis (Lu, Meier, Rao, Miller, &amp; Pliske, 1998) and MDS (Fuller et al., 1998)) performed major experiments in the use of passages, particularly when employed in conjunction with other methods as input to data fusion. This diverse use of passages continued in TREC-7, with passages clearly becoming one of the standard tools for experimentation.</Paragraph>
      <Paragraph position="9"> The query expansion/modification techniques shown in the third and fourth lines of the table 5.1 were started when the topics were substantially shortened in TREC-3. As described in section 3.2, the format of the topics was modified to remove a valuable source of keywords: the concept section. In the search for some technique that would automatically expand the topic, several groups revived an old technique of assuming that the top retrieved documents are relevant, and then using them in relevance feedback.</Paragraph>
      <Paragraph position="10"> This technique, which had not worked on smaller collections, turned out to work very well in the TREC environment.</Paragraph>
      <Paragraph position="11"> By TREC-6 almost all groups were using variations on expanding queries using information from the top retrieved documents (often called pseudo-relevance feedback). There are many parameters needed for success here, such as how many top documents to use for mining terms, how many terms to select, and how to weight those terms. There has been general convergence on some of these parameters. Table 5.1 shows the characteristics of the expansion tools used in the top 8 systems in TREC-7. The second column gives the basic expansion model, with the vector-based systems using the Rocchio expansion and other systems using expansion models more suitable to their retrieval model. For example, the Local Context Analysis (LCA) method developed by the INQUERY group (Xu &amp; Croft, 1996) has been successfully used by other groups. The third column shows the number of top-ranked documents (P if passages were used), and the number of terms added from these documents. It should be noted that these numbers are more similar than in earlier TRECs, although they are still being investigated by new systems adopting these techniques as there can be subtle differences between systems that strongly influence parameter selection. The fourth column shows the source of the documents being mined for terms, which has generally moved to the use of as much information as possible, i.e. all the TREC disks as opposed to only those being used for testing purposes.</Paragraph>
      <Paragraph position="12"> TRECs 5, 6, and 7 saw many additional experiments in the query expansion area. The Open Text Corporation (Fitzpatrick &amp; Dent, 1997) gathered terms for expansion by looking at relevant documents from past topics that were loosely similar to the TREC-5 topics. Several groups ( (Lu, Ayoub, &amp; Dong, 1997; Namba, Igata, Horai, Nitta, &amp; Matsui, 1999)) have tried clustering the top retrieved documents in order to more accurately select expansion terms, and in TREC-6 three groups (City University, AT&amp;T, and IRIT) successfully got information from negative feedback, i.e. using non-relevant documents to modify the expansion process.</Paragraph>
      <Paragraph position="13"> TREC-7 contained even more experiments in automatic query expansion, such as the group (Mandala, Tokunaga, Tanaka, Okumura, ~ Satoh, 1999) that compared the use of three different thesauri for expansion (WordNet, a simple co-occurrance the- null saurus and an automatically built thesaurus using predicate-argument structures). Of particular note is the AT&amp;T (Singhal et al., 1999) investigation into &amp;quot;conservative enrichment&amp;quot; to avoid the additional noise caused by using larger corpora (all five disks) for query expansion.</Paragraph>
      <Paragraph position="14"> Groups that build their queries manually also looked into better query expansion techniques starting in TREC-3 (see fourth line of table 5.1). At first these expansions involved using other sources to manually expand the initial query. However the rules governing manual query building changed in TREC-5 to allow unrestricted interactions with the systems.</Paragraph>
      <Paragraph position="15"> This change caused a major evolution in the manual * query expansion, with most systems not only manually expanding the initial queries, but then looking at retrieved documents in order to further expand the queries, much in the manner that users of these systems could operate. Two types of experiments were notable in TREC-5: those that could be labelled as &amp;quot;manual exploration&amp;quot; runs and those that involved a more complex type of human-machine interaction.</Paragraph>
      <Paragraph position="16"> The first type is exemplified by the GE group (Strzalkowski et al., 1997), where the task was to ask users to pick out phrases and sentences from the retrieved * documents to add to the query, in hopes that this process could be imitated by automatic methods. The CLARITECH group (Milic-Frayling, Evans, Tong, &amp; Zhai, 1997) is a good example of the second type of * manual TREC-5 runs. They examined a multi-stage process of query construction, where the goal was to investigate better sets of tools that allow users to improve their queries, including different sources for suggestions of expansion terms and also various levels of user-added constraints to the expansion process.</Paragraph>
      <Paragraph position="17"> Many of the manual experiments seen in both TREC-6 andTREC-7, however, hark back to the simpler scenario of having users edit the automaticailygenerated query, or having users select documents to be used in automatic relevance feedback. Several of the groups had specific user strategies that they tested.</Paragraph>
      <Paragraph position="18">  GE Corporate R&amp;D/Rutgers University (Strzalkowski, Lin, &amp; Perez-Carballo, 1998) used automatically-generated summaries of the top 30 documents retrieved as sources of manuallyselected terms and phrases.</Paragraph>
      <Paragraph position="19"> CLARITECH Corp. (Evans, Huettner, Tong,  Jansen, &amp; Bennett, 1999) performed a user experiment measuring the difference in performance between two presentation modes: a ranked list vs a clustered set of documents.</Paragraph>
      <Paragraph position="20"> University of Toronto (Bodner &amp; Chignell, 1999) used their dynamic hypertext model to build the queries.</Paragraph>
      <Paragraph position="21">  vance feedback as opposed to automatic feedback from the top 20 documents.</Paragraph>
      <Paragraph position="22"> The final line in table 5.1 shows some of the other areas that have seen concentrated research in the ad hoc task. Data fusion has been used in TREC by many groups in various ways, but has increased in complexity over the years. For example, a project involving four teams led by Tomek Strzalkowski has continued the investigation of merging results from multiple streams of input using different indexing methods ((Strzalkowski et al., 1997, 1998, 1999).</Paragraph>
      <Paragraph position="23"> In TREC-6, several groups such as Lexis-Nexis (Lu et al., 1998) and MDS (Fuller et al., 1998) used multiple stages of data fusion, including merging results from different term weighting schemes, various mixtures of documents and passages, and different query expansion schemes.</Paragraph>
      <Paragraph position="24"> The INQUERY system from the University of Massachusetts has worked in all TREC's to automatically build more structure into their queries, based on information they have &amp;quot;mined&amp;quot; from the topics (Brown, 1995). Starting in TREC-5, there have been experiments by other groups to use more information from the initial topic. Lexis-Nexis (Lu et al., 1997) used the inter-term distance between nouns in the topic. Several other groups have made use of term proximity features (Australian National  look for clues that would suggest a need for more emphasis on certain topic terms. TREC-7 had two additional groups working with the use of term co-occurrance and proximity as alternative methods for ranking (see (Braschler, Wechsler, Mateev, Mittendorf, &amp; Sch~iuble, 1999) and (Nakajima, Takaki, Hirao, &amp; Kitauchi, 1999)).</Paragraph>
      <Paragraph position="25"> A final theme that has continued throughout all the TREC conferences has been the investigation of the use of phrases in addition to single terms. This has long been a topic for research in the information retrieval community, with generally unsuccessful results. However there was initially hope that use of phrases in these much larger collections would become critical and almost all groups have experimented with phrases. In general these experiments have been equally unsuccessful.</Paragraph>
      <Paragraph position="26"> The fourth column of table 5.1 shows the widespread use of phrases in addition to single terms in TREC-7, but the minimal improvement from their use. The biggest improvement reported in the papers was 3.6% from the INQUERY group at the University of Massachusetts (Allan et al., 1999). Whereas most of the other groups are also using phrases, many did not bother to test for differences due to minimal results in earlier years. Cornell/SabIR reported 7.7% improvement in TREC-6, but this is the improvement on top of the initial baseline, not the improvement after expansion. Private conversations with several of these groups indicate that these improvements are likely to be much less if measured after expansion. As is often the case, these minimal changes in the averages cover a wide variation in phrase performance across topics. A special run by the Okapi group (many thanks) showed less than a 1% average difference in performance, but 19 topics helped by phrases, 14 hurt, and the rest unchanged. Whereas the benefit of phrases is not proven, they are likely to remain a permanent tool in the retrieval systems in a manner similar to the earlier adoption of stemming.</Paragraph>
      <Paragraph position="27"> It is interesting to note that many of these groups are using different phrase &amp;quot;gathering&amp;quot; techniques. The Okapi group has a manually-built phrase list with synonym classes that has slowly grown over the years based on mostly past TREC topics. The automatically-produced INQUERY phrase list was new for TREC-6 (Allan et al., 1998), the Cornell list was basically unchanged from early TRECs, and the BBN list was based on a new bigram model.</Paragraph>
      <Paragraph position="28"> The creation of two formal topic lengths in TREC-5 has inspired many experiments comparing results using those different topic lengths, and the addition of a formal &amp;quot;title&amp;quot; in TREC-6 increased these investigations. Table 5.1 shows the results (official and unofficial as reported in the papers) of the top 8 TREC-7 groups showing their use of different topic parts. The second column gives the various topic parts used by each group (T = title, D = description, N = narrative). The third column gives the average precision using only the description and title. The fourth and fifth columns give the corresponding performance of the systems using either only the title or using the full topic (all topic parts).</Paragraph>
      <Paragraph position="29"> Note that most of the best runs use the full topic.</Paragraph>
      <Paragraph position="30"> However there is now a smaller performance difference between runs that use the full topic and runs that use only the title and description sections than was seen in earlier TRECs. This is most likely due to improved query expansion methods, but could be due to variations across topic sets. It should be noted that the improvement going to the full topic is only 1% for several groups. The decrease in performance using only the title is more marked, ranging from 4%  by topic length.</Paragraph>
      <Paragraph position="31"> to 22%. The TREC-7 title results should be a truer measure of the effects of using the title only than TREC-6, where the descriptions were often missing key terms. However, it is not clear how representative these titles are with respect to very short user inputs and therefore title results should best be viewed as how well these systems could perform on very short, but very good user input.</Paragraph>
      <Paragraph position="32"> Looking at individual topic results shows a less consistent picture. Table 5.1 shows the number of topics that had the best performance from among a group's three runs using different input lengths. Not only is there a wide variation across topics, there is also a wide variation across systems in that topics that work best at a particular length for one group did not necessarily work best at that length for the other groups.</Paragraph>
    </Section>
    <Section position="2" start_page="255" end_page="257" type="sub_section">
      <SectionTitle>
5.2 The Routing Results
</SectionTitle>
      <Paragraph position="0"> The routing evaluation used a specifically selected subset of the training topics against a new set of test documents, but there have always been difficulties in locating appropriate testing data for the routing task.</Paragraph>
      <Paragraph position="1"> TREC-3 was forced to re-use some of the training data, and TREC-4 performed routing tests using the Federal Register (with new data) for 25 of the topics, and using training data and &amp;quot;net trash&amp;quot; for testing the other 25 topics. This situation was clearly not ideal and for TREC-5 NIST held back decisions on the routing topics until a new data source could be found.</Paragraph>
      <Paragraph position="2"> When the FBIS data became available, it was decided to pick topics that had many relevant documents in the Associated Press data, on the assumption that the FBIS data would be similar to AP. Because of delays in getting and processing the data, this assumption could not be checked out, and problems arose that will be discussed later.</Paragraph>
      <Paragraph position="3"> It should be noted that the routing task in TREC has always served two purposes. The first is its intended purpose: to test systems in their abilities to use training data to build effective filters or profiles. The second purpose, which has become equally important in the more recent TRECs, is to serve as a learning environment for more effective retrieval techniques in general. Groups use the relevance judgments to explore the characteristics of relevant documents, such as which features are most effective to use for retrieval or how to best merge results from multiple queries. This is more profitable than simply using the previous TREC results in a retrospective manner because of the use of completely new testing data for evaluation.</Paragraph>
      <Paragraph position="4"> A focus on using the training data as a learning environment was particularly prevalent in TREC-5.</Paragraph>
      <Paragraph position="5"> Cornell (Buckley, Singhal, &amp; Mitra, 1997) used the relevant and non-relevant documents for investigations of Rocchio feedback algorithms, including more complex processes of expansion and weighting. The University of Waterloo (Clarke &amp; Cormack, 1997) interactively searched the training data for co-occurring substrings and GE (Strzalkowski et al., 1997) ran major experiments in data fusion to test their new stream-based architecture. In each of these cases the experiments are assumed to lead to better ways of doing the routing task, and also to new approaches for the ad hoc task.</Paragraph>
      <Paragraph position="6"> Three experimental themes dominate most routing experiments in TREC-5. The first is the discovery of optimal features (usually single terms) for use in the query or filter. The Okapi System from City University, London (Beaulieu et al., 1997) continued its experiments in repeatedly trying various combinations of terms to discover the optimal set, but for TREC-5 used subsets of the training data. The University of California at Berkeley (Gey, Chen, He, Xu, &amp; Meggs, 1997) concentrated on further investigations of the use of the chi-square discrimination measure to locate large numbers of good terms, and the Swiss Federal Institute of Techology (ETH) (Ballerini et al., 1997) tried three different feature selection methods, including the chi-square method, the RSV (OKAPI) method, and a new method, the U measure. Xerox (Hull et al., 1997) also investigated a new feature selection method, the binomial likelihood ratio test.</Paragraph>
      <Paragraph position="7"> The second theme was the use of co-occurring term pairs in the training data to &amp;quot;expand&amp;quot; the query. Four groups experimented with locating and incorporating co-occurring pairs of terms, including the INQUERY group from the University of Massachusetts in both TREC-4 and TREC-5 (Allan et al., 1996, 1997), and Cornell University in TREC-5 (Buckley et ai., 1997). As mentioned before, Waterloo interactively looked for word-pairs or co-occurring strings to manually add to their query. ETH used the OKAPI RSV values to formally motivate a series of experi- null meats using co-occurring terms within different portions of the document (within sentence, within paragraph, etc.) as different methods of constructing queries. These multiple representations of the query were then linearly combined, with the parameters for that combination discovered using logistic regression on the training data.</Paragraph>
      <Paragraph position="8"> The third theme in the routing experiments was the continuing effort to use only subsets of the training data. The number of judged documents per topic is on the order of 2000 or more, and this can be eomputationally difficult for complex techniques. Efficiency has motivated CUNY experiments (the PIRCS system) since TREC-3 where they tried using only the &amp;quot;short&amp;quot; documents for training. In TREC-5 this group (Kwok &amp; Grunfeld, 1997) used genetic algorithms to select the optimal set of training documents. Cornell (in TREC-5) used a new &amp;quot;query zone&amp;quot; technique to subset the training documents so that not all non-relevant documents were used for training. The goal was not just improved efficiency, but also improved effectiveness in that training was more concentrated on documents that the Cornell system was likely to retrieve.</Paragraph>
      <Paragraph position="9"> There is another issue that suggests the use of subsets: the problem of overfitting the queries/methods to the training data. This was specifically emphasized in the City system, where they used different subsets of the training data for locating features, and used combinations of runs for their final results. Xerox used subsets to reduce overfitting, with their subsets based on finding documents within a &amp;quot;local zone&amp;quot; to the query (a predecessor to the query zoning technique used by Cornell). The Xerox paper provides more discussion of the overfitting problem and suggests some additional techniques to avoid it.</Paragraph>
      <Paragraph position="10"> As in the ad hoc task, there is a heavy adoption rate across groups for successful techniques. For the ad hoc task these techniques revolve around better ways of handling the initial topic, or use of the top X documents for relevance feedback. Because of the existence of training data in routing, the routing experiments have generally not used the topic itself heavily, but constructed queries mainly based on the training data. The success of these techniques therefore revolves around how well the test data matches the training data, and also on how tuned the techniques are to the particular training data.</Paragraph>
      <Paragraph position="12"> FBIS material for test data. Whereas the types of documents are similar, the domains of the documents did not always match. For some topics there was a good match of training and test data, but for others the match was very poor, and very few relevant documents were found for those topics. Four topics had zero relevant documents in the test set, and an additional six topics had only one or two relevant documents. Additionally there was a serious mismatch on the number of relevant documents for a topic in the training data and in the test data. Even after dropping the four topics with no relevant documents from the evaluation, the results are still heavily affected by the mismatch. The overall results for TREC-5 were not better than for TREC-4 (or TREC-3.</Paragraph>
      <Paragraph position="13"> In TREC-6 an attempt was made to have a close match between the training and test data. Since the TREC-5 routing task had used a document stream from the Foreign Broadcast Information Service (FBIS) as its test set, a new stream of FBIS documents was selected as the TREC-6 test set. The TREC-6 routing topics consisted of 38 topics used in TREC-5 that had at least 5 relevant documents in the original FBIS stream, plus nine new topics (that had minimal training data on the original FBIS stream).</Paragraph>
      <Paragraph position="14"> The histogram in Figure 7 shows that the training and test data do have similar numbers of relevant documents for most topics.</Paragraph>
      <Paragraph position="15"> The following gives the various experiments that were run by the 8 top performing systems in the TREC-6 routing task.</Paragraph>
      <Paragraph position="16">  * AT&amp;T Labs Research (Singhal, 1998) added the machine learning technique of boosting to the query refinement phase of the Cornell TREC-5 routing algorithm (which includes the use of word pairs, DFO optimization, and query zones).</Paragraph>
      <Paragraph position="17"> * City University, London (Walker et al., 1998) explored iterative methods of term weighting with the goal of avoiding overfitting.</Paragraph>
      <Paragraph position="18"> * Cornell/SaBIR Research (Buckley, Mitra, Walz, &amp; Cardie, 1998) also used a variant of the basic Cornell TREC-5 routing approach, adding SuperConcepts to the routing query.</Paragraph>
      <Paragraph position="19"> * Queens College, CUNY (Kwok, Grunfeld, &amp; Xu, 1998) combined results from five separate component runs; this combined result is superior to each of the individual components.</Paragraph>
      <Paragraph position="20"> * University of Waterloo (Cormack et al., 1998) interactively refined a set of Boolean queries into a single tiered Boolean query for each topic.</Paragraph>
      <Paragraph position="21"> * Claxitech Corporation (Milic-Frayling, Zhai,  Tong, Jansen, &amp; Evans, 1998) explored the benefits of using different term selection methods in different parts of the query refinement process.</Paragraph>
      <Paragraph position="22"> For this run they developed different queries using different term selection strategies and then, for each topic, selected the query that performed the best on the training data.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="257" end_page="258" type="metho">
    <SectionTitle>
* MSI/IRIT/SIG/CERISS (Boughanem &amp; Soul&amp;
</SectionTitle>
    <Paragraph position="0"> Dupuy, 1998) continued their work with a spreading activation model by expanding queries with the top 30 terms from relevance backpropagation. null  Sch~iuble, 1998) also performed a combination run where one component run selected query words and phrases based on the U-measure. The best mean average precision for a routing run in TREC-6 was .420, a 9% improvement over TREC-5's best of .386. However, given that the TREC-6 task was designed to use a homogeneous data set whereas the TREC-5 test data were different from the training data, a greater improvement was expected. At this point, it is unclear why the difference was not greater. It is possible that while the numbers of relevant documents in the training and test set are comparable, the relevant documents  in each set don't &amp;quot;look like&amp;quot; each other. However, this is unlikely since both sets of documents come from a common source. It is also possible that the mismatch between training and test sets is not as significant a factor as was thought.</Paragraph>
    <Paragraph position="1"> Another hypothesis suggested by (Singhal, 1998) is that the relevance judgments are less consistent for routing than they are for the ad hoc task, and that this inconsistency prevents the machine learning methods that are prevalent in the task from performing well. Since some routing topics have been used many times, and therefore have relevance judgments spanning many years, the judgments are likely to be less consistent than for the ad hoc task. It may be instructive to explore the stability of the routing techniques in the face of different relevance judgments, especially given that real user judgments are known to be extremely volatile (Schamber, 1994).</Paragraph>
    <Paragraph position="2"> Because of operational constraints on the overall TREC program, it was decided to pursue further investigations in routing within the very closely related filtering track. For this reason there was no routing task in TREC-7, but there was a routing option in the filtering track.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML