File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1070_metho.xml

Size: 14,803 bytes

Last Modified: 2025-10-06 14:13:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1070">
  <Title>OVERVIEW OF THE SECOND TEXT RETRIEVAL CONFERENCE (TREC-2)</Title>
  <Section position="3" start_page="351" end_page="354" type="metho">
    <SectionTitle>
2. TREC-2 RESULTS
2.1 Introduction
</SectionTitle>
    <Paragraph position="0"> In general the TREC-2 results showed significant improvements over the TREC-1 results. Many of the original TREC-1 groups were able to &amp;quot;complete&amp;quot; their system rebuilding and tuning tasks. The results for TREC-2 therefore can be viewed as the &amp;quot;best first-pass&amp;quot; that most groups can accomplish on this large amount of data. The adhoc results in particular represent baseline results from the scaling-up of current algorithms to large test collections. The better systems produced similar results, results that are comparable to those seen using these algorithms on smaller test collections.</Paragraph>
    <Paragraph position="1"> The routing results showed even more improvement over TREC-1 routing results. Some of this improvement was due to the availability of large numbers of accurate relevance judgments for training (unlike TREC-1), but most of the improvements came from new research by participating groups into the best ways of using the training data. null All references in this section are papers in the TREC-2 proceedings (Harman 1994b).</Paragraph>
    <Section position="1" start_page="351" end_page="353" type="sub_section">
      <SectionTitle>
2.2 Adhoc Results
</SectionTitle>
      <Paragraph position="0"> The adhoc evaluation used new topics (101-150) against the two disks of training documcnts (disks 1 and 2).</Paragraph>
      <Paragraph position="1"> There were 44 sets of results for adhoc evaluation in TREC-2, with 32 of them based on runs for the full data set. Of these, 23 used automatic construction of qucrics, 9 used manual construction, and 2 used feedback.</Paragraph>
      <Paragraph position="2"> Figure 1 shows the recall/precision curves for the six  TREC-2 groups with the highest non-interpolated average precision using automatic construction of queries. The results marked &amp;quot;INQ001&amp;quot; are the INQUERY system from the University of Massachusetts (see Croft, Callan &amp; Broglio paper). This system uses probabilistic term weighting and a probabilistic inference net to combine various topic and document features. The results marked &amp;quot;dortQ2&amp;quot;, &amp;quot;Brkly3&amp;quot; and &amp;quot;cmlL2&amp;quot; are all based on the use of the Cornell SMART system, but with important variations. The &amp;quot;crnlL2&amp;quot; run is the basic SMART system from Comell University (see Buckley, Allan &amp; Salton paper), but using less than optimal term weightings (by mistake).</Paragraph>
      <Paragraph position="3"> The &amp;quot;dortQ2&amp;quot; results from the University of Dortmund come from using polynomial regression on the training data to find weights for various pre-set term features (see Fuhr, Pfeifer, Bremkamp, Pollmann &amp; Buckley paper).</Paragraph>
      <Paragraph position="4"> The &amp;quot;Brkly3&amp;quot; results from the University of California at Berkeley come from performing logistic regression analysis to learn optimal weighting for various term frequency measures (see Cooper, Chen&amp; Gey paper). The &amp;quot;CLARTA&amp;quot; system from the CLARIT Corporation expands each topic with noun phrases found in a thesaurus that is automatically generated for each topic (see Evans &amp; Lefferts paper). The &amp;quot;Isiasm&amp;quot; results are from Bellcore (see Dumais paper). This group uses latent semantic indexing to create much larger vectors than the more traditional vector-space models such as SMART. The run marked &amp;quot;lsiasm&amp;quot; represents only the base SMART pre-processing results, however. Due to processing errors the &amp;quot;improved&amp;quot; LSI run produced unexpectedly poor results.</Paragraph>
      <Paragraph position="5"> Figure 2 shows the recall/precision curve for the six TREC-2 groups with the highest non-interpolated average precision using manual construction of queries. It should be noted that varying amounts of manual intervention were used. The results marked &amp;quot;INQ002&amp;quot;, &amp;quot;siems2&amp;quot;, and &amp;quot;CLARTM&amp;quot; are automatically-generated queries with manual modifications. The &amp;quot;INQ002&amp;quot; results reflect various manual modifications made to the &amp;quot;INQ001&amp;quot; queries, with those modifications guided by strict rules. The &amp;quot;siems2&amp;quot; results from Siemens Corporate Research, Inc. (see Voorhees paper) are based on the use of the Comell SMART system, but with the topics manually modified (the &amp;quot;not&amp;quot; phases removed). These results were meant to be the base run for improvements using WordNet, but the improvements did not materialize. The &amp;quot;CLARTM&amp;quot; resuits represent manual weighting of the query terms, as opposed to the automatic weighting of the terms that was used in &amp;quot;CLARTA&amp;quot;. The results marked &amp;quot;Vtcms2&amp;quot;, &amp;quot;Cn-Qst2&amp;quot;, and &amp;quot;TOPIC2&amp;quot; are produced from queries constructed completely manually. The &amp;quot;Vtcms2&amp;quot; results are from Virginia Tech (see Fox &amp; Shaw paper) and show the effects of combining the results from SMART vector-space queries with the results from manually-constructed soft Boolean P-Norm type queries. The &amp;quot;CnQst2&amp;quot; results, from ConQuest Software (see Nelson paper), use a very large general-purpose semantic net to aid in constructing better queries from the topics, along with sophisticated morphological analysis of the topics. The results marked &amp;quot;TOPIC2&amp;quot; are from the TOPIC system by Verity Corp.</Paragraph>
      <Paragraph position="6"> (see Lehman &amp; Reid paper) and reflect the use of an expert system working off specially-constructed knowledge bases to improve performance.</Paragraph>
      <Paragraph position="7"> Several comments can be made with respect to these ad-hoc results. First, the better results (most of the automatic results and the three top manual results) are very similar and it is unlikely that there is any statistical differences between them. There is clearly no &amp;quot;best&amp;quot; method, and the fact that these systems have very different approaches to retrieval, including different term weighting schemes, different query construction methods, and different similarity match methods implies that there is much more to be learned about effective retrieval techniques. Additionally, whereas the averages for the systems may be similar, the systems do better on different topics and retrieve different subsets of the relevant documents.</Paragraph>
      <Paragraph position="8"> A second point that should be made is that the automatic query construction methods continue to perform as well as the manual construction methods. Two groups (the INQUERY system and the CLARIT system) did explicit comparision of manually-modified queries vs those that were not modified and concluded that manual modification provided no benefits. The three sets of results based on completely manually-generated queries had even poorer performance than the manually-modified queries. Note that this result is specific to the very rich TREC topics; it is not clear that this will hold for the short topics normally seen in other retrieval environments.</Paragraph>
      <Paragraph position="9"> As a final point, it should be noted that these adhoc results  represent significant improvements over the results from TREC-1. Figure 5 (after the routing results) shows a comparison of results for a typical system in TREC-1 and TREC-2. Some of this improvement is due to improved evaluation, but the difference between the curve marked &amp;quot;TREC-I&amp;quot; and the curve marked &amp;quot;TREC-2 looking at top 200 only&amp;quot; shows significant performance improvement.</Paragraph>
      <Paragraph position="10"> Whereas this improvement could represent a difference in topics (the TREC-1 curve is for topics 51-100 and the TREC-2 curves are for topics 101-150), the TREC-2 topics are generally felt to be more difficult and therefore this improvement is likely to be an understatement of the actual improvements.</Paragraph>
      <Paragraph position="11"> Very few groups worked with less than the full document collection. The system from New York University (see Strzalkowski &amp; Carballo paper) reflects a very intensive use of natural language processing (NLP) techniques, including a parse of the documents to help locate syntactic phrases, context-sensitive expansion of the queries, and other NLP improvements on statistical techniques. In interests of space this graph is not shown; please refer to the paper by this group in this proceedings.</Paragraph>
    </Section>
    <Section position="2" start_page="353" end_page="354" type="sub_section">
      <SectionTitle>
2.3 Routing Results
</SectionTitle>
      <Paragraph position="0"> The routing evaluation used a subset of the training topics (topics 51-100 were used) against the new disk of test documents (disk 3). There were 40 sets of results for routing evaluation, with 32 of them based on runs for the full data set. Of the 32 systems using the full data set, 23 used automatic construction of queries, and 9 used manual construction.</Paragraph>
      <Paragraph position="1"> Figure 3 shows the recall/precision curves for the six TREC-2 groups with the highest non-interpolated average precision using automatic construction of the routing queries. Again three systems are based on the Cornell SMART system. The plot marked &amp;quot;crnlCl&amp;quot; is the actual SMART system, using the basic Rocchio relevance feed-back algorithms, and adding many terms (up to 500) from the relevant training documents to the terms in the topic.</Paragraph>
      <Paragraph position="2"> The &amp;quot;dortPl&amp;quot; results come from using a probabilistically-based relevance feedback instead of the vector-space algorithm, and adding only 20 terms from the relevant documents to each query. These two systems have the best routing results. The &amp;quot;Brkly5&amp;quot; system uses logistic regression on both the general frequency variables used in their adhoc approach and on the query-specific relevance data available for training with the routing topics. The results marked &amp;quot;cityr2&amp;quot; are from City University, London (see Robertson, Walker, Jones, Hancock-Beaulieu &amp; Gafford paper). This group automatically selected variable numbers of terms (10-25) from the training documents for each topic (the topics themselves were not used as term sources), and then used traditional probabilistic reweighting to weight these terms. The &amp;quot;INQ003&amp;quot; results also use probabilistic reweighting, but use the topic terms, expanded by 30 new terms per topic from the training documents. The results marked &amp;quot;lsir2&amp;quot; are more latent semantic indexing results from Belicore. This run was made by creating a filter of the singular-value decomposition vector sum or centroid of all relevant documents for a topic (and ignoring the topic itself).</Paragraph>
      <Paragraph position="3"> Figure: 4 shows the recall/precision curves for the six TREC-2 groups with the highest non-interpolated average precision using manual construction of the routing queries. The results marked &amp;quot;INQ004&amp;quot; are from the INQRY system using an inferential combination of the &amp;quot;INQ003&amp;quot; queries and manually modified queries created from the topic. The &amp;quot;trw2&amp;quot; results represent an adaptation of the TRW Fast Data Finder pattern matching system to allow use of term weighting (see Mettler paper). The queries were manually constructed and the term weighting was learned from the training data. The &amp;quot;gecrdl&amp;quot; results from GE Research and Development Center (see Jacobs paper) also come from manually-constructed queries, but using a general-purpose lexicon and the training data to suggest input to the Boolean pattern matcher.</Paragraph>
      <Paragraph position="4"> The results marked &amp;quot;CLARTM&amp;quot; are similar to the &amp;quot;CLARTM&amp;quot; adhoc results except that the training documents were used as the source for thesaurus building, as opposed to using the top set of retrieved documents. The &amp;quot;rutcombx&amp;quot; results from Rutgers University (see Belkin, Kantor, Cool &amp; Quatrain paper) come from combining 5 sets of manually-generated Boolean queries to optimize performance for each topic. The results marked &amp;quot;TOPIC2&amp;quot; are from the TOPIC system and reflect the use of an expert system working off specially-constructed knowledge bases to improve performance.</Paragraph>
      <Paragraph position="5"> As was the case with the adhoc topics, the automatic query construction methods continue to perform as well as, or in this case, better than the manual construction methods. A comparision of the two INQRY runs illustrates this point and shows that all six results with manually-generated queries perform worse than the six runs with automatically-generated queries. The availability of the training data allows an automatic tuning of the queries that would be difficult to duplicate manually without extensive analysis.</Paragraph>
      <Paragraph position="6"> Unlike the adhoc results, there are two runs (&amp;quot;crnlCl&amp;quot; and &amp;quot;dortPl&amp;quot;) that are clearly better than the others, with a significant difference between the &amp;quot;cmlCl&amp;quot; results and the &amp;quot;dortPl&amp;quot; results and also significant differences between these results and the rest of the automatically-generated query results. In particular the use of so many terms (up to 500) for query expansion by the Cornell group was one of the most interesting findings in TREC-2 and represents a departure from past results (see Buckley, Allan, &amp;  Salton paper for more on this).</Paragraph>
      <Paragraph position="7"> As a final point, it should be noted that the routing results also represent significant improvements over the results from TREC-1. Figure 6 shows a comparison of results for a typical system in TREC-1 and TREC-2. Some of this improvement is due to improved evaluation, but the difference between the curve marked &amp;quot;TREC-I&amp;quot; and the curve marked &amp;quot;TREC-2 looking at top 200 only&amp;quot; shows significant performance improvement. There is more improvement for the routing results than for the adhoc resuits due to better training data (mostly non-existent for TREC-1) and to major efforts by many groups in new routing algorithm experiments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML