File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/93/x93-1005_concl.xml
Size: 8,590 bytes
Last Modified: 2025-10-06 13:57:06
<?xml version="1.0" standalone="yes"?> <Paper uid="X93-1005"> <Title>DOCUMENT DETECTION OVERVIEW</Title> <Section position="6" start_page="11" end_page="14" type="concl"> <SectionTitle> 5. EVALUATION METRICS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 5.1 Recall/Precision Curves </SectionTitle> <Paragraph position="0"> Standard recall/precision figures were calculated for each TIPSTER and TREC system and the tables and graphs for the results were provided. Figure 2 shows typical recall/precision curves. The x axis plots the recall values at fixed levels of recall, where Recall = number of relevant items retrieved wtal number of relevant items in collection The y axis plots the average precision values at those given recall values, where precision is calculated by number of relevant items retrieved Precision' = wtal number of items retrieved These curves represent averages over the 50 topics. The averaging method was developed many years ago \[2\] and is well accepted by the information retrieval community.</Paragraph> <Paragraph position="1"> It was therefore used unchanged for the TIPSTER evaluation. The curves show system performance across the full range of retrieval, i.e. at the early stage of retrieval where the highly-ranked documents give high accuracy or precision, and at the final stage of retrieval where there is usually a low accuracy, but more complete retrieval. Note that the use of these curves assumes a ranked output from a system. Systems that provide an unranked set of documents are known to be less effective and therefore were not tested in the TIPSTER/TREC programs.</Paragraph> <Paragraph position="2"> The curves in figure 2 show that system A has a much higher precision at the low recall end of the graph and therefore is more accurate. System B however has higher precision at the higher recall end of the curve and therefore will give a more complete set of relevant documents, assuming that the user is willing to look further in the ranked list.</Paragraph> </Section> <Section position="2" start_page="11" end_page="13" type="sub_section"> <SectionTitle> 5.2 Recall/Fallout Curves </SectionTitle> <Paragraph position="0"> A second set of curves were calculated using the recall/fallout measures, where recall is defined as before and fallout is defined as number of nonrelevant items retrieved fallout = total number of nonrelevant items in collection Note that recall has the same definition as the probability of detection and that fallout has the same definition as the probability of false alarm, so that the recall/fallout curves are also the ROC (Relative Operating Characteristic) curves used in signal processing. A sample set of curves corresponding to the recall/precision curves are shown in figure 3. These curves show the same order of performance as do the recall/precision curves and are provided as an alternative method of viewing the results. The present version of the curves is experimental as the curve creation is particularly sensitive to scaling (what range is used for calculating fallout). The high precision performance does not show well in figure 3; the high recall performance dominates the curves.</Paragraph> <Paragraph position="1"> Whereas the recall/precision curves show the retrieval system results as they might be seen by a user (since precision measures the accuracy of each retrieved document as it is retrieved), the recall/fallout curves emphasize the ability of these systems to screen out non-relevant material. In particular the fallout measure shows the discrimation powers of these systems on a large document collection. Since recall/precision measures do not include any indication of the collection size, the recall and precision of a system based on a 1400 document collection could be the same as that of a system based on a million document collection, but obviously the discrimation powers on a million document collection would be much greater. This was not have been a problem on the smaller collections, but the discrimination power of systems on TIPSTERsized collections is very important.</Paragraph> </Section> <Section position="3" start_page="13" end_page="13" type="sub_section"> <SectionTitle> 5.3 Single-Value Evaluation Measures </SectionTitle> <Paragraph position="0"> In addition to these recall/precision and recall/fallout curves, there were 3 single-value measures often used in TIPSTER. The first two measures are precision averages across the curves, and the third measure is precision at a particular cutoff of documents retrieved.</Paragraph> <Paragraph position="1"> One of the averages, the non-interpolated average r recision, combines the average precision for each topic, with that topic average computed by taking the precision after every retrieved relevant document. The final average corresponds to the area under an ideal (non-interpolated) recall/precision curve.</Paragraph> <Paragraph position="2"> The second precision average (the l 1-point precision average) averages across interpolated precision values (which makes it somewhat less accurate). It is calculated by averaging the precision at each of the 11 standard recall points on the curve (0.0, 0.1 ..... 1.0) for each topic. Often this average is stated as an improvement over some baseline average 11-point precision.</Paragraph> <Paragraph position="3"> The third measure used is an average of the precision at each topic after 100 documents have been retrieved for that topic. This measure is useful because it contains no interpolation, and reflects a clearly comprehended retrieval point. It took on added importance in the TIPSTER environment because only the top I00 documents retrieved for each topic were actually assessed. For this reason it produces a guaranteed evaluation point for each system.</Paragraph> </Section> <Section position="4" start_page="13" end_page="14" type="sub_section"> <SectionTitle> 5.4 Problems with Evaluation </SectionTitle> <Paragraph position="0"> Since this was the first time that such a large collection of text has been used in evaluation, there were some problems using the existing methods of evaluation. The major problem concerned a thresholding effect caused by an inability to evaluate ALL documents retrieved by a given system.</Paragraph> <Paragraph position="1"> For the TIPSTER 12-month evaluation and TREC-1 the groups were asked to send in only the top 200 documents retrieved by their systems. This artificial document cutoff is relatively low and systems did not retrieve all the relevant documents for most topics within the cutoff. All documen~ retrieved beyond the 200 were considered non-relevant by default and therefore the recall/precision curves became inaccurate after about 40% recall on average. The 18-month TIPSTER evaluation used a cutoff of 500 documents, and the TIPSTER 24-month and TREC-2 used the top 1000 documents. Figure 4 shows the difference in the curves produced by these evaluation thresholds, including a curve for no threshold (similar to the way evaluation has been done on the smaller collections.).</Paragraph> <Paragraph position="2"> These curves show that the use of a 1000-document cutoff has mostly resolved the thresholding problem.</Paragraph> <Paragraph position="3"> Two more issues in evaluation have become important.</Paragraph> <Paragraph position="4"> The first issue involves the need for more statistical evaluation. As will be seen in the results, the recall/precision curves are often close, and there is a need to check if there is truly any statistically significant differences between two systems' results or two sets of results from the same system. This problem is currently under investigation in collaboration with statistical groups experienced in the - evaluation of information retrieval systems.</Paragraph> <Paragraph position="5"> The second issue involves getting beyond the averages to better understand system performance. Because of the huge number of documents and the long topics, it is very difficult to perform failure analysis, or any type of analysis on the results to better understand the retrieval processes being tested. Without better understanding of underlying system performance, it will be hard to consolidate research progress. Some preliminary analysis of per topic performance was provided for the TIPSTER 24-month evaluation and TREC-2, and more attention will be given to this problem in the future.</Paragraph> </Section> </Section> class="xml-element"></Paper>