File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-1066_evalu.xml
Size: 6,121 bytes
Last Modified: 2025-10-06 14:00:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1066"> <Title>A LAYERED APPROACH TO NLP-BASED INFORMATION RETRIEVAL</Title> <Section position="7" start_page="401" end_page="402" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> Evaluation has two primary goals in commercial work. First, is the software robust enough and accurate enough to satisfy paying customers? Second, is a proposed change or new feature an improvement or a step backward? Customers are more concerned with precision, because they do not like to see matches they cannot explain. Precision above about 80% eliminated the majority of customer complaints about accuracy.</Paragraph> <Paragraph position="1"> Oddly enough, they are quite willing to make excuses for bad system behavior, explaining away implausible matches, once they have been convinced of the system's basic accuracy. The customers rarely test recall, since it is rare either for them to know which pictures are available or to enter successive related queries and compare the match sets. Complaints about recall in the initial stages of system development came from suppliers, who wanted to ensure their own pictures could be retrieved reliably. To test recall as well as precision in a controlled environment, in tile early phase of development, a test set of 1200 images was created, and manually matched, by a photo researcher, against queries submitted by other photo researchers. The process was time-consuming and frustratingly imprecise: it was difficult to score, since matches call be partial, and it was hard to determine how much credit to assign for, say, a 70% match that seemed more like a 90% match to the human researcher. Precision tests on the live (500,000-image) PNI system were much easier to evaluate, since the system was more likely to have the images requested. For example, while a database containing no little girls in red shirts will offer up girls with any kind of shirt and anything red, a comprehensive database will bury those imperfect matches beneath the more highly ranked, more accurate matches. Ultimately, precision was tested on 50 queries on the full system; any bad match, or partial match if ranked above a more complete match, was counted as a miss, and only the top 20 images were rated. Recall was tested on a 50-image subset created by limiting such non-NLP criteria as image orientation and photographer. Precision was 89.6% and recall was 92%.</Paragraph> <Paragraph position="2"> In addition, precision was tested by comparing query results for each new feature added (e.g. &quot;Does noun phrase syntax do us any good? What rankings work best?&quot;}. It was also tested by series of related queries, to test, for example, whether penguins swimming retrieved the same images as swimming penguins. Recall was tested by more related queries and for each new feature, and, more formally, in comparison to keyword searches and to Excalibur's RetrievalWare. Major testing occurred when the database contained 30,000 images, and again at 150,000. At 150,000, one major result was that WordNet senses were rearranged so that they were in frequency order based on the senses hand-tagged by captioners for the initial 150,000 images.</Paragraph> <Paragraph position="3"> In one of our retrieval tests, the combination of noun phrase syntax and name recognition improved recall by 18% at a fixed precision point. While we have not yet attempted to test the two capabilities separately, it does appear that name recognition played a larger role in the improvement than did noun phrase syntax. This is in accord with previous literature on the contributions of noun phrase syntax (Lewis, 1992), (Lewis and Croft, 1990).</Paragraph> <Section position="1" start_page="401" end_page="402" type="sub_section"> <SectionTitle> 4.1 Does Manual Sense-Tagging Improve </SectionTitle> <Paragraph position="0"> Precision? Preliminary experiments were performed on two subcorpora, one with WordNet senses manually tagged, and the other completely untagged. The corpora are not strictly comparable: since the photos are different, the correct answers are different in each case. Nonetheless, since each corpus includes over 20,000 pictures, there should be enough data to provide interesting comparisons, even at this preliminary stage. Certain other measures have been taken to ensure that the test is as useful as possible within the constraints given; these are described below. Results are consistent with those shown in Voorhees (1994).</Paragraph> <Paragraph position="1"> Only precision is measured here, since the principal effect of tagging is on precision: untagged irrelevant captions are likely to show up in the results, but lack of tagging will not cause correct matches to be missed. Only crossing matches are scored as bad. That is, if Match 7 is incorrect, but Match 8, 9 and 10 are correct, then the score is 90% precision. If, on the other hand, Match 7 is incorrect and Matches 8, 9 and 10 are also incorrect, there is no precision penalty, since we want and expect partial matches to follow the good matches.</Paragraph> <Paragraph position="2"> Only the top ten matches are scored. There are three reasons for this: first, scoring hundreds or thousands of matches is impractical. Second, in actual usage, no one will care if Match 322 is better than Match 321, whereas incongruities in the top ten will matter very much. Third, since the threshold is set at 50%, some of the matches are by definition only &quot;half right.&quot; Raising the threshold would increase perceived precision but provide less insight about system performance.</Paragraph> <Paragraph position="3"> Eleven queries scored better in the sense-tagged corpus, while only two scored better in the untagged corpus. The remainder scored the same in both corpora. In terms of precision, the sense-tagged corpus scored 99% while the untagged corpus scored 89% (both figures are artificially inflated, but in parallel, since only crossing matches are scored as bad).</Paragraph> </Section> </Section> class="xml-element"></Paper>