File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/92/h92-1040_concl.xml
Size: 3,096 bytes
Last Modified: 2025-10-06 13:56:50
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1040"> <Title>APPENDIX: SAMPLE DATA DOCUMENT TEXT: *RECORD*</Title> <Section position="11" start_page="208" end_page="209" type="concl"> <SectionTitle> 7. SUMMARY OF RESULTS </SectionTitle> <Paragraph position="0"> The preliminary series of experiments with the CACM-3204 collection of computer science abstracts showed a consistent improvement in performance: the average precision increased from 32.8% to 37.1% (a 13% increase), while the normalized recall went from 74.3% to 84.5% (a 14% increase), in comparison with the statistics of the base system. This improvement is a combined effect of the new stemmer, compound terms, term selection in queries, and query expansion using filtered similarity relations. The choice of similarity relation filter has beeen found critical in improving retrieval precision through query expansion. It should also be pointed out that only about 1.5% of all &quot; We believe that measuring term specificity over document-size contexts (eg. Sparck Jones, 1972) may not be appropriate in this case. In particular, syntax-based contexts allow for processing texts without any intemal document structure.</Paragraph> <Paragraph position="1"> m Slightly simplified here.</Paragraph> <Paragraph position="2"> 13 The filter was most effective at cr = 0.57.</Paragraph> <Paragraph position="3"> similarity relations originally generated from CACM-3204 were found admissible after filtering, contributing only 1.2 expansion on average per query. It is quite evident significantly larger corpora are required to produce more dramatic results. 14 15 A detailed summary is given in Table 1 below.</Paragraph> <Paragraph position="4"> These results, while modest by IR standards, are significant for another reason as well. They were obtained without any manual intervention into the database or queries, and without using any other information about the database except for the text of the documents (i.e., not even the hand generated keyword fields enclosed with most documents were used). Lewis and Croft (1990), and Croft et al. (1991) report results similar to ours but they take advantage of Computer Reviews categories manually assigned to some documents. The purpose of this research is to explore the potential of automated NLP in dealing with large scale IR problems, and not necessarily to obtain the best possible results on any particular data collection. One of our goals is to point a feasible direction for integrating NLP into the traditional IR (Strzalkowski and Vauthey, 1991; Grishman 14 KL Kwok (private communication) has suggested that the low percentage of admissible relations might be similar to the phenomenon of 'tight dusters' which while meaningful are so few that their impact is small.</Paragraph> <Paragraph position="5"> 15 A sufficiently large text corpus is 20 million words or more. This has been partially confirmed by experiments performed at the University of Massachussetts (B. Croft, private communication). and Strzalkowski, 1991).</Paragraph> </Section> class="xml-element"></Paper>