File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/93/h93-1069_abstr.xml

Size: 3,368 bytes

Last Modified: 2025-10-06 13:47:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1069">
  <Title>Document retrieval and text retrieval</Title>
  <Section position="2" start_page="0" end_page="347" type="abstr">
    <SectionTitle>
2. Text retrieval
</SectionTitle>
    <Paragraph position="0"> However a new situation has arisen with the availability of machine-readable full text. For text retrieval (TR), NLP to provide more sophisticated indexing may be needed because more discrimination within large files of long texts is required, or may be desired because more focusing is possible. This suggests the more NLP the better, but whether for better-motivated simple indexing or for more complex representation has to be determined.</Paragraph>
    <Paragraph position="1"> Given past experience, and the need for flexibility in the face of uncertainty, a sound approach appears to be to maintain overall simplicity but to allow for more complex indexing descriptors than single terms, derived through NLP and NL-flavoured, e.g. simple phrases or predications. These would be just coordinated for descriptions but, more importantly, statistically selected and weighted. To obtain the reduced descriptions still needed to emphasise important text content, text-locational or statistical information could be exploited. To support indexing, and, more critically, searching a terminological apparatus again of a simple NL-oriented kind providing term substitutes or collocates, and again statistically controlled, could be valuable. Searching should allow  the substitution or relaxation of elements and relations in complex terms, again with weighting, especially via feedback. This whole approach would emphasise the NL of the texts while recognising the statistical properties of large files and long documents. The crux is thus to demonstrate that linguistically-constrained terms are superior to e.g. co-locational ones.</Paragraph>
    <Paragraph position="2"> Heavy testing is needed to establish performance for the suggested approach, given the many factors affecting retrieval systems, both environment variables e.g. document type, subject domain, user category, and system parameters e.g. description exhaustivity, language specificity, weighting formula. There are also different evaluation criteria, performance measures, and application methods to consider. Proper testing is hard (and costly) since it requires large collections, of requests as much as documents, with relevance assessments, and implies fine-grained comparisons within a grid of system contexts and design options.</Paragraph>
    <Paragraph position="3"> Various approaches along the lines suggested, as well as simpler DR-derived ones, are being investigated within ARPA TREC. The TREC experiments are important as the largest retrieval tests to date, with an earnest evaluation design, as well as being TR tests on the grand scale. But any conclusions drawn from them must be treated with caution since the TREC queries are highly honed, and are for standing interests (matching a document against many requests not vice versa), with tightly specified response needs. TREC is not typical of many retrieval situations, notably the 'wants to read about' one, so any results obtained, especially good ones relying on collection tailoring, may not be generally applicable and other tests are mandatory.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML