File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/92/h92-1039_abstr.xml
Size: 4,768 bytes
Last Modified: 2025-10-06 13:47:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1039"> <Title>Session 5b. Information Retrieval</Title> <Section position="1" start_page="0" end_page="203" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> As this is the first time there has been a session on information retrieval at a DARPA Speech and Natural Language Workshop, it seems appropriate to provide a more detailed introduction to this topic than would normally appear. The term &quot;information retrieval&quot; refers to a particular application rather than a particular technique, with that application being the location of information in a (usually) large amount of (relatively) unstructured text. This could be done by constructing a filter to pull useful information from a continuous stream of text, such as in building an intelligent router or a library profiling system. Alternatively the text could be archived newspapers, online manuals, or electronic card catalogs, with the user constructing an ad-hoc query against this information. In both cases there needs to be accurate and complete location of information relevant to the ad-hoc query or filter, and efficient techniques capable of processing often huge amounts of incoming text or very large archives.</Paragraph> <Paragraph position="1"> The currently-used Boolean retrieval systems grew out of the 100 or more year old practice of building cumulative indices, with early mechanical devices enabling people to join two index terms using AND's and OR's. This mechanism was adapted to computers and although today's commercial retrieval systems are much more sophisticated, they had not gone beyond the Boolean model.</Paragraph> <Paragraph position="2"> Boolean systems are difficult for naive or intermittent users to operate, and even skilled searchers find these systems limiting.</Paragraph> <Paragraph position="3"> The widespread use of computers in the 1960's, and the availability of online text made possible some innovative and extensive research in new information retrieval techniques \[5, 3\]. This work has continued, with new models being proposed, many experimental techniques being tried, and some implementation and testing of these systems in real-world environments. For an excellent summary of various models and techniques, see \[1\], and for a discussion of implementation issues, see \[2\]. The major archival publications in the area of information retrieval are 1) Information Processing and Management, Pergamon Press; 2) Journal of the American Society for Information Science; and 3) the annual proceedings of the ACM SIGIR conference, available from ACM Press.</Paragraph> <Paragraph position="4"> text, with the goal being to match a user's query (or a filter) against the text in such a manner as to provide a ranked list of titles (or documents), with that rank based on the probability that a document is relevant to the query or filter. The use of statistical techniques rather than natural language techniques comes from the need to handle relatively large amounts of text, and the (supposed) lack-of-need to completely understand text in order to retrieve from it. For a survey of the use of natural language procedures in information retrieval, see \[4\].</Paragraph> <Paragraph position="5"> The statistical techniques have proven successful in laboratories, and generally retrieve at least some relevant documents at high precision levels. The performance figure often quoted for these systems is 50% precision at 50% recall; roughly equivalent to the performance of Boolean systems used by skilled searchers. Unfortunately this performance has not seen major improvement recently, although improvements continue in related parts of information retrieval, such as interfaces, efficiency, etc. There are two explanations often given for this lack of improvement. The first is that the currentlyavailable test collections are too small to allow proper performance of many of the proposed techniques, and second, that more sophisticated techniques are needed, including some natural language techniques.</Paragraph> <Paragraph position="6"> The DARPA TIPSTER and TREC programs address both these issues, with a much larger test collection (4 gigabytes of text) being built, and a range of techniques, including sophisticated statistical techniques and efficient natural language techniques, being supported.</Paragraph> <Paragraph position="7"> Results from these projects will be reported in the future. The four papers in this session all apply natural language techniques to information retrieval, and illustrate some of the important ways that natural language processing can improve information retrieval.</Paragraph> </Section> class="xml-element"></Paper>