File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/x98-1017_intro.xml

Size: 4,923 bytes

Last Modified: 2025-10-06 14:06:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1017">
  <Title>The Smart/Empire TIPSTER IR System</Title>
  <Section position="3" start_page="0" end_page="107" type="intro">
    <SectionTitle>
4. Context-Dependent Multi-Document Summari-
</SectionTitle>
    <Paragraph position="0"> zation. The goal of our research in this area is to provide a short summary for an entire group of related documents that includes only query-related portions.</Paragraph>
    <Paragraph position="1"> Taken as a whole, our research aims to increase end-user efficiency in each of the above tasks by reducing the amount of text that the user must peruse in order to get the desired useful information.</Paragraph>
    <Paragraph position="2"> We attack each task through a combination of statistical and linguistic approaches. The proposed statistical approaches extend existing methods in IR by performing statistical computations within the context of another query or document. The proposed linguistic approaches build on existing work in information extraction and rely on a new technique for trainable partial parsing. In short, our integrated approach uses both statistical and linguistic sources to identify selected relationships among important terms in a query or text. The relationships are encoded as TIPSTER annotations \[7\]. We then use the extracted relationships: (1) to discard or reorder retrieved texts (for high-precision text retrieval); (2) to locate redundant information (for near-duplicate document detection); and (3) to generate coherent synopses (for context-dependent text summarization).</Paragraph>
    <Paragraph position="3"> An end-user scenario that takes advantage of the efficiency opportunities offered by our research might proceed as follows: 1. The user submits a natural language query to the retrieval system, asking for a high-precision search. This search will attempt to retrieve fewer documents than a normal search, but at a higher quality, so many fewer non-useful documents will need to be examined.</Paragraph>
    <Paragraph position="4"> 2. The documents in the result set will be clustered so that closely related documents are grouped.</Paragraph>
    <Paragraph position="5"> Duplicate documents will be clearly marked so the user will not have to look at them at all.</Paragraph>
    <Paragraph position="6"> Near-duplicate documents will also be clearly marked. When the user examines a document marked as a near-duplicate to a document previously examined, the new material in this document is emphasized in color so that it can be quickly perused, while the duplicate material can be ignored.</Paragraph>
    <Paragraph position="7"> 3. Long documents can be automatically summarized, within the context of the query, so that perhaps only 20% of the document will be presented. This 20% summary  would include the material that made the system decide the document was useful, as well as other material designed to set the context for the query-related material. 4. If the user wishes, an entire cluster of documents can be summarized. The user can then decide whether to look at any of the individual documents. This multi-document summary will once again be query-related.</Paragraph>
    <Paragraph position="8"> One key result of our TIPSTER efforts is the development of TRUESmart, a Toolbox for Research in User Efficiency. TRUESmart is a set of tools and data supporting researchers in the development of methods for improving user efficiency for state-of-the-art information retrieval systems. TRUESmart allows the integration of system components for high-precision retrieval, duplicate detection, and context-dependent summarization; it includes a simple graphical user interface (GUI) that supports each of these tasks in the context of the end-user scenario described above. In addition, TRUESmart aids system evaluation and analysis by highlighting important term relationships identified by the underlying statistical and linguistic language processing algorithms.</Paragraph>
    <Paragraph position="9"> The rest of the paper presents TRUESmart and its underlying IR and NLP components. Section 2 first provides an overview of the Smart IR system and the Empire Natural Language Processing (NLP) system. Section 3 describes the TRUESmart toolbox. To date, we have used TRUESmart to support our work in high-precision retrieval and context-dependent document summarization. We describe our results in these areas in Sections 4-5 using the TRUESmart interface to illustrate the algorithms developed and their contribution to the end-user scenario described above. Section 6 summarizes our work in duplicate detection and describes how the TRUESmart interface will easily be extended to support this task and include linguistic term relationships in addition to statistical term relationships. We conclude with a summary of the potential advantages of our overall approach.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML