File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/x98-1007_metho.xml
Size: 3,794 bytes
Last Modified: 2025-10-06 14:15:22
<?xml version="1.0" standalone="yes"?> <Paper uid="X98-1007"> <Title>The Cornell TIPSTER Phase III Project</Title> <Section position="3" start_page="0" end_page="37" type="metho"> <SectionTitle> NEAR-DUPLICATE DETECTION </SectionTitle> <Paragraph position="0"> The goal of the research in this area is to devise a system that will reduce the amount of duplicated information that the user sees. Current retrieval system may return several versions of a document in which the differences results from changes made to the metafile, word deviations in the body (e.g. multiple authors using different words to describe the same event) or an update with new information added. While other retrieval-enhancement algorithms have been developed to identify exact and near duplicates, this research effort investigates methods for processing documents that are similar but do not necessarily contain the same terms. For example: If a user has seen document X and the first two paragraphs of unseen document Y only contain information that is the same as or similar to information in X, the system would process Y to &quot;hide&quot; the duplicated data showing the user only the unique paragraphs. An operational version of this system could be integrated with an agency's retrieval system to remove exact and almost duplicates from the retrieval system's hits list. This application probably would benefit most Intelligence Community analysts.</Paragraph> <Paragraph position="1"> Other types of users may be interested in capturing all duplicates, near-duplicates and similar documents. In this scenario a user tasked to process a collection of document consistently would want to identify all duplicate and similar documents for processing in the same or a similar manner.</Paragraph> <Paragraph position="2"> If the near-duplicate detection effort is successful, the resulting system would provide the user with this identification capability.</Paragraph> </Section> <Section position="4" start_page="37" end_page="37" type="metho"> <SectionTitle> CONTEXT-DEPENDENT SUMMARIZATION </SectionTitle> <Paragraph position="0"> The third research area continues the objective of reducing the amount of text that the user must read by presenting summaries of long documents in lieu of the full documents. The summarization software will either provide a short summary for each document in a collection or one summary for an entire group of related documents. If the collection contains disparate document, the Cornell approach uses a preliminary step to group related documents and then applies the summarization algorithms to each group.</Paragraph> <Paragraph position="1"> Summarization will be done in the context of the query. The Cornell system will capture only those features relevant to the user's information need. This is distinguishable from a generic summary that would capture the salient items for the entire document without regard to any particular search query or information need. For example: Suppose the target document contains information on political profiles, military status, weapons proliferation issues and economic changes about a country of interest. A good generic summary would contain the essential elements on all four topics. If the user's only interest is the second topic, as would be reflected in the user-defined query, then a good context-dependent summary would contain only those elements that are relevant to military status.</Paragraph> <Paragraph position="2"> In a combined approach to summarization, the users will have the options of generic summaries or query-based summaries for each document in a collection or a cross-document summary for the entire collection. (See \[3\] for details on generic summarization.)</Paragraph> </Section> class="xml-element"></Paper>