File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/x98-1023_intro.xml

Size: 3,433 bytes

Last Modified: 2025-10-06 14:06:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1023">
  <Title>Improving Robust Domain Independent Summarization</Title>
  <Section position="4" start_page="0" end_page="171" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Summarization is the problem of presenting the most important information contained in one or more documents. The research described here focuses on multi-lingual summarization (MLS).</Paragraph>
    <Paragraph position="1"> Summaries of documents are produced in Spanish, Japanese, English and Russian using the same basic summarization engine.</Paragraph>
    <Paragraph position="2"> The core summarization problem is taking a single text and producing a shorter text in the same language that contains all the main points in the input text. We are using a robust, graded approach to building the core engine by incorporating statistical, syntactic and document structure analyses among other techniques. We have developed a system design which allows the parameterization both of the summarization process and of necessary information about the languages being processed.</Paragraph>
    <Paragraph position="3"> Document structure analysis (Salton &amp; Singal 94, Salton et al. 95) is important for extracting the topic of a text. In a statistical analysis for example (Paice 90, Paice &amp; Jones 93), titles and sub-titles would be given a more important weight than the body of the text. Similarly, the introduction and conclusion for the text itself and for each section are more important than other paragraphs, and the first and last sentences in each paragraph are more important than others. The applicability of these depends on the style adopted in a particular domain, and on the language: the stylistic structure and the presentation of arguments vary significantly across genres and languages. Structure analysis must be tailored to a particular type of text in a particular language. In the MINDS system document structure analysis involves the following subtasks: null  In order to allow a multitude of techniques to contribute to sentence selection, the core engine adopts a flexible method of scoring the sentences in a document by each of the techniques and then ranking them by combining the different scores.</Paragraph>
    <Paragraph position="4"> Text-structure based heuristics provide the main method for ranking and selecting sentences in a document. These are supplemented by word frequency analysis methods.</Paragraph>
    <Paragraph position="5"> The core engine is designed in such a way that as additional resources, such as lexical and other knowledge bases or text processing and MT engines, become available from other ongoing research efforts they can be incorporated into the overall multi-engine MINDS system. The most promising components are part of speech tagging, anaphora resolution, and semantic methods to allow concept identification to supplement word  frequency analysis. Part of speech tagging has already been used to perform sentence length reduction by stripping out &amp;quot;superfluous&amp;quot; words and phrases. The other methods will be used to maintain document coherence, and to improve sentence selection and reduction.</Paragraph>
    <Paragraph position="6"> In this paper we describe the architecture and performance of the current system and our plans for incorporating new NLP methods.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML