File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/x98-1023_metho.xml
Size: 14,094 bytes
Last Modified: 2025-10-06 14:15:20
<?xml version="1.0" standalone="yes"?> <Paper uid="X98-1023"> <Title>Improving Robust Domain Independent Summarization</Title> <Section position="5" start_page="171" end_page="175" type="metho"> <SectionTitle> 2 MINDS - Multi-Lingual Interactive </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="171" end_page="171" type="sub_section"> <SectionTitle> Document Summarization 2.1 Background </SectionTitle> <Paragraph position="0"> The need for summarization tools is especially strong if the source text is in a language different from the one(s) in which the reader is most fluent.</Paragraph> <Paragraph position="1"> Interactive summarization of multilingual documents is a very promising approach to improving productivity and reducing costs in large-scale document processing. This addresses the scenario where an analyst is trying to filter through a large set of documents to decide quickly which documents deserve further processing. This task is more difficult and expensive when the documents are in a foreign language in which the analyst may not be as fluent as he or she is in English. The task is even more difficult when the documents are in several different languages. For example, the analyst's task may be to filter through newspaper articles in many different languages published on a particular day to generate a report on different nations' reactions to a current international event, such as a nuclear test on the previous day. This last task is currently infeasible for a single analyst, unless he or she understands each one of those languages, since machine translation (MT) of entire documents cannot yet meet the requirements of such a task. Multi-lingual summarization (MLS) introduces the possibility of translating a summary rather than the entire document to the language of the summary (i.e., English). We hope that MLS and MT can mutually benefit from one another since summarization offers MT the benefit of not having to translate entire texts and also spares a user from having to read through an entire document produced by an MT system.</Paragraph> </Section> <Section position="2" start_page="171" end_page="173" type="sub_section"> <SectionTitle> 2.2 Overview </SectionTitle> <Paragraph position="0"> The MINDS system is a multilingual domain independent summarization system, which is able to summarize documents written in English, Japanese, Russian and Spanish. The system is intended to be rapidly adaptable to new language and genres by adjusting a set of parameters. A summarization system for Turkish has just been added to the system. This required about one programmer day of effort, mostly spent in preprocessing the language</Paragraph> <Paragraph position="2"> resources used by the system. The types of summarization information used are also intended to be adjustable by a user &quot;on the fly&quot;, to allow the tuning of the summarizers output based on length of summary needed, type of document structure, topic focus.</Paragraph> <Paragraph position="3"> The MINDS summarization system is composed of four stages. First we have an Input Process stage, whose main function is to get the relevant text in the document in UNICODE encoding. The second stage is a Document Structuring Stage, where paragraph and sentence recognition, and word tokenization are performed. All the information about the document structure is stored in a &quot;Document Object&quot; that will be used in the Summarization-Translation stage. In the Summarization-Translation Stage, the text is summarized using sentence extraction techniques, where the sentence scoring and ranking is mainly based on text-structure based heuristics supplemented by word frequency analysis methods and in some cases by information from a Name Recognition module. Once the summary is ready in the original</Paragraph> <Paragraph position="5"> language, MINDS uses MT engines from other ongoing CRL projects to translate the summary to English. The final stage is the Output Process that generates the summary output form; SGML, HTML, or Plain text. This may also involve conversion from UNICODE to the original encoding of the document.</Paragraph> </Section> <Section position="3" start_page="173" end_page="173" type="sub_section"> <SectionTitle> 2.3 Input Process Stage </SectionTitle> <Paragraph position="0"> In the input stage, MINDS can accept documents written in different languages and codesets: currently English, Japanese, Russian, Turkish and Spanish. Also the documents can be in different formats such as SGML, HTML, E-mail or Plain text. A parsing stage identifies the document's format, selects and applies the appropriate parser and extracts the relevant text from the document.</Paragraph> <Paragraph position="1"> Once we have the text to be summarized a language recognition module determines the language in which the document is written and the text encoding used in the document. Given the encoding of the document the text is converted to UNICODE and all the rest of the processing is carried out on the UNICODE version of the text.</Paragraph> </Section> <Section position="4" start_page="173" end_page="174" type="sub_section"> <SectionTitle> 2.4 Document Structuring Stage </SectionTitle> <Paragraph position="0"> After the text to be summarized is available in UNICODE encoding, its structure needs to be determined. This is the job of the Document Structuring Stage. In this stage, three tokenization stages are performed. The first one pose of identifies the paragraphs in the document. The second tokenization stage identifies sentences within each paragraph. To identify sentence boundaries for many languages requires a list of abbreviations for the language. Languages such as Chinese and Japanese have an unambiguous &quot;stop&quot; character and thus do not present this problem. Finally, word tokenization is carried out to identify individual words in each sentence. Here Chinese and Japanese which do not use spaces between words require some segmentation method to be applied.</Paragraph> <Paragraph position="1"> The current system actually uses two character pairs, bi-grams, for all its calculations for Japanese. These bi-grams are produced starting at every character position in the document.</Paragraph> <Paragraph position="2"> All the structuring information is stored in a &quot;Document Object&quot;, which is the main data structure of the system, holding all the information gen- null erated during the processing. After the tokenization stage is complete and depending on the lexical resources available for each language, other stages are performed, such as Morphology, Proper Name Recognition and Tagging.</Paragraph> </Section> <Section position="5" start_page="174" end_page="175" type="sub_section"> <SectionTitle> 2.5 Summarization-Translation Stage </SectionTitle> <Paragraph position="0"> In the Summarization-Translation Stage, the importance of each sentence in the document is determined using a scoring procedure, which assigns scores to the sentences according to the position of the sentences in the document structure and according to the occurrences of key-words in the sentence which belong to the set of most frequent words in the document that are not in a &quot;stop list&quot; (the most frequent words in a language are considered irrelevant). We make the assumption that these key-word represent or identify the main concepts in the document, therefore if a sentence contains several of them, its score should be high so it could be selected as part of the summary. It is important to note here that we need a &quot;stop list&quot; for each language considered in the summarization system. Also, if a Proper Name Recognition module is available for a specific language, we use the information about person names, organization names, places and dates to contribute in the scores of sentences.</Paragraph> <Paragraph position="1"> At this point if the lexical resources are available, an optional sentence length reduction can be carried out using information from a tagging stage.</Paragraph> <Paragraph position="2"> This sentence length reduction includes the elimination of adjectives from noun phrases, keeping only the head noun in a noun phrase, eliminating adverbs from verb phrases and eliminating most of the prepositional phrases. However, if a word selected for elimination is a key word, proper noun, the name of a place, a date or a number, the word is kept in the sentences. If this word happens to be in a prepositional phrase, then the prepositional phrase is kept in the sentence.</Paragraph> <Paragraph position="3"> Once the scoring process is done, the sentences are ranked and a summary is generated using the sentences with the higher scores that together do not exceed a predetermined percentage of the document's length. This summary is written in the document's original language, so a machine translation system is used to produce an English version of the summary.</Paragraph> </Section> <Section position="6" start_page="175" end_page="175" type="sub_section"> <SectionTitle> 2.6 Output Process </SectionTitle> <Paragraph position="0"> At this point in the summarization process we have a version of the document's summary in the original language and a version in English, both encoded using UNICODE and in plain text format.</Paragraph> <Paragraph position="1"> The Output Process stage takes these two versions of the summary and converts the one written in the original language to the original encoding of the document (identified by the Language Recognition module), then it converts the version in English from UNICODE to &quot;8859_1&quot; (ISO Latin-l).</Paragraph> <Paragraph position="2"> After the summaries are in the proper output encoding, the system generates the summary in one of the following formats: SGML, HTML, E-mail or Plain text according to the user's specification or to system parametrization, for example, if the summarization system is being used for web delivery, then the output format will be HTML by default.</Paragraph> </Section> </Section> <Section position="6" start_page="175" end_page="176" type="metho"> <SectionTitle> 3 Extending the Summarization Capability </SectionTitle> <Paragraph position="0"> Our goal is to improve the usability and flexibility of the summarization system, while still retaining robustness. This is one of the main reasons why we favor the sentence selection method rather than approaches based on deep analysis and generation (Beale 94, Carlson & Nirenburg 90).</Paragraph> <Paragraph position="1"> Though much disparaged for lack of readability, cohesion etc. systems based in the sentence selection method performed well in the recent Tipster summarization evaluation. In fact the readability as assessed by the evaluators was as high for summaties of about 30% of the document length as it was for the original documents. We are developing summarization techniques based on information extraction and text generation. These will not give very good coverage, because of their domain specificity, but do offer advantages, particularly in the area of cross document summarization.</Paragraph> <Paragraph position="2"> Our experiments have shown for English that the inclusion of other language processing techniques Can indeed increase the flexibility and performance of the summarizer. In particular proper name recognition, co-reference resolution, part of speech tagging and partial parsing can all contribute to the performance of the system.</Paragraph> <Paragraph position="3"> The use of proper names allows the summaries to be weighted towards sections of the documents discussing specific individuals or organizations rather than more general topics. In terms of production of informative summaries, rather than indicative summaries, this may be an important capability. This technique was used to produce summaries evaluated using a &quot;question and answer&quot; methodology at the Tipster evaluation and produced a high performance here.</Paragraph> <Paragraph position="4"> We have not incorporated co-reference resolution methods in our system yet, but it would seem that readability can be improved by the ability to replace pronouns with their referents would be useful. It remains to be seen, however, whether sufficient accuracy can be achieved to support this method. In cases like this where an error may be critical for a user of the system we would normally mark the fact that the text had been added by the system.</Paragraph> <Paragraph position="5"> Part of speech tagging and phrase recognition allows us to carry out certain kinds of text compaction. This is particularly important when very short summaries (10%) of short documents are required.</Paragraph> <Paragraph position="6"> Our experiments with this kind of compaction have showed reductions of about 1/3 of the summary size with some loss of readability. A single sentence example shows the usefulness of this technique. null</Paragraph> <Section position="1" start_page="175" end_page="175" type="sub_section"> <SectionTitle> Original Sentence </SectionTitle> <Paragraph position="0"> Browning-Ferris Industries Inc. was denied a new permit for a huge Los Angeles-area garbage dump, threatening more than $1 billion in future revenue and the value of a $100 million investment.</Paragraph> </Section> <Section position="2" start_page="175" end_page="176" type="sub_section"> <SectionTitle> Shortened Sentence </SectionTitle> <Paragraph position="0"> Browning-Ferris Industries Inc. was denied a permit for a Los Angeles-area dump, threatening more than $1 billion in revenue and the value of a $100 million investment.</Paragraph> <Paragraph position="1"> We hope eventually to have sentence reduction in place for all the languages we process, and that this will also improve the readability of MT output by allowing it to process significantly simplified input.</Paragraph> </Section> </Section> class="xml-element"></Paper>