File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/x98-1026_metho.xml

Size: 51,016 bytes

Last Modified: 2025-10-06 14:15:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1026">
  <Title>Generating Summaries of Multiple News Articles. In Proceedings of the Eighteenth Annual International ACM Conference on Research and Development in Information</Title>
  <Section position="3" start_page="0" end_page="198" type="metho">
    <SectionTitle>
1. THE NATURE OF SUMMARIES
</SectionTitle>
    <Paragraph position="0"> Early experimentation in the late 1950's and early 60's suggested that text summarization by computer was feasible though not straightforward (Luhn, 59; Edmundson, 68). The methods developed then were fairly unsophisticated, relying primarily on surface level phenomena such as sentence position and word frequency counts, and focused on producing extracts (passages selected from the text, reproduced verbatim) rather than abstracts (interpreted portions of the text, newly generated).</Paragraph>
    <Paragraph position="1"> After a hiatus of some decades, the growing presence of large amounts of online text--in corpora and especially on the Web--renewed the interest in automated text summarization. During these intervening decades, progress in Natural Language Processing (NLP), coupled with great increases of computer memory and speed, made possible more sophisticated techniques, with very encouraging results. In the late 1990's, some relatively small research investments in the US (not more than 10 projects, including commercial efforts at Microsoft, Lexis-Nexis, Oracle, SRA, and TextWise, and university efforts at CMU, NMSU, UPenn, and USC/ISI) over three or four years have produced several systems that exhibit potential marketability, as well as several innovations that promise continued improvement. In addition, several recent workshops, a book collection, and several tutorials testify that automated text summarization has become a hot area.</Paragraph>
    <Paragraph position="2"> However, when one takes a moment to study the various systems and to consider what has really been achieved, one cannot help being struck by their underlying similarity, by the narrowness of their focus, and by the large numbers of unknown factors that surround the problem. For example, what precisely is a summary? No-one seems to know exactly. In our work, we use summary as the generic term and define it as follows: A summary is a text that is produced out of one or more (possibly multimedia) texts, that contains (some of) the same information of the original text(s), and that is no longer than half of the original text(s).</Paragraph>
    <Paragraph position="3"> To clarify the picture a little, we follow and extend (Sp~irck Jones, 97) by identifying the following aspects of variation. Any summary can be characterized by (at least) three major classes of characteristics: Invut: characteristics of the source text(s) Source size: single-document vs. multi-document: A single-document summary derives from a single input text (though the summarization process itself may employ information compiled earlier from other texts). A multi-document summary is one text that covers the content of more than one input text, and is usually used only when the input texts are thematically related.</Paragraph>
    <Paragraph position="4"> Specificity: domain-specific vs. general: When the input texts all pertain to a single domain, it may be appropriate to apply domain-specific summarization techniques, focus on specific content, and output specific formats, compared to the general case. A domain-specific summary derives from input text(s) whose theme(s) pertain to a single restricted domain. As such, it can assume less term ambiguity, idiosyncratic word and grammar usage, specialized formatting, etc., and can reflect them in the summary.</Paragraph>
    <Paragraph position="5">  A general-domain summary derives from input text(s) in any domain, and can make no such assumptions.</Paragraph>
    <Paragraph position="6"> Genre and scale: Typical input genres include newspaper articles, newspaper editorials or opinion pieces, novels, short stories, non-fiction books, progress reports, business reports, and so on. The scale may vary from book-length to paragraphlength. Different summarization techniques may apply to some genres and scales and not others.</Paragraph>
    <Paragraph position="7"> Output: characteristics of the summary as a text Derivation: Extract vs. abstract: An extract is a collection of passages (ranging from single words to whole paragraphs) extracted from the input text(s) and produced verbatim as the summary. An abstract is a newly generated text, produced from some computer-internal representation that results after analysis of the input.</Paragraph>
    <Paragraph position="8"> Coherence: fluent vs. disfluent: A fluent summary is written in full, grammatical sentences, and the sentences are related and follow one another according to the rules of coherent discourse structure. A disfluent summary is fragmented, consisting of individual words or text portions that are either not composed into grammatical sentences or not composed into coherent paragraphs.</Paragraph>
    <Paragraph position="9"> Partiality: neutral vs. evaluative: This characteristic applies principally when the input material is subject to opinion or bias. A neutral summary reflects the content of the input text(s), partial or impartial as it may be. An evaluative summary includes some of the system's own bias, whether explicitly (using statements of opinion) or implicitly (through inclusion of material with one bias and omission of material with another).</Paragraph>
    <Paragraph position="10"> Conventionality: fixed vs. floating: A fixedsituation summary is created for a specific use, reader (or class of reader), and situation. As such, it can conform to appropriate in-house conventions of highlighting, formatting, and so on. A floatingsituation summary cannot assume fixed conventions, but is created and displayed in a variety of settings to a variety of readers for a variety of purposes.</Paragraph>
    <Paragraph position="11"> Purpose: characteristics of the summary usage Audience: Generic vs. query-oriented: A generic summary provides the author's point of view of the input text(s), giving equal import to all major themes in it. A query-oriented (or user-oriented) summary favors specific themes or aspect(s) of the text, in response to a user's desire to learn about just those themes in particular. It may do so explicitly, by highlighting pertinent themes, or implicitly, by omitting themes that do not match the user's interests.</Paragraph>
    <Paragraph position="12"> Usage: Indicative vs. informative: An indicative summary provides merely an indication of the principal subject matter or domain of the input text(s) without including its contents. After reading an informative summary, one can explain what the input text was about, but not necessarily what was contained in it. An informative summary reflects (some of) the content, and allows one to describe (parts of) what was in the input text.</Paragraph>
    <Paragraph position="13"> Expansiveness: Background vs. just-the-news: A background summary assumes the reader's prior knowledge of the general setting of the input text(s) content is poor, and hence includes explanatory material, such as circumstances of place, time, and actors. A just-the-news summary contains just the new or principal themes, assuming that the reader knows enough background to interpret them in context.</Paragraph>
    <Paragraph position="14"> At this time, apart from early work by Sp~irck Jones and students, such as (Tait and Sp~irck Jones, 83), we know of few linguistic or computational studies of these and other aspects of summaries; the work by (Van Dijk and Kintsch, 83) and (Endres-Niggemeyer, 97) focus on the psycholinguistic aspects of humans when they create summaries. We believe that the typology of summaries is a fruitful area for further study, both by linguists performing text analysis and by computational linguists trying to create techniques to create summaries conforming to one or more of the characteristics listed above. A better understanding of the types of summary will facilitate the construction of techniques and systems that better serve the various purposes of summarization in general.</Paragraph>
    <Paragraph position="15"> Our own work is computational. Over the past two years, under the TIPSTER program, we have been developing the text summarization system SUMMARIST (Hovy and Lin, 98; Lin, 98). Our goal is to investigate the nature of text summarization, using SUMMARIST both as a research tool and as an engine to produce summaries for people upon demand.</Paragraph>
    <Paragraph position="16"> In this paper, we first describe the architecture of SUMMARIST and provide details on the evaluated results of several of its modules in Sections 3, 4, and 5. Finally, since the evaluation of summaries (and of summarization) is a little-understood business, we describe some preliminary experiments in this regard in Section 6.</Paragraph>
  </Section>
  <Section position="4" start_page="198" end_page="204" type="metho">
    <SectionTitle>
2. SUMMARIST
</SectionTitle>
    <Paragraph position="0"> The goal of SUMMARIST is to create summaries of arbitrary text in English and selected other languages (Hovy and Lin, 98). By eschewing language-specific methods for the relatively surface-level processing, it is possible to create a multi-lingual summarizer fairly easily. Eventually, however, SUMMARIST will include language-specific techniques of parsing and semantic analysis, and will combine robust NLP processing (using Information Retrieval and statistical techniques) with symbolic world knowledge (embodied in the concept thesaurus SENSUS (Knight and Luk, 94; Hovy, 98), derived from WordNet (Miller et al., 90) and augmented by dictionaries and similar resources) to overcome the problems endemic to either approach alone. These problems arise because existing robust NLP methods tend to operate at the word level, and hence miss concept-level generalizations (which are provided by symbolic world knowledge), while on the other hand symbolic knowledge is too difficult to acquire in large enough scale to provide adequate coverage and robustness. For high-quality yet robust summarization, both aspects are needed.</Paragraph>
    <Paragraph position="1"> In order to maintain functionality while we experiment with new aspects, and since not all kinds of summary require the same processing steps, we have adopted a very open, modular design. Since it is still under development, not all the modules of SUMMARIST are equally mature.</Paragraph>
    <Paragraph position="2"> To create extracts, one needs procedures to identify the most important passages in the input text. To create abstracts, the core procedure is a process of interpretation. In this step, two or more topics are fused together to form a third, more succinctly stated, one. (We define topic as a particular subject that we write about or discuss.) This step must occur after the identification step. Finally, to produce the summary, a concluding step of sentence generation is needed. Thus SUMMARIST is structured according to the following 'equation':</Paragraph>
    <Paragraph position="4"> For identification, the goal is to filter the input to retain only the most important, central, topics. Once they have been identified, they can simply be output, to form an extract. Typically, topic identification can be achieved using various complementary techniques. This stage of SUMMARIST is by far the most developed; making it, at present, an extract-only summarizer. See Section 3.</Paragraph>
    <Paragraph position="5"> For interpretation, the goal is to perform compaction through re-interpreting and fusing the extracted topics into more succinct ones. This is necessary because abstracts are usually much shorter than their equivalent extracts. All the variations of fusion are yet to be discovered, but they include at least simple concept generalization (he ate pears, apples, and bananas---+ he ate fruit) and script identification (he sat down, read the menu, ordered, ate, and left----~ he visited the restaurant). We discuss interpretation in Section 4.</Paragraph>
    <Paragraph position="6"> For generation, the goal is to produce the output summary. In the case of extracts, generation is a null step, but in the case of abstracts, the generator has to reformulate the extracted and fused material into a coherent, densely phrased, new text. The modules planned for SUMMARIST are described in Section 5.</Paragraph>
    <Paragraph position="7"> Prior to topic identification, the system preprocesses the input text. This stage converts all inputs into a standard format we call SNF (Summarist Normal Form).</Paragraph>
    <Paragraph position="8"> Preprocessing includes tokenizing (to read non-English texts and output word-segmented tokens); part-of-speech tagging (the tagger is based on Brill's (1992) part-of-speech tagger); demorphing (to find root forms of each input token, using a modification of WordNet's (Miller et al., 90) demorphing program); phrasing (to find collocations and multi-word phrases, as recorded in WordNet); token frequency counting; tf.idf weight calculation (to calculate the tf.idf weight (Salton, 88) for each input token, and rank the tokens accordingly); and query relevance calculation (to record with each sentence the number of demorphed content words in the user's query that also appear in that sentence).</Paragraph>
    <Paragraph position="9"> An example text in Indonesian, preprocessed into SNF, is shown in Figures l(a) and l(b). Figure l(a) indicates that the text contained 1618 characters, and that it had been processed by the following modules: tokenization and part of speech tagging, title treatment, demorphing, WordNet categorization and common word identification, tf.idf computation, and OPP processing (see Section 3.1).</Paragraph>
    <Paragraph position="10"> It also records the top-scoring words in the text, together with their scores, as given by the modules computing term frequency (tf_keywords), tf.idf, and the OPP. The field opp_rule shows the most important sentence positions as 0 (the title); sentence 1; sentences 2 and 3 (tied), in that order. Figure l(b) contains the processed text itself, one word per line, with the features added to each word by various modules. The features include paragraph and sentence number (pno and sno), part of speech (pos, empty for Indonesian), common word indicator (cwd), presence of word in title (ttl), morphology (mph), WordNet count (wnc), word frequency in text (frq), and tfidfand OPP scores (see Sections 3.3 and 3.1 resp.).</Paragraph>
    <Paragraph position="11"> 3. Phase 1: Topic Identification Summarization systems that perform topic identification only produce extract summaries; these include the current  operational version of SUMMARIST, as well as the systems of (Aone et al., 98; Strzalkowski et al., 98; Bagga and Baldwin, 98; and Mitra et al., 97).</Paragraph>
    <Paragraph position="12">  hadapan 3. vide 3. { saya 2. agung 2. digambarkan 2. {jawaban 2. &gt; &lt;*tfidf_keywords=lewinsky, 35.0881kesaksian, 32.448{kongres,27.7181video,16.8941 dem krat 3.859 hadapan 2.2 9 digambarkan .838 senat r .262{pembaruan .45 &gt; &lt;*opp rule=p:0,111,212,413,4 s:-,-&gt; &lt;*opp keywords=kongres,26.9171kesaksian, 25.6671demokrat,16.33311ewinsky, 14.167 1 senat r 2.667{sar nkan .667 vid 9.333{screen 9. {the 9. {had pan 8.9 7&gt; Figure 1 (a). Indonesian text: preamble, after preprocessing.  We assume that a text can have many (sub)-topics, and that the topic extraction process can be parameterized in at least two ways: first, to include more or fewer topics to produce longer or shorter summaries, and second, to include only topics relating to the user's expressed interests.</Paragraph>
    <Paragraph position="13"> Typically, topic identification can be achieved using various complementary techniques, including those based on stereotypical text structure, cue words, high-frequency indicator phrases, and discourse structure. Modules for all of these have been completed or are under construction for SUMMARIST. In processing, each module assigns a numeric score to each sentence. When all modules are done, the Topic Id Integration Module combines their scores to produce the overall ranking. The final result is the top-ranked n% of sentences as its final result, where n is specified by the user.</Paragraph>
    <Section position="1" start_page="200" end_page="200" type="sub_section">
      <SectionTitle>
3.1 Position Module
</SectionTitle>
      <Paragraph position="0"> The Position Module is based on the well-known fact that certain genres exhibit such structural and expositional regularity that one can reliably locate important sentences in certain fixed positions in the text. In early studies, Luhn (1959) and Edmundson (1968) identified several privileged positions, such as first and last sentences.</Paragraph>
      <Paragraph position="1"> We generalized their results (Lin and Hovy, 97), developing a method for automatically identifying the sentence positions most likely to yield good summary sentences. The training phase of this method calculates the yield of each sentence position by comparing the similarity between human-created abstracts and the contents of the sentence in each ordinal position in the corresponding texts. By summing over a large collection of text-abstract pairs from the same corpus and appropriately normalizing, we create the Optimal Position Policy (OPP), a ranked list that indicates in what ordinal positions in the text the high-topic-bearing sentences tend to occur. We tested this method on two corpora: the Ziff-Davis texts (13,000 newspaper articles announcing computer products) and a set of several thousand. Wall Street Journal newspaper articles. For the Ziff-Davis corpus we found the OPP to be \[T1, P2S1, P3S1, P4S1, PIS1, P2S2, {P3S2, P4S2, P5S1, P1S2}, P6S1 .... \] i.e., the title (T1) is the most likely to bear topics, followed by the first sentence of paragraph 2, the first sentence of paragraph 3, etc. (Paragraph 1 is invariably a teaser sentence in this corpus.) In contrast, for the Wall Street Journal, we found the OPP to be \[T1,P1S1,P1S2 .... \] We evaluated the OPP method in various ways. One measured coverage, the fraction of the (human-supplied) keywords that are included verbatim in the sentences selected under the policy. (A random selection policy would extract sentences with a random distribution of topics; a good position policy would extract rich topic-bearing sentences.) We measured the effectiveness of an OPP by taking cumulatively more of its sentences: first just the title, then the title plus P2S1, and so on.</Paragraph>
      <Paragraph position="2"> Summing together the multi-word contributions in the top ten sentence positions, 10-sentence extracts (approx. 15% of a typical Ziff-Davis text) intersected with 95% of the corresponding human abstracts.</Paragraph>
      <Paragraph position="3"> In addition to the OPP itself, we created OPP keywords, by counting the number of times each open-class word appeared in an OPP-selected sentence, and sorting them by frequency. Any other sentence with a high number of these keywords can also be rewarded with an appropriate score.</Paragraph>
      <Paragraph position="4"> In operation, the Position Module simply selects an appropriate OPP for the input text, assigns a score to each sentence in order of the OPP, and then computes the OPP keyword list for the text. It then assigns additional scores to sentences according to how many OPP keywords they contain. These scores can be seen in Figure l(b), in the item opp=x,y on each line. The first number provides the global OPP score (the score of this word, summed over the whole text) and the second score the local OPP (the score of this sentence in the OPP).</Paragraph>
    </Section>
    <Section position="2" start_page="200" end_page="201" type="sub_section">
      <SectionTitle>
3.2 Cue Phrase Module
</SectionTitle>
      <Paragraph position="0"> In pioneering work, Baxendale (1958) identified two sets of phrases--bonus phrases and stigma phrases--that tend to signal when a sentence is a likely candidate for inclusion in a summary and when it is definitely not a candidate, respectively. Bonus phrases such as &amp;quot;in summary&amp;quot;, &amp;quot;in conclusion&amp;quot;, and superlatives such as &amp;quot;the best&amp;quot;, &amp;quot;the most important&amp;quot; can be good indicators of important content. During processing, the Cue Phrase Module simply rewards each sentence containing a cue phrase with an appropriate score (constant per cue pfirase) and penalizes those containing stigma phrases.</Paragraph>
      <Paragraph position="1"> Unfortunately, cue phrases are genre dependent. For example, &amp;quot;Abstract&amp;quot; and &amp;quot;in conclusion&amp;quot; are more likely to occur in scientific literature than in newspaper articles.</Paragraph>
      <Paragraph position="2"> Given this genre-dependence, the major problem with cue phrases is identifying them. A natural method is to identify high-yield sentences in texts (compared to their human-made abstracts) and then to identify common phrases in those sentences. A careful study on the automated collection of cue phrases is reported in (Teufel and Moens, 98).</Paragraph>
      <Paragraph position="3">  In the context of SUMMARIST, we have tried several methods of acquiring cue phrases. In one experiment, we manually compiled a list of cue phrases from a training corpus of paragraphs that themselves were summaries of texts. In this corpus, sentences containing phrases such as &amp;quot;this paper&amp;quot;, &amp;quot;this article&amp;quot;, &amp;quot;this document&amp;quot;, and &amp;quot;we conclude&amp;quot; fairly reliably reflected the major content of the paragraphs. This indicated to us the possibility of summarizing a summary.</Paragraph>
      <Paragraph position="4"> In another experiment, we examined methods to automatically generate cue phrases (Liu and Hovy, in prep.). We examined various counting methods, all of them comparing the ratios of occurrence densities of words in summaries and in the corresponding texts in various ways, and then extracted the words showing the highest increase in occurrence density between text and associated abstract. Finally, we searched for frequent concatenations of such privileged words into phrases.</Paragraph>
      <Paragraph position="5"> While we found no useful phrases in a corpus of 1,000 newspaper articles, we found the following in 87 articles on Computational Linguistics: Method 1 Method 2 $1 phrase $2 phrase 11.50 multiling, natural lang. 3.304 in this paper 8.500 paper presents the 2.907 this paper we 7.500 paper gives 2.723 base on the 6.000 paper presents 2.221 a set of 6.000 now present 2.192 the result of 5.199 this paper presents 2.000 the number of 4.555 paper describes 1.896 in order to In method 1, S1 = wc~, the total number of words occurring in the summary that co-occur with the word w in any sentence, normalized by the total number of words. In method 2, $2 = w c~ * df/D, where D is the total number of training documents and df is the number of documents in which the word being counted appears.</Paragraph>
      <Paragraph position="6"> The Cue Phrase Module was not applied in the example in Figure l(b), since we have not trained cue phrases for Indonesian.</Paragraph>
    </Section>
    <Section position="3" start_page="201" end_page="201" type="sub_section">
      <SectionTitle>
3.3 Topic Signature Module
</SectionTitle>
      <Paragraph position="0"> In a straightforward application of word counting, one might surmise that words that occur most frequently in the text may possibly indicate important material.</Paragraph>
      <Paragraph position="1"> Naturally, one has to rule out closed-class words such as &amp;quot;the&amp;quot; and &amp;quot;in&amp;quot;. A common method is to create a list of words using ~.idf, a measure that rewards words for being relatively frequent--much more frequent in one text than on average, across the corpus. This method, pioneered a decade ago (Salton, 88), is used in IR systems to achieve query term expansion.</Paragraph>
      <Paragraph position="2"> The same idea can be used in topic identification. On the assumption that semantically related words tend to cooccur, one can construct word families and then count the frequency of word families instead of individual words.</Paragraph>
      <Paragraph position="3"> A frequent word family will indicate the importance of its common semantic notion(s) in the text. To implement this idea, we define a Topic Signature as a topic word (the head) together with a list of associated (keyword weight) pairs. Each topic signature represents a semantic concept using word co-occurrence patterns. We describe in Section 4.2 how we automatically build Topic Signatures and plan to use them for topic interpretation.</Paragraph>
      <Paragraph position="4"> For use in topic identification, we created a Topic Signature for each of five groups of 200 documents, drawn from five domains. When performing topic identification for a document, the Topic Id Signature Module scanned each sentence, assigning to each word that occurred in a signature the weight of that keyword in the signature. Each sentence then received a signature score equal to the total of all signature word scores it contained, normalized by sentence length. This score indicated the relevance of the sentence to the signature topic.</Paragraph>
      <Paragraph position="5"> Since we have no signatures for Indonesian word families, no signature score appears in Figure l(b).</Paragraph>
      <Paragraph position="6"> However, the tf.idf score of each word (comparing the frequencies of each term in the text and across a collection of Indonesian texts) appears in the item tfidf=x.</Paragraph>
    </Section>
    <Section position="4" start_page="201" end_page="204" type="sub_section">
      <SectionTitle>
3.4 Discourse Structure Module
</SectionTitle>
      <Paragraph position="0"> A new module that uses discourse structure is under construction for SUMMARIST. This module, being built by Daniel Marcu, is an extension of his Ph.D. work (Marcu, 97). It is based on the fact that texts are not simply flat lists of sentences; they have a hierarchical structure, one in which certain clauses are more important than others. After parsing the hierarchical structure of an input text and then identifying the important clauses in this structure, Marcu discards unimportant clauses and retains only the most important ones, still bound together within the discourse structure, and hence still forming a coherent text.</Paragraph>
      <Paragraph position="1"> To produce the text's discourse structure, Marcu adapted Rhetorical Structure Theory (Mann and Thompson, 88), which postulates approximately 25 relations that bind clauses (or groups of clauses) together if they exhibit certain semantic and pragmatic properties. These relations are signaled by so-called cue phrases such as &amp;quot;but&amp;quot; and &amp;quot;however&amp;quot; (for the relation Contrast), &amp;quot;in order to&amp;quot; and &amp;quot;because&amp;quot; (for the relation Cause), &amp;quot;then&amp;quot;  and &amp;quot;next&amp;quot; (for Sequence), and so on. Most relations are binary, having a principal component (the Nucleus) and a subsidiary one (the Satellite). Relations can be nested recursively; a text is only coherent if all its clauses can be linked together, first in local subtrees and then in progressively larger ones, under a single overarching relation. Marcu uses a constraint satisfaction algorithm to assemble all the trees that legally organize the input text, and then employs several heuristics to prefer one tree over the others.</Paragraph>
      <Paragraph position="2"> To produce an extract summary of the input text, Marcu simply discards the least salient material, in order, by traversing the discourse structure top-down, following Satellite links only.</Paragraph>
      <Paragraph position="3"> Marcu's subsequent this work combines the discourse structure paradigm with several other surface-based methods, including cue phrases, the discourse tree shape, title words, position, and so on (Marcu, 98a). Using an automated coefficient learning method, he finds that the best linear combination of values for all these methods.</Paragraph>
      <Paragraph position="4"> Evaluation shows that the resulting extracts approximate human performance on both newspaper articles and  modules, each sentence has been assigned several different scores. Some method is required to combine these scores into a single score, so that the most important topic-bearing sentence can be ranked first. However, it is not immediately clear how the various scores should be combined for the best result. Various approaches have been described in the literature. Most of them employ some sort of combination function, in which coefficients assign various weights to the individual scores, which are then summed. (Kupiec et al., 95) and (Aone et al., 97) employ the Expectation Maximization algorithm to derive coefficients for their systems.</Paragraph>
      <Paragraph position="5"> Initially, we implemented for SUMMARIST a straightforward linear combination function, in which we specified the coefficients manually, by experimentation.</Paragraph>
      <Paragraph position="6"> This hand tuning method is good for getting a feeling of how various modules can affect the SUMMARIST output, but it does not guarantee consistent performance over a large collection. As we found in the formal TIPSTER-SUMMAC evaluation of various summarization systems (Firmin Hand and Sundheim, 98), the results of this function were decidedly non-optimal, and did not show the potential and power of the system! Since consistent performance and graceful degradation are very important for SUMMARIST, alternative combination functions were needed.</Paragraph>
      <Paragraph position="7"> In subsequent work, we tested two automated methods of creating better combination functions. These methods assumed that a set of texts with their ideal summaries are available, and that information contained in the ideal summaries can be reliably recovered from their corresponding original texts. The sentences in the original texts that were also included in the summaries we called target sentences.</Paragraph>
      <Paragraph position="8"> Unfortunately, we know of no large corpus of texts with truly ideal summaries. Therefore, as training data, we used a portion of the results of the TIPSTER-SUMMAC summarization evaluation dry run, annotated to indicate the relevance and popularity of each sentence, as aggregated over the ratings of several systems (Baldwin, 98). This collection contains 403 summaries containing 4,830 training instances/sentences, which are judged as relevant to TREC topics 110, 132, 138, 141, and 151.</Paragraph>
      <Paragraph position="9"> (Note that contributions from topics are not uniform.</Paragraph>
      <Paragraph position="10"> Specific (topic/contribution) numbers are 110/226, 132/122, 138/50, 141/42, and 151/63.) Since the sentence scores are based on the consensus votes of the summaries resulting from six different experimental summarization systems participating in the dry run, no single sentence relevance judgement is available to construct a truly ideal summary set. Thus the consensus selected target sentences should be called 'pseudo-ideal sentences'.</Paragraph>
      <Paragraph position="11"> Please refer to (Firmin Hand and Sundheim, 98) for a more detail description of the TIPSTER-SUMMAC dry run evaluation procedure and setup.</Paragraph>
      <Paragraph position="12"> We then employed two methods to automatically learn the combination function(s) that identified in each training text the most target sentences.</Paragraph>
      <Paragraph position="13"> The first method is a decision tree learning algorithm based on C4.5 (Quinlan, 86). Each module's outcome is used as a feature in the learning space. The normalized score '(from 0 to 1 inclusive) of each module for each sentence is used as its feature value. A feature vector is a six-tuple: (TTL: vl, TF: v2, TFIDF: v3, SIG: v4, OPP: v5, QRY: v6). TTL indicates the score from the title module; TF, the term frequency module; TF1DF, the ~.idf module; SIG, the topic signature module; OPP, the position module; and QRY, the query signature module.</Paragraph>
      <Paragraph position="14"> All sentences included in the ideal summary of a text are positive examples, all others are negative examples.</Paragraph>
      <Paragraph position="15"> To fully utilize limited training data, we followed the standard decision tree training and validation procedure  and conducted a 5-way cross-validation test. The algorithm generated a tree of 1,611 nodes, of which the top (most informative) questions pertain to the query signature, term frequency, overlap with title, and OPP.</Paragraph>
      <Paragraph position="16"> Compared with the manually built function, the decision tree is considerably better. With the linear combination function, SUMMARIST used to score 33.02% (Recall and Precision) on an unseen test set of 82 dry-run texts. On the same data, SUMMARIST now scores 58.07% (Recall and Precision) in the 5-way cross-validation test. This represents an improvement of 25%. It is important to understand that this 58.07% score should not be interpreted as how frequently SUMMARIST produces and recovers relevant summaries. This figure is obtained by evaluating against every sentence contained in the pseudo-ideal summaries. Thus it is simply a measure of how well SUMMARIST correctly recovers the pseudo-ideal sentences. The final judgement has to be made by human analysts who judge the sentences extracted by SUMMARIST as a whole, a judgement that unfortunately we cannot carry out on a large scale with our limited staff. However, the figure is still a valuable performance indicator. No summaries can be called good summaries if they do not contain any summary-worthy sentences.</Paragraph>
      <Paragraph position="17"> For the second method, we followed the same setup as the decision tree training method mentioned above but implemented a 6-node perceptron as the learning engine.</Paragraph>
      <Paragraph position="18"> Training it on the same data produced results within 1% of the decision tree.</Paragraph>
      <Paragraph position="19"> When measuring overall summarization performance, we conjecture that the performance of SUMMARIST with the decision tree, tested in a relevance judgement setting such as the TIPSTER-SUMMAC evaluation, should lie somewhere in the 70% range. As explained above, using the decision tree, SUMMARIST's performance increased by 25% (= 57% of its initial score). Adding this improvement to its score of 49% in the Ad Hoc normalized best summary category (Firmin Hand and Sundheim, 98) places it with that range. We look forward to new evaluations like SUMMAC.</Paragraph>
      <Paragraph position="20"> We have recently trained a new decision tree, using different data. The new training data derives from the Question and Answer summary evaluation data provided by TIPSTER-SUMMAC. The principal difference between the (Baldwin, 98) data and the new Q&amp;A data is that the latter contains essential text fragments (phrases, clauses, and sentences) which must be included in summaries to answer some TREC topics. These fragments are judged by two human judges and are thus much more accurate training data. SUMMARIST trained on the Q&amp;A data should therefore perform better than the version trained on the older data.</Paragraph>
      <Paragraph position="21"> An example extract summary of the Indonesian text in Figure 1, using the latest decision tree as combination function, appears in Figure 2. A detailed description of the aforementioned training experiments and the improved combination function appears in (Lin, in prep.).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="204" end_page="206" type="metho">
    <SectionTitle>
4. Phase 2: Topic Interpretation
</SectionTitle>
    <Paragraph position="0"> The second phase of summarization--going from extract to abstract--is considerably more complex than the first, for it requires that the system have recourse to world knowledge. Without knowledge of the world, no system can fuse together the topics extracted to produce a smaller number of topics to form an abstract. Yet one cannot simply ignore abstraction: extracts are not adequate for many tasks. In one study, (Marcu, 98b) counted how many clauses had to be extracted from a text in order to fully contain all the material included in a human abstract of that text. Working with a newspaper corpus of 10 texts and 14 judges, he found a compression factor of 2.76--in that genre, extracts are almost three times as long (counting words) as their corresponding abstracts! Results of this kind indicate the need for summarization systems to further process extracted material: to remove redundancies, rephrase sentences to pack material more densely, and, importantly, to merge or fuse related topics into more 'general' ones using world knowledge.</Paragraph>
    <Paragraph position="1"> The major problem encountered in the abstraction process is the acquisition of such world knowledge. In this section we describe two experiments performed in the context of SUMMARIST that investigate topic interpretation.</Paragraph>
    <Section position="1" start_page="204" end_page="204" type="sub_section">
      <SectionTitle>
4.1. Concept Counting and the Wavefront
</SectionTitle>
      <Paragraph position="0"> One of the most straightforward examples of topic fusion is concept generalization: John bought some apples, pears, and oranges.</Paragraph>
      <Paragraph position="1"> ---y John bought some fruit.</Paragraph>
      <Paragraph position="2"> Using a concept generalization taxonomy called WordNet (Miller et al., 90), we have developed a method to recognize that apple, pear, etc., can be summarized as fruit. The idea is simple. We first identify the WordNet equivalent of each topic word in the text, and then locate an appropriate generalization concept.</Paragraph>
      <Paragraph position="3"> To identify an appropriate generalization concept, we need to count frequencies of occurrence of concepts (after all, apple and pear can equally well be generalized as plant-product or physical-object). Our algorithm (Lin 95) first counts the number of occurrences of each content word in the text, and assigns that number to the word's associated concept in WordNet. It then propagates all these weights upward, assigning to each node the sum of its weight plus all its childrens' weights. Next, it proceeds back down, deciding at each node whether to stop or to continue downward. The algorithm stops when the node is an appropriate generalization of its children; that is, when its weight derives so equally from two or more of its children that no child is the clear majority contributor to its weight. To find the wavefront, we define a concept's weight to be the sum of the frequency of occurrence of the concept C plus the weights of all its subconcepts. We then define the concept frequency ratio between a concept and its subconcepts: R ~ MAX(sum of all children of C) SUM(sum of all children of C) This criterion selects the most specific generalization of a set of concepts as their fuser.</Paragraph>
      <Paragraph position="4"> This algorithm can be extended to identify successive layers of fuser concepts. As described in (Lin 95), the first set of fuser concepts the algorithm locates need not be the last; one can continue downward, in order to locate more specific generalizations. Each stopping frontier we call an interesting wavefront. By repeating the wavefront location process until it reaches the leaf concepts of the hierarchy, the algorithm derives a set of interesting wavefronts. From all the interesting wavefronts, one can choose the most general one below a certain depth D to ensure a good balance of generality and specificity. For WordNet, we found D=6, by experimentation.</Paragraph>
      <Paragraph position="5"> To evaluate the results of this type of fusion, we selected 26 articles about new computer products from BusinessWeek (1993-94) of average 750 words each. For each text we extracted the eight sentences containing the most interesting concepts using the wavefront technique, and compared them to the contents of a professional's abstracts of these 26 texts from an online service. We developed several weighting and scoring variations and tried various ratio and depth parameter settings for the algorithm. We also implemented a random sentence selection algorithm as a baseline comparison.</Paragraph>
      <Paragraph position="6"> The results were promising, though not startling.</Paragraph>
      <Paragraph position="7"> Average recall (R) and precision (P) values over the three scoring variations were R=0.32 and P=0.35, when the system produces extracts of 8 sentences. In comparison, the random selection method scored R=O.18 and P=0.22 in the same experimental setting. These values show that semantic knowledge can help enable improvements over traditional IR word-based techniques. However, the limitations of WordNet are serious drawbacks: it contains almost no domain-specific knowledge.</Paragraph>
    </Section>
    <Section position="2" start_page="204" end_page="206" type="sub_section">
      <SectionTitle>
4.2 Interpretation using Topic Signatures
</SectionTitle>
      <Paragraph position="0"> Before addressing the problem of world knowledge acquisition head-on, we decided to investigate what type of knowledge would be useful for topic interpretation.</Paragraph>
      <Paragraph position="1"> After all, one can spend a lifetime acquiring knowledge in  just a small domain. How little knowledge does one need to enable effective concept fusion? Our idea, again, was simple. We would collect a set of words that were typically associated with a target word, and then, during interpretation, replace the occurrence of the related words by the target word. For example, we would replace joint instances of table, menu, waiter, order, eat, pay, tip, and so on, by the single phrase restaurant-visit, in producing an indicative summary. We thus defined a topic signature as a family of related words, as follows:</Paragraph>
      <Paragraph position="3"> where head is the target word and each w~ is an related word with association strength s~.</Paragraph>
      <Paragraph position="4"> As described in (Lin 97), we constructed signatures automatically from a set of 30,000 texts from the 1987 Wall Street Journal (WSJ) corpus. The paper's editors have classified each text into one of 32 classes. Within the texts of each class, we counted the occurrences of each content word (demorphed to remove plurals, etc.), relative to the number of times they occur in the whole corpus, using the standard oSidfmethod. We then selected the top-scoring 300 terms for each category and created a signature with the category name as its head. The top terms of four example signatures are shown in Figure 3.</Paragraph>
      <Paragraph position="5"> It is quite easy to determine the identity of the signature head just by inspecting the top few signature words.</Paragraph>
      <Paragraph position="6"> In order to evaluate the quality of the signatures formed by the algorithm, we evaluated the effectiveness of each signature by seeing how well it served as a selection criterion on texts. While this is not our intended use of signatures, document categorization is a well-known task with enough results in the literature to give us a sense of the performance of our methods. As data we used a set of 2,204 previously unseen WSJ news articles from 1988.</Paragraph>
      <Paragraph position="7"> For each test text, we created a single-text 'document signature', again using tf.idf, and then matched this document signature against the category signatures. The closest match provided the class into which the text was categorized. We tested several matching functions, including a simple binary match (count 1 if a term match occurs; 0 otherwise); curve-fit match (minimize the difference in occurrence frequency of each term between document and concept signatures), and cosine match (minimize the cosine angle in the hyperspace formed when each signature is viewed as a vector and each word frequency specifies the distance along the dimension for that word). These matching functions all provided approximately the same results. The values for Recall and Precision (R=0.7566 and P=0.6931) are encouraging and compare well with recent IR results (TREC, 95).</Paragraph>
      <Paragraph position="8"> Current experiments are investigating the use of contexts smaller than a full text to create more accurate signatures.  These results are encouraging enough to allow us to continue with topic signatures as the vehicle for a first approximation to world knowledge, as useful for topic interpretation.</Paragraph>
      <Paragraph position="9"> Considerable subsequent experimentation (Hovy and Junk, in prep.) with a variety of methods, including Latent Semantic Analysis, pairwise signatures, etc., indicates that the most promising method of creating signatures is Z 2. We are now busy creating a large number of signatures in an attempt to overcome the world knowledge acquisition problem.</Paragraph>
      <Paragraph position="10"> 5. Phase 3: Summary Generation We have devoted no effort to adapting an existing language generator or developing a new one for SUMMARIST. It has scarcely been necessary, since SUMMARIST is principally an extract-only system at present. However, we envisage the language generation needs of summarization systems in general, and SUMMARIST in particular, to require two major steps.</Paragraph>
      <Paragraph position="11"> The mieroplanner: The task of a microplanner, in general, is to convert a discourse-level specification of a sequence of topics into a list of specifications at the level of one or a few clauses at a time. This involves making choices for several semi-independent aspects, including sentence length, internal sentence organization (order of preposition phrases, active or passive mood, etc.), identification of the principal theme and focus units, selection of the main verb and other important words, and so on. In the context of summarization, the microplanner's task is to ensure that the information selected by the topic identification module and (possibly) fused by .the topic interpretation module is phrased compactly and as briefly as possible though still in a grammatical sentence.</Paragraph>
      <Paragraph position="12"> The microplanner can be built to perform its work at two levels: the textual level, in which its input is a list of sentences or sentence fragments, and its output is a compacted list of sentences, and the representational level, in which its input is couched in an abstract notation (whether more or less explicitly syntactic depends on the implementation), and its output is a fairly syntactic abstract specification of each sentence.</Paragraph>
      <Paragraph position="13"> In the former case, the output is more or less directly readable by a human, while in the latter, the output has to be converted into grammatical sentences by the sentence generator.</Paragraph>
      <Paragraph position="14"> Microplanning is an area still largely unexplored by computational linguists. The work that has been done, including some of the major systems (Nirenburg et al., 89; Rambow and Korelsky, 92; Hovy and Wanner, 96), is not really appropriate to the specific compactionrelated needs of summarization. The most relevant study on microplanning for summarization is (McKeown and Radev, 95).</Paragraph>
      <Paragraph position="15"> The sentence generator: The task of a sentence generator (often called realizer) is to convert a fairly detailed specification of one or a few clause-sized units into a grammatical sentence. A number of relatively easy to use sentence generators is available to the research community, including Penman (Penman, 88), FUF/SURGE (Elhadad, 92), RealPro (Lavoie and Rambow, 97), and NITROGEN (Langkilde and Knight, 98). We plan to employ one or more of these in SUMMARIST.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="206" end_page="207" type="metho">
    <SectionTitle>
6. Summary Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="206" end_page="207" type="sub_section">
      <SectionTitle>
6.1 Two Basic Measures
</SectionTitle>
      <Paragraph position="0"> How can you evaluate the quality of a summary? We have found no literature on this fascinating question. Indeed, many anecdotes and experiences lead one to believe that the uses of summaries are so task-specific and user-oriented that no objective measurement is possible. When the inter-judge scoring variability is higher than half the average score, as small tests relating to summaries occasionally have suggested, then perhaps there is no hope.</Paragraph>
      <Paragraph position="1"> However, it is possible to develop some general guidelines and approaches, and from them to develop some approximations to summarization evaluation. We give a very rough sketch here of some work performed in the context of SUMMARIST in early 1997; more details are in (Hovy, in prep.).</Paragraph>
      <Paragraph position="2"> It is obvious that to be a summary, the summary must obey two requirements: * it must be shorter than the original input text; * it must contain (some of) the same information as the original, and not other, new, information.</Paragraph>
      <Paragraph position="3">  One can then define two measures to capture the extent to which a summary S conforms to these requirements with regard to a text T:  However we choose to measure the length and the information content, we can say that a good summary is one in which CR is small (tending to zero) while RR is large (tending to unity). We can characterize summarization systems and/or text types by plotting the ratios of the summaries produced under varying conditions. For</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="207" end_page="207" type="metho">
    <SectionTitle>
RR
</SectionTitle>
    <Paragraph position="0"> example, Figure 4(a) shows a fairly normal growth curve: as the summary gets longer (grows along the x axis, which measures CR), it contains more information (grows also along the y axis, which measures RR), until it is just as long at the original text and contains the same information. In contrast, Figure 4(b) shows a curve with a very desirable bend: at some special point, the addition of just a little more material to the summary adds a disproportionately large amount more of information. Figure 4(c) shows another desirable behavior: initially, all the important material is included in the summary; as it grows, the new material is less interesting.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML