File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2020_metho.xml

Size: 19,141 bytes

Last Modified: 2025-10-06 14:10:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2020">
  <Title>Topic-Focused Multi-document Summarization Using an Approximate Oracle Score</Title>
  <Section position="5" start_page="152" end_page="153" type="metho">
    <SectionTitle>
3 The Oracle Score
</SectionTitle>
    <Paragraph position="0"> Recently, a crisp analysis of the frequency of content words used by humans relative to the high frequency content words that occur in the relevant documents has yielded a simple and powerful summarization method called SumBasic (Nenkova and Vanderwende, 2005). SumBasic produced extract summaries which performed nearly as well as the best machine systems for generic100wordsummaries, asevaluatedinDUC 2003 and 2004, as well as the Multi-lingual Summarization Evaluation (MSE 2005).</Paragraph>
    <Paragraph position="1"> Instead of using term frequencies of the corpus to infer highly likely terms in human summaries, we propose to directly model the set of terms (vocabulary) that is likely to occur in a sample of human summaries. We seek to estimate the probability that a term will be used by a human summarizer to first get an estimate of the best possible extract and later to produce a statistical model for an extractive summary system. While the primary focus of this work is &amp;quot;task oriented&amp;quot; summaries, we will also address a comparison with SumBasic and other systems on generic multi-document summaries for the DUC 2004 dataset in Section 8.</Paragraph>
    <Paragraph position="2"> Our extractive summarization system is given a topic, t, specified by a text description. It then evaluates each sentence in each document in the set to determine its appropriateness to be included in the summary for the topic t.</Paragraph>
    <Paragraph position="3"> We seek a statistic which can score an individual sentence to determine if it should be included as a candidate. We desire that this statistic take into account the great variability that occurs in the space of human summaries on a given topic t. One possibility is to simply judge a sentence based upon the expected fraction of the &amp;quot;human summary&amp;quot;-terms that it contains. We posit an oracle, which answers the question &amp;quot;Does human summary i contain the term t?&amp;quot; By invoking this oracle over the set of terms and a sample of human summaries, we can readily compute the expected fraction of human summary-terms the sentence contains. To model the variation in human summaries, we use the oracle to build a probabilistic model of the space of human abstracts. Our &amp;quot;oracle score&amp;quot; will then compute the expected number of summary terms a sentence contains, where the expectation is taken from the space of all human summaries on the topic t.</Paragraph>
    <Paragraph position="4"> We model human variation in summary generation with a unigram bag-of-words model on the terms. In particular, consider P(t|t) to be the probability that a human will select term t in a summary given a topic t. The oracle score for a sentence x, o(x), can then be defined in terms of</Paragraph>
    <Paragraph position="6"> where |x |is the number of distinct terms sentence x contains, T is the universal set of all terms used in the topic t and x(t) = 1 if the sentence x contains the term t and 0 otherwise. (We affectionally refer to this score as the &amp;quot;Average Jo&amp;quot; score, as it is derived the average uni-gram distribution of terms in human summaries.) While we will consider several approximations to P(t|t) (and, correspondingly, o), we first explore the maximum-likelihood estimate of P(t|t) given by a sample of human summaries. Suppose we are given h sample summaries generated independently. Let cit(t) = 1 if the i-th summary contains the term t and 0 otherwise. Then the</Paragraph>
    <Paragraph position="8"> cit(t).</Paragraph>
    <Paragraph position="9"> We define ^o by replacing P with ^P in the definition of o. Thus, ^o is the maximum-likelihood estimate for o, given a set of h human summaries. Given the score ^o, we can compute an extract summary of a desired length by choosing the top scoring sentences from the collection of documents until the desired length (250 words) is obtained. We limit our selection to sentences which have 8 or more distinct terms to avoid selecting incomplete sentences which may have been tagged by the sentence splitter.</Paragraph>
    <Paragraph position="10"> Before turning to how well our idealized score, ^o, performs on extract summaries, we first define  thescoringmechanismusedtoevaluatethesesummaries. null</Paragraph>
  </Section>
  <Section position="6" start_page="153" end_page="153" type="metho">
    <SectionTitle>
4 ROUGE
</SectionTitle>
    <Paragraph position="0"> The state-of-the-art automatic summarization evaluation method is ROUGE (Recall Oriented Understudy for Gisting Evaluation, (Hovy and Lin 2002)), an n-gram based comparison that was motivated by the machine translation evaluation metric, Bleu (Papineni et. al. 2001). This system uses a variety of n-gram matching approaches, some of which allow gaps within the matches as well as more sophistcated analyses. Surprisingly, simple unigram and bigram matching works extremely well. For example, at DUC 05, ROUGE-2 (bigram match) had a Spearman correlation of 0.95 and a Pearson correlation of 0.97 when compared with human evaluation of the summaries for responsiveness (Dang 2005). ROUGE-n for matching n[?]grams of a summary X against h model human summaries is given by:</Paragraph>
    <Paragraph position="2"> where Xn(i) is the count of the number of times the n-gram i occurred in the summary and Mn(i,j) is the number of times the n-gram i occurred in the j-th model (human) summary.</Paragraph>
    <Paragraph position="3"> (Note that for brevity of notation, we assume that lemmatization (stemming) is done apriori on the terms.) When computing ROUGE scores, a jackknife procedure is done to make comparison of machine systems and humans more amenable. In particular, if there are k human summaries available for a topic, then the ROUGE score is computed for a human summary by comparing it to the remaining k [?] 1 summaries, while the ROUGE score for a machine summary is computed against all k sub-sets of size k [?] 1 of the human summaries and taking the average of these k scores.</Paragraph>
  </Section>
  <Section position="7" start_page="153" end_page="154" type="metho">
    <SectionTitle>
5 The Oracle or Average Jo Summary
</SectionTitle>
    <Paragraph position="0"> We now present results on the performance of the oracle method as compared with human summaries. We give the ROUGE-2 (R2) scores as well as the 95% confidence error bars. In Figure 1, the human summarizers are represented by the letters A-H, and systems 15, 17, 8, and 4 are the top performing machine summaries from DUC 05. The letter &amp;quot;O&amp;quot; represents the ROUGE-2 scores for extract summaries produced by the oracle score, ^o. Perhaps surprisingly, the oracle producedextractswhichperformedbetterthanthehu- null man summaries! Since each human only summarized 10 document clusters, the human error bars are larger. However, even with the large error bars, we observe that the mean ROUGE-2 scores for the oracle extracts exceeds the 95% confidence error bars for several humans.</Paragraph>
    <Paragraph position="1"> While the oracle was, of course, given the unigram term probabilities, its performance is notable on two counts. First, the evaluation metric scored on 2-grams, while the oracle was only given unigram information. In a sense, optimizing for ROUGE-1 is a &amp;quot;sufficient statistic&amp;quot; scoring at  the human level for ROUGE-2. Second, the humans wrote abstracts while the oracle simply did extracting. Consequently, the documents contain sufficient text to produce human-quality extract summaries as measured by ROUGE. The human performance ROUGE scores indicate that this approach is capable of producing automatic extractive summaries that produce vocabulary comparable to that chosen by humans. Human evaluation (which we have not yet performed) is required to determine to what extent this high ROUGE-2 performance is indicative of high quality summaries for human use.</Paragraph>
    <Paragraph position="2"> The encouraging results of the oracle score naturally lead to approximations, which, perhaps, will give rise to strong machine system performance. Our goal is to approximate P(t|t), the probability that a term will be used in a human abstract. In the next section, we present two approaches which will be used in tandem to make this approximation.</Paragraph>
  </Section>
  <Section position="8" start_page="154" end_page="156" type="metho">
    <SectionTitle>
6 Approximating P(t|t)
</SectionTitle>
    <Paragraph position="0"> We seek to approximate P(t|t) in an analogous fashion to the maximum-likelihood estimate ^P(t|t). To this end, we devise methods to isolate a subset of terms which would likely be included in the human summary. These terms are gleaned from two sources, the topic description and the collection of documents which were judged relevanttothetopic. Theformerwillgiverisetoquery terms and the latter to signature terms.</Paragraph>
    <Section position="1" start_page="154" end_page="154" type="sub_section">
      <SectionTitle>
6.1 Query Term Identification
</SectionTitle>
      <Paragraph position="0"> A set of query terms is automatically extracted from the given topic description. We identified individual words and phrases from both the &lt;topic&gt; (Title) tagged paragraph as well as whichever of the &lt;narr&gt; (Narrative) Set d408c: approximate, casualties, death, human, injury, number, recent, storms, toll, total, tropical, years Set d436j: accidents, actual, causes, damage, events, injured, killed, prevent, result, train, train wrecks, trains, wrecks  tagged paragraphs occurred in the topic description. We made no use of the &lt;granularity&gt; paragraph marking. We tagged the topic description using the POS-tagger, NLProcessor (http://www.infogistics.com/posdemo.htm), and any words that were tagged with any NN (noun), VB (verb), JJ (adjective), or RB (adverb) tag were included in a list of words to use as query terms. Table 1 shows a list of query terms for our two illustrative topics.</Paragraph>
      <Paragraph position="1"> Thenumberofquerytermsextractedinthisway ranged from a low of 3 terms for document set d360f to 20 terms for document set d324e.</Paragraph>
    </Section>
    <Section position="2" start_page="154" end_page="154" type="sub_section">
      <SectionTitle>
6.2 Signature Terms
</SectionTitle>
      <Paragraph position="0"> The second collection of terms we use to estimate P(t|t) are signature terms. Signature terms are the terms that are more likely to occur in the document set than in the background corpus. They are generally indicative of the content contained in the collection of documents. To identify these terms,weusethelog-likelihoodstatisticsuggested by Dunning (Dunning 1993) and first used in summarization by Lin and Hovy (Hovy and Lin 2000).</Paragraph>
      <Paragraph position="1"> The statistic is equivalent to a mutual information statistic and is based on a 2-by-2 contingency table of counts for each term. Table 2 shows a list of signature terms for our two illustrative topics.</Paragraph>
    </Section>
    <Section position="3" start_page="154" end_page="156" type="sub_section">
      <SectionTitle>
6.3 An estimate of P(t|t)
</SectionTitle>
      <Paragraph position="0"> To estimate P(t|t), we view both the query terms and the signature terms as &amp;quot;samples&amp;quot; from idealized human summaries. They represent the terms that we would most likely see in a human summary. As such, we expect that these sample terms may approximate the underlying set of human summary terms. Given a collection of query terms and signature terms, we can readily estimate our target objective, P(t|t) by the following:</Paragraph>
      <Paragraph position="2"> Set d408c: ahmed, allison, andrew, bahamas, bangladesh, bn, caribbean, carolina, caused, cent, coast, coastal, croix, cyclone, damage, destroyed, devastated, disaster, dollars, drowned, flood, flooded, flooding, floods, florida, gulf, ham, hit, homeless, homes, hugo, hurricane, insurance, insurers, island, islands, lloyd, losses, louisiana, manila, miles, nicaragua, north, port, pounds, rain, rains, rebuild, rebuilding, relief, remnants, residents, roared, salt, st, storm, storms, supplies, tourists, trees, tropical, typhoon, virgin, volunteers, weather, west, winds, yesterday.</Paragraph>
      <Paragraph position="3"> Set d436j: accident, accidents, ammunition, beach, bernardino, board, boulevard, brake, brakes, braking, cab, car, cargo, cars, caused, collided, collision, conductor, coroner, crash, crew, crossing, curve, derail, derailed, driver, emergency, engineer, engineers, equipment, fe, fire, freight, grade, hit, holland, injured, injuries, investigators, killed, line, locomotives, maintenance, mechanical, miles, morning, nearby, ntsb, occurred, officials, pacific, passenger, passengers, path, rail, railroad, railroads, railway, routes, runaway, safety, san, santa, shells, sheriff, signals, southern, speed, station, train, trains, transportation, truck, weight, wreck  where st(t)=1 if t is a signature term for topic t and 0 otherwise and qt(t) = 1 if t is a query term for topic t and 0 otherwise.</Paragraph>
      <Paragraph position="4"> More sophisticated weightings of the query and signature have been considered; however, for this paper we limit our attention to the above elementary scheme. (Note, in particular, a psuedorelevance feedback method was employed by (Conroy et. al. 2005), which gives improved performance.) null Similarly, we estimate the oracle score of a sentence's expected number of human abstract terms as</Paragraph>
      <Paragraph position="6"> where |x |is the number of distinct terms that sentence x contains, T is the universal set of all terms and x(t) = 1 if the sentence x contains the term t and 0 otherwise.</Paragraph>
      <Paragraph position="7"> Forboththeoraclescoreandtheapproximation, we form the summary by taking the top scoring sentences among those sentences with at least 8 distinct terms, until the desired length (250 words fortheDUC05data)isachievedorexceeded. (The threshold of 8 was based upon previous analysis of the sentence splitter, which indicated that sentences shorter than 8 terms tended not be be well formed sentences or had minimal, if any, content.) If the length is too long, the last sentence chosen is truncated to reach the target length.</Paragraph>
      <Paragraph position="8"> Figure 2 gives a scatter plot of the oracle score o and its approximation oqs for all sentences with at least 8 unique terms. The overall Pearson correlation coefficient is approximately 0.70. The correlation varies substantially over the topics. Figure 3 gives a histogram of the Pearson correlation coefficients for the 50 topic sets.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="156" end_page="157" type="metho">
    <SectionTitle>
7 Enhancements
</SectionTitle>
    <Paragraph position="0"> In the this section we explore two approaches to improvethequalityofthesummary, linguisticpreprocessing (sentence trimming) and a redundancy removal method.</Paragraph>
    <Section position="1" start_page="156" end_page="156" type="sub_section">
      <SectionTitle>
7.1 Linguistic Preprocessing
</SectionTitle>
      <Paragraph position="0"> We developed patterns using &amp;quot;shallow parsing&amp;quot; techniques, keying off of lexical cues in the sentences after processing them with the POS-tagger.</Paragraph>
      <Paragraph position="1"> We initially used some full sentence eliminations along with the phrase eliminations itemized below; analysis of DUC 03 results, however, demonstrated that the full sentence eliminations were not useful.</Paragraph>
      <Paragraph position="2"> The following phrase eliminations were made,  for these eliminations. Comparison of two runs in DUC 04 convinced us of the benefit of applying these phrase eliminations on the full documents, prior to summarization, rather than on the selected sentences after scoring and sentence selection had been performed. See (Conroy et. al. 2004) for details on this comparison.</Paragraph>
      <Paragraph position="3"> After the trimmed text has been generated, we then compute the signature terms of the document sets and recompute the approximate oracle scores. Note that since the sentences have usually had some extraneous information removed, we expect some improvement in the quality of the signature terms and the resulting scores. Indeed, the median ROUGE-2 score increases from 0.078 to 0.080.</Paragraph>
    </Section>
    <Section position="2" start_page="156" end_page="157" type="sub_section">
      <SectionTitle>
7.2 Redundancy Removal
</SectionTitle>
      <Paragraph position="0"> The greedy sentence selection process we described in Section 6 gives no penalty for sentences which are redundant to information already contained in the partially formed summary. A method for reducing redundancy can be employed. One popular method for reducing redundancy is maximum marginal relevance (MMR) (2). Based on previous studies, we have found that a pivoted QR, a method from numerical linear algebra, has some advantages over MMR and performs somewhat better.</Paragraph>
      <Paragraph position="1"> Pivoted QR works on a term-sentence matrix formed from a set of candidate sentences for inclusion in the summary. We start with enough sentences so the total number of terms is approximately twice the desired summary length. Let B be the term-sentence matrix with Bij = 1 if sentence j contains term i.</Paragraph>
      <Paragraph position="2"> The columns of B are then normalized so their 2-norm (Euclidean norm) is the corresponding approximate oracle score, i.e. oqs(bj), where bj is thej-thcolumnofB.Wecallthisnormalizedterm sentence matrix A.</Paragraph>
      <Paragraph position="3"> Given a normalized term-sentence matrix A, QR factorization attempts to select columns of A in the order of their importance in spanning the subspace spanned by all of the columns. The standard implementation of pivoted QR decomposition is a &amp;quot;Gram-Schmidt&amp;quot; process. The first r sentences (columns) selected by the pivoted QR are used to form the summary. The number r is chosen so that the summary length is close to the target length. A more complete description can be found in (Conroy and O'Leary 2001).</Paragraph>
      <Paragraph position="4"> Note, that the selection process of using the pivoted QR on the weighted term sentence matrix will first choose the sentence with the highest opq score as was the case with the greedy selection process. Its subsequent choices are affected by previous choices as the weights of the columns are decreased for any sentence which can be approximated by a linear combination of the current set of selected sentences. This is more general than simply demanding that the sentence have small overlap with the set of previous chosen sentences as  Approximations ^o vs. Humans and Peers would be done using MMR.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML