File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4001_metho.xml

Size: 11,289 bytes

Last Modified: 2025-10-06 14:08:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4001">
  <Title>Using N-Grams to Understand the Nature of Summaries</Title>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Using N-gram Sequences to
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Characterize Summaries
</SectionTitle>
      <Paragraph position="0"> Our approach to characterizing summaries is much simpler than what Jing has described and is based on the following idea: if human-written summaries are extractive, then we should expect to see long spans of text that have been lifted from the source documents to form a summary.</Paragraph>
      <Paragraph position="1"> Note that this holds under the assumptions made by Jing's model of operations that are performed by human summarizers. In the examples of operations given by Jing, we notice that long n-grams are preserved (designated by brackets), even in the operations mostly likely to disrupt the original text:  Jing considers a sentence to have been generated from scratch if fewer than half of the words were composed of terms coming from the original document.</Paragraph>
      <Paragraph position="2">  The range in potential gains is due to possible variations in summary length.</Paragraph>
      <Paragraph position="3"> Sentence Reduction: Document sentence: When it arrives sometime next year in new TV sets, the V-chip will give parents a new and potentially revolutionary device to block out programs they don't want their children to see. Summary sentence: [The V-chip will give parents a] [device to block out programs they don't want their children to see.] Syntactic Transformation: Document sentence: Since annoy.com enables visitors to send unvarnished opinions to political and other figures in the news, the company was concerned that its activities would be banned by the statute. Summary sentence: [Annoy.com enables visitors to send unvarnished opinions to political and other figures in the news] and feared the law could put them out of business.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Sentence Combination:
</SectionTitle>
      <Paragraph position="0"> Document sentence 1: But it also raises serious questions about the privacy of such highly personal information wafting about the digital world.</Paragraph>
      <Paragraph position="1"> Document sentence 2: The issue thus fits squarely into the broader debate about privacy and security on the Internet, whether it involves protecting credit card numbers or keeping children from offensive information.</Paragraph>
      <Paragraph position="2"> Summary sentence: [But it also raises] the issue of [privacy of such] [personal information] and this issue hits the nail on the head [in the broader debate about privacy and security on the Internet.]</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Data and Experiments
</SectionTitle>
      <Paragraph position="0"> For our experiments we used data made available from the 2001 Document Understanding Conference (DUC), an annual large-scale evaluation of summarization systems sponsored by the National Institute of Standards and Technology (NIST). In this corpus, NIST has gathered documents describing 60 events, taken from the Associated Press, Wall Street Journal, FBIS San Jose Mercury, and LA Times newswires. An event is described by between 3 and 20 separate (but not necessarily unique) documents; on average a cluster contains 10 documents. Of the 60 available clusters, we used the portion specifically designated for training, which contains a total of 295 documents distributed over 30 clusters.</Paragraph>
      <Paragraph position="1"> As part of the DUC 2001 summarization corpus, NIST also provides four hand-written summaries of different lengths for every document cluster, as well as 100-word summaries of each document. Since we wished to collectively compare single-document summaries against multi-document summaries, we used the 100-word multi-document summaries for our analysis. It is important to note that for each cluster, all summaries (50, 100, 200 and 400-word multi-document and 100-word per-document) have been written by the same author. NIST used a total of ten authors, each providing summaries for 3 of the 30 topics. The instructions provided did not differ per task; in both single and multi-document scenarios, the authors were directed to use complete sentences and told to feel free to use their own words (Over, 2004).</Paragraph>
      <Paragraph position="2"> To compare the text of human-authored multi-document summaries to the full-text documents describing the events, we automatically broke the documents into sentences, and constructed a minimal tiling of each summary sentence. Specifically, for each sentence in the summary, we searched for all n-grams that are present in both the summary and the documents, placing no restrictions on the potential size of an n-gram. We then covered each summary sentence with the ngrams, optimizing to use as few n-grams as possible (i.e. favoring n-grams that are longer in length). For this experiment, we normalized the data by converting all terms to lowercase and removing punctuation.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> On average, we found the length of a tile to be 4.47 for single-document summaries, compared with 2.33 for multi-document summaries. We discovered that 61 out of all 1667 hand-written single-document summary sentences exactly matched a sentence in the source document, however we did not find any sentences for which this was the case when examining multi-document summaries.</Paragraph>
      <Paragraph position="1"> We also wanted to study how many sentences are fully tiled by phrases coming from exactly one sentence in the document corpus, and found that while no sentences from the multi-document summaries matched this criteria, 7.6% of sentences in the single-document summaries could be tiled in this manner. When trying to tile sentences with tiles coming from only one document sentence, we found that we could tile, on average, 93% of a single-document sentence in that manner, compared to an average of 36% of a multi-document sentence.</Paragraph>
      <Paragraph position="2"> This suggests that for multi-document summarization, we are not seeing any instances of what can be considered single-sentence compression. Table 1 summarizes the findings we have presented in this section.</Paragraph>
      <Paragraph position="3">  Figure 1 shows the relative frequency with which a summary sentence is optimally tiled using tile-sizes up to 25 words in length in both the single and multi-document scenarios. The data shows that the relative frequency with which a single-document summary sentence is optimally tiled using n-grams containing 3 or more words is consistently higher compared to the multi-document case. Not shown on the histogram (due to insufficient readability) is that we found 379 tiles (of approximately 86,000) between 25 and 38 words long covering sentences from single-document summaries.</Paragraph>
      <Paragraph position="4"> No tiles longer than 24 words were found for multi-document summaries.</Paragraph>
      <Paragraph position="5"> In order to test whether tile samples coming from tiling of single-document summaries and multi-document summaries are likely to have come from the same underlying population, we performed two one-tailed unpaired t-tests, in one instance assuming equal variances, and in the other case asssuming the variances were unequal. For these statistical significance tests, we randomly sampled 100 summary sentences from each task, and extracted the lengths of the n-grams found via minmal tiling. This resulted in the creation of a sample of 551 tiles for single-document sentences and 735 tiles for multi-document sentences.</Paragraph>
      <Paragraph position="6"> For both tests (performed with a=0.05), the P-values were low enough (0.00033 and 0.000858, respectively) to be able to reject the null hypothesis that the average tile length coming from single-document summaries is the same as the average tile length found in multi-document summaries. We chose to use a one-tailed P-value because based on our experiments we already suspected that the single-document tiles had a larger mean.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
Optimal Tile Lengths
4 Conclusions and Future Work
</SectionTitle>
    <Paragraph position="0"> Our experiments show that when writing multi-document summaries, human summarizers do not appear to be cutting and pasting phrases in an extractive fashion. On average, they are borrowing text around the bigram level, instead of extracting long sequences of words or full sentences as they tend to do when summarizing a single document. The extent to which human summarizers form extractive summaries during single and multi-document summarization was found to be different at a level which is statistically significant.</Paragraph>
    <Paragraph position="1"> These findings are additionally supported by the fact that automatic n-gram-based evaluation measures now being used to assess predominately extractive multi-document summarization systems correlate strongly with human judgments when restricted to the usage of unigrams and bigrams, but correlate weakly when longer n-grams are factored into the equation (Lin &amp; Hovy, 2003). In the future, we wish to apply our method to other corpora, and to explore the extent to which different summarization goals, such as describing an event or providing a biography, affect the degree to which humans employ rewriting as opposed to extraction.</Paragraph>
    <Paragraph position="2"> Despite the unique requirements for multi-document summarization, relatively few systems have crossed over into employing generation and reformulation (McKeown &amp; Radev, 1995, Nenkova, et al. 2003). For the most part, summarization systems continue to be based on sentence extraction methods. Considering that humans appear to be generating summary text that differs widely from sentences in the original documents, we suspect that approaches which make use of generation and reformulation techniques may yield the most promise for multi-document summarization. We would like to empirically quantify to what extent current summarization systems reformulate text, by applying the techniques presented in this paper to system output.</Paragraph>
    <Paragraph position="3"> Finally, the potential impact of our findings with respect to recent evaluation metrics should not be overlooked. Caution must be given when employing automatic evaluation metrics based on the overlap of n-grams between human references and system summaries.</Paragraph>
    <Paragraph position="4"> When reference summaries do not contain long n-grams drawn from the source documents, but are instead generated in the author's own words, the use of a large number of reference summaries becomes more critical.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML