File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1101_metho.xml

Size: 13,004 bytes

Last Modified: 2025-10-06 14:08:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1101">
  <Title>Improving Summarization Performance by Sentence Compression - A Pilot Study</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 A Noisy-Channel Model for Sentence
Compression
</SectionTitle>
    <Paragraph position="0"> Knight and Marcu (K&amp;M) (2000) introduced two sentence compression algorithms, one based on the noisy-channel model and the other decision-based.</Paragraph>
    <Paragraph position="1"> We use the noisy-channel model in our experiments since it is able to generate a list of ranked candidates, while the decision-based is not.</Paragraph>
    <Paragraph position="2"> * Source model P(s) - The compressed sentence language model. This would assign low probability to short sentences with undesirable features, for example, ungrammatical or  too short.</Paragraph>
    <Paragraph position="3"> * Channel model P(t  |s) - Given a compressed sentence s, the channel model assigns the probability of an original sentence, t, which could have been generated by s.</Paragraph>
    <Paragraph position="4"> * Decoder - Given the original sentence t, find the best short sentence s generated from t, i.e. maximizing P(s  |t). This is equivalent to maximizing P(t  |s)*P(s).</Paragraph>
    <Paragraph position="5">  We used K&amp;M's sentence compression algorithm as it was and did not retrain on new corpus. We also adopted the compression length-adjusted log probability to avoid the tendency of selecting very short compressions. Figure 1 shows a list of compressions for the sentence &amp;quot;In Louisiana, the hurricane landed with wind speeds of about 120 miles per hour and caused severe damage in small coastal centres such as Morgan City, Franklin and New Iberia.&amp;quot; ranked according to their length-adjusted log-probability.</Paragraph>
    <Paragraph position="6">  based multi-document summarization system. It is among the top two performers in DUC 2001 and 2002 (Over and Liggett, 2002). It consists of three main components: * Content Selection - The goal of content selec null tion is to identify important concepts mentioned in a document collection. NeATS computes the likelihood ratio l (Dunning, 1993) to identify key concepts in unigrams, bigrams, and trigrams, and clusters these concepts in order to identify major subtopics within the main topic. Each sentence in the document set is then ranked, using the key concept structures. These n-gram key concepts are called topic signatures (Lin and Hovy 2000). We used key n-grams to rerank compressions in our experiments.</Paragraph>
    <Paragraph position="7"> * Content Filtering - NeATS uses three different filters: sentence position, stigma words, and maximum marginal relevancy. Sentence position has been used as a good content filter since the late 60s (Edmundson, 1969). We apply a simple sentence filter that only retains the 10 lead sentences. Some sentences start with stigma words such as conjunctions, quotation marks, pronouns, and the verb &amp;quot;say&amp;quot; and its derivatives usually cause discontinuity in summaries. We simply reduce the scores of these sentences to demote their ranks and avoid including them in summaries of small sizes. To Number of Words Adjusted Log-Prob Raw Log-Prob Sentence  ana, the hurricane landed with wind speeds of about 120 miles per hour and caused severe damage in small coastal centres such as Morgan City, Franklin and New Iberia.&amp;quot; address the redundancy problem, we use a simplified version of CMU's MMR (Goldstein et al., 1999) algorithm. A sentence is added to the summary if and only if its content has less than X percent overlap with the summary.</Paragraph>
    <Paragraph position="8"> * Content Presentation - To ensure coherence of the summary, NeATS pairs each sentence with an introduction sentence. It then outputs the final sentences in their chronological order.</Paragraph>
    <Paragraph position="9"> We ran NeATS to generate summaries of different sizes that were used as our test bed. The topic signatures created in the process were used to rerank compressions. We describe the automatic evaluation metric used in our experiments in the next section. null</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Unigram Co-Occurrence Metric
</SectionTitle>
    <Paragraph position="0"> In a recent study (Lin and Hovy, 2003a), we showed that the recall-based unigram co-occurrence automatic scoring metric correlates highly with human evaluation and has high recall and precision in predicting the statistical significance of results comparing with its human counterpart. The idea is to measure the content similarity between a system extract and a manual summary using simple n-gram overlap. A similar idea called IBM BLEU score has proved successful in automatic machine translation evaluation (NIST, 2002; Papineni et al., 2001). For summarization, we can express the degree of content overlap in terms of n-gram matches as the following equation:  They are typically either sentences or elementary discourse units as defined by Marcu (1999). Countmatch null (n-gram) is the maximum number of n-grams co-occurring in a system extract and a model unit. Count(n-gram) is the number of n-grams in the model unit. Notice that the average n-gram coverage score, C n , as shown in equation 1, is a recall-based metric, since the denominator of equation 1  is the sum total of the number of n-grams occurring in the model summary instead of the system summary and only one model summary is used for each evaluation. In summary, the unigram co-occurrence statistics we use in the following sections are based on the following formula:</Paragraph>
    <Paragraph position="2"> 1/(j-i+1). Ngram(1, 4) is a weighted variable length n-gram match score similar to the IBM BLEU score; while Ngram(k, k), i.e. i = j = k, is simply the average k-gram co-occurrence score C k . In this study, we set i = j = 1, i.e. unigram co-occurrence score.</Paragraph>
    <Paragraph position="3"> With an automatic scoring metric defined, we describe the experimental setup in the next section.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experimental Designs
</SectionTitle>
    <Paragraph position="0"> As stated in the introduction, we aim to investigate the effectiveness of sentence compression on over-all system performance. If we can have a lossless compression function that compresses a given sentence to a minimal length and still retains the most important content of the sentence then we would be able to pack more information content into a fixed size summary. Figure 2 illustrates this effect &lt;multi size=&amp;quot;225&amp;quot; docset=&amp;quot;d19d&amp;quot; org-size=&amp;quot;227&amp;quot; comp-size=&amp;quot;227&amp;quot;&gt; Lawmakers clashed on 06/23/1988 over the question of counting illegal aliens in the 1990 Census, debating whether following the letter of the Constitution results in a system that is unfair to citizens. The forum was a Census subcommittee hearing on bills which would require the Census Bureau to figure out whether people are in the country legally and, if not, to delete them from the counts used in reapportioning seats in the House of Representatives. Simply put, the question was who should be counted as a person and who, if anybody, should not. The point at issue in Senate debate on a new immigration bill was whether illegal aliens should be counted in the process that will reallocate House seats among states after the 1990 census. The national head count will be taken April 1, 1990. In a blow to California and other states with large immigrant populations, the Senate voted on 09/29/1989 to bar the Census Bureau from counting illegal aliens in the 1990 population count. At stake are the number of seats in Congress for California, Florida, New York, Illinois, Pennsylvania and other states that will be reapportioned on the basis of next year's census. Federal aid to states also is frequently based on population counts, so millions of dollars in grants and other funds made available on a per capita basis would be affected.</Paragraph>
    <Paragraph position="1">  &lt;multi size=&amp;quot;225&amp;quot; docset=&amp;quot;d19d&amp;quot; org-size=&amp;quot;227&amp;quot; comp-size=&amp;quot;98&amp;quot;&gt; Lawmakers clashed over question of counting illegal aliens Census debating whether results. Forum was a Census hearing, to delete them from the counts. Simply put question was who should be counted and who, if anybody, should not. Point at issue in debate on an immigration bill was whether illegal aliens should be counted. National count will be taken April 1, 1990. Senate voted to bar Census Bureau from counting illegal aliens. At stake are number of seats for California New York. Aid to states is frequently based on population counts, so millions would be affected.</Paragraph>
    <Paragraph position="2">  graphically. For document AP900424-0035, which consists of 23 sentences or 417 words, we generate the full permutation set of sentence extracts, i.e., all possible 100+-5, 150+-5, and 200+-5 words extracts.</Paragraph>
    <Paragraph position="3"> The 100+-5 words extract at average compression ratio of 0.76 has most of its unigram co-occurrence score instances (18,284/61,762 [?] 30%) falling within the interval between 0.40 and 0.50, i.e., the expected performance of an extraction-based system would be between 0.40 and 0.50. The 150+-5 words extract at lower compression ratio of 0.64 has most of its instances between 0.50 and 0.60 (115,240/377,933 [?] 30%) and the 200+-5 words extract at compression ratio of 0.52 has most of its instances between 0.70 and 0.80 (212,116/731,819 [?] 29%). If we can compress 150 or 200-word summaries into 100 words and retain their important content, we would be to achieve an average 30% to 50% increase in performance.</Paragraph>
    <Paragraph position="4"> The question is: can an off-the-shelf sentence compression algorithm such as K&amp;M's noisy-channel model achieve this? If the answer is yes, then how much performance gain can be achieved? If not, are there other ways to use sentence compression to improve system performance? To improve system performance? To answer these questions, we conduct the following experiments</Paragraph>
    <Paragraph position="6"> topic sets and generate summaries of size: 100, 120, 125, 130, 140, 150, 160, 175, 200, 225, 250, 275, 300, 325, 350, 375, and 400.</Paragraph>
    <Paragraph position="7"> (2) Run K&amp;M's sentence compression algorithm over all summary sentences (run KM). For each summary sentence, we have a set of candidate compressions.</Paragraph>
    <Paragraph position="8"> See Figure 1 for example.</Paragraph>
    <Paragraph position="9"> (3) Rerank each candidate compression set using different scoring methods: a. Rerank each candidate compression set using topic signatures (run SIG).</Paragraph>
    <Paragraph position="10"> b. Rerank each candidate compression set using combination of KM and SIG scores using linear interpolation of topic signature score (SIG) and K&amp;M's log-probability score (KM). We use the following formula in this experiment:  (green) indicates runs on the column that are significantly better than runs on the row; dark gray indicates significantly worse.</Paragraph>
    <Paragraph position="12"> l is set to 2/3 (run SIGKMa).</Paragraph>
    <Paragraph position="13"> c. Rerank each candidate compression set using SIG score first and then KM is used to break ties (run SIGKMb).</Paragraph>
    <Paragraph position="14"> d. Rerank each candidate compression set using unigram co-occurrence score against manual references.</Paragraph>
    <Paragraph position="15"> This gives the upper bound for the K&amp;M's algorithm applied to the output generated by NeATS (run ORACLE).</Paragraph>
    <Paragraph position="16"> (4) Select the best compression combination. For a given length constraint, for example 100 words, we produce the final result by selecting a compressed summary across different summary sizes for each topic that fits the length limit (&lt;= 100+-5 words), and output them as the final summary. For example, we found that a 227-word summary for topic D19 could be compressed to 98 words using the topic signature reranking method. The compressed summary would then be selected as the final summary for topic D19. Figure 3 shows the original 227-word summary and Figure 4 shows its compressed version.</Paragraph>
    <Paragraph position="17"> There were 30 test topics in DUC 2001 and each topic contained about 10 documents. For each topic, four summaries of approximately 50, 100, 200, and 400 words were created manually as the 'ideal' model summaries. We used the set of 100-word manual summaries as our references in our experiments. An example manual summary is shown in Figure 5. We report results of these experiments in the next section.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML