XML Viewer - p99-1072

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/p99-1072_evalu.xml
Size: 13,791 bytes
Last Modified: 2025-10-06 14:00:38
<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1072">
  <Title>Improving Summaries by Revising Them</Title>
  <Section position="5" start_page="559" end_page="564" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Evaluation of text summarization and other such NLP technologies where there may be many acceptable outputs, is a difficult task. Recently, the U.S. government conducted a large-scale evaluation of summarization systems as part of its TIPSTER text processing program (Mani et al. 1999), which included both an extrinsic (relevance assessment) evaluation, as well as an intrinsic (coverage of key ideas) evaluation. The test set used in the latter (Q&amp;:A) evaluation along with several automatically scored measures of informativeness has been reused in evaluating the informativeness of our revision component.</Paragraph>
    <Section position="1" start_page="559" end_page="559" type="sub_section">
      <SectionTitle>
4.1 Background: TIPSTER Q&amp;A
Evaluation
</SectionTitle>
      <Paragraph position="0"> In this Q&amp;A evaluation, the summarization system, given a document and a topic, needed to produce an informative, topic-related summary that contained the correct answers found in that document to a set of topic-related questions.</Paragraph>
      <Paragraph position="1"> These questions covered &amp;quot;obligatory&amp;quot; information that has to be provided in any document judged relevant to the topic. The topics chosen (3 in all) were drawn from the TREC (Harman and Voorhees 1996) data sets. For each topic, 30 relevant TREC documents were chosen as the source texts for topic-related summarization. The principal tasks of each Q&amp;A evaluator were to prepare the questions and answer keys and to score the system summaries. To construct the answer key, each evaluator marked off any passages in the text that provided an answer to a question (example shown in Table 1).</Paragraph>
      <Paragraph position="2"> Two kinds of scoring were carried out. In the first, a manual method, the answer to each question was judged Correct, Partially Correct, or Missing based on guidelines involving a human comparison of the summary of a document against the set of tagged passages for that question in the answer key for that document.</Paragraph>
      <Paragraph position="3"> The second method of scoring was an automatic method. This program 7 took as input a key file and a summary to be scored, and returns an informativeness score on four different metrics.</Paragraph>
      <Paragraph position="4"> The key file includes tags identifying passages in the file which answer certain questions. The scoring uses the overlap measures shown in Table 2 s. The automatically computed V4 thru V7 informativeness scores were strongly correlated with the human-evaluated scores (Pearson r &gt; .97, ~ &lt; 0.0001). Given this correlation, we decided to use these informativeness measures.</Paragraph>
    </Section>
    <Section position="2" start_page="559" end_page="561" type="sub_section">
      <SectionTitle>
4.2 Revision Evaluation:
Informativeness
</SectionTitle>
      <Paragraph position="0"> To evaluate the revised summaries, we first converted each summary into a weighting function which scored each full-text sentence in the summary's source in terms of its similarity to the most similar summary sentence. The weight of a source document sentence s given a sum7The program was reimplemented by us for use in the revision evaluation.</Paragraph>
      <Paragraph position="1"> S Passage matching here involves a sequential match with stop words and punctuation removed.</Paragraph>
      <Paragraph position="2">  computer networks by nonauthorized personnel. Narrative : Illegal entry into sensitive computer networks is a serious and potentially menacing problem. Both 'hackers' and foreign agents have been known to acquire unauthorized entry into various networks. Items relative this subject would include but not be limited to instances of illegally entering networks containing information of a sensitive nature to specific countries, such as defense or technology information, international banking, etc. Items of a personal nature (e.g. credit card fraud, changing of college test scores) should not be considered relevant.</Paragraph>
    </Section>
    <Section position="3" start_page="561" end_page="563" type="sub_section">
      <SectionTitle>
Questions
</SectionTitle>
      <Paragraph position="0"> full credit if the text spans for all tagged key passages are found in their entirety in the summary full credit if the text spans for all tagged key passages are found in their entirety in the summary; haft credit if the text spans for all tagged key passages are found in some combination of full or truncated form in the summary full credit if the text spans for all tagged key passages are found in some combination of full or truncated form in the summary percentage of credit assigned that is commensurate with the extent to which the text spans for tagged key passages are present in the summary  initial drafts. E -- elimination, A - aggregation. A, E, and A+E are shown in the order V4, V5, V6, and V7.</Paragraph>
      <Paragraph position="1"> &lt;sl&gt; Researchers today tried to trace a &amp;quot;virus&amp;quot; that infected computer systems nationwide, &lt;Q4&gt; slowing machines in universities, a NASA and nuclear weapons lab and other federal research centers linked by a Defense Department computer network. &lt;/q4&gt; &lt;s3&gt; Authorities said the virus, which &lt;FROM S16&gt; &lt;Q3&gt; the virus infected only unclassified computers &lt;/Q3&gt; and &lt;FROM $15&gt; &lt;Q3&gt; the virus affected the unclassified, non-secured computer systems &lt;/q3&gt; (and which &lt;FROM S19&gt; &lt;Q4&gt; the virus was %nainly just slowing down systems ) and slowing data &amp;quot;, &lt;/Q4&gt; apparently &lt;q4&gt; destroyed no data but temporarily halted some research. &lt;/Q4&gt; &lt;s14&gt;. The computer problem also was discovered late Wednesday at the &lt;q3&gt; Lawrence Livermore National Laboratory in Livermore, Calif. &lt;/Q3&gt; &lt;s15&gt; &lt;s20&gt; &amp;quot;the developer was clearly a very high order hacker,&amp;quot;, &lt;FROM $25&gt; &lt;QI&gt; a graduate student &lt;/QI&gt; &lt;Q2&gt; who made making a programming error in designing the virus,causing the program to replicate faster than expected &lt;/q2&gt; or computer buff, said  face) and deleted (italics) spans. Sentence &lt;s&gt; and Answer Key &lt;Q&gt; tags are overlaid. mary is the match score of s's best-matching summary sentence, where the match score is the percentage of content word occurrences in s that are also found in the summary sentence.</Paragraph>
      <Paragraph position="2"> Thus, we constructed an idealized model of each summary as a sentence extraction function.</Paragraph>
      <Paragraph position="3"> Since some of the participants truncated and occasionally mangled the source text (in addition, Penn carried out pronoun expansion), we wanted to avoid having to parse and apply revision rules to such relatively ill-formed material.</Paragraph>
      <Paragraph position="4"> This idealization is highly appropriate, for each of the summarizers considered 9 did carry out sentence extraction; in addition, it helps level the playing field, avoiding penalization of individual summarizers simply because we didn't cater to the particular form of their summary.</Paragraph>
      <Paragraph position="5"> Each summary was revised by calling the revision program with the full-text source, the original compression rate of the summary, and 9TextWise, which extracted named entities rather than passages, was excluded.</Paragraph>
      <Paragraph position="6">  the summary weighting function (i.e., with the weight for each source sentence). The 630 revised summaries (3 topics x 30 documents per topic x 7 participant summaries per document) were then scored against the answer keys using the overlap measures above. The documents consisted of AP, Wall Street Journal, and Financial Times news articles from the TREC (Harman and Voorhees 1996) collection.</Paragraph>
      <Paragraph position="7"> The rules used in the system are very general, and were not modified for the evaluation except for turning off most of the reference adjustment rules, as we wished to evaluate that component separately. Since the answer keys typically do not contain names of commentators, we wanted to focus the algorithm away from such names (otherwise, it would aggregate information around those commentators). As a result, special rules were written in the revision rule language to detect commentator names in reported speech (&amp;quot;X said that ..&amp;quot;, &amp;quot;X said ...&amp;quot;, &amp;quot;, said X..&amp;quot;, &amp;quot;, said X..&amp;quot;, etc.), and these names were added to a stoplist for use in entityhood and coreference tests during regular revision rule application.</Paragraph>
      <Paragraph position="8"> Figure 3 shows percentage of losses, maintains, and wins in informativeness against the initial draft (i.e., the result of applying the compression to the sentence weighting function).</Paragraph>
      <Paragraph position="9"> Informativeness using V7 is measured by V71deg normalized for compression as: sl nV7 = V7 * (1 - ~-~) (1) where sl is summary length and sO is the source length. This initial draft is in itself not as informative as the original summary: in all cases except for Penn on 257, the initial draft either maintains or loses informativeness compared to the original summary.</Paragraph>
      <Paragraph position="10"> As Figure 3 reveals (e.g., for nVT), revising the initial draft using elimination rules only (E) results in summaries which are less informative than the initial draft 65% of the time, suggesting that these rules are removing informative material. Revising the initial draft using aggregation rules alone (A), by contrast, results in more informative summaries 47% of the time, and equally informative summaries another 13% 1degV7 computes for each question the percentage of its answer passages completely covered by the summary.</Paragraph>
      <Paragraph position="11"> This normalization is extended similarly for V4 thru V6.</Paragraph>
      <Paragraph position="12"> of the time. This is due to aggregation folding in additional informative material into the initial draft when it can. Inspection of the output summaries, an example of which is shown in Figure 4, confirms the folding in behavior of aggregation. Finally, revising the initial draft using both aggregation and elimination rules (ATE) does no more than maintain the informativeness of the initial draft, suggesting A and E are canceling each other out. The same trend is observing for nV4 thru nV6, confirming that the relative gain in informativeness due to aggregation is robust across a variety of (closely related) measures. Of course, if the revised summaries were instead radically different in wording from the original drafts, such informativeness measures would, perhaps, fall short.</Paragraph>
      <Paragraph position="13"> It is also worth noting the impact of aggregation is modulated by the current control strategy; we don't know what the upper bound is on how well revision could do given other control regimes. Overall, then, while the results are hardly dramatic, they are certainly encouraging zl.</Paragraph>
    </Section>
    <Section position="4" start_page="563" end_page="564" type="sub_section">
      <SectionTitle>
4.3 Revision Evaluation: Readability
</SectionTitle>
      <Paragraph position="0"> Inspection of the results of revision indicates that the syntactic well-formedness revision criterion is satisfied to a very great extent. Improper extraction from coordinated NPs is an issue (see Figure 4), but we expect additional revision rules to handle such cases. Coherence disfiuencies do occur; for example, since we don't resolve possessive pronouns or plural definites, we can get infelicitous revisions like &amp;quot;A computer virus, which entered ,their computers through ARPANET, infected systems from MIT.&amp;quot; Other limitations in definite NP coreference can and do result in infelicitous reference adjustments. For one thing, we don't link definites to proper name antecedents, resulting in inappropriate indefinitization (e.g., &amp;quot;Bill Gates ... *A computer tycoon&amp;quot;). In addition, the &amp;quot;same head word&amp;quot; test doesn't of course address inferential relationships between the definite NP and its antecedent (even when the antecedent is explicitly mentioned), again resulting in inappropriate indefinitization (e.g., &amp;quot;The program ....a developer ~', and &amp;quot;The developer 11 Similar results hold while using a variety of other compression normalization metrics.</Paragraph>
      <Paragraph position="1">  ... An anonymous caller said .a very high order hacker was a graduate student&amp;quot;).</Paragraph>
      <Paragraph position="2"> To measure fluency without conducting an elaborate experiment involving human judgmentsl we fell back on some extremely coarse measurea based on word and sentence length computed by the (gnu) unix program style (Cherry 1981). The FOG index sums the average sentence length with the percentage of words over 3 syllables, with a &amp;quot;grade&amp;quot; level over 12 indicating difficulty for the average reader.</Paragraph>
      <Paragraph position="3"> The Kincaid index, intended for technical text, computes a weighted sum of sentence length and word length. As can be seen from Table 3, there is a slight but significant lowering of scores on both metrics, revealing that according to these metrics revision is not resulting in more complex text. This suggests that elimination rather than aggregation is mainly responsible for this.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML