File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/p00-1041_evalu.xml

Size: 11,426 bytes

Last Modified: 2025-10-06 13:58:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1041">
  <Title>Headline Generation Based on Statistical Translation</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> Zero level-Model: The system was trained on approximately 25,000 news articles from Reuters dated between 1/Jan/1997 and 1/Jun/1997. After punctuation had been stripped, these contained about 44,000 unique tokens in the articles and slightly more than 15,000 tokens in the headlines.</Paragraph>
    <Paragraph position="1"> Representing all the pairwise conditional probabilities for all combinations of article and headline words3 added significant complexity, so we simplified our model further and investigated the effectiveness of training on a more limited vocabulary: the set of all the words that appeared in any of the headlines.4 Conditional probabilities for words in the headlines that also appeared in the articles were computed. As discussed earlier, in our zero-level model, the system was also trained on bigram transition probabilities as an approximation to the headline syntax. Sample output from the system using this simplified model is shown in Figures 1 and 3.</Paragraph>
    <Paragraph position="2">  zero-level model, that we have discussed so far, works surprisingly well, given its strong independence assumptions and very limited vocabulary. There are problems, some of which are most likely due to lack of sufficient training data.5 Ideally, we should want to evaluate the system's performance in terms both of content selection success and realization quality. However, it is hard to computationally evaluate coherence and phrasing effectiveness, so we have, to date, restricted ourselves to the content aspect, which is more amenable to a quantitative analysis. (We have experience doing much more laborious human eval3This requires a matrix with 660 million entries, or about 2.6GB of memory. This requirement can be significantly reduced by using a threshold to prune values and using a sparse matrix representation for the remaining pairs. However, inertia and the easy availability of the CMU-Cambridge Statistical Modeling Toolkit - which generates the full matrix - have so far conspired to prevent us from exercising that option.</Paragraph>
    <Paragraph position="3"> 4An alternative approach to limiting the size of the mappings that need to be estimated would be to use only the top a82 words, where a82 could have a small value in the hundreds, rather than the thousands, together with the words appearing in the headlines. This would limit the size of the model while still allowing more flexible content selection.</Paragraph>
    <Paragraph position="4"> 5We estimate that approximately 100MB of training data would give us reasonable estimates for the models that we would like to evaluate; we had access to much less.</Paragraph>
    <Paragraph position="5">  Mideast advisers, including Secretary of State Madeleine Albright and U.S. peace envoy Dennis Ross, in preparation for a session with Israel Prime Minister Benjamin Netanyahu tomorrow. Palestinian leader Yasser Arafat is to meet with Clinton later this week. Published reports in Israel say Netanyahu will warn Clinton that Israel can't withdraw from more than nine percent of the West Bank in its next  scheduled pullback, although Clinton wants a 12-15 percent pullback.</Paragraph>
    <Paragraph position="6"> 1: clinton -6 0 2: clinton wants -15 2 3: clinton netanyahu arafat -21 24 4: clinton to mideast peace -28 98 5: clinton to meet netanyahu arafat -33 298 6: clinton to meet netanyahu arafat is null and system generated output using the simplest, zero-level, lexical model. Numbers to the right are log probabilities of the string, and search beam size, respectively.</Paragraph>
    <Paragraph position="7"> uation, and plan to do so with our statistical approach as well, once the model is producing summaries that might be competitive with alternative approaches.) After training, the system was evaluated on a separate, previously unseen set of 1000 Reuters news stories, distributed evenly amongst the same topics found in the training set. For each of these stories, headlines were generated for a variety of lengths and compared against the (i) the actual headlines, as well as (ii) the sentence ranked as the most important summary sentence. The latter is interesting because it helps suggest the degree to which headlines used a different vocabulary from that used in the story itself.6 Term over- null cal model for content selection on 1000 Reuters news articles. The headline length given is that a which the overlap between the terms in the target headline and the generated summary was maximized. The percentage of complete matches indicates how many of the summaries of a given length had all their terms included in the target headline.</Paragraph>
    <Paragraph position="8"> lap between the generated headlines and the test standards (both the actual headline and the summary sentence) was the metric of performance.</Paragraph>
    <Paragraph position="9"> For each news article, the maximum overlap between the actual headline and the generated headline was noted; the length at which this overlap was maximal was also taken into account. Also tallied were counts of headlines that matched completely - that is, all of the words in the generated headline were present in the actual headline - as well as their lengths. These statistics illustrate the system's performance in selecting content words for the headlines. Actual headlines are often, also, ungrammatical, incomplete phrases. It is likely that more sophisticated language models, such as structure models (Chelba, 1997; Chelba and Jelinek, 1998), or longer n-gram models would lead to the system generating headlines that were more similar in phrasing to real headlines because longer range dependencies shelf Carnegie Mellon University summarizer, which was the top ranked extraction based summarizer for news stories at the 1998 DARPA-TIPSTER evaluation workshop (Tip, 1998). This summarizer uses a weighted combination of sentence position, lexical features and simple syntactical measures such as sentence length to rank sentences. The use of this summarizer should not be taken as a indicator of its value as a testing standard; it has more to do with the ease of use and the fact that it was a reasonable candidate.  summary sentences, respectively, of the article. Using Part of Speech (POS) and information about a token's location in the source document, in addition to the lexical information, helps improve performance on the Reuters' test set.</Paragraph>
    <Paragraph position="10"> could be taken into account. Table 1 shows the results of these term selection schemes. As can be seen, even with such an impoverished language model, the system does quite well: when the generated headlines are four words long almost one in every five has all of its words matched in the article s actual headline. This percentage drops, as is to be expected, as headlines get longer.</Paragraph>
    <Paragraph position="11"> Multiple Selection Models: POS and Position As we mentioned earlier, the zero-level model that we have discussed so far can be extended to take into account additional information both for the content selection and for the surface realization strategy. We will briefly discuss the use of two additional sources of information: (i) part of speech (POS) information, and (ii) positional information. null POS information can be used both in content selection - to learn which word-senses are more likely to be part of a headline - and in surface realization. Training a POS model for both these tasks requires far less data than training a lexical model, since the number of POS tags is much smaller. We used a mixture model (McLachlan and Basford, 1988) - combining the lexical and the POS probabilities - for both the content selection and the linearization tasks.</Paragraph>
    <Paragraph position="12"> Another indicator of salience is positional information, which has often been cited as one of the most important cues for summarization by ex- null 1: clinton -23.27 2: clinton wants -52.44 3: clinton in albright -76.20 4: clinton to meet albright -105.5 5: clinton in israel for albright -129.9 6: clinton in israel to meet albright -158.57 (a) System generated output using a lexical + POS model. 1: clinton -3.71 2: clinton mideast -12.53 3: clinton netanyahu arafat -17.66 4: clinton netanyahu arafat israel -23.1 5: clinton to meet netanyahu arafat -28.8 6: clinton to meet netanyahu arafat israel -34.38 (b) System generated output using a lexical + positional model.</Paragraph>
    <Paragraph position="13"> 1: clinton -21.66 2: clinton wants -51.12 3: clinton in israel - 58.13 4: clinton meet with israel -78.47 5: clinton to meet with israel -87.08 6: clinton to meet with netanyahu arafat -107.44 (c) System generated output using a lexical + POS + posi null augmented lexical models. Numbers to the right are log probabilities of the generated strings under the generation model.</Paragraph>
    <Paragraph position="14">  the evaluation, but which are semantically equivalent, together with some &amp;quot;equally good&amp;quot; generated headlines that were counted as wrong in the evaluation. traction (Hovy and Lin, 1997; Mittal et al., 1999). We trained a content selection model based on the position of the tokens in the training set in their respective documents. There are several models of positional salience that have been proposed for sentence selection; we used the simplest possible one: estimating the probability of a token appearing in the headline given that it appeared in the 1st, 2nd, 3rd or 4th quartile of the body of the article. We then tested mixtures of the lexical and POS models, lexical and positional models, and all three models combined together. Sample output for the article in Figure 3, using both lexical and POS/positional information can be seen in Figure 4. As can be seen in Table 2,7 Although adding the POS information alone does not seem to provide any benefit, positional information does. When used in combination, each of the additional information sources seems to improve the overall model of summary generation.</Paragraph>
    <Paragraph position="15"> Problems with evaluation: Some of the statistics that we presented in the previous discussion suggest that this relatively simple statistical summarization system is not very good compared to some of the extraction based summarization systems that have been presented elsewhere (e.g., (Radev and Mani, 1997)). However, it is worth emphasizing that many of the headlines generated by the system were quite good, but were penalized because our evaluation metric was based on the word-error rate and the generated headline terms did not exactly match the original ones. A quick manual scan of some of the failures that might have been scored as successes 7Unlike the data in Table 1, these headlines contain only six words or fewer.</Paragraph>
    <Paragraph position="16"> in a subjective manual evaluation indicated that some of these errors could not have been avoided without adding knowledge to the system, for example, allowing the use of alternate terms for referring to collective nouns. Some of these errors are shown in Table 3.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML