File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1202_metho.xml
Size: 21,071 bytes
Last Modified: 2025-10-06 14:08:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1202"> <Title>Using Thematic Information in Statistical Headline Generation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Veracity of Generated Summaries </SectionTitle> <Paragraph position="0"> Berger and Mittal (2000) describe limitations to the generation of headlines by recycling words from the article. One such limitation is that the proposition expressed by the generated summary is not guaranteed to reflect the information in the source text. As an example, they present two sentences of differing meaning which uses the same words. We present their example in Example 1, which illustrates the case in which the subject and object are swapped.</Paragraph> <Paragraph position="1"> The dog bit the postman The postman bit the dog.</Paragraph> <Paragraph position="2"> Example 1. An example of different propositions presented in two sentences which use the same words.</Paragraph> <Paragraph position="3"> However, we believe that the veracity of the generated sentence, with respect to the original document, is affected by a more basic problem than variation in word order. Because words from any part of a source document can be combined probabilistically, there is a possibility that words can be used together out of context. We refer to this as Out-of-Context error. Figure 1 presents an example of a generated headline in which the adverb wrongly reports stock price movement. It also presents the actual context in which that adverb was used.</Paragraph> <Paragraph position="4"> Generated headline &quot;singapore stocks shares rebound&quot;&quot; Actual headline: &quot;Singapore shares fall, seen higher after holidays.&quot; Original context of use of 'rebound': &quot;Singapore shares closed down below the 2,200 level on Tuesday but were expected to rebound immediately after Chinese Lunar New Year and Muslim Eid Al-Fitr holidays, to a word being re-used out of context.</Paragraph> <Paragraph position="5"> Out-of-Context errors arise due to limitations in the two criteria for selecting words mentioned in Section 1. While, for selection purposes, a word is scored according to its goodness as candidate summary word, word order is determined by a notion of grammaticality, modelled probabilistically using ngrams of lexemes.</Paragraph> <Paragraph position="6"> However, the semantic relationship implied by probabilistically placing two words next to each other, for example an adjective and a noun, might be suspect. As the name &quot;Out-of-Context&quot; suggests, this is especially true if the words were originally used in non-contiguous and unrelated contexts. This limitation in the word selection criteria can be characterized as being due to a lack of long distance relationship information.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Our Approach to &quot;Encouraging Truth&quot; </SectionTitle> <Paragraph position="0"> In response to this limitation, we explore the use of a matrix operation, Singular Value Decomposition (SVD) to guide the selection of words. Although our approach still does not guarantee factual correctness with respect to the source document, it has the potential to alleviate the Out-of-Context problem by improving the selection criteria of words for inclusion in the generated sentence, by considering the original contexts in which words were used. With this improved criteria, we hope to &quot;encourage truth&quot; by incorporating long distance relationships between words. Conceptually, SVD provides an analysis of the data which describes the relationship between the distribution of words and sentences. This analysis includes a grouping of sentences based on similar word distributions, which correspond to what we will refer to here as the main themes of the document.1 By incorporating this information into the word selection criteria, the generated sentence will &quot;gravitate&quot; towards a single theme. That is, it will tend to use words from that theme, reducing the chance that words are placed together out of context.</Paragraph> <Paragraph position="1"> By reflecting the content of the main theme, the summary may be informative (Borko, 1975).</Paragraph> <Paragraph position="2"> That is, the primary piece of information within the source document might be included within the summary. However, it would remiss of us to claim that this quality of the summary is guaranteed. In general, the generated summaries are at least useful to gauge what the source text is about, a characteristic described by Borko as being indicative.</Paragraph> <Paragraph position="3"> Figure 2 presents the generated summary using SVD for the same test article presented in Figure 1. In this case, the summary is informative as not only are we told that the article is about a stock market, but the movement in price in this example is correctly determined.</Paragraph> <Paragraph position="4"> Generated headline using SVD: &quot;singapore shares fall&quot; based word selection criterion. The movement in share price is correct.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Related Work </SectionTitle> <Paragraph position="0"> As the focus of this paper is on statistical single-sentence summarisation we will not focus on preceding work which generates summaries greater in length than a sentence. We direct the reader to Paice (1990) for an overview of summarisation based on sentence extraction.</Paragraph> <Paragraph position="1"> Examples of recent systems include Kupiec et al. (1995) and Brandow et al. (1995). For examples of work in producing abstract-like summaries, see Radev and McKeown (1998), which combines work in information extraction 1 Theme is a term that is used in many ways by many researchers, and generally without any kind of formal definition. Our use of the term here is akin to the notion that underlies work on text segmentation, where sentences naturally cluster in terms of their 'aboutness'.</Paragraph> <Paragraph position="2"> and natural language processing. Hybrid methods for abstract-like summarisation which combine statistical and symbolic approaches have also been explored; see, for example, McKeown et al. (1999), Jing and McKeown (1999), and Hovy and Lin (1997).</Paragraph> <Paragraph position="3"> Statistical single sentence summarisation has been explored by a number of researchers (see for example, Witbrock and Mittal, 1999; Zajic et al., 2002). We build on the approach employed by Witbrock and Mittal (1999) which we will describe in more detail in Section 3.</Paragraph> <Paragraph position="4"> Interestingly, in the work of Witbrock and Mittal (1999), the selection of words for inclusion in the headline is decided solely on the basis of corpus statistics and does not use statistical information about the distribution of words in the document itself. Our work differs in that we utilise an SVD analysis to provide information about the document to be summarized, specifically its main theme.</Paragraph> <Paragraph position="5"> Discourse segmentation for sentence extraction summarisation has been studied in work such as Boguraev and Neff (2000) and Gong and Liu (2001). The motivation behind discovering segments in a text is that a sentence extraction summary should choose the most representative sentence for each segment, resulting in a comprehensive summary. In the view of Gong and Liu (2001), segments form the main themes of a document. They present a theme interpretation of the SVD analysis, as it is used for discourse segmentation, upon which our use of the technique is based. However, Gong and Liu use SVD for creating sentence extraction summaries, not for generating a single sentence summary by re-using words.</Paragraph> <Paragraph position="6"> In subsequent work to Witbrock and Mittal (1999), Banko et al. (2000) describe the use of information about the position of words within four quarters of the source document. The headline candidacy score of a word is weighted by its position in one of quarters. We interpret this use of position information as a means of guiding the generation of a headline towards the central theme of the document, which for news articles typically occurs in the first quarter.</Paragraph> <Paragraph position="7"> SVD potentially offers a more general mechanism for handling the discovery of the central themes and their positions within the document.</Paragraph> <Paragraph position="8"> Jin et al. (2002) have also examined a statistical model for headlines in the context of an information retrieval application. Jin and Hauptmann (2001) provide a comparison of a variety of learning approaches used by researchers for modelling the content of headlines including the Iterative Expectation-Maximisation approach, the K-Nearest neighbours approach, a term vector approach and the approach of Witbrock and Mittal (1999).</Paragraph> <Paragraph position="9"> In this comparison, the approach of Witbrock and Mittal (1999) fares favourably, ranking second after the term vector approach to title word retrieval (see Jin and Hauptmann, 2001, for details). However, while it performs well, the term vector approach Jin et al. (2002) advocate doesn't explicitly try to model the way a headline will usually discuss the main theme and may thus be subject to the Out-of-Context problem.</Paragraph> <Paragraph position="10"> Finally, for completeness, we mention the work of Knight and Marcu (2000), who examine single sentence compression. Like Witbrock and Mittal (1999), they couch summarisation as a noisy channel problem. Under this framework, the summary is a noise-less source of information and the full text is the noisy result. However, in contrast to our approach, Knight and Marcu (2000) handle parse trees instead of the raw text. Their system learns how to simplify parse trees of sentences extracted from the document to be summarized, to uncover the original noise-less forms.</Paragraph> <Paragraph position="11"> 5 Generating a Single Sentence Summary In this section, we describe our approach to single sentence summarisation. As mentioned earlier, our approach is based on that of Witbrock and Mittal (1999). It differs in the way we score words for inclusion in the headline. Section 5.1 presents our re-implementation of Witbrock and Mittal's (1999) framework and introduces the Content Selection strategy they employ. Section 5.2 describes our extension using SVD resulting in two alternative Content Selection strategies.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Searching for a Probable Headline </SectionTitle> <Paragraph position="0"> We re-implemented the work described in Witbrock and Mittal (1999) to provide a single sentence summarisation mechanism. For full details of their approach, we direct the reader to their paper (Witbrock and Mittal, 1999). A brief overview of our implementation of their algorithm is presented here.</Paragraph> <Paragraph position="1"> Conceptually, the task is twofold. First, the system must select n words from a news article that best reflect its content. Second, the best (grammatical) word ordering of these n words must be determined. Witbrock and Mittal (1999) label these two tasks as Content Selection and Realisation. Each of these criteria are scored probabilistically, whereby the probability is estimated by prior collection of corpus statistics.</Paragraph> <Paragraph position="2"> To estimate Content Selection probability for each word, we use the Maximum Likelihood Estimate (MLE). In an offline training stage, the system counts the number of times a word is used in a headline, with the condition that it occurs in the corresponding news article. To form the probability, this frequency data is normalised by the number of times the word is used in articles across the whole corpus. This particular strategy of content selection, we refer to this as the Conditional probability.</Paragraph> <Paragraph position="3"> The Realisation criterion is determined simply by the use of bigram statistics, which are again collected over a training corpus during the training stage. The MLE of the probability of word sequences is calculated using these bigram statistics. Bigrams model the grammaticality of a word given the preceding word that has already been chosen.</Paragraph> <Paragraph position="4"> It should be noted that both the Content Selection and Realisation criteria influence whether a word is selected for inclusion in the headline. For example, a preposition might poorly reflect the content of a news article and score a low Content Selection probability.</Paragraph> <Paragraph position="5"> However, given the context of the preceding word, it may be the only likely choice.</Paragraph> <Paragraph position="6"> In both the training stage and the headline generation stage, the system employs the same preprocessing. The preprocessing, which mirrors that used by Witbrock and Mittal (1999), replaces XML markup tags and punctuation (except apostrophes) with whitespace. In addition, the remaining text is transformed into lower case to make string matching case insensitive. The system performs tokenisation by using whitespace as a word delimiter.</Paragraph> <Paragraph position="7"> In Witbrock and Mittal's approach (1999), the headline generation problem reduces to finding the most probable path through a bag of words provided by the source document, essentially a search problem. They use the beam search variety of the Viterbi algorithm (Forney, 1973) to efficiently search for the headline. In our implementation, we provided the path length as a parameter to this search mechanism. In addition, we used a beam size of 20.</Paragraph> <Paragraph position="8"> To use the Viterbi algorithm to search for a path, the probability of adding a new word to an existing path is computed by combining the Content selection probability, the Realisation probability and the probability of the existing path, which is recursively defined. Combining each component probability is done by finding the logs of the probabilities and adding them together. The Viterbi algorithm sorts the paths according to the path probabilities, directing the search towards the more probable word sequences first. The use of repeated words in the path is not permitted.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Using Singular Value Decomposition for Content Selection </SectionTitle> <Paragraph position="0"> As an alternative to the Conditional probability, we examine the use of SVD in determining the Content Selection probability. Before we outline the procedure for basing this probability on SVD, we will first outline our interpretation of the SVD analysis, based on that of Gong and Liu (2001). Our description is not intended to be a comprehensive explanation of SVD, and we direct the reader to Manning and Schutze (2000) for a description of how SVD is used in information retrieval.</Paragraph> <Paragraph position="1"> Conceptually, when used to analyse documents, SVD can discover relationships between word co-occurrences in a collection of text. For example, in the context of information retrieval, this provides one way to retrieve additional documents that contain synonyms of query terms, where synonymy is defined by similarity of word co-occurrences. By discovering patterns in word co-occurrences, SVD also provides information that can be used to cluster documents based on similarity of themes.</Paragraph> <Paragraph position="2"> In the context of single document summarisation, we require SVD to cluster sentences based on similarities of themes. The SVD analysis provides a number of related pieces of information relating to how words and sentences relate to these themes. One such piece of information is a matrix of scores, indicating how representative the sentence is of each theme. Thus, for a sentence extraction summary, Gong and Liu (2001) would pick the top n themes, and for each of these themes, use this matrix to choose the sentence that best represents it.</Paragraph> <Paragraph position="3"> For single sentence summarisation, we assume that the theme of the generated headline will match the most important theme of the article.</Paragraph> <Paragraph position="4"> The SVD analysis orders its presentation of themes starting with the one that accounts for the greatest variation between sentences. The SVD analysis provides another matrix which scores how well each word relates to each theme. Given a theme, scores for each word, contained in a column vector of the matrix, can then normalised to form a probability. The remainder of this section provides a more technical description of how this is done.</Paragraph> <Paragraph position="5"> To begin with, we segment a text into sentences. Our sentence segmentation preprocessing is quite simple and based on the heuristics found in Manning and Schutze (2000). After removing stopwords, we then form a terms by sentences matrix, A. Each column of A represents a sentence. Each row represents the usage of a word in various sentences. Thus the frequency of word t in sentence s is stored in the cell Ats. This gives us an t * s matrix, where t ++ s. That is, we expect the lexicon size of a particular news article to exceed the number of sentences.</Paragraph> <Paragraph position="6"> For such a matrix, the SVD of A is a process that provides the right hand side of the following equation: A = U.S. Vtranspose where U is a t * r matrix, S is an r * r matrix, and V is an s * r matrix. The dimension size r is the rank of A, and is less than or equal to the number of columns of A, in this case, s. The matrix S is a diagonal matrix with interesting properties, the most important of which is that the diagonal is sorted by size. The diagonal values indicate the variation across sentences for a particular theme, where each theme is represented by a separate diagonal element. The matrix V indicates how representative a sentence is of a score. Similarly the matrix U indicates how related to the themes each word is. A diagram of this is presented in Figure 3.</Paragraph> <Paragraph position="7"> Before describing how we use each of these matrices, it is useful to outline what SVD is doing geometrically. Each sentence, a column in the matrix A, can be thought of as an object in t dimensional space. SVD uncovers the relations between dimensions. For example, in the case of text analysis, it would discover relationships between words such as synonyms.</Paragraph> <Paragraph position="8"> In a trivial extreme of this case where two sentences differ only by a synonym, SVD would ideally discover that the two synonyms have very similar word co-occurrences. In the analysis matrices of U, S and V, the redundant dimensions corresponding to these highly similar words might be removed, resulting in a reduced number of dimensions, r, required to represent the sentences.</Paragraph> <Paragraph position="9"> SVD matrices as it relates to single sentence summarisation.</Paragraph> <Paragraph position="10"> Of the resulting matrices, V is an indication of how each sentence relates to each theme, indicated by a score. Thus, following Gong and Liu (2001), a plausible candidate for the most important sentence is found by taking the first column vector of V (which has s elements), and finding the element with the highest value. This sentence will be the one which is most representative of the theme. The index of that element is the index of the sentence to extract. However, our aim is not to extract a sentence but to utilise the theme information. The U matrix of the analysis provides information about how well words correspond to a particular theme.</Paragraph> <Paragraph position="11"> We examine the first column of the U matrix, sum the elements and then normalize each element by the sum to form a probability. This probability, which we refer to as the SVD probability, is then used as the Content Selection probability in the Viterbi search algorithm.</Paragraph> <Paragraph position="12"> As an alternative to using the SVD probability and the Conditional Probability in isolation, a Combined Probability is calculated using the harmonic mean of the two. The harmonic mean was used in case the two component probabilities differed consistently in their respective orders of magnitude. Intuitively, when calculating a combined probability, this evens the importance of each component probability.</Paragraph> <Paragraph position="13"> To summarize, we end up with three alternative strategies in estimating the Content Selection Probability: the Conditional Probability, the SVD Probability and the Combined Probability.</Paragraph> </Section> </Section> class="xml-element"></Paper>