File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1124_metho.xml
Size: 25,145 bytes
Last Modified: 2025-10-06 14:15:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1124"> <Title>Improving summarization through rhetorical parsing tuning</Title> <Section position="4" start_page="207" end_page="211" type="metho"> <SectionTitle> 3 An enhanceddiscourse-based </SectionTitle> <Paragraph position="0"> framework for text summarization</Paragraph> <Section position="1" start_page="207" end_page="207" type="sub_section"> <SectionTitle> 3.1 Introduction </SectionTitle> <Paragraph position="0"> There are two ways in which one can integrate a discourse-based measure of textual saliency, such as that described above, with measures of saliency that are based on cohesion, position, similarity with the title, etc. The simplest way is to compute a probability distribution of the importance of textual units according to the discourse method and to combine it with all probability distributions produced by the other heuristics. In such an approach, the discourse heuristic is just one of the n heuristics that are employed by a system. Obtaining good summaries amounts then to determining a good way of combining the implemented heuristics.</Paragraph> <Paragraph position="1"> Overall, a summarization system that works along the lines described above still treats texts as flat sequences of textual units, although the discourse method internally uses a more sophisticated representation. The shortcoming of such an approach is that it still permits the selection of textual units that do not play a central role in discourse. For example, if the text to be summarized consists only of units 7 and 8 in text (!), it may be possible that the combination of the position, title, and discourse heuristics will yield a higher score for unit 7 than for unit 8, although unit 8 is the nucleus of the text and expresses what is important. Unfortunately, if we interpret text as a flat sequence of units, the rhetorical relation and the nuclearity assignments with respect to these units cannot be appropriately exploited.</Paragraph> <Paragraph position="2"> A more complex way to integrate discourse, cohesion, position, and other summarization-based methods is to consider that the structure of discourse is the most important factor in determining saliency, an assumption supported by experiments done by Mani et al. (1998). In such an approach, we no longer interpret texts as flat sequences of textual units, but as tree structures that reflect the nuclearity and rhetorical relations that characterize each textual span. When discourse is taken to be central to the interpretation of text, obtaining good summaries amounts to finding the &quot;best&quot; discourse interpretations. In the rest of the paper, we explore this approach.</Paragraph> </Section> <Section position="2" start_page="207" end_page="209" type="sub_section"> <SectionTitle> 3.2 Criteria for measuring the &quot;goodness&quot; of </SectionTitle> <Paragraph position="0"> discourse structures In order to find the 'best' discourse interpretations, i.e., the interpretations that yield summaries that are most similar to summaries generated manually, we considered seven metrics, which we discuss below.</Paragraph> <Paragraph position="1"> The clustering-based metric. A common assumption in the majority of current text theories is that good texts exhibit a well-defined topical structure. In our approach, we assume that a discourse tree is &quot;better&quot; if it exhibits a high-level structure that matches as much as possible the topical boundaries of the text for which that structure is built.</Paragraph> <Paragraph position="2"> In order to capture this intuition, when we build discourse trees, we associate with each node of a tree a clustering score. For the leaves, this score is 0; for the internal nodes, the score is given by the similarity between the immediate children. The similarity is computed using a traditional cosine metric, in the style of Hearst (1997). We consider that a discourse tree ,4 is &quot;better&quot; than another discourse tree t3 if the sum of the clustering scores associated with the nodes of A is higher than the sum of the clustering scores associated with the nodes of/3.</Paragraph> <Paragraph position="3"> The marker-based metric. Naturally occurring texts use a wide range of discourse markers, which signal coherence relations between textual spans of various sizes. We assume that a discourse structure should reflect explicitly as many of the discourse relations that are signaled by discourse markers. In other words, we assume that a discourse structure .4 is better than a discourse structure B if+4 uses more rhetorical relations that are explicitly signaled than B.</Paragraph> <Paragraph position="4"> The rhetorical-clustering-based metric. The clustering-based metric discussed above computes an overall similarity between two textual spans. However, in the discourse formalization proposed by Marcu ( 1996; 1997c), it is assumed that whenever a discourse relation holds between two textual spans, that relation also holds between the salient units (nuclei) associated with those spans. We extend this observation to similarity as well, by introducing the rhetorical-clustering-based metric, which measures the similarity between the salient units associated with two spans. For example, the clustering-based score associated with the root of the tree in figure 1 measures the similarity between spans \[i,6\] and \[7,10\]. In contrast, the rhetorical-clustering-based score associated with the root of the same tree measures the similarity between units 2 and 8, which are the salient units that pertain to spans \[ 1,6\] and \[7,10\] respectively. In the light of the rhetorical-clustering-based metric, we consider that a discourse tree A is &quot;better&quot; than another discourse tree B if the sum of the rhetorical-clustering scores associated with the nodes of .4 is higher than the sum of the rhetorical-clustering scores associated with the nodes of B.</Paragraph> <Paragraph position="5"> The shape-based metric. The only disambiguation metric that we used in our previous work (Marcu, 1997b) was the shape-based metric, according to which the &quot;best&quot; trees are those that are skewed to the right. The explanation for this metric is that text processing is, essentially, a left-to-right process. In many genres, people write texts so that the most important ideas go first, both at the paragraph and at the text levels. I The more text writers add, the more they elaborate on the text that went before: as a consequence, incremental discourse building consists mostly of expansion of the tight branches. According to the shape-based metric, we consider that a discourse tree A is &quot;better&quot; than another discourse tree B if A is more skewed to the right than B (see Marcu (1997c) for a mathematical formulation of the notion of skewedness).</Paragraph> <Paragraph position="6"> The title-based metric. A variety of systems assume that important sentences in a text use words that occur in the title. We measure the similarity between each textual unit and the title by applying a traditional cosine mettic. We compute a title-based score for each discourse structure by computing the similarity between the title and the units that are promoted as salient in that structure. The intuition that we capture in this way is that a discourse structure should be constructed so that it promotes as close to the root as possible the units that are similar with the title. According to the title-based mettic, we consider that a discourse structure A is &quot;better&quot; that a discourse structure B if the title-based score of .4 is higher than the title-based score of/3.</Paragraph> <Paragraph position="7"> The position-based metric. Research in summarization (Baxendale, 1958; Edmundson, 1968; Kupiec et al., 1995; Lin and Hovy, 1997) has shown that, in genres with stereotypical structure, important sentences are often located at the beginning or end of paragraphs/documents.</Paragraph> <Paragraph position="8"> Our position-based metric captures this intuition by assigning a positive score to each textual unit that belongs to the first two or last sentences of the the first three or last two paragraphs. We compute a position-based score for each discourse structure by averaging the position-based scores of the units that are promoted as salient in that discourse structure. The intuition that we capture in this way is that a discourse structure should be constructed so that it promotes as close to the root as possible the units that are located at the beginning or end of a text. According to the position-based metric, we consider that a discourse structure .4 is &quot;better&quot; that a discourse structure 13 if the position-based score of .4 is higher than the position-based score of/3.</Paragraph> <Paragraph position="9"> The connectedness-based metric. A heuristic that is often employed by current summarization systems is that of considering important the highest connected entities in more or less elaborate semantic structures (Skorochodko, 1971; Hoey, 1991; Salton and Allan, 1995; Mani and Bloedorn. 1998; Barzilay and Elhadad, 1997). We implement this heuristic by computing the average cosine similarity of each textual unit in a text with r~pect to all the other units. We associate a connectedness-based score to each discourse structure by averaging the connectedness-based scores of the units that are promoted as salient in that discourse structure. As in the case of the other mett In fact. journalists are trained to employ this &quot;'pyramid&quot; approach Io writing consciously (Curnming and McKercher. 1994).</Paragraph> <Paragraph position="10"> tics, we consider that a discourse structure A is &quot;better&quot; that a discourse structure B if the connectedness-based score of A is higher than the connectedness-based score of B.</Paragraph> <Paragraph position="11"> 4 Combining heuristics</Paragraph> </Section> <Section position="3" start_page="209" end_page="209" type="sub_section"> <SectionTitle> 4.1 The approach </SectionTitle> <Paragraph position="0"> As we have already mentioned, discourse parsing is ambiguous the same way sentence parsing is: the rhetorical parsing algorithm often derives more than one discourse structure for a given text. Each of the seven metrics listed above favors a different discourse interpretation.</Paragraph> <Paragraph position="1"> For the purpose of this paper, we assume that the &quot;best'&quot; discourse structures are given by a linear combination of the seven metrics. Hence, along the lines described in section 3.2, we associate with each discourse structure a clusteting-based score sctust, a marker-based score Smark, a rhetorical-clustering-based score Srh~,,t,s~, a shape-based score sshap,, a title-based score SdtlC/, a position-based score Seos, and a connectedness-based score sco,,; and we assume that the best tree of a text is that that corresponds to the discourse structure D that has the highest score s(D). The score s(D) is computed as shown in (3), where wct,,t,... , woo, are weights associated with each metric.</Paragraph> <Paragraph position="3"> To avoid data skewedness, the scores that correspond to each metric are normalized to values between 0 and 1.</Paragraph> <Paragraph position="4"> &quot; Given the above formulation, our goal is to determine combinations of weights that yield discourse structures that, in turn, yield summaries that are as close as possible to those generated by humans. In discourse terms, this amounts to using empirical summarization data for discourse parsing disambiguation.</Paragraph> </Section> <Section position="4" start_page="209" end_page="209" type="sub_section"> <SectionTitle> 4.2 Corpora used in the study </SectionTitle> <Paragraph position="0"> In order to evaluate the appropriateness for summarization of each of the heuristics, we have used two corpora: a corpus of 40 newspaper articles from the TREC collection (Jing et al., 1998) and a corpus of five articles from Scientific American (Marcu, ! 997a).</Paragraph> <Paragraph position="1"> Five human judges selected sentences to be included in 10% and 20% summaries of each of the articles in the TREC corpus (see (Jing et al., 1998) for details). For each of the 40 articles and for each cutoff figure (10% and 20%), we took the set of sentences selected by at least three human judges as the &quot;gold standard&quot; for summarization. In our initial experiments, we noticed that the rhetorical parsing algorithm needed more than I minute in order to automatically generate summaries for seven of the 40 articles in the TREC corpus, which were highly ambiguous from a discourse perspective. In order to enable a better employment of training techniques that are specific to machine learning, we partitioned the TREC collection into two subsets. The first subset contained 15 documents: this subset included the seven documents for which our summarization algorithm required ex~tensive computation and eight other documents that were selected randomly. The second subset contained the rest of 25 documents for which our algorithm could generate summaries sufficiently fast. For the purpose of this paper, we will refer to the collection of 25 articles as the &quot;training corpus&quot; and to the collection of 15 articles as the &quot;test corpus&quot;. However, the reader should not take the de:notations associated with these referents literally, because the partitioning was not performed randomly. Rather, the reader should see the partitioning only as a means for accelerating the process that determines combination:~ of heuristics that yield the best summarization results for all the texts in the corpus.</Paragraph> <Paragraph position="2"> The second corpus consisted of five Scientific American texts whose elementary textual units (clause-like units) were labelled by 13 human judges as being very important, somewhat important, or unimportant (see (Marcu, 1997c) for the details of the experiment).</Paragraph> <Paragraph position="3"> For each of the five texts, we took the set of textual units for which at least seven judges agreed to be very important as the gold standard for summarization.</Paragraph> <Paragraph position="4"> We built automatically discourse structures for the texts in the two corpora using various combinations of weights and we compared the summaries that were derived from these structures with the gold standards. The comparison employed traditional recall and precision figures, which reflected the percent of textual units that were identified correctly by the program with respect to the gold standards and the percent of textual units that were identified correctly by the program with respect to the total number of units that were identified by the program.</Paragraph> <Paragraph position="5"> For both corpora, we attempted to mimic as closely as possible the summarization tasks carried out by human judges. For the TREC corpus, we automatically extracted summaries at 113% and 20% cutoffs; for the Scientific American corpus, we automatically extracted summaries that reflected the lengths of the summaries on which human judges agreed.</Paragraph> </Section> <Section position="5" start_page="209" end_page="211" type="sub_section"> <SectionTitle> 4.3 Appropriateness for summarization of the </SectionTitle> <Paragraph position="0"> individual heuristics The TREC corpus. Initially, we evaluated the appropriateness for text summarization of each of the seven heuristics at both 10% and 20% cutoffs for the collection of texts in the TREC corpus. By assigning in turn value 1 to each of the seven weights, while the other six weights were assigned value 0, we estimated the appropriateness of using each individual metric for text summarization.</Paragraph> <Paragraph position="1"> Tables 1 and 2 show the recall and precision figures that pertain to discourse structures that were built for the TREC corpus, in order to evaluate the appropriateness for text summarization of each of the seven metrics at 10% and 20% cutoffs, respectively. For a better understanding of the impact of each heuristic, tables I and 2 also show for text summarization in the TREC corpus -- the 10% cutoff.</Paragraph> <Paragraph position="2"> for text summarization in the TREC corpus -- the 20% cutoff.</Paragraph> <Paragraph position="3"> the recall and precision figures associated with the human judges and with two baseline algorithms. The recall and precision figures for the human judges were computed by taking the average recall and precision of the summaries built by each human judge individually when compared with the gold standard. These recall and precision figures can be interpreted as summarization upper-bounds for the collection of texts that they characterize. Since each judge contributed to the derivation of the gold standards, the recall and precision figures that pertain to human judges are biased: they are probably higher than the figures that would characterize an outsider to the experiment. null The recall and precision figures that pertain to the base-line algorithms are computed as follows: the lead-based algorithm assumes that important units are located at the beginning of texts; the random-based algorithm assumes that important units can be selected randomly.</Paragraph> <Paragraph position="4"> The results in table 1 show that, for newspaper articles, the title- and position-based metrics are the best individual metrics for distinguishing between discourse trees that are appropriate for generating 10% summaries and discourse trees that are not. Interestingly, none of these heuristics taken in isolation is better than the lead-based algorithm. In fact, the results in table 1 show that there is almost no quantitative difference in terms of recall and precision between summaries generated by the lead-based algorithm and summaries generated by humarls. null We were so puzzled by this finding that we investigated further this issue: by scanning the collection of 40 articles, we came to believe that since most of them are very short and simple, they are inappropriate as a testbed for summarization research. To estimate the validity of this belief, we focused our attention on a subset of 10 articles that seemed to use a more sophisticated writing style, that did not follow straightforwardly the pyramid-based approach; each of these 10 articles used at least once the word &quot;computer&quot;. When we evaluated the performance of the lead-based algorithm on this subset, we obtained figures of 66.00% recall and 43.66% precision at the 10% cutoff. This result suggests that as soon more sophisticated texts are considered, the performance of the lead-based algorithm decreases significantly even within the newspaper genre.</Paragraph> <Paragraph position="5"> The results in table 2 show that, for newspaper articles, the shape-based metric is the best individual metric for distinguishing between discourse trees that are appropriate for 20% summaries and discourse trees that are not. Still, the shape-based heuristic is not better than the lead-based algorithm.</Paragraph> <Paragraph position="6"> The Scientific American corpus. When we evaluated the appropriateness for text summarization of the heuristics at both clause and sentence levels for the collection of texts in the Scientific American corpus, we obtained a totally different distribution of the configuration of weights that yielded the highest recall and precision figures.</Paragraph> <Paragraph position="7"> A close analysis of the results in table 3 shows that, for Scientific American articles, the clustering-, rhetorical-clustering-, and shape-based metrics are the best individual metrics for distinguishing between discourse trees that are good for clause-based summarization and discourse trees that are not.</Paragraph> <Paragraph position="8"> The results in table 4 show that, for Scientific American articles, the shape-based metric is the best individual metric for distinguishing between discourse trees that are appropriate for sentence-based summarization. Surprisingly, the title-, position-, and connectedness-based metrics underperform even the random-based metric.</Paragraph> <Paragraph position="9"> In contrast with the results that pertain to the TREC corpus, the lead-based algorithm performs significantly worse than human judges for the texts in the Scientific American corpus, despite the Scientific American texts being shorter than those in the TREC collection.</Paragraph> <Paragraph position="10"> Discussion. Overall, the recall and precision figures presented in this section suggest that no individual heuristic consistently guarantees success across different text genres. Moreover, the figures suggest that, even within the same genre, the granularity of the textual units that are selected for summarization and the overall length of the for text summarization in the Scientific American corpus for text summarization in the Scientific American corpus the sentence case.</Paragraph> <Paragraph position="11"> summary affect the appropriateness of a given heuristic.</Paragraph> <Paragraph position="12"> By focusing only on the human judgments, we notice that the newspaper genre yields a higher consistency than the Scientific American genre with respect to what humans believe to be important. Also, the results in this section show that humans agree better on important sentences than on important clauses; and that within the newspaper genre, they agree better on what is very important (the 10% summaries) than on what is somewhat important (the 20% summaries).</Paragraph> </Section> <Section position="6" start_page="211" end_page="211" type="sub_section"> <SectionTitle> 4.4 Learning the best combinations of heuristics </SectionTitle> <Paragraph position="0"> The individual applications of the metrics suggest what heuristics are appropriate for summarizing texts that belong to the text genres of the two corpora. In addition to this assessment, we were also interested in finding combinations of heuristics that yield good summaries. To this end, we employed a simple learning paradigm, which we describe below.</Paragraph> <Paragraph position="1"> In the framework that we proposed in this paper, finding a combination of metrics that is best for summarization amounts to finding a combination of weights wctust, ..., Wco, that maximizes the recall and precision figures associated with automatically built summaries. The algorithm shown in figure 2 performs a greedy search in the seven-dimensional space defined by the weights, using an approach that mirrors that proposed by Sehnan, Levesque, and Mitchell (1992) for solving propositional satisfiability problems.</Paragraph> <Paragraph position="2"> The algorithm assigns initially to each member of the vector of weights I~maz a random value in the interval \[0, 1\]. This assignment corresponds to a point in the n-dimensional space defined by the weights. The program then attempts NoO\]Steps times to move incrementally, in the n-dimensional space, along a direction that maximizes the F-measure of the recall and precision figures that pertain to the automatically built summaries. The F-measure is computed as shown in (4), below.</Paragraph> <Paragraph position="4"> The F-measure always takes values between the values of recall and precision, and is higher when recall and precision are closer.</Paragraph> <Paragraph position="5"> For each point I~&quot;t the program computes the F-value of the recall and precision figures of the summaries that correspond to all the points in the neighborhood of IVt that are at distance A w along each of the seven axes (lines 6-10 in figure 2). From the set of 14 points that characterize the neighborhood of the current configuration I f&quot;t, the algorithm selects randomly (line 12) one ofthe weight configurations that yielded the maximum F-value (line 11). In line 13, the algorithm moves in the n-dimensional space to the position that characterizes the configuration of weights that was selected on line 12. After NoOfSteps iterations, the algorithm updates the configuration of weights ~, such that it reflects the combination of weights that yielded the maximal F-value of the recall and precision figures (line 15 in figure 2). The algorithm repeats this process noOyTries times, in order to increase the chance of finding a maximum that is not local.</Paragraph> <Paragraph position="6"> Since the lengths of the summaries that we automatically extracted was fixed in all cases, we chose to look for configurations of weights that maximized the F-value of the recall and precision figures. However, one can use the algorithm in figure 2 to find configurations of weights that maximize only the recall or only the precision figure as well.</Paragraph> </Section> </Section> class="xml-element"></Paper>