File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/j05-3002_metho.xml

Size: 60,451 bytes

Last Modified: 2025-10-06 14:09:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="J05-3002">
  <Title>Sentence Fusion for Multidocument News Summarization</Title>
  <Section position="3" start_page="298" end_page="300" type="metho">
    <SectionTitle>
2. Framework for Sentence Fusion: MultiGen
</SectionTitle>
    <Paragraph position="0"> Sentence fusion is the central technique used within the MultiGen summarization system. MultiGen takes as input a cluster of news stories on the same event and produces a summary which synthesizes common information across input stories. An example of a MultiGen summary is shown in Figure 1. The input clusters are automatically produced from a large quantity of news articles that are retrieved by Newsblaster from 30 news sites each day.</Paragraph>
    <Paragraph position="1"> In order to understand the role of sentence fusion within summarization, we overview the MultiGen architecture, providing details on the processes that precede sentence fusion and thus, the input that the fusion component requires. Fusion itself is discussed in the subsequent sections of the article.</Paragraph>
    <Paragraph position="2"> MultiGen follows a pipeline architecture, shown in Figure 2. The analysis component of the system, Simfinder (Hatzivassiloglou, Klavans, and Eskin 1999) clusters sentences of input documents into themes, groups of sentences that convey similar information (Section 2.1). Once themes are constructed, the system selects a subset of the groups to be included in the summary, depending on the desired compression Figure 1 An example of MultiGen summary as shown in the Columbia Newsblaster Interface. Summary phrases are followed by parenthetical numbers indicating their source articles. The last sentence is extracted because it was repeated verbatim in several input articles.</Paragraph>
    <Paragraph position="3">  Computational Linguistics Volume 31, Number 3 Figure 2 MultiGen architecture.</Paragraph>
    <Paragraph position="4"> length (Section 2.2). The selected groups are passed to the ordering component, which selects a complete order among themes (Section 2.3).</Paragraph>
    <Section position="1" start_page="299" end_page="300" type="sub_section">
      <SectionTitle>
2.1 Theme Construction
</SectionTitle>
      <Paragraph position="0"> The analysis component of MultiGen, Simfinder, identifies themes, groups of sentences from different documents that each say roughly the same thing. Each theme will ultimately correspond to at most one sentence in the output summary, generated by the fusion component, and there may be many themes for a set of articles. An example of a theme is shown in Table 1. As the set of sentences in the table illustrates, sentences within a theme are not exact repetitions of each other; they usually include phrases expressing information that is not common to all sentences in the theme. Information that is common across sentences is shown in the table in boldface; other portions of the sentence are specific to individual articles. If one of these sentences were used as is to represent the theme, the summary would contain extraneous information. Also, errors in clustering might result in the inclusion of some unrelated sentences. Evaluation involving human judges revealed that Simfinder identifies similar sentences with 49.3% precision at 52.9% recall (Hatzivassiloglou, Klavans, and Eskin 1999). We will discuss later how this error rate influences sentence fusion.</Paragraph>
      <Paragraph position="1"> To identify themes, Simfinder extracts linguistically motivated features for each sentence, including WordNet synsets (Miller et al. 1990) and syntactic dependencies, such as subject-verb and verb-object relations. A log-linear regression model is used to combine the evidence from the various features into a single similarity value. The model was trained on a large set of sentences which were manually marked for similarity. The output of the model is a listing of real-valued similarity values on sentence pairs. These similarity values are fed into a clustering algorithm that partitions the sentences into closely related groups.</Paragraph>
      <Paragraph position="2"> Table 1 Theme with corresponding fusion sentence.</Paragraph>
      <Paragraph position="3">  1. IDF Spokeswoman did not confirm this, but said the Palestinians fired an antitank missile at a bulldozer.</Paragraph>
      <Paragraph position="4"> 2. The clash erupted when Palestinian militants fired machine guns and antitank missiles at a bulldozer that was building an embankment in the area to better protect Israeli forces. 3. The army expressed &amp;quot;regret at the loss of innocent lives&amp;quot; but a senior commander said troops had shot in self-defense after being fired at while using bulldozers to build a new embankment at an army base in the area.</Paragraph>
      <Paragraph position="5"> Fusion sentence: Palestinians fired an antitank missile at a bulldozer.</Paragraph>
      <Paragraph position="6">  Barzilay and McKeown Sentence Fusion for Multidocument News Summarization</Paragraph>
    </Section>
    <Section position="2" start_page="300" end_page="300" type="sub_section">
      <SectionTitle>
2.2 Theme Selection
</SectionTitle>
      <Paragraph position="0"> To generate a summary of predetermined length, we induce a ranking on the themes and select the n highest.</Paragraph>
      <Paragraph position="1">  This ranking is based on three features of the theme: size measured as the number of sentences, similarity of sentences in a theme, and salience score. The first two of these scores are produced by Simfinder, and the salience score is computed using lexical chains (Morris and Hirst 1991; Barzilay and Elhadad 1997) as described below. Combining different rankings further filters common information in terms of salience. Since each of these scores has a different range of values, we perform ranking based on each score separately, then induce total ranking by summing ranks from individual categories: Rank (theme) = Rank (Number of sentences in theme) + Rank (Similarity of sentences in theme) + Rank (Sum of lexical chain scores in theme) Lexical chains--sequences of semantically related words--are tightly connected to the lexical cohesive structure of the text and have been shown to be useful for determining which sentences are important for single-document summarization (Barzilay and Elhadad 1997; Silber and McCoy 2002). In the multidocument scenario, lexical chains can be adapted for theme ranking based on the salience of theme sentences within their original documents. Specifically, a theme that has many sentences ranked high by lexical chains as important for a single-document summary is, in turn, given a higher salience score for the multidocument summary. In our implementation, a salience score for a theme is computed as the sum of lexical chain scores of each sentence in a theme.</Paragraph>
    </Section>
    <Section position="3" start_page="300" end_page="300" type="sub_section">
      <SectionTitle>
2.3 Theme Ordering
</SectionTitle>
      <Paragraph position="0"> Once we filter out the themes that have a low rank, the next task is to order the selected themes into coherent text. Our ordering strategy aims to capture chronological order of the main events and ensure coherence. To implement this strategy in MultiGen, we select for each theme the sentence which has the earliest publication time (theme time stamp). To increase the coherence of the output text, we identify blocks of topically related themes and then apply chronological ordering on blocks of themes using theme time stamps (Barzilay, Elhadad, and McKeown 2002). These stages produce a sorted set of themes which are passed as input to the sentence fusion component, described in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="300" end_page="305" type="metho">
    <SectionTitle>
3. Sentence Fusion
</SectionTitle>
    <Paragraph position="0"> Given a group of similar sentences--a theme--the problem is to create a concise and fluent fusion of information, reflecting facts common to all sentences. (An example of a fusion sentence is shown in Table 1.) To achieve this goal we need to identify phrases common to most theme sentences, then combine them into a new sentence.</Paragraph>
    <Paragraph position="1"> At one extreme, we might consider a shallow approach to the fusion problem, adapting the &amp;quot;bag of words&amp;quot; approach. However, sentence intersection in a set-theoretic sense produces poor results. For example, the intersection of the first two sentences 2 Typically, Simfinder produces at least 20 themes given an average Newsblaster cluster of nine articles. The length of a generated summary typically does not exceed seven sentences.</Paragraph>
    <Paragraph position="2">  Computational Linguistics Volume 31, Number 3 from the theme shown in Table 1 is (the, fired, antitank, at, a, bulldozer). Besides its being ungrammatical, it is impossible to understand what event this intersection describes. The inadequacy of the bag-of-words method to the fusion task demonstrates the need for a more linguistically motivated approach. At the other extreme, previous approaches (Radev and McKeown 1998) have demonstrated that this task is feasible when a detailed semantic representation of the input sentences is available. However, these approaches operate in a limited domain (e.g., terrorist events), where information extraction systems can be used to interpret the source text. The task of mapping input text into a semantic representation in a domain-independent setting extends well beyond the ability of current analysis methods. These considerations suggest that we need a new method for the sentence fusion task. Ideally, such a method would not require a full semantic representation. Rather, it would rely on input texts and shallow linguistic knowledge (such as parse trees) that can be automatically derived from a corpus to generate a fusion sentence.</Paragraph>
    <Paragraph position="3"> In our approach, sentence fusion is modeled after the typical generation pipeline: content selection (what to say) and surface realization (how to say it). In contrast to that involved in traditional generation systems in which a content selection component chooses content from semantic units, our task is complicated by the lack of semantics in the textual input. At the same time, we can benefit from the textual information given in the input sentences for the tasks of syntactic realization, phrasing, and ordering; in many cases, constraints on text realization are already present in the input.</Paragraph>
    <Paragraph position="4"> The algorithm operates in three phases:  Content selection occurs primarily in the first phase, in which our algorithm uses local alignment across pairs of parsed sentences, from which we select fragments to be included in the fusion sentence. Instead of examining all possible ways to combine these fragments, we select a sentence in the input which contains most of the fragments and transform its parsed tree into the fusion lattice by eliminating nonessential information and augmenting it with information from other input sentences. This construction of the fusion lattice targets content selection, but in the process, alternative verbalizations are selected, and thus some aspects of realization are also carried out in this phase. Finally, we generate a sentence from this representation based on a language model derived from a large body of texts.</Paragraph>
    <Section position="1" start_page="301" end_page="305" type="sub_section">
      <SectionTitle>
3.1 Identification of Common Information
</SectionTitle>
      <Paragraph position="0"> Our task is to identify information shared between sentences. We do this by aligning constituents in the syntactic parse trees for the input sentences. Our alignment process differs considerably from alignment for other NL tasks, such as machine translation, because we cannot expect a complete alignment. Rather, a subset of the subtrees in one sentence will match different subsets of the subtrees in the others. Furthermore, order across trees is not preserved, there is no natural starting point for alignment, and there are no constraints on crosses. For these reasons we have developed a bottom-up local multisequence alignment algorithm that uses words and phrases as anchors for matching. This algorithm operates on the dependency trees for pairs of input sen- null Barzilay and McKeown Sentence Fusion for Multidocument News Summarization tences. We use a dependency-based representation because it abstracts over features irrelevant for comparison such as constituent ordering. In the subsections that follow, we describe first how this representation is computed, then how dependency subtrees are aligned, and finally how we choose between constituents conveying overlapping information.</Paragraph>
      <Paragraph position="1"> In this section we first describe an algorithm which, given a pair of sentences, determines which sentence constituents convey information appearing in both sentences. This algorithm will be applied to pairwise combinations of sentences in the input set of related sentences.</Paragraph>
      <Paragraph position="2"> The intuition behind the algorithm is to compare all constituents of one sentence to those of another and select the most similar ones. Of course, how this comparison is performed depends on the particular sentence representation used. A good sentence representation will emphasize sentence features that are relevant for comparison, such as dependencies between sentence constituents, while ignoring irrelevant features, such as constituent ordering. A representation which fits these requirements is a dependency-based representation (Melcuk 1988). We first detail how this representation is computed, then describe a method for aligning dependency subtrees.</Paragraph>
      <Paragraph position="3"> 3.1.1 Sentence Representation. Our sentence representation is based on a dependency tree, which describes the sentence structure in terms of dependencies between words.</Paragraph>
      <Paragraph position="4"> The similarity of the dependency tree to a predicate-argument structure makes it a natural representation for our comparison.</Paragraph>
      <Paragraph position="5">  This representation can be constructed from the output of a traditional parser. In fact, we have developed a rule-based component that transforms the phrase structure output of Collins's (2003) parser into a representation in which a node has a direct link to its dependents. We also mark verb-subject and verb-node dependencies in the tree.</Paragraph>
      <Paragraph position="6"> The process of comparing trees can be further facilitated if the dependency tree is abstracted to a canonical form which eliminates features irrelevant to the comparison. We hypothesize that the difference in grammatical features such as auxiliaries, number, and tense has a secondary effect when the meaning of sentences is being compared.</Paragraph>
      <Paragraph position="7"> Therefore, we represent in the dependency tree only nonauxiliary words with their associated grammatical features. For nouns, we record their number, articles, and class (common or proper). For verbs, we record tense, mood (indicative, conditional, or infinitive), voice, polarity, aspect (simple or continuous), and taxis (perfect or none). The eliminated auxiliary words can be re-created using these recorded features. We also transform all passive-voice sentences to the active voice, changing the order of affected children.</Paragraph>
      <Paragraph position="8"> While the alignment algorithm described in Section 3.1.2 produces one-to-one mappings, in practice some paraphrases are not decomposable to words, forming one-to-many or many-to-many paraphrases. Our manual analysis of paraphrased sentences (Barzilay 2003) revealed that such alignments most frequently occur in pairs of noun phrases (e.g., faculty member and professor) and pairs including verbs with particles (e.g., stand up, rise). To correctly align such phrases, we flatten subtrees containing noun phrases and verbs with particles into one node. We subsequently determine matches between flattened sentences using statistical metrics.</Paragraph>
      <Paragraph position="9"> 3 Two paraphrasing sentences which differ in word order may have significantly different trees in phrase-based format. For instance, this phenomenon occurs when an adverbial is moved from a position in the middle of a sentence to the beginning of a sentence. In contrast, dependency representations of these sentences are very similar.</Paragraph>
      <Paragraph position="10">  Computational Linguistics Volume 31, Number 3 Figure 3 Dependency tree of the sentence The IDF spokeswoman did not confirm this, but said the Palestinians fired an antitank missile at a bulldozer on the site. The features of the node confirm are explicitly marked.</Paragraph>
      <Paragraph position="11"> An example of a sentence and its dependency tree with associated features is shown in Figure 3. (In figures of dependency trees hereafter, node features are omitted for clarity.) 3.1.2 Alignment. Our alignment of dependency trees is driven by two sources of information: the similarity between the structure of the dependency trees and the similarity between lexical items. In determining the structural similarity between two trees, we take into account the types of edges (which indicate the relationships between nodes). An edge is labeled by the syntactic function of the two nodes it connects (e.g., subjectverb). It is unlikely that an edge connecting a subject and verb in one sentence, for example, corresponds to an edge connecting a verb and an adjective in another sentence. The word similarity measures take into account more than word identity: They also identify pairs of paraphrases, using WordNet and a paraphrasing dictionary. We automatically constructed the paraphrasing dictionary from a large comparable news corpus using the co-training method described in Barzilay and McKeown (2001). The dictionary contains pairs of word-level paraphrases as well as phrase-level paraphrases. null  Several examples of automatically extracted paraphrases are given in Table 2.</Paragraph>
      <Paragraph position="12"> During alignment, each pair of nonidentical words that do not comprise a synset in  Barzilay and McKeown Sentence Fusion for Multidocument News Summarization Table 2 Lexical paraphrases extracted by the algorithm from the comparable news corpus. (auto, automobile), (closing, settling), (rejected, does not accept), (military, army), (IWC, International Whaling Commission), (Japan, country), (researching, examining), (harvesting, killing), (mission-control office, control centers), (father, pastor), (past 50 years, four decades), (Wangler, Wanger), (teacher, pastor), (fondling, groping), (Kalkilya, Qalqilya), (accused, suspected), (language, terms), (head, president), (U.N., United Nations), (Islamabad, Kabul), (goes, travels), (said, testified), (article, report), (chaos, upheaval), (Gore, Lieberman), (revolt, uprising), (more restrictive local measures, stronger local regulations) (countries, nations), (barred, suspended), (alert, warning), (declined, refused), (anthrax, infection), (expelled, removed), (White House, White House spokesman Ari Fleischer), (gunmen, militants) WordNet is looked up in the paraphrasing dictionary; in the case of a match, the pair is considered to be a paraphrase.</Paragraph>
      <Paragraph position="13"> We now give an intuitive explanation of how our tree similarity function, denoted by Sim, is computed. If the optimal alignment of two trees is known, then the value of the similarity function is the sum of the similarity scores of aligned nodes and aligned edges. Since the best alignment of given trees is not known a priori, we select the maximal score among plausible alignments of the trees. Instead of exhaustively traversing the space of all possible alignments, we recursively construct the best alignment for trees of given depths, assuming that we know how to find an optimal alignment for trees of shorter depths. More specifically, at each point of the traversal we consider two cases, shown in Figure 4. In the first case, two top nodes are aligned with each other, and their children are aligned in an optimal way by applying the algorithm to shorter trees. In the second case, one tree is aligned with one of the children of the top node of the other tree; again we can apply our algorithm for this computation, since we decrease the height of one of the trees.</Paragraph>
      <Paragraph position="14"> Before giving the precise definition of Sim, we introduce some notation. When T is a tree with root node v,weletc(T) denote the set containing all children of v.</Paragraph>
      <Paragraph position="15"> For a tree T containing a node s,thesubtreeofT which has s as its root node is denoted</Paragraph>
      <Paragraph position="17"> Tree alignment computation. In the first case two tops are aligned, while in the second case the top of one tree is aligned with a child of another tree.</Paragraph>
      <Paragraph position="18">  Computational Linguistics Volume 31, Number 3 Given two trees T and T prime with root nodes v and v prime , respectively, the similar- null ), capture mappings in which the top of one tree is aligned with one of the children of the top node of the other tree (the bottom of Figure 4). The maximization in the NodeCompare formula searches for the best possible alignment for the child nodes of the given pair of nodes and is defined by  ) is the set of all possible matchings between A and A prime , and a matching (between A and A prime )isasubsetm of A x A prime such that for any two distinct elements (a, a prime ), (b, b prime ) [?] m,botha negationslash= b and a prime negationslash= b prime . In the base case, when one of the trees has depth one, NodeCompare(T, T prime ) is defined to be NodeSimilarity(v, v prime ).</Paragraph>
      <Paragraph position="19">  The similarity score NodeSimilarity(v, v prime ) of atomic nodes depends on whether the corresponding words are identical, paraphrases, or unrelated. The similarity scores for pairs of identical words, pairs of synonyms, pairs of paraphrases, and edges (given in Table 3) are manually derived using a small development corpus. While learning of the similarity scores automatically is an appealing alternative, its application in the fusion context is challenging because of the absence of a large training corpus and the lack of an automatic evaluation function.  The similarity of nodes containing flattened subtrees,  such as noun phrases, is computed as the score of their intersection normalized by the length of the longest phrase. For instance, the similarity score of the noun phrases antitank missile and machine gun and antitank missile is computed as a ratio between the score of their intersection antitank missile (2), divided by the length of the latter phrase (5).</Paragraph>
      <Paragraph position="20"> The similarity function Sim is computed using bottom-up dynamic programming, in which the shortest subtrees are processed first. The alignment algorithm returns the similarity score of the trees as well as the optimal mapping between the subtrees of input trees. The pseudocode of this function is presented in the Appendix. In the resulting tree mapping, the pairs of nodes whose NodeSimilarity positively contributed to the alignment are considered parallel. Figure 5 shows two dependency trees and their alignment.</Paragraph>
      <Paragraph position="21"> As is evident from the Sim definition, we are considering only one-to-one node &amp;quot;matchings&amp;quot;: Every node in one tree is mapped to at most one node in another tree. This restriction is necessary because the problem of optimizing many-to-many alignments 5 Our preliminary experiments with n-gram-based overlap measures, such as BLEU (Papineni et al. 2002) and ROUGE (Lin and Hovy 2003), show that these metrics do not correlate with human judgments on the fusion task, when tested against two reference outputs. This is to be expected: As lexical variability across input sentences grows, the number of possible ways to fuse them by machine as well by human also grows. The accuracy of match between the system output and the reference sentences largely depends on the features of the input sentences, rather than on the underlying fusion method.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="305" end_page="312" type="metho">
    <SectionTitle>
6 Pairs of phrases that form an entry in the paraphrasing dictionary are compared as pairs of atomic entries.
</SectionTitle>
    <Paragraph position="0"> The subtree flattening performed during the preprocessing stage aims to minimize the negative effect of the restriction on alignment granularity.</Paragraph>
    <Paragraph position="1"> Another important property of our algorithm is that it produces a local alignment.</Paragraph>
    <Paragraph position="2"> Local alignment maps local regions with high similarity to each other rather than creating an overall optimal global alignment of the entire tree. This strategy is more meaningful when only partial meaning overlap is expected between input sentences, as in typical sentence fusion input. Only these high-similarity regions, which we call intersection subtrees, are included in the fusion sentence.</Paragraph>
    <Section position="1" start_page="306" end_page="309" type="sub_section">
      <SectionTitle>
3.2 Fusion Lattice Computation
</SectionTitle>
      <Paragraph position="0"> Fusion lattice computation is concerned with combining intersection subtrees. During this process, the system will remove phrases from a selected sentence, add phrases from other sentences, and replace words with the paraphrases that annotate each node. Among the many possible combinations of subtrees, we are interested only in those combinations which yield semantically sound sentences and do not distort the information presented in the input sentences. We cannot explore every possible combination, since the lack of semantic information in the trees prohibits us from assessing the quality of the resulting sentences. In fact, our early experimentation with generation from constituent phrases (e.g., NPs, VPs) demonstrated that it was difficult to ensure that semantically anomalous or ungrammatical sentences would not be generated. Instead, we select a combination already present in the input sentences as a basis and transform it into a fusion sentence by removing extraneous information and augmenting the fusion sentence with information from other sentences. The advantage of this strategy is that, when the initial sentence is semantically correct and the applied transformations aim to preserve semantic correctness, the resulting sentence is a semantically correct one. Our generation strategy is reminiscent of Robin and McKeown's (1996) earlier work on revision for summarization, although Robin and McKeown used a three-tiered representation of each sentence, including its semantics and its deep and surface syntax, all of which were used as triggers for revision.</Paragraph>
      <Paragraph position="1"> The three steps of the fusion lattice computation are as follows: selection of the basis tree, augmentation of the tree with alternative verbalizations, and pruning of  denote the number of nodes in the second tree. We assume that the branching factor of a parse tree is bounded above by a constant. The function NodeCompare is evaluated only once on each node pair. Therefore, it is evaluated n  times totally. Each evaluation is computed in constant time, assuming that values of the function for node children are known. Since we use memoization, the total time of the procedure is O(n  Computational Linguistics Volume 31, Number 3 Figure 5 Two dependency trees and their alignment tree. Solid lines represent aligned edges. Dotted and dashed lines represent unaligned edges of the theme sentences.</Paragraph>
      <Paragraph position="2"> the extraneous subtrees. Alignment is essential for all the steps. The selection of the basis tree is guided by the number of intersection subtrees it includes; in the best case, it contains all such subtrees. The basis tree is the centroid of the input sentences-the sentence which is the most similar to the other sentences in the input. Using the alignment-based similarity score described in Section 3.1.2, we identify the centroid by computing for each sentence the average similarity score between the sentence and the rest of the input sentences, then selecting the sentence with the highest score. Next, we augment the basis tree with information present in the other input sentences. More specifically, we add alternative verbalizations for the nodes in the basis tree and the intersection subtrees which are not part of the basis tree. The alternative verbalizations are readily available from the pairwise alignments of the basis tree with other trees in the input computed in the previous section. For each node of the basis tree, we record all verbalizations from the nodes of the other input trees aligned with a given node. A verbalization can be a single word, or it can be a phrase, if a node represents a noun compound or a verb with a particle. An example of a fusion lattice, augmented  Barzilay and McKeown Sentence Fusion for Multidocument News Summarization Figure 6 A basis lattice before and after augmentation. Solid lines represent aligned edges of the basis tree. Dashed lines represent unaligned edges of the basis tree, and dotted lines represent insertions from other theme sentences. Added subtrees correspond to sentences from Table 1. with alternative verbalizations, is given in Figure 6. Even after this augmentation, the fusion lattice may not include all of the intersection subtrees. The main difficulty in subtree insertion is finding an acceptable placement; this is often determined by syntactic, semantic, and idiosyncratic knowledge. Therefore, we follow a conservative insertion policy. Among all the possible aligned sentences, we insert only subtrees whose top node aligns with one of the nodes in a basis tree.</Paragraph>
      <Paragraph position="3">  We further constrain the insertion procedure by inserting only trees that appear in at least half of the sentences of a theme. These two  Computational Linguistics Volume 31, Number 3 constituent-level restrictions prevent the algorithm from generating overly long, unreadable sentences.</Paragraph>
      <Paragraph position="4">  Finally, subtrees which are not part of the intersection are pruned off the basis tree. However, removing all such subtrees may result in an ungrammatical or semantically flawed sentence; for example, we might create a sentence without a subject. This overpruning may happen if either the input to the fusion algorithm is noisy or the alignment has failed to recognize similar subtrees. Therefore, we perform a more conservative pruning, deleting only the self-contained components which can be removed without leaving ungrammatical sentences. As previously observed in the literature (Mani, Gates, and Bloedorn 1999; Jing and McKeown 2000), such components include a clause in the clause conjunction, relative clauses, and some elements within a clause (such as adverbs and prepositions). For example, this procedure transforms the lattice in Figure 6 into the pruned basis lattice shown in Figure 7 by deleting the clause the clash erupted and the verb phrase to better protect Israeli forces. These phrases are eliminated because they do not appear in the other sentences of the theme and at the same time their removal does not interfere with the well-formedness of the fusion sentence. Once these subtrees are removed, the fusion lattice construction is completed.</Paragraph>
    </Section>
    <Section position="2" start_page="309" end_page="312" type="sub_section">
      <SectionTitle>
3.3 Generation
</SectionTitle>
      <Paragraph position="0"> The final stage in sentence fusion is linearization of the fusion lattice. Sentence generation includes selection of a tree traversal order, lexical choice among available alternatives, and placement of auxiliaries, such as determiners. Our generation method utilizes information given in the input sentences to restrict the search space and then chooses among remaining alternatives using a language model derived from a large text collection. We first motivate the need for reordering and rephrasing, then discuss our implementation.</Paragraph>
      <Paragraph position="1"> For the word-ordering task, we do not have to consider all the possible traversals, since the number of valid traversals is limited by ordering constraints encoded in the fusion lattice. However, the basis lattice does not uniquely determine the ordering: The placement of trees inserted in the basis lattice from other theme sentences is not restricted by the original basis tree. While the ordering of many sentence constituents is determined by their syntactic roles, some constituents, such as time, location and manner circumstantials, are free to move (Elhadad et al. 2001). Therefore, the algorithm still has to select an appropriate order from among different orders of the inserted trees.</Paragraph>
      <Paragraph position="2"> The process so far produces a sentence that can be quite different from the extracted sentence; although the basis sentences provides guidance for the generation process, constituents may be removed, added in, or reordered. Wording can also be modified during this process. Although the selection of words and phrases which appear in the basis tree is a safe choice, enriching the fusion sentence with alternative verbalizations has several benefits. In applications such as summarization, in which the length of the produced sentence is a factor, a shorter alternative is desirable. This goal can be achieved by selecting the shortest paraphrase among available alternatives.</Paragraph>
      <Paragraph position="3"> Alternate verbalizations can also be used to replace anaphoric expressions, for instance, 9 Furthermore, the preference for shorter fusion sentences is further enforced during the linearization stage because our scoring function monotonically decreases with the length of a sentence.</Paragraph>
      <Paragraph position="4">  A pruned basis lattice.</Paragraph>
      <Paragraph position="5"> when the basis tree contains a noun phrase with anaphoric expressions (e.g., his visit) and one of the other verbalizations is anaphora-free. Substitution of the latter for the anaphoric expression may increase the clarity of the produced sentence, since frequently the antecedent of the anaphoric expression is not present in a summary. Moreover, in some cases substitution is mandatory. As a result of subtree insertions and deletions, the words used in the basis tree may not be a good choice after the transformations, and the best verbalization might be achieved by using a paraphrase of them from another theme sentence. As an example, consider the case of two paraphrasing verbs with different subcategorization frames, such as tell and say. If the phrase our correspondent is removed from the sentence Sharon told our correspondent that the elections were delayed . . . , a replacement of the verb told with said yields a more readable sentence.</Paragraph>
      <Paragraph position="6"> The task of auxiliary placement is alleviated by the presence of features stored in the input nodes. In most cases, aligned words stored in the same node have the same feature values, which uniquely determine an auxiliary selection and conjugation. However, in some cases, aligned words have different grammatical features, in which case the linearization algorithm needs to select among available alternatives.</Paragraph>
      <Paragraph position="7">  Computational Linguistics Volume 31, Number 3 Linearization of the fusion sentence involves the selection of the best phrasing and placement of auxiliaries as well as the determination of optimal ordering. Since we do not have sufficient semantic information to perform such selection, our algorithm is driven by corpus-derived knowledge. We generate all possible sentences  from the valid traversals of the fusion lattice and score their likelihood according to statistics derived from a corpus. This approach, originally proposed by Knight and Hatzivassiloglou (1995) and Langkilde and Knight (1998), is a standard method used in statistical generation. We trained a trigram model with Good-Turing smoothing over 60 megabytes of news articles collected by Newsblaster using the second version CMU-Cambridge Statistical Language Modeling toolkit (Clarkson and Rosenfeld 1997).</Paragraph>
      <Paragraph position="8"> The sentence with the lowest length-normalized entropy (the best score) is selected as the verbalization of the fusion lattice. Table 4 shows several verbalizations produced by our algorithm from the central tree in Figure 7. Here, we can see that the lowest-scoring sentence is both grammatical and concise.</Paragraph>
      <Paragraph position="9"> Table 4 also illustrates that entropy-based scoring does not always correlate with the quality of the generated sentence. For example, the fifth sentence in Table 4--Palestinians fired antitank missile at a bulldozer to build a new embankment in the area--is not a well-formed sentence; however, our language model gave it a better score than its well-formed alternatives, the second and the third sentences (see Section 4 for further discussion). Despite these shortcomings, we preferred entropy-based scoring to symbolic linearization. In the next section, we motivate our choice.</Paragraph>
      <Paragraph position="10">  system (Barzilay, McKeown, and Elhadad 1999), we performed linearization of a fusion dependency structure using the language generator FUF/SURGE (Elhadad and Robin 1996). As a large-scale linearizer used in many traditional semantic-to-text generation systems, FUF/SURGE could be an appealing solution to the task of surface realization. Because the input structure and the requirements on the linearizer are quite different in text-to-text generation, we had to design rules for mapping between dependency structures produced by the fusion component and FUF/SURGE input. For instance, FUF/SURGE requires that the input contain a semantic role for prepositional phrases, such as manner, purpose,orlocation, which is not present in our dependency representation; thus we had to augment the dependency representation with this information. In the case of inaccurate prediction or the lack of relevant semantic information, the linearizer scrambles the order of sentence constituents, selects wrong prepositions, or even fails to generate an output. Another feature of the FUF/SURGE system that negatively influences system performance is its limited ability to reuse phrases readily available in the input, instead of generating every phrase from scratch. This makes the generation process more complex and thus prone to error.</Paragraph>
      <Paragraph position="11"> While the initial experiments conducted on a set of manually constructed themes seemed promising, the system performance deteriorated significantly when it was applied to automatically constructed themes. Our experience led us to believe that transformation of an arbitrary sentence into a FUF/SURGE input representation is similar in its complexity to semantic parsing, a challenging problem in its own right. Rather than refining the mapping mechanism, we modified MultiGen to use a statis10 Because of the efficiency constraints imposed by Newsblaster, we sample only a subset of 20,000 paths. The sample is selected randomly.</Paragraph>
      <Paragraph position="12">  tical linearization component, which handles uncertainty and noise in the input in a more robust way.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="312" end_page="318" type="metho">
    <SectionTitle>
4. Sentence Fusion Evaluation
</SectionTitle>
    <Paragraph position="0"> In our previous work, we evaluated the overall summarization strategy of MultiGen in multiple experiments, including comparisons with human-written summaries in the Document Understanding Conference (DUC)  evaluation (McKeown et al. 2001; McKeown et al. 2002) and quality assessment in the context of a particular information access task in the Newsblaster framework (McKeown et al. 2002). In this article, we aim to evaluate the sentence fusion algorithm in isolation from other system components; we analyze the algorithm performance in terms of content selection and the grammaticality of the produced sentences. We first present our evaluation methodology (Section 4.1), then we describe our data (Section 4.2), the results (Section 4.3), and our analysis of them (Section 4.4).</Paragraph>
    <Section position="1" start_page="312" end_page="314" type="sub_section">
      <SectionTitle>
4.1 Methods
4.1.1 Construction of a Reference Sentence. We evaluated content selection by com-
</SectionTitle>
      <Paragraph position="0"> paring an automatically generated sentence with a reference sentence. The reference sentence was produced by a human (hereafter the RFA), who was instructed to generate a sentence conveying information common to many sentences in a theme. The RFA was not familiar with the fusion algorithm. The RFA was provided with the list of theme sentences; the original documents were not included. The instructions given to the RFA included several examples of themes with fusion sentences generated by the authors. Even though the RFA was not instructed to use phrases from input sentences, the sentences presented as examples reused many phrases from the input sentences.</Paragraph>
      <Paragraph position="1"> We believe that phrase reuse elucidates the connection between input sentences and a resulting fusion sentence. Two examples of themes, reference sentences, and system outputs are shown in Table 5.</Paragraph>
      <Paragraph position="2">  automatically computed inputs which reflect the accuracy of the existing preprocessing tools. For this reason, the test data were selected randomly from material collected by Newsblaster. To remove themes irrelevant for fusion evaluation, we introduced two 11 DUC is a community-based evaluation of summarization systems organized by DARPA.  #1 Four people including an Islamic cleric have been detained in Pakistan after a fatal attack on a church on Christmas Day.</Paragraph>
      <Paragraph position="3"> #2 Police detained six people on Thursday following a grenade attack on a church that killed three girls and wounded 13 people on Christmas Day.</Paragraph>
      <Paragraph position="4"> #3 A grenade attack on a Protestant church in Islamabad killed five people, including a U.S. Embassy employee and her 17-year-old daughter.</Paragraph>
      <Paragraph position="5">  additional filters. First, we excluded themes that contained identical or nearly identical sentences (with cosine similarity higher than 0.8). When processing such sentences, our algorithm reduces to sentence extraction, which does not allow us to evaluate the generation abilities of our algorithm. Second, themes for which the RFA was unable to create a reference sentence were also removed from the test set. As mentioned above, Simfinder does not always produce accurate themes,  and therefore, the RFA could choose not to generate a reference sentence if the theme sentences had too little in common. An example of a theme for which no sentence was generated is shown in Table 6. As a result of this filtering, 34% of the sentences were removed. 4.1.3 Baselines. In addition to the system-generated sentence, we also included in the evaluation a fusion sentence generated by another human (hereafter, RFA2) and three baselines. (Following the DUC terminology, we refer to the baselines, our system, and the RFA2 as peers.) The first baseline is the shortest sentence among the theme sentences, which is obviously grammatical, and it also has a good chance of being representative of common topics conveyed in the input. The second baseline is produced by a simplification of our algorithm, where paraphrase information is omitted during the alignment process. This baseline is included to capture the contribution of paraphrase information to the performance of the fusion algorithm. The third baseline consists of the basis sentence. The comparison with this baseline reveals the contribution of the insertion and deletion stages in the fusion algorithm. The comparison against an RFA2 sentence provides an upper bound on the performance of the system and baselines. In addition, this comparison sheds light on the human agreement on this task.</Paragraph>
      <Paragraph position="6"> 12 To mitigate the effects of Simfinder noise in MultiGen, we induced a similarity threshold on input trees--trees which are not similar to the basis tree are not used in the fusion process.  Barzilay and McKeown Sentence Fusion for Multidocument News Summarization Table 6 An example of noisy Simfinder output.</Paragraph>
      <Paragraph position="7"> The shares have fallen 60% this year.</Paragraph>
      <Paragraph position="8"> They said Qwest was forcing them to exchange their bonds at a fraction of face value--between 52.5% and 82.5%, depending on the bond--or else fall lower in the pecking order for repayment in case Qwest went broke.</Paragraph>
      <Paragraph position="9"> Qwest had offered to exchange up to $12.9 billion of the old bonds, which carried interest rates between 5.875% and 7.9%.</Paragraph>
      <Paragraph position="10"> The new debt carries rates between 13% and 14%.</Paragraph>
      <Paragraph position="11"> Their yield fell to about 15.22% from 15.98%.</Paragraph>
      <Paragraph position="12">  tence along with the corresponding reference sentence. The judge also had access to the original theme from which these sentences were generated. The order of the presentation was randomized across themes and peer systems. Reference and peer sentences were divided into clauses by the authors. The judges assessed overlap on the clause level between reference and peer sentences. The wording of the instructions was inspired by the DUC instructions for clause comparison. For each clause in the reference sentence, the judge decided whether the meaning of a corresponding clause was conveyed in a peer sentence. In addition to 0 score for no overlap and 1 for full overlap, this framework allows for partial overlap with a score of 0.5. From the overlap data, we computed weighted recall and precision based on fractional count (Hatzivassiloglou and McKeown 1993). Recall is a ratio of weighted clause overlap between a peer and a reference sentence, and the number of clauses in a reference sentence. Precision is a ratio of weighted clause overlap between a peer and a reference sentence, and the number of clauses in a peer sentence.</Paragraph>
      <Paragraph position="13">  grammatical (3), partially grammatical (2), and not grammatical (1). The judge was instructed to rate a sentence in the grammatical category if it contained no grammatical mistakes. Partially grammatical included sentences that contained at most one mistake in agreement, articles, and tense realization. The not grammatical category included sentences that were corrupted by multiple mistakes of the former type, by erroneous component order or by the omission of important components (e.g., subject).</Paragraph>
      <Paragraph position="14"> Punctuation is one issue in assessing grammaticality. Improper placement of punctuation is a limitation of our implementation of the sentence fusion algorithm that we are well aware of.</Paragraph>
      <Paragraph position="15">  Therefore, in our grammaticality evaluation (following the DUC procedure), the judge was asked to ignore punctuation.</Paragraph>
    </Section>
    <Section position="2" start_page="314" end_page="315" type="sub_section">
      <SectionTitle>
4.2 Data
</SectionTitle>
      <Paragraph position="0"> To evaluate our sentence fusion algorithm, we selected 100 themes following the procedure described in the previous section. Each set varied from three to seven sentences, 13 We were unable to develop a set of rules which works in most cases. Punctuation placement is determined by a variety of features; considering all possible interactions of these features is hard. We believe that corpus-based algorithms for automatic restoration of punctuation developed for speech recognition applications (Beeferman, Berger, and Lafferty 1998; Shieber and Tao 2003) could help in our task, and we plan to experiment with them in the future.</Paragraph>
      <Paragraph position="1">  Computational Linguistics Volume 31, Number 3 with 4.22 sentences on average. The generated fusion sentences consisted of 1.91 clauses on average. None of the sentences in the test set were fully extracted; on average, each sentence fused fragments from 2.14 theme sentences. Out of 100 sentence, 57 sentences produced by the algorithm combined phrases from several sentences, while the rest of the sentences comprised subsequences of one of the theme sentences. (Note that compression is different from sentence extraction.) We included these sentences in the evaluation, because they reflect both content selection and realization capacities of the algorithm.</Paragraph>
      <Paragraph position="2"> Table 5 shows two sentences from the test corpus, along with input sentences. The examples are chosen so as to reflect good- and bad-performance cases. Note that the first example results in inclusion of the essential information (the fact that bodies were found, along with time and place) and leaves out details (that it was a remote location or how many miles west it was, a piece of information that is in dispute in any case). The problematic example incorrectly selects the number of people killed as six, even though this number is not repeated and different numbers are referred to in the text. This mistake is caused by a noisy entry in our paraphrasing dictionary which erroneously identifies &amp;quot;five&amp;quot; and &amp;quot;six&amp;quot; as paraphrases of each other.</Paragraph>
    </Section>
    <Section position="3" start_page="315" end_page="315" type="sub_section">
      <SectionTitle>
4.3 Results
</SectionTitle>
      <Paragraph position="0"> Table 7 shows the length ratio, precision, recall, F-measure, and grammaticality score for each algorithm. The length ratio of a sentence was computed as the ratio of its output length to the average length of the theme input sentences.</Paragraph>
    </Section>
    <Section position="4" start_page="315" end_page="318" type="sub_section">
      <SectionTitle>
4.4 Discussion
</SectionTitle>
      <Paragraph position="0"> The results in Table 7 demonstrate that sentences manually generated by the second human participant (RFA2) not only are the shortest, but are also closest to the reference sentence in terms of selected information. The tight connection  between sentences generated by the RFAs establishes a high upper bound for the fusion task. While neither our system nor the baselines were able to reach this level of performance, the fusion algorithm clearly outperforms all the baselines in terms of content selection, at a reasonable level of compression. The performance of baseline 1 and baseline 2 demonstrates that neither the shortest sentence nor the basis sentence is an adequate substitution for fusion in terms of content selection. The gap in recall between our system and baseline 3 confirms our hypothesis about the importance of paraphrasing information for the fusion process. Omission of paraphrases causes an 8% drop in recall due to the inability to match equivalent phrases with different wording. Table 7 also reveals a downside of the fusion algorithm: Automatically generated sentences contain grammatical errors, unlike fully extracted, human-written sentences. Given the high sensitivity of humans to processing ungrammatical sentences, one has to consider the benefits of flexible information selection against the decrease in readability of the generated sentences. Sentence fusion may not be a worthy direction to pursue if low grammaticality is intrinsic to the algorithm and its correction requires 14 We cannot apply kappa statistics (Siegel and Castellan 1988) for measuring agreement in the content selection task since the event space is not well-defined. This prevents us from computing the probability of random agreement.</Paragraph>
      <Paragraph position="1">  Evaluation results for a human-crafted fusion sentence (RFA2), our system output, the shortest sentence in the theme (baseline 1), the basis sentence (baseline 2), and a simplified version of our algorithm without paraphrasing information (baseline 3).</Paragraph>
      <Paragraph position="2">  knowledge which cannot be automatically acquired. In the remainder of the section, we show that this is not the case. Our manual analysis of generated sentences revealed that most of the grammatical mistakes are caused by the linearization component, or more specifically, by suboptimal scoring of the language model. Language modeling is an active area of research, and we believe that advances in this direction will be able to dramatically boost the linearization capacity of our algorithm.</Paragraph>
      <Paragraph position="3">  mistakes in content selection and surface realization. Note that in some cases multiple errors are entwined in one sentence, which makes it hard to distinguish between a sequence of independent mistakes and a cause-and-effect chain. Therefore, the presented counts should be viewed as approximations, rather than precise numbers. We start with the analysis of the test set and continue with the description of some interesting mistakes that we encountered during system development.</Paragraph>
      <Paragraph position="4"> Mistakes in Content Selection. Most of the mistakes in content selection can be attributed to problems with alignment. In most cases (17), erroneous alignments missed relevant word mappings as a result of the lack of a corresponding entry in our paraphrasing resources. At the same time, mapping of unrelated words (as shown in Table 5) was quite rare (two cases). This performance level is quite predictable given the accuracy of an automatically constructed dictionary and limited coverage of WordNet. Even in the presence of accurate lexical information, the algorithm occasionally produced suboptimal alignments (four cases) because of the simplicity of our weighting scheme, which supports limited forms of mapping typology and also uses manually assigned weights.</Paragraph>
      <Paragraph position="5"> Another source of errors (two cases) was the algorithm's inability to handle many-to-many alignments. Namely, two trees conveying the same meaning may not be decomposable into the node-level mappings which our algorithm aims to compute.</Paragraph>
      <Paragraph position="6"> For example, the mapping between the sentences in Table 8 expressed by the rule X denied claims by Y - X said that Y's claim was untrue cannot be decomposed into smaller matching units. At least two mistakes resulted from noisy preprocessing (tokenization and parsing).</Paragraph>
      <Paragraph position="7"> In addition to alignment, overcutting during lattice pruning caused the omission of three clauses that were present in the corresponding reference sentences. The sentence Conservatives were cheering language is an example of an incomplete sentence derived from the following input sentence: Conservatives were cheering language in the final version  Computational Linguistics Volume 31, Number 3 Table 8 A pair of sentences which cannot be fully decomposed.</Paragraph>
      <Paragraph position="8"> Syria denied claims by Israeli Prime Minister Ariel Sharon . . .</Paragraph>
      <Paragraph position="9"> The Syrian spokesman said that Sharon's claim was untrue . . .</Paragraph>
      <Paragraph position="10"> that ensures that one-third of all funds for prevention programs be used to promote abstinence. The omission of a relative clause was possible because some sentences in the input theme contained the noun language without any relative clauses.</Paragraph>
      <Paragraph position="11"> Mistakes in Surface Realization. Grammatical mistakes included incorrect selection of determiners, erroneous word ordering, omission of essential sentence constituents, and incorrect realization of negation constructions and tense. These mistakes (42) originated during linearization of the lattice and were caused either by incompleteness of the linearizer or by suboptimal scoring of the language model. Mistakes of the first type are caused by missing rules for generating auxiliaries given node features. An example of this phenomenon is the sentence The coalition to have play a central role, which verbalizes the verb construction will have to play incorrectly. Our linearizer lacks the completeness of existing application-independent linearizers, such as the unification-based FUF/SURGE (Elhadad and Robin 1996) and the probabilistic Fergus (Bangalore and Rambow 2000). Unfortunately, we were unable to reuse any of the existing large-scale linearizers because of significant structural differences between input expected by these linearizers and the format of a fusion lattice. We are currently working on adapting Fergus for the sentence fusion task.</Paragraph>
      <Paragraph position="12"> Mistakes related to suboptimal scoring were the most common (33 out of 42); in these cases, a language model selected ill-formed sentences, assigning a worse score to a better sentence. The sentence The diplomats were given to leave the country in 10 days illustrates a suboptimal linearization of the fusion lattice. The correct linearizations--The diplomats were given 10 days to leave the country and The diplomats were ordered to leave the country in 10 days--were present in the fusion lattice, but the language model picked the incorrect verbalization. We found that in 27 cases the optimal verbalizations (in the authors' view) were ranked below the top-10 sentences ranked by the language model. We believe that more powerful language models that incorporate linguistic knowledge (such as syntax-based models) can improve the quality of generated sentences.</Paragraph>
      <Paragraph position="13">  we also regularly track the quality of generated summaries on Newsblaster's Web page. We have noted a number of interesting errors that crop up from time to time that seem to require information about the full syntactic parse, semantics, or even discourse. Consider, for example, the last sentence from a summary entitled Estrogen- null was created by sentence fusion and clearly, there is a problem. Certainly, there was a study finding the risk of dementia in women who took one type of combined hormone pill, but it was not the government study which was abruptly halted last summer. In looking at the two sentences from which this summary sentence was drawn, we can see that there is a good amount of overlap between the two, but the component does not have enough information about the referents of the different terms to know that two different  An example of wrong reference selection. Subscripts in the generated sentence indicate the theme sentence from which the words were extracted.</Paragraph>
      <Paragraph position="14"> #1 Last summer, a government study was abruptly halted after finding an increased risk of breast cancer, heart attacks, and strokes in women who took one type of combined hormone pill.</Paragraph>
      <Paragraph position="15"> #2 The most common form of hormone replacement therapy, already linked to breast cancer, stroke, and heart disease, does not improve mental functioning as some earlier studies suggested and may increase the risk of dementia, researchers said on Tuesday.  studies are involved and that fusion should not take place. One topic of our future work (Section 6) is the problem of reference and summarization.</Paragraph>
      <Paragraph position="16"> Another example is shown in Table 10. Here again, the problem is reference. The first error is in the references to the segments.Thetwousesofsegments in the first source document sentence do not refer to the same entity and thus, when the modifier is dropped, we get an anomaly. The second, more unusual problem is in the equation of Clinton/Dole, Dole/Clinton,andClinton and Dole.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML