File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1057_metho.xml
Size: 16,180 bytes
Last Modified: 2025-10-06 14:07:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1057"> <Title>A Noisy-Channel Model for Document Compression</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Document Compression </SectionTitle> <Paragraph position="0"> The document compression task is conceptually simple. Given a document a5a7a6a8a2a10a9a12a11a13a9a15a14a17a16a18a16a19a16a20a9a22a21a23a3 , our goal is to produce a new documenta5a25a24 by &quot;dropping&quot; words a9a22a26 from a5 . In order to achieve this goal, we 1A number of other systems use the outputs of extractive summarizers and repair them to improve coherence (DUC, 2001; DUC, 2002). Unfortunately, none of these seems flexible enough to produce in one shot good summaries that are simultaneously coherent and grammatical.</Paragraph> <Paragraph position="1"> extent the noisy-channel model proposed by Knight & Marcu (2000). Their system compressed sentences by dropping syntactic constituents, but could be applied to entire documents only on a sentence-by-sentence basis. As discussed in Section 1, this is not adequate because the resulting summary may contain many compressed sentences that are irrelevant. In order to extend Knight & Marcu's approach beyond the sentence level, we need to &quot;glue&quot; sentences together in a tree structure similar to that used at the sentence level. Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) provides us this &quot;glue.&quot; The tree in Figure 1 depicts the RST structure of Text (1). In RST, discourse structures are non-binary trees whose leaves correspond to elementary discourse units (EDUs), and whose internal nodes correspond to contiguous text spans. Each internal node in an RST tree is characterized by a rhetorical relation. For example, the first sentence in Text (1) provides BACKGROUND information for interpreting the information in sentences 2 and 3, which are in a CONTRAST relation (see Figure 1). Each relation holds between two adjacent non-overlapping text spans called NUCLEUS and SATELLITE. (There are a few exceptions to this rule: some relations, such as LIST and CONTRAST, are multinuclear.) The distinction between nuclei and satellites comes from the empirical observation that the nucleus expresses what is more essential to the writer's purpose than the satellite.</Paragraph> <Paragraph position="2"> Our system is able to analyze both the discourse structure of a document and the syntactic structure of each of its sentences or EDUs. It then compresses the document by dropping either syntactic or discourse constituents.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 A Noisy-Channel Model </SectionTitle> <Paragraph position="0"> For a given document a5 , we want to find the summary text a27 that maximizes a28a25a29a30a27a32a31a33a5a35a34 . Using Bayes rule, we flip this so we end up maximizing a28a25a29a30a5a36a31a37a27a38a34a39a28a25a29a30a27a38a34 . Thus, we are left with modelling two probability distributions: a28a25a29a30a5a36a31a37a27a38a34 , the probability of a document a5 given a summary a27 , and a28a25a29a30a27a40a34 , the probability of a summary. We assume that we are given the discourse structure of each document and the syntactic structures of each of its EDUs.</Paragraph> <Paragraph position="1"> The intuitive way of thinking about this application of Bayes rule, reffered to as the noisy-channel model, is that we start with a summary a27 and add &quot;noise&quot; to it, yielding a longer document a5 . The noise added in our model consists of words, phrases and discourse units.</Paragraph> <Paragraph position="2"> For instance, given the document &quot;John Doe has secured the vote of most democrats.&quot; we could add words to it (namely the word &quot;already&quot;) to generate &quot;John Doe has already secured the vote of most democrats.&quot; We could also choose to add an entire syntactic constituent, for instance a prepositional phrase, to generate &quot;John Doe has secured the vote of most democrats in his constituency.&quot; These are both examples of sentence expansion as used previously by Knight & Marcu (2000).</Paragraph> <Paragraph position="3"> Our system, however, also has the ability to expand on a core message by adding discourse constituents. For instance, it could decide to add another discourse constituent to the original summary &quot;John Doe has secured the vote of most democrats&quot; by CONTRASTing the information in the summary with the uncertainty regarding the support of the governor, thus yielding the text: &quot;John Doe has secured the vote of most democrats. But without the support of the governor, he is still on shaky ground.&quot; As in any noisy-channel application, there are three parts that we have to account for if we are to build a complete document compression system: the channel model, the source model and the decoder.</Paragraph> <Paragraph position="4"> We describe each of these below.</Paragraph> <Paragraph position="5"> The source model assigns to a string the probability a28a25a29a30a27a40a34 , the probability that the summary a27 is good English. Ideally, the source model should disfavor ungrammatical sentences and documents containing incoherently juxtaposed sentences.</Paragraph> <Paragraph position="6"> The channel model assigns to any document/summary pair a probability a28a25a29a41a5a42a31a37a27a40a34 . This models the extent to which a5 is a good expansion of a27 . For instance, if a27 is &quot;The mayor is now looking for re-election.&quot;, a5a43a11 is &quot;The mayor is now looking for re-election.</Paragraph> <Paragraph position="7"> He has to secure the vote of the democrats.&quot; and a5a25a14 is &quot;The major is now looking for re-election. Sharks have sharp teeth.&quot;, we expect a28a25a29a30a5a44a11a45a31a37a27a40a34 to be higher than a28a25a29a30a5a25a14a46a31a37a27a38a34 because a5a44a11 expands on a27 by elaboration, while a5a25a14 shifts to a different topic, yielding an incoherent text.</Paragraph> <Paragraph position="8"> The decoder searches through all possible summaries of a document a5 for the summary a27 that maximizes the posterior probability a28a25a29a41a5a42a31a37a27a40a34a13a28a25a29a30a27a40a34 . Each of these parts is described below.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Source model </SectionTitle> <Paragraph position="0"> The job of the source model is to assign a score a28a25a29a30a27a40a34 to a compression independent of the original document. That is, the source model should measure how good English a summary is (independent of whether it is a good compression or not). Currently, we use a bigram measure of quality (trigram scores were also tested but failed to make a difference), combined with non-lexicalized context-free syntactic probabilities and context-free discourse probabilities, giving a28a25a29a30a27a40a34a47a6 a28a49a48a50a26a52a51a54a53a56a55a39a57a25a29a30a27a38a34a12a58a25a28a60a59a62a61a64a63a17a65a15a29a30a27a40a34a12a58 a28a60a66a67a59a62a61a64a63a17a65a32a29a41a27a38a34 . It would be better to use a lexicalized context free grammar, but that was not possible given the decoder used.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Channel model </SectionTitle> <Paragraph position="0"> The channel model is allowed to add syntactic constituents (through a stochastic operation called constituent-expand) or discourse units (through another stochastic operation called EDU-expand).</Paragraph> <Paragraph position="1"> Both of these operations are performed on a combined discourse/syntax tree called the DS-tree. The DS-tree for Text (1) is shown in Figure 1 for reference. null Suppose we start with the summary a27a68a6 &quot;The mayor is looking for re-election.&quot; A constituent- null expand operation could insert a syntactic constituent, such as &quot;this year&quot; anywhere in the syntactic tree of a27 . A constituent-expand operation could also add single words: for instance the word &quot;now&quot; could be added between &quot;is&quot; and &quot;looking,&quot; yielding a5a91a6 &quot;The mayor is now looking for re-election.&quot; The probability of inserting this word is based on the syntactic structure of the node into which it's inserted. null Knight and Marcu (2000) describe in detail a noisy-channel model that explains how short sentences can be expanded into longer ones by inserting and expanding syntactic constituents (and words).</Paragraph> <Paragraph position="2"> Since our constituent-expand stochastic operation simply reimplements Knight and Marcu's model, we do not focus on them here. We refer the reader to (Knight and Marcu, 2000) for the details.</Paragraph> <Paragraph position="3"> In addition to adding syntactic constituents, our system is also able to add discourse units. Consider the summary a27a92a6 &quot;John Doe has already secured the vote of most democrats in his consituency.&quot; Through a sequence of discourse expansions, we can expand upon this summary to reach the original text. A complete discourse expansion process that would occur starting from this initial summary to generate the original document is shown in Figure 2.</Paragraph> <Paragraph position="4"> In this figure, we can follow the sequence of steps required to generate our original text, beginning with our summary a27 . First, through an operation D-Project (&quot;D&quot; for &quot;D&quot;iscourse), we increase the depth of the tree, adding an intermediate NUC=SPAN node. This projection adds a factor of a28a25a29 Nuc=Span a93 Nuc=Spana31 Nuc=Spana34 to the probability of this sequence of operations (as is shown under the arrow).</Paragraph> <Paragraph position="5"> We are now able to perform the second operation, D-Expand, with which we expand on the core message contained ina27 by adding a satellite which evaluates the information presented ina27 . This expansion adds the probability of performing the expansion (called the discourse expansion probabilities, a28a60a66a67a94 . An example discourse expansion probability, written a28a25a29 Nuc=Span a93 Nuc=Span Sat=Evala31 Nuc=Span a93 Nuc=Spana34 , reflects the probability of adding an evaluation satellite onto a nuclear span).</Paragraph> <Paragraph position="6"> The rest of Figure 2 shows some of the remaining steps to produce the original document, each step labeled with the appropriate probability factors. Then, the probability of the entire expansion is the product of all those listed probabilities combined with the appropriate probabilities from the syntax side of things. In order to produce the final score a28a25a29a30a5a36a31a37a27a38a34 for a document/summary pair, we multiply together each of the expansion probabilities in the path leading from a27 to a5 .</Paragraph> <Paragraph position="7"> For estimating the parameters for the discourse models, we used an RST corpus of 385 Wall Street Journal articles from the Penn Treebank, which we obtained from LDC. The documents in the corpus range in size from 31 to 2124 words, with an average of 458 words per document. Each document is paired with a discourse structure that was manu- null ally built in the style of RST. (See (Carlson et al., 2001) for details concerning the corpus and the annotation process.) From this corpus, we were able to estimate parameters for a discourse PCFG using standard maximum likelihood methods.</Paragraph> <Paragraph position="8"> Furthermore, 150 document from the same corpus are paired with extractive summaries on the EDU level. Human annotators were asked which EDUs were most important; suppose in the example DS-tree (Figure 1) the annotators marked the second and fifth EDUs (the starred ones). These stars are propagated up, so that any discourse unit that has a descendent considered important is also considered important. From these annotations, we could deduce that, to compress a NUC=CONTRAST that has two children, NUC=SPAN and SAT=EVALUATION, we can drop the evaluation satellite. Similarly, we can compress a NUC=CONTRAST that has two children, SAT=CONDITION and NUC=SPAN by dropping the first discourse constituent. Finally, we can compress the ROOT deriving into SAT=BACKGROUND NUC=SPAN by dropping the SAT=BACKGROUND constituent. We keep counts of each of these examples and, once collected, we normalize them to get the discourse expansion probabilities.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Decoder </SectionTitle> <Paragraph position="0"> The goal of the decoder is to combine a28a25a29a41a27a38a34 with a28a25a29a30a5a36a31a37a27a38a34 to get a28a25a29a30a27a12a31a37a5a44a34 . There are a vast number of potential compressions of a large DS-tree, but we can efficiently pack them into a shared-forest structure, as described in detail by Knight & Marcu (2000). Each entry in the shared-forest structure has three associated probabilities, one from the source syntax PCFG, one from the source discourse PCFG and one from the expansion-template probabilities described in Section 3.2. Once we have generated a forest representing all possible compressions of the original document, we want to extract the best (or the a140 -best) trees, taking into account both the expansion probabilities of the channel model and the bigram and syntax and discourse PCFG probabilities of the source model. Thankfully, such a generic extractor has already been built (Langkilde, 2000).</Paragraph> <Paragraph position="1"> For our purposes, the extractor selects the trees with the best combination of LM and expansion scores after performing an exhaustive search over all possible summaries. It returns a list of such trees, one for each possible length.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 System </SectionTitle> <Paragraph position="0"> The system developed works in a pipelined fashion as shown in Figure 3. The first step along the pipeline is to generate the discourse structure. To do this, we use the decision-based discourse parser described by Marcu (2000)2. Once we have the discourse structure, we send each EDU off to a syn- null identification, a146a13a147a18a144a147 for identifying hierarchical spans, a141a143a148a18a144a148 for nuclearity identification and a145a13a141a18a144a149 for relation tagging. tactic parser (Collins, 1997). The syntax trees of the EDUs are then merged with the discourse tree in the forest generator to create a DS-tree similar to that shown in Figure 1. From this DS-tree we generate a forest that subsumes all possible compressions. This forest is then passed on to the forest ranking system which is used as decoder (Langkilde, 2000).</Paragraph> <Paragraph position="1"> The decoder gives us a list of possible compressions, for each possible length. Example compressions of Text (1) are shown in Figure 4 together with their respective log-probabilities.</Paragraph> <Paragraph position="2"> In order to choose the &quot;best&quot; compression at any possible length, we cannot rely only on the log-probabilities, lest the system always choose the shortest possible compression. In order to compensate for this, we normalize by length. However, in practice, simply dividing the log-probability by the length of the compression is insufficient for longer documents. Experimentally, we found a reasonable metric was to, for a compression of length a140 , divide each log-probability by a140 a11a39a150a14 . This was the job of the length chooser from Figure 3, and enabled us to choose a single compression for each document, which was used for evaluation. (In Figure 4, the compression chosen by the length selector is italicized and was the shortest one3.)</Paragraph> </Section> class="xml-element"></Paper>