A Noisy-Channel Model for Document Compression
Hal Daum´e III and Daniel Marcu
Information Sciences Institute
University of Southern California
4676 Admiralty Way, Suite 1001
Marina del Rey, CA 90292
a0 hdaume,marcu
a1 @isi.edu
Abstract
We present a document compression sys-
tem that uses a hierarchical noisy-channel
model of text production. Our compres-
sion system first automatically derives the
syntactic structure of each sentence and
the overall discourse structure of the text
given as input. The system then uses a sta-
tistical hierarchical model of text produc-
tion in order to drop non-important syn-
tactic and discourse constituents so as to
generate coherent, grammatical document
compressions of arbitrary length. The sys-
tem outperforms both a baseline and a
sentence-based compression system that
operates by simplifying sequentially all
sentences in a text. Our results support
the claim that discourse knowledge plays
an important role in document summariza-
tion.
1 Introduction
Single document summarization systems proposed
to date fall within one of the following three classes:
Extractive summarizers simply select and present
to the user the most important sentences in
a text — see (Mani and Maybury, 1999;
Marcu, 2000; Mani, 2001) for comprehensive
overviews of the methods and algorithms used
to accomplish this.
Headline generators are noisy-channel probabilis-
tic systems that are trained on large corpora
of a2 Headline, Texta3 pairs (Banko et al., 2000;
Berger and Mittal, 2000). These systems pro-
duce short sequences of words that are indica-
tive of the content of the text given as input.
Sentence simplification systems (Chandrasekar et
al., 1996; Mahesh, 1997; Carroll et al., 1998;
Grefenstette, 1998; Jing, 2000; Knight and
Marcu, 2000) are capable of compressing long
sentences by deleting unimportant words and
phrases.
Extraction-based summarizers often produce out-
puts that contain non-important sentence fragments.
For example, the hypothetical extractive summary
of Text (1), which is shown in Table 1, can be com-
pacted further by deleting the clause “which is al-
ready almost enough to win”. Headline-based sum-
maries, such as that shown in Table 1, are usually
indicative of a text’s content but not informative,
grammatical, or coherent. By repeatedly applying a
sentence-simplification algorithm one sentence at a
time, one can compress a text; yet, the outputs gen-
erated in this way are likely to be incoherent and
to contain unimportant information. When summa-
rizing text, some sentences should be dropped alto-
gether.
Ideally, we would like to build systems that have
the strengths of all these three classes of approaches.
The “Document Compression” entry in Table 1
shows a grammatical, coherent summary of Text (1),
which was generated by a hypothetical document
compression system that preserves the most impor-
tant information in a text while deleting sentences,
phrases, and words that are subsidiary to the main
message of the text. Obviously, generating coher-
ent, grammatical summaries such as that produced
by the hypothetical document compression system
in Table 1 is not trivial because of many conflicting
                Computational Linguistics (ACL), Philadelphia, July 2002, pp. 449-456.
                         Proceedings of the 40th Annual Meeting of the Association for
Type of Hypothetical output Output Output is Output is
Summarizer contains only coherent grammatical
important info
Extractive John Doe has already secured the vote of most a4
summarizer democrats in his constituency, which is already
almost enough to win. But without the support
of the governer, he is still on shaky ground.
Headline mayor vote constituency governer a4
generator
Sentence The mayor is now looking for re-election. John Doe a4
simplifier has already secured the vote of most democrats
in his constituency. He is still on shaky ground.
Document John Doe has secured the vote of most democrats. a4 a4 a4
compressor But he is still on shaky ground.
Table 1: Hypothetical outputs generated by various types of summarizers.
goals1. The deletion of certain sentences may result
in incoherence and information loss. The deletion of
certain words and phrases may also lead to ungram-
maticality and information loss.
The mayor is now looking for re-election. John Doe
has already secured the vote of most democrats in his
constituency, which is already almost enough to win.
But without the support of the governer, he is still on
shaky grounds.
(1)
In this paper, we present a document compression
system that uses hierarchical models of discourse
and syntax in order to simultaneously manage all
these conflicting goals. Our compression system
first automatically derives the syntactic structure of
each sentence and the overall discourse structure of
the text given as input. The system then uses a sta-
tistical hierarchical model of text production in or-
der to drop non-important syntactic and discourse
units so as to generate coherent, grammatical doc-
ument compressions of arbitrary length. The system
outperforms both a baseline and a sentence-based
compression system that operates by simplifying se-
quentially all sentences in a text.
2 Document Compression
The document compression task is conceptually
simple. Given a document a5a7a6a8a2a10a9a12a11a13a9a15a14a17a16a18a16a19a16a20a9a22a21a23a3 , our
goal is to produce a new documenta5a25a24 by “dropping”
words a9a22a26 from a5 . In order to achieve this goal, we
1A number of other systems use the outputs of extrac-
tive summarizers and repair them to improve coherence (DUC,
2001; DUC, 2002). Unfortunately, none of these seems flexible
enough to produce in one shot good summaries that are simul-
taneously coherent and grammatical.
extent the noisy-channel model proposed by Knight
& Marcu (2000). Their system compressed sen-
tences by dropping syntactic constituents, but could
be applied to entire documents only on a sentence-
by-sentence basis. As discussed in Section 1, this
is not adequate because the resulting summary may
contain many compressed sentences that are irrele-
vant. In order to extend Knight & Marcu’s approach
beyond the sentence level, we need to “glue” sen-
tences together in a tree structure similar to that used
at the sentence level. Rhetorical Structure Theory
(RST) (Mann and Thompson, 1988) provides us this
“glue.”
The tree in Figure 1 depicts the RST structure
of Text (1). In RST, discourse structures are non-
binary trees whose leaves correspond to elementary
discourse units (EDUs), and whose internal nodes
correspond to contiguous text spans. Each internal
node in an RST tree is characterized by a rhetor-
ical relation. For example, the first sentence in
Text (1) provides BACKGROUND information for inter-
preting the information in sentences 2 and 3, which
are in a CONTRAST relation (see Figure 1). Each re-
lation holds between two adjacent non-overlapping
text spans called NUCLEUS and SATELLITE. (There are
a few exceptions to this rule: some relations, such
as LIST and CONTRAST, are multinuclear.) The dis-
tinction between nuclei and satellites comes from
the empirical observation that the nucleus expresses
what is more essential to the writer’s purpose than
the satellite.
Our system is able to analyze both the discourse
structure of a document and the syntactic structure
of each of its sentences or EDUs. It then compresses
the document by dropping either syntactic or dis-
course constituents.
3 A Noisy-Channel Model
For a given document a5 , we want to find the
summary text a27 that maximizes a28a25a29a30a27a32a31a33a5a35a34 . Using
Bayes rule, we flip this so we end up maximizing
a28a25a29a30a5a36a31a37a27a38a34a39a28a25a29a30a27a38a34 . Thus, we are left with modelling two
probability distributions: a28a25a29a30a5a36a31a37a27a38a34 , the probability of
a document a5 given a summary a27 , and a28a25a29a30a27a40a34 , the
probability of a summary. We assume that we are
given the discourse structure of each document and
the syntactic structures of each of its EDUs.
The intuitive way of thinking about this applica-
tion of Bayes rule, reffered to as the noisy-channel
model, is that we start with a summary a27 and add
“noise” to it, yielding a longer document a5 . The
noise added in our model consists of words, phrases
and discourse units.
For instance, given the document “John Doe has
secured the vote of most democrats.” we could add
words to it (namely the word “already”) to gener-
ate “John Doe has already secured the vote of most
democrats.” We could also choose to add an en-
tire syntactic constituent, for instance a prepositional
phrase, to generate “John Doe has secured the vote
of most democrats in his constituency.” These are
both examples of sentence expansion as used previ-
ously by Knight & Marcu (2000).
Our system, however, also has the ability to ex-
pand on a core message by adding discourse con-
stituents. For instance, it could decide to add another
discourse constituent to the original summary “John
Doe has secured the vote of most democrats” by
CONTRASTing the information in the summary with
the uncertainty regarding the support of the gover-
nor, thus yielding the text: “John Doe has secured
the vote of most democrats. But without the support
of the governor, he is still on shaky ground.”
As in any noisy-channel application, there are
three parts that we have to account for if we are to
build a complete document compression system: the
channel model, the source model and the decoder.
We describe each of these below.
The source model assigns to a string the probabil-
ity a28a25a29a30a27a40a34 , the probability that the summary a27
is good English. Ideally, the source model
should disfavor ungrammatical sentences and
documents containing incoherently juxtaposed
sentences.
The channel model assigns to any docu-
ment/summary pair a probability a28a25a29a41a5a42a31a37a27a40a34 .
This models the extent to which a5 is a good
expansion of a27 . For instance, if a27 is “The
mayor is now looking for re-election.”, a5a43a11 is
“The mayor is now looking for re-election.
He has to secure the vote of the democrats.”
and a5a25a14 is “The major is now looking for
re-election. Sharks have sharp teeth.”, we
expect a28a25a29a30a5a44a11a45a31a37a27a40a34 to be higher than a28a25a29a30a5a25a14a46a31a37a27a38a34
because a5a44a11 expands on a27 by elaboration,
while a5a25a14 shifts to a different topic, yielding an
incoherent text.
The decoder searches through all possible sum-
maries of a document a5 for the summary
a27 that maximizes the posterior probability
a28a25a29a41a5a42a31a37a27a40a34a13a28a25a29a30a27a40a34 .
Each of these parts is described below.
3.1 Source model
The job of the source model is to assign a score
a28a25a29a30a27a40a34 to a compression independent of the original
document. That is, the source model should measure
how good English a summary is (independent of
whether it is a good compression or not). Currently,
we use a bigram measure of quality (trigram scores
were also tested but failed to make a difference),
combined with non-lexicalized context-free syntac-
tic probabilities and context-free discourse probabil-
ities, giving a28a25a29a30a27a40a34a47a6 a28a49a48a50a26a52a51a54a53a56a55a39a57a25a29a30a27a38a34a12a58a25a28a60a59a62a61a64a63a17a65a15a29a30a27a40a34a12a58
a28a60a66a67a59a62a61a64a63a17a65a32a29a41a27a38a34 . It would be better to use a lexical-
ized context free grammar, but that was not possible
given the decoder used.
3.2 Channel model
The channel model is allowed to add syntactic
constituents (through a stochastic operation called
constituent-expand) or discourse units (through an-
other stochastic operation called EDU-expand).
Both of these operations are performed on a com-
bined discourse/syntax tree called the DS-tree. The
DS-tree for Text (1) is shown in Figure 1 for refer-
ence.
Suppose we start with the summary a27a68a6 “The
mayor is looking for re-election.” A constituent-
a69a70a70a71
S
NPB
DT NN
VP
VBZ ADVP
RB
VP−A
VBG PP
NPB
NN PUNC.
IN
The mayor
now looking
for
is
re−election .
a72a73a71a74a75a73a76a77a78a79a70a80a81a82
TOP
John Doe has already
secured the vote of
most democrats in his
constituency,
which is already
almost enough
to win.
But without the
support of the
governer,
he is still
on shaky
ground.
a83a80a76a74a72a84a73a81 a72a73a71a74a85a86a73a87a80a73a71a88a70a81 a72a73a71a74a76a70a81a82a88a71a88a70a81 a83a80a76a74a72a84a73a81
a83a80a76a74a89a70a81a71a79a73a90a71
a83a80a76a74a89a70a81a71a79a73a90a71
a83a80a76a74a72a84a73a81
* *
Figure 1: The discourse (full)/syntax (partial) tree for Text (1).
expand operation could insert a syntactic con-
stituent, such as “this year” anywhere in the syntac-
tic tree of a27 . A constituent-expand operation could
also add single words: for instance the word “now”
could be added between “is” and “looking,” yielding
a5a91a6 “The mayor is now looking for re-election.”
The probability of inserting this word is based on
the syntactic structure of the node into which it’s in-
serted.
Knight and Marcu (2000) describe in detail a
noisy-channel model that explains how short sen-
tences can be expanded into longer ones by inserting
and expanding syntactic constituents (and words).
Since our constituent-expand stochastic operation
simply reimplements Knight and Marcu’s model, we
do not focus on them here. We refer the reader
to (Knight and Marcu, 2000) for the details.
In addition to adding syntactic constituents, our
system is also able to add discourse units. Consider
the summary a27a92a6 “John Doe has already secured the
vote of most democrats in his consituency.” Through
a sequence of discourse expansions, we can expand
upon this summary to reach the original text. A com-
plete discourse expansion process that would occur
starting from this initial summary to generate the
original document is shown in Figure 2.
In this figure, we can follow the sequence of
steps required to generate our original text, begin-
ning with our summary a27 . First, through an op-
eration D-Project (“D” for “D”iscourse), we in-
crease the depth of the tree, adding an intermediate
NUC=SPAN node. This projection adds a factor of
a28a25a29 Nuc=Span a93 Nuc=Spana31 Nuc=Spana34 to the probabil-
ity of this sequence of operations (as is shown under
the arrow).
We are now able to perform the second operation,
D-Expand, with which we expand on the core mes-
sage contained ina27 by adding a satellite which eval-
uates the information presented ina27 . This expansion
adds the probability of performing the expansion
(called the discourse expansion probabilities, a28a60a66a67a94 .
An example discourse expansion probability, writ-
ten a28a25a29 Nuc=Span a93 Nuc=Span Sat=Evala31 Nuc=Span a93
Nuc=Spana34 , reflects the probability of adding an eval-
uation satellite onto a nuclear span).
The rest of Figure 2 shows some of the remaining
steps to produce the original document, each step la-
beled with the appropriate probability factors. Then,
the probability of the entire expansion is the prod-
uct of all those listed probabilities combined with
the appropriate probabilities from the syntax side of
things. In order to produce the final score a28a25a29a30a5a36a31a37a27a38a34
for a document/summary pair, we multiply together
each of the expansion probabilities in the path lead-
ing from a27 to a5 .
For estimating the parameters for the discourse
models, we used an RST corpus of 385 Wall Street
Journal articles from the Penn Treebank, which we
obtained from LDC. The documents in the corpus
range in size from 31 to 2124 words, with an av-
erage of 458 words per document. Each document
is paired with a discourse structure that was manu-
a95a96a96a97
John Doe has already
secured the vote of
most democrats in his
constituency,
a98a99a100a101a102a103a104a105
which is already
almost enough
to win.
a102a104a97a101a106a107a104a108a99a104a97a109a96a105
a98a99a100a101a102a103a104a105 a110a111a112a113a114a115a116a117a118
a110a111a119a120a121a122a123a124
a110a111a112a113a114a115a116a117a118
a95a96a96a97
John Doe has already
secured the vote of
most democrats in his
constituency,
a98a99a100a101a102a103a104a105
a125a126
a127a128
a129a130a131a132
a95a96a96a97
John Doe has already
secured the vote of
most democrats in his
constituency,
a98a99a100a101a102a103a104a105
a98a99a100a101a102a103a104a105
John Doe has already
secured the vote of
most democrats in his
constituency,
a98a99a100a101a102a103a104a105
which is already
almost enough
to win.
a102a104a97a101a106a107a104a108a99a104a97a109a96a105
But without the
support of the
governer,
a102a104a97a101a100a96a105a133a109a97a109a96a105
he is still
on shaky
ground.
a98a99a100a101a102a103a104a105
a98a99a100a101a102a103a104a105
a98a99a100a101a134a96a105a97a135a104a136a97
a98a99a100a101a134a96a105a97a135a104a136a97
a95a96a96a97
a110a111a119a120a121a122a123a124
John Doe has already
secured the vote of
most democrats in his
constituency,
a98a99a100a101a102a103a104a105
which is already
almost enough
to win.
a102a104a97a101a106a107a104a108a99a104a97a109a96a105
But without the
support of the
governer,
a102a104a97a101a100a96a105a133a109a97a109a96a105
he is still
on shaky
ground.
a98a99a100a101a102a103a104a105
The mayor is 
now looking
for re−election.
a102a104a97a101a137a104a100a138a139a135a96a99a105a133 a98a99a100a101a102a103a104a105
a98a99a100a101a134a96a105a97a135a104a136a97
a98a99a100a101a134a96a105a97a135a104a136a97
a95a96a96a97
a95a96a96a97
John Doe has already
secured the vote of
most democrats in his
constituency,
a98a99a100a101a102a103a104a105
which is already
almost enough
to win.
a102a104a97a101a106a107a104a108a99a104a97a109a96a105
a98a99a100a101a134a96a105a97a135a104a136a97
a98a99a100a101a102a103a104a105
John Doe has already
secured the vote of
most democrats in his
constituency,
a98a99a100a101a102a103a104a105
which is already
almost enough
to win.
a102a104a97a101a106a107a104a108a99a104a97a109a96a105
a98a99a100a101a102a103a104a105
a98a99a100a101a134a96a105a97a135a104a136a97
a95a96a96a97
he is still
on shaky
ground.
a98a99a100a101a134a96a105a97a135a104a136a97
P(Nuc=Span −> Nuc=Span
                         Sat=evaluation
   Nuc=Span −> Nuc=Span)
P(Nuc=Span −> Nuc=Span |
P(Nuc=Span −> Nuc=Contrast
                             Nuc=Contrast |
P(Root −> Sat=Background Nuc=Span |
                Root −> Nuc=Span)
Nuc=Span)
P(Nuc=Span −> Nuc=Contrast | Nuc=Span)
     Nuc=Span −> Nuc=Contrast)
P(Nuc=Contrast −> Sat=condiation Nuc=Span |
                 Nuc=Contrast −> Nuc=Span)
a110a111a119a120a121a122a123a124
a110a111a112a113a114a115a116a117a118
P(Nuc=Contrast −> Nuc=Span | Nuc=Contrast)*
Figure 2: A sequence of discourse expansions for Text (1) (with probability factors).
ally built in the style of RST. (See (Carlson et al.,
2001) for details concerning the corpus and the an-
notation process.) From this corpus, we were able
to estimate parameters for a discourse PCFG using
standard maximum likelihood methods.
Furthermore, 150 document from the same corpus
are paired with extractive summaries on the EDU
level. Human annotators were asked which EDUs
were most important; suppose in the example DS-
tree (Figure 1) the annotators marked the second
and fifth EDUs (the starred ones). These stars are
propagated up, so that any discourse unit that has
a descendent considered important is also consid-
ered important. From these annotations, we could
deduce that, to compress a NUC=CONTRAST that has
two children, NUC=SPAN and SAT=EVALUATION, we
can drop the evaluation satellite. Similarly, we can
compress a NUC=CONTRAST that has two children,
SAT=CONDITION and NUC=SPAN by dropping the first
discourse constituent. Finally, we can compress the
ROOT deriving into SAT=BACKGROUND NUC=SPAN by
dropping the SAT=BACKGROUND constituent. We keep
counts of each of these examples and, once col-
lected, we normalize them to get the discourse ex-
pansion probabilities.
3.3 Decoder
The goal of the decoder is to combine a28a25a29a41a27a38a34 with
a28a25a29a30a5a36a31a37a27a38a34 to get a28a25a29a30a27a12a31a37a5a44a34 . There are a vast number
of potential compressions of a large DS-tree, but
we can efficiently pack them into a shared-forest
structure, as described in detail by Knight & Marcu
(2000). Each entry in the shared-forest structure has
three associated probabilities, one from the source
syntax PCFG, one from the source discourse PCFG
and one from the expansion-template probabilities
described in Section 3.2. Once we have generated a
forest representing all possible compressions of the
original document, we want to extract the best (or
the a140 -best) trees, taking into account both the ex-
pansion probabilities of the channel model and the
bigram and syntax and discourse PCFG probabili-
ties of the source model. Thankfully, such a generic
extractor has already been built (Langkilde, 2000).
For our purposes, the extractor selects the trees with
the best combination of LM and expansion scores
after performing an exhaustive search over all possi-
ble summaries. It returns a list of such trees, one for
each possible length.
4 System
The system developed works in a pipelined fash-
ion as shown in Figure 3. The first step along the
pipeline is to generate the discourse structure. To
do this, we use the decision-based discourse parser
described by Marcu (2000)2. Once we have the dis-
course structure, we send each EDU off to a syn-
2The discourse parser achieves an f-score of
a141a143a142a18a144a145 for EDU
identification, a146a13a147a18a144a147 for identifying hierarchical spans, a141a143a148a18a144a148 for
nuclearity identification and a145a13a141a18a144a149 for relation tagging.
Parser
Discourse Syntax
Parser
Forest
Generator
Decoder ChooserLength Output Summary
Input Document
Figure 3: The pipeline of system components.
tactic parser (Collins, 1997). The syntax trees of
the EDUs are then merged with the discourse tree
in the forest generator to create a DS-tree similar to
that shown in Figure 1. From this DS-tree we gener-
ate a forest that subsumes all possible compressions.
This forest is then passed on to the forest ranking
system which is used as decoder (Langkilde, 2000).
The decoder gives us a list of possible compressions,
for each possible length. Example compressions of
Text (1) are shown in Figure 4 together with their
respective log-probabilities.
In order to choose the “best” compression at
any possible length, we cannot rely only on the
log-probabilities, lest the system always choose the
shortest possible compression. In order to compen-
sate for this, we normalize by length. However, in
practice, simply dividing the log-probability by the
length of the compression is insufficient for longer
documents. Experimentally, we found a reasonable
metric was to, for a compression of length a140 , divide
each log-probability by a140 a11a39a150a14 . This was the job of
the length chooser from Figure 3, and enabled us
to choose a single compression for each document,
which was used for evaluation. (In Figure 4, the
compression chosen by the length selector is itali-
cized and was the shortest one3.)
5 Results
For testing, we began with two sets of data. The
first set is drawn from the Wall Street Journal (WSJ)
portion of the Penn Treebank and consists of a151a19a152 doc-
uments, each containing between a153a154a151 and a155a157a156 words.
The second set is drawn from a collection of stu-
3This tends to be the case for very short documents, as the
compressions never get sufficiently long for the length normal-
ization to have an effect.
dent compositions and consists ofa158 documents, each
containing between a152a46a153 and a159a160a151 words. We call this
set the MITRE corpus (Hirschman et al., 1999). We
would liked to have run evaluations on longer docu-
ments. Unfortunately, the forests generated even for
relatively small documents are huge. Because there
are an exponential number of summaries that can be
generated for any given text4, the decoder runs out
of memory for longer documents; therefore, we se-
lected shorter subtexts from the original documents.
We used both the WSJ and Mitre data for eval-
uation because we wanted to see whether the per-
formance of our system varies with text genre. The
Mitre data consists mostly of short sentences (av-
erage document length from Mitre is a152 sentences),
quite in constrast to the typically long sentences in
the Wall Street Journal articles (average document
length from WSJ is a161a160a16a52a162a157a158 sentences).
For purpose of comparison, the Mitre data was
compressed using five systems:
Random: Drops random words (each word has a
50% chance of being dropped (baseline).
Hand: Hand compressions done by a human.
Concat: Each sentence is compressed individually;
the results are concatenated together, using
Knight & Marcu’s (2000) system here for com-
parison.
EDU: The system described in this paper.
Sent: Because syntactic parsers tend not to work
well parsing just clauses, this system merges
together leaves in the discourse tree which are
in the same sentence, and then proceeds as de-
scribed in this paper.
The Wall Street Journal data was evaluated on the
above five systems as well as two additions. Since
the correct discourse trees were known for these
data, we thought it wise to test the systems using
these human-built discourse trees, instead of the au-
tomatically derived ones. The additionall two sys-
tems were:
PD-EDU: Same as EDU except using the perfect
discourse trees, available from the RST corpus
(Carlson et al., 2001).
4In theory, a text of
a163 words has a145a54a164 possible compressions.
len log prob best compression
a142 a165a67a166a143a166a20a142a18a144a148a54a147a143a167a54a147 Mayor is now looking which is enough.
a166a168a141 a165a67a166a168a141a170a169a171a144a52a166a20a147a19a166a20a147 The mayor is now looking which is already almost enough to win.
a166a168a167 a165a67a166a168a149a170a169a171a144a146a39a148a172a169a39a147 The mayor is now looking but without support, he is still on shaky ground.
a166a168a142 a165a67a166a168a167a54a147a18a144a149a54a141a19a166a20a147 Mayor is now looking but without the support of governer, he is still on shaky ground.
a145a143a145 a165a67a166a56a169a39a167a18a144a52a166a20a148a143a148a54a147 The mayor is now looking for re-election but without the support of the governer, he is still on shaky
ground.
a145a13a142 a165a173a145a13a141a54a148a18a144a148a54a149a143a148a54a147 The mayor is now looking which is already almost enough to win. But without the support of the
governer, he is still on shaky ground.
Figure 4: Possible compressions for Text (1).
PD-Sent: The same as Sent except using the perfect
discourse trees.
Six human evaluators rated the systems according to
three metrics. The first two, presented together to
the evaluators, were grammaticality and coherence;
the third, presented separately, was summary qual-
ity. Grammaticality was a judgment of how good
the English of the compressions were; coherence
included how well the compression flowed (for in-
stance, anaphors lacking an antecedent would lower
coherence). Summary quality, on the other hand,
was a judgment of how well the compression re-
tained the meaning of the original document. Each
measure was rated on a scale from a151 (worst) to a158
(best).
We can draw several conclusions from the eval-
uation results shown in Table 2 along with aver-
age compression rate (Cmp, the length of the com-
pressed document divided by the original length).5
First, it is clear that genre influences the results.
Because the Mitre data contained mostly short sen-
tences, the syntax and discourse parsers made fewer
errors, which allowed for better compressions to be
generated. For the Mitre corpus, compressions ob-
tained starting from discourse trees built above the
sentence level were better than compressions ob-
tained starting from discourse trees built above the
EDU level. For the WSJ corpus, compression ob-
tained starting from discourse trees built above the
sentence level were more grammatical, but less co-
herent than compressions obtained starting from dis-
course trees built above the EDU level. Choosing the
manner in which the discourse and syntactic repre-
sentations of texts are mixed should be influenced by
the genre of the texts one is interested to compress.
5We did not run the system on the MITRE data with perfect
discourse trees because we did not have hand-built discourse
trees for this corpus.
WSJ Mitre
Cmp Grm Coh Qual Cmp Grm Coh Qual
Random 0.51 1.60 1.58 2.13 0.47 1.43 1.77 1.80
Concat 0.44 3.30 2.98 2.70 0.42 2.87 2.50 2.08
EDU 0.49 3.36 3.33 3.03 0.47 3.40 3.30 2.60
Sent 0.47 3.45 3.16 2.88 0.44 4.27 3.63 3.36
PD-EDU 0.47 3.61 3.23 2.95
PD-Sent 0.48 3.96 3.65 2.84
Hand 0.59 4.65 4.48 4.53 0.46 4.97 4.80 4.52
Table 2: Evaluation Results
The compressions obtained starting from per-
fectly derived discourse trees indicate that perfect
discourse structures help greatly in improving coher-
ence and grammaticality of generated summaries. It
was surprising to see that the summary quality was
affected negatively by the use of perfect discourse
structures (although not statistically significant). We
believe this happened because the text fragments we
summarized were extracted from longer documents.
It is likely that had the discourse structures been built
specifically for these short text snippets, they would
have been different. Moreover, there was no compo-
nent designed to handle cohesion; thus it is to be ex-
pected that many compressions would contain dan-
gling references.
Overall, all our systems outperformed both the
Random baseline and the Concat systems, which
empirically show that discourse has an important
role in document summarization. We performed a174 -
tests on the results and found that on the Wall Street
Journal data, the differences in score between the
Concat and Sent systems for grammaticality and
coherence were statistically significant at the 95%
level, but the difference in score for summary quality
was not. For the Mitre data, the differences in score
between the Concat and Sent systems for grammati-
cality and summary quality were statistically signif-
icant at the 95% level, but the difference in score for
coherence was not. The score differences for gram-
maticality, coherence, and summary quality between
our systems and the baselines were statistically sig-
nificant at the 95% level.
The results in Table 2, which can be also as-
sessed by inspecting the compressions in Figure 4
show that, in spite of our success, we are still far
away from human performance levels. An error that
our system makes often is that of dropping comple-
ments that cannot be dropped, such as the phrase
“for re-election”, which is the complement of “is
looking”. We are currently experimenting with lex-
icalized models of syntax that would prevent our
compression system from dropping required verb ar-
guments. We also consider methods for scaling up
the decoder to handling documents of more realistic
length.
Acknoledgements
This work was partially supported by DARPA-ITO
grant N66001-00-1-9814, NSF grant IIS-0097846,
and a USC Dean Fellowship to Hal Daume III.
Thanks to Kevin Knight for discussions related to
the project.
References
Michele Banko, Vibhu Mittal, and Michael Witbrock.
2000. Headline generation based on statistical trans-
lation. In Proceedings of the 38th Annual Meeting of
the Association for Computational Linguistics (ACL–
2000), pages 318–325, Hong Kong, October 1–8.
Adam Berger and Vibhu Mittal. 2000. Query-relevant
summarization using FAQs. In Proceedings of the
38th Annual Meeting of the Association for Computa-
tional Linguistics (ACL–2000), pages 294–301, Hong
Kong, October 1–8.
Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski.
2001. Building a discourse-tagged corpus in the
framework of rhetorical structure theory. In Pro-
ceedings of the 2nd SIGDIAL Workshop on Discourse
and Dialogue, Eurospeech 2001, Aalborg, Denmark,
September.
John Carroll, Guidon Minnen, Yvonne Canning, Siobhan
Devlin, and John Tait. 1998. Practical simplification
of english newspaper text to assist aphasic readers. In
Proceedings of the AAAI-98 Workshop on Integrating
Artificial Intelligence and Assistive Technology.
R. Chandrasekar, Christy Doran, and Srinivas Bangalore.
1996. Motivations and methods for text simplifica-
tion. In Proceedings of the Sixteenth International
Conference on Computational Linguistics (COLING
’96), Copenhagen, Denmark.
Michael Collins. 1997. Three generative, lexicalized
models for statistical parsing. In Proceedings of the
35th Annual Meeting of the Association for Compu-
tational Linguistics (ACL–97), pages 16–23, Madrid,
Spain, July 7-12.
Proceedings of the First Document Understanding Con-
ference (DUC-2001), New Orleans, LA, September.
Proceedings of the Second Document Understanding
Conference (DUC-2002), Philadelphia, PA, July.
Gregory Grefenstette. 1998. Producing intelligent tele-
graphic text reduction to provide an audio scanning
service for the blind. In Working Notes of the AAAI
Spring Symposium on Intelligent Text Summarization,
pages 111–118, Stanford University, CA, March 23-
25.
L. Hirschman, M. Light, E. Breck, and J. Burger. 1999.
Deep read: A reading comprehension system. In Pro-
ceedings of the 37th Annual Meeting of the Association
for Computational Linguistics.
H. Jing. 2000. Sentence reduction for automatic text
summarization. In Proceedings of the First Annual
Meeting of the North American Chapter of the Asso-
ciation for Computational Linguistics NAACL-2000,
pages 310–315, Seattle, WA.
Kevin Knight and Daniel Marcu. 2000. Statistics-based
summarization — step one: Sentence compression.
In The 17th National Conference on Artificial Intelli-
gence (AAAI–2000), pages 703–710, Austin, TX, July
30th – August 3rd.
Irene Langkilde. 2000. Forest-based statistical sentence
generation. In Proceedings of the 1st Annual Meeting
of the North American Chapter of the Association for
Computational Linguistics, Seattle, Washington, April
30–May 3.
Kavi Mahesh. 1997. Hypertext summary extraction for
fast document browsing. In Proceedings of the AAAI
Spring Symposium on Natural Language Processing
for the World Wide Web, pages 95–103.
Inderjeet Mani and Mark Maybury, editors. 1999. Ad-
vances in Automatic Text Summarization. The MIT
Press.
Inderjeet Mani. 2001. Automatic summarization.
William C. Mann and Sandra A. Thompson. 1988.
Rhetorical structure theory: Toward a functional the-
ory of text organization. Text, 8(3):243–281.
Daniel Marcu. 2000. The Theory and Practice of Dis-
course Parsing and Summarization. The MIT Press,
Cambridge, Massachusetts.
