Towards Statistical Paraphrase Generation: Preliminary Evaluations of
Grammaticality
Stephen Wana0a2a1 Mark Drasa0 Robert Dalea0
a0 Center for Language Technology
Div of Information Communication Sciences
Macquarie University
Sydney, NSW 2113
swan,madras,rdale@ics.mq.edu.au
C´ecile Parisa1
a1 Information and Communication
Technologies
CSIRO
Sydney, Australia
Cecile.Paris@csiro.au
Abstract
Summary sentences are often para-
phrases of existing sentences. They
may be made up of recycled fragments
of text taken from important sentences
in an input document. We investigate
the use of a statistical sentence gener-
ation technique that recombines words
probabilistically in order to create new
sentences. Given a set of event-related
sentences, we use an extended version
of the Viterbi algorithm which employs
dependency relation and bigram proba-
bilities to find the most probable sum-
mary sentence. Using precision and
recall metrics for verb arguments as a
measure of grammaticality, we find that
our system performs better than a bi-
gram baseline, producing fewer spuri-
ous verb arguments.
1 Introduction
Human authored summaries are more than just
a list of extracted sentences. Often the sum-
mary sentence is a paraphrase of a sentence in the
source text, or else a combination of phrases and
words from important sentences that have been
pieced together to form a new sentence. These
sentences, referred to as Non-Verbatim Sentences,
can replace extracted text to improve readability
and coherence in the summary.
Consider the example in Figure 1 which
presents an alignment between a human authored
summary sentence and a source sentence. The
Summary Sentence:
Every province in the country, except one, endured sporadic fighting, looting
or armed banditry in 2003.
Source Sentence:
However, as the year unfolded, every province has been subjected to fighting,
looting or armed banditry, with the exception of just one province (Kirundo,
in northern Burundi).
Figure 1: An aligned summary and source sen-
tence.
text is taken from a corpus of Humanitarian Aid
Proposals1 produced by the United Nations for
the purpose of convincing donors to support a re-
lief effort.
Theexample illustrates that sentence extraction
alone cannot account forthebreadth of humanau-
thored summary sentences. This is supported by
evidence presented in (Jing and McKeown, 1999)
and (Daum´e III and Marcu, 2004).
Moving towards the goal of abstract-like auto-
matic summary generation challenges us to con-
sider mechanisms for generating non-verbatim
sentences. Such a mechanism can usefully be
considered as automatically generating a para-
phrase.2 We treat the problem as one in which a
new and previously unseen summary sentence is
to be automatically produced given some closely
related sentences extracted from a source text.
Following on from (Witbrock and Mittal,
1999), we use and extend the Viterbi algorithm
(Forney, 1973) for the purposes of generating
non-verbatim sentences. This approach treats
1These are available publically at
http://www.reliefweb.com.
2Paraphrase here includes sentences generated in an In-
formation Fusion task (Barzilay et al., 1999).
88
sentence generation as a search problem. Given
a set of words (taken from some set of sentences
to paraphrase), we search for the most likely se-
quence given some language model. Intuitively,
we want the generated string to be grammatical
and to accurately reflect the content of the source
text.
Within the Viterbi search process, each timewe
append a word tothe partially generated sentence,
we consider how well it attaches to a dependency
structure. The focus of this paper is to evaluate
whether or not a series of iterative considerations
of dependency structure results in a grammatical
generated sentence. Previous preliminary evalu-
ations (Wan et al., 2005) indicate that the gen-
erated sequences contain less fragmented text as
measured by an off-the-shelf dependency parser;
more fragments would indicate a grammatically
problematic sentence.
However, while encouraging, such an evalu-
ation says little about what the actual sentence
looks like. For example, such generated text
might only be useful if it contains complete
clauses. Thus, in this paper, we use the precision
and recall metric to measure how many generated
verb arguments, as extracted from dependency re-
lations, are correct.
Theremainder of this paper is structured as fol-
lows. Section 2 provides an overview introducing
our approach. In Section 3, we briefly illustrate
our algorithm with examples. A brief survey of
related work is presented in Section 4. Wepresent
our grammaticality experiments in Section 5. We
conclude with further work in Section 6.
2 An Overview of our Approach to
Statistical Sentence Generation
One could characterise the search space as being
a series of nested sets. The outer most set would
contain all possible word sequences. Within this,
asmaller set ofstrings exhibiting somesemblance
of grammaticality might be found, though many
of these might be gibberish. Further nested sets
are those that are grammatical, and within those,
the set of paraphrases that are entailed by the in-
put text.
However, given that we limit ourselves to sta-
tistical techniques and avoid symbolic logic, we
cannot make any claim of strict entailment. We
Original Text
A military transporter was scheduled to take off in the afternoon from Yokota
air base on the outskirts of Tokyo and fly to Osaka with 37,000 blankets .
Mondale said the United States, which has been flying in blankets and is
sending a team of quake relief experts, was prepared to do more if Japan
requested .
United States forces based in Japan will take blankets to help earthquake
survivors Thursday, in the U.S. military’s first disaster relief operation in
Japan since it set up bases here.
Our approach with Dependencies
6: united states forces based in blankets
8: united states which has been flying in blankets
11: a military transporter was prepared to osaka with 37,000 blankets
18: mondalesaidtheafternoon from yokota airbaseon theunited stateswhich
has been flying in blankets
20: mondale said the outskirts of tokyo and is sending a military transporter
was prepared to osaka with 37,000 blankets
23: united states forces based in the afternoon from yokota air base on the
outskirts of tokyo and fly to osaka with 37,000 blankets
27: mondale said the afternoon from yokota air base on the outskirts of tokyo
and is sending a military transporter was prepared to osaka with 37,000 blan-
kets
29: united states which has been flying in the afternoon from yokota air base
on the outskirts of tokyo and is sending a team of quake relief operation in
blankets
31: united states which has been flying in the afternoon from yokota air base
on the outskirts of tokyo and is sending a military transporter was prepared to
osaka with 37,000 blankets
34: mondalesaidtheafternoon from yokota airbaseon theunited stateswhich
has been flying in the outskirts of tokyo and is sending a military transporter
was prepared to osaka with 37,000 blankets
36: united states which has been flying in japan will take off in the after-
noon from yokota air base on the outskirts of tokyo and is sending a military
transporter was prepared to osaka with 37,000 blankets
Figure 2: A selection of example output. Sen-
tences are prefixed by their length.
thus propose an intermediate set of sentences
which conserve the content of the source text
without necessarily being entailed. These are re-
ferred to as the set of verisimilitudes, of which
properly entailed sentences are a subset. The aim
of our choice of features and our algorithm exten-
sion is to reduce the search space from gibberish
strings to that of verisimilitudes. While generat-
ing verisimilitudes is our end goal, in this paper,
we are concerned principally with the generating
of grammatical sentences.
To do so, the extension adds an extra feature
propagation mechanism to the Viterbi algorithm
such that features are passed along a word se-
quence path in the search space whenever a new
word is appended to it. Propagated features are
used to influence the choice of subsequent words
suitable for appending to a partially generated
sentence. In our case, our feature is a depen-
dency structure of the word sequence correspond-
ing to the search path. Our present dependency
representation is based on that of (Kittredge and
89
Mel’cuk, 1983). However, it contains only the
head and modifier of a relation, ignoring relation-
ship labels for the present.
Algorithmically, after appending a word to a
path, a dependency structure of the partially gen-
erated string is obtained probabilistically. Along
with bigram information, the long-distance con-
text of dependency head information of the pre-
ceding word sequence will be useful in generat-
ing better sentences by filtering out all words that
might, at a particular position in the string, lead
to a spurious dependency relation in the final sen-
tence. Example output is presented in Figure 2.
Asthe dependency “parsing” mechanism islin-
ear3 and is embedded within the Viterbi algo-
rithm, the result is an O(a3 a4 ) algorithm.
By examining surface-syntactic dependency
structure at each step in the search, resulting sen-
tences are likely to be more grammatical. This
marraige of models has been tested in other fields
such as speech recognition (Chelba and Jelinek,
1998) with success. Although it is an impover-
ished representation of semantics, considering de-
pendency features in our application context may
also serendipitously assist verisimilitude genera-
tion.
3 The Extended Viterbi Algorithm:
Propagating Dependency Structure
In this section, we present an overview of the
main features of our algorithm extension. We di-
rect the interested reader to our technical paper
(Wan et al., 2005) for full details.
The Viterbi algorithm (for a comprehensive
overview, see (Manning and Sch¨utze, 1999)) is
used to search for the best path across a network
of nodes, where each node represents a word in
the vocabulary. The best sentence is a string of
words, eachoneemittedbythecorresponding vis-
ited node on the path.
Arcs between nodes are weighted using a com-
bination of two pieces of information: a bigram
probability corresponding to that pair of words;
and a probability corresponding to the likelihood
of a dependency relation between that pair of
words. Specifically, the transition probability
3The parse is thus not necessarily optimal, in the sense of
guaranteeing the most likely parse.
defining theseweights istheaverage of thedepen-
dency transition probability and the bigram prob-
ability.
To simplify matters in this evaluation, we
assume that the emission probability is always
one. The emission probability is interpreted
as being a Content Selection mechanism that
chooses words that are likely to be in a summary.
Thus, in this paper, each word has an equally
likely chance of being selected for the sentence.
Transition Probability is defined as:
a5a7a6a9a8a7a10a9a11a13a12a9a14a16a15a2a17 a11a13a12a19a18a21a20
a22a2a23a2a24a26a25a27a22a29a28a29a24a30a10a31a5
a6a32a8a34a33a36a35
a8a26a37a39a38
a10a32a11
a12a32a14a40a15
a17 a11
a12
a18a42a41a43a5
a6a9a8a34a44a39a45a19a46
a10a32a11
a12a32a14a40a15
a17 a11
a12
a18a19a18
where,
a5a7a6a9a8
a33a47a35
a8a26a37a39a38
a10a32a11a48a12a32a14a40a15a29a17 a11a48a12a19a18a21a20a50a49a26a51a53a52a47a54a56a55
a10a32a11
a12
a41a57a11
a12a9a14a16a15
a18
a49a26a51a58a52a59a54a60a55
a10a32a11
a12
a18
The second function, a61a63a62a43a64a58a65a27a66a68a67 , is the focus of this
paper and discussed in Section 3.1.
Emission Probability (for this paper, always set to
1):
a5
a45
a38a69a10a32a11a70a18a40a20a72a71
Path Probability is defined recursively as:
a5
a46
a37a58a6a9a73
a10a32a11a48a74a29a41a58a75a39a75a39a75a39a41a57a11a48a12a32a14a40a15a26a18a21a20
a5a2a6a9a8
a33a36a35
a8a26a37a39a38
a10a32a11a48a12a32a14a40a15a29a17 a11a48a12a19a18a13a76a77a5
a45
a38a69a10a9a11a70a18a78a76a79a5
a46
a37a39a6a80a73
a10a32a11a48a74a78a75a58a75a39a75a57a11a48a12a68a18
In the remaining subsections, we present an
example-based discussion of how dependency-
based transitions are used, and a discussion of
how the dependency structure of the unfolding
path is maintained and propagated within the
search process.
3.1 Word Selection Using Dependency
Transitions
Given two input sentences “The relief workers
distributed food to the hungry.” and “The UN
workers requested medicine and blankets.”, the
task is to generate a single sentence that contains
material from these two sentences. As in (Barzi-
lay et al., 1999), we assume that the sentences
stem from the sameevent and thus, references can
be fused together.
Imagine alsothat bigram frequencies have been
collected from a relevant UN Humanitarian cor-
pus. Figure 3 presents bigram probabilities and
two sample paths through the lattice. The path
could follow one of two forks after encountering
90
Graph nodes:
a11a81a15 is workers
a11a13a82 is distributed
a11a13a83 is food
a11a78a84 is blankets
a24 is the end-of-sentence state
a11a70a15 a11
a82 a11a48a83 a24
a11
a84
a5a60a10 wa82a36a17 w
a15
a18 a5a60a10 wa83 a17 wa82a53a18
a5a60a10 w
a84
a17 w
a82
a18a86a85a20a88a87
a5a60a10 E
a89a69a90
a17 wa84a27a18
a5a60a10 E
a89a69a90
a17 w
a91
a18
Figure 3: Two search paths. One is consistent
with the input text, the other is not. Assume that
the probabilities are taken from a relevant corpus
such that a61a77a92 ba93a42a94a78a3a96a95a70a97a56a98a30a99a48a100 da101a53a99a102a98a53a103a40a101a53a104a36a105a86a98a29a97a56a106a81a107 is not zero.
the word distributed, since the corpus may have
examples of the word pairs distributed food and
distributed blankets. Since both food and blankets
can reach the end-of-sentence state, both might
conceivably be generated by considering just n-
grams. However, only one is consistent with the
input text.
To encourage the generation of verisimilitudes,
wecheck foradependency relation between blan-
kets and distributed in the input sentence. As no
evidence is found, we score this transition with
a low weight. In contrast, there is evidence for
the alternative path since the input text does con-
tain a dependency relation between food and dis-
tributed.
In reality, multiple words might still conceiv-
ably be modified by future words, not just the im-
mediately preceding word. In this example, dis-
tributed is the root of a dependency tree struc-
ture representing the preceding string. However,
any node along the rightmost root-to-leaf branch
of the dependency tree (that represents the par-
tially generated string) could be modified. This
dependency structure is determined statistically
using a probabilistic model of dependency rela-
tions. To represent the rightmost branch, we use a
stack data structure (referred to as the head stack)
whereby older stack items correspond to nodes
closer to the root of the dependency tree.
The probability of the dependency-based transi-
tion is estimated as follows:
a5a7a6a9a8a34a44a39a45a19a46a21a10a32a11a48a12a32a14a40a15a29a17 a11a48a12a19a18a109a108
a5a60a10
a90
a24a57a5a36a110a68a111
a38
a10a32a11
a12a32a14a40a15
a41a34a112a59a24a39a22a2a113a29a114
a55
a22
a49a26a115
a10a32a11
a12
a18a19a18a19a18a21a20
a116a118a117a58a119
a73a7a120a7a73
a45
a37
a44a58a121
a6a9a37a53a122a68a123a53a124a126a125
a12a42a127
a5a59a10
a90
a24a57a5
a110a68a111
a38a128a10a32a11a48a12a32a14a40a15a27a41a34a112a47a18a19a18
where a61a69a92a42a129a130a97a29a61a63a131a58a132a7a133a134a92a34a135a137a136a19a138 a0a59a139a7a140 a107a29a107 is inspired by and
closely resembles the probabilistic functions in
(Collins, 1996).
After selecting and appending a new word, we
update this representation containing the govern-
ing words of the extended string that can yet be
modified. The new path is then annotated with
this updated stack.
3.2 Maintaining the Head Stack
There are three possible alternative outcomes to
the head stack update mechanism. Given a head
stack representing thedependency structure of the
partially generated sentence and a new word to
append to the search path, the first possibility is
that the new word has no dependency relation to
any of the existing stack items, in which case we
simply push the new word onto the stack. For
the second and third cases, we check each item
on the stack and keep a record only of the best
probable dependency between the new word and
the appropriate stack item. The second outcome,
then, is that the new word is the head of some
item on the stack. All items up to and including
that stack item are popped off and the new word is
pushed on. The third outcome is that it modifies
some item on the stack. All stack items up to (but
not including) the stack item are popped off and
the new word is pushed on.
We now step through the generation of the sen-
tence “The UN relief workers distributed food to
the hungry” which is produced by the exploration
of one path in the search process. Figure 4 shows
how the head stack mechanism updates and prop-
agates the stack of governing words as we append
words to the path to produce this string.
We first append the determiner the to the new
string and push it onto the empty stack. As dic-
tated by a high n-gram probability, the word UN
follows. However, there is no evidence of a rela-
tion with the preceding word, so we simply push
it on the stack. Similarly, relief is appended and
also pushed on the stack.
When we encounter the word workers we find
evidence that it governs each of the preceding
91
Graph nodes:
a11a81a15 is The a11a48a141 is food
a11a13a82 is UN a11a81a142 is to
a11
a83 is relief
a11a48a143 is the
a11
a84 is workers
a11a48a144 is hungry
a11
a91 is distributed
a24 is the end-of-sentence state
a11
a15
a11a48a82 a11 a83 a11a13a84 a11
a91
a11
a91
a11a48a141 a11
a142 a11 a143 a11a48a144 a24
a145
a11
a15a88a146 a147
a11a48a82
a11
a15a88a148 a149
a11a48a83
a11a48a82
a11
a15a88a150
a145
a11a78a84
a146
a145
a11
a91
a146 a147
a11a48a141
a11
a91
a148
a147
a11
a142
a11
a91
a148 a149
a11a13a143
a11
a142
a11
a91
a150 a151
a11a48a144
a11
a143
a11a81a142
a11
a91a153a152
Figure 4: Propagating the head stack feature
along the path.
three words. The modifiers are popped off and
workers is pushed on. Skipping ahead, the tran-
sition distribute food has a high bigram probabil-
ity and evidence for a dependency relation exists.
This results in a strong overall path probability as
opposed to the alternative fork in Figure 3. Since
distributed can still be modified in the future by
words, it is not popped off. The word food is
pushed onto the stack as it too can still be modi-
fied.
The sentence could end there. Since we multi-
ply path, transition and emission probabilities to-
gether, longer sentences will have a lower prob-
ability and will be penalised. However, we can
choose to continue the generation process to pro-
duce a longer sentence. The word to modifies dis-
tributed. To prevent crossing dependencies, food
is popped off the stack before pushing to. Ap-
pending the rest of the words is straightforward.
4 Related Work
In recent years, there has been a steady stream of
research in statistical text generation (see Langk-
ilde and Knight (1998), and Bangalore and Ram-
bow (2000)). These approaches begin with a rep-
resentation of sentence semantics that has been
produced by a content planning stage. Compet-
ing realisations of the semantic representation are
ranked using an n-gram model. Our approach dif-
fers in that we do not start with a semantic repre-
sentation. Rather, we paraphrase the original text,
searching for the best word sequence and depen-
dency tree structure concurrently.
Summarization researchers have also studied
the problem of generating non-verbatim sen-
tences: see (Jing and McKeown, 1999), (Barzi-
lay et al., 1999) and more recently (Daum´e III
and Marcu, 2004). Jing uses a HMM for learn-
ing alignments between summary and source sen-
tences. Daume III also provides a mechanism
for sub-sentential alignment but allows for align-
ments between multiple sentences. Both ap-
proaches provide models for later recombining
sentence fragments. Our work differs primar-
ily in granularity. Using words as a basic unit
potentially offers greater flexibility in pseudo-
paraphrase generation; however, like any ap-
proach that recombines text fragments, it incurs
additional problems inensuring that thegenerated
sentence reflects the information in the input text.
In work describing summarisation as transla-
tion, KnightandMarcu(Knight andMarcu, 2002)
also combine syntax models to help rank the
space of possible candidate translations. Their
work differs primarily in that they search over a
space of trees representing the candidate trans-
lations and we search over a space of word se-
quences which are annotated by corresponding
trees.
5 Evaluation
Inthissection, wedescribe twosmallexperiments
designed to evaluate whether a dependency-
based statistical generator improves grammatical-
ity. The first experiment uses a precision and re-
call styled metric on verb arguments. Wefind that
our approach performs significantly better than
the bigram baseline. The second experiment ex-
amines the precision and recall statistics on short
and long distance verb arguments. We now de-
scribe these two experiments in more detail.
5.1 Improvements in Grammaticality: Verb
Argument Precision and Recall
In this evaluation, we want to know what advan-
tages a consideration of input text dependencies
affords, compared to just using bigrams from the
input text. Given a set of sentences which has
been clustered on the basis of similarity of event,
the system generates the most probable sentence
92
by recombining words from the cluster.4 The aim
of the evaluation is to measure improvements in
grammaticality. To do so, we compare our depen-
dency based generation method against a bigram
model baseline.
Since verbs are crucial in indicating the gram-
maticality of a clause, we examine the verb argu-
ments of the generated sentence. We use a recall
and precision metric over verb dependency rela-
tions and compare generated verb arguments with
those from the input text. For any verbs included
in the generated summary, we count how many
generated verb-argument relations can be found
amongst the input text relations for that verb. A
relation match consists of an identical head, and
also an identical modifier. Since word order in
English is vital for grammaticality, a matching re-
lation must also preserve the relative order of the
two words within the generated sentence.
The precision metric is as follows:
precision a20
count a10 matched-verb-relationsa18
count a10 generated-verb-relationsa18
The corresponding recall metric is defined as:
recall a20
count a10 matched-verb-relationsa18
count a10 source-text-verb-relationsa18
The data for our evaluation cases is taken from
the information fusion data collected by (Barzi-
lay et al., 1999). This data is made up of news
articles that have first been grouped by topic, and
then component topic sentences further clustered
by similarity of event. We use 100 sentence clus-
ters and on average there are 4 sentences per clus-
ter.
Each sentence cluster forms an evaluation case
for which the task is to generate a single sentence.
Foreach evaluation case, the baseline method and
ourmethodgenerates asetof answerstrings, from
1 to 40 words in length.
For each cluster, sentences are parsed
using the Connexor dependency parser
(www.connexor.com) to obtain dependency
relations used to build dependency models for
that cluster. In the interests of minimising con-
flating factors in this comparison, we similarly
4This sentence could be an accurate replica of an original
sentences, or a non-verbatim sentence that fuses information
from various input sentences.
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 0.45
 0.5
 3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
Precision
Sentence Length
Precision of Verb Arguments across Sentence Lengths
Baseline
System
Figure 5: Verb-Argument Relation Precision
scores for generated output compared to a bigram
baseline
train bigram language models on the input cluster
of text. This provides both the bigram baseline
and our system with the best possible chance
of producing a grammatical sentence given the
vocabulary of the input cluster. Note that the
baseline is a difficult one to beat because it is
likely to reproduce long sequences from the
original sentences of the input cluster. However,
the exact regurgitation of input sentences is not
necessarily the outcome of the baseline generator
since, for each cluster, bigrams from multiple
sentences are combined into a single model.
We do not use any smoothing algorithms for
dependency counts in this evaluation since at
present time. Thus, given the sparseness arising
from a small set of sentences, our dependency
probabilities tend towards boolean values. For
both our approach and the baseline, the bigrams
are smoothed using Katz’s back-off method.
5.1.1 Results and Discussion
Figure 5 shows the average precision score
across sentence lengths. Thatis, foreachsentence
length, there are 100 instances whose precisions
are averaged. As can be seen, the system almost
always achieves a higher precision than the base-
line. Asexpected, precision decreases assentence
length increases.
Ourapproach isdesigned tominimise the num-
ber of spurious dependency relations generated in
the resulting sentence. As this is typically mea-
sured byprecision scores, recall scores are lessin-
93
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
Recall
Sentence Length
Recall of Verb Arguments across Sentence Lengths
Baseline
System
Figure 6: Verb-Argument Relation Recall scores
for generated output compared to a bigram base-
line
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 0.45
 0.5
 0.55
 3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
Precision
Sentence Length
Precision of Adjacent Verb Arguments across Sentence Lengths
Baseline
System
Figure 7: Adjacent Verb-Argument Relation Pre-
cision scores for generated output compared to a
bigram baseline
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
Precision
Sentence Length
Precision of Long Distance Verb Arguments across Sentence Lengths
Baseline
System
Figure 8: LongDistance Verb-Argument Relation
Precision scores forgenerated output compared to
a bigram baseline
teresting as a measure of the generated sentence.
However, for completeness, they are presented
Figure 6. Results indicate that our system was
indistinguishable from the baseline. This is un-
surprising as our approach is not designed to in-
crease the retrieval of dependency relations from
the source text.
Using a two-tailed Wilcoxon test (alpha a154
0.05), we find that the differences in precision
scores are significant for most sentence lengths
except lengths 17 and 32. The failure to reject the
null hypothesis5 for these lengths is interpreted as
idiosyncratic in our data set. In the case of the
recall scores, differences are not significant.
The results support the claim that a
dependency-based statistical generator im-
proves grammaticality by reducing the number
of spurious verb-argument dependency relations.
It is also possible to treat dependency precision
as being a superficial measure of content conser-
vation between the generated sentence and the
input sentences. Thus, it can also be seen as a
poor measure of how well the summary captures
the source text.
5.2 Examining Short and Long Distance
Verb Arguments
Intuitively, one would expect the result from the
first experiment to be reflected in both short (ie.
adjacent) and long distance verb dependencies.
To test this intuition, we examined the precision
and recall statistics for the two types of depen-
dencies separately. The same experimental setup
is used as in the first experiment.
The results for adjacent (short) dependencies
echo that of the first experiment. The precision
results for adjacent dependencies are presented in
Figure 7. Again, our system performs better than
the baseline in terms of precision. Our system is
indistinguishable in recall performance from the
baseline. Due to space constraints, we omit the
recall graph. Using the same significance test as
before, we find that the differences in precision
are generally significant across sentence lengths.
That our approach should achieve a better pre-
cision for adjacent relations supports the claim
of improved grammaticality. The result resonates
5That is, the means of scores by our system and the base-
line are not different.
94
well with the earlier finding that sentences gener-
ated by the dependency-based statistical genera-
tor contain fewer instances of fragmented text. If
this is so, one would expect that a parser is able to
identify more of the original intended dependen-
cies.
The results for the long distance verb argument
precision and recall tests are slightly different.
Whilst the graph of precision scores, presented in
Figure 8, shows our system often performing bet-
ter than the baseline, this difference is not signif-
icant. As expected, the recall scores between our
system and the baseline are on par and we again
omit the results.
This result is interesting because one would ex-
pect that what our approach offers most is the
ability to preserve long distance dependencies
from the input text. However, long distance rela-
tions are fewer in number than adjacent relations,
which account for approximately 70% of depen-
dency relations (Collins, 1996). As the generator
still does not produce perfect text, if the interme-
diate text between the head and modifier of a long
distance relation contains any grammatical errors,
the parser will obviously have difficulty in iden-
tifying the original intended relation. Given that
there are fewer long distance relations, the pres-
ence of such errors quickly reduces the perfor-
mance margin for the precision metric and hence
no significant effect is detected. We expect that
as we fine-tune the probabilistic models, the pre-
cision of long distance relations is likely to im-
prove.
6 Conclusion and Future Work
In this paper, we presented an extension to the
Viterbi algorithm which selects words in the
string that are likely result in probable depen-
dency structures. In a preliminary evaluation
using precision and recall of dependency rela-
tions, we find that it improves grammaticality
over a bigram model. In future work, we in-
tend re-introduce the emission probabilities to
model content selection. We also intend to use
corpus-based dependency relation statistics and
we would like to compare the two language mod-
els using perplexity. Finally, we would like to
compare our system to that described in (Barzi-
lay et al., 1999).
References
Srinivas Bangalore and Owen Rambow. 2000. Ex-
ploiting a probabilistic hierarchical model for gen-
eration. In Proceedings of COLING, Universit¨at
des Saarlandes, Saarbr¨ucken, Germany.
Regina Barzilay, Kathleen R. McKeown, and Michael
Elhadad. 1999. Information fusion in the context
of multi-document summarization. In Proceedings
of ACL, Morristown, NJ, USA.
Ciprian Chelba and Fred Jelinek. 1998. Exploiting
syntactic structure for language modelling. In Pro-
ceedings of ACL-COLING, Montreal, Canada.
Michael John Collins. 1996. A new statistical parser
based on bigram lexical dependencies. In Arivind
Joshi and Martha Palmer, editors, Proceedings of
ACL, San Francisco.
Hal Daum´e III and Daniel Marcu. 2004. A phrase-
based hmm approach to document/abstract align-
ment. In Proceedings of EMNLP 2004, Barcelona,
Spain.
G. David Forney. 1973. The viterbi algorithm. Pro-
ceedings of The IEEE, 61(3):268–278.
HongyanJingandKathleenMcKeown. 1999. Thede-
composition of human-written summary sentences.
In Research and Development in Information Re-
trieval.
Richard I. Kittredge and Igor Mel’cuk. 1983. To-
wards a computable model of meaning-text rela-
tionswithin a naturalsublanguage. InTheProceed-
ings of IJCAI.
Kevin Knight and Daniel Marcu. 2002. Summa-
rization beyond sentence extraction: a probabilis-
tic approach to sentence compression. Artif. Intell.,
139(1):91–107.
Irene Langkilde and Kevin Knight. 1998. The practi-
cal value of N-grams in derivation. In Proceedings
of INLG, New Brunswick, New Jersey.
Christopher D. Manning and Hinrich Sch¨utze. 1999.
Foundations of Statistical Natural Language Pro-
cessing. The MIT Press, Cambridge, Mas-
sachusetts.
Stephen Wan, Robert Dale, Mark Dras, and Cecile
Paris. 2005. Searchingforgrammaticalityand con-
sistency: Propagating dependencies in the viterbi
algorithm. In The Proceedings of EWNLG, Ab-
erdeen, Scotland.
Michael J. Witbrock and Vibhu O. Mittal. 1999.
Ultra-summarization (poster abstract): a statisti-
cal approach to generating highly condensed non-
extractive summaries. In The Proceedings of SI-
GIR, New York, NY, USA.
95
