Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 145–152, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Kernel-based Approach for Automatic Evaluation of Natural Language
Generation Technologies: Application to Automatic Summarization
Tsutomu Hirao
NTT Communication Science Labs.
NTT Corp.
hirao@cslab.kecl.ntt.co.jp
Manabu Okumura
Precision and Intelligence Labs.
Tokyo Institute of Technology
oku@pi.titech.ac.jp
Hideki Isozaki
NTT Communication Science Labs.
NTT Corp.
isozaki@cslab.kecl.ntt.co.jp
Abstract
In order to promote the study of auto-
matic summarization and translation, we
need an accurate automatic evaluation
method that is close to human evalua-
tion. In this paper, we present an eval-
uation method that is based on convolu-
tion kernels that measure the similarities
between texts considering their substruc-
tures. We conducted an experiment us-
ing automatic summarization evaluation
data developed for Text Summarization
Challenge 3 (TSC-3). A comparison with
conventional techniques shows that our
method correlates more closely with hu-
man evaluations and is more robust.
1 Introduction
Automatic summarization, machine translation, and
paraphrasing have attracted much attention recently.
These tasks include text-to-text language genera-
tion. Evaluation workshops are held in the U.S.
and Japan, e.g., the Document Understanding Con-
ference (DUC)1, NIST Machine Translation Evalu-
ation2 as part of the TIDES project, the Text Sum-
marization Challenge (TSC)3 of the NTCIR project,
and the International Workshop on Spoken Lan-
guage Translation (IWSLT)4.
These evaluation workshops employ human eval-
uations, which are essential in terms of achieving
1http://duc.nist.gov
2http://www.nist.gov/speech/tests/mt/
3http://www.lr.titech.ac.jp/tsc
4http://www.slt.atr.co.jp/IWSLT2004
high quality evaluations results. However, human
evaluations require a huge effort and the cost is con-
siderable. Moreover, we cannot automatically eval-
uate a new system even if we use the corpora built
for these workshops, and we cannot conduct re-
evaluation experiments.
To cope with this situation, there is a particular
need to establish a high quality automatic evalua-
tion method. Once this is done, we can expect great
progress to be made on natural language generation.
In this paper, we propose a novel automatic
evaluation method for natural language generation
technologies. Our method is based on the Ex-
tended String Subsequence Kernel (ESK) (Hirao
et al., 2004b) which is a kind of convolution ker-
nel (Collins and Duffy, 2001). ESK allows us to
calculate the similarities between a pair of texts tak-
ing account of word sequences, their word sense se-
quences and their combinations.
We conducted an experimental evaluation using
automatic summarization evaluation data developed
for TSC-3 (Hirao et al., 2004a). The results of the
comparison with ROUGE-N (Lin and Hovy, 2003;
Lin, 2004a; Lin, 2004b), ROUGE-S(U) (Lin, 2004b;
Lin and Och, 2004) and ROUGE-L (Lin, 2004a;
Lin, 2004b) show that our method correlates more
closely with human evaluations and is more robust.
2 Related Work
Automatic evaluation methods for automatic sum-
marization and machine translation are grouped into
two classes. One is the longest common subse-
quence (LCS) based approach (Hori et al., 2003;
Lin, 2004a; Lin, 2004b; Lin and Och, 2004). The
other is the N-gram based approach (Papineni et al.,
145
Table 1: Components of vectors corresponding to S1 and S2. Bold subsequences are common to S1 and S2.
a0 subsequence S1 S2 a0 subsequence S1 S2 a0 subsequence S1 S2
Becoming 1 1 Becoming-is a1 a2 a1 a2 astronaut-DREAM 0 a1 a2
DREAM 1 1 Becoming-my a1a4a3a5a1a4a3 astronaut-ambition 0 a1 a2
SPACEMAN 1 1 SPACEMAN-DREAM a1a4a3a5a1 a2 astronaut-is 0 1
a 1 0 SPACEMAN-ambition 0 a1
a2 astronaut-my 0
a1
ambition 0 1 SPACEMAN-dream a1 a3 0 cosmonaut-DREAM a1 a3 0
1 an 0 1 SPACEMAN-great a1
a2 0 cosmonaut-dream
a1 a3 0
astronaut 0 1 SPACEMAN-is 1 1 cosmonaut-great a1 a2 0
cosmonaut 1 0 SPACEMAN-my a1a6a1 cosmonaut-is 1 0
dream 1 0 a-DREAM a1 a7 0 cosmonaut-my a1 0great 1 0 a-SPACEMAN 1 0 great-DREAM 1 0
is 1 1 2 a-cosmonaut 1 0 2 great-dream 1 0
my 1 1 a-dream a1 a7 0 is-DREAM a1 a2 a1
Becoming-DREAM a1a4a8a5a1
a7 a-great
a1 a3 0 is-ambition 0 a1
Becoming-SPACEMAN a1a6a1 a-is a1 0 is-dream a1
a2 0
Becoming-a 1 0 a-my a1 a2 0 is-great a1 0
Becoming-ambition 0 a1 a7 an-DREAM 0 a1 a3 is-my 1 1
2 Becoming-an 0 1 an-SPACEMAN 0 1 my-DREAM a1 1
Becoming-astronaut 0 a1 an-ambition 0 a1 a3 my-ambition 0 1
Becoming-cosmonaut a1 0 an-astronaut 0 1 my-dream a1 0
Becoming-dream a1a4a8 0 an-is 0 a1 my-great 1 0
Becoming-great a1
a7 0 an-my 0
a1
a2
2002; Lin and Hovy, 2003; Lin, 2004a; Lin, 2004b;
Soricut and Brill, 2004).
Hori et. al (2003) proposed an automatic eval-
uation method for speech summarization based on
word recognition accuracy. They reported that their
method is superior to BLEU (Papineni et al., 2002)
in terms of the correlation between human assess-
ment and automatic evaluation. Lin (2004a; 2004b)
and Lin and Och (2004) proposed an LCS-based au-
tomatic evaluation measure called ROUGE-L. They
applied ROUGE-L to the evaluation of summariza-
tion and machine translation. The results showed
that the LCS-based measure is comparable to N-
gram-based automatic evaluation methods. How-
ever, these methods tend to be strongly influenced
by word order.
Various N-gram-based methods have been pro-
posed since BLEU, which is now widely used for the
evaluation of machine translation. Lin et al. (2003)
proposed a recall-oriented measure, ROUGE-N,
whereas BLEU is precision-oriented. They reported
that ROUGE-N performed well as regards automatic
summarization. In particular, ROUGE-1, i.e., uni-
gram matching, provides the best correlation with
human evaluation. Soricut et. al (2004) proposed
a unified measure. They integrated a precision-
oriented measure with a recall-oriented measure by
using an extension of the harmonic mean formula. It
performs well in evaluations of machine translation,
automatic summarization, and question answering.
However, N-gram based methods have a critical
problem; they cannot consider co-occurrences with
gaps, although the LCS-based method can deal with
them. Therefore, Lin and Och (2004) introduced
skip-bigram statistics for the evaluation of machine
translation. However, they did not consider longer
skip-n-grams such as skip-trigrams. Moreover, their
method does not distinguish between bigrams and
skip-bigrams.
3 Kernel-based Automatic Evaluation
The above N-gram-based methods correlated
closely with human evaluations. However, we
think some skip-n-grams (na9a11a10 ) are useful. In this
paper, we employ the Extended String Subsequence
Kernel (ESK), which considers both n-grams and
skip-n-grams. In addition, the ESK allows us to add
word senses to each word. The use of word senses
enables flexible matching even when paraphrasing
is used.
The ESK is a kind of convolution kernel (Collins
and Duffy, 2001). Convolution kernels have recently
attracted attention as a novel similarity measure in
natural language processing.
3.1 ESK
The ESK is an extension of the String Subsequence
Kernel (SSK) (Lodhi et al., 2002) and the Word Se-
quence Kernel (WSK) (Cancedda et al., 2003).
The ESK receives two node sequences as inputs
146
and maps each of them into a high-dimensional vec-
tor space. The kernel’s value is simply the inner
product of the two vectors in the vector space. In
order to discount long-skip-n-grams, the decay pa-
rameter a12 is introduced.
We explain the computation of the ESK’s value
whose inputs are the sentences (S1 and S2) shown
below. In the example, word senses are shown in
braces.
S1 Becoming a cosmonaut:a13 SPACEMANa14 is my great
dream:a13 DREAMa14
S2 Becoming an astronaut:a13 SPACEMANa14 is my ambi-
tion:a13 DREAMa14
In this case, “cosmonaut” and “astronaut” share
the same sense a15 SPACEMANa16 and “ambition” and
“dream” also share the same sense a15 DREAMa16 . We
can use WordNet for English and Goitaikei (Ikehara
et al., 1997) for Japanese.
Table 1 shows the subsequences derived from S1
and S2 and its weights. Note that the subsequence
length is two or less. From the table, there are fif-
teen subsequences5 that are common to S1 and S2.
Therefore, a17a19a18a21a20a23a22a25a24a27a26a4a28a29a18a31a30a33a32a34a18a21a35a37a36a39a38a41a40a23a42a43a12a44a42a45a35a37a12 a26 a42a43a12a47a46a23a42
a12a47a48a49a42a41a12a51a50a49a42a52a12a51a53a49a42a54a12a51a55 . For reference, there are three
unigrams, one bigram, zero trigrams and three skip-
bigrams common to S1 and S2.
Formally, the ESK is defined as follows. a56 and a57
are node sequences.
ESKa58a29a59a61a60a63a62a29a64a66a65a68a67 a58
a69a71a70a73a72 a74a76a75a78a77a80a79 a81a83a82a83a77a85a84
a86
a69
a59a61a87a89a88a90a62a90a91a47a92a83a65 (1)
a86
a69
a59a61a87a93a88a90a62a90a91a47a92a94a65a68a67 a95a34a96
a97
a59a61a87a89a88a98a62a90a91a47a92a29a65 if a99a100a67a102a101
a86a104a103
a69a106a105a27a72
a59a61a87a93a88a90a62a90a91a51a92a29a65a47a107
a95a34a96
a97
a59a61a87a93a88a90a62a90a91a51a92a29a65 otherwise (2)
Here, a108 is the upper bound of the subsequence length
and a109a111a110a112 a28a78a113a29a114a93a32a93a115a117a116a118a36 is defined as follows. a113a94a114 is the a119 -th
node of a56 . a115a120a116 is the a121 -th node of a57 . The function
a122a120a123a27a124
a28a29a125a25a32a93a113a34a36 returns the number of attributes common to
given nodes a125 and a113 .
a86 a103
a69
a59a61a87a93a88a90a62a90a91a51a92a83a65a68a67
a126 if
a127a85a67a39a101
a1
a86 a103
a69
a59a61a87 a88 a62a98a91 a92
a105a47a72
a65a51a128
a86 a103a103
a69
a59a61a87 a88 a62a90a91 a92
a105a27a72
a65 otherwise
(3)
a109a111a110a110a112 a28a78a113a94a114a93a32a93a115a120a116a25a36 is defined as follows:
a86 a103a103
a69
a59a61a87a93a88a90a62a90a91a51a92a83a65a68a67
a126 if
a129a130a67a102a101
a1
a86a104a103a103
a69
a59a61a87a93a88
a105a27a72
a62a90a91a47a92a83a65a51a128
a86
a69
a59a61a87a93a88
a105a27a72
a62a98a91a47a92a83a65a132a131
(4)
5Bold subsequences in Table 1.
Finally, we define the similarity measure between
a56 and a57 by normalizing ESK. This similarity can be
regarded as an extension of the cosine measure.
Sima58a133a90a134a136a135 a59a61a60a63a62a94a64a66a65a68a67 ESKa58 a59a61a60a137a62a93a64a66a65
ESKa58 a59a61a60a63a62a90a60a19a65 ESKa58 a59a132a64a138a62a94a64a66a65
a131 (5)
3.2 Automatic Evaluation based on ESK
Suppose, a139 is a system output, which consists of
a140 sentences, and
a141 is a human written reference,
which consists of a142 sentences. a143a94a114 is a sentence in
a139 , and a144 a116 is a sentence in a141 . We define two scoring
functions for automatic evaluation. First, we define
a precision-oriented measure as follows:
a145
a58
a133a90a134a136a135
a59a61a146a31a62a90a147a49a65a68a67
a101
a148
a149
a88
a70a73a72a151a150a49a152a132a153
a72a68a154
a92
a154a4a69
Sima58a133a98a134a61a135 a59a61a155 a88 a62a90a156 a92 a65 (6)
Symmetrically, we define a recall-oriented mea-
sure as follows:
a157
a58
a133a98a134a61a135
a59a61a146a31a62a98a147a49a65a68a67
a101
a99
a69
a92
a70a73a72 a150a49a152a132a153
a72a68a154
a88
a154
a149
Sima58a133a90a134a136a135 a59a61a155a158a88a78a62a98a156a159a92a29a65 (7)
Finally, we define a unified measure, i.e., F-
measure, as follows:
a160
a58
a133a98a134a61a135
a59a61a146a31a62a90a147a49a65a68a67
a59a68a101a161a128a104a162
a2
a65a117a163
a157
a133a90a134a136a135
a59a61a146a31a62a90a147a49a65a117a163
a145
a133a90a134a136a135
a59a61a146a31a62a164a147a49a65
a157
a133a98a134a61a135
a59a61a146a31a62a164a147a49a65a47a128a49a162
a2
a163
a145
a133a98a134a61a135
a59a61a146a31a62a164a147a49a65
(8)
a165 is a cost parameter for
a166a168a167a132a169a78a170 and a171a138a167a132a169a78a170 .
a165 ’s value
is selected depending on the evaluation task. Since
summary should not miss important information
given in the human reference, recall is more impor-
tant than precision. Therefore,a large a165 will yield
good results.
3.3 Extension for Multiple References
When multiple human references (correct answers)
are available, we define a simple function for multi-
ple references as follows:
a160a23a172 a133a90a173a68a174
a133a98a134a61a135
a59a61a146a31a62
a157
a65a68a67
a101
a175
a176
a88
a70a73a72
a160
a133a90a134a61a135
a59a61a146a31a62a90a147 a88 a65a132a62 (9)
Here, equation (9) gives the average score. a166 in-
dicates a set of references; a166a177a38a54a15a159a141a179a178a80a32a34a180a34a180a34a180a85a32a93a141a182a181a138a16 .
4 Experimental Evaluation
To confirm and discuss the effectiveness of our
method, we conducted an experimental evalua-
tion using TSC-3 multiple document summarization
147
evaluation data and our additional data.
4.1 Task and Evaluation Metrics in TSC-3
The task of TSC-3 is multiple document summariza-
tion. Participants were given a set of documents
about a certain event and required to generate two
different length summaries for the entire document
set. The lengths were about 5% and 10% of the total
number of characters in the document set, respec-
tively. Thirty document sets were provided for the
official run evaluation. There were ten participant
systems; one provided by the TSC organizers as a
baseline system.
The evaluation metric follows DUC’s SEE eval-
uation scheme (Harman and Over, 2004). For each
document set, one human subject makes a reference
summary and uses it as a basis for evaluating ten
system outputs. This human evaluation procedure
consists of the following steps:
Step 1 For each reference sentence a144a183a116a85a28a29a184a185a141a179a36 , repeat
Steps 2 and 3.
Step 2 For a144a25a116 , the human assessor finds the most
relevant sentence set a186 from the system output.
Step 3 The assessor assigns a score, a187a118a28a78a144a183a116a25a32a34a186a39a36 ,
a188
a32
a188a37a189
a30a37a32a34a180a34a180a34a180a183a32a34a30
a189a76a188a37a189 1.0 means perfect. in terms of
how much of the content of a144a25a116 can be repro-
duced by using only sentences in a186 .
Step 4 Finally, the evaluation score of
output a139 for reference a141 is defined
a190
a28a78a141a191a32a93a139a102a36a29a38
a116
a187a25a28a78a144 a116 a32a34a186a39a36a29a192a31a193a141a45a193.
The final score of a system is calculated by
applying the above procedure and normalized by
the number of topics, i.e., a46a29a194a195
a24
a178
a190
a28a78a141
a195
a32a93a139
a195
a36a29a192a37a10
a188 .
When multiple references a166a11a28a29a38a49a15a159a141a191a30a37a32a34a180a34a180a34a180a183a32a93a141a111a181a138a16a37a36
are available, the scores are given as follows:
a190a151a196
a167a130a197a29a198
a28a29a166a11a32a93a139a102a36a29a38 a199
a190
a28a78a141
a199
a32a93a139a102a36a29a192a31a193a166a5a193.
4.2 Variation of Human Assessors
In TSC-3’s official run evaluation, system outputs
were compared with one human written reference
summary for each topic. There were five topic sets
and five human assessors (A-E in Table 2) for each
topic set.
Before we use the one human written reference
summary as the gold-standard-reference, to exam-
ine variations among human assessors, we prepared
two additional human summaries for each topic sets.
Table 2: The relationship between topics and refer-
ence summary creators, i.e., human assessors. a186a39a28a29a200a201a36
indicates a subject A’s evaluation score for all sys-
tems for corresponding topics.
topic-ID a202 a72 a202 a2 a202
a3
a202a23a203a89a204a93a205
1 - 6 a206 (A) a206 (E) a206 (C) mean(a206 (A),a206 (E),a206 (C))7 - 12
a206 (B) a206 (A) a206 (D) mean(a206 (B),a206 (A),a206 (D))13 - 18
a206 (C) a206 (B) a206 (E) mean(a206 (C),a206 (B),a206 (E))19 - 24
a206 (D) a206 (C) a206 (A) mean(a206 (D),a206 (C),a206 (A))25 - 30
a206 (E) a206 (D) a206 (B) mean(a206 (E),a206 (D),a206 (B))
Table 3: Correlations between human judgments.
correlation rank correlationcoefficient (
a156 ) coefficient (a207 )short
a202
a72
a202
a2
a202
a3
a202
a173a98a208a89a209
a202
a72
a202
a2
a202
a3
a202
a173a98a208a89a209
a202
a72 1.00 .968 .902 .988 1.00 .976 .697 .988
a202
a2 a210 1.00 .910 .996 a210 1.00 .733 .988
a202
a3
a210 a210 1.00 .914 a210 a210 1.00 .758
a202
a173a98a208a89a209
a210 a210 a210 1.00 a210 a210 a210 1.00
long
a202
a72
a202
a2
a202
a3
a202
a173a98a208a89a209
a202
a72
a202
a2
a202
a3
a202
a173a98a208a89a209
a202
a72 1.00 .908 .822 .964 1.00 .964 .939 .964
a202
a2 a210 1.00 .963 .987 a210 1.00 .952 1.00
a202
a3
a210 a210 1.00 .931 a210 a210 1.00 .932
a202
a173a98a208a89a209
a210 a210 a210 1.00 a210 a210 a210 1.00
Therefore, we obtained three reference summaries
and evaluation results for each topic sets (Table 2).
Moreover, we prepared unified evaluation results
of three human judgment as a211 a197a89a212a158a213 , which is calcu-
lated as the average of three human scores.
The relationship between topics and human asses-
sors is shown in Table 2. For example, subject B
generates summaries and evaluates all systems for
topics 7-12, 13-18 and 25-30 on a211 a178 , a211
a26
, and a211
a46respectively. Note that each human subject, A to
E, was a retired professional journalist; that is, they
shared a common background.
Table 3 shows the Pearson’s correlation coeffi-
cient (a144 ) and Spearman’s rank correlation coefficient
a214 for the human subjects. The results show that ev-
ery pair has a high correlation. Therefore, changing
the human subject has little influence as regards cre-
ating references and evaluating system summaries.
The evaluation by human subjects is stable. This re-
sult agrees with DUC’s additional evaluation results
(Harman and Over, 2004). However, the behavior
of the correlations between humans with different
backgrounds is uncertain. The correlation might be
fragile if we introduce a human subject whose back-
ground is different from the others.
148
4.3 Compared Automatic Evaluation Methods
We compared our method with ROUGE-N and
ROUGE-L described below. We used only content
words to calculate the ROUGE scores because the
correlation coefficient decreased if we did not re-
move functional words.
WSK-based method
We use WSK instead of ESK in equation (6)-(8).
ROUGE-N
ROUGE-N is an N-gram-based evaluation mea-
sure defined as follows (Lin, 2004b):
ROUGE-Na59a61a146a31a62a90a147a49a65a68a67 a215
a77a83a216 a209a68a217a61a173
a172a27a218
a77
a215
a219a27a220a158a221a183a222a85a223
a172
a173a78a224a76a225a164a226
a59a136a227a158a228
a152a130a150a104a229
a65
a215
a77a29a216 a209a68a217a76a173 a172
a218
a77
a215
a219a27a220a159a221a183a222a85a223
a59a136a227a158a228
a152a130a150 a229
a65
(10)
Here, a230a66a231a37a232a21a233a27a234a118a28a78a235a37a236a25a237a37a238a11a239a168a36 is the number of an N-gram
and a230a66a231a37a232a21a233a27a234 a196 a197a29a240a98a241a243a242a244a28a78a235a37a236a25a237a37a238a49a239a168a36 denotes the number of n-
gram co-occurrences in a system output and the ref-
erence.
ROUGE-S
ROUGE-S is an extension of ROUGE-2 defined
as follows (Lin, 2004b):
ROUGE-Sa59a61a146a31a62a98a147a49a65a68a67 a59a68a101a161a128a104a162
a2
a65a161a163
a157
a134a61a135a93a245a246
a2
a59a61a146a31a62a98a147a49a65a161a163
a145
a134a61a135a89a245a246
a2
a59a61a146a31a62a164a147a49a65
a157
a134a136a135a93a245a246
a2
a59a61a146a31a62a90a147a49a65a51a128a104a162
a2 a145
a134a61a135a89a245a246
a2
a59a61a146a31a62a98a147a49a65
(11)
Where a166a168a169a78a170a248a247a250a249
a26
and a171a138a169a90a170a158a247a250a249
a26
are defined as follows:
a251
a134a61a135a89a245a246
a2
a59a61a146a31a62a90a147a49a65a68a67 a252a248a253a85a254a255
a1
a59a61a146a31a62a90a147a49a65
# of skip bigram a2a23a147 (12)
a3
a134a136a135a93a245a246
a2
a59a61a146a31a62a90a147a49a65a68a67 a252a83a253a118a254a255
a1
a59a61a146a31a62a90a147a49a65
# of skip bigram a2 a146 (13)
Here, function Skip2 returns the number of skip-
bi-grams that are common to a141 and a139 .
ROUGE-SU
ROUGE-SU is an extension of ROUGE-S, which
includes unigrams as a feature defined as fol-
lows (Lin, 2004b):
ROUGE-SUa59a61a146a31a62a90a147a49a65a68a67 a59a68a101a161a128a49a162
a2
a65a117a163
a157
a134a5a4
a59a61a146a31a62a98a147a49a65a71a163
a145
a134a6a4
a59a61a146a31a62a98a147a49a65
a157
a134a5a4
a59a61a146a31a62a90a147a49a65a47a128a49a162
a2 a145
a134a5a4
a59a61a146a31a62a164a147a49a65
(14)
Where a166 a169a8a7 and a171 a169a8a7 are defined as follows:
a251
a134a5a4
a59a61a146a31a62a98a147a49a65a68a67 a252
a9
a59a61a146a31a62a90a147a49a65
(# of skip bigrams + # of unigrams) a2 a147 (15)
a3
a134a5a4
a59a61a146a31a62a90a147a49a65a68a67 a252
a9
a59a61a146a31a62a90a147a49a65
(# of skip bigrams + # of unigrams) a2 a146 (16)
Here, function SU returns the number of skip-bi-
grams and unigrams that are common to a141 and a139 .
ROUGE-L
ROUGE-L is an LCS-based evaluation measure
defined as follows (Lin, 2004b):
ROUGE-La59a61a146a31a62a90a147a49a65a68a67 a59a68a101a161a128a49a162
a2
a65a161a163
a157a11a10
a225a90a134
a59a61a146a31a62a90a147a49a65a161a163
a145a12a10
a225a90a134
a59a61a146a31a62a98a147a49a65
a157a11a10
a225a90a134
a59a61a146a31a62a90a147a49a65a47a128a49a162
a2 a145a12a10
a225a98a134
a59a61a146a31a62a90a147a49a65
(17)
where a166a14a13a250a241a132a169 and a171a15a13a250a241a130a169 are defined as follows:
a157a11a10
a225a98a134
a59a61a146a31a62a98a147a49a65a68a67
a101
a91 a16
a75 a77a29a216
LCSa17a244a59a61a156 a88 a62a90a146a21a65 (18)
a145a18a10
a225a98a134
a59a61a146a31a62a98a147a49a65a68a67
a101
a95
a16
a75a78a77a83a216
LCSa17 a59a61a156a34a88a78a62a98a146a21a65 (19)
Here, LCSa19a244a28a78a144a183a114a93a32a93a139a102a36 is the LCS score of the union
longest common subsequence between reference
sentences a144a25a114 and a139 . a115 and a122 are the number of words
contained in a141 , and a139 , respectively.
The multiple reference version of ROUGE-N S,
SU or L, RNa196 a167a132a197a29a198 a32 RSa196 a167a132a197a29a198 a32 RSUa196 a167a130a197a29a198 a32 RLa196 a167a130a197a29a198 can
be defined in accordance with equation (9).
4.4 Evaluation Measures
We evaluate automatic evaluation methods by
using Pearson’s correlation coefficient (a144 )
and Spearman’s rank correlation coefficient
(a214 ). Since we have ten systems, we make a
vector a20a66a38a49a28a8a21a102a178a80a32a22a21
a26
a32a34a180a34a180a34a180a118a32a22a21a117a114a94a32a34a180a34a180a34a180a85a32a22a21a66a178
a194
a36 from the
results of an automatic evaluation. Here,
a21a120a114a93a38a49a30a37a192a37a10
a188
a46a29a194
a195
a24
a178
a23
a28a78a141
a195
a32a93a139a244a114a25a24
a195
a36 . a141
a195 indicates a ref-
erence for the a113 -th topic. a23 indicates an automatic
evaluation function such as a26a138a167a132a169a90a170 , a26a15a27a21a169a78a170 , ROUGE-N,
ROUGE-S, ROUGE-SU and ROUGE-L. Next, we
make another vector a28a39a38a104a28a8a29 a178 a32a22a29
a26
a32a34a180a34a180a34a180a85a32a22a29 a114 a32a34a180a34a180a34a180a118a32a22a29 a178
a194
a36
from the human evaluation results. Here,
a29a51a114a93a38a11a30a37a192a37a10
a188
a46a29a194
a195
a24
a178
a190
a28a78a141
a195
a32a93a139a73a114a25a24
a195
a36 . Finally, we com-
pute a144 and a214 between a20 and a28 6.
4.5 Evaluation Results and Discussions
Table 4 shows the evaluation results obtained by
using Pearson’s correlation coefficient a144 . Table 5
shows the evaluation results obtained with Spear-
man’s rank correlation coefficient a214 . The ta-
6When using multiple references, functions
a30 and a31 for
making vectors a32 and a33 are substituted for a30 a172 a133a90a173a68a174 and a31 a172 a133a78a173a68a174 ,
respectively.
149
Table 4: Results obtained with Pearson’s correlation coefficient.“stop” indicates with stop word exclusion,
“case” indicates w/o stop word exclusion.
short long
a202
a72
a202
a2
a202
a3
a202
a173a164a208a93a209
a202
a72
a202
a2
a202
a3
a202
a173a164a208a93a209
stop case stop case stop case stop case stop case stop case stop case stop case
ROUGE-1 .965 .884 .931 .888 .937 .879 .956 .906 .906 .876 .919 .916 .897 .891 .918 .948
ROUGE-2 .943 .960 .836 .880 .861 .906 .904 .937 .886 .930 .788 .941 .834 .616 .856 .929
ROUGE-3 .906 .936 .759 .814 .786 .846 .862 .900 .873 .909 .717 .849 .826 .431 .844 .885
ROUGE-4 .878 .914 .725 .752 .729 .794 .837 .871 .850 .890 .651 .787 .836 .292 .836 .865
ROUGE-L .919 .777 .789 .683 .875 .867 .898 .852 .917 .840 .861 .812 .847 .829 .910 .848
ROUGE-S(a34 ) .934 .914 .805 .888 .872 .938 .867 .917 .812 .863 .744 .954 .709 .547 .757 .900
ROUGE-S(9) .929 .935 .783 .899 .808 .917 .856 .939 .840 .903 .735 .951 .730 .617 .787 .927
ROUGE-S(4) .936 .943 .802 .891 .839 .917 .877 .940 .876 .920 .778 .945 .814 .663 .840 .932
ROUGE-SU(a34 ) .934 .914 .805 .887 .872 .937 .867 .917 .811 .864 .743 .954 .707 .547 .756 .900
ROUGE-SU(9) .926 .938 .765 .890 .789 .906 .845 .936 .829 .904 .705 .948 .701 .586 .766 .925
ROUGE-SU(4) .930 .945 .772 .865 .810 .889 .861 .927 .868 .921 .730 .928 .785 .620 .818 .925
a160
a58
a70
a2
a133a98a134a61a135 a59a61a162a138a67
a1
a65 .942 .927 .921 .957 .941 .957 .967 .969
a160
a58
a70
a2
a133a98a134a61a135 a59a61a162a138a67a36a35a159a65 .929 .943 .928 .965 .939 .962 .959 .967
a160
a58
a70
a3a133a98a134a61a135 a59a61a162a138a67
a1
a65 .939 .923 .919 .962 .926 .954 .953 .966
a160
a58
a70
a3a133a98a134a61a135 a59a61a162a138a67a36a35a159a65 .927 .933 .920 .964 .920 .947 .904 .949
a160
a58
a70
a7
a133a98a134a61a135 a59a61a162a138a67
a1
a65 .921 .900 .897 .955 .900 .932 .890 .946
a160
a58
a70
a7
a133a98a134a61a135 a59a61a162a138a67a36a35a159a65 .909 .900 .888 .950 .892 .921 .819 .922
a160
a58
a70
a2
a37 a134a136a135 a59a61a162a138a67
a1
a65 .939 .900 .897 .942 .931 .923 .936 .939
a160
a58
a70
a2
a37 a134a136a135 a59a61a162a138a67a36a35a159a65 .928 .921 .909 .958 .932 .939 .950 .950
a160
a58
a70
a3a37 a134a136a135 a59a61a162a138a67
a1
a65 .938 .902 .886 .947 .924 .921 .934 .944
a160
a58
a70
a3a37 a134a136a135 a59a61a162a138a67a36a35a159a65 .928 .922 .895 .960 .920 .929 .919 .942
a160
a58
a70
a7
a37 a134a136a135 a59a61a162a138a67
a1
a65 .929 .896 .873 .947 .910 .913 .908 .938
a160
a58
a70
a7
a37 a134a136a135 a59a61a162a138a67a36a35a159a65 .918 .915 .879 .956 .903 .913 .865 .925
bles show results obtained with and without stop
word exclusion for the entire ROUGE family. For
ROUGE-S and ROUGE-SU, we use three varia-
tions following (Lin, 2004b): the maximum skip dis-
tances are 4, 9 and infinity 7. In addition, we exam-
ine a165 a38 a35 anda10 for the ESK-based and WSK-based
methods. The decay parameter a12 for a26a138a167a132a169a90a170 and a26a38a27a21a169a78a170
is set at 0.5. We will discuss these parameter values
in Section 4.6.
From the tables, ROUGE-N’s a144 and a214 decrease
monotonically with N when we exclude stop words.
In most cases, the performance is improved by in-
cluding stop words for N (a9a49a35 ). There is a large
difference between ROUGE-1 and ROUGE-4. The
ROUGE-S family is comparable to the ROUGE-SU
family and their performance is close to ROUGE-
1 without stop words and ROUGE-2 with stop
words. ROUGE-L is better than both ROUGE-3 and
ROUGE-4 but worse than ROUGE-1 or ROUGE-2.
On the other hand, a26a138a167a132a169a78a170 ’s correlation coefficients
(a144 ) do not change very much with respect to a108 . Even
if a108 is set at 4, we can obtain good correlations.
The behavior of rank correlation coefficients (a214 ) is
7We use
a162 =1,2, and 3. However there are little difference
among correlation coefficient regardless of a162 because the num-
ber of the words in reference and the number of the words in
system output are almost the same.
similar to the above. The difference between the
ROUGE family and our method is particularly large
for long summaries. By setting a108a47a38a49a35 , our method
gives the good results. The optimal a165 is varied in
the data sets. However, the difference betweena165 a38a54a35
and a165 a38a49a10 is small.
For a214 , our method outperforms the ROUGE fam-
ily except for a211a151a178 . By contrast, we can see a108a47a38a49a10 or
a108a51a38a40a39 provided the best results. The differences be-
tween our method and the ROUGE family are larger
than for a144 .
For both a144 and a214 , when multiple references are
available, our method outperforms the ROUGE fam-
ily.
Although ROUGE-1 sometimes provides better
results than our method for short summaries, it has
a critical problem; ROUGE-1 disregards word se-
quences making it easy to cheat. For instance, we
can easily obtain a high ROUGE-1 score by using
a sequence of high Inverse Document Frequency
(IDF) words. Such a summary is incomprehensi-
ble and meaningless but we obtain a good ROUGE-1
score comparable to those of the top TSC-3 systems.
By contrast, it is difficult to cheat other members of
the ROUGE family or our method.
Our evaluation results imply that a26a31a167a132a169a78a170 is robust
150
Table 5: Results obtained with Spearman’s correlation coefficient. “stop” indicates with stop word exclu-
sion, “case” indicates w/o stop word exclusion.
short long
a202
a72
a202
a2
a202
a3
a202
a173a164a208a93a209
a202
a72
a202
a2
a202
a3
a202
a173a164a208a93a209
stop case stop case stop case stop case stop case stop case stop case stop case
ROUGE-1 .988 .964 .842 .891 .842 .855 .927 .903 .818 .830 .903 .806 .867 .855 .842 .915
ROUGE-2 .927 .976 .770 .794 .855 .842 .879 .903 .721 .891 .721 .855 .794 .648 .818 .903
ROUGE-3 .879 .927 .588 .697 .818 .818 .867 .927 .758 .842 .636 .745 .806 .564 .709 .855
ROUGE-4 .818 .879 .721 .697 .745 .745 .867 .867 .685 .794 .564 .612 .830 .455 .709 .758
ROUGE-L .927 .830 .661 .600 .806 .818 .879 .806 .842 .770 .576 .612 .636 .709 .879 .697
ROUGE-S(a34 ) .939 .939 .673 .818 .794 .818 .818 .927 .770 .879 .636 .818 .697 .527 .709 .867
ROUGE-S(9) .879 .952 .600 .745 .721 .794 .733 .939 .758 .806 .576 .806 .673 .564 .745 .855
ROUGE-S(4) .891 .964 .600 .794 .794 .794 .794 .939 .709 .842 .576 .770 .770 .733 .758 .842
ROUGE-SU(a34 ) .939 .939 .673 .818 .794 .818 .818 .927 .770 .879 .636 .818 .697 .553 .709 .867
ROUGE-SU(9) .879 .964 .600 .745 .721 .794 .745 .939 .745 .806 .576 .758 .612 .564 .745 .903
ROUGE-SU(4) .879 .988 .600 .745 .721 .770 .794 .903 .758 .855 .576 .794 .709 .612 .794 .842
a160
a58
a70
a2
a133a98a134a61a135 a59a61a162a138a67
a1
a65 .952 .879 .855 .939 .842 .927 .903 .903
a160
a58
a70
a3a133a98a134a61a135 a59a61a162a138a67a36a35a159a65 .952 .915 .891 .939 .855 .903 .903 .903
a160
a58
a70
a3a133a98a134a61a135 a59a61a162a138a67
a1
a65 .964 .867 .867 .976 .818 .927 .879 .879
a160
a58
a70
a3a133a98a134a61a135 a59a61a162a138a67a36a35a159a65 .964 .891 .915 .976 .758 .903 .709 .891
a160
a58
a70
a7
a133a98a134a61a135 a59a61a162a138a67
a1
a65 .927 .830 .867 .952 .661 .903 .733 .915
a160
a58
a70
a7
a133a98a134a61a135 a59a61a162a138a67a36a35a159a65 .927 .842 .842 .988 .588 .903 .673 .891
a160
a58
a70
a2
a37 a134a136a135 a59a61a162a138a67
a1
a65 .976 .794 .830 .952 .818 .867 .806 .891
a160
a58
a70
a2
a37 a134a136a135 a59a61a162a138a67a36a35a159a65 .952 .842 .830 .952 .818 .867 .794 .903
a160
a58
a70
a3a37 a134a136a135 a59a61a162a138a67
a1
a65 .976 .794 .818 .939 .806 .855 .733 .879
a160
a58
a70
a3a37 a134a136a135 a59a61a162a138a67a36a35a159a65 .976 .879 .855 .952 .806 .818 .794 .915
a160
a58
a70
a7
a37 a134a136a135 a59a61a162a138a67
a1
a65 .964 .794 .818 .939 .806 .855 .697 .915
a160
a58
a70
a7
a37 a134a136a135 a59a61a162a138a67a36a35a159a65 .964 .867 .855 .976 .745 .855 .770 .915
Table 6: Best scores for each data set.Pearson’s Correlation Coefficient
Length a202 a72 a202 a2 a202
a3
a202
a173a98a208a89a209
short .945 .946 .933 .967(
a0
a62a94a1a37a62a98a162 ) (2,0.7,2) (2,0.7,4) (2,0.1,3) (2,0.7,3)long .941 .962 .971 .972
(a0 a62a94a1a37a62a98a162 ) (2,0.6,2) (2,0.6,3) (2,0.7,2) (2,0.8,2)
Spearman’s Rank Correlation Coefficient
Length a202 a72 a202 a2 a202
a3
a202
a173a98a208a89a209
short .964 .915 .915 .988(
a0
a62a94a1a37a62a98a162 ) (3,0.9,4) (2,0.3,4) (3,0.5,3) (4,0.7,4)long .855 .927 .915 .939
(a0 a62a94a1a37a62a98a162 ) (2,0.8,4) (3,0.5,2) (2,0.5,4) (2,0.8,3)
for a108 and length of summary and correlates closely
with human evaluation results. Moreover, it includes
no trivial way of obtaining a good score. These
are significant advantages over ROUGE family. In
addition, our method outperformed the WSK-based
method in most cases. This result confirms the effec-
tiveness of semantic information and the significant
advantage of the ESK.
4.6 Effects of Parameters
Our method has three parameters, a108a51a32a34a12 , and a165 . In
this section, we discuss the effects of these param-
eters. Figure 1 shows a144 and a214 for various a12 and a165
values with respect to a211 a197a89a212a158a213 . Note that we set a108 at
2 in the figure because the tendency is similar when
we use other values, namely a108a51a28a29a38a49a10a168a231a37a236a41a39a31a36 . From Fig.
1, we can see that a165 a38a49a30 is not good. With automatic
summarization, ‘precision’ is not necessarily a good
evaluation measure because highly redundant sum-
maries may obtain a very high precision. On the
other hand, ‘recall’ is not good when a system’s out-
put is redundant. Therefore, equal treatment of ‘pre-
cision’ and ‘recall’ does not give a good evaluation
measure. The figure shows that a165 a38a104a35a37a32a34a10 and 5 are
good for a144 and a165 a38a104a10a37a32a22a39a31a32a43a42 and infinity are good for a214 .
Moreover, we can see a significant differences be-
tween a12a182a38a41a30 and others from the figure. This implies
an advantage of our method compared to ROUGE-S
and ROUGE-SU, which cannot handle decay factor
for skip-n-grams.
From Fig. 1, we can see thata214 is more sensitive to
a165 than
a144 . Here,
a165
a38a49a10a37a32a22a39a31a32a43a42 and infinity obtained the
best results. a165 a38a49a30 was again the worst. This result
indicates that we have to determine the parameter
value properly for different tasks. a12 does not greatly
affect the correlation for a108a51a38a104a10a37a32a22a39a31a32a43a42 and infinity as re-
gards the middle range.
Table 6 show the best results when we exam-
ined all parameter combinations. In the brackets,
we show the best settings of these parameter com-
binations. For a144 , a108a51a38a41a35 provides the best result and
middle range a12 anda165 a38a49a35 or 3 are good in most cases.
On the other hand, the best settings for a214 vary with
151
0.8
0.85
0.9
0.95
1.0
0 0.5 1.0
Correlation Coefficient
λ
β=1
β=2
β=3
β=4
β=5
β=inf.
0.7
0.75
0.8
0.85
0.9
0.95
1.0
0 0.5 1.0
Rank Correlation Coefficient
β=1
β=2
β=3
β=4
β=5
β=inf.
λ
Figure 1: Correlation coefficients for various values of a44 and a45 on a46a48a47a22a49a51a50 .
the data set. a52a54a53a56a55 is not always good for a57 .
In short, we can see that the decay parameter for
skips is significant and long skip-n-grams are effec-
tive especially a57 .
These results show that our method has an ad-
vantage over the ROUGE family. In addition, our
method is robust and sufficiently good even if close
attention is not paid to the parameters.
5 Conclusion
In this paper, we described an automatic evalua-
tion method based on the ESK, which is a method
for measuring the similarities between texts based
on sequences of words and word senses. Our ex-
periments showed that our method is comparable
to ROUGE family for short summaries and outper-
forms it for long summaries. In order to prove that
our method is language independent, we will con-
duct an experimental evaluation by using DUC’s
evaluation data. We believe that our method will
also be useful for other natural language generation
tasks. We are now planning to apply our method to
an evaluation of machine translation.
References
N. Cancedda, E. Gaussier, C. Goutte, and J-M. Renders. 2003.
Word Sequence Kernels. Journal of Machine Learning Re-
search, 3(Feb):1059–1082.
M. Collins and N. Duffy. 2001. Convolution Kernels for Nat-
ural Language. In Proc. of Neural Information Processing
Systems (NIPS’2001).
D. Harman and P. Over. 2004. The Effects of Human Variation
in DUC Summarization Evaluation. In Proc. of Workshop
on Text Summarization Branches Out, pages 10–17.
T. Hirao, T. Fukusima, M. Okumura, C. Nobata, and H. Nanba.
2004a. Corpus and Evaluation Measures for Multiple Docu-
ment Summarization with Multiple Sources. In Proc. of the
COLING, pages 535–541.
T. Hirao, J. Suzuki, H. Isozaki, and E. Maeda. 2004b.
Dependency-based Sentence Alignment for Multiple Docu-
ment Summarization. In Proc. of the COLING, pages 446–
452.
C. Hori, T. Hori, and S. Furui. 2003. Evaluation Methods
for Automatic Speech Summarization. In Proc. of the Eu-
rospeech2003, pages 2825–2828.
S. Ikehara, M. Miyazaki, S. Shirai, A. Yokoo, H. Nakaiwa,
K. Ogura, Y. Ooyama, and Y. Hayashi. 1997. Goi-Taikei
– A Japanese Lexicon (in Japanese). Iwanami Shoten.
C-Y. Lin and E. Hovy. 2003. Automatic Evaluation of Sum-
maries Using N-gram Co-occurrence Statistics. In Proc. of
the NAACL/HLT, pages 150–157.
C-Y. Lin and F.J. Och. 2004. Automatic Evaluation of Machine
Translation Quality Using Longest Common Subsequence
and Skip-Bigram Statistics. In Proc. of the ACL, pages 606–
613.
C-Y. Lin. 2004a. Looking for a Good Metrics: ROUGE and its
Evaluation. In Proc. of the NTCIR Workshops, pages 1–8.
C-Y. Lin. 2004b. ROUGE: A Package for Automatic Evalua-
tion of Summaries. In Proc. of Workshop on Text Summa-
rization Branches Out, pages 74–81.
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and
C. Watkins. 2002. Text Classification using String Kernel.
Journal of Machine Learning Research, 2(Feb):419–444.
K. Papineni, S. Roukos, T. Ward, and Zhu W-J. 2002. BLEU:
a Method for Automatic Evaluation of Machine Translation.
In Proc. of the ACL, pages 311–318.
R. Soricut and E. Brill. 2004. A Unified Framework for Auto-
matic Evaluation using N-gram Co-occurrence Statistics. In
Proc. of the ACL, pages 614–621.
152
