Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation
and/or Summarization, pages 9–16, Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
On the Subjectivity of Human Authored Short Summaries
BalaKrishna Kolluru Yoshihiko Gotoh
University of Sheffield, Department of Computer Science
Sheffield S1 4DP, United Kingdom
a0 b.kolluru, y.gotoh
a1 @dcs.shef.ac.uk
Abstract
We address the issue of human subjec-
tivity when authoring summaries, aiming
at a simple, robust evaluation of machine
generated summaries. Applying a cross
comprehension test on human authored
short summaries from broadcast news, the
level of subjectivity is gauged among four
authors. The instruction set is simple,
thus there is enough room for subjectiv-
ity. However the approach is robust be-
cause the test does not use the absolute
score, relying instead on relative compar-
ison, effectively alleviating the subjectiv-
ity. Finally we illustrate the application of
the above scheme when evaluating the in-
formativeness of machine generated sum-
maries.
1 Introduction
Subjectivity plays an important role when remov-
ing the unwanted or redundant information for sum-
marising a document. Human beings tend to dis-
agree on what should be a ‘one good summary’
(Mani, 2001). This is probably because every indi-
vidual, whilst arriving at a summary, looks at things
from a different perspective. Guided by various
factors such as educational background, profession,
personal interests and experience, an individual de-
cides whether a certain aspect is worth being in-
cluded in a summary. What might seem relevant
to one person could be deemed redundant by an-
other when reading the same story, thus account-
ing for more than one ‘correct’ summary. The is-
sue of subjectivity gains prominence as the compres-
sion ratio increases, i.e., the shorter the summary, the
larger the number of ‘correct’ summaries (Lin and
Hovy, 2003b). This is due to the fact that assimila-
tion of seemingly important contents takes priority
while discarding the redundant information. This is
a highly subjective aspect.
Although the subjectivity reflects individual’s
thoughts, there will also be some information com-
monly observed in different summaries of the same
story. Stated otherwise, words in a summary may
vary, phrases may vary, and often the grammatical
structure may not be the same, but a certain degree
of information may be common across summaries.
To what degree is information uniform across dif-
ferent summaries? How much subjectivity is there?
How do we account for similar information stated
using different words, expressions, or grammatical
structure when comparing summaries? How does
this help when gauging the informativeness? Does
the subjectivity cause any adverse effects when eval-
uating summaries? It is these questions that we aim
to address in this paper.
Let us assume that the atomic facts of a summary
account for its relevance. Then, a simple question
that elicits any one of these atomic facts represents
a benchmark for assessing its informativeness. We
wish to evaluate the quality of a summary in terms of
atomic facts commonly observed in-, or subjectively
discarded from, assorted human authored short sum-
maries. In our quest to quantify the subjectivity, we
devise a cross comprehension test along the lines
of (Hirschmann et al., 1999) for extracting atomic
contents. The comprehension test is modelled on
a question-answer style framework. ‘Crossing’ the
model turns out to be an effective scheme for mea-
suring the divergence among multiple summaries.
Questions are prepared by the subject who wrote
the original summary (Section 3). Their answers
9
should be derived by reading the summary alone.
Summary-questionnaire pairs are then swapped in
such a way that any summary is paired with ques-
tions written by other subjects (Section 4). The num-
ber of questions that cannot be answered by reading
the summary accounts for the subjectiveness of the
author (Section 5). Finally, we address how the cross
comprehension test can be used for evaluating ma-
chine generated summaries (Section 6).
2 Related Works
There have been a number of studies concerned with
collating and analysing of human authored sum-
maries, with the aim of producing and evaluating
machine generated summaries. A phrase weighting
process called the ‘pyramid method’ was described
in (Nenkova and Passonneau, 2004). They exploited
the frequency of the same (similar) information that
was in multiple summaries of the same story. It was
referred to as a summarisation content unit (SCU).
Increasing stability of pyramid scores was observed
as the pyramid grew larger. The authors concluded,
however, that the initial creation of the pyramid was
a tedious task because a large number of SCUs had
to be hand annotated.
In (Van Halteren and Teufel, 2003), the co-
occurrence of atomic information elements, called
factoids, was examined whilst analysing 50 different
summaries of two stories. A candidate summary was
compared with the reference using factoids in or-
der to measure the informativeness. The authors ob-
served that from a wide selection of factoids only a
small number were included in all summaries. From
a pool of factoids, approximately 30% were taken to
build a consensus summary that could be used as a
‘gold standard’.
Summary evaluation has been recognised as a
sensitive, non-trivial task. In (Radev and Tam, 2003)
the relative utility was calculated based on a signif-
icance ranking assigned to each sentence. A word
network based summary evaluation scheme was pro-
posed in (Hori et al., 2003), where the accuracy was
weighted by the posterior probability of the manual
summaries in the network. Significantly, they sur-
mised the independence of their criterion from the
variations in hand summaries.
A regression analysis was performed in (Hiro-
hata et al., 2005) and concluded that objective eval-
uations were more effective than subjective ap-
proaches. Although their experiments were con-
cerned with presentation speech, the results do have
a universal appeal.
Another notable development in the field is the
a0 -gram co-occurrence matching technique as pro-
posed in (Lin and Hovy, 2003a). Their tool, ROUGE,
compares the number of a0 -gram matches between a
reference and a machine generated summary. Re-
cently, ROUGE was piloted for evaluation of sum-
maries from newspaper/newswire articles (Over and
Yen, 2004). ROUGE simulated the manual evalua-
tion well for that task, although it is still unclear how
closely it well to other tasks.
To some extent, the work described in this paper
is close to that of (Nenkova and Passonneau, 2004)
and (Van Halteren and Teufel, 2003). We analyse
human authored summaries associating human sub-
jectivity with their unique interpretation of stories.
We consider their effect when evaluating machine
generated summaries.
3 Production of Human Authored Short
Summaries
Our aim is to investigate an effective, robust ap-
proach to summary evaluation. In this paper, we
identify and quantify the aspect of human subjec-
tivity while authoring short summaries. To this end,
four subjects produced a short summary (approxi-
mately 100 characters, or 15 words) for broadcast
news stories given a simple instruction set. This
summary is referred to as a ‘one line’ summary be-
cause it corresponds approximately to the average
sentence length for this data set.
3.1 Author Profiles
Four summary authors are briefly profiled below:
Subject A. A linguist by profession, a polyglot out
of interest, and an author by hobby. This subject is
fluent in English, Spanish and French; English being
the first language. The subject is trained to write
summaries and translations.
Subject B. A manager by qualification and a poly-
glot by necessity; English is a second language. This
subject was trained in making presentations and doc-
umentation. We hoped to benefit from the synergy
10
of both fields for summary production.
Subject C. A physicist by qualification and cur-
rently working towards a PhD in speech recognition.
English is the first language. In addition, this subject
has an interest in theatre and drama, thus is exposed
to literature and related fields.
Subject D. Working on research in multiparty meet-
ings as a post doctoral fellow. English is the first lan-
guage for this subject. Experience of meeting sum-
marisation.
All subjects are educated to at least graduate level,
and have are fluent in English. It was expected that
they could produce summaries of good quality with-
out detailed instruction or further training. A simple
instruction set (discussed later) was given, leaving
wide room for interpretation about what might be
included in the summary. Hence subjectivity was
promoted.
3.2 Data
The human subjects worked on a small subset of
American broadcast news stories from the TDT-2
corpus (Cieri et al., 1999). They were used for NIST
TDT evaluations and the TREC-8 and TREC-9 spo-
ken document retrieval evaluations. Each program
in the corpus contained 7 to 8 news stories on aver-
age, spanning 30 minutes as broadcast which might
be reduced to 22 minutes once advertisement breaks
were removed. A set of 51 hand transcriptions were
manually selected from the corpus. The average
length was 487 words in 25 sentences per transcrip-
tion.
3.3 Instructions
Summary production. A simple instruction was
given to the subjects in order to arrive at a summary:
a0 Each summary should contain about 100 char-
acters, possibly in the subject’s own words.
As the news stories ranged from 16 to 84 sentences,
subjects would have to prioritise information that
could be included in their ‘one line’ summary. The
instruction implicitly encouraged the subjects to put
as much important information as possible into a
summary, while maintaining a good level of fluency.
It was also a flexible instruction so that subjects were
able to use their own expressions when necessary.
After completion of the task, they commented that
this instruction made them experiment with differ-
ent words to shorten or expand the information they
wanted to include. For example, how could an earth-
quake disaster be expressed in different ways:
8000+ feared dead? a1a2a1a2a1 or
thousands of people killed? a1a2a1a2a1 or
a lot of people are believed to be dead?
Another feature of this instruction was the amount
of generalisation that a subject was likely to use. For
example, a subject could say
US Senate to decide on tobacco bill
but given the length constraints, it could be like
Senate to vote on bill, hiking tobacco price
while adding extra information, but omitting specific
details.
Questionnaire production. When producing sum-
maries, subjects were aware that they also had to
prepare questions with the following instructions:
a0 A questionnaire may consists of 2–4 questions;
a0 An answer must be found in the particular sum-
mary, without reading the entire story;
a0 Yes / no questions should not be used;
a0 The summary may roughly be reconstructed
from the question-answer set.
Each fact might be questioned in such a way that the
particular summary could be recovered. Ideally we
would expect each question to elicit a precise infor-
mation point chosen for the summary — e.g., who
did it, when did it happen, what was the cause? The
question-answer set enabled us to gauge the most
relevant information as decided by the subjects, so
that their subjectiveness became apparent.
3.4 Full Sample
A ‘one line’ summary-questionnaire pair was pro-
duced for 51 broadcast news stories by each of the
four subjects. The statistics in Table 1 show the av-
erage number of words and characters for each sum-
mary. It is observed that Subjects A (6.1 characters /
word) and C (5.8) tended to use longer words than B
11
Subject #words #characters #questions
A 16 113 3.7
B 17 99 3.5
C 12 81 2.4
D 21 131 3.0
Table 1: This table shows the average number of
words and characters for each summary, and the av-
erage number of questions per summary.
(4.9) and D (5.3). The table also shows how the av-
erage number of questions varies between subjects.
Table 2 shows a full sample. The complete news
story is found in the Appendix. The difference be-
tween the four summaries can be clearly observed.
One noticeable aspect is the amount of abstraction
preferred by various subjects. Both Subjects A and
D fully utilised words from the news story and made
a small amount of abstraction. In particular, Sub-
ject A chose to pick out a person (‘Fisher’) who
conducted the study, while D opted for specifics of
the study (‘dopamine’ — a responsible chemical).
On the other hand, Subjects B and C have rendered
their interpretation of the story in their own expres-
sions. They have produced a highly abstracted sum-
mary reflecting the sense of the story while ignoring
the specifics — nevertheless they were very different
from each other. All four summaries happen to be of
good quality, however it is the sheer divergence in
the words, the expressions and subjective interpreta-
tion that is striking.
Word usage among the subjects is also interest-
ing — e.g., ‘visual images’ as against ‘physical
traits’; similarly ‘inner feelings’ as against ‘chem-
istry’. Such expressions and idioms are open for in-
terpretation, making it difficult to quantify the infor-
mativeness of any summary.
There also exist many factual news stories among
the 51 test stories. It is left for a future study to
compare between factual and non-factual news, in
particular about the amount of abstraction.
4 Cross Comprehension Test
Each question can extract a relevant answer from the
particular summary by the same author. If a ques-
tion set were applied to a different summary, some
answers may be discernible whereas others may not.
The cross comprehension test achieves this by swap-
Subject A
Summary:
Fisher’s study claims we seek partners using unconscious
love maps; women prefer status, men go for physical traits.
Questions:
1. Who is the author of this study?
2. What claim does the researcher make concerning our
method for seeking a sexual partner?
3. What do women look for in men?
4. What do men go for?
Subject B
Summary:
Internal feelings of love between men and women are
unique; external features depend on culture.
Questions:
1. What are unique?
2. What is this topic about?
3. What differs between men and women?
4. Why does it differ?
Subject C
Summary:
Culture and chemistry both play a role in the science of
romance.
Questions:
1. What is being discussed?
2. What are the factors affecting the particular event?
Subject D
Summary:
Men are turned on by visual images and women are more
focused on someone’s character traits, based on dopamine.
Questions:
1. What do women look for in men?
2. What do men look for in women?
3. What is the chemical that controls attraction?
Table 2: Summary-questionnaire pairs produced
from broadcast news stories by four subjects.
ping a summary-questionnaire pair, i.e., each sum-
mary was paired with questions produced by differ-
ent authors. Figure 1 illustrates the way it works.
A single judge examines whether each question
can be answered by reading a swapped summary.
The judge is a person different from the four sum-
mary authors. Further, if the answer is found, it may
be relevant, partially relevant, or totally irrelevant to
the one expected by the author. Thus, the decision is
made from the following four options:
relevant: a relevant answer is found — the answer
is deemed to be relevant if it conveys the same
meaning as expected by the author even if a dif-
ferent expression is used;
partially relevant: an answer is partially relevant;
12
summary B summary C
question
set D
summary Dsummary A
question
set C
question
set B
question
set A
Figure 1: The cross comprehension test swaps
summary-questionnaire pairs between subjects. For
example, a summary by Subject A may be ques-
tioned by those set by Subjects B, C, and D.
irrelevant: an answer is found, but is totally differ-
ent from that expected by the author.
not found: no answer is found.
Sample (re-visited). Table 3 shows the summary
and questions crossed from the sample in Table 2.
For example, when the ‘one line’ summary authored
by Subject A is matched with Subject B’s questions,
corresponding answers may be
1. ?;
2. seeking partners;
3. women prefer status, men go for physical traits;
4. unconscious love maps.
We may thus conclude answers are ‘not found’, ‘rel-
evant’, ‘irrelevant’, and ‘partially relevant’ because,
from Table 2, actual answers sought by B were
1. internal feelings;
2. love between men and women;
3. external features;
4. cultural reason.
Compensating ill-framed questions. We are
aware that not all ‘one line’ summaries were well
written. For example, it may be difficult to reach
the expected answer (‘external features’) for Question
3 by Subject B (‘What differs between men and women?’)
by reading the summary from the same subject.
Moreover, subjects occasionally set a question that
could not be answered properly by reading the par-
ticular summary alone. By crossing the summary-
questionnaire pair, ill-framed questions are effec-
tively compensated, because they are equally posed
to all candidate summaries.
Judgement difficulty. One potential problem in
this scheme is the difficulty a judge may face when
choosing from the four options. A judge’s decision
can also be affected by subjectivity. Our assump-
tions are that (1) because there are only four options,
there is less room for the subjectivity in comparison
Summary by Subject A:
Fisher’s study claims we seek partners using unconscious
love maps; women prefer status, men go for physical traits.
Questions by Subject B:
1. What are unique? (N)
2. What is this topic about? (R)
3. What differs between men and women? (I)
4. Why does it differ? (P)
Questions by Subject C:
1. What is being discussed? (R)
2. What are the factors affecting the particular event? (R)
Questions by subject D:
1. What do men look for in women? (R)
2. What do women look for in men? (R)
3. What is the chemical that controls attraction? (N)
Table 3: What if the summary by Subject A is ques-
tioned by Subjects B, C, or D? (R), (P), (I), and (N)
after each question indicate the answer is relevant,
partially relevant, irrelevant, and not found.
to the summary writing task, and that (2) a decision
between ‘relevant’ and ‘partially relevant’ and one
between ‘irrelevant’ and ‘not found’ are both not
very important because the former two are roughly
associated with commonly shared information and
the latter two correspond to the subjective part. Al-
though the following section shows results by a sin-
gle judge, we are currently conducting the same ex-
periments using multiple judges in order to quantify
our assumptions.
5 Evaluation Results
Each of the four ‘one line’ summaries from the 51
broadcast news stories were evaluated using three
sets of ‘crossed’ questions.
5.1 Summary Relevance
Figure 2(a) shows, when paired with questions by
other subjects, how many answers could be found
in a candidate summary. The figure indicates that
summaries authored by the different subjects con-
tained ‘relevant’ information for less than half (47%
overall average for four subjects) of questions. The
number goes up slightly (61%) if ‘partially relevant’
answers are included. The number of answers that
were ‘not found’ indicates the level of subjectivity
for this ‘summary writing’ exercise; more than one
third (35%) of information that one subject thought
13
Subject A
B
C
D
Subject A
B
C
D
0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0
(a) summary relevance (b) questionnaire relevance
0.0
not found
irrelevantpartially relevant
relevant not foundrelevant
partially relevant irrelevant
Figure 2: Summary relevance was measured when evaluated against questions by other subjects, while
questionnaire relevance was calculated when evaluated against summaries by other subjects.
was the most important was discarded by the oth-
ers. We surmise that ‘irrelevant’ answers were also
caused by the subjectivity; occasionally authors ar-
rived at contradictory summaries of the same story
due to its ambiguous nature. In such cases, ques-
tions were produced from that author’s subjective
view, and they certainly affected the relevance of a
summary by the other subject.
Another notable outcome of this experiment is
that the number of answers found ‘relevant’, ‘par-
tially relevant’ or ‘irrelevant’ was 71%, 61%, 54%
and 73% for Subjects A, B, C, and D, respec-
tively. This seems roughly proportional to the av-
erage length of summaries by each subject (113, 99,
81, and 131 characters, respectively). The longer the
summary, the more information one can write in the
summary. It is thus hypothesised that only the sum-
mary length matters for finding the ‘relevant’ infor-
mation in summaries. Looking at this outcome from
a different perspective, there is no evidence that one
author was more subjective than the others.
5.2 Questionnaire Relevance
Figure 2(b) shows, when paired with summaries by
other subjects, how many candidate questions could
be answered. It is based on the same evaluation as
2(a), but observed from the different angle. Ap-
proximately the same number (55–59%) of ‘rele-
vant’, and ‘partially relevant’ answers were found
for Subjects A, B, and D. However, it was much
higher (80%) for Subject C. The reason seems to be
that this subject frequently set questions that might
accept a wide range of answers, while other sub-
jects tended to frame questions that required more
specific information in the summary; e.g., Subject
C’s ‘what is being discussed?’ was a general ques-
tion that was more likely to have some answer than
Subject B’s question ‘what differs between men and
women?’.
5.3 Discussion
The overall number of ‘relevant’ and ‘partially rele-
vant’ answers found by the cross comprehension test
was just over 61% for four subjects. This accounts
for the amount of information that was agreed by all
the subjects as important. For more than one third of
summary contents, subjects had different opinions
about whether they should be in their ‘one line’ sum-
maries, resulting in categories such as ‘irrelevant’ or
‘not found’. Occasionally these categories resulted
from ill-framed questions, but such questions were
infrequent. For most of the cases, they were caused
by the subjectivity of a different individual.
We noted earlier that only the summary length
matters and there is no evidence that one author was
more subjective than the others. It is probably be-
cause, given a clear instruction about the summary
length (i.e., roughly 100 characters for this task),
there is an upper bound for the amount of infor-
mation that anyone can fit into the summary, while
maintaining fluency. When the summary is short,
one has to make a serious decision about which im-
portant information should go into a summary, and
the decision often reflects one’s subjective thoughts.
Our argument is that, assuming the subject’s effort,
the amount of subjectivity was controlled by the
summary length constraints rather than an individ-
ual’s nature.
14
question set X
human authored
question set Y
human authored
human authored
summary Y
human authored
summary X
machine genarated
summary
Figure 3: Evaluation of machine generated sum-
maries by the cross comprehension test.
The diversity of summaries caused by individual
subjectivity may be alleviated by carefully drafting
an instruction set. However it probably results in
a large list of instructions, and the drafting process
certainly will not be straightforward. Further, it is
not likely that we can ever completely remove the
subjectivity from human work. Indeed, if subjectiv-
ity disappeared from human authored summary by
well crafted instructions, it would be more like turn-
ing human activity into a mechanical process, rather
than a machine to simulate human work.
A non-trivial problem of the approach may be the
amount of human effort needed for evaluation. Pro-
duction of summary-questionnaire pairs may not be
difficult, as it is based on a simple instruction set and
even accepts ill-framed questions, but it still requires
human time. On the other hand, a judge’s role is the
most critical — it is labour intensive, and the effect
of potentially subjective judgement needs to be stud-
ied.
Although certainly not flawless, the cross com-
prehension test has its own advantage. A simple
instruction set is effective; it encourages authors to
make their best effort to put as much information
into a short summary. Most importantly, the test
is robust; it sometimes causes ill-framed questions,
but they can be compensated by relative comparison
achieved by crossing summary-questionnaire pairs.
6 Evaluation of Machine Generated
Summaries
The objective of this evaluation is to measure the in-
formation content of machine generated summaries
using a human authored summary as a yardstick.
Although very subjective for many cases, a human
summary can still be a reference if we do not treat
them as a ‘gold standard’.
The cross comprehension test of machine gener-
ated and human authored summaries is illustrated in
Machine generated summary:
senate to vote to approve the expansion of north atlantic
treaty organisation to bigger nato means us obligations
Summary by subject B:
US Senate to decide on NATO expansion; US assesses
bigger NATO more arms deal but poor ties with Russia.
Questions by subject D:
1. What is happening to the NATO?
2. Who sees this move as a threat?
3. Who is bearing the main cost?
Table 4: Evaluation of machine and human authored
summaries using questions by the different subject.
Figure 3. Questions are set by the different author
from the one who wrote the summary. A human au-
thored summary may still be the best summary in
many respects, but it will no longer be considered
perfect. One may target the relevance level of the
human summary (e.g., 61% for the ‘one line’ sum-
mary task from the broadcast news stories) for auto-
matic summarisation research.
Table 4 shows one example from those with which
we are currently experimenting. Answers sought by
Subject D were ‘expansion’, ‘Russian’, and ‘Ameri-
can taxpayers’, respectively. Given this question set,
answers are ‘relevant’, ‘relevant’, and ‘not found’
for the summary by Subject B, and answers found in
the machine generated summary are ‘relevant’, ‘not
found’, and ‘not found’, respectively.
7 Conclusion
In this paper, we have presented the issue of hu-
man subjectivity when authoring summaries, with
regard to producing a simple, robust evaluation of
machine generated summaries. Applying the cross
comprehension test on human authored ‘one line’
summaries from broadcast news stories, we gauged
the level of subjectivity among four authors. The
instruction set was simple, thus there was enough
room for subjectivity. However the approach was ro-
bust because the test did not use the absolute score,
instead relying on relative comparison, effectively
alleviating the subjectivity. We also showed the ap-
proach to evaluating machine generated summaries.
The experiment using this scheme is currently un-
derway.
15
Acknowledgement. This work was funded by UK
EPSRC grant GR/R42405, Statistical Summarisa-
tion of Spoken Language (S3L).

References
C. Cieri, D. Graff, M. Liberman, N. Martey, and S. Strassel.
1999. The TDT-2 text and speech corpus. DARPA Broadcast
News Workshop, Herndon, VA.
M. Hirohata, Y. Shinnaka, K. Iwano, and S. Furui. 2005.
Sentence extraction-based presentation summarization tech-
niques and evaluation metrics. ICASSP, Philadelphia.
L. Hirschmann, J. Burger, D. Palmer, and P. Robinson. 1999.
Evaluating content extraction from audio source. ESCA
Workshop: Accessing Information in Spoken Audio, Cam-
bridge.
C. Hori, T. Hori, and S. Furui. 2003. Evaluation method for
automatic speech summarization. Eurospeech, Geneva.
C. Lin and E. Hovy. 2003a. Automatic evaluation of sum-
maries using n-gram co-occurrence statistics. HLT-NAACL,
Edmonton.
C. Lin and E. Hovy. 2003b. The potential and limitations of au-
tomatic sentence extraction for summarization. HLT-NAACL
Workshop on Automatic Summarization, Edmonton.
I. Mani. 2001. Automatic Summarization. Jon Benjamins Pub-
lishing Company.
A. Nenkova and R. Passonneau. 2004. Evaluating content
selection in summarization: The pyramid method. HLT-
NAACL, Boston.
P. Over and J. Yen. 2004. An introduction to DUC 2004: Intrin-
sic evaluation of generic news text summarization systems.
DUC Workshop, Boston.
D. Radev and D. Tam. 2003. Summarization evaluation via
relative utility. CIKM, New Orleans.
H. Van Halteren and S. Teufel. 2003. Examining the consensus
between human summaries: Initial experiments with factoid
analysis. HLT-NAACL Workshop on Automatic Summariza-
tion, Edmonton.
