How to Appreciate the Quality of Automatic Text Summarization? 
Examples of FAN and MLUCE Protocols and their Results on SERAPHIN 
• ~ Jean-Luc Minel *, Sylvame Nugler ** and G6raid Plat *** 
CAMS-CNRS 
96 Boulevard Raspail 
75006 Pans 
mmel@cams.msh-paris.fr 
** EDF DER/GRETS 
1 Avenue du G6n~rai de Gaulle 
92000 Clamart 
\[Sylvaine.Nugier, Gerald.Pmt~der.edfgdf.fr 
Abstract 
For the SERAPHIN project, we set up two assessment 
protocols m order to be able to more accurately assess the 
• quahty of abstracts - the FAN protocol and the MLffCE 
protocol, for which we provide the results The FAH 
protocol assesses the legibdtty of an abstract, independently 
from the source text The MLUCE protocol ls designed to 
allow users of automatic abstracts to assess then" quality 
These protocols were applied to a corpus of 27 texts which 
vaned m length from between three and twelve pages These 
texts were randomly chosen from EDF archives They 
include both sclenUfic and general press articles, extracts 
from books, and internal EDP notes The results of the FAlq 
protocol demonstrate the &fficulty of using surface lmgmsuc 
m&cators to assess the. quality of an abstract, the results of 
the MLUCE protocol illustrate the importance of user 
expectations 
1 lntroductmn 
The SERAPHIN system produces abstracts usmg an 
alternative approach, the contextual exploration method 
(Descl~ et ai 97), based on the pinpointing of lmgmsttc 
indications m order to ldenUfy 0 certain structurmg 
mformaUon, l0 causal arguments and arguments by cause 
(Jacklewlcz 96), ill) different def'mmg wordings (Cartier 97) 
The abstracts are made up of sentences extracted from the 
source text, and to which semantic labels have been attached 
(Bern et al 96), representing the salient points of the source 
text from the author's point of view The size of the 
abstract ss lnmted to 20% of the source text 
The assessment of abstracts has oRen been approached from 
the computer documentatmn angle (Salton 89), particularly 
by using crRena such as system's recall and system's 
precision The mare problem with these criteria is the 
posmlatmn of the existence of a user request, expressed in 
the form of a combmauon of describers It can be seen that 
this hypothesis does not generally correspond to the reahty 
of using abstracts 
Indeed, the reader of the abstract has already selected the 
source document as being one which belongs to his field of 
mterest, and he ~s looking to obtain the most accurate 
understanding possible of the content of the document Thts 
Is why, wlthm the SERAPHIH project, we set up two 
assessment protocols, FAN and MLUCE, whtch are designed 
to better assess the quahty of abstracts . The purpose of the 
present amcle ~s not the evaluation of the SERAPHIH 
system Rself ~, but "how to evaluate" the quality of 
automattc text suraraanzatmn system 
At the outset, we wished to assess 50 texts, but the cost, 
m terms of reading tune, forced us to reduce this objective 
These two protocols were apphed to a corpus of 27 texts 
which vaned m length from between three and twelve pages 
These texts were randomly chosen from EDF archives 
They melude beth scienufic and general press articles, 
extracts from books, and internal EDF notes 
2 The FAN Protocol 
This protocol anus to assess the quahty of an abstract 
independently from the source text and the reformation It 
contains Assessment was therefore camed out by two 
jurors, who were not specialL~tS m the fields concerned, who 
read the 27 abstracts without hawng seen the source texts It 
takes approximately 10 minutes to assess an abstract The 
assessment grid contams 4 criteria, which are described below 
Crzterwn 1 Number of Anaphora Deprtved of Re.l~rents 
Given that an abstract Is created from sentences from the 
source text, It Is pusslble that a sentence contains an 
anaphora whose referent does not belong to the extracted 
sentence We must not forget that SERAPHIN does not 
detect referents, R detects what it considers to be indications 
of potentml anaphora wRhm sentence Pl, out of a closed hst 
Ot, thts, that, etc ), and then apphes a simple heunsm which 
revolves selecting the preceding sentence P0 which contains 
the potential anaphora On the one hand, thts heurnsm may 
prove to be msufficsent (no exphctt referent, the referent is 
located• m a sentence further up m the text), and on the 
other hand this heunsm is not apphed to sentence P0 (m 
order to avotd selec~ng sentences based on cnterta that are 
i this ~lll be the objectof another report 
25 
not "semantic") We believe this crRenon to be a 
determining factor with regard to the legtbthty of the 
abstract, nevertheless the jurors raised several problems We 
will illustrate these via a number of examples 
Thus, m the following sentence, 
According to thls researcher, It ts a movement Which zs 
characterlsed by a deswe to go backwards, a deswe for a 
state m which the obstruction between subject and object no 
longer exists (Text N~) 
the anaphonc term is deprived of its referent but the 
legibility and coherency of the abstract are not really altered, 
because, m this textual context, the name of the researcher ~s 
not considered .to be an important piece of information 
Nevertheless, m order to restrict • the effects of 
interpretation, this case was considered to be an anaphora 
deprwed of a referent 
In the following sentence, . .. 
One could say that these first contacts between EDF and the 
resident created a precedent whtch was notvery favourable 
wath regard to estabhshmg peaceful relationships between 
EDF and the local populatton ln, mg near the hne (Text 
N*9) 
the potentially anaphonc term these may be interpreted as 
referring either to contacts described earher m the text, or to 
' a chronological account which takes on meaning as one reads 
the entire text, and, m part, at the end of the sentence in 
question In the text under consideration, it Is the second 
interpretation which ts correct, but because the juror &d not 
have access to the source text, we treated this case as an 
anaphora deprived of a referent 
Crlterron 2 Rupture of textual segments organtsed by 
Linear lntegratton Markers 
Various studies on textual lmgmsttcs (CharoUes 89, Adam, 
90) have underlined the utdtty of locating lmgmstac markers 
m order to identify the dtscurswe organisataons which go 
beyond the sentence itself SERAPHIN identifies linear 
integration markers (MIL) from within a closed list (on the 
one hand, on the other ham~ firs@, secondly, etc ) m order 
to rebmld textual segments m the abstracts produced Thus, 
ff sentence P is selected, sentences Pi, P2, Pn which are 
linked to P via MIL's wdl also be selected This selection 
may fall, . either because the MILts not recogmsed (absent 
from the list,, elhpsts, spelling mistake, etc ), or because the 
maximum size of the abstract has been reached We should- 
stress both the fact that the jurors can only detect ruptures in 
the textual segments, and not thetr completeness, and that 
argumentation connectors (such as zndeea~ furthermore, etc ) 
are not considered to be MIL's This decision was taken 
after a preliminary validation by ten or so readers, and was 
confirmed by the jurors SERAPHIN uses a special symbol 
\[ \] to show that two sentences are not adjacent m the 
source text, such as m the following example 
It ts easy to imagine the huge number of texts and wrztten 
documents that ts produced by a company hke EDF 
\[Y Of course, all "language productwn" may appear to come 
fi'om an abstract source, a umque source .(Text N°6) 
This symbol avoids the problem of the reader mistakenly 
reconstructing argumentation chains 
Criterion 3 Presence of "tautologtcal" sentences :. 
A sentence is considered to be tautological, if the 
information It provides is completely independent of the 
source text, as m the following example 
Predicting the future is a difficult and uncertain exercise 
(Text N°4) 
We were trying to detect abstracts whlch, although 
optimal from the point of view of the two criteria above, 
had merely been created from very general sentences, as ts 
often the case with certmn abstractsby authors In fact, this 
criterion is far too dependent on the knowledge that the 
reader has of the subject of the text, and the results of the 
protocol show that It Is not pertinent 
Crtterton 4 Legzbzhty of the abstract 
This crRermn, whose values are Very Back Mediocre, 
Gooa~ Very Good, Is an overall appreciation of the abstract 
Although they are highly subjectave, the "scores" given by 
the jurors varied very little, with just the two exceptions set 
out m table 1 below 
Juror Jl Juror J2 
Text N ° 20 Mechocre Very Good 
Text N°22 Good Very Bad . 
Table 1 "Dwergence of assessment between jurors 
Text N°20 (TF1, Le Grand Bluff) Is a chronological 
description of the pnvatlsatmn of the television channel 
TF1 The successmn of events, the number of players 
revolved, metaphors such as "It ts at the foot of the wall that 
one judges the br:cldayer'; make the readmg of the abstract 
(and of the source text) difficult unless the reader has a good 
understanding of the subject concerned (exceptionally, juror 
2 happened to know this subject well) 
Text N022 (Credlbdzty of command control systems 
concepts and tools) ~ a highly technical text outside the 
experience of the two jurors 
For the presentation of the results (Tables 2 and 3) we 
systematically chose the lowest "score" 
Interpreting the results of the FAN Protocol 
We cross-referenced the legibihty criterion with criteria 1 
and 2 Contrary to our anginal hypothesis, there is no 
correlation between the two Indeed, the comments made by 
the two jurors show that overall readmg of the abstract 
• allows them to overcome any locahsed lack of understanding 
caused by the absence of anaphonc referents This 
conclusion must nevertheless be nuanced by the fact that 
there is a hmited number of mistakes m the abstracts that 
were analysed Assessmg abstracts simply on the basis of 
surface hngmstic mdlcators, and without calling upon the 
knowledge that the jurors may have of the subject concerned, 
remains a difficult problem 
\] Nommtakes \] Onemlstake \] Two mistakes \] Three mistakes \] 
26 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
l 
I 
Crtterton 1 '13" Texts 5 Texts 6 Texts 3 Texts. 
(Anaphora) 48 % 19 % 22. % 11% 
Criterion 2 ...i ::,. 10 Texts 13 Texts 4 Texts 0 Texts 
(MIL) 37 % 48 % 15 %-,. 0 % 
Crltermn 3 25 Texts 2 Texts 0 Texts 0 Texts 
("Tautology") 92 % 8 % 0 %. 0 % 
Table 2 Synthesis of the results for the first three criteria 
Very Good Good Medmere V~ry Bad 
Crltermn 4 7 Texts 13 Texts 5 Texts 2 Texts 
(LeglbflRy) 26 % 48 % 19 % 7.% 
Table 3 Syntbesls of the results for the legibdlty criterion 
3 The MLUCE Protocol 
3.1 ObjecUves 
The arm of this protocol ts to enable potentml users to assess 
the quahty of automatic abstracts In this case, an abstract, 
and therefore its quahty, wdl not be defined m any absolute 
way, but rather m terms of the ways m which it can be used 
For example, tf a person is looking for the "proven" zdea 
wlthm a text, he wall need to understand the different stages 
of the argument, on the other hand, if he only wishes to 
observe any stmultaneous occurrences of two themes wRhm 
the same text, he no longer needs to have the arguments 
The assessment of quahty therefore depends on what one 
wishes to use the abstract for (Path et al, 1961) It ts 
therefore important to pre-def'me one or more uses of the 
abstract, and for each def'mRton to accurately measure the 
"dlstance" between the source text and its abstract We 
selected twoapphcatmns for automatlc abstracts winch were 
of partlcular interest to EDF 
* Apphcatton 1 the abstract is a tool which allows one to 
decide whether or not to read the source text 
• Apphcatwn 2 the abstract is a support for writing a 
synthesis of a written document 
The MLUCE protocol therefore alms at "measuring" how 
a given abstract meets these two oblectwes 
In sectmn 3 4, we set out the results on SERAPHIN, but 
this protocol was apphed, without much tmportance, to the 
RAFI system (Lehmam 95) 
3.2 Experiment procedure 
The procedure selected for assessmg a "summar\]smg" system 
has to be precise, complete and unambiguous, m order to 
- restrict the "reader effect" as much as possible - m other 
words, to hm\]t the vartatlon m assessmentbetween readers, 
due to then" different fields of experhse, different cultures or 
varying archetypes of abstracts, 
- hmR the influence of the order m whtch the texts are read, 
- be.adapted to all types of text, and to take their differences 
rote account 
A pdot was requn.ed m order to adjust the procedure to the 
above requ~ments For the SERAPHIN assessment, we set 
up a jury of four readers 
- a qualified French language teacher, specmlLsed m teaching 
abstract and synthesis techmques, 
- a documentary researcher, working in a documentary umt, 
- two users, wRh different training backgrounds 
For the first stage of the assessment, we gave each 
member of the jury seven texts, then. abstracts, and a hst of 
mstructlons explaining the approach to be used (the "jurors" 
each had different texts Each reader-assessor then had to 
- read the documents m a pre-defined order (firstly, all the 
abstracts, then all the source texts), 
- fill m the reader's sheet (attached to each document) as he 
went along, 
- give Ins overall opmmn on the "comparison sheets" 
provided for thts purpose 
The second stage of the assessment revolved analysing 
the sheets returned by the readers 
The-whole experiment (defmamn of procedure, and the 
actual assessment) lasted a total of eight months 
3.3 Criteria retained 
On the basis of the experiments camed out by Borko et al 
(1975), Edmunson (1969), Mathls et al (1973), Payne 
(1964) on the assessment of the quahty of "abstracts", and 
m terms of the apphcat~ons defined (section 3 1), we set four 
criteria, and for each criterion we estabhshed the means of 
assessing it 
Apphcatlon I 
For the first apphcatton, the criteria defined in MLUCE are 
designed to assess the utdlty of the abstract as a statable 
dec~ston-makmg tool for the reader These criteria must 
allow one to judge whether the abstract contains the 
mformauon requwed to be able to decide whether or not to 
read the source text 
In order to do so, we wall say that the abstract must allow 
usto 
* Identify the field or nature of the source text Each reader 
fills m two grids (one for the source text and one for the 
27 
abstract) which show the fields or natures of the texts 
sctenufic or techmcal, pohtlcal, sociological, polemical, 
general, prospective, retrospective, situational or state-of- 
the-art 
• check the presence of the essentml Ideas Each reader 
underlines the ideas m text Tt which he feels to be essential, 
and checks that they are present m abstract R~ 
• avmd parasmc ideas Each reader highhghts sentences In 
R~ which should not be m RI, and the sentences m abstract Rt 
which are cut off from the context (essential ideas that have 
been cut short) 
Apphcauon 2 
For the second application, the cntena defined m MLUCE 
• are designed to assess the utthty of the abstract as a support 
for writing a synthesis of a written document 
In order to do so, we will say that the abstract must allow 
us to 
• identify the fie!d or nature of the source text (criterion 
ldenUcal to application I) 
• check the presence of the essentml ideas (cntermn identical 
to apphcatmn 1) 
• hlghhght the logical lmkmg of ideas Each reader fills in 
two grids (one for the Source text and one for the abstract) m 
which the following argumentation links appear cause 
implying consequence, consequence ~mphes cause, 
proposmon of a solution, from particular to general, from 
general to pamcular, motivated juxtaposmon of facts, hstmg 
of facts, confrontatmn He then states whether the idea ~s 
"proven" in each of the documents he has read Finally, he 
assesses whether the abstract is clear, fmrly clear, not very 
clear or incomprehensible 
3.4. Results on SERAPHIN and interpretation 
Identification of the subject (table 4) 
The texts submitted to the reader-assessors may cover 
several fields and be of several chfferent natures, which ts why 
the total number of texts shown m table 4 is greater than the 
number of texts studied (number stadled = 27) 
The categories hsted in table 4 were not explicitly defined 
to the jurors, ttts therefore possible that there is a certain 
amount of "subjectivity" m the categonsatlon of the texts 
Nevertheless, we supposed that each reader could lmphcitly 
and continuously m time divide the texts rote the categories 
proposed 
number of source texts for which 
' the abstract . 
respects the 
subject or 
scientific or 9 
technical 
political 6 
8 sociological 
polemical 
general 
prospective 
retrospective 
situational err 
state of the / 
art J 
Table 4 
• cl0es not respect: 
the subject or 
10 
2 
1 0 
3 8 
2 4 
15 
Identification of the subject 
number of 
i abstracts 
close to 2 
the text 
fmrly close to. 11 
the text 
relatively 10 
different from 
the text 
well away from 4 
the text 
Table 5 Presence of the essential ideas 
Presence of essent*al ideas (table 5) 
The result of hlghhghtmg the "essential xdeas" m the source 
text, and of the reader marking the "parasmc ideas" that 
appear m the abstract, are grouped together m order to 
define a "proximity" mdlcator This mdieator ~ defined m 
the following way 
- we will define an abstract as being close to the text ff more 
than 75% of the sentences which make it up are among the 
essential ideas (highlighted) and less than 10% are parasmc 
ideas, 
-we will define an abstract as being fairly close to the text If 
between 50% and 75% of the sentences which make it up are 
among the essential ideas (highhghted) and less than 10% are 
parasmc Ideas, 
- we will define an abstract as being rela#vely different from 
the text If between 25% and 50% of the sentences which 
28 ¸ 
I 
I 
! 
I 
I 
make it up are among the essential Ideas (hlghhghted) and 
less than i 0% are parasttlc ideas, • 
- we wzll define an abstract as being Well away from the text m 
all other cases 
Highhghtmg the logical sequence of arguments (table 6) 
We have supposed that a text was written, by hm author, m a 
precise mm (the "proven" idea) We •have ldenUfied 8 types 
of argumentaUon lmks which allowed the authors to 
construct their demonstratmn (rows m table 6) Like when 
identifying a field, the texts submRted to the jurors may link 
several types of argument, which is why the total number of 
texts shown m table 6 is greater than the number of texts 
studied (readers were asked, where necessary, to gwe detads 
of the order m which the different types appeared over the 
whole of the text) 
cause nnplymg 
consequence 
consequence 
unphes cause 
proposltmn of 
a solutton 
from 
parUcular to 
general 
if tom general 
to pamcular 
motzvated 
juxtaposRton 
of facts 
hstmg of facts 
confrontatzon 
number of source texts for 
whzch the abstract 
respects the chain does not respect 
of argument the chain 
4 7 
1 3 
6 5 
1 2 
0 3 
10 
2 0 
1 2 
Table 6 htghhghtmg the logical sequence of arguments 
Table 6 should be compared wzth table 4 Indeed, we have 
nouced, wtth regard to the texts studied, the absence m the 
abstract of the source of the argument that ,sm the original 
document Thus, a text whtch uses a gtven theory, leads to 
an abstract m whtch no theoretical base for the argument is 
given Thin might explain the bad performance of the 
"sctenttfic" or "prospective" texts and the sequences of 
"cause ~mphes consequence" or "consequence Imphes cause" 
types 
Quahty of the abstract (tables 7 and 8) 
After having determmed the field, the jurors noted the 
logical argumentatton sequencing, stated the "proven" tdea 
of the abstract, and filled m a grid m order to give their 
overall lmpress~on of the quahty of the abstract 
number of 
abstracts 
clear 6 
fmrly clear 6 
not very clear 10 
mcomprehens~bl 
e 
Table 7 quahty of ~o abstract 
number of abstracts judged to be 
clear faxrly not mcompr 
clear very e- 
clear henslble 
scientific or techmcal \] 0 5 10 4 
pohtzcal 2 5 4 2 
socmlogzcal 2 4 3 1 
I 
polemical 0 1 0 0 
general 0 1 0 0 
prospectzve 0 5 6 0 
,.,, , , 
retrospecOve 1 1 2 2 
sztuaUonal or 0 8 8 4 
state-of-the- 
art 
Table 8 quahty of the abstract m terms of fields or natures 
of the source toxt 
4 Conclusion 
A text-by-text comparison of the results Of the legxbdRy 
criterion (Very Bad, Medtocre, Gooc~ Very Good)m the 
FAN protocol, and the results of the quahty of the abstract 
(Incomprehensible, Not Very Clear. Fawly Clear, Clear) m 
the MLUCE protocol, shows very httle convergence Apart 
from two excepUons, MLUCE ts always more demanding 
than FAN Here we hlghhght the differences m assessment 
between a user who reads an abstract m order to find answers 
to speczfic questmns, and a reader who =s not trying to assess 
the mformat~on content of the same abstract 
The quahty of an abstract depends on what the user 
expects from tt, and only an in-sltu assessment wall allow one 
to really assess the performance of a "summansmg" system 
Followmg this expenmant with these two protocols, the 
mstallaUon of any such procedure would appear to be 
extremely expensive - not to mention the fact that =t would 
reqmre "user expectatmns" to be defined, and the related 
assessment cnterm to be formahsed 
29 
I 
However, the FAN and MLUCE protocols, when apphed to a 
test corpus, which remains to be defined, may nevertheless 
serve as a base on which to compare systems whtch 
summartse wa sentence extraction 
Acknowledgements 
We would hke to thank our six jurors, not only for hawng 
accepted to take pan m thts experiment, but also for thetr 
comments on certain aspects of the abstracts 

References 
Adam J M (1990), l~l~ments de hngmsttque textuelle, 
Mardaga, Liege 

Bern, J Career E, Descl~s, J-P, Jaclaewtcz, A Mmel, J-L 
(1996) Fdtrage Automattque de textss, In Natural 
Language Processing and lndustrml Apphcauons, (pp 28- 
35), Umverstt6 de Moncton, H-B, Canada 

Borko, H & Bernler, C L (1975) Abstractmg concept and 
methods San Diego, New York, Berkeley Academic 
Press, 250 p 

Cartier, E (1997) La D~finmon ses formes d'expresslon, 
son contenu et sa valeur darts les texte, th~se en cours 
Umverstt6 de Parts Sorbonne. Pans 

Charolles, M (1989) Marquages lmgutstiques et r6sum6 de 
textes, m CHAROLLES M et PETITJEAH A \[6ds\], Le 
rdaumd de texte , aspects hngutsttques, s~mwttques, 
psychohngmsttques et automatzques, Co|loque mternauonal 
de lmgutsuque organts6 par les Umverslt~s de Metz et Hancy 
II \[12-13-14 sept 1989\], Khncksteck 

Descl~s, J-P Cartter, E, Jacktewtcz A Mmel, J-L (1996) 
Textual Processing and Contextual ExploraUon Method In 
CONTEXT 97 (pp 189-197), Umversldade Federal do Rio 
de Janelro, BAsal 

Edmunson, H P (1969) Newmethods m automatic 
abstracting Journal of assoctatwn for computing 
machmery, 16(2), (pp 264--285) 

Jaclaewlcz, A (1996) La notion de cause pour le filtrage de 
phrases nnportantes d'un texte m Natural Language 
Processing and Industrial Apphcatwns, (pp 136-141), 
Umverslt6 de Moncton, N-B, Canada 

Le Roux, D, Mmel, J L & Bern, J (1994) SERAPHIIq 
project Fzrst European Conference of Cognmve $aence in 
Industry, Luxembourg, 28-30 septembre 1994 

Lehmam, A (1995) Le r6sum6 des textes techmques et 
sclenUfiques, aspects hngutstlques et computattonnels, 
ThOse de doctorat, Umverslt6 de Nancy 2 

Mathts, B A, Rush, J A & Young, C A (19'73) 
Improvement of automattc abstract by use of structural 
analysts Journal of the Amerlcan society for mformatwn 
sczence, 24(2), (pp 101-109) 

Mike, S Itoh, E (1994) A full-text retrieval system with a 
dynarmc abstract generauon functton ,m SIGIR 94, Dubhn, 
pp 152-161 

Nugter S & Ptat G (1996) Evaluauon du prototype de 
r~sum6 SERAPHIlq protocole et premters r6sultat, Note 
Interne EDF, HH-52/96/008 

Payne, D (1964) Automatic abstracting evaluatton support 
American Instttute for research Ptttsburgh, Penn AD 
431 910 

Rath, G J, Restock, A & Savage, T R (1961) The 
formatton of abstract by selectton of sentences, Sentence 
selection by man and machme, The rehabdlty of people m 
selectmg sentences American documentatwn, 12(2), (pp 
139--143) 

Sabah, G (1988) L'mtelhgence amfictelle et le langage, 
revr6sentatlon des connatssances, Hermes C, Pans 

Sal'ton, G (1989), Automatic text precessmg the 
transformauon, analysts and retrieval of reformation by 
computer, Ad&son Weshley Publ Coinp 
