Goal-Directed Approach for Text Summarization 
Ryo Ochitani, Yoshio Nakao, Fumihito Nishino 
. Fujitsu Laboratories Lmuted 
4-1-1 Karmkodanaka, Nakahara, 
Kawasakh Japan 211-88 
och~@flab, fu 3 ~tsu. co. j p, nakao@flab, fu 3 it su. co. 3P, nlszno@flab, fu 3 Itsu. co. 3P 
Abstract 
The information to InClude m a summary 
vanes depending on the author's mtentmn 
and the use of the summary To create the 
best summaries, the appropriate goals of 
the extracting process should be set and a 
guide should be outlined that instructs the 
system how to meet the tasks 
The approach described m thin report m 
intended to be a basic archltecture to ex- 
tract a set of concme sentences that are in- 
dicated or predlcted by goals and contexts 
To evaluate a sentence, the sentence selec- 
tion algorithm simply measures the mfor- 
matlveness of each sentence by comparing 
with the determined goals, and the algo- 
rlthm extracts a set of the hlghest scored 
bentences by repeat apphcatmn of thin com- 
parmon 
Thin approach m apphed m the summary of 
newspaper artlcles The headhnes are used 
as the goals Also the method to extract 
charactenstlc sentences by using property 
mformatlon of text is shown 
In thls experiment m whlch Japanese news 
articles are summarized, the sunlmarles 
consmt of about 30% of the original text 
On avelage, thin method extracts 50% 
less text than the slmple tltle-keyword 
method 
1 Introduction 
Summaly requnements (such as length and content) 
vary widely, depending on from, subject, and situa- 
tion of use For example, even sevelal sentences may 
seem too long fol news reticles obtained from a net- 
wolk Snmlarly, as ~holt as possible summaries wall 
be desuable to preview sites in a web browsel, when 
a huge number of lesults are retrieved from search 
engmeo 
To extract a short summar} for this kind of pur- 
pose, an extract coveuug all topics in the text wall 
be too long Using small number of sentences to 
extrapolate the contents of the entlre text will be 
adequate for an efficlent prevmw To include the 
intended polnts and charactermtlc mformatmn m a 
short summary, the mechamsm to detect the pur- 
pose of the summary and select the sentences that 
match the goals m needed in the summanzatmn pro- 
cess 
In thin report, an algorithm that helps reahze such 
• a goal and context lnformatton oriented summanza- 
tmn system m described The algorithm evaluates 
the informativeness of each sentence m a text and 
selects a small number of sentences, mcludmg effec- 
tive mformatmn One of the apphcatmns of thin al- 
gorithm m shown in the expellment on the sentence 
extraction from the newspaper articles and market 
surveys The experimental system uses headhnes 
and htles as the goals of the sentence selection, and 
the lesults ale shorter and more effective than the 
simple tltle-keywold method (Pmce, 90) 
The results of the cmrent simple experiment are 
based on the word matching that as the goal pro- 
cessmg However, the experiments should include 
plocE, stag of the following structural goals, the con- 
cept level matching that uses the thesaurus, and the 
topic detection flora the text 
2 The Goal-Directed Summarization 
Sunnnanes n/thls system may differ from the gen- 
eral notlon of a sunmlaly that covers all toplcs de- 
scribed in the ongmal text A summary m defined 
as a set of extlacted sentences that gives some idea 
to the leader of the contents of a text, the reader m 
able to determine whether the text ms wolth reading 
ol not based on the smnmary Under thin defimtlon, 
a sunnnary m effective if the extract Includes the au- 
thor's intentlon or leqmred mformatlon of the reader 
by the fewest numbel of sentences posslble These 
infomlatlon should be included and satisfied by ex- 
tracted sentences ale called the 'goals' The summa- 
rizatlon plocess Is graded by the goals m called 'goal- 
dnected' Figure 1 shows the system archltecture of 
a general goal dlrected summarlzatmn system Thls 
47 
ExtDm~ $0u~ 
I 
.... Goal Detect~m , L __J 
I i _ _ _ _L __ 
I~'mabve~ss 1 
Figure 1 System Architecture 
system consists of a goal detection and sentence se- 
lectmn process by Informativeness evaluation 
The 'goal-directed' method may be sound over- 
stated, because the current experimental system 
handles only the headhnes,' htles and some text 
property expressions However, the 'goal-directed' } 
method is named, as the first step toward real|zing } 
a context based summarmatmn system 
Figure 2 
3 Sentence Selection Algorithm 
The sentence selectmn algonthm calculates the 'm- 
formahveness' for each sentence m a document The 
measurement represents the strength of relatmn be- 
tween the goals, sentences, and the richness ofmfor- 
matron m a document These var|ables are defined 
by the following three numerical values 
1 Number of dtfferent sentence expressmns related to 
the goals 
2 Total number of sentence expressmns related to the 
goals 
3 Total number of sentence express|ons being not re- 
lated to the goals 
The order of these measurements defines their 
precedence The first measurement is given the high- 
est prmnty Sentences that sahsfy many of the goals 
are conmdered more mformahve Both the first and 
second values above represent the amount of tarot- 
matron included m a sentence The third measure- 
ment indicates the amount of mformatmn m a sen- 
tence and roughly simulates the contained amount 
of explanation or descnpt!on about the goal 
The sentence select|on algorithm (shown m Fig- 
ure 2) relates the highest scored sentences by the in- 
formativeness measurement The measurements are 
repeatedly evaluated until all the goals are related 
to the sentences or all relatmns are found 
4 , Goal Detection 
Tins system is designed to be built into the text pre- 
vtew menu of a word processor or the query results 
hstmg of a document retneve system Thus, the 
contents of a document are unpredictable and the 
system needs to work m real time This hmltahon 
reqmres the system handles rather rumple mforma- 
tmn For example, the word list compiled from the 
:headhnes m used as the.goals when processing news 
48 
All goals are ~pven m the goal hst 
All sentences of the source text axe given m the sentence 
hst 
while(goal emsts m the goal hst) { .. .. 
The mfonnahveneas measm:ements axe apphed 
to each sentence m the sentence list 
if (the sentence(or sentences) with max.unum 
informativeness exists) { 
The sentence m and removed from 
the sentence hst, and added into the 
extract hst 
The goals related to the sentence axe 
removed from the goal hst 
} e~e{ 
The algonthm stops 
Algonthm of the mformahve selection 
Simple tltle-keyword Informativeness 
lectmn 
Extractm ~ Number of 
Rate Arhdes 
100% 2,237 
90% - 1,083 
80% - 1,758 
70% - 1,642 
60% - 1,441 
' 50% - 1,250 
40% - 1,027 
30% - 813 
20% - 654 
10% - 501 
- 10% 218 
0% 938 
Total 13,562\] 
Average 64% 
Medtan 70% 
Kate Number of 
Artmles 
16 5% 450 
8 0% 43 
13 0% 186 
12 1% 359 
10 6% 587 
9 2% 944 
7 6% 1,506 
6 0% 2,061 
4 8% 2,765 
3 7% 2,673 
1 6% 1,050 
6 9% 938 
100% 13,562 4 
32% 
27% 
Se- 
Kate 
33% 
0 3% 
1 4% 
2 7% 
4 3% 
7 0% 
II 1% 
15 2% 
20.4% 
19 7% 
7 7% 
6 9% 
100% 
Table 1 Extraction rates of newspaper arhcles 
arhcles The htle words are used to extract a text 
from a report These simple word hsts may be too 
simple and a httle inadequate as goals 
Goal-dtrected summanzahon includes the pro- 
cessmg of the structural reformation This includes 
the concept level goal detechon using thesaurus, 
document structure, and structural mformahon m 
the titles (sechon, subsechon ) 
5 Experiments- 
The first experiment is summary for 13,562 news- 
paper arhcles and 62 monthly market survey report 
arhcles Both texts are m Japanese The calcu- 
lated extrachon rates based on the total number of 
I 
I 
| 
|, 
! 
! 
I, 
II 
i 
if 
i 
i 
I 
Extract\]c 
Rate 
100% 
90% 
80% - 
70% - 
60% - 
50% - 
40% - 
30% - 
20% - 
lo% - 
- 1o% 
o% 
"Total 
Average 
Me&an 
Simple tltle-keyword 
L Number of Rate 
. Art\]des 
' '..:i " 2 3 2% 
2 3 2% 
e 9 7% 
5 81% 
7 11 3% 
3 4 8% 
11 17 7% 
12 21.0% 
8 13 0% 
4 6 5% 
0 0% 
1 1 6% 
62 49~ 100% 
43% 
Infonnatlveness Se- 
lectlon 
Number of Rate 
Artxdes 
o 0% 
o 0% 
o o% 
o o% 
o 0% 
1 I 16% 
,3 4 8% 
0 0% 
5 8 O% 
10 16 1% 
42 67.7% 
1 1 6% 
62 11~ lOO% 
7% 
Table 2 Extractmn rates of computer business sur- 
vey reports 
Method Average extraction rates 
Informativeness selection 8% . 
Simple tltle-keyword 41% 
Simple ffrequency-keyword .33% 
Table 3 Average Extractmn rates of Enghsh news 
art\]des 
characters I are hsted m Table 1 
On average, the length of a summarized text by 
this system shows 50% of the length by the snnple 
t\]tle-keyword method The most frequent compres- 
sion rate m the results of the rumple tltle-keyword 
method Is 100% (the entire text) By using the m- 
formatwe selectmn, the rate falls between 20% to 
30% 
Table 2 hats the results of the computer business 
survey reports In thin case, the differences between 
the rates are larger than the newspaper results The 
text of these business reports \]s longer than the 
newspaper articles 
These experiments are mostly of Japanese docu- 
ments Only a few results~ for Enghsh documents 
are avadable Table 3 hsts the results of the extract- 
mg summaries of Enghsh news articles In thin case, 
the extractmn rates are calculated based on the to- 
tal number of words 2 The nature of this system 
makes evaluating the contents dd~cult and no clear 
solutmn can be obtained 
The evaluation methods m (Salton and Allan, 93) 
and (Kuplec eta l, 95) apphed to their system are 
using only intrinsic lnformatmn m a source text 
Salton measures the smnlar\]ty between a summary 
1 charc~cte~s tn a ~ummar~ 
c~Gro, ct~r$ tn G te~rt 2 tuoFG,g t~l. G sulrnlrltQr~ 
words $n a t~t 
and an omgmal text Kuplec compares extracts with 
manually coded summaries If the priority of refor- 
mation of a text is equal and mformatweness can be 
Calculated umformly, these evaluations are statable 
However, a priority m affected by the context 
Detenmnmg the appropnatenees of the results 
was difficult Thus, the extracts were randomly cho- 
sen and the inappropriateness was analyzed for 87 
newspaper articles 11 market report articles 
Obvious errors were found m 17 summaries (16 
news articles, one report ) These errors were mainly 
caused by the fadure of synonyms .of the tltle- 
keywords and words m a sentence (e x, dead body, 
and corpse) to match The other summaries in- 
cluded enough reformation to extrapolate the con- 
tents of the or.lgmal texts Thus, 80% of the sum- 
mattes contained enough reformation to serve as a 
preview 
In a news article, the leading paragraph should 
be a good summary of the article Therefore, the 
extracts of thin system and the lead paragraphs of 
news articles were compared Among all news ar- 
ticles, 70% of extracts from fins system included 
sentences from lead paragraphs and 50% of the ex- 
tracts included only the lead paragraphs Thus, the 
system algorithm naturally selected more sentences 
from lead paragraphs than other parts of a news ar- 
ticle 
Next, the appropriateness and compactness of the 
text between the lead paragraphs and extracts of 
tins system were compared the news data Inap- 
propriate results were found to be 4% higher m the 
extracts Double the number of extracts were more 
compact than the lead paragraph All of the report 
data of the extratlts were shorter than the leading 
paragraphs Thus, extracts from this system are re- 
garded as being better than leading paragraphs 
In the expemnent described above on news arti- 
cles, the goals were taken from the headlines and 
titles Also, some external source can serve as the 
goals of a summary If summaries are used to com- 
pare the text contents, text properties (such as tf tdf 
scores) can be used to create the goals of the sum- 
mary 
For example, the extracts wall include &stmctlve 
reformation \]f words with high tf ldf scores are gwen 
The extracts wall show the common mformatxon of 
text \]f words with high document frequencies are 
given Figure 3 shows the results of fins experiment 
using small number of the specfllcatlons documents 
of hard dmk drives 
As shown m Figure 3(a), the high tf \]df words de- 
terlmne the sentences describing the dmtmctlve fea- 
tures of the hard disk that are to be selected Figure 
3(b) shows that the words with high document fre- 
quencies are used to select the common reformation 
about the general specfficatmns 
49 
(a) .Eztrochon by tf ,dr property 
Words w,th h,gh t ff sdJ scores 
DEs, DMs, F6632A, H, path configuratmn, 
MB, GB, path, RANK, F6493, F6429G 
Summary by the hsgh t f sdy words 
Flemble configuration The F1700B has a four 
path configuration (connechon path to a mag- 
nehc chsks) as a standard feature 
In ad&tion, m the F1700B, the path to the 
channel and the paths to the magnetic dmk 
unit can be increased independently, so a flex- 
ible configuration can be found to smt the sys- 
tem environment 
High speed data transfer Data transfer rate be- 
tween host is high speed 3 0 MB/sec or 4 5 
MB/sec F1700B + F6425G/H, or F6427G/H, 
or F6429G/H has to be sold as a subsystem 
(b) Extractson by document frequency property 
Words tosth the hsghest document frequency . 
table, page, m3, contents, width, weight, tem- 
perature, power consumption, KVA, height, 
heat chsmpation, frequency, dunenston, depth, 
mr flow 
Summary by the hsgh df words 
Width 1,040 
Dtmenaon(mm) Depth 815 Height 1,690 
Weight (Kg) 
Frequency 50/60Hz +/- I0 
1 6(2 2) Heat chsslpatIon ( ) includes 512MB 
cache 
780(1,240) 1,240(1,700) 1,320(1,780) 
930(1,400) 1,240(1,700) A\]x tio, w(m3/nnn) 
Temperature 15 - 32 degrees cenhgrade (When 
controlled) Environment 
Figure 3 Summary examples using the properties 
'~f0~ the text classification. " - 
6 Discussion 
This experiment only demonstrates a small part of 
goal-directed summarization. Many subjects still 
need to be tested 
1 Using of the thesaurus 
Most fmlures in processing news articles were 
caused by synonyms (such as 'corpse' and 'dead 
body', 'fishery' and 'fisherman') to be matched 
Most of these errors can be corrected by using 
the thesaurus 
2 Processing the structured goals 
To summarize structured documents (such as 
manuals) the hierarchical structure of the sec- 
tions and subsections can be used to create 
goals These goals may control the inheritance 
of sub-goals to be satisfied m the substructure 
(such as, the 'preface' section ) 
3 Resolving the anaphonc expression 
Fewer problems than the English sentence ex- 
trachon occurred, because Japanese text was 
mostly the subject of experiment and the text 
less contains the anaphonc expression 
However, person and company names In news 
articles are often abbreviated and shortened 
Resolving these, abbreviated and shortened ex- 
presslons are needed to Increase readablhty 
4 Control of the summary length 
Because the mare purpose of this system is to 
offer concme information for prevlewmg docu- 
ment contents, the length of output cannot be 
directly controlled If the length needs to be 
varmd, some methods to extend the results may 
be added as post-pr0cessing The method to 
find sentence relations (such as leeyacal cohesion) 
may be suitable to find sentence chmns with re- 
lated topics 
5 Evaluation method 
The evaluation of extracts cannot be simply de- 
fined Extracts cannot be evaluated without 
context For objective evaluation, measuring 
the effect (e x, the time of prrevlewmg) may be 
realistic 
7 Conclusion 
This report is about the sentence extraction experi- 
ment using the 'informativeness' evaluation method 
The evaluation of the extracted summaries shows the 
system selects smaller sets of sentences than the sim- 
ple title-keyword method without losing reformation 
content Enough Information is extracted for pre- 
viewing document contents 
The cu~ent system may be too simple to be re- 
garded as a 'goal directed' However, this exper- 
iment shows, the efficiency of the generated sum- 
maries is improved, even when a snnple words list 
IS used as the goal of the selection process m the 
system 

References 
Juhan Kuplec, Jan Pedersen and Francme Chen 1995 
A Ttmnable Document Summarizer, In ACM SI- 
GIR'95, pages 68-73 

Chrm D Pmce 1990 Constructing hterature Abstracts 
by Computer Techmques and Prospects In\]ormatson 
Processing f~ Management, Vol 26, No 1, pages 171.- 
186 

Geraxd Salton and James Allan 1993 Selective Text 
Utilization and Text Traversal In Hyperte~t'93, pages 
131-143 
