The Automated Acquisition of Topic Signatures for Text 
Summarization 
Chin-Yew Lin and Eduard Hovy 
Information S(:i(umes Illstitute 
University of Southern California 
Marina del Rey, CA 90292, USA 
{ cyl,hovy } C~isi.edu 
Abstract 
In order to produce, a good summary, one has 
to identify the most relevant portions of a given 
text. We describe in this t)at)er a method for au- 
tomatically training tel)it, signatures--sets of related 
words, with associated weights, organized around 
head topics and illustrate with signatm'es we cre- 
;tt.ed with 6,194 TREC collection texts over 4 se- 
lected tot)ics. We descril)e the l)ossible integration 
of' tolli(: signatures with ontoh)gies and its evaluaton 
on an automate(l text summarization system. 
1 Introduction 
This t)aper describes the automated (:reation of what 
we call topic signatures, constructs that can I)lay a 
central role. in automated text summarization and 
information retrieval. ToI)ic signatures can lie used 
to identify the t)resence of a (:omph~x conce.pt a 
concept that consists of several related coinl)onents 
in fixed relationships. \]~.c.sta'uvant-'uisit, for examph~, 
invoh,es at h,ast the concel)ts lltCgFIt, t'.(tt, pay, and 
possibly waiter, all(l Dragon Boat PcstivaI (in Tat- 
wan) involves the Ct)llC(!l)t,S cal(tlzt'lt,s (a talisman to 
ward off evil), rnoza (something with the t)ower of 
preventing pestilen(:e and strengthening health), pic- 
tures of Ch, un9 Kuei (a nemesis of evil spirits), eggs 
standing on end, etc. Only when the concepts co- 
occur is one licensed to infer the comph:x concept; 
cat or moza alone, for example, are not sufficient. At 
this time, we do not c.onsider the imerrelationships 
among tile concepts. 
Since many texts may describe all the compo- 
nents of a comI)lex concept without ever exI)lic- 
itly mentioning the mlderlying complex concel/t--a 
tol)ic--itself, systems that have to identify topic(s), 
for summarization or information retrieval, require 
a method of infcu'ring comt)h'x concelltS fl'om their 
component words in the text. 
2 Related Work 
In late 1970's, \])e.long (DeJong, 1982) developed a 
system called I"tIUMP (Fast Reading Understand- 
ing and Memory Program) to skim newspaper sto- 
ries and extract the main details. FRUMP uses 
a data structure called sketchy script to organize 
its world knowh'dge. Each sketchy script is what 
FRUMI ) knows al)out what can occur in l)articu- 
lar situations such as denmnstrations, earthquakes, 
labor strike.s, an(t so on. FRUMP selects a t)artic- 
ular sketchy script based on clues to styled events 
in news articles. In other words, FRUMP selects an 
eml)t3 ~ t(uni)late 1 whose slots will be tilled on the fly 
as t"F\[UMP reads a news artMe. A summary is gen- 
erated })ased on what has been (:al)tured or filled in 
the teml)Iate. 
The recent success of infornmtion extractk)n re- 
search has encore'aged the FI{UM1 ) api)roach. The 
SUMMONS (SUMMarizing Online News artMes) 
system (McKeown and Radev, 1999) takes tem- 
l)late outputs of information extra(:tion systems de- 
velofmd for MUC conference and generating smn- 
maries of multit)le news artMes. FRUMP and SUM- 
MONS both rely on t/rior knowledge of their do- 
mains, th)wever, to acquire such t)rior knowledge 
is lal)or-intensive and time-consuming. I~)r exam-- 
l)le, the Unive.rsity of Massa(:husetts CIRCUS sys- 
l.enl use(l ill the MUC-3 (SAIC, 1998) terrorism do- 
main required about 1500 i)erson-llours to define ex- 
traction lmtterns 2 (Rilotf, 1996). In order to make 
them practical, we need to reduce the knowhxlge en- 
gineering bottleneck and iml)rove the portability of 
FI{UMI ) or SUMMONS-like systems. 
Since the worhi contains thousands, or perhal)s 
millions, of COml)lex (:on(:et)ts , it is important; to be 
able to learn sketchy scripts or extraction patterns 
automatically from corpora -no existing knowledge 
base contains nearly enough information. (Rilotf aim 
Lorenzen, 1999) 1)resent a system AutoSlog-TS that 
generates extraction i)atterns and learns lexical con- 
straints automatically fl'om t)rec\]assified text to al- 
leviate the knowledge engineering I)ottleneck men- 
tioned above. Although Riloff al)plied AutoSlog-TS 
l\Ve viewed sketchy s(:lil)tS and teml)lates as equivalent 
(ollstrllctS ill the sense that they sl)ecil ~, high level entities 
and relationships for specific tot)its. 
2Aii extra(:l;iOll pattt!rlk is essentially ;t case fraine contains 
its trigger word, enabling conditions, variable slots, and slot 
constraints. CIRCUS uses a database of extraction patterns 
to t~alSe texts (l{ilolI', 1996). 
495 
to text categorization and information extraction, 
the concept of relevancy signatures introduced by 
her is very similar to the topic si.qnatures we pro- 
posed in this paper. Relevancy signatures and topic 
signatures arc both trained on preclassitied docu- 
ments of specific topics and used to identify the 
presence of the learned topics in previously unseen 
documents. The main differences to our approach 
are: relevancy signatures require a parser. They are 
sentence-based and applied to text categorization. 
On the contrary, topic signatures only rely on cor- 
pus statistics, arc docmnent-based a and used in text 
smnmarization. 
In the next section, we describe the automated 
text smmnarization system SUMMARIST that we 
used in the experiments to provide the context of 
discussion. We then define topic signatures and de- 
tail the procedures for automatically constructing 
topic signatures. In Section 5, we give an overview 
of the corpus used in the evaluation. In Section 6 we 
present the experimental results and the possibility 
of enriching topic signatures using an existing ontol- 
ogy. Finally, we end this paper with a conclusion. 
3 SUMMARIST 
SUMMARIST (How and Lin, 1999) is a system 
designed to generate summaries of multilingual in- 
put texts. At this time, SUMMARIST can process 
English, Arabic, Bahasa Indonesia, Japanese, Ko- 
rean, and Spanish texts. It combines robust natural 
 processing methods (morl)hologieal trans- 
formation and part-of-speech tagging), symbolic 
world knowledge, and information retrieval tech- 
niques (term distribution and frequency) to achieve 
high robustness and better concept-level generaliza-- 
tion. 
The core of SUMMARIST is based on the follow- 
ing 'equation!: 
summarization = topic identification + 
topic interpretation + generation. 
These three stages are: 
Topic Identifieatlon: Identify the most imtmrtant 
(central) topics of the texts. SUMMARIST 
uses positional importance, topic signature, and 
term frequency. Importance based on discourse 
structure will be added later. This is tile most 
developed stage in SUMMARIST. 
Topic Interpretation: ~i~-) fllse concepts such as 
waiter, menu, and food into one generalized 
concept restaurant, we need more than the sin> 
pie word aggregation used in traditional infor- 
mation retrieval. We have investigated concept 
aWe would like to use only the relevant parts of documents 
to generate topic signatures in the future, qkext segmentation 
algorithms such as TextTiling (Ilearst, 1997) can be used to 
find subtopic segments in text. 
ABCNEWS.cona : Delay in Handing Flight 990 \['robe to FBI 
N'I'SI3 Cllaitnlan JarlleS tlall says Egyptian clfficials Iv811l to I,~view restllts 
of tile investigation intcl lhe crasll of llggyptAir Flight 990 before tile case 
i~ lurlled over Ic, tile Fi31, 
Ntlv. IG - U S. ilxvestigl~lo\[~ lLppear to be leatlillg iiIore thgll eveF low~trd 
tile possibility that one of the cc~-pilot~ of EgyptAir Flight 990 may have 
de\[iheralely crashed tile plane last Ilaflllth, killing all 217 people on board. 
flail'ever. US. officials say tile National Tran~por'tation Safety Board will 
delay transferring tile invegtigalion of the Oct 31 crash to tilt: FI31 - the 
agency that wotlid lead i~ criminal probe - for at least tt few days. to MIow 
Egyptian experts to review evidence ill tile case. 
gttsl)iciotl~ of foul play were raised after investigators listening to rt tape 
ftolll lilt cockpit voice recorder isolated a religious prayer or statelllellt 
made by tile co-pilot just before tile plane's autopilot was turned off 
slid the plane began its initial plunge into tile Atlantic Ocean off Mas- 
srtchttsett$' Nalltucket \[sialld. 
Over tile' past week. after mucil effort, tile NTSJB and tile Navy succeeded 
ill Iocatillg the plane's two "black boxes," th~ cockpit voice recorder and 
lhe flight data recorder. 
The tape indicates tllat shortly after the plane leveled ~ff at its cruising 
altitude of as,000 feet, tile cl~ief pilot of tile aircraft left the plane's 
cockpit, leaving one of tile twc~ co-pilots nIolle tilere as the aircraft began 
its descent. 
Figure 1: A Nov. 16 1999 ABC News page sumnmry 
generated by SUMMARIST. 
counting and topic signatures to tackle tile fll- 
sion problem. 
Summary Generation: SUMMARIST can pro- 
duce keyword and extract type summaries. 
Figure 1 shows an ABC News page summary 
about EgyptAir Flight 990 by SUMMARIST. SUM- 
MARIST employs several different heuristics in tile 
topic identification stage to score terms and sen- 
tences. The score of a sentence is simply the sum 
of all the scores of content-bearing terms in the sen- 
tence. These heuristics arc implemented in separate 
modules using inputs from preprocessing modules 
such as tokenizer, part-of-speech tagger, morpholog- 
ical analyzer, term frequency and tfidf weights cal- 
culator, sentence length calculator, and sentence lo- 
cation identifier. \Ve only activate the position mod- 
ule, tile tfidfmodule, and the. topic signature module 
for comparison. We discuss the effectiveness of these 
modules in Section 6. 
4 Topic Signatures 
Before addressing the problem of world knowledge 
acquisition head-on, we decided to investigate what 
type of knowledge would be useflfl for summariza- 
tion. After all, one can spend a lifetime acquir- 
ing knowledge in just a small domain. But what 
is tile minimum amount of knowledge we need to 
enable effective topic identification ms illustrated by 
the restaurant-visit example? Our idea is simple. 
We would collect a set of terms 4 that were typi- 
cally highly correlated with a target concept from a 
preclassified corpus such as TREC collections, and 
then, during smnmarization, group the occurrence of 
the related terms by the target concept. For exam- 
pie, we would replace joint instances of table, inert'u, 
waiter, order, eat, pay, tip, and so on, by the single 
phrase restaurant-visit, in producing an indicative 
4Terms can be stemmed words, bigrams, or trigrams. 
496 
sulnlllary. \Ve thus defined a tot)it signat.ure as a 
family of related terms, as follows: 
~I'S = { topic, sifl~zutu.rc. } 
= {topic,< (t,,wl),...,(t,,,w,,) >} (1) 
where topic is the target concet)t and .,d.q)zat~Lrc is 
a vector of related ternls. Each t, is an term ldghly 
correlated to topic with association weight w/. The 
number of related terms 7z can tie set empirically 
according to a cutot\[ associated weight. We. describe 
how to acquire related terms and their associated 
weights in the next section. 
4.1 Signature Term Extraction and Weight 
Estimation 
()n the assumption that semantically related terms 
tend to co-occur, on(' can construct topic signa- 
tures fl'om preclassified text using the X 2 test, mu-. 
tual information, or other standard statistic tests 
and infornlation-theoreti(: measures. Instead of X '2, 
we use likclih.ood ratio (Dunniug, 1993) A, sin(:e A 
i,; more apI)rot)riate for si/arse data than X 2 test 
and the quantity -21o9A is asymi)t(/tically X ~ dis- 
tril)ute(15. Therefore, we Call (leterndnc the (:onti- 
(lence level for a specific -21o9A value l/y looking ut) 
X :~ (tistril)ution table and use tlm value to sel(,,ct an 
at)i)rot)riate cutoff associated weight. 
We have documents l)\['e.classitied into a :;('~t, "R. of 
relevant texts and a set ~. of nonrelewmt texl;s for a 
given topic. Assuming the following two hyl)othe,'~es: 
ttypothesis 1 (Ifl): t'(~Pvlti) = P = P('PvltT/), i.e. 
the r(.,lewmcy of a d()(:|lment is in(teI)en(hmt, of 
ti. 
I\]\[ypothesis 2 (tt2): I'('Pv\[ti) == lh ~ 1)'2 - 
t)('Pvlt, i), i.e. th(~ t)r(.':;(;n(:(~ of t i indi(:~Lt(.~.'; strong 
r(~levan(:y ~ssunling \]h >> 1)2 • 
and the following 2-10=2 contingency tabl(;: 
where Ol~ is the fiequency of term ti occurring in 
the. l'elev;tnt set, 012 is the \[r(!qu(nlcy of Lerm t i t)c- 
curring in the \]lollrelevallt, set, O21 is the fle(lllell(:y 
of tt;rnl \[i¢ ti occurring in the rtdevant set, O._,~ is 
the fl'equ(mcy of term l.i ¢ ti o(:curring in the non- 
l'elevaiit seL. 
-kssmning a l)inomial distril)ution: 
C;) b(~; ,,., :/.) = :,:~(1 - .~:)(" ") (2) 
5This assumes |hal the ratio is between the inaximuni like> 
\[ihood est, im&t.(! over a .qll})part of l;}l(! i)alatlllC't(~r sl)a(:(~ ;tll(\] l.h(! 
lllaxillUllll likelihood (}sI.i|II}tlA~ ov(!r the (Hltill! i)al'aillt~tt!r si);t(:e. 
Set! (Manning ;tnd Sch/itze, I999) t)ag, es 172 l.o 175 for d(!t.ails. 
then the likelihood for HI is: 
L(H~) = b(Ot~; 0~ + Ou,,p)b(O:,~; 0:,, + Om,,p) 
and for //2 is: 
L(H2) = D(OI 1; O11 Jr" (')12, Pl )b(O21; (')21 Jr- (,)22,1)2) 
The -2log,\ value is then computed ms follows: 
1.(f/1 ) 
m --21o 9 -- 
L(it 2 ) 
b(O 11 ; OI 1 + O12, P)IJ(021 : O21 + 022 , P) --21o 9 
1'((-)1 l ; ()11 + O1'-), P I )h(O21 ; O21 q- ()22 , P2 ) 
: --2((Oll +021)lor_Jp+(()12+022)lo9(1--l~)-- (,~1) 
(¢)lllo'JPl+Ol21og(l "t'1)+0211ogp2-~0221o0(1-f~2))) 
-- '.2.,~' x (~'i(7~)- ;~(~19-)) (4) 
= 2,'v x Z(P~; T) (5) 
whel'e N = Olt -F O12 -1- O21 -I- 022 is the total llum-. 
her of t, ernl occurrence, in the corpus, 7/('/~) is the 
entropy of terms over relevant and nonrelevant sets 
of documents, 7/('felt ) is the entropy of a given term 
OVel" relev;inL ~/nd nonl'(.qevallt sets of doellinellLS, ~tll(1 
Z('R.; T) i:; the inutual information between docu- 
ment relevancy and a given t('.rm. Equation 5 indi- 
cates that mutual inforntation 6 is an e(tuiwdent mea- 
sur(.' t() lik(.qiho(id ratio when we assume a binomial 
distribution and a 2-by-2 ('ontingency table. 
To crest(; topic .~dgnature for a given tot)ic , we: 
1. (:lassify doctunents as relevant or nonrclcwmt 
according t() tile given topic 
2. comt)ut.e the -21oflA wdue using Equation 3 for 
each Lcrm in the document colle(:Lion 
"{. rank t, erms according 1o their -2lo9~ value 
4. select a c(mfid(mce le, vel fi'om the A;: (listril)utiotl 
table; (letermin(~ the cutotf associated weight, 
mid the numl)(n" of t(nms to he included iIl the 
signatures 
5 The Corpus 
The training data derives Kern the Question and 
Answering summary evahmtion data provided l)y 
TIPSTEI/.-SUMMAC (Mani et al., 1998) that is a 
sttbset of the TREC collectioliS. The TREC data is a 
collection of texts, classified into various topics, used 
for formal ewduaLions of information retrieval sys- 
tems in a seri(~s of annual (:omparisons. This data set: 
contains essential text fragnients (phrases, (:Iausos, 
iuld sentences) which must 1)e included in SUllltIlarios 
to ~tnswer some TI{EC tel)its. These fl'agments are 
each judged 1)y a hmnan judge. As described in Se(:- 
tion 3, SUMMAI~IST employs several independent 
nlo(hlles to assign a score to each SelltA:llCe~ and Chell 
COlll})illeS the st.'or(.'.% L() decide which sentences to ex- 
tract from the input text;. ()n0. can gauge the efticacy 
(>l'he lllll\[lla} inrormalion is defined according to chapter 2 
of ((;over and Thomas, i991) and is not tile i)airwis(~ mutual 
inforlnalion us(!d in ((;hur(:h and llanks, 1990). 
497 
TREC Topic Da~crlption 
(nunQ Number: 151 
(title} Topic: Co, ping with overcrowded prisons 
(dese} Deserilltioll: 
The doeullaent will provide inf,~rnlation ol~ jail and prison overcrowdiuK 
and how irlmates are forced to cope with th,~se conditions; or it will 
reveal plan~ to relieve tile overcrowded ¢ollditlon. 
(nart) Narrative: 
A relevant docunaent will describe scene~ of overcro~vdillg that have 
becolne all too crlllllllOll ill jails and prisons arottnd the country, Tile 
document will identify how inmates are forced to cope with those over- 
crowded condition~, and/or what tile Cclrreetional Systelll is doing, or 
phlnning to do, to alleviate tile crowded collditioll. (/top) 
Test Questions 
QI Me'hat are name and/or location of tile correction faeililies 
where the reported ~vercrowding exists? 
Q2 x~Vhat negative experiences have there been at tile overcrowded 
facilities (whether or not tile)" are thought to have been caused 
by the overcrowdlng)? 
Q3 What measures have been taken/plaiailed/recommended (etc.) 
to aeconllnod~te more illlllaIes zlt penal facilities, e.g., doublillg 
tip, Ile~y COllStructlon? 
Q,I ~,Vhat measures have been taken/planned/rec~mnlel,ded (etc.} 
to reduce tile lltllllber of Dew illli\]ate$, e.g., moratoriums 
on admisMon, alternative penalties, programe to reduce 
crime/recldivism? 
Q5 What measures have been taken/planned/recommended (etc.) 
to reduce tile number of existing inmates at an overcrc~wded 
facility, e.g.. granting early release, trnnsfering to uncrowded 
facilities? 
Sample Answer Keys 
(DOCNO) AP891027-0063 (/DOCNO) 
(FILEID) AP-NR- 10-27-89 0615EDT(/FILEID) 
(IST_LINE)r a PM-ChainedInmates 10-27 0335(/IST.LINE) 
(2ND-LINE)PM-Chained lnmates,0344 (/2ND_LINE) 
(IIEAD) lnmates Chained to 1.Vails in 13Mtimore Police 
Stations(/llEAD) 
(DATELINE)BALTIMOITIE (AP) (/DATELINE} ('tEXT) 
(Q,q)prisoner~ are kept chained to the wall~ of local police lcJekup~ for 
as long as three days at a tln~e I)ecattse of overcrowding ill regular jell 
cells, police said.(/Q3} 
Overcrowding at the (Q1)lqMtlrnore County Detention Center(/Q1) 
h~ forced pnllee tn .. (/TEXT) 
Table 1: TREC topic description for topic 151, test 
questions expected to be answered by relewmt doc- 
uments, and a smnple document with answer key, s. 
of each module by comparing, for ditferent amounts 
of extraction, how many :good' sentences the module 
selects by itself. We rate a sentence as good simply 
if it also occurs in the ideal human-made extract, 
and measure it using combined recall and precision 
(F-score). We used four topics r of total 6,194 doc- 
uments from the TREC collection. 138 of them are 
relevant documents with TIPSTER-SUMMAC pro- 
vided answer keys for the question and answering 
evaluation. Model extracts are created automati- 
cally from sentences contailfing answer keys. Table 
1 shows TREC topic description for topic 151, test 
questions expected to be answered by relevant doc- 
uments s, and a sample relevant document with an- 
swer keys markup. 
7These four topics are: 
topic 151: Overcrowded Prisons, 1211 texts, 85 relevant; 
topic 257: Cigarette Consumption, 1727 texts, 126 relevant; 
topic 258: Computer Security, 1701 texts, 49 relevant; 
topic 271: Solar Power, 1555 texts, 59 relevant. 
SA relevant: document only needs to answer at least one of 
the five questions. 
6 Experimental Results 
In order to assess the utility of topic signatures in 
text sununarization, we follow the procedure de- 
scribed at the end of Section 4.1 to create topic 
signature for each selected TREC topic. Docu- 
ments are separated into relevant and nom'elevant 
sets according to their TREC relevancy judgments 
for each topic. We then run each document through 
a part-of-speech tagger and convert each word into 
its root form based on the \\h)rdNct lexical database. 
We also collect individual root word (unigram) fi'e- 
quency, two consecutive non-stopword 9 (bigram) fi'e- 
quency, and three consecutive non-stopwords (tri- 
gram) fi'equeney to facilitate the computation of the 
-21ogA value for each term. We expect high rank- 
ing bigram and trigram signature terms to be very 
informative. We set the cutoff associated weight at 
10.83 with confidence level ~t = 0.001 by looking up 
a X 2 statistical table. 
Table 2 shows the top 10 unigrmn, bigram, and tri- 
gram topic signature terms for each topic m. Several 
conclusions can be drawn directly. Terms with high 
-21ogA are indeed good indicators for their corre- 
sponding topics. The -2logA values decrease as the 
number of words in a term increases. This is rea- 
sonable, since longer terms usually occur less often 
than their constituents. However, bigram terms are 
more informative than nnigrant terms as we can ob- 
serve: jail//prison overervwding of topic 151, tobacco 
industry of topic 257, computer security of topic 258, 
and solar e'n, ergy/imwer of topic 271. These mLto- 
matically generated signature terms closely resemble 
or equal the given short TREC topic descriptions. 
Although trigram terms shown in the table, such 
as federal court order, philip morris 7~r, jet propul.. 
sion laboratory, and mobile telephone s:qstem are also 
meaningflfl, they do not demonstrate the closer term 
relationship among other terms in their respective 
topics that is seen in tlm bigram cases. We expect 
that more training data can improve tile situation. 
We notice that the -2logA values for topic 258 
are higher than those of the other three topics. As 
indicated by (Mani et al., 1998) the majority of rel- 
evant documents for topic 258 have the query topic 
as their main theme; while the others mostly have 
the query topics as their subsidiary themes. This 
implies that it is too liberal to assume all the terms 
in relevant documents of the other three topics are 
relevant. We plan to apply text segmentation algo- 
rithms such as TextTiling (Hearst, t997) to segment 
documents into subtopic units. We will then per- 
form the topic signature creation procedure only on 
tile relevant units to prevent inchlsion of noise terms. 
9\,Ve use the stopword list supplied with the SMAIIT re- 
trieval system. 
l°q'he -2logA values are not comparable across ngram cat- 
egories, since each ngraln category has its own sample space. 
498 
Topic 
I :ll h~l alll -21,~gX \]li~lallI -21,,9X 
jail t)3L I)1,1 e()tH~t 2, jail Dit) 27:1 
c+,lllll} .IIJN ~21 eaely le+\]+.~lSt • N,'~ :{t;\] 
,)v,.),'~,,wdln~; :¢12:1.I~ ~tal,. Dl~.,,n 7.1 R72 
illln¢lt," 2:tl 7d5 stal," 1,) i~, ,n,.i 67 ,3(~t~ 
~h+.lif\[ IF, I .ilo ,1:~ 3 fill," l;l t(;2") 
stale 151 9tt~ iail rl%l'lctr~%vI\]lrld I;1 ~\[i 
I}l l~tllil'l II ~" I ";~' C(,tlt I +,l,i"t t{ll.O!)l} 
i+tl-s,,rl 1,17, 3t),i h..al jail 56 tit+ 
Cl)y 133177 pll.,)D ()vcylcl,,ivthll~ 55 37: +, 
,,v,.r,.r,,wd,+d 12N I)t)S i-(*lllt :l\[ facilit 3 52 9o9 
10 Signatttro Torms of Tupic 151 Ovorcrowdod Prisons 
"I'I I~I alIl -21,,~11\ 
f-,It.l~tl c,,ul~ <,ltt,.l -I.', :),;11 
C,,lllp\]y c+,ll~('lll ,\]+'c\[+++" 3,5 12L 
+l,.kali+ i'i)iIi,\[~' +h,'ll\[\[ \[15 121 
~,11 i,) tl;,nk :;.5 L21 
j,)o tl;lllk IH)li5 :~.',.1'21 
pll~C,ll,'r c+)ll:ll~ lail :~5 121 
91:if,. \])ll~t)ll i21)llltl~ ~N t).l\[\] 
t put plis+,ll .2t~ :t-II 
c+~uuly jaiL ~l;ll,. 2t~31 l 
h,,hl l,~e~l jail 2d :l I I 
Topic 10 Sigllnttlre Ternls of Topic 257 - (ligar~tt~ Co|lslll)l|)tlo|l 
l:ni<rtun 21ogX I+i ~.rarll -- 2/,,f/.\ 'i ri4~am - 21,,~A 
clgrtl,.It¢+ .171; \[}:iN ~tlb:xt'c+) LIt(\]ll~ll~ ~il 7)iN I,hilip IIl,)lli+ tjl 2.~ ~DSI 
l()htlcc~) :ll;l 017 hn t-lg/ll,-llt- t;7 t2}I Ir)l\]ilIlalls beDs~,ll h<.(\[~f. 211 ~)t~\[l 
sIIIOkill~ 28.t 19~ philip tll(,l\[i~ 5t ()7;~ \[1111~ L'ikll('e\[ d'+\[llll 22 21.t 
~nl~,ke 15913.1 clarxl<'t1,: %,'at t80.t5 qtt irilln cll~ '.2'1 IIS 
I,~lhlllall¢ 156.)375 tolhlllllll~ itlY,'lIlatlt)llgkl -t.t .13.1 qtl qtt fi~ln 21 -IlS 
,,~ha I.tS 372 tobac¢,) elll()k.~ 112){}I bll b\[i bll 20 22ti 
s,~ila 12)i .121 ~il patrick t0. t55 c+)llstlllll}lloll bn clgar,.lte 2022d 
Illtll 113 ~+1~) cl~at+'ll~" c~lllpallV :ID \[$1)D ~\[t+gtt alll.r\[iCttll ~llI,lk¢'(}llt 20226 
alll,)k('l 10.1 Ii0 ('elll lllalk+'l 36223 \[llll~ Callt'e\[ ht:gl\[I 2(~ 22{i 
b\[~t 79.90:1 ?~IN illt+ll+il++t :1t;.22:1 illa\[ay~iall +illk~\[tl>,ll+e t'4)lllpillIV 2(I.22t\] 
Topic I0 Signatur. "I'~r)ns of Topic 258 -- Co)nputor Security 
I ~llial /lilt "2Ionia It i+/,r alll 21,QIX "I'1 i~ratn --21o9, X 
(:+lllll)lltOr 115!~ :151 C4,lllpIlll't ael'tlllly 213331 )el Iltt/pll\[~i()ll \]ltht)hlt(lly \[~ ~5.t 
virus 927.G7-1 ~\[;idtlgtl," sltl\[tl'lll 17~ 5NN Illh.'ll I lilt) 9R 85,t 
hacker 867.377 FOlllptlt,'t +ySlelll 1-16.32~ C,+ltl++\]l IllliVet~il~,' ~ladll\[tle 7}) IJNI 
in,)l rl+ +i+;+~ 2i13'.! ) l,.~+-arch c,+ulte\[ l;i2 .l I :i lawtellcl" b,'rk,' *'j~ lall,)l al+,l ,,. 79.0N \[ 
c,,rn,'ll 3P+5 6+4 c,:,ml,ut,'r wrus 12~k033 I~+,++, jet ptOl,tlL+iot+ 79.0~1 
unlv+'l¢ity 31)5.95~ corneli UlXiVeleity 1(1~4 7-tl Ul;iV,q+i)y ~;radulxt,. +txldt. lll 79U~1 
+ysl+'lll 290.3"17 Iltl(:l,P;ll %t++npl)ll 107.283 lawtllle,+ liv+:rtn(.te Iigtli()llal i\]\[) l\[I;~ 
I/tb,.ralL:)ly 2N7 521 inilitary t'(,lllp,ll.:r 106.522 liv~qlll,)l¢" ilu) i,maL luboralory {59195 
\[ab 225.51); vitu~ plo~t<'llll 1U6 522 c,)lllpllI(~r S,~CUlily eXpetl 66.19G 
mecLaly 128.515 %vesl ~etlllall 82 2\[0 ~ecu\[it?,' cenl{~\[ 13ethesda -19423 
Topic 10 Signature Terlns of Topic 271 Solar Power 
I'lligtaln --21oqX tiigt ~ltn --2logX "I'r i ~;r hill --21o!~A 
solar -1S-l.315 e,~la~ eltetl4y 2{Di 521 divi~i,m Inullipl,~ acress 31 3-17 
)tlazdit :10Pt 0IY) s<,lal l,tlw,'t 9,1 210 nl,)bil,: l,,l,-ph,,n,. #civic,, 313.17 
le,) 271;.932 ('hri~tiall aid 8fi.211 blillsh It.cllllilll)R}' g\[,,llp 23510 
itJtLiItlll 2.5N.71):"+ l++,a S3'Sl,*III 711 5:{5 elllI}l heiNht llXile 23+5111 
pax+lh,,n 2133 81 I ill++tlllt. Ieit'j)lll)lle (115;l+i I'illllllCilll Ilackill+; IlJdllllll 22i+51(1 
i)(~tltld 12/,121 iti,liunl pl,,j,.cl 112.697 ~l,~llal Inr)hil, + sal,'llite 23 511J 
t,lw~r 12G.35:1 leili <+,,,d- 61.~111 halldlleld IIled~il," t,'leph,,ll,: '23510 
\[,,,,k,,ut 125 .ll3t; scie.llc,, palk ~'>.1 NS{) ill(,hile ~atellll., .v>tetn 23 510 
iil\[lllilSRl 1O9728 ~()llkl t'illlt'l'lltlilIl,l 51t ~5{} Illlltlllvlill igidillnl I>l,lject 23,510 
hc,ydsh,tl 7N :173 l)p slllal ?+1 ;/17 activt- s+,la\[ *ystern 15673 
Tattle 2: Top 10 signat.m'e t.erm.~; of mfigram, bigram, and trigram for fore" TREe t.opics. 
6.1 Comparing Summary Extraction 
Effectiveness Using Topic Signatures+ 
TFII)t", and Bas(,line Algorithms 
In orde)" I() (~vahla(.(~ the (d\[+:ct.iv(,im.~s nf l(>l)i(: .~dgna- 
l;lll'(~S llS(~(\] ill SlllllIIN/l'y (~Xtl'it(:t;iOll, W{, ~ CtIllll)~ll'(~ +flit! 
Sllltllll~tl'y StHII~011('CS ex(~ract,(~d 1)y the tol)ic si~Ilil\[lll'0, 
module+, basulin(.' module, and tfidf lnothll(~s with lm- 
nta'n annot, at(~(l lllo(lo,\] Sllllllll}ll'ios. \VC III(+~}/SIlI'(+ + l;h(; 
l)c'rfl)rmanc(~' using a c()ml)ined umasure of l'n'call (I~) 
and pr(~cisi(m (P), F. F-score is defined by: 
I"-- (1 +H'2)I'l? where 
/3-'P + I~ 
t ) 
2\7.) ,. 
f'~rln 
fVln 
~\,, 
# of .sc,tcncc.~ c:rtratcd th,t olso 
atqwar in. tim model ,s.mn)¢lr!l 
# of sc+lt(!ncc,s i11 tim nlo,h:l .~um.tav!l 
# of ,s('./Itclwcs c:rlv¢lclcd t)1,t ll*c .Sll.Slcln 
rclatic'c iml,ortancc of l~ aml 1:' 
(6) 
(7) 
\Ve as.~um(~ (,(lual importance of re(:all iIIld preci- 
sion aim set H to 1 in our (+,Xl)('+rimtml;s. The Imselitm 
(I)ositi(m) module scores ('at:h S(!llt(':llC{} hy its I)osi- 
ti(>n in the text. The first sent(race gets the high- 
esc s(:ortL the last S(HIt(H1Co the lowest. The l)as(~liIl(~ 
method is eXlmCted to lm ('.f\[ectiv(~ for news gem'e. 
The tfidf module assigns a score t.o a tt++rllI ti at:cord- 
ing to the product; of its flequc, ncy within a dot:- 
lllll(Hlt .j (tfij) and its illV(~I'S(} doctmmnt t?equoncy 
(idfi lo.q ,~). N is the total mmfl)or of document.s 
in the (:()rlms and dfj is the, numl)er of (Io(:HnloAll;.q 
(:OlH:nining term ti. 
The topic sigjlla(.lll(++ module sciliis each ,q(~llt;(H1C(~: 
assigning to ('ach word that occurs in a topic signa- 
(ure thu weigh(, of that, keyword in t.hc' tol)ic signa- 
tltl'tL Eit{'h s(++llt(,+ItC(~ Ill(ill l'(:c(:ive.q a topic signature 
score equal to tlm total of all signature word scores it 
(:Olllailis, normalizc'd 1) 3' the. highest sentence score. 
This s(:ol( 3 indical.es l;h(~ l'(!l(wall(:(~ of l.h(; S(!llt.t~n(:(! to 
t, lw sigmmlre topic. 
SU.~\[.MAt/IST In'oduced (!xttat:ts of tlm samu 
l(~xI.q sui)aralely for each ,,lodul0, for a s(~l'i(,s of ex- 
tracts ranging from ()cX; to 100% of the. original l;(}xI. 
Althottgh many rel<want docttments are avaita})l+, 
for each t01>ic, Ollly SOlll0 o\[ \[h0111 htlv(~ allSWOl kc!y 
499 
markut)s. The mnnber of documents with answer 
keys are listed in the row labeled: "# of Relevant 
Does Used in Training". To ensure we utilize all 
the available data and conduct a sound evaluation, 
we perform a three-fold (:ross validation. We re- 
serve one-third of documents as test set, use the rest 
as training set, and ret)eat three times with non- 
overlapl)ing test set. Furthernmre, we use only uni- 
gram topic signatures fin" evaluation. 
The result is shown in Figure 2 and TaMe 3. We 
find that the topic signature method outperforms 
the other two methods and the tfidfmethod performs 
poorly. Among 40 possibh,, test points fl)r four topics 
with 10% SUmlnary length increment (0% means se- 
lect at least one sentence) as shown in Table 3, the 
topic signature method beats the baseline method 
34 times. This result is really encouraging and in- 
dicates that the topic signature method is a worthy 
addition to a variety of text summarization methods. 
6.2 Enriching Topic Signatures Using 
Existing Ontologies 
We have shown in the previous sections that topic 
signatures can be used to al)I)roximate topic iden- 
tification at the lexieal level. Although the au- 
tomatically acquired signature terms for a specific 
topic seem to 1)e bound by unknown relationships as 
shown in Table 2, it is hard to image how we can 
enrich the inherent fiat structure of tol)ie signatures 
as defined in Equation 1 to a construct as complex 
as a MUC template or script. 
As discussed in (Agirre et al., 2000), we propose 
using an existing ontology such as SENSUS (Knight 
and Luk, 1994) to identify signature term relations. 
The external hierarchical framework can be used to 
generalize topic signatures and suggest richer rep- 
resentations for topic signatures. Automated entity 
recognizers can be used to (:lassify unknown enti- 
ties into their appropriate SENSUS concept nodes. 
We are also investigating other approaches to attto- 
matieally learn signature term relations. The idea 
mentioned in this paper is just a starting point. 
'7 Conclusion 
In this paI)er we l)resented a t)rocedure to automati- 
(:ally acquire topic signatures and valuated the eflk~c- 
tiveness of applying tol)i(: signatures to extract tot)i(: 
relevant senten(:es against two other methods. The 
tot)ie signature method outt)erforms the baseline and 
the tfidfmethods for all test topics. Topic signatures 
can not only recognize related terms (topic identifi- 
(:ation), but grout) related terms togetlmr under one 
target concept (topic interpretation). 'IbI)i(: identi- 
fication and interpretation are two essential steps in 
a typical automated text summarization system as 
we l)resent in Section 3. 
'\]))pic: signatures (:an also been vie.wed as an in- 
verse process of query expansion. Query expansion 
intends to alleviate the word mismatch ln'oblenl in 
infornmtion retrieval, since documents are normally 
written in different vocabulary, ttow to atttomati- 
(ally identify highly e(nrelated terms and use them 
to improve information retrieval performance has 
been a main research issue since late 19611's. Re- 
cent advances in the query expansion (Xu and Croft, 
1996) can also shed some light on the creation of 
topic signatures. Although we focus the ltse of topic 
signatures to aid text summarization in this paper, 
we plan to explore the possibility of applying topic 
signatures to perform query expansion in the future. 
The results reported are encouraging enough to 
allow us to contimm with topic signatures as the ve- 
hMe for a first approximation to worht knowledge. 
We are now busy creating a large nmnber of signa- 
ture.s to overcome the world knowledge acquisition 
problem and use them in topic interpretation. 
8 Acknowledgements 
YVe thank the anonymous reviewers for very use- 
tiff suggestions. This work is supported in part by 
DARPA contract N66001-97-9538. 
500 
-~ ..~:,., .;~ ,5. "'--=-" _.. _..&ass 
0 50000 n ~ .~ .............. n ~ .................... ~ .......... 
--'-.,,q,~ .... + +. ,x. o + ~ ° 
..... 1,* '+ .~ *+- .... -- 
..;-5;,:~:; .... h ~.:~.7~.~ ~ ~ ^'--~-.. 
........ , ?--g2 .......... o 400OO f , ' " +- ~-" + ~ -, "~2x-+, 
\[ • .... : ....... : 
\[ - ...... ; ""7 ........ 2,=_ ~ 0 
=0000 j' +J" Jj 1" " 
.,::i'ff "4. 4 "A- .I - i+ ~. : 
4" .,~t . -a + -a-- -#.-.-~---.a ~ ' . 
0 ~o00o 'd-;9~7~ -7 '+ 5~:7~:=-+': ;: ~ .... =-~++:7:: ~ -:'~ +--~ ....... " ~5_~Ztt::~:ll;: ;i 
I ",;. A'/,-¢~- <F" ~. " "~" ~ 257 
44" • 
o ;oo~ ._~-.c_-__~ / 
0 00000 
I 
000 005 010 015 020 025 030 ,335 040 045 050 055 060 065 070 o75 050 085 090 095 ~00 
.~ umrn~i-~ Lenqth 
Figure 2: F-measur(: w;. summary length for all fimr topics. ~bi)ic signature cl(mrly outperforin tfidf and 
baselin(', ex(:ei)t for tit(: case of topic 258 where t)(~rforman(:(; for tim thr(;e methods are roughly equal. 
I__ I------~~g- lO% I ___~o~a -ao~~a---- 4o~ I ~0%--\[ ~o~ \[ ~,o~ ~o% I 9o~ I lOO% I 
\[ ~.+,_~.,~dl .... i ..:ms o.a-~9 I o..~.o o.aa4 o.,~:c- I ...ao=, \[ '__2 :~r' I~ o.ara oar,; I ..at,, I e.a.~w-I 
+4.58 +7.48 +15.6a +14.17 +8.66 +3. I 51_toplc.si~ I -2,7d -2.19-- 
\[ 257-h,,*,,li .... r--- (1.1-}98 {~.15.__.5 I c,,, ,,.is., ".'~L I o.,~~--F--,~.~,, I--o t,l o.1~, I °.!s~ 
\[ '_,.~r_,a,,r \[ -55.11- -38.56 I -'".5U ~">' ~".0'; ''"'+' I S'~:'' I ~~""" I +r 0'~t ..... , 257_,~i,ic.~ig +45.5~ +64.06 +31.88 ~ +20.40 \[ +20.60 \[ 4_-~01 +12.4&-I14.24 - O.(h.~\] 
\[_ 25u_h~,~.,li .... L_ ol tk_ o 270 I "'4'-'2 ~' '*'~:~ I ,, ~,r_, L_ "47t_J .4r,, 1--~-'1 o.~,'__,+~ o s'_,Z._J 
\[ 271_l,aseli .... I <,at tT_.._,~.:,,~; T--,Ta77--- ..a:~ _L ,, :s:,r, .L .... ~~~~:~-i~- T- o.ae~ \] ,).:~:~~\[--~ 
....... +a~.lO j__ +4~_~~s.ro.~ +~.a,~ I +~.~o_ll 0.,~, \] 
Table 3: F..measul'e t)erformanc(~ differen(:e compared to 1)aselin(~ nt(:thod in t)ercentage. Cohmms indicate 
at diffe.rent summary lengths related to fldl length docum(mts. Values in the 1)aselin(,. rows are F-measure 
s(:ores. Vahms in the tfidf and tot)i(: signatur(~ rows arc i)('.rformmlc(~ increase or (h,.crease divide(l by their 
(:orr(.,sI)ontling baseline scores and shown in I)er(:(mtag(!. 

References 

Eneko .~girre, Olatz Ansa, Edum'd Hovy, and David 
Martinez. 2000. Enriching very large ontologies 
using the www. In Proceedings of the Work,,;hop 
on Ontology Construction of the European Con- 
fl:rencc of AI (ECAI). 

Kenneth Church and Patrick Hanks. 1990. Word as- 
sociation IIOrlllS, mutual information and lexicog- 
raphy. In Proceedings of the 28th Annual Meeting 
of the Association for Computational Lingui.vtic.~" 
(,4CL-90), pages 76~-83. 

Thomas Cover and Joy A. Thomas. 1991. Elcment.~ 
of Information Theory..John Wiley & Sons. 

Gerald DeJong. 1982. An overview of the FRUMP 
system. In ~2mdy G. Lehnert and Martin H. 
Ringle, editors, Strategies for natural  
processing, pages 149-76. Lawrence Erlbaum A.s- 
so(lares. 

Ted Dunning. 1993. A~i:eurate methods for the 
statistics of surprise and coincidence. Computa- 
tional Li'nguistics, 19:61--7'4. 

Marti Itearst. 1997. TextTiling: Segmenting text 
into nmlti-l)aragraph subtopic passages. Compu- 
tational Linguistics, 23:33-64. 

Eduard Hovy and Chin-Yew Lin. 1999. Automated 
text summarization in SUMMAIRIST. In Inder- 
jeer Mani and Mark T. Maybury, editors, Ad- 
vances in Automatic 71xxt Summarization, chap- 
ter 8, pages 81 94. MIT Press. 

Kevin Knight and Steve K. Luk. 1994. Building a 
large knowledge base for machine translation, ht 
Proceedings of the Eleventh National Coy@re'nee 
on Arti\]icial Intelligence (AAAI-9/~). 

Inderje(?t Mani, David House, Gary KMn, Lyn(~tt(~ 
ttirschman, Leo ()brst, Thdr6se Firmin, Micha(d 
Chrzanowski, and Beth Sundheim. 1998. 
The TIPSTER SUMMAC t~xl smmnm'iza- 
tion evaluation final r(:t)ort. %~(:hnical I/,ol)orl; 
MTR98W0000138, The MITRE Corporation. 

Christopher Manning and Hinrich Schiitzc. 1999. 
t'}mdatious of Statistical Natural Language Pro~ 
cessi'ng. MIT Press. 

Kathh~(m M(:K(!own and l)rag(mfir R. Iladev. 1999. 
('TO.II( rat, illg Sllllllll;ll'i(:s of Iltlllt, i\[)l\[~ ll(!~vs articles. 
In hMtu.iet~t Mani and Mark T. Maybury, edi 
t,ors, Admm.ces i'n Automatic Text Sv,.mmarization, 
chapter 24, pagc+s 381 :/89. MiT Press. 

Ellen Riloff and Jeffrey Lorenzen. 1999. Ext:raction- 
t)a:;e,d text cateI,dorization: Generating donmin- 
qmcitic role relationships atttonmtically. In 
Tomek Strzalkowski, editor, Natural Language In- 
formation, Retrieval. Kluwer Academic Publishc, r.q. 

Ellen tlilolf. 1996. An ompirical study of automated 
dictionary construction for information extraction 
in three domains. Artificial Intelligence ,Journal, 
85, August. 

SAIC. 1998. Introduction to information extraction. 
http://www.mu(:.sai(:.(:om. 

Jinxi Xu and W. Bruc(! Croft. 1996. Query ex- 
pal>ion using local and gh)bal document analysis. 
In l'rocee.dings of the 17th Annual Inter'national 
A(JM SIGIR Co't@rence. on Research and Devel- 
opment in Information l{et'rieval, pages 4 -11. 
