Salience-based Content Characterisation of Text Documents 
Branimir Boguraev 
Apple Research Laboratories 
Apple Computer, Inc 
bkbOresearch apple, eom 
Christopher Kennedy 
Department of Lmgmstms 
Umvermty of Cahforma Santa Cruz 
kennedy@izng ucsc. edu 
Abstract 
Tradmonally, the document summansatlon task 
has been tackled rather as a natural language pro- 
cessing problem, with an. mstanhatecl meaning 
template being rendered into coherent prose, or 
as a. passage (~xtractlon problem, where certain 
.fragments (typ,cally sentences) of the souse doc- 
ument are deemed to be hlghly representahveof. 
its content, and thus dehvered as meanmgfid "ap- 
proxtmahons" of R Balancing the confltctmg re- 
qmremants of depth and accuracy of a summary, 
on the one hand, and document and domain m- 
dependence, on the other, has proven a very hard 
problem This paper describes a novel approach 
to content charactensatlon of text documents It ts 
domain- and genre-independent, by wrtue of not 
reqmrmg an m-depth analysm of the fifll mean- 
mg At hhe same trine, it remmns closer to the 
core meaning by choosing a different granulm'xty 
of Its representahons (phrasal expresstous rather 
than sentences or paragraphs), by exploiting a no- 
tion of dmcourse contlgmty and coherence for the 
purposes ofumform coverage and context main- 
tenance, and by utdmmg a strong lmgmstm nohon 
of sahence, as a more appropriate and representa- 
bye measure of a document's "aboutness" 
1 Capsule overviews 
The malonty of techmques for "summansatlon", as ap- 
phed to average-length documents, fall within two broad 
categories those that rely on template mstantmtlon and 
those that rely on passage extrachon 
Work m the former framework traces its roots to 
some pioneering research by DeJong \[7\],-and Trot \[29\], 
more recently° the DARPA-sponsored TIPSTER programme 
(\[2\])--and, m parUcular, the message understanchng con- 
ferencces (MUC e g \[6\] and \[I\])--have prowded fertile 
ground for such work, by placing the emphams of doc- 
ument analysm to the ldentdlca~on and extracfaon of cer- 
tain core entttms and facts m a document, which are 
"packaged" together m a template There are shared mtu- 
lttons among researchers that generaUon of smooth prose 
from thts template would ymld a summary of the docu- 
ment's core content, recent work, most notably by McK- 
eown and colleagues (cf \[21\]), focuses on making these 
mtul~ons more concrete 
While prowdmg a rich context for research m genera- 
tlon, this framework requires an analysm front end capa- 
ble of mstantmtmg a template to a statable level of detail 
Given the current state of the art m text analysm m gen- 
eral, and of semanttc and discourse processing m partmu- 
lar (Sparck Jones, \[27\] and \[28\], discusses the depth of un- 
derstanding requtred for constructing true summaries), 
work on template-driven, knowledge-based summansa- 
bon to date m hardly domain- or gen~mdependent 
The alternative framework largely escapes thts con- 
stramt, by vmwmg the task as one of Identifying cer- 
tain passages (typmally sentences) whtch, by some met- 
nc, are deemed to be the most representatwe of the doc- 
umant's content The techmque dates back at least to the 
50% (Luhn, \[17\]), but it is relattvely recently that these 
ideas have been filtered through research wath strongly 
pragmatm constraints, for instance what kinds of docu- 
ments are ophmally suited for being "abstracted" m such 
a way (e g Preston and Wllhams \[23\], Rau et a/\[25\]), how 
to derive more representattve sconng fimctlons (e g for 
complex documents, such as multi-topic ones, Salton et 
al \[26\], or where training from professionally prepared 
abstracts m possible, Kupmc et al \[15\]), what heuristics 
nught be developed for tmprovmg readabfltty and co- 
herence of "narrattves" made up of dmconhguous source 
document chunks, Paine (\[22\]), or with optnnal pres~- 
taUons of such passage extracts, armed at retaining some 
sense of larger and/or global context (Mahesh \[18\]) 
The cost of avoiding the requirement for a language- 
aware front end ts the complete lack of intelligence or 
even context-awareness--at the back end the vahdlty, 
and uttht~ of sentence- or paragraph-sEed extracts as 
representahons for the document content m still an open 
questLon (Rau \[24\]), especmlly with the recent wave of 
commercml products announcing built-in "summansa- 
Uon" (by extractton) features (Caruso \[4\]) * 
In this work, we take an approach wbach nught be 
construed as striving for the best of both worlds We 
use hngmstmally-mtenswe techruques to ,dentffy highly 
sahent phrasal umts across the entree span of the docu- 
ment, capable of fiancttonmg as to/nc stamps The set of 
topic stamps, presented m ways which both retam local 
and reflect global context, m what we call sahence-based 
content charactensatwn, or a capsule overvww, of the docu- 
ment 
A capsule overvtew m not a summar~ m that it does 
not attempt to convey document content as a sequence 
of sentences It m, however, a senu-formal (normahsed) 
representation of the document, derived after a process 
1Also athttp IIw~w nytxmes comllzbrarylcyberldlgzcomlO1279?dlgzcom html 
I 
I 
I 
I 
1 
I 
i 
1 
i 
i 
! 
i 
1 
1 
1 
I 
i 
I 
of data reducUon over the original text Indeed, by adopt- 
mg ~ner granularity Of representauon (below that of sen 7 
tence), we consaously trade m "readab~hty" (or narrative 
coherence)for tracking of detad 2 In particular, we seek 
to charactense a document's content m a way which ms 
representahve of the full flaw of the narratwe this ~s m 
contrast to passage extraction methods, which typically 
h~ghhght only certain fragments (an unavoidable conse- 
quence of the compronuses necessary when the passages 
are sentence-stzed) 
A capsule overwew ms not a fully mstantmted mean- 
mg template eather A pnmary considerahon m our work 
ms that content charactensahon methods apply to any doc- 
ument source or type Tins emphasms on domain inde- 
pendence translates into a processing model which stops 
short of a fully mstantmted semantic representation Sun- 
~larly, the requirement for eJ~iaent, and sca/ab/e, technology 
necessitates operahng from a shallow syntactic base, thus 
our procedures are designed to arcumvent the need for a 
comprehensive parsing engine Not having to rely upon 
the parsing components typically seeking to dehver m- 
depth, full, syntactic analysms of text, makes ~t posslble to 
generate capsule summaries for a variety of documents, 
up to and including real data from unfanuhar domains or 
novel genres 
For us, a capsule overwew is instead a coherently pre- 
sented hst of those hngumt~c expressions wluch refer to 
the most pronunent objects mentioned m the dl~urse-- 
Its topw stamps---and prowde further spenficat~on of the 
relational contexts m wluch they appear The mtmt~ons 
underlying our approach can be illustrated with the fol- 
lowing news article s. 
PRIEST IS CHARGED WITH POPE ATTACK 
A Spamsh/b'~est was charged here today with attempting 
to murder the Pope. Juan Fernandez Krohn, aged 32, was ar- 
rested after a man armed unth a bayonet approached the Pope 
whde he was saym8 prayers at Farina on Wednesday raght " 
According to the pohce, Fernandez told the mveshgators 
today that he trained for the past s~x months for the assault 
He was alleged to have claimed the Pope 'looked fonous' on 
heanng the priest's cnUc~sm of his handhng of the church's 
affmrs If found gmlty, the Spamard faces a prison sentence of 
15-20 years 
There are a number of reasons why the title, "Priest Is 
Charged unth Pope Attack; ~s a highly representa~ve ab- 
straction of the core content of the article It encapsulates 
the essence of what the story m about there are two ac- 
tors, ldentzfied by.ther most prominent charactenst~cs, 
one of them has been aRacked by the other, the perpe- 
trator has been charged, there ms an unphcat~on of mal- 
ice to the act The title bnngs the complete set of sahent 
facts together, m a thoughtfully composed statement, de- 
signed to be brief yet mformat~ve Whether a present day 
natural language analysms program can derive---without 
being primed of a domain and genre---the mformat~on 
requned to generate such a summary ms arguable 01us 
ms assunung, of course, that generatlon techmques could, 
m their own right, do the planning and dehvery of such 
a concme and mformatlon-paclc.ed message) However, 
part of the task of dehvenng accurate content characten- 
satlon m being able to identify the components of tlus ab- 
stractlon (e g, "priest', "pope', "attack', "charged ~nth') It m 
from these components that, eventually, a message tem- 
plate would begin to be constructed 
It ms also precmely these components, vmwed as 
phrasal umts with certain dmscourse properties, that a 
capsule overview should locate and present as a charac- 
tensahon of the content of a text document Our strat- 
egy ms to mane a document for the most sahentuand 
by hypothesis, the most representative--phrasal umts, 
as well as the relational expressions they are assocmted 
with, with the goal of establmshmg the land of core con- 
" tent specification that ms captured by the title of this ex- 
ample 
The remainder of thins paper ms orgamsed as follows 
Given the miportance we asslgn to phrasal Ident~ca- 
tlon, we outhne m Sectlon 2 the starting point for flus 
work research on temunology Identfflcatlon and extend- 
mg tl'ns to non-techmcal domains In partlcular, we fo- 
cus on the problems that base-hn e terminology Identlfi- 
catlon encounters when apphed to open-ended range of 
text documents, and outhne a set of extensions reqmred 
for adapting it to the goal of core content ldent~ficatlon 
Essentlany, these bod down to formahsmg and ~mple- 
mentmg an operational nohon of sahence which can be 
used to unpose an ordering on phrasal umts accorchng to 
the topical prominence of the objects they refer to, thin ms 
discussed m Section 3 Section 4 illustrates the processes 
revolved m topic ldentlhcatlon and construction of cap- 
sule overviews by example We close by posltmnmg thins 
work within the space.of summansat~on techmques 
2. Phrasal identification for con- 
tent characterisation 
The ldentxficatlon and extraction of techmcal terminology 
ms~ arguably, one of the better understood and most ro- 
bust NLP technologms within the current state of the art 
of phrasal analyms What Is pafacularly interesting for us 
m the fact that the llngtustlc pmpertms of techrucal terms 
lead to the defimtzon of computational procedures, capa- 
ble of term identification across a wide range of techlcal 
prose, whale mamtmmmg their quahty regardless of doc- 
ument domain and type Since topic stamps are essen- 
trolly phrasal units with certain dmcourse pmpertms--- 
they manifest a hlgh degree of sahence within contlgu- 
ous d~ourse segments--we define the task of content 
charactensatlon as one of Identifying phrasal umts wlth 
lexzco-syntactu: properttes smular to those of techmcal 
2A hst of topic stamps Is, by itself, not a coherent summary, however, by employing appropnately daslgned presentatlon 
metaphors.--aumng, overall, to retain contextual cues assocmted wlth toplc stamps m context--our tol:~C stamps are more content- 
ful than just a hst of (noun or verb) phrases This paper focuses on the hngmshc processes underlyng the automabc Idenhhcahon 
and extrachon of toplc stamps and their mgamsatlon within capsule ovennews The msues of the right presentatmn metaphor and 
operatlonal environment(s) for use of toplc stamps-based capsule overwew are sublect of a different &scussxon 
3Adapted from an example of S Nlrenbtu~ 
3 
terms and with dascourse propertms whxch signify their 
status as "most prominent" In Section 3, we show how 
these dmcourse propert|es are computable as a function 
of the grammatical dtstnbut~on of the phrase Below 
we dmcuss the potentml of ternunology ~denttficatton for 
content charactensat~on 
2.1 Technical terminology: strengths 
and limitations 
One of the best defined procedures for of techmcal termi- 
nology xdentfftcatton ts that developed by Justeson and 
Katz \[10\], who focus on multi-word noun phrases oc- 
cumng m continuous texts A study of the hngutstlc 
proper~es of these constituents--preferred phrase stnac- 
tures, behavaour towards lex~caltsatxort, contraction pat- 
terns, and certain dascourse propertms--leads to the for- 
mulatlon of arobust and domain-independent algorithm 
for term xdentffacatlon Justeson and Katz's TERMS algo- 
rithm accompltshes htgh levels of coverage, it can be nn- 
plemented within a range of underlying NLP technolo- 
gins (e g morphologically enhanced lexlcal look-up \[10\], 
part-of-speech taggmg \[5\], or syntactac parsing \[20\]), and 
it has strong cross~hngmstlc apphcalaon (see, for instance, 
\[3\]) Most traportantly for our purposes, the algorithm ts 
particularly useful for generating a "first cut" towards a 
broad charactensatlon of the content of the document 
Conventional uses of techmcal terminology are most 
commonly ldeniafied with text indexing, computattonal 
lexxcology, and machine-assisted translation Less com- 
mon ts the use of techmcal terms as a representalaon of 
the topical content of a document Thin is to a large ex- 
tent an artifact of the accepted vmw---at least m mforma- 
tton retrieval context--stapulatmg that terms of mterest 
are the ones that dtstmgmsh documents from each other, 
almost by defimtton, these are not the terms which are 
representattve of the "aboutness" of a document 
Stall, ms ts dear that a program hke TERMS IS a good 
starting point for .dtstdhng representatave lists For ex- 
ample, \[10, append|x\] presents several term sets "stochas- 
h¢ neural net; 'lomt dzstnlmtzon; ~eature vector; covanance 
matruc; "training algorithm; and so forth, accurately char- 
actense a document as belonging to the stalastacal pattern 
dasstficataon domam, 'word sense; "lextcal knowledge; "lex- 
zcal aralnguay resolutwn ; "word meaning', 'seraantzc mterlrre: 
tatwn; "syntachc reahzatwn', and so forth assagn, equally 
rehably, a document to the lextcal semantlcs domain 
Such hsts are representalave, unfortunatel~ they can 
easily become overwhelming Conventtonall3~ volume 
ts controlled by promoting terms with htgher frequen- 
cms Thts, however, is a very weak metric for our pur- 
poses, xt also does not scale down well for texts wluch are 
smaller than typical instances of techrucal prose or sc~en- 
t~c arttcles----such as news stones, press releases, or web 
pages The notxon of techrucal term needs appropriate ex- 
tensions, so that xt apphes not)ust to sc~entzhc prose, but 
to an open-ended set of document types and genres Be- 
low we address thin msue by dtscussmg how a basic term 
set can be enriched m order to convey a more refined ptc- 
ture of content 
2.2 Extended phrasal analysis 
As noted above, wathout the dosed nature of the techm- 
cal domains and documentation, it ts not dear what use 
can be made of term sets derived from arbttrary texts 
Certainly we cannot even talk of "techmcal terms" m the 
narrower sense assumed by the TERMS algonthm The 
question ts whether smular phrase Identlhcatmn tech- 
nology generates phrase sets whtch can be construed 
as broadly characteristic of the topical content of a doc- 
ument, m the same way m which a term set can be 
vmwed as charactenstlc of the domain to whxch tech- 
mcal prose belongs In other words, the questmn con- 
cerns the wider apphcablhty of hngmstlc processing tar- 
geted at term sdentlflcatlon, relalaon extractmn, and ob- 
)ect cross-classlficatlon Can a set of phrases denved m 
this way provide a representalaonal base which enables 
rapid, compact, and accurate apprecaalaon of the mforma- 
twn contained m an arlntrardy chosen document7 Three 
problems arise when "varalla" term sets are considered 
as the basts for a content charactensatton task ' 
Undergeneratlon For a set of phrases to be truly rep- 
resentative of document content, ~t must provlde an ex- 
haustme description of the enlat~es dmcussed m the docu- 
ment That is, it ought to contain not lUSt those expres- 
:.swns whtch satmfy the strict phrasal defimtlon of "tech- 
mcal term", but rather every expresmon which ment|ons 
a parttcapant m the events described m the text Phrasal 
analysts must therefore be extended to include pronouns 
and reduced descriptions, m addltton to the more com- 
plex nominals wtuch correspond to true techmcal terms 
Overgenerahon Relaxation of the canomcal phrasal def- 
lrutton of techmcal term leads to mformataon overload 
When apphed to a document without regard to domain 
or genre, a system whmch extracts phrases on the basts of 
relaxed canomcal ternunology constraints wdl typscally 
generate a term set far larger than a user can absorb wzth- 
out cogmtave overhead At the same lame, the set may 
contain several dtstmct phrasal umts whtch refer to the 
same dascourse object Wsthout some means of resolving 
anaph0nc relations, these crucaal cconnechons will be lost 
" D|fferentxahon Fmall~ whtle a hst of terms may be top- 
ical for the particular source document m which they oc- 
cur, other documents within the same domain are hkely 
to yield smular, overlapping sets of terms Unaccept- 
ably, this nught result m two documents containing the 
same or smular terms being dassffted as "about the same 
thing'; when mfact they mzght focus on completely daf- 
ferent subtopics wtthtn the general domain they share 
Although we approach these problems m shghtly dif- 
ferent ways, the solutaons are interconnected, and it ts 
thetr mteractton that is cruaal to the derivation of cap- 
sule overvaews from extended phrasal anaiyses The ex- 
• act mechamsms revolved m the processmg are described 
m more detail m Section 3, here we outline the modffz- 
cataons and extens~ous to traditional term ~dentfflcataon 
technology wluch address the above problems 
Ftrst, undergeneratlon ts resolved by nnplementmg 
a statable generahsatwn---and relaxatxon--of the notion 
of a term, so that |dentffacatlon and extraction of phrasal 
• urnts mvo)ves a procedure essentially hke TERMS \[10\], 
I 
I 
i 
I 
l 
I 
! 
I 
I 
i 
I 
i, 
I 
I 
,I 
i 
i 
I 
I 
but which results m an extended phrase set, containing 
an exhausbve hstmg of the objects mentmned m the text 
Second, .overgenerabon m resolved through reductmn of 
the extended phrase set m two ways The extended 
phrase set m transformed, through the apphcatlon of an 
anaphora resolutmn procedure (See Sechon 3 below, and 
Kennedy and Boguraev \[13\], \[14\]), into a set of expres- 
sions wluch umquety identify the objects referred to m the 
text (hereafter a referent set) 
However, the data reductmn ansmg from distalhng 
the extended phrase set down to a smaller referent set 
is still not enough In order to ehnunate cogmhve over- 
load for the user, the referent set must be further reduced 
to a small, coherent, and easily absorbed listing of lUSt 
those expressmns wl~ch identify the most important ob- 
iects m the text An mtmbve and stratghtforward means 
• of accomphshmg this mvolves ranking the members of 
the referent set according to a measure of the prominence, 
or unportance, m the text of theobjects they refer to Such 
a ranking not only prowdes the basis for identifying topic 
stamps, it also solves the thn.d problem above, that of 
dffferentatton Although two related documents may m- 
stanUate the same term sets, ff the documents are con- 
cerned with &fferent topics, then the relalave importance 
of the terms m the two documents wdl differ as a func- 
bon of differences m use and grammatical dlstnbutlon 
The underlying mtmtaon Is that term sets can be dffferon- 
hated m two ways lexlcally, by wrtue of contmnmg dif- 
ferent terms, or hmrarchlcally, by wrtue of the ordering of 
then" members Ordered term sets, m the latter case, pro- 
vide chstmct charactermatlons of documents, even ff the 
overall lexlcal make-up of the term sets Is smular Given a 
formahsed notmn of"importance", we can generate a co- 
herent set of topic stamps from an undlfferentmted refer- 
ent set, while overcommg the lack of coherence inherent 
m unordered term sets 
The challenge, then, Is to define a statable selection 
procedure, operating over a larger set of phrasal units 
than that generated by a typical term ldenUficahon algo- 
nthra (mcluclmg not only all terms, but term-hke phrases, 
as well as thmr variants, reduced forms, and anaphonc 
references), malong reformed choices about the degree to 
which each phrase Is representabve of the text as a whole, 
and presenting its output m a form whtch retains contex- 
tual mformauon for each phrase The key to normahsmg 
the content of a document to a smaU set of dmtmgmshed, 
and discriminating, phrasal umts Is being able to estab- 
hsh a containment hierarchy of phrases (wluch would 
eventually be exploited for capsule overvaew presenta- 
iaon at different leveb.of granularity), and being able to 
make refmed ludgements concerning the degree of rele- 
vance of each unit, wainn its own Oocal) discourse seg- 
ment In other words, we need to be able to filter a term 
set m such a way that those expressions which are most 
representabve of the content of the document are selected 
as topic stamps The next Section describes the process 
of constructing exactly this type of "unportance-based" 
ranking by building on and extending a crucaal feature of 
the anaphora resolutaon procedure used to generate the 
reference set sahence 
3 Salience-based content charac- 
terisation .... .;.., 
Sahence m a measure of the relatnre pronunence of ob- 
jects m &scourse objects with high sahence are the focus 
of attentmn, those wath low salience are at the periph- 
ery In an effort to resolve the problems facing a term- 
based approach to content charactensattorb we have de- 
veloped a procedure whtch uses a sahence feature as the 
basis for the type of "ranlang by unportance" of referents 
discussed above, and ultunately for topxc stamp ldenbfi- 
cataon By determining the sahence of the members of a 
referent set, an ordering can be Imposed which, m con- 
necbon with an appropriate choice of threshold value, 
prowdes the basis for a reducbon of the entn'e term set 
to only those terms which ldenbfy the most prominent 
parbcnpants m the dmcourse Ttus reduced set of terms, 
m combmabon with relabonal mformabon of the sort din- 
cussed m the prevmus secbon and folded into an appro- 
pnate presentaiaon metaphor, may then be presented as a 
• charactensalaon of a docmnent's content Crucmlly, tlus 
analysis sahsfies the reqmrements menboned above R Is 
consise, it m coherent, and xt does not introduce the cog- 
mhve overload assooated3wth a fifll-scale term analysis 
Thin strategy for scahng up the phrasal analysis pro- • 
wded by standard term ldentlficatmn technology has at 
its core the uUhsahon of a crucml feature of dmcourse 
structure the pronunence, over some segment of text, of 
particular ref~ents--somethmg that Is mmsmg from the 
tradRmnal technology for "bare" terminology ldentffica- 
hon Below we describe the core detmls of our technol- 
ogy Frost, we explain more concretely what we mean by 
"segment of text", why segments are nnportant, end how 
they are detemuned Second, we present a method for 
determining salience which, when apphed to arbitrary 
sets of phrasal umts, generates an ordering that accu- 
rately represents the relabve pronunence of the objects re- 
ferred to m a document We also descnbe what hngmshc 
mformatmn, available through scalable and robust tden- ' 
tlficatmn technologies, can be leveraged to reform such 
a nobon of sahence Finally, we give an over~ew of a 
hngmslac processing envaronment wluch, w~ule carrying 
out these tasks, remains open-ended wRh respect to the 
language, domain, style and genre of the texts we want 
to be able to handle 
3.1 Discourse segmentation 
The example m Sechon I illustrates the anportance of dis- 
course segmentataon As R happens, the htle m this case 
works as an overvaew of the content of the passage be- 
cause the text Rself Is fmrly short As a text increases 
m length, the "completeness" of a short descrlpbon as 
a charactensahon of content deteriorates If the mten- 
bon Is to use concise descnphons consmbng of one or 
two topical phrases (topic stamps) plus modlficataonal 
and relalaonal mformatmn as the primary mformahon- 
bearing umts for capsule oversnew, then it follows that 
texts longer than (roughly) one to three paragraphs must 
be broken down mto smaller umts or segments 
.The approach to segmentation we adopt tmplements 
a s~mdanty-based algorithm along the hnes of the one de- 
veloped by Hearst \[8\], which ldenl~hes topically coherent 
sections of text using a lex~cal smulanty measure In the 
final presentation of results, each segment ms assoaated 
with a concmse, phrasal-based descnptmn of ~ts content 
w~thout loss of accuracy The set of such descnptwns, 
ordered according to linear sequencing of the segments 
m the text, may then be used as the basis for a capsule 
overview The problem of content charactensatzon of a 
large text, then, ms reduced to the problem of finding topic 
stamps for each segment m the document 
3.2 Local salience 
As noted m Section 2 2, the set of expressions generated 
by extended phrasal analysms typically contains a number 
of anaphonc expresmons--pmnouns, reduced descrip- 
tions, ere--which must be resolved Our anaphora res- 
olutzon algorithm m based on a procedure developed 
by Lappm and Leass \[16\], and Is described m detail 
m Kennedy and Boguraev \[13\], \[14\], m essence, ~t de- 
velops an adaptation for denying rehable interpretation 
from conmderably shallower hngmstxc analysts of the m- 
put We make the snnphfymg assumption that every 
phrase identified by extended phrasal analysts consti- 
tutes a "mentmn" of a participant rathe chscourse (see 
Mare and MacnuUan \[19\] for chscusszon of the notion 
of "mention" rathe context of proper names interpreta- 
tion ) Coreference Is represented by eqmvalence classes 
of nominals, where each eqmvalence dass corresponds to 
a umque referent m the dmscourse The set of such eqmva- 
lence classes constitutes the referent set dmcussed above 
However, anaphora resolutwn ms nnportant not only 
for reducing the extended phrase set, ~t also plays a cru- 
cml role m the ~dent~flcat~on of topic stamps The rea- 
son thins ms so ms that ~t ms based on a strict defxut~on of 
the notwn of sahence Roughly spealang, an antecedent 
for an anaphonc expressmn is located by first ehnunatmg 
all mmposmble can&date antecedents, then ranking the 
remaining candidates according to a'Iocal sahence mea- 
sure, and selecting the most sahent candidate as the an- 
tecedent Local sahence ms a funcct~on of how a canch- 
date satmsfles a set of grammatical, syntactic, and con- 
textual parameters Following Lappm and Leass, we re- 
fer to these constraints as "sahence factors" Ind~wdual 
"sahence factors are associated with numerical values, as 
follows 4 
SENT 100 Iff the expressmn ts m the current sentence 
CNTX ,50 df the expresmon Is m the current discourse segment 
SUBJ 80 df the expresslon ts a subject 
EXST 70 ~f the expresswn ts m an existentml construction 
POSS 65 lf~ the expression m a possessrce 
ACC 50 ~ the expression ~s a duect object 
DAT ,40 lff the expression Ls an mchrect object 
OBLQ 30 ~ the expresslon Is the complement of a preposlt~on 
HEAD 80 Iff the expresmon Is not contained m another phrase 
ARG ,50 df the expresslon Is not contained m an adjunct 
The local sahence of a candldate ms the sum of the values 
of the sahence factors that are satlsfed by some member 
of the eqmvalence class to which the canchdatebelongs 
(note that values may be sahsfied at most once by each 
member of the class) 
One m~portant aspect of these numencal values ms 
that they nnpose a relational structure among the sahence 
factors, crucmUy, as observed by Lappm and Leass, such 
a structure reflects the relahve ranking of the factors Thin 
ms lUStffmd both hngmstlcally, as a consequence of the 
role played by the functmnal hmrarchy m deternunmg 
anaphonc relations (see e g Keenan and Comne \[12\]), 
as well as by experimental results (see Lappm and Leass 
\[16\], Kennedy and Boguraev \[13\], \[14\] for dtscusswn) 
An nnportant feature of local sahence ms that it ms van- 
able the sahence of a referent decreases and increases ac- 
cording to the frequency with which new members are 
added to the eqmvalence class to which it belongs When 
an anaphonc hnk ms estabhshed, the anaphorms added 
to the eqmvalence class to which its antecedent belongs, 
and the sahence of the class ms boosted accordingly If a 
referent ceases to be mentioned m the text, however, ~ts 
local sahence ms mcrementaUy decreased . 
3.3 Discourse salience 
Corts!der again the news article dmscussed m Section 1 In- 
tmtlvely, the reason why 'priest" ms at the focus of the tale 
ms that there are no less than eight references to the same 
actor m the body of the story (these are marked by ltal- 
icmsmg them m the example), moreover, these references 
occur m pronunent syntactlc posltlons five are subjects 
of mare clauses, two are subjects of embedded clauses, 
and one ms a possessor Smularl)~ the mason why 'Pape" ms 
the secondary object of the title ms that he ms also receives 
multiple mentions (five), but these references tend to oc- 
cur m less prominent positions (two are dtmct objects) 
In order to generate such a broad picture of the 
prominence of referents across a dmcourse, we mamtam 
a measure of the sahence of referents both m the text as 
a whole, and m the dmscourse segments m which they 
occur Thin ms accomphshed through an elaboratmn of 
the local sahence computation described above, which 
interprets the same conchtwns with respect to a nort- 
decreasing dwcourse sahence value 
Local sahence, because of lts vanablhty, provides a 
reahshc representation of the antecedent space for an 
anaphor In contrast, discourse sahence reflects the dm- 
tnbutlonal propertles of a referent as the text story un- 
folds Thins non-clecreasmg salience measure underhes 
a detailed representation of dmcourse structure which, 
when overlayed onto the results of dmcourse segmenta- 
tion, gives a coherent representation of the topical pronu- 
nence of particular referents m speahc segments of text 
Specifically, it becomes the basts for exactly the type of 
Importance-based ranking of referents dmcussed m Sec- 
tzon 2 2 Using thin ordering, we define the topic stamps. 
for a segment S to be the n lughes t ranked referents m s 
(where n ms a scalable value) 
4Our sahence factors nurror those used by Lappm and Leass, with fl'te excephon of Poss, wluch Is sens~ve to possessive expms- 
mons, and CNTX, wl~ch Is sensllav e to the chscourse segment m whlch a candldate appears 
I 
i 
i 
,i 
,l 
I 
i 
I 
I 
! 
! 
! 
! 
! 
,! 
I 
! 
! 
! 
4 , Example 
The operatxonal coml>onents tocontent charactensaUon 
described here fall m the following categories dmcourse 
segmentation, phrasal analysm (of nonunal expressions 
and relaUons), anaphora resolution and generation of the 
referent set, calculatton of d~scourrse sahence and ranking 
of referents by segment, tdenttficat~on of topic stamps, 
and enriching topic stamps wtth relatwnal context(s) 
Some of the fimct~onahty follows duectly from terau- 
nology ~dent~ficatton, m parttcular, both relation tden- 
tff~cattton and extended phrasal analys~s are camed out 
by running a phrasal grammar over a stream of text to- 
kans tagged for morphologtcal, syntactic, and grammat- 
~ca.l fimct~on, thin ts m addtt~on to a grammar mmmg 
for terms and, generally6 referents (Base level lmgmsttc 
analys~s m pmwded by the LINGSOFT supertagger, \[11\] ) 
The later, more semant~cally-mtensxve algorithms are de- 
scribed m some detail m \[13\] and \[14\] 
We illustrate the procedure by htghhghtmg certain as- 
pects of a capsule over~ew of a recent Forbes amcle (\[9\]) 
The document xs of medmm-to-large stze (approxtmately 
four pages m print), and focuses on the strategy of Gilbert 
Ameho (Apple Computer's CEO) concerning a new op- 
erating system for the Macintosh Too long to quote here 
m full, the following passage from the beginning of the 
article contains the first, second and third segments, as 
~denUfied by the discourse segmantat~on component de- 
scribed m Section 31 (cf \[8\]), m the example below, seg- 
ment boundaries are marked by extra verttcal space) 
ON'I~ DAY everythmgBdlC-ateshassoldyou up tonow whekhexst sWm- 
dows 95 car Windows 97 wall become dbsolete, ' declan~ Calbert Ameho, the 
bossatAppleComputer Gates m vulnerable at that pomt And we want to 
make sum we re m~ly to come forward wtth a superior arawer" 
Bdl Gates vul~a~ble~ Apple wmdd swoop m and take Mtcrosoffs custom- 
em~ R~hculoos' lmpm.~ble~ In the last ~ year Apple lc~i: $816 
Mtcm~ofi made $2 2 bdhon Macrosoft has a market value ttm'ty 0rues that of Apple 
Outlandish and ~ran&ose as Ameho s ~dea sounds, ,t makes sei~.~ for Ap- 
ple to think m sud~ bsg. bold terms App~ m m a pomum where st~ndm~ pat 
almost certainly meatus slow death 
It's a Int hke a pattent wsth a pwbably terminal a~e clecu:bng to take a 
chance on an untested but pm~ new drug A bold s~ategy m the least 
risky strategy As things sUmcl, customers and outude softwme developers 
ahke are de~rtmg the company Apple needs s~e~mg dram~c to pexsuade 
them to stay aboard Ar~4,~l rede.agn of the desktop comput~ aught do the 
trick If they think the redemgn has met& they may feel compelled to Ket c~ 
the bandwagoa lest tt leave them beluncl 
Lots ol "ffs/' but you can t a~me Amelm of lacking vmon Today's desk- 
top~ he says are dl-eqmpped tohandle the coming power of the 
Intemet Tomorrow s m~ must accommodate nve~ of data, mulume&a 
and mulUtask,mg (msghng several tasks sun~.tsly) " • 
We're war the point of upgradm R, he says Time to scrap your operaUn~ 
system and staxq over Theopemtmgsystemmthesoftwate that~entrolshow 
your computer's pare (memmy, dmkdnves scream) interact wtth apphcatmns 
hkegamesanciWebbrowsere Once you've done that buynewapphcauc~s 
to go wah ~e ~'eensmeered operaung sysumL 
Ameho 53,bnngsalotofc~hblhtytothasta~ l-l~tesumemcludesboth 
a rescue of Nahonel Senucea~uc'tor from neat-bankruptcy and 16 patents in- 
cluding one for comventmg the cha~oupled device 
But where t~ Amelm going to get th~ new operating ~ystem~ From Be Inc, 
m Menlo Park Cahf, a half-hour s drive from App~s Cupemno headquar- 
text a hot httle company founded by ex-Apple v~ary Jean-Lores Gassee 
Its BeCS, t~w ~mg du~cal trois, m that radu~ redesagn m opereung 
~ that Ameho ~s talking about Mamed to ha.q:lware .from Apple and 
.Apple cloners, the BeOS lust aught be a crechble compeutor to Mtcrosoft's 
Windows, whu~ runs cs* IBM-coznpauble hardware • 
The capsule overvxew was automaticallygenerated by a 
fully tmplemented, and operational, system, whlch in- 
corporates all of the processing components ldenttfied 
above -The relevant sec~ons of the overvtew (for the 
three segments of the passage quoted) are as follows 
1 APPLE wou/d swoop m 
take MICROSOFT'S customers7 
APPLE lost $816 mdlwn, 
MICROSOFT made $2 2 bdhon 
MICROSOFT has a market value 
APPLE ~ m a po~twn 
APPLE ~L~ soraahmg dmr~t~c 
2 Today's DESKTOP MACroNS, he says, are dl.~d 
Tomorrow's MACHINES must accomodate 
scrap your OPERATING SYSTEM 
OPERATING SYSTEM ~ the software 
3 AMEUO brings creddTday 
HIS resum& includes both 
AMELIO I$ going to get fhL~ NEW OPERATING SYSTEM? 
radlca/redesign in OPERATING SYSTEMS 
AMELIO ~ taltangabout 
The chwslon of this passage into segments, and the 
segment-based asssgnment of topic stamps, exemphfies 
a capsule overwew's "trackmg" of the underlying coher- 
ence of a story The dxscourse segmentation component 
recogmzes shifts m toplc--m thin example, the shxft from 
dmcussmg the relation between Apple and Microsoft to, 
some remarks on the future of desktop computing to a 
summary of Ameho's background and plans for Apple's 
operating system Layered on top of segmentation are the 
topic stamps themselves, m thear relational contexts, at a 
phrasal level of granularity 
The first segment sets up the d~cusslon by posmon- 
mg Apple opposae Microsoft m the marketplace and fo- 
cusing on thetr major products, the operating systems 
The topic stamps ldentffted for th~ segment, APPLE and 
MICROSOFT, together with thmr local contexts, are both 
mdxcatlve of the introductory character of the operung 
paragraphs and htghly representatwe of the gmt of the 
first segment Note that the apparent unmformahveness 
of some relational contexts, for example, ' APPLE ~s m a 
posztwn ; does not pose a serious problem An adjust- 
ment of the gvanu!anty--at.capsule overwew presenta- 
Uon ttme--reve/ds the larger context m wl'uch the topsc 
stamp occurs (e g, a sentence), which m turn inherits the 
lugh topxcahty ranking of Its anchor 'APPLE ~ In a posz- 
twn where standing pat almost certmnly means slow death" 
For second segment of the sample, OPERATING SYS- 
TEM and DESKTOP MACHINEShave been Identified as 
representatwe The set .of four phrases illustrated pro- 
vtdes an encapsulated snapshot of the segment, w~ch m- 
troduces Ameho's wews on commg challenges for desk- 
top machines and the general concept of an operat- 
ing system Agem, even ff some of these are some- 
what under-specified, more detail is easily available by a 
change m granulan~ wluch reveals the defimtlonal na- 
ture of the even larger context "The OPERATING SYSTEM L~ 
the software that controls how your computer's parts " 
The third segment of the passage exemphfied above 
assoaated w~th the stamps GILBERT AMELIO and NEW 
OPERATING SYSTEM The reasons, and lmgumt~c ratio- 
nale, for the selection of these particular noun phrases 
as topxcal are essenttally ldenUcal to the mtmtlon behind 
"truest" and "Pope" being the central topics of the exam- 
ple 'm Sectmn I The computational just~ficatton for the 
choices hes m the extremely high values of sahence, re- 
sultmg from taking into account a number of factors co- 
referentaahty between "Ameho" arid "G:Ibert Ametw', co- 
referentaahty between "Ameho" and 'H~s; syntactic pronu- 
nence of 'Ameho" (as a subject) promoting topical sta- 
tus ~gher than for instance "Apple" (wluch appears m 
adjunct posmons), I'ugh overall frequency (four, count- 
mg the anaphor, as opposed to three fo r 'Apple'--even 
the two get the same number of text occurrences m the 
segment)--and boost m global sahence measures, due to 
• "prlnung" effects of both referents for "G~Ibert Araeho" and 
• operating system" m thepnor dmcourse of the two preced- 
ing segments Even ff we are unable to generate a single 
phrase summary m the form of, sa~ "Ameho seeks a new 
operating system; the overview for the closing segment 
comes close, arguably, it Is even better than any single 
phrase summary 
As the dmcusslon of this example Illustrates, a cap- 
sule overview is derived by a process which fecthtates 
partial understanding of the text by the user The final 
set of topic stamps is designed to be representative of 
the core of the document content It ~s compact, as it ~s 
a sigmficantly cut-down versmn of the full list of identi- 
fied terms It ts Inghly mformatzbe, as the terms mcluded 
m R are the most prominent ones m the document It 
representatme of the whole document, as a separate topic 
tracking module effecbvely maintains a record of where 
and how referents occur m the entire span of the text As 
the topics are, by de~mtmn, the primary content-beanng 
entities m a document, they offer accurate approximation 
of what that document m about 
5 Related and future work 
Our framework clearly attempts to balance the conflict- 
mg reqmrements of the two primary approaches to the 
document summansabon task By design, we target any 
text type, document genre, and domain of chscourse, and 
thus compromise by forgomg m-depth analysls of the fun 
meaning of the document On the other hand, our con- 
tent charactensation procedure remains closer to the core 
meaning than the approximations offered by tradlbonal 
passage extraction algorithms, with certain sentence- or 
paragraph-szz.ed passages deemed mdlcabve of content 
by means of smulanty scoring metrics 
By choosing a phrasal granularity of representatmn--- 
rather than sentence- or paragraph-based--we can ob- 
tain a more refined view into highly relevant fragments 
of the source, this also offers a finer-grained control for 
adjusting the level of detad m capsule ovenaews Ex- 
ploiting a notmn of dmcourse conbgmty and coherence 
for the purposes of full source coverage and continuous 
context maintenance ensures that the entire text of the 
document Is uniformly represented m the overview F1- 
naUy, by utthsmg a strong lmgutstic notion of salience, 
the procedure can bmld a richer representation of the 
discourse objects, and exploit thin for reformed dec~ons 
about their pronunence, importance, and ultunately top- 
lcahty, sahence thus becomes central to denying a strong 
sense of a document's "aboumess" 
At present, sahence calculabons are driven from con- 
textual analys~ and syntactac conslderabons focusing on 
dtscourse objects and thetr behawour m the text Given 
the power of our phrasal grammars, however, it is con- 
celvable to extend the framework to Identify, exphatly 
represent, and smularly rank, higher order expressions 
(e g events, or propertms of obMcts ) Thin may not ul- 
tnnately change the appearance of a capsule overview, 
however, it will allow for even more reformed )udge- 
ments about relevance of dmcourse entities More mt- 
portantly, it ts a necessary step towards developing more 
sophmticated chscourse processing techmques (such as 
those discussed m Sparck Jones \[28\]), which are ulb- 
mately essenbal for the automabc construction of true 
summaries 
Currentl3~ we analyse m&vidual documents, unhke 
McKeown and Radev \[21\], there.is no nobon of calcu- 
lating sahence across the boundaries of more than one 
document---even If we were to know m advance that they 
are somehow related However, we are experimenting 
using topic stamps as representabon and navlgatwn "la- 
bels" m a multi-document space, we thus plan to fold 
m awareness of document boundaries (as an extension 
to tracking the effects of dlscourse segment boundaries 
wltinn a smgle document) Even though the approach 
presented here can be construed, m some sense, as a type 
of passage extractmn, it is considerably less exposed to 
problems hke pronouns out of context, or discontinu- 
ous sentences presented as conbguous passages (cf Paice 
\[22\]) This ~ a direct consequence of the fact that we em- 
ploy anaphora resolubon to construct a chscourse model 
with exphtnt representation of objects, and use syntactic 
criteria to extract coherent phrasal umts For the same 
reason, topic stamps are quantfflably adequate content 
abstractions see Kennedy and Boguraev \[13\] for evalu- 
atlon of the anaphora resolubon algorithm, we are also 
m the process of designing a user study to determine the 
u~lt3~ from usability point of view, of capsule overviews 
as defined here 
Recent work m summansatmn has begun to focus 
closer on the utihty of document fragments with gran- 
ularity below that of a sentence Thus McKeown and 
Radev \[21\] pro-actively seek, and use to great lever- 
age, certain cue phrases whzch denote specific rhetorical 
and/or rater-document relationships Mahesh \[18\] uses 
phrases as "sentence surrogates'; m a process called sen- 
tence Sunphficabon, hm rationale is that with hypertext, a 
• phrase can be used as a place-holder for the complete sen- 
ttence, and/or is a more conveniently mampulated, com- 
pared to a sentence Even m passage extraction work; no- 
tions of multi-word expressmns have found use as one of 
several features clnvmg a statmtical classffer scoring sen- 
tences for inclusion m a sentence-based summary (Ku- 
plec et al \[15\]) In all of these examples, the use of a 
phrase is somewhat peripheral to the fundamental as- 
sumpbons of the particular approach, more to the point, 
it is a different kind of object that the summary is com- 
posed from (a template, m the case of \[21\]), or that the 
underlying machinery is seeking to identify (sentences, 
m the case of \[18\] and \[15\]) In contrast, our adophon 
8 
I 
I 
I 
.i 
! 
! 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
i 
I 
I 
of phrasal expressions as the atonuc building blocks for 
capsule overwews ~s central to the design, it drives the 
entire analysm process, and ~ the undew.mmng for our 
d~ourse representation .. 

References 
\[1\] Advanced Research Projects Agency Fourth Message 
Understanding Conference (MUC-5), BalOmore, Mary- 
land, 1993 Software and Intelhgent Systems Tech- 
nology Office 

\[2\] Advanced Research Projects Agency Trpster Text 
Program Phase I, Fredericksburg, Vlrgmm, 1993 

\[3\] D Boungault Surface grammatical analysis for the 
extract2on of terminological noun phrases In 14th 
International C.onfe~ei~celon Computatzonal Lmgulstlcs, 
Nantes, France, 1992 ' 

D Caruso New software summarizes documents 
The New York Tunes, January 27,1997 

I Dagan and K Church Ternught identifying and 
translating technical terminology In Proceedings of 
4th Conference on Apphed NLP, Stuttgart, German3~ 
1995 

\[6\] Defense Advanced Research Prolects Agency Fourth 
Message Understanding Conference (MUC-4), McLean, 
VlrgTaua, 1992 Software and Intelhgent Systems 
Technology Office 

\[7\] G DeJong An over, new of the FRUMP system In 
• W Lehnert and M Rmgle, editors, Strategies for Nat- 
ural Language Parsing, pp 149-176 Lawrence Eft- 
baum Assocmtes, Hfllsdale, NJJ, 1982 

\[8\] M Hearst Multi-paragraph segmentation ofexpos- 
xtory text In 32nd Annual Meeting of the Assocmt~on 
for Computatwnal Lmgu~st;es, Las Cruces, NM, 1994 

N Hutheesmg Gdbert Ameho's grand scheme to 
rescue Apple Forbes Magazine, December 16,1996 

J S Justeson and S M Katz Techmcal terminol- 
ogy some hngmst~c properties and an algorithm for 
xdentff~cat~on m text Natural Language Engineering, 
1(1) 9-27,1995 

F Karlsson, A Vouhlamen, j Hetkkda, and A An- 
tdla Constraint grammar A language-independent sys- 
tem for parsing free text Mouton de Gruyter, 1995 

E Keenan and B Comne Noun phrase accesszb~hty 
and umversal grammar Linguistic Inquiry, 8 62-100, 
1977 

C Kennedy and B Boguraev Anaphora for ev- 
eryone Prononunal anaphora resolution without a 
parser In Proceedings of COLING-96, Copenhagen, 
DK, 1996 

C Kennedy ~ind B Boguraev Anaphora m a 
wider context Tracking discourse referents In 
W Wahlster, editor, Proceedings of ECAI-96, Bu- 
dapest, Hungar3~ 1996 John Wdey and Sons 

J Kupmc, J Pedersen, and F Chen A trainable doc- 
ument summarizer In Proceedings of the 18th Annual 
International ACM SIGIR Conference, 68--733, Seattle, 
Washington, 1995 

\[16\] S Lappm and H Leass An algorithm for pronom- 
inal anaphora resolution Conrputahonal Lmgulst~es, 
20(4) 535-561,1994 

\[17\] H Luhn " The automatic creation of hterature ab- 
stracts IBM Journal of Research and Development, 
2 159--165,1959 

\[18\] K Mahesh Hypertext summary extraction for fast 
document browsing In Proceedings of AAAI 5pnng 
Symposmra NLP for WWW, pp 95--104, Stanford, 
CA, 1997 

\[19\] I Mare and T MacMillan Identff3nng unknown 
proper names m newswlre text In B Boguraev 
and J Pustejovsk~ eeds, Corpus Processing for Lexlcal 
Acqu~sztum, pp 41-59 MIT Press, Cambridge, MA, 
1996 

M M McCord Slot grammar a system for sunpler 
construction of practical natural language gram- 
mars In R Studer, ed, Natural language and lbgnc 
mternatwnal sc~enhJic sympos;um, Lecture Notes m 
Computer Science, pp 118--145 Springer Verlag, 
1990 

\[21\] K McKeown and D Radev Generating summanes 
of multiple news articles In Proceedings of the 18th 
Annual Internatzonal ACM SIGIR, pp 74--82, Seattle, 
Washington, 1995 

\[22\] C D Pmce Constructing hterature abstracts by 
computer techmques and prospects Informahon 
Processing and Management, 26 171-186,1990 

\[23\] K Preston and S Wflhams Managing the mfor- 
matron ove~Ioad new automatic summarization 
tools are good news for the hard-pressed executive 
Phymcs m Business, 1994 

\[24\] L Rau Conceptual reformation extraction and m- 
formation retrieval from natural language input In 
Proceedings of RIAO-88, Conference on User-oriented, 
Content-Based, Text and Image Handhng, pp 424.-437, 
Cambndge, MA, 1988 

\[2.5\] L Ran, R Brandow, and K Ivhtze Dommn- 
independent summarization of news In Surama- 
nzmg Text for Intelhgent Commumcatzons, pp 71-75, 
Dagstuhl, German~ 1994 

\[26\] G Salton, A Smghal, C Buckle~andM M~tra Au- 
tomatic text decompo~tlon using text segments and 
text themesIn Seventh A CM Conference on Hypertext, 
Washington, D C, 1996 

\[27\] K Sparck Jones Discourse modelling for automatic 
text summansmg Techmcal Report 290, Umverslty 
of Cambridge Computer Laborator~ 1993 

\[28\] K Sparck Jones What might be m a surnmary~' In 
Knorz, Krause, and Womaer-Hacker, eds, Informa- 
tton Retrieval 93 Von der Modelherung zur Anwen- 
dung, pp 9-26, Umversltatsverlag Konstanz, 1993 

\[29\] J Trot Automatzc summansmg of Enghsh texts PhD 
thesis, Umverslty of Cambridge Computer Labora- 
tory, Cambndge, UK, 1983 Technical Report 47 
