Ranking Text Units According to Textual Saliency, Connectivity 
and Topic Aptness 
Antonio Sanfilippo* 
LINGLINK 
Anite Systems 
13 rue Robert Stumper 
L-2557 Luxembourg 
Abstract 
An efficient use of lexical cohesion is described 
for ranking text units according to their contri- 
bution in defining the meaning of a text (textual 
saliency), their ability to form a cohesive sub- 
text (textual connectivity) and the extent and 
effectiveness to which they address the different 
topics which characterize the subject matter of 
the text (topic aptness). A specific application 
is also discussed where the method described is 
employed to build the indexing component of a 
summarization system to provide both generic 
and query-based indicative summaries. 
1 Introduction 
As information systems become a more inte- 
gral part of personal computing, it appears 
clear that summarization technology must be 
able to address users' needs effectively if it is 
to meet the demands of a growing market in 
the area of document management. Minimally, 
the abridgement of a text according to a user's 
needs involves selecting the most salient por- 
tions of the text which are topically best suited 
to represent the user's interests. This selec- 
tion must also take into consideration the de- 
gree of connectivity among the chosen text por- 
tions so as to minimize the danger of produc- 
ing summaries which contain poorly linked sen- 
tences. In addition, the assessment of textual 
saliency, connectivity and topic aptness must 
be computed efficiently enough so that summa- 
° This work was carried out within the Information 
Technology Group at SHARP Laboratories of Europe, 
Oxford, UK. I am indebted to Julian Asquith, Jan I J- 
dens, Ian Johnson and Victor Poznarlski for helpful com- 
ments on previous versions of this document.. Many 
thanks also to Stephen Burns for internet programming 
support., Ralf Steinberger for assistance in dictionary 
conversion, and Charlotte Boynton for editorial help. 
rization can be conveniently performed on-line. 
The goal of this paper is to show how these ob- 
jectives can be achieved through a conceptual 
indexing technique based on an efficient use of 
lexical cohesion. 
2 Background 
Lexical cohesion has been widely used in text 
analysis for the comparative assessment of 
saliency and connectivity of text fragments. 
Following Hoey (1991), a simple way of com- 
puting lexical cohesion in a text is to segment 
the text into units (e.g sentences) and to count 
non-stop words 1 which co-occur in each pair of 
distinct text units, as shown in Table 2 for the 
text in Table 1. Text units which contain a 
greater number of shared non-stop words are 
more likely to provide a better abridgement of 
the original text for two reasons: 
• the more often a word with high informa- 
tional content occurs in a text, the more 
topical and germane to the text the word 
is likely to be, and 
• the greater the number of times two text 
units share a word, the more connected 
they are likely to be. 
Text saliency and connectivity for each text unit 
is therefore established by summing the num- 
ber of shared words associated with the text 
unit. According to Hoey, the number of links 
(e.g. shared words) across two text units must 
be above a certain threshold for the two text 
units to achieve a lexical cohesion rank. For ex- 
ample, if only individual scores greater than 2 
1Non-stop words can be intuitively thought of as 
words which have high informational content. They usu- 
ally exclude words with a very high fequency of occur- 
rence, especially closed class words such as determiners, 
prepositions and conjunctions (Fox, 1992). 
1157 
#1# Apple Looking for a Partner 
#2# NEW YORK (Reuter) - Apple is actively 
looking for a friendly merger partner, 
according to several executives close 
to the company, the New York Times 
said on Thursday. 
#3# One executive who does business with 
Apple said Apple employees told him 
the company was again in talks with 
Sun Microsystems, the paper said. 
#4# On Wednesday, Saudi Arabia's Prince 
Alwaleed Bin Talal Bin Abdulaziz A1 
Saud said he owned more than five 
percent of the computer maker's stock, 
recently buying shares on the open 
market for a total of $115 million. 
#5# Oracle Corp Chairman Larry Ellison 
confirmed on March 27 he had formed an 
independent investor group to gauge 
interest in taking over Apple. 
#6# The company was not immediately 
available to comment. 
Table h Sample text with numbered text units 
Text units 
#1# #2# 
#1# #3# 
#1# #4# 
#1# #5# 
#1# #6# 
#2# #3# 
#2# #4# 
#2# #5# 
#2# #6# 
#3# #4# 
#3# #5# 
#3# #6# 
#4# #5# 
#4# #6# 
#5# #6# 
Words shared Score 
Apple, look, partner 3 
Apple, Apple 2 
0 
Apple 1 
0 
Apple, Apple, 
executive, company 4 
0 
Apple 1 
company 1 
0 
Apple, Apple 2 
company 1 
0 
0 
0 
Table 2: Measuring lexical cohesion in text unit 
pairs. 
are taken into account, the final scores and con- 
sequent ranking order computable from Table 2 
are: 
. first: text unit #2# (final score: 7); 
• second: text unit #3# (final score: 4), and 
• third: text unit #1# (final score: 3). 
A text abridgement can be obtained by select- 
ing text units in ranking order according to the 
text percentage specified by the user. For ex- 
ample, a 35% abridgement of the text in Ta- 
ble 2 would result in the selection of text units 
#2# and #3#. 
As Hoey points out, additional techniques 
can be used to refine the assessment of lexi- 
cal cohesion. A typical example is the use of 
thesaurus functions such as synonymy and hy- 
ponymy to extend the notion of word sharing 
across text units, as exemplified in Hirst and St- 
Onge (1997) and Barzilay and Elhadad (1997) 
with reference to WordNet (Miller et al., 1990). 
Such an extension may improve on the assess- 
ment of textual saliency and connectivity thus 
providing better generic summaries, as argued 
in Barzilay and Elhadad (1997). 
There are basically two problems with the 
uses of lexical cohesion for summarization re- 
viewed above. First, the basic algorithm re- 
quires that (i) all unique pairwise permutations 
of distinct text units be processed, and (ii) all 
cross-sentence word combinations be evaluated 
for each such text unit pair. The complexity of 
this algorithm will therefore be O(n 2 • m 2) for 
n text units in a text and m words in a text 
unit of average length in the text at hand. This 
estimate may get worse as conditions such as 
synonymy and hyponymy are checked for each 
word pair to extend the notion of lexical cohe- 
sion, e.g. using WordNet as in Barzilay and E1- 
hadad (1997). Consequently, the approach may 
not be suitable for on-line use with longer input 
texts. Secondly, the use of thesauri envisaged 
in both Hirst and St-Onge (1997) and Barzi- 
lay and Elhadad (1997) does not address the 
question of topical aptness. Thesaural relations 
such as synonymy and hyponymy are meant to 
capture word similarity in order to assess lexical 
cohesion among text units, and not to provide a 
thematic characterization of text units. 2 Con- 
sequently, it will not be possible to index and 
retrieve text units in term of topic aptness ac- 
cording to users' needs. In the remaining part 
of the paper, we will show how these concerns 
of efficiency and thematic characterization can 
be addressed with specific reference to a system 
performing generic and query-based indicative 
2Notice incidentally that such thematic characteriza- 
tion could not be achieved using thesauri such as Word- 
Net since since WordNet does not provide an arrange- 
ment of synonym sets into classes of discourse topics (e.g. 
finance, sport, health). 
1158 
summaries. 
3 An Efficient Method for 
Computing Lexical Cohesion 
The method we are about to describe comprises 
three phases: 
• a preparatory phase where the input 
text undergoes a number of normalizations 
so as to facilitate the process of assessing 
lexical cohesion; 
• an indexing phase where the sharing of 
elements indicative of lexical cohesion is as- 
sessed for each text unit, and 
• a ranking phase where the assessment of 
lexical cohesion carried out in the indexing 
phase is used to rank text units. 
3.1 Preparatory Phase 
During the preparatory phase, the text under- 
goes a number of normalizations which have the 
purpose of facilitating the process of computing 
lexical cohesion, including: 
• removal of formatting commands 
• text segmentation, i.e. breaking the input 
text into text units 
• part-of-speech tagging 
• recognition of proper names 
• recognition of multi-word expressions 
• removal of stop words 
• word tokenization, e.g. lemmatization. 
3.2 Indexing Phase 
In providing a solution for the efficiency prob- 
lem, our aim is to compute lexical cohesion for 
all text units in a text without having to pro- 
cess all cross-sentence word combinations for all 
unique and distinct pair-wise text unit permu- 
tations. To achieve this objective, we index 
each text unit with reference to each word oc- 
curring in it and reverse-index each such word 
with reference to all other text units in which 
the word occurs, as shown in Table 3 for text 
unit #2#. The sharing of words can then be 
measured by counting all occurrences of iden- 
tical text units linked to the words associated 
with the "head" text unit (#2# in Table 3), as 
shown in Table 4. By repeating the two opera- 
I < company {#3#,#6#} > 
#2# < executive {#3#} > 
< look {#1#} > 
< partner {#i#} > 
Table 3: Text unit #2# and its words with point- 
ers to the other text units in which they occur. 
Table 4: Total number of lexical cohesion links 
which text unit #2# has with all other text units 
tions described above for each text unit in the 
text shown in Table 1, we will obtain a table of 
lexical cohesion links equivalent to that shown 
on Table 2. 
According to this method, we are still pro- 
cessing pair-wise permutations of text units to 
collect lexical cohesion links as shown in Ta- 
ble 4. However, there are two important differ- 
ences with the original algorithm. First, non- 
cohesive text units are not taken into account 
(e.g. the pair #2#-#4# in the example un- 
der analysis); therefore, on average the number 
of text unit permutations will be significantly 
smaller than that processed in the original al- 
gorithm. With reference to the text in Table 1, 
for example, we would be processing 7 text unit 
permutations less which is over 41% of the num- 
ber of text unit permutations which need com- 
puting according to the original algorithm, as 
shown in Table 2. Secondly, although pair-wise 
text unit combinations are still processed, we 
avoid doing so for all cross-sentence word per- 
mutations. Consequently, the complexity of the 
algorithm is O(n 2 • m) for n text units in a text 
and m words in a text unit of average length 
in the text as compared to O(n 2 , m 2) for the 
original algorithm. 3 
ZA further improvement yet would be to avoid count- 
ing lexical cohesion links per text unit as in Table 4, 
and just sum all text unit occurrences associated with 
reversed-indexed words in structures such as those in 
Table 3, e.g. the lexical cohesion score for text unit 
#2# would simply be 9. This would remove the need 
of processing pair-wise text unit permutations for the 
assessment of lexical cohesion links, thus bringing the 
complexity clown to O(n * m). Such further step, how- 
ever, would preempt the possibility of excluding lexical 
cohesion scores for text unit pairs which are below a 
given threshold. 
1159 
• Let 
TRSH be the lexical cohesion threshold 
TU be the current text unit 
LC Tu be the current lexical cohesion score 
of TU (i.e. LC Tv is the count of tokenized 
words TU shares with some other text unit) 
- CLevel. be the level of the current lexical co- 
hesion score calculated as the difference be- 
tween LC Tv and TRSH 
- Score be the lexical cohesion score previously 
assigned TU (if any) 
- Level be the level for the lexical cohesion 
score previously assigned to TU (if any) 
- if LC TU -~ 0, then do nothing 
- else~ if the scoring structure 
(Level, TU, Score) exists, then 
* if Level > CLevel, then do nothing 
. else, if Level = CLevel, then the new 
scoring structure is 
(Level, TU, Score + LC Tu ) 
* else, if CLevel > 0, then 
• if Level > 0, then the new scoring 
structure is (1, TU, Score + LC TU) 
• else, if Level < O, then the new scor- 
ing structure is (1, TU, LC TU) 
. else the new scoring structure is 
(CLevel, TU, LC ~'u) 
- else 
* if CLevel > 0, then create the scoring 
structure (1, TU, LC Tu) 
* else create the scoring structure 
( C Level, TU, LC T~\] ) 
Table 5: Method for ranking text units accord- 
ing to lexical cohesion scores. 
3.3 Ranking Phase 
Each text unit is ranked with reference to the 
total number of lexical cohesion scores collected, 
such as those shown in Table 4. The objective 
of such a ranking process is to assess the im- 
port of each score and combine all scores into 
a rank for each text unit. In performing this 
assessment, provisions are made for a thresh- 
old which specifies the minimal number of links 
required for text units to be lexically cohesive, 
following Hoey's approach (see §1). The proce- 
dure outlined in Table 5 describes the scoring 
methodology adopted. Ranking a text unit ac- 
cording to this procedure involves adding the 
lexical cohesion scores associated with the text 
unit which are either 
• Costant values 
- TRSH = 2 
- TU = $2# 
• Scoring text unit #2$ 
- Lexical cohesion with text unit #6# 
* LC TU = 1 
. CLevel -- -1 (i.e. LC Tu- TRSH) 
* no previous scoring structure 
. current scoring structure: (-1,#2#, 1) 
- Lexical cohesion with text unit #S# 
* LC TU ~. 1 
* CLevel = -1 
. previous scoring structure: i-l, #2#, 1) 
. current scoring structure: (-1, #2#, 2) 
- Lexical cohesion with text unit #3# 
* LC Tu = 4 
* CLevel = 2 
. previous scoring structure: i-I, #2#, 2) 
. current scoring structure: (0, #25, 4) 
- Lexical cohesion with text unit #1# 
* LC TU = 3 
. CLevel = 1 
. previous scoring structure: (1, #2#, 4) 
* final scoring structure: (1, #2#, 7) 
Table 6: Ranking text unit #2# for lexical cohe- 
sion. 
• above the threshold, or 
• below the threshold and of the same mag- 
nitude. 
If the threshold is 0, then there is a single level 
and the final score is the sum of all scores. Sup- 
pose for example, we are ranking text units #2# 
with reference to the scores in Table 4 with a 
lexical cohesion threshold of 2. In this case we 
apply the ranking procedure in Table 5 to each 
score in Table 4, as shown in Table 6. Following 
this procedure for all text units in Table 1, we 
will obtain the ranking in Table 7. 
4 Assessing Topic Aptness 
When used with a dictionary database provid- 
ing information about the thematic domain of 
words (e.g. business, politics, sport), the same 
method can be slightly modified to compute lex- 
ical cohesion with reference to discourse topics 
rather than words. Such an application makes 
1160 
Rank Text unit Level Score 
1st #2# 1 7 
2nd #3# 1 4 
3rd #1# 1 3 
4th #5# 0 2 
5th #6# -I 2 
6th #4# - -- - 0 
Table 7: Ranking for all text units in the text 
shown on Table 1. 
\[\[ WORD_POS CODE EXPLANATION 
company_n 
partnerda 
F Finance & Business 
MI Military (the armed forces) 
SCG Scouting & Girl Guides 
TH Theatre 
DA Dance & Choreography 
F Finance & Business 
MGE Marriage, Divorce, 
Relationships & Infidelity 
TG Team Games 
Table 8: Fragment of dictionary database pro- 
viding subject domain information. 
it possible to detect the major topics of a docu- 
ment automatically and to assess how well each 
text unit represents these topics. 
In our implementation, we used the "subject 
domain codes" provided in the machine read- 
able version of CIDE (Cambridge International 
Dictionary of English (Procter, 1995)). Table 8 
provides an illustrative example of the infor- 
mation used. Both the indexing and ranking 
phases are carried out with reference to subject 
domain codes rather than words. 
As shown in Table 9 for text unit #1#, the in- 
dexing procedure provides a record of the sub- 
ject domain codes occurring in each text unit; 
each such subject code is reverse-indexed with 
reference to all other text units in which the 
subject code occurs. In addition, a record of 
which word originates which cohesion link is 
kept for each text unit index. The main func- 
tion of keeping track of this information is to 
avoid counting lexical cohesion links generated 
by overlapping domain codes which relate to the 
same word -- for words associated with more 
than one code. Such provision is required in or- 
der to avoid, or at least reduce the chances of, 
counting codes which are out of context, that is 
codes which relate to senses of the word other 
than the intended sense. For example, the word 
partner occurring in the first two text units of 
the text in Table 1 is associated with four dif- 
< F {#2#-partner, 
I #3#-company, #1#-partner #6#-company} > 
< NGE {#2#-partner} > 
< TG {#2#-partner} > 
Table 9: Text unit #1# and its subject domain 
codes with pointers to the other text units in 
which they occur. 
#3# #6# 
# l#-partner 1 1 
F F 
company company 
Table 10: Total number of lexical cohesion links 
induced by subject domain codes for text unit 
#I#. 
ferent subject codes pertaining to the domains 
of Dance (DA), Finance (F), Marriage (M) and 
Team Games (TG), as shown in Table 8. How- 
ever, only the Finance reading is appropriate in 
the given context. If we count the cohesion links 
generated by partner we would therefore count 
three incorrect cohesion links. By excluding all 
four cohesion links, the inclusion of contextually 
inappropriate cohesion links is avoided. Need- 
less to say, we will also throw away the correct 
cohesion link (F in this case). However, this loss 
can be redressed if we also compute lexical co- 
hesion links generated from shared words across 
text units as discussed in §2, and combine the 
results with the lexical cohesion ranks obtained 
with subject domain codes. 
The lexical cohesion links for text unit #1# 
will therefore be scored as shown in Table 10, 
where associations between link scores and rele- 
vant codes as well as the words generating them 
are maintained. As can be observed, only the 
appropriate code expansion F (Finance) for the 
words partner and company is taken into ac- 
count. This is simply because F is the only code 
shared by the two words (see Table 8). 
As mentioned earlier, lexical cohesion links 
induced by subject domain scores can be used 
to rank text units using the procedure shown in 
Table 5. Other uses include providing a topic 
profile of the text and an indication of how well 
each text unit represents a given topic. For ex- 
ample, the code BZ (Business & Commerce) is 
associated with the words: 
1161 
#2#-executive 
#3#-executive 
#3#-business 
#4#-market 
#5#-interest 
#2 #3# 
1 
BZ 
business 
1 
BZ 
execut. 
I 2 
BZ BZ 
execut, execut. 
business 
1 2 
BZ BZ 
execut, execut. 
business 
#4# #5# 
1 1 
BZ BZ 
market interest 
1 1 
BZ BZ 
market interest 
1 1 
BZ BZ 
market interest 
1 
BZ 
interest 
1 
BZ 
market 
Table 11: Lexical cohesion links relating to code 
BZ. 
CODES TEXT UNIT PAIRS 
BZ 2-32-42-5~43-53-23-4~5 
4-24-34-34-55-25-35-35-4 
F 1-21-31-62-12-32-63-13-26-1~2 
FA 2-55-2 
IV 4-55-4 
CN 944-3 
Table 12: Subject domain codes and the text 
units pairs they relate. 
• executive occurring once in text units #2# 
and #3#; 
• business occurring once in text unit #3#; 
• market occurring once in text unit #4#, and 
• interest occurring once in text unit #5#. 
After calculating the lexical cohesion links for 
all text units following the method illustrated 
in Tables 9-10 for text unit #1#, the links scored 
for the code BZ will be as shown in Table 11. By 
repeating this operation for all codes for which 
there are lexical cohesion scores -- F, FA, IV 
and CN for the text under analysis -- we could 
then count all text unit pairs which each code 
relates, as shown in Table 12. The relations be- 
tween subject domain codes and text unit pairs 
in Table 12 can subsequently be turned into per- 
centage ratios to provide a topic/theme profile 
of the text as shown in Table 13. 
By keeping track of the links among text 
units, relevant codes and their originating 
words, it is also possible to retrieve text units 
on the basis of specific subject domain codes 
or specific words. When retrieving on specific 
50% 
31.25% 
6.25% 
6.25% 
6.25% 
Table 13: 
BZ Business & Commerce 
F Finance & Business 
IV Investment & Stock Markets 
FA Overseas Politics & 
International Relations 
CN Communications 
Topic profile of document in Table 1, 
according to the distribution of subject domain 
codes across text units shown in Table 12. 
words, there is also the option of expanding the 
word into subject domain codes and using these 
to retrieve text units. The retrieved text units 
can then be ordered according to the ranking 
order previously computed. 
5 Applications, Extensions and 
Evaluation 
An implementation of this approach to lexical 
cohesion has been used as the driving engine of 
a summarization system developed at SHARP 
Laboratories of Europe. The system is designed 
to handle requests for both generic and query- 
based indicative summaries. The level-based 
differentiation of text units obtained through 
the ranking procedure discussed in §3.3, is used 
to select the most salient and better connected 
portion of text units in a text corresponding to 
the summary ratio requested by the user. In 
addition, the user can display a topic profile of 
the input text, as shown in Table 13 and choose 
whichever code(s) s/he is interested in, specify a 
summary ratio and retrieve the wanted portion 
of the text which best represents the topic(s) 
selected. Query-based summaries can also be 
issued by entering keywords; in this case there 
is the option of expanding key-words into codes 
and use these to issue a summary query. 
The method described can also be used to de- 
velop a conceptal indexing component for infor- 
mation retrieval, following Dobrov et al. (1997). 
Because an attempt is made to prune contex- 
tually inappropriate sense expansions of words, 
the present method may help reducing the am- 
biguity problem. 
Possible improvements of this approach can 
be implemented taking into account additional 
ways of assessing lexical cohesion such as: 
• the presence of synonyms or hyponyms 
across text units (Hoey, 1991; Hirst and St- 
Onge, 1997; Barzilay and Elhadad 1997); 
1162 
• the presence of lexical cohesion established 
with reference to lexical databases offer- 
ing a semantic classification of words other 
than synonyms, hyponyms and subject do- 
main codes; 
• the presence of near-synonymous words 
across text units established by using a 
method for estimating the degree of seman- 
tic similarity between word pairs such as 
the one proposed by Resnik (1995); 
• the presence of anaphoric links across text 
units (Hoey, 1991; Boguraev & Kennedy, 
1997), and 
• the presence of formatting commands as in- 
dicators of the relevance of particular types 
of text fragments. 
To evaluate the utility of the approach to 
lexical cohesion developed for summarization, 
a testsuite was created using 41 Reuter's news 
stories and related summaries (available at 
http ://www. yahoo, com/headlines/news/), 
by annotating each story with best summary 
lines. In one evaluation experiment, summary 
ratio was set at 20% and generic summaries 
were obtained for the 41 texts. On average, 
60~0 of each summary contained best summary 
lines. The ranking method used in this evalu- 
ation was based on combined lexical cohesion 
scores based on lemmas and their associated 
subject domain codes given in CIDE. Summary 
results obtained with the Autosummarize 
facility in Microsoft Word 97 were used as 
baseline for comparison. On average, only 
30% of each summary in Word 97 contained 
best summary lines. In future work, we hope 
to corroborate these results and to extend 
their validity with reference to query-based 
indicative summaries using the evaluation 
framework set within the context of SUMMAC 
(Automatic Text Summarization Conference, 
see http ://www. tipster, org/). 

References 
Barzilay, R. and M. Elhadad (1997) Using 
Lexical Chains for Text Summarization. 
In I. Mani and M. Maybury (eds) Intel- 
ligent Scalable Text Summarization, Pro- 
ceedings of a Workshop Sponsored by the 
Association for Computational Linguistics, 
Madrid, Spain. 
Boguraev, B. &: C. Kennedy (1997) Salience- 
based Content Characterization of Text 
Documents. In I. Mani and M. Maybury 
(eds) Intelligent Scalable Text Summariza- 
tion, Prooceedings of a Workshop Spon- 
sored by the Association for Computational 
Linguistics, Madrid, Spain. 
Dobrov, B., N. Loukachevitch and T. Yud- 
ina (1997) Conceptual Indexing Using The- 
matic Representation of Texts. In The 6th 
Text Retrieval Conference (TREC-6). 
Fox, C. (1992) Lexical Analysis and Stoplists. 
In Frakes W and Baeza-Yates R (eds) Infor- 
mation Retrieval: Data Structures &: Algo- 
rithms. Prentice Hall, Upper Saddle River, 
N J, USA, pp. 102-130. 
Hirst, G. and D. St-Onge (1997) Lexical 
Chains as Representation of context for the 
detection and correction of malapropism. 
In C. Fellbaum (ed) WordNet: An elec- 
tronic lexical database and some of its ap- 
plications. MIT Press, Cambridge, Mass. 
Hoey, M. (1991) Patterns of Lexis in Text. 
OUP, Oford, UK. 
Miller, G., Beckwith, R., C. Fellbaum, D. 
Gross and K. Miller (1990) Introduc- 
tion to WordNet: An on-line lexical 
database. International Journal of Lexi- 
cography, 3(4):235-312. 
Procter, P. (1995) Cambridge International 
Dictionary of English, CUP, London. 
Philip Resnik (1995) Using information con- 
tent to evaluate semantic similarity in a 
taxonomy. In IJCAI-95. 
