7- 
Statistical methods for retrieving most significant paragraphs in newspaper articles 
Jos~ Abrafos 
Departamento de Inform~ca. Faculdade de 
Ci~nclas e Tecnologla / UNL 
2825 Monte da Capanca, Portugal 
• jea@dt fct uni pt 
. Gabriel Pereira Lopes 
• ~ Departamento de Inform~Uca, Faculdadede 
Cl~ncms e Tecnologta I UNL 
2825 Monte da Capanca, Portugal 
gpl@& fct unl pt 
Abstract 
Retrieving a most stgulficant paragraph m a 
newspaper arUcle can act as a kind of surnmanzatmn It 
can gwe the human reader some hints on the contents 
of the arucle and help him to decide whether It deseei'ves 
a full readmg or not It may also act as a filter for a 
robust natural language understanding system, to 
extract relevant mformatton from that paragraph m 
order to enable conceptual mformauon retrieval 
Talang a newspaper arUcle and a base corpus, word 
co-occurrences w3th higher resolving power are 
~dent~fied These co-occurrences are used to estabhsh 
hnks between the paragraphs of the arUcle The 
paragraph which presents the larger number of hnks tO 
other paragraphs ~s considered a most slgmficant one 
Though designed and tested for the Portuguese 
language, the staUshcal nature of our proposal should 
ensure ns portabtlny to other languages 
1. Introduction 
The advantages of using stattsucal methods when 
dealing w~th large volumes of text are known Namely, 
thelr capabdny of facing any kind of subjects, without 
feanng the most baroque syntacucal structures, and 
always produ~ng an answer whlch, though varying m 
habthty, ts always more useful than "fad" 
The scope of the present work Is the use of 
stat|st|cal methods to remeve a most ssgn~ficant 
paragraph sn a newspaper amcle The method we 
propose nught help a reader m getung a qmck ghmpse 
of the contents of a newspaper and dccldmg whlch 
articles deserve a full reading It can besldes facthtate 
searches through journalmttc text bases But we are also 
interested on pruning the amount of text to be 
automatically processedfor robust understanding of 
natural language Thls wdl enable conceptual based 
document representation and conceptual mformat~on 
retrieval (Mauldm 1991) 
The process Is based on rcmeving the 
co-occurrences wlth hlgher resolving power m each 
document, using them to estabhsh hnks between 
paragraphs, and selecting the paragraph with more 
hnks to other paragraph s 
Tests performed vdth the support of a base corpus of 
about 500 thousand words were able to identify a most 
slgn!ficant paragraph m 7 out of I0 newspaper arucles 
We present, m annex, the results of some experiments 
concerning one of the arucles 
2. Antecedents 
An Idea borrowed from Information Retrieval, ts 
that a term will be so more relevant m a document the 
more frequently n occurs m that document, and the less 
frequently It occurs m a base corpus 
Maarek (1992), followmg other authors, considers 
that using paws of words as an indexing umt ~s more 
adequate to mformauon retrieval than usmg single 
words IntmUvely, n is planslble to adnut that, for 
mstanee, the pmr \[rile system\] ts far more mformauve 
than the words file and .~stem taken m lsolatton 
Maarek alms at remeving pmrs of lextcally related 
words In Enghsh, 98% of the lexlcal relations occur 
between words within a span of 5 words m a sentence. 
s e, the window to consider when extracting words 
related to word w, should span from postttuon w-5 to 
w+5 Maarek also defines the resolwng power of a parr 
m a document d as 
P = ~'Pd log Pc 
where Pd is the observed probabshty of appearance of 
the pan" m document d, Pc the observed probabdny of 
the pmr recorpus, and -log Pc the quantity of 
mformauon assocmted to the pmr It Is easdy seen that 
p wall be h|gher, the higher the frequency of the pmr m 
the document and the lower sts frequency m the corpus, 
which agrees wlth the sdea presented at the begmnmg 
of this sectton 
Church and Hanks (1990) propose the apphcatlon 
of the concept of mutual mformatton 
e(x,y) 
~,(x.y) = hog2 ecx)e(y) 
51 
to the retrieval, ro a corpus, of pairs of lextcally related 
words They alsoconslder a word span of :e5 words and 
observe that "roterestrog" pmr, s generally present a 
mutual mformatxon above 3 
Salton and.Allan (1995) foc~as on paragraph level 
Each paragraph Is represented by a weighed vector, 
where each element is a term (typically. word stems, 
a_f~r excluchng those in a stop hsO The weight of each 
term reflects (as usual) posmve~y its frequency in the 
document and negatively its frequency m the corpus 
Usrog a roeasure of smulanty between vectors and 
applying a sumlanty threshold, one can define which 
paragraphs are linked They then constder of central 
tmportance the paragraph with the largest number of 
conneottons to other paragraphs 
The idea underl3ang the present work was to 
integrate these 3 approaches and to apply the resulting 
roethod to newspaper articles, w~th the purpose of 
retnewng, ro each article, a roost mgmficant paragraph 
3. The proposed approach 
As stated before, the method of Church and Hanks 
identifies pmrs of lexlcally related words So, for 
instance, the pair \[conselho seguranfa\] (security 
conned), with an assocmted mutual mforn~uon of 5.3, 
can be considered as a potential mdexang term, while 
the pan" \[para a\] (to the), though 63 tunes roere 
frequent ro our corpus, having a mutual roformauon of 
0.7, can be excluded We have then a erttenon for 
exclusion, that dispenses with the need for stop hsts, 
and that alms at assunng the exlstence of a leracal 
relation between the words of the rematrong pairs 
But not all pans of lexlcally related words are good 
rodexmg terms of a document The pair should also 
meet the reqmrement of being relevant m the 
considered document The method of Maarek proposes 
a measure of the resolvrog power of each pair ro the 
concerned document, thus enabling the selection, 
among all the poten.al indexing terms, of those that 
are relevant m each document For rostanco, \[estados 
umdos\] (united states) has a hxgh mutual roformatmn 
(8 1) but it can be of little relevance m an article about 
the hberatton of prisoners by the Serbs of Sarajevo 
(p=0007) The experiments earned out point to a 
threshold of the resolwng power around 0 01 We 
consider as relevant ro a document only the pairs vath a 
resolvrog power above this threshold 
When the same pmr occurs ro chfferent paragraphs 
of the same document, hnks can be estabhshed between 
those paragraphs At flus point, we only consider pairs 
that were not excluded ro prewous steps (mutual 
roformatton > 3 and resolwng power > 0 01) Though, 
each hnk Is not hnuted to pairs of words In fact, the 
52 
wider the hnk, the higher its relevance After 
processing a document, we often get overlapping pmrs 
For instance, m an amcle where the expression dos tr~s 
antJgos behgerantea (of the three former contenders) ts 
used repeatedly, the foll0vang pmrs were retrieved 
\[tr~s behgerantes\] \[an~gos behgerantes\] \[dos 
behgerantes\] 1 
By ohserwng the overlap of these pmrs ro the very 
document, a single hnk can be retrieved, m the form of 
the tuple \[dos trOs antlgos behgerantes\] 
Adaptmg the roethod of Salton and Allan, we can 
formulate the hypothesis that the paragraph vath the 
larger number of hnks to other paragraphs would be of 
central impox~tance in the document 
In summary, the steps of the proposed method are 
*m a base corpus, compute the frequency of each 
word and the frequency of each co-occurrence, 
consadenng a window spanrong from posihon 14,-5 
to w+5, 
*to each document c~mpute, smuIarly, the 
frequency of each word and each co-occurrence, 
*exclude, from the co-¢r.e~m'ences \]dent:fled m the 
document, those presenting a mutual mformatlon 
or a resolving power under the defined thresholds 
(I(x,y) < 3 or p < 0 01), 
• take the selected pans and group the overlapping 
ones, the resulting tuples (pairs and groups of 
pairs) occmTmg repeatedly ro different paragraphs 
estabhsh hnks between those paragraphs, 
*hypothetically, the paragraph presenting a larger 
number of hnk~ to other paragraphs wall be of 
central ,mportanco in the document 
It should be noted that this proposal, compared to 
Salton and Allan's, has the advantages (at least ro 
theory) of avotchng the use, always arbitrary, of stop 
hsts 2, and of basing the calculations exclusively on the 
tuples that are relevant ro the document, instead of 
using the heavy vectors containing all the terms of each 
paragraph We don't have, so far, enough data to make 
any clmm about the comparative quahty of the links 
1 pairs \[tr~s antlgos\] \[dos an#gos\], though considered relevant, 
didn't score enough mutuat reformat=on to be selected 
2 the relevance of a word depends on the context, so, we prefer 
not to a pnon exclude any word, by sandtng it to a stop kst In 
fact, some of the tuplas we retheved as relevant include words 
that would otherw=se be pad of such a Ist An example m the 
pmr \[n~o ahnhados\] (nonaligned) where the word n,~o (not) 
though quite significant tn context, would be excluded wa stop 
#st . 
i 
I 
|j 
:! 
I 
i 
I 
i 
1 
I 
4. Applying the proposal 
The base corpus was uuually bruit vath news from 
Lusa news agency, m a total of 216 319 words Later, 
news from "0 P6bllco" newspaper (about 90 000 
w~ds) and more news from Lusa were added, and the 
total reached 537 085 words The consequences of tins 
enlargement will be chscussed m the next secUon 
Experunents were made vath 10 articles from "O 
Pdbhco", that chdn't belong to the corpus 
Both the corpus and the documents were subjected 
to a very elementary pre-processmg, wluch basically 
6onslsted of 
• convemng all uppercase letters to lowercase 
* convemng all numbers to NUMERO (NUMBER) 3 
• ehmmaUng all non-letter characters 
Words or co-occurrences not present m the corpus, 
if occumng m a document, would lead, respectively m 
the computatlon of mutual mformatlon or resolwng 
power, to ¢hwdmg by 0 or to log2 0 To prevent 
sltuatton, in such cases, and only for calculatlon 
purposes, the document is added to the corpus By 
doing so, though, the mutual mformatmn becomes 
overestamated For instance, the parr \[ha eslav6ma\] (m 
slavoma) occurs 3 tnnes in an article As eslav6ma 
doesn't occur m the corpus, the artacle m hdded to the 
corpus, for calculatmn purposes only concernmg tlus 
pair The result is the presuppositmn that, despite the 
qmte low frequencies of eslav6ma and \[na eslav6ma\], 
almost every tune the word eslav6ma occurs it IS 
preceded by ha, the mutual mfc~naUon of the parr 
being then artificially raised 
To overcome this overest~maUon, 2 adthUonal 
mutual mformauon thresholds were defined 
*tf one of the words (or both) doesn't occur m the 
cOrpus, it must be I(x,y) > 10, 
•. * if both words occur in the corpus but they never 
co-occur, it must be I(x,y) > 8 
These lurers are not defimtlve They were suggested 
by the experiments camed out, which were though too 
few to ensure their defimuon with certainty 
" Theamclesanalyzed m those experiments are in 
average 500 words long Pre-processmg and frequency 
calculations are obtmned through gawk commands 
(Umx) The calculaUons of mutual lnformatmn, 
resolwng power and the filtenng of co.ocoxrrences 
through these criteria are implemented in C 
3 the choee of reducing all numbers to NUMERO has to do wdh 
• the kind of documents under study, ,n texts about law, for 
instance, the ~stmctmn between Law 12/86 and Law 47/95 
may be important 
53 
Nevertheless, gwen the experimental nature of the 
system, optlm~zaUon was no mmn concern Searches m 
the file contammg the co-occurrences of the corpus 
(22 MB) are sequenual, this source of mefficien~. 
being only palhated by prevmously sorting the" 
co-oocurrenc~s by dscreasmg order of probablhty In 
what concerns the arUcle presented m Annex A (441 
words), pre-processmg, calculatmn of freque~cles and 
sorting takes about 5 seconds The calenlaUons 
revolved m selecting and somng co-ecru-fences take 
~about 8 minutes 4 By the charactenst~c~ of the 
lmplementatmn, tins last tlme m (hre~y propo~onal, 
among other factors, to the amount of words m the 
corpus and to the amount of unknown words that occur 
m the document 
Out of the 10 arucles that were analyzed, the 
method we propose achieved the ~denttficaUon of the 
most slgmficant paragraph in 7 and was clearly 
n~staken m 1 In the remaining 2 articles, the~e doesn't 
seem to be, mtmUvely, a most representative paragraph 
Thls lntultmn m the evaluatton of the results is 
necessarily subjecUve 
N0twlthstanchng the very small number of arttcles 
involved in this test, it may be ~mous to compare our 
results vath those that would be obtmned by just 
picking up the 1 st paragraph of each amcle, or even 
both the la and the 2~ paragraphs 
# of articles 
removes a most 
slsmt cent § 
extstance of a most 
slgnd~cant § is not clear 
the § retrieved is not a 
most slgmficant one 
our 
proposal 
7 
2 
1=§+ 
I ~ § 2~d§ 
5 6 
2 2 
3 2 
5. Discussion of the results 
The proposed method ignores a series of basic 
questmns, namely 
Lemmatization 
All the calculauons are made vathout any attempt of 
umfymg plural forms ruth singular forms, dtfferent 
conjugations of a same verb, etc Nevertheless, it 
doesn't look clear that new hnks, obtained by grouping 
words that, though shanng a common stem, were m 
fact used m chsUnct forms, wdl necessarily mcrease the 
performance of the system Would it make sense to 
unify tribunal de famlha (court that deals with famdy 
cases) vath tribunal fanuhar (farmhar court)? And 
4 tzmes measured m a DECstabon 5000/200 
dwfltos do homem (human rights) vnth dwezto dos 
homens 0aw of men)~ 
Anaphora resolution • ~ 
Though the umficauon of the anaphor with the 
antecedent, m most cases, makes obvtously sense, 
anaphora resolutzon would reqmre a complete analysis 
of the text, totally outside the scope of this proposal 
Curiously, m the only experiment that was made of full 
anaphora ~ resoluUon, the number of hnks between 
paragraphs substanUally increased, but the paragraph 
retrieved as most sigmficant - the first- was no longer 
the one obtmned by mtumon - the second s -(refer to 
results m the annex) 
Unification of synonyms, hyponyms, hyperonyms 
The same arguments presented about lemmaUzaUon 
can apply here The experiment of umfymg lmUals 
vmh full names - e g ONU ~ NafOes Umdas 
(UN ~ Umted NatJons) - simple to do with the help of 
a thesaurus, gave s~xmlar results to those of appl3ang 
anaphora resoluUon 
Size of the co-o~currence window 
The wmdow spanning from posluon w-5 to w+5, 
defined for Enghsh language, may be not the most 
adequate to Portuguese No further expertments were 
performed vath other sizes of windows 
Indexing terms 
The resolving power criterion rams at assuqng that 
the selected co-occurrences are relevant m the 
document being analyzed A manual mdeyang could, 
nevertheless, choose other terms, pess~bly even foreign 
• to the document In fact, in an amcle describing a coup 
there may be references to derrube de governantes 
(overthrowmg of rulers), tomada do poder (tahng the 
corpora are though qmte small Nothing m&cates that 
the results would stand a more substanttal increase of 
the corpus 
We also tried to find out how far estabhslimg links 
could help in identifying a structure of the text The 
structures obtmned, by connecting lteratlvely each new 
paragraph to the one wRh more hnks m common, are 
not conclustve In some cases they are close to a 
posstble mtmUve structure of the text, while m other 
they dtverge considerably The structare obtained for 
the text m annex was among the most plausible 
6. Conclusion 
The methodology we propose integrates the 
concepts of mutual mformat~on associated to a pmr of 
words, resolwng power of that paw m a document and 
estabhshmg of links between paragraphs of a 
document, wRh the purpose of retrieving a most 
representattve paragraph 
The methods we use are pureIy staustacal 
Nevertheless, notw~thstan&ng their s~mphclty, the 
rough stmphficauons referred m the prevaous section 
and the extguousness of the corpus, the results seem 
quite Interesting The habihty of these results is though 
hn'nted by the amount of tests that weze performed and 
by an evaluation based on the mtmUon of the authors 
Probably, an increase of the corpus and the 
refinement of the process wtth some, even elementary, 
hngmsnc cntena, would benefit the performance 
Though designed and tested for the Portuguese 
language, the stattst~cal nature of tlus methodology 
should ensure its portabflRy to other languages 
power), vothout any explicit 
expression golpe de estado (coup) 
We present, lfi annex,, the results of processing a 
document using the miual corpus (216 319 words) and 
the augmented one (537 085 words) In what concerns 
the co-occurrences that were selected as estabhshmg 
hnks, one can notice the excluson of \[da ONU\] (of the 
UH) m the 2nd case (m the 1st case it already presented 
a mutual mformatmn very close to the threshold) All 
the other selected co-occurrences remain, and their 
ordenng m terms of resolving power m also preserved 
The paragraph retrieved as central ~s the same Both 
5 in fact, in thin article, central reformation seems to concentrate. 
the 2 mSal paragraphs, the 2rid rederahng most of the 
mformabon introduced by the 1st The 2rid refers the 2 actors 
(UN and NATO) whose achons vail be analysed latter This 
may suggest some preference agmnst the 1st Anyway, each 
one of these 2 paragraphs can be consclered as 
representatwe of the text 
occurrence of the ,.~.... References 
Church, K and Hanks, P (1990) Word assocmUon 
norms, mutual mfunnaUon, and ' lexacography. 
Computatlonal Lmgmsttcs, 16 (I), p 22-29 
Maarek, Y (1992) Automatically constructmg snnple 
help systems from natural language representaUon s 
In P Jacobs Ed, Text-based mtelhgent systems 
current research and practice m mformatton 
extraction and retrteval, Lawrence Erlbaum 
Assocaates Pubhshers, Hdlsdale, New 3ersey, p 
243-256 
Manldm, M (1991) Conceptual mformatwn retrieval 
, a case stu@ m adaptatn, e partml parsing, Kluwer 
Acadermc Press, Dordrecht 
Salton, G and Allan, J (1995) SelecUve text 
utthzatton and text traversal Internatmnal Journal of 
Human-Computer Studtes, 43, p 483-497 
54 
.Annex A - Results of the experiments, relative to one of the target newspaper articles 
Full text of the amcle, the selected co-occurrences (relative to the larger corpus) are underhned 
Caoacetes azms vao ser protegldos pela NATO 
ONU aprova nussilo na Eslav6ma 
0 Conselho de SeLmranca da ONU declchu 
segunda-fen-a ~ noRe estabelecer urea adnumstrac~o 
transRdna, apolada por urea opera~o de manutenq~o 
da paz, na regi,5o da Eslav6ma. Oriental, t~ltlmo 
terrR6no no interior das fronte~ras adnnmstrat~vas da 
Crodcm mnda ¢ontrolado polos mdependentmtas 
~rwos 
Para a mms~o, corn a dura¢~io prevtsta de um ano, 
-viio ~ dlspombdizados lmcmlmente at6 cmco n~l 
,capacete.S azms, corn regras de envol,dmento hnutado 
mas ClUe poder~o beneficlar de uma protecc~o da 
NATO Esta operag~o, j~l dessgnada "AdrmmstrafAo 
TransR6na das Nacf)es Umdas para a Eslav6ma 
~e__~_.g_~" (U AES), fol aprovada pot" unanmudade 
pelos 15 membros do Conselh.o. de Seguranca 
Trata-se da pnme~ra dec~o concreta para a 
aphca~io do piano de paz destmado a esta sensfvel 
reg~o da Croftcta que faz frontetra corn ~'Votvodma 
s~xwa, ap6s o acordo do passado dm 12 de Novembro 
entre a Cro~cm e representantes dos s6rwos locals, 
conclufdo ~ margem des conversac~.s de Dayton 
(Estados Umdos) sobre a B6sma-Herzegovma 
A admm~strafAo transR6na da Esla~dnm Oriental - 
atravessada pelo Dantibm, Inmte natural entre a S6rna 
e a Cr~ma e o grande elxo fluvml da reg~io, mclmndo 
nas hga~.s coma Hungna - vm set confia~ ao 
&plomata norte-amencano Jacques Klein, ant~go 
oficml da For~a Adrea dos EUA e que se tomant numa 
(0 Pdbhco, 17Jan96) 
esp~e de "governador" deste f6ml temt6ao, cerca de. 
cmco pot cento da superffcm da Cro~tcla Na quahdade 
de adnnmstrador provzs6no, Kl~n_ possm autondade 
m~ma sobre as componentcs c~vd e nnhtar da nuss~io 
da ONU 
Desde meadcs de Dezembro que a ONU e a NATO 
decidtram "repartn" a sua mtervenf~o na ex-federa~o 
jugoslava A Ahan~a Atlgutlca desempenha 
actualmente urea funq~o determmante na B6snta, ao 
&ngtr a operaf~to "Esfor¢o Concertado', enquanto a 
ONU mant6m o comando das operacj3es na Crcdcta e 
Maced6ma FEsta decls~o do Conselho de Se~ranfa p~ 
oficmlmente termo tt Racassada Operac~o das Hacd3es 
Umdas para o Restabeleclmento da Confianqa na 
Crodcm (ONURC), cujos cfcctzvos forram rettrados na 
sua quase totahdade durante o ano passado, na 
sequSn~a das ofenstvas nuhtares croatas na EslavGnm 
Ocldental, em Mmo, e na Krajma, em Agosto 
Hcs termos do acordo assmado em Dayton, eesta 
regtlo dever~ ficar totahnente desmflRanzada 30 dtas 
ap~ a mstala~o efecuva no terreno da for~a da ONU, 
e prev~,se urn perfodo de transzf~o roAmmo de dora 
aries, finde o qual a regl~o dever~i regressar ao controlo 
efectlvo da Crro4iem Para os especmhstas mdRares 
Naf~eS Umdas envolwdos nesta operaf.~o, a tarefa mats 
&fled conmstmi em convencer os cerca de 20 mfl 
nuhcaancs sh'wos fortemente armados a entregarem as 
suas armas e acestarem o regresso da autcndad¢ de 
Zagreb ~t regl~o 
55 
Co-occurrences repeated m the do~ment, wlth I(x,y) ~ 3 (posslble links) 
p I(x,y) fd(x,y) f=(x) f=(y) fg(x,y) x y 
0 031037 3 001719 5 4474 99 82 da onu 
0025119 11 344917 3 0 10 0 eslavdma oriental 
0 025119 *9 703371 3 52 0 0 adrmmstra~o transR6na 
0 025119 *3 887126 3 1758 0 0 na eslavdma 
0022752 5 355113 3 151 70 10 conselho seguranCa 
0 020184 4 854552 3 1006 55 37 das na~es 
0 019714 8 907315 3 55 77 47 na~es umdas 
0 019371 4 967027 3 1006 77 56 das umdas 
0017277 10076007 2 16 5 0 cro~la s6mos 
0 017277 *4 257744 2 . 61 74 0 rcgl~o deve~ 
0017277 *4 107468 2 1006 0 0 das eslav6ma 
0 017277 *3 266078 2 561 16 0 entre cro,~cla 
0 015837 12 078945 2 6 10 6 capacetes azms 
0 015168 4 065371 2 73 354 10 v~o ser 
0 026902 10 807156 - 3 0 36 0 eslavdma 
0 026902 *9 554176 3 143 0 0 adrmmstraqao 
0 026902 *3 889618 3 4352 0 0 na 
0 023081 5 309673 3 32.5 175 21 conselho 
0 020268 5 054372 3 2351 121 88 das 
0 020018 9 199451 3 121 151 100 na~Oes 
0 019778 5 095578 3 2351 151 113 das 
0 019371 - 6 565658 2 54 21 1 croatia 
0 018465 *4 193060 2 2351 0 0 das 
0 015589 12 012423 2 18 26 18 capacetes 
0 015387 3 817080 2 169 947 21 v~o 
of the UN 
East Slavoma 
transttory 
admmzstratmn 
m Slavoma 
Security Councd 
of the Natwns 
Umted Natmns 
of the Umted 
Croatia Serbs 
regmn should 
of Slavoma 
between Croatza 
blue helmets 
wdl be 
corpus 216 319 words 
oriental . 
transR6na 
eslavdma 
seguranqa 
na¢6es 
umdas 
umdas 
s6rvlos 
eslav6ma 
aZUlS 
Set 
East Slavoma 
. transttory 
admmlstratton 
m Slavoma . 
Security Councd 
of the Nalzons 
Umted Nations 
of the Umted 
Croa~a Serbs 
of Slavoma 
blue helmets 
will be 
corpus 537 085 words 
p - resolving power 
I(×,y) - mutual mfonnauon 
fa(x,y) - frequency of parr x,y m the document 
* pmrs ohmmatcd by ovcresUmate adjustment (of section 4) 
• f~(x)- frequency of wordx m the corpus 
fc(y) - frequency of word y m the corpus 
f~(x,y) - frequency of pmr x,y m the corpus 
56 
Matrixes shovang the number of hnks between paragraphs, and structures obtmned by connecting each new 
parfigraph to the preceding one voth more hnks m common The connecuons used m the construcuon of the 
structure (cf secUon 5) are slgnalcd m bold face .~= i ..... . 
without anaphora resoluUon 
§ 1 2 3 4 5 6 Y~ 
1 2 1 1 1 0 5 
2 2 • 0 1 2 1 6 
3 1 0 ,,. 0 0 011 
4 1 1 0-: 002 
5 I 2 0 0 I 4 
6 0 1 0 O 1 2 
• da ONU ~ das Naf~es Umdas 
(of the UN ~=) of the Umted Natmns) 
§ 1 2 3 4 5 6 Z 
1 ~i3 1 2 2 1 9 
2 3l 02218 
3 1 0 .1 0 0 i0 .I 
4 2 2 0 I 1il 6 
5 2 2 0 1 1 6 
6 1 1 0 1 1! 4 
with full anaphora resoluuon 
§ 1 2 3 4 5 6 
'i 1 5 3 5 4 4 19 2 1 4 3 3 16 
3 3 1 I" 2 3 3 12 
4 5 4! 2 .,.~ 2 4 17 
5 4 3l 3 2 ,, 2 14 
643i342 ..... 16 
mtmttve structure 
I 
2 3 6 
4 5 
vathout anaphora rcsoluuon 
1 
2 3 
4 
I 
6 
ONU ~ ~ Na~es 
Um~ 
1 
2 3 
4 5 
I 
6 
with full anaphora resolutmn 
! 
2 3 4 5 
I 
6 
57 

References
Church, K and Hanks, P (1990) Word assocmUon
norms, mutual mfunnaUon, and ' lexacography.
Computatlonal Lmgmsttcs, 16 (I), p 22-29

Maarek, Y (1992) Automatically constructmg snnple
help systems from natural language representaUon s
In P Jacobs Ed, Text-based mtelhgent systems
current research and practice m mformatton
extraction and retrteval, Lawrence Erlbaum
Assocaates Pubhshers, Hdlsdale, New 3ersey, p
243-256

Manldm, M (1991) Conceptual mformatwn retrieval
, a case stu@ m adaptatn, e partml parsing, Kluwer
Acadermc Press, Dordrecht

Salton, G and Allan, J (1995) SelecUve text
utthzatton and text traversal Internatmnal Journal of
Human-Computer Studtes, 43, p 483-49.
