File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2166_metho.xml
Size: 7,470 bytes
Last Modified: 2025-10-06 14:14:22
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2166"> <Title>Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences</Title> <Section position="3" start_page="0" end_page="986" type="metho"> <SectionTitle> 2 A System for Generating Text Abstracts </SectionTitle> <Paragraph position="0"> Kupiec et al. (?) present the results of a study where 80% of the sentences in man-made abstracts were &quot;close sentence matches&quot;, i.e., they were &quot;either extracted verbatim from the original or with minor modifications&quot; (p.70). Therefore, we argue that it is not only an easy way but indeed an appropriate one for an automatic system to choose a number of the most relevant sentences and present 1By &quot;satisfying&quot; we mean at least indicative for the content of ~he respective text, if not also informative about it.</Paragraph> <Paragraph position="1"> these as a &quot;text; abstract;&quot; to the user. ~ We further argue that; coherence, although certainly desirable, is imi)ossible without a large scale knowledge based 1;ext mldersl;an(ling syst;em which would not only slow down dm l)erformance signiticantly but necessarily could not be domain inde,1)endent.</Paragraph> <Paragraph position="2"> Our design goal was to use as simple and efflcleat an algorithm as t)ossibh',, avoiding &quot;hem(stics&quot; and &quot;fe, al;ures&quot; emph)yed by other systems (e.g., (?)) wlfich may be hell)tiff in a specific text domain but would have to be redesigned whenever it were ported to a new domain, a In this respect, our system can be compared with the approach of (?) wit() also t)resent an abstracting system for general domain texts. However, whereas their focus is on the evaluation of abstracl; readability (as stand-alone texts), ours is rather on abstract relevance. A flirther difference is the (non-standard) method of tf*idfweight ('ah:ulation timy are using for their system.</Paragraph> <Paragraph position="3"> Our sysl;em was deveh)ped in C+.t-, using libraries for dealing with texts marke(l ut) in SGML format. The algorithm performs the following sl;et)s: 4 1. Take an arl;Me fl'om the corl)uS 5 and lmild a word weight; matrix for all contellt words across all sentences (l;f*idf (:omputal;ion, where the idf-vahms ttte r(> trieved fl'om a preconqmted file). (; Iligit fre(tuency closed class words (like A, THE, ON etc.) are excluded via a stop list file.</Paragraph> <Paragraph position="4"> 2. Determine the sentence weights for all senten(:es in tim arl;Me: Compltt;e the sum over 2Clem'ly, there will be less (:oherence than in a man-made abstract, but, the extracted passages can t)e presented in a way which indicates their relative position in tim text, thus avoiding a possil)ly wrong inti)ression of adjacency.</Paragraph> <Paragraph position="5"> aln fact,, it t,urned out that fact,ors which couhl 1)e thought of as %l)ecitic for newspaper articles&quot;, su(:h as increased weights for title words or sentences in the beginning, did not have a sign(titan( eriect (m the sys|;elll~s per\['orntance,.</Paragraph> <Paragraph position="6"> 4Due to space limitations, we cannot, give all tilt; details here. The reader is ref('xred t,O (?) for there information on this algorithm, various odter nte, thotls that were tested and their respective result,s. (Tiffs paper can I)e el)rained Kern t,im author's heine 1)age whose URL is: ht tp://www.h:l.cmu.e(lu/~zechner/klaus.htnfl.) '~'We used the Daily Telegral)h Corpus which comprises approx. 44.000 articles (15 million words). (~tt*idf=term frequency in a document (tfk) times t,he logarithm of the nunlber of documents in a collecl;ion (N), divi(led I)y the IlnI\[lber of do(;untents where this term oc(:nrs (Ilk): tfk * log ~_N This formula n k yields a high numl)er for words which are frequent in one dneument but; api)e.ar in very few documents only; hence, they can be considered a.s &quot;indicative&quot; fbr the respective document.</Paragraph> <Paragraph position="7"> all tf*idf-values of the (:on(eat words 7 for each sentence, s 3. Sort the sentences according to l;heir weights and extract the N highest weighted sentences in text order to yield (,he abstract of the docllHleltt. null To r(~thtce the size of the vocabulary, our system (;()nv(',rts every word to Ul)I)er (:ase and (runt:ales words after the sixth character. This is also rout:it faster than a word stemming algorithm which has to perh)rm a inorphological analysis. For our experiment;s, the, amount of new ambiguities thereby introduced did not cause specific problems for tim system.</Paragraph> <Paragraph position="8"> For the test set, we (:host', 6 articles fl'om the corires whi(:h are (:los(; t;o tim gh)bal cortms a,verage of \] 7 senl;en(:es per ardcl(;; these artich',s (:ontain approx. 550 words alt(l 22 sentences on the, average (range: 19 23). All these artMes are at)out a single topic, i)robably becmme of our choi(:e al)out a ret)resenl;ative text, lengdL We (lo not address ttm issue of multi-topicality here; however, it is well-known that texts with more (hall olle tel)it are. hm'd to deal wit;it for all kinds of Ill. systeltlS. E.g., the ANES system, described i)y ('?), tries to i(lenl;iily l;hese texts beforehand to 1)e ex(:luded fl'om abstracl;ing.</Paragraph> <Paragraph position="9"> The system's rllll-til\[te ()It a SUN St)arc workstal;ion (UNIX, SUN OS 4.1.3) is approx. 3 seconds for an article of th(; test, set.</Paragraph> </Section> <Section position="4" start_page="986" end_page="987" type="metho"> <SectionTitle> 3 Experiment: Abstracts as </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="986" end_page="987" type="sub_section"> <SectionTitle> Extracts Generated by Human Subjects </SectionTitle> <Paragraph position="0"> In order to bc able to ewfluate the quality of tim abstracts t)rodueed by our system, we, conducted an experiment where we asked 13 human subjects to choose the &quot;most relevant 5-7 sentences&quot; from the six articles Dom the test set. 9 \]b t;~cilitate their (;ask, the subjects should first give each of the sentences in an artMe a &quot;relewmce score&quot; from l (&quot;barely relewmt&quot;)to 5 (&quot;highly relevant;&quot;) and finally choose tit(', trust scored sentences for th(;ir abstracts. The subjects were all native speakers of English (since we used an Englistl cortms) and were. paid for their task. Compared l;o about 3 set:ends for the machine system, the hmnans nee(h;d rThis provides a bias towards longer sentences. Experiment,s with methods that normalized for sentence length yiehled worse results, so dtis bias appears to be apI)roI)riate.</Paragraph> <Paragraph position="1"> SWords in the title and/or appearing in t,ln! first,/last few sent,enees (:all be given Inore weight by tneans of an editable parame.l;e.r tilt;. It turns out,, however, that, these weights do not, lead to an improvement, of the syst,em's performance.</Paragraph> <Paragraph position="2"> 9This number corresponds in fact, well to the observation of (Y) that, the opt,ilnal smnmary length is bet;ween 20% and 30% of the original document length.</Paragraph> <Paragraph position="3"> about 8 minutes (two orders of magnitude more time) for determining the most relevant sentences for an article.</Paragraph> </Section> </Section> class="xml-element"></Paper>