File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0710_metho.xml

Size: 23,982 bytes

Last Modified: 2025-10-06 14:14:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0710">
  <Title>Sentence extraction as a classification task</Title>
  <Section position="4" start_page="58" end_page="59" type="metho">
    <SectionTitle>
2 Sentence selection as classification
</SectionTitle>
    <Paragraph position="0"> In Kupzec et al's experiment, the gold standard sentences are those summary sentences that can be aligned with sentences m the source texts Once the alignment has been carried out, the system tries to determine the characteristic properties of ahgned sentences according to a number of features, wz presence of particular cue phrases, location in the text, sentence length, occurrence of thematic words, and occurrence of proper names Each document sentence receives scores for each of the features, resuiting m an estimate for the sentence's probabihty to also occur m the summary This probabihty is calculated as follows</Paragraph>
    <Paragraph position="2"> s m the source text m mcluded 111 ~lmmary S, given Its feature vvlues, compressmn rate (constant), probablhty of feature-value pair occurnng m a sentence winch m m the summary, probabihty that the feature-value pair occurs uncon&amp;tzonally, * number of feature-valus pairs, j-th feature-value pair Aseummg statmtmal independence of the features, P(~ls E S) and P(Fj) can be estnnated from the corpus Evaluatmn rches on ccross-vahdatmn The model m trmned on a training set of documents, having one document out at a tune (the cu~ent test document) The model is then used to extract can&amp;date sentences from the test document, allowing evaluation of precision (sentences selected correctly over total number of sentences selected) and recall (sentences selected correctly over ahgnable sentences m summary) Since from anygrven test text as many sentences are selected as there are ahgnable sentences m the summary, precamon and recall are always the same Kupiec et al reports that preasion of the m&amp;wdual hetmstles ranges between 20-33%, the highest cumulative result (44%) was adaeved using para. graph, fixed phrases and length cut-off features</Paragraph>
  </Section>
  <Section position="5" start_page="59" end_page="291" type="metho">
    <SectionTitle>
3 Our experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
3.1 Data and gold standards
</SectionTitle>
      <Paragraph position="0"> Our corpus m a collection of 202 papers from different areas of computational lmgtusties, with summaries written by the authors 1 The average length of the summaries m 4 7 sentences, the average length of the documents 210 sentences We seml-aut0matlcally marked up the following structural reformation title, summary, headings, paragraph structure and sentences Tables, equations, figures, captious, references and cross references were removed and replaced by place holders  vetted from DTF~ source into HTML m order to extract raw text and ramlmal structure automatically, then transformed into our SGML format with a perl script, and manually corrected Data colinctlen took place col-</Paragraph>
      <Paragraph position="2"> We decided to use two gold standards * Gold standard A: AHgnment. Gold standard sentences are those occurring m both author summary and source text, m line with Kupmcet al's gold standard * Gold standard B: Human Judgement.</Paragraph>
      <Paragraph position="3"> Gold standard sentences are non:ahgnable source text sentences which a human judge identified as relevant, 1 e mchcatlve of the contents of the source text Exactly how many human-selected sentence candidates were chosen was the human judge's decision Ahgnment between summary and document senfences was assmted by a simple surface snmlarlty measure (longest common subsequence of non-stoplist words) Final ahgnment was declded by a human judge The cxlterlon was snnllanty of semantic contents of the compared sentences The following sentence palr illustrates a dsrect match - .</Paragraph>
      <Paragraph position="4"> Summary: /n understand~ a reference, an agent detexmmes his confidence m Its adequacy as a means of identifying the referent Document: An agent understands a reference once he is com~dent m the adequacy of its (referred) p/an as a means of Identifying the referent Our data show an important chfference wlth Ku-. plec et al's data we have slgnn~cantly lower ahgnment rates Only 17 8% of the summary sentences In our corpus could be automatlcally ahgned wlth a document sentence wlth a certain degree of rehablhty, and only 3% of all summary sentences are Identlcal matches wlth document sentences We created three chfferent sets of trmnmg matehal null Training set I: The 40 documents with the highest rate of overlap, 84% of the summary sentences could be semi-antomatlcally ah~ned with a document sentence Training set 2:42 documents from the year 1994 were arbitrarily chosen out of the remmnmg 163 documents and seml-automatlcally ahgned They showed a much lower rate of overlap, only 36% of summary sentences could be mapped into a document sentence Training set 3:42 documents from the year 1995 were arhitranly chosen out of the remainmg documents and serm-automahcally ahgned Again, the overlap was rather low 42% Training set 123: Conjunctlon of training sets I, 2 and 3 The average document length m 194 sentences, the average summary length m 4 7 sentences * A human judge provlded a mark-up of addltlonal abstract-worthy sentences for these 3 trmnmg sets (124 documents) The remaining 78 documents remain as unseen test data Figure 1 shows the compomtlon of gold standards for our training sets Gold standard sentences for trmmng set I consmt of an approximately balanced mixture of ahgned and human-selected candidates, whereas training set 2 contains three times as many human-selected as ahgned gold standard sentences, training set 3 even four times as many Each document m trmmng set 1 is associated with an average of 7 75 gold standard sentences (A+B), compared to an average of 7 07 gold standard sentences m trmnmg set 2, and an average of 9 14 gold standard sentences m trammg</Paragraph>
      <Paragraph position="6"/>
    </Section>
    <Section position="2" start_page="59" end_page="291" type="sub_section">
      <SectionTitle>
3.2 Heuristics
</SectionTitle>
      <Paragraph position="0"> We ~ employed 5 chfferent heuristics 4 of the methods used by Kuplec et al (1995), viz cue phrase method, locatlon metli6d, sentence length method and thematic word method, and another well-known method m the hterature, viz title method 1. Cue phrase method: The cue phrase method seeks to filter out met~-dtscourse from subject matter We advocate the cue phrase method as our mare method because of the ad&amp;tmnal 'rhetorical' context these meta-lmgmstlc markers make available Thls context of the extracted sentences - along with their proposmunal content - can be used to generate more flexible abstracts We use a hst of 1670 negative and positive cues and indicator phrases or formulalc expressions, 707 of which occur m our training sets For sLmphclty and efficiency, these cue phrases are fixed strings Our cue phrase hst was manually created by a cycle of Inspection of extracted sentences, ldentlficat!on of as yet unaccounted-for expressmns, ad&amp;tlon of these expressions to the cue phrase hst, and possibly inclusion of overlooked abstract-worthy sentences m the gold standard Cue phrases were manually classtfied mto 5 classes, whlch we expected to correspond to the hkehbood of a sentence containing the glvcu cue to be included m the summary a score of-1 means 'very unhkely', -~3 means 'very hkely to be included m a summary' 2 We found ~t useful to assist the dec~un process with corpus frequencies For each cue phrase, we compded ~ts relative frequency m the gold standard sentences and m the overall corpus If a cue phrase proved general {\] e ~t had a high relative corpus frequency) and dtstmct~ve (~ e \]t had a high frequency within the gold standard sentences), we gave ~t a high score, and included other phrases that are syntactically and semanhcally sirmlar to \]t mr0 the cue hst We scanned the data and found the following tendencies * Certain communlcat~ve verbs are typically used to describe the overall goals, they occur frequently m the gold-standard sentences (argue, propane, develop and attempt) Others are predonnnantly used for describing communlcattve sub-goals (detaded steps and subarguments) and should therefore be m a different equivalence class (prove, show and conclude) W~tlnn the class of commumcat~ve verbs, tense and mode seem to be relevant for abstract-worthinesS Verbs m past tense ~We experimented w~th larger and smaller numbers of classes, but obtained best results with the 5-way &amp;stmct~on  or present pedect {as used m the conclumon) are more hkely to refer to global achievements/goals,, and thus to be included m the * summary In the body of the text, present and future forms tend to be used to introduce sub-tasks null Genre specific nominal phrases hke this paper are more distractive when they occur at the begmmng of the sentence (as an approxLmatlon to subject/topic potation)than their non-subject counterparts Exphclt summansatlon markers hke m sum, concluding chd occur frequently, but quite unexpectedly almost always m combination with commumcatlve sub-tasks They were therefore less useful at slgnalhng abstrac~worthy matehal null Sentences m the source text are matched against expresslous m the hst Matching sentences are classified into the correspundmg class, and sentences not contaunng cue phrases are clsssflied as 'neutral' (score 0) Sentences with competing cue phrases are classflied as members of the class with the lngher numerical score, unless one of the competing classes is negative Sentences occurnng directly after hsadmgs hke Introductson or Results are valuable indicators of the general subject area of papers Even though one rmght argue that ttns property should be handled within the location method, we percetve tlas reformation as meta-hngmstlc (and thus logically belongmg to the cue phrase method) Thus, scores for these sentences recelve a prior score of +2 ('hkely to occur m a summary') In a later section, we show how tins method performs on unseen data of the same land (viz texts m the genre of computational lmgulshcs research papers of about .~6--8 pages long) Even though the cue phrase method is well tuned to these data, we are aware that the hst of phrases .we collected mlght not generahze to other genres Some land of automation seems desirable to assist a possible adaptation 2. Location method. Paragraphs at the start and end of a document are more hksly to contain material that Is useful for a summary, as papers are organized hierarchically Paragraphs are also orgarazed hierarchically, with crucial reformation at the beginning and the end of paragraphs Therefore, sentences m document peripheral paragraphs should be good can&amp;dates, and even more so If they occur m the periphery of the paragraph Our algunthm assigns non-zero values only to sentences winch are m document penpheral sections, sentences in the middle of the document receive a  0 score The algorithm is sensitive to prototyplcal heachngs (IntrOdact:on), if such hendmgs cannot be found, it uses a fixed range of paragraphs (first 7 and last 3 paragraphs) Within these document peripheral paragraphs, the values 'l_f' and 'm' (for paragraph initial-or-final and paragraph medial sentences, respectively) are assigned 3. Sentence Length method. All sentences under a certain length (current threshold 15 tokens in- null cluding punctuation) receive a 0 score, all sentences above the threshold a 1 score Kuplec et al mention tins method as useful for filtering out captious, titles and headings In our experiment, thin was not necessary as our format encodes headings and titles as such, and captions are removed As expected, it turns out that the sentence length method Is our least effective method C/ 4. Thematic word method. Tins method tries to identify key words that are characteristic for the contents of the document It concentrates on non-stop-hst words winch occur frequently m the document, but rarely m the overall collection In theory, sentences cont.~mg (clusters of) such thematlc words should be characteristic for the document We use a standard tenn-frequency*mverse-</Paragraph>
      <Paragraph position="2"> .floe frequency of word w m document fgt,~ number of documents contmnmg word w N number of documents m collectaon The 10 top-scoring words are chosen as thematlc words, sentence scores are then computed as a weighted count of thematic word m sentence, meaned by sentence length The 40 top-rated sentences get score 1, all others 0 5. Title method. Words occurring in the title are good candidates for document specific concepts The title method score of a sentence m the mean frequency of title word occurrences (excluding stoplint words) The 18 top-sconng sentences receive the value 1, all other sentences 0 We also experimented with taking words occurring m all headings into account (these words were scored accorchng to the tffldf method) but received better results for tltle words only  Figure 3 First Experiment DLfference between unseen and seen data, training set 3, gold stan-</Paragraph>
    </Section>
    <Section position="3" start_page="291" end_page="291" type="sub_section">
      <SectionTitle>
dards A+B
3.3 Results
</SectionTitle>
      <Paragraph position="0"> Training and evaluation took place as m Kuplec et al's experiment As a basehne we chose sentences from the begmmng of the source text, winch o~ tamed a recall and preczmon of 28 0% on training set 123 Tins from-top baseline (winch zs also used by Kuplec et al ) * is a more conservative basehne than random order it zs more dn~cult to beat, as prototyplcal document structure places a Ingh percentage of relevant reformation m the beginning  tic Note that the contribution of a method cannot be judged by the individual precision/recall for that method For example, the sentence length method (method 3) voth a recall and preczslon over the base-line contributes hardly anything to the end resul L whereas the title method (method 5), winch is be~ low the basehne If regarded mchvldually, performs much better m combination with methods 1 and 2 than method 3 does (67 3% for heuristics 1, 2 and 5, not to be seen from thin table) The reason for tins is the relative independence of the methods If method 5 identifies a successful canchdate, It is less likely that tins can&amp;date has also been Identified by method I or 2 Method 4 (tf*ldf) decreased results shghtly m some of the expernnents, but not m the SAll figures m tables are preamon percentages</Paragraph>
      <Paragraph position="2"> hennstxc and combination, gold standards A+B experiments with our final/largest training set 123 where tt led to a (non-mgmficant) increase We also checked how much precision and recall decrease for unseen data This decrease apphes only to the cue phrase method, because the other henrmtics are fixed and would not change by seeing more data After the manual mark-up of gold standard sentences and additions to the cue phrase hst for training set 3, we treated traunng set 3 as ff it was unseen we used only those 1423 cue phrases for extraction that were compded from training set I and</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="291" end_page="291" type="metho">
    <SectionTitle>
2 A comparxson of fins 'unseen' result to the end
</SectionTitle>
    <Paragraph position="0"> result (Figure 3) shows that our cue phrase hst, even though hand-crafted, xs robust and general enough for our purposes, it generahzes reasonably well to texts of a mmflar kind Figure 4 shows mean precmion and recall for our different training sets for three dflferent extraction methods a combination of all 5 methods ('comb '), the best single heuristic ('cue'), and the baseline ('base') We used both gold standards A+B These results reconfirm the usefulness of Kupiec et al's method of heunst4c combination The method mcreases precmlon for the best method by around 20% It m worth pointing out that thin method produces very short excerpts, wxth compresmous as Ingh as 2-5%, and with a preczslon equal to the recall Thus tins xs a different task from producing long excerpts, e g with a compres~on bf25%, as usually reported m the literature Usmg tins compresmon, we achieved a recall of 96 0% (gold standard A), 98 0% (gold standard B) and 97 3% (gold standards A+B) for training set 123 For comparmon, Kuplee et al report a 85% recall  In order to see how the chfferent gold standards cdegntnbute to the results, we used only one gold standard (A or B) at a time for trmmng and for extraction Figure 5 summarizes the results Looking at Gold standard A, we see that trmnmg set 1 m the only training set winch obtains a recall that is comparable to Kuplec et al's Incidentally, tratmng set 1 is also the only tratmng set that Is  comparable to Kuplec et al's data vnth respect to allgnablhty The bad performance of tratmng set 2 and 3 under evaluation w3th gold standard A m not~surprmmg, as there are too few aligned g01d standard&amp;quot; sentences to tram on 50% of the documents m these training sets contain no or only one ahgned sentence  standard on preclswn and recall, as a function of compresmon Overall, performance seems to correspond to the ratio of gold standard sentences to source text sentences, x e the compresmon of the task 4 The dependency between prectston/recaU and compresmon m depicted m Figure 6 Taking both gold standards into account increases performance conmderably compared to either of the gold standards alone, because of the lower compresmon As we don't have training sets with exactly the same number of gold standard A and B sentences, we cannot directly compare the performance, but the graph m suggestive of a mmdar behavlour of both gold standards The results for training set 123 ,failbetween the results of the mchvxdual training sets (symbolL~ed by the large  tenal on prec~mon and recall, gold standards A+B From tins second experiment we conclude that for our task, them m no dnq~erence between gold standard A and B The crucial factor that preclmon and recall depends on ms the compression of the task  In order to evaluate the impact of the training material on preclslon and recall, we computed each possible pair of training and evaluation material (cf figure 7) In tins experiment, all documents of the trammg set are used to trmn. the model, thin model m then evaluated against each document in the test set, and the mean preclslon and recall is reported Importantly, m thin experiment none of the other documents in the test set m used for tr~tmmg These expernnents show a surprising umfonmty wztlnn test sets overall extraction results for each trmnmg set are very mxmlar Trmnmg on different data does not change the statmtical model much In most cases, extraction for each training set worked best when the model was trmned on the training set itself, rather than on more data Thus, the dflYerence in results between mchvldual trammg sets m not an -effect of data' sparseness at the level of heuristics combmatlon We conclude from thin third experm~ent that improvement m the overall results can primarily be achieved by improwng tangle hsurlstlcs, and not by providing more training data for our simple statmtlcal model</Paragraph>
  </Section>
  <Section position="7" start_page="291" end_page="291" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> Companng our experiment to Kuplec et al ,s the most obvlous dn~erence m the dfiYerence m data Our texts are likely to be more heterogeneous, coming from areas of computational hngumtlcs with . different methodologles and thus having an arg umentative, experimental, or irnplementatlonal ormntatlon Also, as they are not journal artlcles, they are not heavdy edited There is also less of a prototyplcal article structure in computational hngulstics than m experimental dmciphnes like chemical en~neenng Thin makes our texts more dn~cult to extract from The major difference, however, m that we use summanes winch are not written by trained abstractore, but by the authors themselves In only around 20% &amp;quot; of documents m our ongjnal corpus, sentence selection had been used as a method for sununazy generation, whereas profesmonal abstractors rely more heavily and systematically on sentences m the source text when creating their abstracts Using ahgned sentences as gold-standard has two mare advantages First, it makes the defimtlon of the gold standard less labour mtenmve Second, it prowdes a lngher de~ee of objeetwlty It m a much shmpler task for a human judge to dsclds if two sentences convey, the same propositional content, than to decide if a sentence is qualdied for mclumon m a summary or not However, using alignment as the sole definition for gold standard lmphes that a sentence is only a good extraction candidate if its equivalent occurs m the summary, an assumption we beheve to be too restrictive Document sentences other than the aligned. ones m~ht have been sumlar in quality to the chosen sentences, but wdl be trmned on as a negative example with Kupmc et al's method Kupmc et al also recognize that there m not only one optmlal excerpt, and mention Bath et al's (1961) research winch nnphes that the agreement between human judges is rather low We argue that It makes sense to complement ahgned sentences with manually determmed supplementary can&amp;dates Tins m not solely motivated by the data we work with but also by the fact that we envtsage a different task than Kupmc et al (who use the excerpts as mchcative abstracts) We see the extraction of a set of sentences as an intermediate step towards the eventual generation of more flemble and coherent abstracts of variable length For tins task, a whole range of sentences other than just the summary sentences might quahfy as good candidates for further processing ~ One important subgoal m the reconstruction of approximated document structure (cf rhetorical structure, as defined in RST (Mann et al, 1992)) One of the reasons why we concentrated on cue phrases was that we beheve that cue phrases are anobvious and easily accessible source of rhetorical information Another nnportant question was ff there were other properties foUowmg from the mmn difference between our training sets, ahgnablhty Are documents with a Ingh degree of ahgnablhty :nherently 5Tlns m nurrored by the fact that m our gold standards, the number of human-selected sentence canchdates outwelghed ahgned sentences by far  ! ! more statable for abstraction by our algorithm ~ It might be suspected that ahgnahhty m correlated witha better internal structure of the papers, but our experiments suggest that, for the purpose of sentence extraction, thin m eather not the case or not relevant Our results show that our training sets 1, 2 and 3 behave ~ery slrmlarly under evaluation taking ahgned gold standards or human-selected gold standards into account The only definite factor mfiuencmg the results was the compression rate With respect to the quahty of abstracts, tins imphes that the strategy which authors use for summary generahon - be it sentence selection or complete re-generation of the summary from semanhc representahon - m a matter of authonal chmce and not an mchcator of style, text ~quahty, or any aspect that our extraction program is particularly senmhve to Thin means that Kupmc et al's method of clasmficatory sentence selection m not restricted to texts which have hlgh-quahty summaries created by human abstractors We claim that adding human-selected gold standards wdl be useful for generation of more flembie and coherent abstracts, than trammg on just a fixed number of author-provldsd summary sentences would allow</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML