File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1072_metho.xml
Size: 12,703 bytes
Last Modified: 2025-10-06 14:07:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1072"> <Title>The Automated Acquisition of Topic Signatures for Text Summarization</Title> <Section position="4" start_page="495" end_page="495" type="metho"> <SectionTitle> 3 SUMMARIST </SectionTitle> <Paragraph position="0"> SUMMARIST (How and Lin, 1999) is a system designed to generate summaries of multilingual input texts. At this time, SUMMARIST can process English, Arabic, Bahasa Indonesia, Japanese, Korean, and Spanish texts. It combines robust natural language processing methods (morl)hologieal transformation and part-of-speech tagging), symbolic world knowledge, and information retrieval techniques (term distribution and frequency) to achieve high robustness and better concept-level generaliza-tion. null The core of SUMMARIST is based on the following 'equation!: summarization = topic identification + topic interpretation + generation.</Paragraph> <Paragraph position="1"> These three stages are: Topic Identifieatlon: Identify the most imtmrtant (central) topics of the texts. SUMMARIST uses positional importance, topic signature, and term frequency. Importance based on discourse structure will be added later. This is tile most developed stage in SUMMARIST.</Paragraph> <Paragraph position="2"> Topic Interpretation: ~i~-) fllse concepts such as waiter, menu, and food into one generalized concept restaurant, we need more than the sin> pie word aggregation used in traditional information retrieval. We have investigated concept aWe would like to use only the relevant parts of documents to generate topic signatures in the future, qkext segmentation algorithms such as TextTiling (Ilearst, 1997) can be used to find subtopic segments in text.</Paragraph> <Paragraph position="3"> ABCNEWS.cona : Delay in Handing Flight 990 \['robe to FBI N'I'SI3 Cllaitnlan JarlleS tlall says Egyptian clfficials Iv811l to I,~view restllts of tile investigation intcl lhe crasll of llggyptAir Flight 990 before tile case i~ lurlled over Ic, tile Fi31, Ntlv. IG - U S. ilxvestigl~lo\[~ lLppear to be leatlillg iiIore thgll eveF low~trd tile possibility that one of the cc~-pilot~ of EgyptAir Flight 990 may have de\[iheralely crashed tile plane last Ilaflllth, killing all 217 people on board. flail'ever. US. officials say tile National Tran~por'tation Safety Board will delay transferring tile invegtigalion of the Oct 31 crash to tilt: FI31 - the agency that wotlid lead i~ criminal probe - for at least tt few days. to MIow Egyptian experts to review evidence ill tile case.</Paragraph> <Paragraph position="4"> gttsl)iciotl~ of foul play were raised after investigators listening to rt tape ftolll lilt cockpit voice recorder isolated a religious prayer or statelllellt made by tile co-pilot just before tile plane's autopilot was turned off slid the plane began its initial plunge into tile Atlantic Ocean off Massrtchttsett$' Nalltucket \[sialld.</Paragraph> <Paragraph position="5"> Over tile' past week. after mucil effort, tile NTSJB and tile Navy succeeded ill Iocatillg the plane's two &quot;black boxes,&quot; th~ cockpit voice recorder and lhe flight data recorder.</Paragraph> <Paragraph position="6"> The tape indicates tllat shortly after the plane leveled ~ff at its cruising altitude of as,000 feet, tile cl~ief pilot of tile aircraft left the plane's cockpit, leaving one of tile twc~ co-pilots nIolle tilere as the aircraft began its descent.</Paragraph> <Paragraph position="7"> generated by SUMMARIST.</Paragraph> <Paragraph position="8"> counting and topic signatures to tackle tile fllsion problem.</Paragraph> <Paragraph position="9"> Summary Generation: SUMMARIST can produce keyword and extract type summaries.</Paragraph> <Paragraph position="10"> Figure 1 shows an ABC News page summary about EgyptAir Flight 990 by SUMMARIST. SUMMARIST employs several different heuristics in tile topic identification stage to score terms and sentences. The score of a sentence is simply the sum of all the scores of content-bearing terms in the sentence. These heuristics arc implemented in separate modules using inputs from preprocessing modules such as tokenizer, part-of-speech tagger, morphological analyzer, term frequency and tfidf weights calculator, sentence length calculator, and sentence location identifier. \Ve only activate the position module, tile tfidfmodule, and the. topic signature module for comparison. We discuss the effectiveness of these modules in Section 6.</Paragraph> </Section> <Section position="5" start_page="495" end_page="496" type="metho"> <SectionTitle> 4 Topic Signatures </SectionTitle> <Paragraph position="0"> Before addressing the problem of world knowledge acquisition head-on, we decided to investigate what type of knowledge would be useflfl for summarization. After all, one can spend a lifetime acquiring knowledge in just a small domain. But what is tile minimum amount of knowledge we need to enable effective topic identification ms illustrated by the restaurant-visit example? Our idea is simple.</Paragraph> <Paragraph position="1"> We would collect a set of terms 4 that were typically highly correlated with a target concept from a preclassified corpus such as TREC collections, and then, during smnmarization, group the occurrence of the related terms by the target concept. For exampie, we would replace joint instances of table, inert'u, waiter, order, eat, pay, tip, and so on, by the single phrase restaurant-visit, in producing an indicative sulnlllary. \Ve thus defined a tot)it signat.ure as a family of related terms, as follows:</Paragraph> <Paragraph position="3"> where topic is the target concet)t and .,d.q)zat~Lrc is a vector of related ternls. Each t, is an term ldghly correlated to topic with association weight w/. The number of related terms 7z can tie set empirically according to a cutot\[ associated weight. We. describe how to acquire related terms and their associated weights in the next section.</Paragraph> <Section position="1" start_page="496" end_page="496" type="sub_section"> <SectionTitle> 4.1 Signature Term Extraction and Weight Estimation </SectionTitle> <Paragraph position="0"> ()n the assumption that semantically related terms tend to co-occur, on(' can construct topic signatures fl'om preclassified text using the X 2 test, mu-. tual information, or other standard statistic tests and infornlation-theoreti(: measures. Instead of X '2, we use likclih.ood ratio (Dunniug, 1993) A, sin(:e A i,; more apI)rot)riate for si/arse data than X 2 test and the quantity -21o9A is asymi)t(/tically X ~ distril)ute(15. Therefore, we Call (leterndnc the (:onti(lence level for a specific -21o9A value l/y looking ut) X :~ (tistril)ution table and use tlm value to sel(,,ct an at)i)rot)riate cutoff associated weight.</Paragraph> <Paragraph position="1"> We have documents l)\['e.classitied into a :;('~t, &quot;R. of relevant texts and a set ~. of nonrelewmt texl;s for a given topic. Assuming the following two hyl)othe,'~es: ttypothesis 1 (Ifl): t'(~Pvlti) = P = P('PvltT/), i.e.</Paragraph> <Paragraph position="2"> the r(.,lewmcy of a d()(:|lment is in(teI)en(hmt, of ti.</Paragraph> <Paragraph position="3"> I\]\[ypothesis 2 (tt2): I'('Pv\[ti) == lh ~ 1)'2 t)('Pvlt, i), i.e. th(~ t)r(.':;(;n(:(~ of t i indi(:~Lt(.~.'; strong r(~levan(:y ~ssunling \]h >> 1)2 * and the following 2-10=2 contingency tabl(;: where Ol~ is the fiequency of term ti occurring in the. l'elev;tnt set, 012 is the \[r(!qu(nlcy of Lerm t i t)ccurring in the \]lollrelevallt, set, O21 is the fle(lllell(:y of tt;rnl \[iC/ ti occurring in the rtdevant set, O._,~ is the fl'equ(mcy of term l.i C/ ti o(:curring in the nonl'elevaiit seL.</Paragraph> <Paragraph position="4"> -kssmning a l)inomial distril)ution:</Paragraph> <Paragraph position="6"> 5This assumes |hal the ratio is between the inaximuni like> \[ihood est, im&t.(! over a .qll})part of l;}l(! i)alatlllC't(~r sl)a(:(~ ;tll(\] l.h(! lllaxillUllll likelihood (}sI.i|II}tlA~ ov(!r the (Hltill! i)al'aillt~tt!r si);t(:e.</Paragraph> <Paragraph position="7"> Set! (Manning ;tnd Sch/itze, I999) t)ag, es 172 l.o 175 for d(!t.ails.</Paragraph> <Paragraph position="8"> then the likelihood for HI is:</Paragraph> <Paragraph position="10"> whel'e N = Olt -F O12 -1- O21 -I- 022 is the total llum-.</Paragraph> <Paragraph position="11"> her of t, ernl occurrence, in the corpus, 7/('/~) is the entropy of terms over relevant and nonrelevant sets of documents, 7/('felt ) is the entropy of a given term OVel&quot; relev;inL ~/nd nonl'(.qevallt sets of doellinellLS, ~tll(1 Z('R.; T) i:; the inutual information between document relevancy and a given t('.rm. Equation 5 indicates that mutual inforntation 6 is an e(tuiwdent measur(.' t() lik(.qiho(id ratio when we assume a binomial distribution and a 2-by-2 ('ontingency table.</Paragraph> <Paragraph position="12"> To crest(; topic .~dgnature for a given tot)ic , we: 1. (:lassify doctunents as relevant or nonrclcwmt according t() tile given topic 2. comt)ut.e the -21oflA wdue using Equation 3 for each Lcrm in the document colle(:Lion &quot;{. rank t, erms according 1o their -2lo9~ value 4. select a c(mfid(mce le, vel fi'om the A;: (listril)utiotl table; (letermin(~ the cutotf associated weight, mid the numl)(n&quot; of t(nms to he included iIl the</Paragraph> </Section> </Section> <Section position="6" start_page="496" end_page="497" type="metho"> <SectionTitle> signatures 5 The Corpus </SectionTitle> <Paragraph position="0"> The training data derives Kern the Question and Answering summary evahmtion data provided l)y TIPSTEI/.-SUMMAC (Mani et al., 1998) that is a sttbset of the TREC collectioliS. The TREC data is a collection of texts, classified into various topics, used for formal ewduaLions of information retrieval systems in a seri(~s of annual (:omparisons. This data set: contains essential text fragnients (phrases, (:Iausos, iuld sentences) which must 1)e included in SUllltIlarios to ~tnswer some TI{EC tel)its. These fl'agments are each judged 1)y a hmnan judge. As described in Se(:tion 3, SUMMAI~IST employs several independent nlo(hlles to assign a score to each SelltA:llCe~ and Chell COlll})illeS the st.'or(.'.% L() decide which sentences to extract from the input text;. ()n0. can gauge the efticacy (>l'he lllll\[lla} inrormalion is defined according to chapter 2 of ((;over and Thomas, i991) and is not tile i)airwis(~ mutual inforlnalion us(!d in ((;hur(:h and llanks, 1990).</Paragraph> <Paragraph position="1"> (title} Topic: Co, ping with overcrowded prisons (dese} Deserilltioll: The doeullaent will provide inf,~rnlation ol~ jail and prison overcrowdiuK and how irlmates are forced to cope with th,~se conditions; or it will reveal plan~ to relieve tile overcrowded C/ollditlon.</Paragraph> <Paragraph position="2"> (nart) Narrative: A relevant docunaent will describe scene~ of overcro~vdillg that have becolne all too crlllllllOll ill jails and prisons arottnd the country, Tile document will identify how inmates are forced to cope with those overcrowded condition~, and/or what tile Cclrreetional Systelll is doing, or phlnning to do, to alleviate tile crowded collditioll. (/top) to reduce tile lltllllber of Dew illli\]ate$, e.g., moratoriums on admisMon, alternative penalties, programe to reduce crime/recldivism? Q5 What measures have been taken/planned/recommended (etc.) to reduce tile number of existing inmates at an overcrc~wded facility, e.g.. granting early release, trnnsfering to uncrowded (Q,q)prisoner~ are kept chained to the wall~ of local police lcJekup~ for as long as three days at a tln~e I)ecattse of overcrowding ill regular jell cells, police said.(/Q3} Overcrowding at the (Q1)lqMtlrnore County Detention Center(/Q1) h~ forced pnllee tn .. (/TEXT) questions expected to be answered by relewmt documents, and a smnple document with answer key, s.</Paragraph> <Paragraph position="3"> of each module by comparing, for ditferent amounts of extraction, how many :good' sentences the module selects by itself. We rate a sentence as good simply if it also occurs in the ideal human-made extract, and measure it using combined recall and precision (F-score). We used four topics r of total 6,194 documents from the TREC collection. 138 of them are relevant documents with TIPSTER-SUMMAC provided answer keys for the question and answering evaluation. Model extracts are created automatically from sentences contailfing answer keys. Table 1 shows TREC topic description for topic 151, test questions expected to be answered by relevant documents s, and a sample relevant document with answer keys markup.</Paragraph> </Section> class="xml-element"></Paper>