File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2140_metho.xml

Size: 22,274 bytes

Last Modified: 2025-10-06 14:07:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2140">
  <Title>DIASUMM: Flexible Summarization of Spontaneous Dialogues in Unrestricted Domains</Title>
  <Section position="5" start_page="968" end_page="968" type="metho">
    <SectionTitle>
CAN'T DO ANYTHING ABOUT IT SO
</SectionTitle>
    <Paragraph position="0"> after: we lose / i can't do anything about it</Paragraph>
    <Section position="1" start_page="968" end_page="968" type="sub_section">
      <SectionTitle>
2.4 Lack of tel)i(&amp;quot; l)oundaries
</SectionTitle>
      <Paragraph position="0"> (;AI,I,IIOME s\])c'e(;h data is lll/llti-to\])ica\] I)tlt does uot include mark&lt;q) \['or pa.ragral)hs, nor al,y tolfieinforlJ,ative headers. Tyl)ically, we lind about 5 I0 (.lilt'erent topics within a 10-mimd;e segment of a di-ah)gue, i.e., the. topic changes about every 1 2 minutes in these conversations. To facilitate browsing and smHtlmrization, we thus have to discover topi(:ally coherent, segl,lents automatically. This is done using a TextTiling approach, adapted t'ron~ (l\]earst, \]997) (section (i).</Paragraph>
    </Section>
    <Section position="2" start_page="968" end_page="968" type="sub_section">
      <SectionTitle>
2.5 Speech. reeoglfizer errors
</SectionTitle>
      <Paragraph position="0"> Imst but not least, we face t.he l)roblcm of iml)ert'e(:t word a(:cura(:y of sl)eech recognizers, l)articularly when (h'.a~ling with Sl)OUl.a\]mous st)eech over a large vo(:al)uhu'y aud over a low I);mdwi(Ith (:hamJe\],</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="968" end_page="968" type="metho">
    <SectionTitle>
SIIC\]I \[~S l,h(~ (',AI,I,IIOME ({at;tl)asc's which we Juainly
</SectionTitle>
    <Paragraph position="0"> used for develol)lnent , testing, and evaluatiou of our syste/n. (hu'r(mt recognizers tyl)ically exhibit word error rates \['or l,hese (:orl)ora ill the order of 50%. In I)IASUMM's hfl'ormation condensation component, the relevaucc weights of speaker ttlr,ls (:all be adjusted to take into acc.omd, their word confidence scores from 1.111; sl)eech recognizer. That way we can reduce the likelihood of extra.eting passages with a larger amount of word lnisreeognitions (Zeclmer and \Vaibel, 201111). lu this 1)aper, however, the focus will be exclusively on results of our evaluations on human generated transcripts. No information from the speech recognizer nor from the acoustic signal (other than inter-utterance pause durations) are used. We are aware that in particular prosodic information may be of help for tasks such as the detection of sentence boundaries, speech acts, or topic boundaries (l\]irschberg ~md Nakatani, 1998; Shriberg et al., 1998; Stolcke et al., 2000), but the investigation of the integration of this additional source of infer marion is beyond the scope of this pal)er and lel't tbr future work.</Paragraph>
  </Section>
  <Section position="7" start_page="968" end_page="969" type="metho">
    <SectionTitle>
3 System Architecture
</SectionTitle>
    <Paragraph position="0"> The global system architecture of I)IASUMM is a 1)ipeline of the tbllowing lbur major components:</Paragraph>
    <Paragraph position="2"> information condensation. A. fifth component is added a.t the end for the purpose of telegraphic reduction, so that we can maximize the information content in a given amount of space. The system architecture is shown in Figure 1. It also indicates the three major types of smnmaries which can be generated by l)Ia SUMM: 'P\]~ANS (&amp;quot;transcript&amp;quot;): not using the linking and clean-up components; CLEAN: rising the main four components; 'I'EI,E (&amp;quot;telegraphic&amp;quot; summary): additionally, using the telegraphic reduction component.</Paragraph>
    <Paragraph position="3"> The following sections describe the components of DIASUMM ill more detail.</Paragraph>
  </Section>
  <Section position="8" start_page="969" end_page="969" type="metho">
    <SectionTitle>
4 Turn Linking
</SectionTitle>
    <Paragraph position="0"> The two main objectives of this component are: (i) to form turns which contain a set of full (and not partial) clauses; and (ii) to forln turn-pairs in cases where we have a question-answer pair in the dialogue. null To achieve the first objective, we scan the input for adjacent turns of one speaker and link them together if their time-stamp distance is below a pre-specified threshold 0. If the threshold is too small, we don't get most of the (logical) turn continuations across utterance boundaries, if it is too large, we run the risk of &amp;quot;skipping&amp;quot; over short but potentiMly relevant Daglnents of the speaker on the other channel. We experimented with thresholds between 0.0 and 2.0 seconds and determined a local performance maximum around 0 = 1..0.</Paragraph>
    <Paragraph position="1"> For the second objective, to form turn-pairs which comprise a question-answer information exchange between two dialogue participants, we need to detect wh- and yes-uo-questions in the dialogue. We tested  pel'imental results for automatic Q-A-detection two approa.ches: (a) a ItMM based speech a.ct (SA) classifier (\]/Jes, \] 999) and (b) a set of part-of-speech (POS) based rules. The SA classifier was trained oll dialogues which were manually annotated for speech acts, using parts of the SWITCIIBOARI) corpus (Godfrey et al., 1992) for Fmglish and CALLIIOMF, for Spanish. The corresponding answers for the detected questions were hypothesized in the first turn with a. different sl)eaker , following the question-turn. Table 1 shows the results of these experiments for 5 English and 5 Spanish CAI,L\]IOME dialogues, cornpayed to a baseline of randomly assigning n question speech acts, n being the number of question-turns marked by human a.nnotal~ors. We report Fl-seores, where F1 - ~ with P=preeision and /g--recall.</Paragraph>
    <Paragraph position="2"> We note that while the results \[br the SA-classifier and the rule-based approach are very similar for English, the rule-based apl~roach yields better results tbr Spanish. The much higher random baseline for Spanish can be explained by the higher incidence of questions in the Spanish data (14.9deg/(, vs. 5.3% for English).</Paragraph>
  </Section>
  <Section position="9" start_page="969" end_page="970" type="metho">
    <SectionTitle>
5 Clean-up Filter
</SectionTitle>
    <Paragraph position="0"> The clean-up component is a sequence of modules which serve the purposes of (a) rendering the transcripts more readable, (b) simplifying the input for subsequent components, and (c) avoiding unwanted bias for relevance computations (see section 2). All this has to happen without losing essential information that could be relevant in a summary. While other work (\]\]eeman et al., 1996; Stolcke et al., 1998) was concerned with building classifiers that can detect and possibly correct wn:ious speech disfluencies, our implementntion is of a much simpler design. It does not require as much lnanual annota.ted training data and uses individual components for every major category of disfluency.1 t While we have not yet numerically evaluated the perfof mance of this component, its output is deemed very natural to read by system users. Since the focus and goals of this contponent are somewhat different than l)reviotts work in that area, meaningful comparisons are hard to make.</Paragraph>
    <Paragraph position="1">  Single or multiple word repetitions, fillers (e.g., &amp;quot;uhm&amp;quot;), and discourse markers without semantic content (e.g., &amp;quot;you know&amp;quot;) a.re removed fl:om the input, some short forms axe expanded (e.g., &amp;quot;we'll&amp;quot; -+ &amp;quot;we will&amp;quot;), a.nd fl'cquent word sequences are combined into a single token (e.g., % lot of&amp;quot; -+ &amp;quot;a_lot_of&amp;quot;).</Paragraph>
    <Paragraph position="2"> Longer tm'ns are segmented into shorl clauses, which are defined a.s consisting of at least a. sub-ject and a.n inIlectcd verbal form. While (Stolcke and Shriberg, 1996) use n-gram models for this task, and (C~awald~t et al., 1997) use neura.l networks, we decided to use a. rule-based approach (using word a,nd POS information), whose performa.nce proved to be compat'able with the results in the cited \])~pets (1,'~ &gt; 0.85, error &lt; 0.05). ~ leo, . several of tile clea.n-up filter's components, we ina.ke use of Brill's POS ta.gger (Ih:ill, I,(),qd). For Fmglish, we use ~t modified version of Brilt's original t~g set, and the tagger was adapted and retra.ined for Sl)oken langua.ge eorl)ora, (CAIAAIOME a.lKl SWITClltlOalU)) (Zechner, 1997). For S1)anish, we crea.ted our own tag set., derived from the l,l)C lexicon and front the CI{ATEI/. project (LeOn, 1994), and trained the tagger on ma.nua.lly annotated (~;AI,I,IIOME dialogues, l!'urthernlore, a. POS based sha.lk)w chunk parser (Zechner a.nd Wa.ibel, 1998) is used to fill.('.,' (,tit. likely ca.ndidates for incomplete, clauses dne to speech repair or interrul)tion by the other Slleaker.</Paragraph>
  </Section>
  <Section position="10" start_page="970" end_page="970" type="metho">
    <SectionTitle>
6 Topic Segmentation
</SectionTitle>
    <Paragraph position="0"> ,~illce CAI,I,IIOME dialogues are a.lways multi-topica.I, segmenting them into tOl)ical units is an important :;tel) in our summa.riza.tion system. '.l'his allows us to l)rovi(le &amp;quot;signature?' information (frcqllenl; coiltent words) about every topic to the user as a. hell) for faster 1)rowsing and accessing the dat.a., l,'urthel:more, the subsequent informa.tio, condensation COI\]l\])Ollent ca.ll ~,VolYk on smaller parts of the diaJogue a.nd thus opera.re more ellieiently.</Paragraph>
    <Paragraph position="1"> Following (l{oguraev and Ii{cnnedy, 1997; Ba.rzila.y and Elhadad, 1997) who use 'l'extTiling (llcarst, 1997) for their summa.riza.tion systems of written text, we adapted this algorithm (it.s block comparison version) R)r sl)eech data: we choose turns to be minimal units a.nd compute block simila.rity between l)locl(s of k turns every d turns. We use 9 English and 15 Spanish @ALI,tIOMI,; dialogues, manually annota.ted for topic bounda.ries, to determine the optinmm wdues for a set of TextTiling pm:ameters and ~t. the same time to eva.lua.te the accuracy of this algorithm. '.re do this, we ran a.n n-R)ld cross-wdidation (&amp;quot;.jack-l~nifing&amp;quot;) where ~dl dia.logues but one are used to determine the 1)est parameters &amp;quot;train set&amp;quot;) m,d the remaining dia.logue is used as  a held-out d~ta. set for eva.luation (&amp;quot;test set&amp;quot;). This process is rcpea.ted n times and average results are reported. Ta.ble 2 shows the set of p~u:ameters which worked best for most diak)gues ~md 'Fable 3 shows tile eva.hm.tion results of the cross-validation experiment. /,'~-scores improve I)y 18-2d% absohtte over the random baseline for unseen a.nd by 23 35% for seen data., the performance for E\]@ish being better than for Spanish. 'l'hese results, albeit achieved on a. quite different text genre, are well in line with the results in (llea.rst, 1997) who reports a.n absolute improvement of a, bout :20% over a, random baseline for seen data.</Paragraph>
  </Section>
  <Section position="11" start_page="970" end_page="971" type="metho">
    <SectionTitle>
7 Information Condensation
</SectionTitle>
    <Paragraph position="0"> The informa,tion condensa, tion COml)onent is the core o\[' our sysl,en:~, lilts pUrl)OSe is to determine weights for terms and turns (or linked turn-i)airs ) and then to rank the turns a.ccording to their relewmce within each topical segment of the dialogue.</Paragraph>
    <Paragraph position="1"> For term-weighting, lf*idf-insl)ired formula.e (Sa.lton and Buckley, 1990) are used to empha.size words which are in the &amp;quot;middle range&amp;quot; of fl:equency in the dialogue a.nd do not a.pl)eat: in a. stop list. :~ For turn--ranking, we use a version of the &amp;quot;maximal n,argina.l relevance&amp;quot; (MMI{) algorithm (Ca.rbonell and Goldstein, 1998), where emphasis is given to liurns which conta.in ma.ny highly weighted terms tot&amp;quot; the current segment (&amp;quot;sa.lience&amp;quot;) a.nd are sutficiently dissimila.r to previously ranked turns (to minimize redunda.ncy).</Paragraph>
    <Paragraph position="2"> For 9 English and l d Spanish dialogues, the &amp;quot;most relevant&amp;quot; turns were nmrl~ed lay hmnan coders. We ran a. series of cross-validation experiments to (a,) optimize the parameters of this component related to tJ'*idf a.nd MMR computa,tion and to (b) deterlnine 31,'or l,;nglish, our stop list comprises 557 words, for Spanish, 831 words.</Paragraph>
    <Paragraph position="3">  how well this information condensing component can match tile human relewmce annotations.</Paragraph>
    <Paragraph position="4"> Summarization results are comlmted using 1 l-ptavg precision scores t`or ranked turn lists where the maximum precision of the list of retrieved turns is averaged in the 11 evenly spaced intervals between recall=\[0,0.1),\[0.1,0.2), ... \[1.0,1.:1)(Salton and McGill, 1.983). 4 Table 4 shows the results from these experiments. Similar to other experiments in the summarization literature (Ma.ni et a.l., 1998), we find a wide performance variation across different texts.</Paragraph>
  </Section>
  <Section position="12" start_page="971" end_page="971" type="metho">
    <SectionTitle>
8 Telegraphic Reduction
</SectionTitle>
    <Paragraph position="0"> The purpose of this component is to maximize information in a tixed amount of space. We shorten the OUClmt of the summarizer to a &amp;quot;telegraphic style&amp;quot;; that way, more inrorma.tion can be included in a summary of k words (02: n bytes). Since we only use shallow methods for textual analysis that do not generate a. dependency structure, we cannot use complex methods for text reduction as described, e.g., in (Jing, 2000). Our method simply excludes words occurring in the stop list fl:om the summary, except for some highly inforlnative words such as 'T'  or ~11ot ~ .</Paragraph>
    <Paragraph position="1"> 9 User Interface and System  Since we want to enable interactive summarization which a.llows ~ user to browse through a dialogue qnickly Co search for information he is interested in, we have integrated our summarization system into a 3AVA-based graphical user interface (&amp;quot;Meeting Browser&amp;quot;) (Bert et al., 2000). This interface also integrates the output of a speech recognizer (Yu et al., 1.999), and can display a wide variety of infer 1nation about a conversation, including speech acts, dialogue games, and emotions.</Paragraph>
    <Paragraph position="2"> For sumlnarization, the user can determine the size of the summary and which topical segments he wants to have displayed. Ite can also rocus the summary on particular content words (&amp;quot;query-based summary&amp;quot;) or exclude words from consideration (&amp;quot;dynamic stop list expansion&amp;quot;). Smmnarizing a 10 minute segment of a CALLhOME dialogue with our system takes on average less than 30 seconds on a 167 MHz 320 MB Sun Ultral workstation.S 4 We are aware that this annotation and evaluation scheme is far fl'om optlmah it does neither reflect the fact that turns are not necessarily the best units for extraction nor that the 11-pt-avg precision score is not optimally suited for the summarization task. We thus have recently developed a new word-based method for annotation and evaluation of spontaneous speech (Zechner, 2000).</Paragraph>
  </Section>
  <Section position="13" start_page="971" end_page="972" type="metho">
    <SectionTitle>
10 Human Study
1(1.1 Experiment Set;up
</SectionTitle>
    <Paragraph position="0"> Ill order to ewduate the system as a. whole, we conducted a study with humans in the loop to 1)e able Co colnpare three types of summaries (TITANS, CLEAN, TELE, see section 3) with the fllll original transcript.</Paragraph>
    <Paragraph position="1"> We address these two main questions in this study: (i) how fast can information be identified using different types of summaries? (ii) how accurately is the information preserved, comparing different types of summaries? We did not only ask the user &amp;quot;narrow&amp;quot; questions for a specific piece of information -- along the lines of the Q-A-evaluation part. of the SUMMAC conference (Mani eC a.l., 1998) -- but also very &amp;quot;global&amp;quot;, non-specific questions, tied Co a. parCicular (topical) segment of the dialogue.</Paragraph>
    <Paragraph position="2"> The experiment was conducted as follows: Subjeers were given 24 texts each, aceompa.nied by either a generic question (&amp;quot;What is the topic of the discussion in this text segment?&amp;quot;) or three specitic questions (e.g., &amp;quot;Which clothes did speaker A buy.'?&amp;quot;). The texts were drawn from five topical segments each rrom five English CAIAAIOME dialogues. (; They have four difl&gt;rent formats: (a) fldl transcripts (i.e., the transcript of the whole segment) (FULL); (b) summa.ry of the raw transcripts (without linking and clea.n--up) ('rll.aNS); (c) cleaned-up summary (using all four major components of our systenl) (C,I,I,;AN); and (d) telegram suln21\]a, ry (derived rron\] (c), using also Cite Celegraphic reduct.ion component) (TI';LE).</Paragraph>
    <Paragraph position="3"> 'l'he texts or for,,,a.t,, (b), (c), a.nd (d) were generated 1;o have the saaue length: 40% of (a), i.e., we use a 60% reduction rate. All these formats can be accotnpanied by either a. generic or three specitic questions: hence there are eight types of tasks for each of the 24: texts.</Paragraph>
    <Paragraph position="4"> We divided the subjects in eight groups such that no subject had to l)erform more than one task on the same text and we distributed the different Casks evenly \['or each group. Thus we cau make unbiased comparisons across texts and tasks.</Paragraph>
    <Paragraph position="5"> The answer accuracy vs. a pre-defined answer key was manually assessed on a 6 point discrete scale between 0.0 and 1.0.</Paragraph>
    <Paragraph position="6"> 10.2 ll,esults and Discussion Of the 27 subjects taking part in this experiment, we included 24 subjects iu the evaluation; 3 subjects were excluded who were extreme outliers with respect to average answer time or score (not within /* + -2sCddev).</Paragraph>
    <Paragraph position="7"> From the results in Table 5 we observe the following trends with respect to answer accuracy and response time: SOne of the 25 segments was set aside for demonstration purposes.</Paragraph>
    <Paragraph position="8">  task being to identi\[y the topic o\[' a text): The tWO cleaned u D StlllnFla,ries tool( M)out, the same Lime to in;ocess I)ui. had lower a eeura('y scores than tim v(;rsion directly u:dug the trans(:ril)l..</Paragraph>
    <Paragraph position="9"> * spcc~/ir quest.ions (&amp;quot;ilfl'orlnal.ive sunllllaries&amp;quot;, the (.ask being Io lilM Sl)ecilie intLrllml ion in t\]l(~ rext): (I) The accuracy advant, age of the raw I,ranscripl, sun lmaries ('I'R, A NS) over the clealled u\]) versions (CLlCAN) is only small (,oZ :;Latistica.lly signitieant: L:-0.748) 7. (2) 'l'her(&amp;quot; is a sui)eriority of the 'l'l,;lA,,-StllnlHary to t)o(;h otJmr kinds ('rFLI.: is significa.nlJy more ;,iCCtllX/|l(2 (h~-/.ll CLEAN \[()r 1) &amp;quot;~ 0.0~r)).</Paragraph>
    <Paragraph position="10"> l,'rom this w(; conjecture thai. our methods for (:ustomizaJ.ion of the summaries to spoken dialogues is mostly relewmt for inJ'ormativc, but llot so tUll(;h for indi,:,tivc smmmu'ization. We drink that el.her methods, such as lists of signature l)hrases would l)e n tor0 effective to use lbr the \]al;tcr \[mrl)ose.</Paragraph>
    <Paragraph position="11"> 'l)dtle 6 shows the answer accuracy for the three different smmnary tyl)es relative 1;o the accuracy of tile fldl transcripl, texts of l, he sa.me segmenl,s (':relative ~mswer a.ccm:acy&amp;quot;). We, observe that; tit(: r('l~d;ive accuracy reduction for all smnn\]aries is markedly lower than the reduction of tc'xt size: all sunmmries were reduced from the full transcripts l)y 60%, whereas tile answer a(:(:uracy only drops between 9% (TITANS) a,tld 24% (CI,EAN) l()l&amp;quot; the generic quest, ions, 7111 \['DA;\[,, ill 2, of 5 dialogues. I,\]m CI,I.1AN SIIllllllD, l'y scores m:e higher tllall th&lt;&gt;se of the 'I'IIANS summaries.</Paragraph>
    <Paragraph position="12"> and between 20% ('rF, l,l~,) and 29% (CI,F, AN) fOl: the speci\[ic questions. This proves that our systeln is able to retain most of the relevant information in tim summaries.</Paragraph>
    <Paragraph position="13"> As for average' answer times, we see a. ma.rked reduction (3()0{,) of all sunmm.ries coulparcd to the full texts in l,hc .qcneric case; for the SlmCific ease, the time reduction is sonlewhat sma.ller (l 5% 25%).</Paragraph>
    <Paragraph position="14"> One shortcoming of the current, system is thai; it oper~d;es on turns (or \[;tlrll-pa.irs) as minimal units \['or extraction, tn \[Stture work, we will investigate possil)ilities to reduce the minimal units ot7 extrac-l.ion l.o tim level of chmses or sent.&lt;m&lt;:es, wilhoul, giv like; Ul) the idea of linking cross-slxmker information. 1 1 Summary and lglture Work \Ve have presented a sunmmrization sysl,e~ for six) ken dialogues which is constructed to address key difl)renees of spolcen vs. written langua.ge, dia.logues vs. monologues, and inul|.i-topical vs. mono-topical texts. The system cleans up the input for speech disfluencies, links t.urns together into coherent information units, determines tOlfica.l segments, and extracts the most relevant pieces of informal, ion in a user-customiza.ble way. I~;vahml,ions of major system (:Oral)Orients and of t.he systeJn as a. whole were 1)erfornmd. 'l'hc results of a user sl, udy show that with a. sutmna ry size of d0%, between 71% and 911% of the inlbrma.tion of the fill\] text is ret.a.ined in the summary, depending on tile type of summary and tim Lyl)('s of quest, ions being asked.</Paragraph>
    <Paragraph position="15"> \Y=c' are currently extending the system to be able to ha.ndle different levels of granularity for extract;ion (clauses, sentences, turns), leurthermore, we plan to investigate the, integration of l)rosodic information into several (-onq)onents of our system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML