DIASUMM: Flexible Summarization of 
Spontaneous Dialogues in Unrestricted Domains 
Klaus Zechner and Alex Waibel 
LaJlguage 'l~chnologies Institute 
Carnegie Mellon University 
5000 Forbes Avenue 
Pittsbm:gh, PA 115213, USA 
{zechner., waibel}@cs, cmu. edu 
Abstract 
In this paper, we present a summa.rization system 
for spontaneous dialogues which consists of a novel 
multi-stage architectm'e. It is specifically aimed at 
addressing issues related to tlle nature of the l;exts 
being spoken vs. written and being diMogical vs. 
monologica.l. The system is embedded in a. graph- 
ical user interface ~md was developed and tested on 
transcripts of recorded telephone conversations in 
English and Spanish (CAI,LHOMI,;). 
1 Introduction 
Summa.rization of written docmnents has recently 
O' been a. focus for much research in NI,t ~ ( ~.o., (Mani 
and 1Vlasq~ury , 1997; AAAI, 1998; Mani el. al., 1998; 
ACL, 2000), to nanle some of tile Inajol: events in 
this field ill the past few years). Ilowever, very lit- 
tle a.ttention has been given so far to the summa- 
riza.tion of spol, r('~n , even less of conversa- 
lions vs. monologic'al texts. We believe tha.t sum- 
mariza.tion of speech will bccoJne increasingly more 
important, a.s the ~ml(mnt of online audio daLa. grows 
and demand for r~tl)id browsing, skimming, a.nd a.e- 
cess of speech data increases. Another application 
which particulm:ly pertains to our interest in spo-- 
ken dialogue summarization would be the generation 
of meeting minutes for archival purposes a.nd/or to 
update l)a.rticil)a.nts .joining a.t la.ter stages on qm' 
progress of the conversa.tion so far. 
Sunmmrization of dialogues within limilcd do- 
mains ha.s been attempted within the context of 
the VERBMOBII, pl:ojcct ("protocol generation", 
(Alexandersson and Poller, 1998)) or by SRI's MIMI 
summarizer (Kameyama et ~d., 1996). l{ecent work 
on spoken  summarization in unrestricted 
domains has focused ahnost exclusively on Broad- 
cast News, mostly due to the spoken hmguage track 
of recent TREC evaluations (Oarofolo et al., 1997; 
Garotblo et al., 1999). (Waibel et a.1., 1(.)98) describe 
a Meeting Browser where summaries earl be gener- 
ated using technology established for written texts. 
(Va.lenza. el. M., 1999) go one step further and incof 
pora.te knowledge from the speech recognizer (con- 
fidence scores) into their summarization system, as 
well. 
We a.rgue that the nature of spoken dialogues, to- 
gether with their textual representations as speech 
recognizer hypotheses, requires a. set of specific al> 
proa.ches to make summarization feasible for this 
text genre. 
As a demonstrable proof of concept, we present 
the multi-stage a.rchitecture of the summa.rization 
system I)IASUMM which can flexibly deal with spo- 
ken di,dogues in English and Spa.nish, without any 
restrictions of domahl. Since it cannot rely on a.ny 
domain specific knowledge base, it uses shallow sta- 
tisticaJ approaches and presents (possibly modified) 
ca:lracts from the original text. as summa.ry. 
We. present results of several evaluations of our 
system using human transcripts of spontaneous tele- 
phone conversations in English and Spanish from the 
(~,AI,LIIOME corl)/ls ((LI)C), 1996), in particular the 
accura.cy of the topic segmentation and in\[brmat.ion 
condensing components (sections (5 and 7). Also, Ibr 
I.he purpose of a global evaluation, a user study was 
l~ei:%i:med which a.ddresscd in\[or\]nation access t.inJe 
a.nd a.ccura.ey of retaine.d information eompa.ring dif- 
ferent versions of summaries (section 10). 
This paper is organized as follows: In the next sec- 
tion, we provide, a.n ow;rview a.bout t}ie in,till issues 
Ibr summa.rization of Sl)oken dialogues and indicate 
I;hc "~l)l)roaches we, are taking in our system. We 
then present the system a.rchitecture (section 3), fol- 
lowed by a. detailed description of the readier building 
blocks (sections <1 to 8). After a. brief elmra.cteriza- 
tion of the (2 UI (section 9) we describe a user study 
for global system evaluation in section 10. We con- 
clude the pa.per with a smmnary and a brief outlook 
in section 11. 
2 Issues and Approaches: Overview 
In this section, we give a,n overview about the main 
issues that a.ny sunmmrizat;ion system for spoken di- 
a.logues has to address mid indica.te the approach we 
are taking for each of these in I)IASUMM. 
In a generM sense, when dealing with written 
texts, usually there is plenty of information avail- 
able which can be used lbr the purpose of summa- 
968 
rization, such as capitalization, i)un(-tuation ~narks, 
t,itles, passage head(rs, i)aragral)h boundaries, or 
other ,nark-ul)S. (hfforl.mud.ely, however, ,,onc (.)f 
this holds for :q)ccch data whh:h arrives as a stream 
of word l,ok('w; from ;I recognizer, (:ut iuto "utt(.q'- 
antes" by using a silence heuristi('. 
2.1. Lack of clause. 1)Oulldaries 
One of the mosl. serious issues is the lack el senten(:e 
or clause boundaries in spoken dialogues whi(:h ix 
particularly problemati(: .;in(:e scnten(:es, clauses, or 
l)aragral)hs a.re (.:onsidercd the "minimal re,its" in 
virtually all existil,g summarization systcu,s. \'Vheu 
humans speak, they so,lletillles pause durinq a 
(:\]a.use, and not always at. l.he eml of a claus(', whi(:h 
means that the outl)ut of a r(;coguizer (whi(:h us,t- 
ally uses some silelme-heuristics to cut the segments) 
frequently does nol real,eli Iogi(:al sep, l,en(:e or clause 
boundaries, l,ooking at five I';nglish (~A,,I,HOM,,: (li- 
alogues with an average ii/11111)(".1' of :{20 iltl\[.('3'a,l('.c~.q 
eat.h, we find on average 30 such "(:ontinuations" of 
logical clauses over automa.ti(:ally detcrmiued a(:ous- 
tit" segment I)ounda.ries. lu a smmnary, this can 
cause a. r(;du(:tion in coh(,,ren(:c and r<~dability of 
the outlmt. 
We address this issue I)y linking adjac(;nt tm'ns 
of th(; smue sl)eaker together if the silence between 
them ix less than a given col,sl.\[/llt (se(;tioll d). 
2.2 Distrilml;c.d int'(n'matioll 
Siuce we have multi-pari,y conversations as o\])l)oscd 
to Inonologi('al texts, sonmtimcs the cru(:ial in\['or- 
matiou is found in a question-auswer-l)air , i.e., it 
involv('s more than oue Sl)eaker; extracting ouly the 
question or only the auswer wo,ld be meaningless 
in ma.ny cases. We found that on average about 
10% el' the speaker turns belong to such question- 
answer l)airs in five examined English (~AIA,IIOME 
dialogues. Often, either the question or the answer 
ix very shoI:t and does not contain any words with 
high relevan(:c. In order not to "lose" these short 
tutus at a later stage, when only the n~ost, relevant 
turns are extracted, we link them to the matching 
question/answer ahead of/.ime, using two different 
methods to detect questions aud their answers (sec- 
tion 4). 
2.3 Distluent speech 
Speech disfluencies in spontaneous convers,ttions -- 
such as fillers, repetitions, repairs, or unfinished 
clauses -- can make transcril)ts (and summary ex- 
tracts) quite ha.rd to read and also introduce all tin- 
wanted bias to relevance computations (e.g., word 
repetitions would cause a higher word count tbr the 
repeated content words; words in untinished clauses 
would be included in the word count.) 
'l'o alleviate this problem, we employ a clean-up 
tilter pipeline, which eliminates liller words and ,:el)- 
el.it.ions, and segments the tm'ns into short clauses 
(sectiou 5). \Ve also remove incomplete clauses, typ- 
ically sentem:c-iuitial repairs, at this stage of our 
'.syst¢lu. This "clea.niug-up" serves two main pur- 
1)oscs: (i) it. im:rea~cs tim readabilit3~ (for the fiually 
(;xtracl.cd segments); and (ii)it. ~nakcs the text more 
tractable by subsequent modules. 
The following exalnl)le com\])arcs a turn before and 
after t.he clean-up component: 
before: I MEAN WE LOSE WE LOSE I CAN'T I 
CAN'T DO ANYTHING ABOUT IT SO 
after: we lose / i can't do anything 
about it 
2.4 Lack of tel)i(" l)oundaries 
(;AI,I,IIOME s\])c'e(;h data is lll/llti-to\])ica\] I)tlt does 
uot include mark<q) \['or pa.ragral)hs, nor al,y tolfie- 
inforlJ,ative headers. Tyl)ically, we lind about 5 I0 
(.lilt'erent topics within a 10-mimd;e segment of a di-- 
ah)gue, i.e., the. topic changes about every 1 2 min- 
utes in these conversations. To facilitate browsing 
and smHtlmrization, we thus have to discover topi- 
(:ally coherent, segl,lents automatically. This is done 
using a TextTiling approach, adapted t'ron~ (l\]earst, 
\]997) (section (i). 
2.5 Speech. reeoglfizer errors 
Imst but not least, we face t.he l)roblcm of iml)er- 
t'e(:t word a(:cura(:y of sl)eech recognizers, l)articu- 
larly when (h'.a~ling with Sl)OUl.a\]mous st)eech over a 
large vo(:al)uhu'y aud over a low I);mdwi(Ith (:hamJe\], 
SIIC\]I \[~S l,h(~ (',AI,I,IIOME ({at;tl)asc's which we Juainly 
used for develol)lnent , testing, and evaluatiou of our 
syste/n. (hu'r(mt recognizers tyl)ically exhibit word 
error rates \['or l,hese (:orl)ora ill the order of 50%. In 
I)IASUMM's hfl'ormation condensation component, 
the relevaucc weights of speaker ttlr,ls (:all be ad- 
justed to take into acc.omd, their word confidence 
scores from 1.111; sl)eech recognizer. That way we can 
reduce the likelihood of extra.eting passages with a 
larger amount of word lnisreeognitions (Zeclmer and 
\Vaibel, 201111). lu this 1)aper, however, the focus will 
be exclusively on results of our evaluations on hu- 
man generated transcripts. No information from the 
speech recognizer nor from the acoustic signal (other 
than inter-utterance pause durations) are used. We 
are aware that in particular prosodic information 
may be of help for tasks such as the detection of 
sentence boundaries, speech acts, or topic bound- 
aries (l\]irschberg ~md Nakatani, 1998; Shriberg et 
al., 1998; Stolcke et al., 2000), but the investigation 
of the integration of this additional source of infer 
marion is beyond the scope of this pal)er and lel't tbr 
future work. 
3 System Architecture 
The global system architecture of I)IASUMM is a 
1)ipeline of the tbllowing lbur major components: 
969 
inputtor \] 
CLEAN ~ Turn Linking 
and TELE ! 
i 
\] Clean-up Filter ! 
I i 
\] 
J 
input for. Topic Segmentation 
TRANS 
i l 
Information Condensation ~ TRANS 
i L 
1 71- - - - \]7 7- ~ CLEAN 
Telegraphic Reduction TELE 
Fignre 1: System architecture 
turn linking; clean-up filter; topic segmentation; and 
information condensation. A. fifth component is 
added a.t the end for the purpose of telegraphic re- 
duction, so that we can maximize the information 
content in a given amount of space. The system ar- 
chitecture is shown in Figure 1. It also indicates the 
three major types of smnmaries which can be gener- 
ated by l)Ia SUMM: 'P\]~ANS ("transcript"): not using 
the linking and clean-up components; CLEAN: ris- 
ing the main four components; 'I'EI,E ("telegraphic" 
summary): additionally, using the telegraphic reduc- 
tion component. 
The following sections describe the components of 
DIASUMM ill more detail. 
4 Turn Linking 
The two main objectives of this component are: (i) 
to form turns which contain a set of full (and not 
partial) clauses; and (ii) to forln turn-pairs in cases 
where we have a question-answer pair in the dia- 
logue. 
To achieve the first objective, we scan the input for 
adjacent turns of one speaker and link them together 
if their time-stamp distance is below a pre-specified 
threshold 0. If the threshold is too small, we don't 
get most of the (logical) turn continuations across 
utterance boundaries, if it is too large, we run the 
risk of "skipping" over short but potentiMly relevant 
Daglnents of the speaker on the other channel. We 
experimented with thresholds between 0.0 and 2.0 
seconds and determined a local performance maxi- 
mum around 0 = 1..0. 
For the second objective, to form turn-pairs which 
comprise a question-answer information exchange 
between two dialogue participants, we need to detect 
wh- and yes-uo-questions in the dialogue. We tested 
\] English \] Spanish 
Annotated l)ata 
turns 1603 1185 
Wh-questions /12 78 
yes-no-questions /t3 98 
questions total 85 (5.3%) 176 (14.9%) 
Automatic Detection Results (F1) 
SA classifier 
POS rules 
raudom baseline 
0.24 0.22 
0.22 0.37 
0.02 0.13 
Tahle 1: Q-A-pair distribution in the data and ex- 
pel'imental results for automatic Q-A-detection 
two approa.ches: (a) a ItMM based speech a.ct (SA) 
classifier (\]/Jes, \] 999) and (b) a set of part-of-speech 
(POS) based rules. The SA classifier was trained oll 
dialogues which were manually annotated for speech 
acts, using parts of the SWITCIIBOARI) corpus (God- 
frey et al., 1992) for Fmglish and CALLIIOMF, for 
Spanish. The corresponding answers for the de- 
tected questions were hypothesized in the first turn 
with a. different sl)eaker , following the question-turn. 
Table 1 shows the results of these experiments for 5 
English and 5 Spanish CAI,L\]IOME dialogues, corn- 
payed to a baseline of randomly assigning n question 
speech acts, n being the number of question-turns 
marked by human a.nnotal~ors. We report Fl-seores, 
where F1 - ~ with P=preeision and /g--recall. 
We note that while the results \[br the SA-classifier 
and the rule-based approach are very similar for En- 
glish, the rule-based apl~roach yields better results 
tbr Spanish. The much higher random baseline for 
Spanish can be explained by the higher incidence of 
questions in the Spanish data (14.9°/(, vs. 5.3% for 
English). 
5 Clean-up Filter 
The clean-up component is a sequence of modules 
which serve the purposes of (a) rendering the tran- 
scripts more readable, (b) simplifying the input for 
subsequent components, and (c) avoiding unwanted 
bias for relevance computations (see section 2). All 
this has to happen without losing essential informa- 
tion that could be relevant in a summary. While 
other work (\]\]eeman et al., 1996; Stolcke et al., 1998) 
was concerned with building classifiers that can de- 
tect and possibly correct wn:ious speech disfluencies, 
our implementntion is of a much simpler design. It 
does not require as much lnanual annota.ted train- 
ing data and uses individual components for every 
major category of disfluency.1 
t While we have not yet numerically evaluated the perfof 
mance of this component, its output is deemed very natural to 
read by system users. Since the focus and goals of this contpo- 
nent are somewhat different than l)reviotts work in that area, 
meaningful comparisons are hard to make. 
970 
Single or multiple word repetitions, fillers (e.g., 
"uhm"), and discourse markers without semantic 
content (e.g., "you know") a.re removed fl:om the in- 
put, some short forms axe expanded (e.g., "we'll" 
-+ "we will"), a.nd fl'cquent word sequences are 
combined into a single token (e.g., % lot of" -+ 
"a_lot_of"). 
Longer tm'ns are segmented into shorl clauses, 
which are defined a.s consisting of at least a. sub- 
ject and a.n inIlectcd verbal form. While (Stolcke 
and Shriberg, 1996) use n-gram models for this task, 
and (C~awald~t et al., 1997) use neura.l networks, we 
decided to use a. rule-based approach (using word 
a,nd POS information), whose performa.nce proved 
to be compat'able with the results in the cited \])~- 
pets (1,'~ > 0.85, error < 0.05). ~ 
leo, . several of tile clea.n-up filter's components, we 
ina.ke use of Brill's POS ta.gger (Ih:ill, I,(),qd). For 
Fmglish, we use ~t modified version of Brilt's original 
t~g set, and the tagger was adapted and retra.ined for 
Sl)oken langua.ge eorl)ora, (CAIAAIOME a.lKl SWITCll- 
tlOalU)) (Zechner, 1997). For S1)anish, we crea.ted 
our own tag set., derived from the l,l)C lexicon and 
front the CI{ATEI/. project (LeOn, 1994), and trained 
the tagger on ma.nua.lly annotated (~;AI,I,IIOME dia- 
logues, l!'urthernlore, a. POS based sha.lk)w chunk 
parser (Zechner a.nd Wa.ibel, 1998) is used to fill.('.,' 
(,tit. likely ca.ndidates for incomplete, clauses dne to 
speech repair or interrul)tion by the other Slleaker. 
6 Topic Segmentation 
,~illce CAI,I,IIOME dialogues are a.lways multi-topica.I, 
segmenting them into tOl)ical units is an important 
:;tel) in our summa.riza.tion system. '.l'his allows us 
to l)rovi(le "signature?' information (frcqllenl; coil- 
tent words) about every topic to the user as a. hell) 
for faster 1)rowsing and accessing the dat.a., l,'ur- 
thel:more, the subsequent informa.tio, condensation 
COI\]l\])Ollent ca.ll ~,VolYk on smaller parts of the diaJogue 
a.nd thus opera.re more ellieiently. 
Following (l{oguraev and Ii{cnnedy, 1997; Ba.rzi- 
la.y and Elhadad, 1997) who use 'l'extTiling (llcarst, 
1997) for their summa.riza.tion systems of written 
text, we adapted this algorithm (it.s block compar- 
ison version) R)r sl)eech data: we choose turns to 
be minimal units a.nd compute block simila.rity be- 
tween l)locl(s of k turns every d turns. We use 9 
English and 15 Spanish @ALI,tIOMI,; dialogues, man- 
ually annota.ted for topic bounda.ries, to determine 
the optinmm wdues for a set of TextTiling pm:am- 
eters and ~t. the same time to eva.lua.te the accu- 
racy of this algorithm. '.re do this, we ran a.n n-R)ld 
cross-wdidation (".jack-l~nifing") where ~dl dia.logues 
but one are used to determine the 1)est parameters 
"train set") m,d the remaining dia.logue is used as 
2'\]'lie COIIIIIDA'isoII W~:tS (\[OllC OI1 t.he S~-tllle <latat set as used 
m (Gav;ddh ctal., 1997). 
English Spanish 
blocksize k 25 15 
sample distance d 2 2 
rounds of smoothing r 2 l 
smoothing width s 2 \] 
'l.'able 2: OptimM 'l>xt'.l.'iling pa.rameters for English 
and Spanish CAI,IAIOME dialogues 
nmnber of dbdogues 
r~mdom baseline 
test set avg. (%nseen data") 
train set a~vg. ("seen dat£') 
English Spanish 
9 15 
0.34 0.35 
0.58 0.53 
0.69 0.58 
'l'~d)le 3: Topic segmenta.tion results for English and 
Spa.nish CAI,IAIOMI,: dialogues (Fl-Scores) 
a held-out d~ta. set for eva.luation ("test set"). This 
process is rcpea.ted n times and average results are 
reported. Ta.ble 2 shows the set of p~u:ameters which 
worked best for most diak)gues ~md 'Fable 3 shows 
tile eva.hm.tion results of the cross-validation exper- 
iment. /,'~-scores improve I)y 18-2d% absohtte over 
the random baseline for unseen a.nd by 23 35% for 
seen data., the performance for E\]@ish being better 
than for Spanish. 'l'hese results, albeit achieved on 
a. quite different text genre, are well in line with the 
results in (llea.rst, 1997) who reports a.n absolute im- 
provement of a, bout :20% over a, random baseline for 
seen data. 
7 Information Condensation 
The informa,tion condensa, tion COml)onent is the core 
o\[' our sysl,en:~, lilts pUrl)OSe is to determine weights 
for terms and turns (or linked turn-i)airs ) and then 
to rank the turns a.ccording to their relewmce within 
each topical segment of the dialogue. 
For term-weighting, lf*idf-insl)ired formula.e 
(Sa.lton and Buckley, 1990) are used to empha.size 
words which are in the "middle range" of fl:equency 
in the dialogue a.nd do not a.pl)eat: in a. stop list. :~ 
For turn--ranking, we use a version of the "maximal 
n,argina.l relevance" (MMI{) algorithm (Ca.rbonell 
and Goldstein, 1998), where emphasis is given to 
liurns which conta.in ma.ny highly weighted terms tot" 
the current segment ("sa.lience") a.nd are sutficiently 
dissimila.r to previously ranked turns (to minimize 
redunda.ncy). 
For 9 English and l d Spanish dialogues, the "most 
relevant" turns were nmrl~ed lay hmnan coders. We 
ran a. series of cross-validation experiments to (a,) op- 
timize the parameters of this component related to 
tJ'*idf a.nd MMR computa,tion and to (b) deterlnine 
31,'or l,;nglish, our stop list comprises 557 words, for Span- 
ish, 831 words. 
971 
how well this information condensing component can 
match tile human relewmce annotations. 
Summarization results are comlmted using 1 l-pt- 
avg precision scores t`or ranked turn lists where the 
maximum precision of the list of retrieved turns 
is averaged in the 11 evenly spaced intervals be- 
tween recall=\[0,0.1),\[0.1,0.2), ... \[1.0,1.:1)(Salton 
and McGill, 1.983). 4 Table 4 shows the results from 
these experiments. Similar to other experiments in 
the summarization literature (Ma.ni et a.l., 1998), we 
find a wide performance variation across different 
texts. 
8 Telegraphic Reduction 
The purpose of this component is to maximize infor- 
mation in a tixed amount of space. We shorten the 
OUClmt of the summarizer to a "telegraphic style"; 
that way, more inrorma.tion can be included in a 
summary of k words (02: n bytes). Since we only 
use shallow methods for textual analysis that do 
not generate a. dependency structure, we cannot use 
complex methods for text reduction as described, 
e.g., in (Jing, 2000). Our method simply excludes 
words occurring in the stop list fl:om the summary, 
except for some highly inforlnative words such as 'T' 
or ~11ot ~ . 
9 User Interface and System 
Perforlnance 
Since we want to enable interactive summarization 
which a.llows ~ user to browse through a dialogue 
qnickly Co search for information he is interested 
in, we have integrated our summarization system 
into a 3AVA-based graphical user interface ("Meet- 
ing Browser") (Bert et al., 2000). This interface also 
integrates the output of a speech recognizer (Yu et 
al., 1.999), and can display a wide variety of infer 
1nation about a conversation, including speech acts, 
dialogue games, and emotions. 
For sumlnarization, the user can determine the 
size of the summary and which topical segments 
he wants to have displayed. Ite can also rocus 
the summary on particular content words ("query- 
based summary") or exclude words from considera- 
tion ("dynamic stop list expansion"). 
Smmnarizing a 10 minute segment of a CALL- 
hOME dialogue with our system takes on average less 
than 30 seconds on a 167 MHz 320 MB Sun Ultral 
workstation.S 
4 We are aware that this annotation and evaluation scheme 
is far fl'om optlmah it does neither reflect the fact that turns 
are not necessarily the best units for extraction nor that the 
11-pt-avg precision score is not optimally suited for the sum- 
marization task. We thus have recently developed a new 
word-based method for annotation and evaluation of spon- 
taneous speech (Zechner, 2000). 
5The average was computed over five English dialogues. 
10 Human Study 
1(1.1 Experiment Set;up 
Ill order to ewduate the system as a. whole, we con- 
ducted a study with humans in the loop to 1)e able Co 
colnpare three types of summaries (TITANS, CLEAN, 
TELE, see section 3) with the fllll original transcript. 
We address these two main questions in this study: 
(i) how fast can information be identified using dif- 
ferent types of summaries? (ii) how accurately is the 
information preserved, comparing different types of 
summaries? 
We did not only ask the user "narrow" questions 
for a specific piece of information -- along the lines 
of the Q-A-evaluation part. of the SUMMAC confer- 
ence (Mani eC a.l., 1998) -- but also very "global", 
non-specific questions, tied Co a. parCicular (topical) 
segment of the dialogue. 
The experiment was conducted as follows: Sub- 
jeers were given 24 texts each, aceompa.nied by either 
a generic question ("What is the topic of the discus- 
sion in this text segment?") or three specitic ques- 
tions (e.g., "Which clothes did speaker A buy.'?"). 
The texts were drawn from five topical segments 
each rrom five English CAIAAIOME dialogues. (; They 
have four difl>rent formats: (a) fldl transcripts (i.e., 
the transcript of the whole segment) (FULL); (b) 
summa.ry of the raw transcripts (without linking and 
clea.n--up) ('rll.aNS); (c) cleaned-up summary (using 
all four major components of our systenl) (C,I,I,;AN); 
and (d) telegram suln21\]a, ry (derived rron\] (c), using 
also Cite Celegraphic reduct.ion component) (TI';LE). 
'l'he texts or for,,,a.t,, (b), (c), a.nd (d) were gener- 
ated 1;o have the saaue length: 40% of (a), i.e., we 
use a 60% reduction rate. All these formats can 
be accotnpanied by either a. generic or three specitic 
questions: hence there are eight types of tasks for 
each of the 24: texts. 
We divided the subjects in eight groups such that 
no subject had to l)erform more than one task on 
the same text and we distributed the different Casks 
evenly \['or each group. Thus we cau make unbiased 
comparisons across texts and tasks. 
The answer accuracy vs. a pre-defined answer key 
was manually assessed on a 6 point discrete scale 
between 0.0 and 1.0. 
10.2 ll,esults and Discussion 
Of the 27 subjects taking part in this experiment, 
we included 24 subjects iu the evaluation; 3 sub- 
jects were excluded who were extreme outliers with 
respect to average answer time or score (not within 
/* + -2sCddev). 
From the results in Table 5 we observe the fol- 
lowing trends with respect to answer accuracy and 
response time: 
SOne of the 25 segments was set aside for demonstration 
purposes. 
972 
English Spanish 
nun+her of dialogues 9 14 
turns t)er dialogue ma.rked ;ts relevant I)y human coders 12% 25% 
I l-pt-a.vg precision (average over t.ol)i(:a.l segnlent.s) 0.45 0.5.0 
score variation between (liak)gues 0.2 0.49 0.15 0.8 
TM)Ie 4: Smmnarization result;s for English and S1)anisll (I~AI,I,IIOME 
I,'ornmt tra ns (:lea.n \] tele 
'\]'ime vs. A(:c. Tin.: \] Ae(. 'l'ime \[ A( C. I Time \[ Ac( 
generic (q = 72) 
specific (q = 216) 
L full Time~ Ace. 
I 0._(,. 1D.): s 
Tin,-~ec. 
-%~ \[ 07739 
'l'M)le 5: Average a.nsw('r times (i,, sect a.nd a.ccuracy scores (\[0.0-1.0\]) over eight dilferent tasks (number of 
subjects=2d; q:=mmd)er of questions l)er task type). 
summary l,ype 
generic / indicative 
speci\[ic / informative 
\[ !)/).s I    wci.,l \] Lr Ls 
1 ?t .0 
'l'able 6: Ilela.tive answer accuracies in % for dill'~,rent 
Sl)l\]llll~/ri(~S 
* ge~w'ric questions ("indicative summarie,s", the 
task being to identi\[y the topic o\[' a text): The 
tWO cleaned u D StlllnFla,ries tool( M)out, the same 
Lime to in;ocess I)ui. had lower a eeura('y scores 
than tim v(;rsion directly u:dug the trans(:ril)l.. 
* spcc~/ir quest.ions ("ilfl'orlnal.ive sunllllaries", 
the (.ask being Io lilM Sl)ecilie intLrllml ion in t\]l(~ 
re×t): (I) The accuracy advant, age of the raw 
I,ranscripl, sun lmaries ('I'R, A NS) over the clealled 
u\]) versions (CLlCAN) is only small (,oZ :;Latis- 
tica.lly signitieant: L:-0.748) 7. (2) 'l'her(" is a 
sui)eriority of the 'l'l,;lA,,-StllnlHary to t)o(;h otJmr 
kinds ('rFLI.: is significa.nlJy more ;,iCCtllX/|l(2 (h~-/.ll 
CLEAN \[()r 1) "~ 0.0~r)). 
l,'rom this w(; conjecture thai. our methods for (:us- 
tomizaJ.ion of the summaries to spoken dialogues is 
mostly relewmt for inJ'ormativc, but llot so tUll(;h 
for indi,:,tivc smmmu'ization. We drink that el.her 
methods, such as lists of signature l)hrases would l)e 
n tor0 effective to use lbr the \]al;tcr \[mrl)ose. 
'l)dtle 6 shows the answer accuracy for the three 
different smmnary tyl)es relative 1;o the accuracy of 
tile fldl transcripl, texts of l, he sa.me segmenl,s (':rela- 
tive ~mswer a.ccm:acy"). We, observe that; tit(: r('l~d;ive 
accuracy reduction for all smnn\]aries is markedly 
lower than the reduction of tc'xt size: all sunmmries 
were reduced from the full transcripts l)y 60%, 
whereas tile answer a(:(:uracy only drops between 9% 
(TITANS) a,tld 24% (CI,EAN) l()l" the generic quest, ions, 
7111 \['DA;\[,, ill 2, of 5 dialogues. I,\]m CI,I.1AN SIIllllllD, l'y scores 
m:e higher tllall th<>se of the 'I'IIANS summaries. 
and between 20% ('rF, l,l~,) and 29% (CI,F, AN) fOl: the 
speci\[ic questions. This proves that our systeln is 
able to retain most of the relevant information in 
tim summaries. 
As for average' answer times, we see a. ma.rked re- 
duction (3()0{,) of all sunmm.ries coulparcd to the full 
texts in l,hc .qcneric case; for the SlmCific ease, the 
time reduction is sonlewhat sma.ller (l 5% 25%). 
One shortcoming of the current, system is thai; it 
oper~d;es on turns (or \[;tlrll-pa.irs) as minimal units 
\['or extraction, tn \[Stture work, we will investigate 
possil)ilities to reduce the minimal units ot7 extrac-- 
l.ion l.o tim level of chmses or sent.<m<:es, wilhoul, giv 
like; Ul) the idea of linking cross-slxmker information. 
1 1 Summary and lglture Work 
\Ve have presented a sunmmrization sysl,e~ for six) 
ken dialogues which is constructed to address key 
difl)renees of spolcen vs. written langua.ge, dia.logues 
vs. monologues, and inul|.i-topical vs. mono-topical 
texts. The system cleans up the input for speech 
disfluencies, links t.urns together into coherent in- 
formation units, determines tOlfica.l segments, and 
extracts the most relevant pieces of informal, ion in 
a user-customiza.ble way. I~;vahml,ions of major sys- 
tem (:Oral)Orients and of t.he systeJn as a. whole were 
1)erfornmd. 'l'hc results of a user sl, udy show that 
with a. sutmna ry size of d0%, between 71% and 911% 
of the inlbrma.tion of the fill\] text is ret.a.ined in the 
summary, depending on tile type of summary and 
tim Lyl)('s of quest, ions being asked. 
\¥c' are currently extending the system to be able 
to ha.ndle different levels of granularity for extract;ion 
(clauses, sentences, turns), leurthermore, we plan to 
investigate the, integration of l)rosodic information 
into several (-onq)onents of our system. 
12 Acknowledgements 
We wa.nt, l,o tha.nk the almotators for their ell'errs aim 
Klaus Hies for providing l.he automatic speech a(:t 
973 
tagger. We appreciate comments and suggestions 
t¥om Alon Lavie, Marsal Gawtld~/, Jade Goldstein, 
Thomas MacCracken, and the &llonymotls l:eviewers 
on earlier drafts of this paper. 
This work was funded in part by the VEf{BMOBI1, 
project of the Federal Republic of Oerma,ny, ATR - 
Interpreting Telecommunications Research L~l)ora- 
tories of Japan, and the US l)epartment of l)efense. 

References 

AAAI, editor. 1998. Proceedin9s of the AAAI-98 Spring 
Symposium on Intelligent Te.vt Summarization, Stan\]ord, 
CA. 

ACL. 2000. Proceedings of thc ANLP/NAACL-2000 Work- 
shop on Automatic Summarization, Seattle, WA, May. 

Jan Alexaudersson and Peter Poller. 1998. Towards mul- 
tilingual protocol generation for spontaneous speech dia- 
logues. In Proceedings of the INLG-98, Niagara-on-the- 
lahc, Canada, ilugust. 

f{cgina Barzilay and Michael Elhadad. 1997. Using lexical 
chains for text summarization. In ilCL/EACL-97 Work- 
shop on Intelligent and Scalable Te.vt Summarization. 

Michael Bert, l{alph Gross, llua Yu, Xiaojin Zhu, Yue Pan, 
Jie Yang, and Alex Waibel. 2000. Multimodal meeting 
tracker. In Proceedings o\] the Conference on Content- 
Based Multimedia Information Access, IHAO-2000, Paris, 
l<7'ance, April. 

Braniinir Boguraev and Chrlstol)hcr I(cnnedy. 1997. 
Salience-based characterisation of text documents. In 
A CID/EA CL- 97 Workshop on Intelligent and Scalable Text 
Summarization. 

Eric Brill. 1994. Some advances in transforlnation-I)~ed part 
of speech tagging. In Proceeedings o.f AAAI-9/~. 

Jaime Carbonell mid Jade Goldstein. 1998. The use of MMR, 
diversity-based reranking for reordering docunlents and 
producing summaries. In Proceedings o.f the 21st ACM- 
SIGIJg International Co,florence on Research and Devel- 
opment in lnJormation ll.ctrieval, Melbour~;c, Australia. 

Johl\] S. Garofolo, Ellen M. Voorhees, Vincent M. Stanford, 
and l(aren Sparck .\]ones. \] 997. TI{I\]C-6 1997 spoken doc- 
IllllellL retriewfl track overview and results. In Proceed- 
in9s o.\[ the 1997 "17H¢C-6 Conference, Gaithe'rsburg, MI), 
November, pages 83 -91. 

John S. Garofolo, Ellen M. Voorhees, Cedric G. P. Auzanne, 
and Vincent M. Stratford. 1999. Spoken doculnent re- 
trieval: 1998 evaluation aud investigation of new inetrics. 
In Proceedings of the ESCA workshop: Accessing informa- 
tion in spoken audio, pages 1-7. Camloridge, UK, April. 

Morsel Gawddh, Klaus Zechner, and Gregory Aist. 1997. 
Iligh perforlnauce segnlentation of spontaneous speech us- 
ing part of speech and trigger word infornmtion. In Pro- 
eeedin9 s of the 5th ANLP Conference, Washington DO, 
pages 12-15. 

J. J. Godfrey, E. C. ltolliman, and J. Mcl)mfiel. 1992. 
SWITCttBOARD: telephone speech corpus for research mid 
development. In Proceedings of the IUASSP-92, vohnne 1, 
pages 517-520. 

Martl A. IIearst. 1997. TextTiling: Segmenting text into 
multi-paragraph subtopic passages. Computational Lin.- 
guistics, 2311):33-64, March. 

Peter A. IIeeman, Ieyung he Loken-Khn, and James 1:. Allen. 
1996. Oombining the detection and correction of speech 
repairs. In Proceedin9s of ICSLP-96. 

Julia Ilirsehberg mid Christine Nakatmfi. 1998. Acoustic 
indicators of topic segmentation. In Proceedings o.f the 
ICSLP-98, Sydney, Australia. 

IIongyan Jing. 2000. Sentence reduction for automatic text 
sum,narlzation. In Proceedings of ANIH~-NAA CL-2000, 
Seattle, WA, May, pages 310-;315. 

Megumi Kameyama, Goh Kawai, and isao Arima. 1996. A 
real-tinie systcni for summarizing human-human sponta- 
neous spoken dialogues. Ill Proceedings of the ICSLP-96, 
pages 681-684. 

Linguistic Data Consortium (LDC). 1996. CallHome alld 
CallFriend LVCSR databases. 

Fernando S~nchez \[,edn. 1994. Spanish l.agset for tile 
CI~.ATIBR project, http://xxx.lanl.gov/cinp-lg/9406023. 

lndetjeet Mani and Mark Maybury, editors. 1997. Proceed- 
in gs of the A CL/ICA CL '97 Workshop on Intelligent Scal- 
able Text Summarization, Madrid, Spain. 

\]ndet:ieet Mani, I)avid ltouse, Gary Klein, l,ynette 
Hirschman, Leo Obrst, Therese Firmin, Michael 
Chrzanowsld, and lJeth Sundheim. 1998. The 'I'\]P- 
STER SUMMAC text summarization evaluation. Mitre 
Technical Report MTIi 98W0000138, October 1998. 

Klaus liles. 1999. ItMM and neural network based speech 
act detectiou. \]n Proceedings o\] the ICASSP-99, Phoenix, 
Arizona, March. 

Gerard Salton and Chris Buckley. 1990. \]?lexlble text match- 
ing for information retrieval. 'Pcchnical report, Cornell 
University, Department of Computer Science, TR. 90-1158, 
September. 

Germ'd Salton and Michael J. McGill. 1983. Introduction to 
Modern Information ltetrieval. McO,'aw IIill, q\~kyo etc. 

Elizabeth Shriberg, Rebecca Bates, Andreas Stolcke, Paul 
q)*ylor, Daniel aurafsky, Klaus f{ies, Noah Coccaro, l{achel 
Martin, Marie Meteer, and Carol Van Ess-Dykema. 1998. 
(Jan prosody aid the automatic classification of dialog acts 
in conversational speech? Lan9aa9 e and Speech, ,1113- 
4):439 487. 

Andrew,s Stolcke and l~lizabeth Shriberg. 1996. Automatic 
linguistic segmentation of conversational speech. In Pro- 
ceedings o\] the I6'SL\]~-96, pages 1005-1008. 

Andreas Stolcke, Elizabeth Shriberg, Rebecca Bates, Marl 
Ostendorf, Dilek IIakkani, Madelei,m Plauche, (JSkhan 
Tfir, and Yu tin. 1998. Automatic detection of sentence 
1ooundm:ies and disfluencies based on recognized words. In 
Proceedings of the ICSLP-98, Sydney, Australia, Decen> 
bet, volunm 5, pages 2247--2250. 

Andreas 8tolcke, ISlizabeth Shriberg, l)ilek IIakkani-Tfir, and 
GSkhan q'fir. 2000. Prosody-based automatic segmenta- 
tion of speech into sentences and topics. Speech Comn~u- 
"nhcatio'a., 32(1-2). 

l/obin Valenza, 3kmy l~obinson, Marianne l\]ickcy, and l{oger 
Tucker. 199,(/. Sunnnarisation of Sl)oken audio through in- 
forniatiou extraction, tn Proceedings o,f the /'TSCA work- 
shop: Aceessin.9 i~fformatio'n i~ spoken audio, pages 111 
116. C.2ambridge, UK, April. 

Alex Waibel, Michael Belt, and Michael Finke. 1998. Meet- 
ing browser: Tracking and summarizillg meetings, in Pro- 
ceedings of the DARPA Broadcast News l/Vo'rkshop. 

Hue Yu, Michael Finke, and Alex Waibel. 1999. Progress 
ill atltonlatic meeting transcril)tion. \]n Proceedings qf 
EUI~OSI'EECI1-99, Budapest, lhm9ary, September. 

Klaus Zeehner and Alex \¥aibel. 1998. Using chunk based 
partial parsing of spontaneous speech in unrestricted do- 
mains for reducing word error rate in speech recognition. 
In Proceedings of COLING-A CL 98, \]WIontreal, Canada. 

Klaus Zechner and Alex Waibel. 2000. Minimizing word error 
rate in textual suinnlaries of spoken lmiguage. \]u Procced- 
ings o\] the First Meeting o.f the North American Chapter o.f 
the Association for Computational Linguistics, NAACL- 
2000, Seattle, WA, April/May, pages 186-193. 

Klaus Zechner. 1997. Building chunk level represen- 
tations for spontmmous speech in unrestricted do- 
mains: The CHUNI';Y system and its al)plication to 
reranking N-best lists of a speech recognizer. Mas- 
ter's thesis (project report), Oh/I_U, available fl'om: 
http ://wuu. es. emu. edu/-zechner/publicat ions. html. 

Klaus Zechner. 2000. A word-based annota- 
tion and evaluation scheme for summariza- 
tion of Sl)ontancoIJs speech. Awfilablc fi'oni http://www.cs.¢,,,,.eduFzechner/pubiications.i,1:ml. 
