Activity detection for information access
to oral communication
KlausRiesandAlexWaibel 
fries|ahwg@cs.cmu.edu
InteractiveSystemsLabs,CarnegieMellonUniversity,Pittsburgh,PA,15213,USA
InteractiveSystemsLabs,Universit¨atKarlsruhe,Fakult¨atf¨urInformatik,76128Karlsruhe,Germany
http://www.is.cs.cmu.edu/http://werner.ira.uka.de
ABSTRACT
Oral communication is ubiquitous and carries important in-
formation yet it is also time consuming to document. Given
the development of storage media and networks one could
just record and store a conversation for documentation. The
question is, however, how an interesting information piece
would be found in a large database. Traditional informa-
tion retrieval techniques use a histogram of keywords as the
document representation but oral communication may o er
additional indices such as the time and place of the rejoinder
and the attendance. An alternative index could be the ac-
tivity such as discussing, planning, informing, story-telling,
etc. This paper addresses the problem of the automatic de-
tection of those activities in meeting situation and everyday
rejoinders. Several extensions of this basic idea are being
discussed and/or evaluated: Similar to activities one can
de ne subsets of larger database and detect those automati-
cally which is shown on a large database of TV shows. Emo-
tions and other indices such as the dominance distribution
of speakers might be available on the surface and could be
used directly. Despite the small size of the databases used
some results about the e ectiveness of these indices can be
obtained.
Keywords
activity, dialogue processing, oral communication, speech,
information access
 We would like to thank our lab, especially Klaus Zechner,
Alon Lavie and Lori Levin for their discussions and support.
We would also like to thank our sponsors at DARPA. Any
opinions,  ndings and conclusions expressed in this material
are those of the authors and may not re ect the views of
DARPA, or any other party.
.
1. INTRODUCTION
Information access to oral communication is becoming an
interesting research area since recording, storing and trans-
mitting large amounts of audio (and video) data is feasible
today. While written information is often available elec-
tronically (especially since it is typically entered on com-
puters) oral communication is usually only documented by
constructing a new document in written form such as a tran-
script (court proceedings) or minutes (meetings). Oral com-
munications are therefore a large untapped resource, espe-
cially if no corresponding written documents are available
and the cost of documentation using traditional techniques
is considered high: Tutorial introductions by a senior sta 
member might be worthwhile to attend by many newcomers,
o ce meetings may contain informations relevant for oth-
ers and should be reproducable, informal and formal group
meetings may be interesting but not fully documented. In
essence the written form is already a reinterpretation of the
original rejoinder. Such a reinterpretation are used to
 extract and condense information
 add or delete information
 change the meaning
 cite the rejoinder
 relate rejoinders to each other
Reinterpretation is a time consuming, expensive and op-
tional step and written documentation is combining reinter-
pretation and documentation step in one 1. If however rein-
terpretation is not necessary or unwanted a system which
is producing audiovisual records is superior. If reinterpreta-
tion is wanted or needed a system using audiovisual records
may be used to improve the reinterpretation by adding all
audiovisual data and the option to go back to the unaltered
original. Whether reinterpretation is done or not it is cru-
cial to be able to navigate e ectively within an audiovisual
document and to  nd a speci c document.
1The most important exception is the literal courtroom
transcript, however one could argue that even transcripts
are reinterpretations since they do not contain a number of
informations present in the audio channel such as emotions,
hesitations, the use of slang and certain types of hetereglos-
sia, accents and so forth. This is speci cally true if tran-
scription machines are used which restrict the transcriber
to standard orthography.
Willie is sickNeed new coding schemePersonal stuffSetting up the new hard drive
Data collection, last WednesdayDialogue detection, with HansLanguage modeling tutorial for Tim
MeetingsTV showsLectures
SpeechesMeeting
Database
Segment
Figure 1: Information access hierarchy: Oral com-
munications take place in very di erent formats and
the  rst step in the search is to determine the
database (or sub-database) of the rejoinder. The
next step is to  nd the speci c rejoinder. Since re-
joinders can be very long the rejoinder has to seg-
mented and a segment has to be selected.
While keywords are commonly used in information ac-
cess to written information the use of other indices such as
style is still uncommon (but see Kessler et al. (1997); van
Bretan et al. (1998)). Oral communication is richer than
written communication since it is an interactive real time
accomplishment between participants, may involve speech
gestures such as the display of emotion and is situated in
space and time. Bahktin (1986) characterizes a conversa-
tion by topic, situation and style. Information access to
oral communication can therefore make use of indices that
pertain to the oral nature of the discourse (Fig. 2). In-
dices other than topic (represented by keywords) increase
in importance since browsing audio documents is cumber-
some which makes the common interactive retrieval strategy
\query, browse, reformulate" less e ective. Finally the topic
may not be known at all or may not be that relevant for the
query formulation, for example if one just wants to be re-
minded what was being discussed last time a person was
met. Activities are suggested as an alternative index and
are a description of the type of interaction. It is common
to use \action-verbs" such as story-telling, discussing, plan-
ning, informing, etc. to describe activities 2. Items similar
to activities have been shown to be directly retrievable from
autobiographic memory (Herrmann, 1993) and are therefore
indices that are available to participants of the conversation.
Other indices may be very e ective but not available: The
frequency of the word \I" in the conversation, the histogram
of word lengths or the histogram of pitch per participant.
In Fig. 1 the information access hierarchy is being intro-
duced which allows to understand the problem of informa-
tion access to oral communication at di erent levels. In
Ries (1999) we have shown that the detection of general di-
2 The de nition of activities such as planning may vary
vastly across general dialogue genres, for example compare
a military combat situation with a mother child interaction.
However it is often possible to develop activities and dia-
logue typologies for a speci c dialogue genre. The related
problem of general typologies of dialogues is still far from
being settled and action-verbs are just one potential catego-
rization (Fritz and Hundschnur, 1994).
SpeakerTimeLocation
Related rejonders
Parts of speechEmotion
Overlap between speakers
Semantics / pragmaticsKeywords
SituationStyle
Topic
Figure 2: Bahktin’s characterization of dialogue:
Bahktin (1986) describes a discourse along the three
major properties style, situation and topic. Current
information retrieval systems focus on the topical as-
pect which might be crucial in written documents.
Furthermore, since throughout text analysis is still a
hard problem, information retrieval has mostly used
keywords to characterize topic. Many features that
could be extracted are therefore ignored in a tradi-
tional keyword based approach.
alogue genre (database level in Fig. 1) can be done with
high accuracy if a number of di erent example types have
been annotated; in Ries et al. (2000) we have shown that it
is hard but not impossible to distinguish activities in per-
sonal phone calls (segment level in Fig. 1) . In this paper
we will address activities in meetings and other types of di-
alogues and show that these activities can be distinguished
using certain features and a neural network based classi er
(Sec. 2, segment level in Fig. 1). The concept of information
retrieval assessment using information theoretic measures is
applied to this task (Sec. 3). Additionally we will introduce
a level somewhat below the database level in Fig. 1 that we
call \sub-genre" and we have collected a large database of
TV-shows that are automatically classi ed for their show-
type (Sec. 4). We also explore whether there are other in-
dices similar to activities that could be used and we are
presenting results on emotions in meetings (Sec. 5).
2. ACTIVITY DETECTION
We are interested in the detection of activities that are
described by action verbs and have annotated those in two
databases:
meetings have been collected at Interactive Systems Labs at
CMU (Waibel et al., 1998) and a subset of 8 meetings
has been annotated. Most of the meetings are by the
data annotation group itself and are fairly informal in
style. The participants are often well acquainted and
meet each other a lot besides their meetings.
Santa Barbara (SBC) is a corpus released by the LDC
and 7 out of 12 rejoinders have been annotated.
The annotator has been instructed to segment the rejoin-
ders into units that are coherent with respect to their topic
Activity SBC Meeting
Discussion 35 58
Information 25 23
Story-telling 24 10
Planning 7 19
Undetermined 5 8
Advising 5 17
Not meeting 3 2
Interrogation 2 1
Evaluation 1 0
Introduction 0 1
Closing 0 1
Table 1: Distribution of activity types: Both
databases contain a lot of discussing, informing and
story-telling activities however the meeting data
contains a lot more planning and advising.
and activity and annotate them with an activity which fol-
lows the intuitive de nition of the action-verb such as dis-
cussing, planning, etc. Additionally an activity annotation
manual containing more speci c instructions has been avail-
able (Ries et al., 2000; Thym e-Gobbel et al., 2001) 3. The
list of tags and the distribution can be seen in Tab. 1. The
set of activities can be clustered into \interactive" activities
of equal contribution rights (discussion,planning), one per-
son being active (advising, information giving, story-telling),
interrogations and all others.
Measure Meeting SBC CallHome
all inter all inter Spanish
 0.41 0.51 0.49 0.56 0.59
Mutual inf. 0.35 0.25 0.65 0.32 0.61
Table 2: Intercoder agreement for activities: The
meeting dialogues and Santa Barbara corpus have
been annotated by a semi-naive coder and the  rst
author of the paper. The  -coe cient is determined
as in Carletta et al. (1997) and mutual information
measures how much one label \informs" the other
(see Sec. 3). For CallHome Spanish 3 dialogues were
coded for activities by two coders and the result
seems to indicate that the task was easier.
Both datasets have been annotated not only by a semi-
naive annotator but also by the  rst author of the paper.
The results for  -statistics (Carletta et al., 1997) and mu-
tual information between the coders can be seen in Tab. 2.
The intercoder agreement would be considered moderate but
compares approximately to Carletta et al. (1997) agreement
on transactions ( = 0:59), especially for the interactive ac-
tivities and CallHome Spanish.
For classi cation a neural network was trained that uses
the softmax function as its output and KL-divergence as
3 In contrast to (Ries et al., 2000; Thym e-Gobbel et al.,
2001) the \consoling" activity has been eliminated and an
\informing" activity has been introduced for segments where
one or more than one member of the rejoinder give informa-
tion to the others. Additionally an \introducing" activity
was added to account for a introduction of people or topics
at the beginning of meetings.
Feature all interactive
SBC meet SBC meet
baseline 32.7 41.1 50.5 54.6
dialogue acts per channel 28.1 37.6 47.7 56.7
dialogue acts 28.0 36.2 46.7 65.3
words 38.3 39.7 53.3 54.6
dominance 32.7 44.7 64.5 58.2
style 24.3 35.5 53.3 58.9
style + words 42.1 38.3 52.3 57.5
dominance + words 41.1 41.1 52.3 58.9
dominance + style + words 42.1 39.7 53.3 60.3
dialogue acts + words 42.1 37.6 57.0 61.0
dialogue acts + style + words 39.3 40.4 57.9 61.0
Wordnet 37.4 37.6 46.7 52.5
Wordnet + words 49.5 39.0 53.3 57.5
 rst author 59.8 57.9 73.8 72.7
Table 3: Activity detection: Activities are detected
on the Santa Barbara Corpus (SBC) and the meet-
ing database (meet) either without clustering the
activities (all) or clustering them according to their
interactivity (interactive) (see Sec. 2 for details).
the error function. The network connects the input di-
rectly to the output units. Hidden units have not been used
since they did not yield improvements on this task. The
network was trained using RPROP with momentum (Ried-
miller and Braun, 1993) and corresponds to an exponen-
tial model (Nigam et al., 1999). The momentum term can
be interpreted as a Gaussian prior with zero mean on the
network weights. It is the same architecture that we used
previously (Ries et al., 2000) for the detection of activities
on CallHome Spanish. Although some feature sets could be
trained using the iterative scaling algorithm if no hidden
units are being used the training times weren’t high enough
to justify the use of the less  exible iterative scaling algo-
rithm. The features used for classi cation are
words the 50 most frequent words / part of speech pairs
are used directly, all other pairs are replaced by their
part of speech 4.
stylistic features adapted from Biber (1988) and contain
mostly syntactic constructions and some word classes.
Wordnet a total of 40 verb and noun classes (so called lex-
icographers classes (Fellbaum, 1998)) are de ned and
a word is replaced by the most frequent class over all
possible meanings of the word.
dialogue acts such as statements, questions, backchannels,
::: are detected using a language model based detec-
tor trained on Switchboard similar to Stolcke et al.
(2000) 5
4Klaus Zechner trained an English part of speech tagger
tagger on Switchboard that has been used. The tagger uses
the code by Brill (1994).
5The model was trained to be very portable and therefore
the following choices were taken: (a) the dialogue model
is context-independent and (b) only the part of speech are
taken as the input to the model plus the 50 most likely
word/part of speech types.
dominance is described as the distribution of the speaker
dominance in a conversation. The distribution is rep-
resented as a histogram and speaker dominance is mea-
sured as the average dominance of the dialogue acts (Linell
et al., 1988) of each speaker. The dialogue acts are de-
tected and the dominance is a numeric value assigned
for each dialogue act type. Dialogue act types that
restrict the options of the conversation partners have
high dominance (questions), dialogue acts that signal
understanding (backchannels) carry low dominance.
First author The activities used for classi cation are those
of the semi-naive coder. The \ rst author" column de-
scribes the \accuracy" of the  rst author with respect
to the naive coder.
The detection of interactive activities works fairly well
using the dominance feature on SBC which is also natu-
ral since the relative dominance of speakers should describe
what kind of interaction is exhibited. The dialogue act dis-
tribution on the other hand works fairly well on the more
homogeneous meeting database were there is a better chance
to see generalizations from more speci c dialogue based in-
formation. Overall the combination of more than one feature
is really important since word level, Wordnet and stylistic
information, while sometimes successful, seem to be able to
improve the result while they don’t provide good features by
themselves. The meeting data is also more di cult which
might be due to its informal style.
3. INFORMATION ACCESS ASSESSMENT
Assuming a probabilistic information retrieval model a
query r { in our example an activity { predicts a docu-
ment d with the probability q(djr) = q(rjd)q(d)q(r) . Let p(d;r)
be the real probability mass distribution of these quanti-
ties. The probability mass function q(rjd) is estimated on
a separate training set by a neural network based classi-
 er 6. The quantity we are interested in is the reduction
in expected coding length of the document using the neural
network based detector 7:
 Eplog q(D)q(DjR)  H(R) Ep log 1q(RjD)
The two expectations correspond exactly to the measures in
Tab. 5, the  rst represents the baseline, the second the one
for the respective classi er. In more standard information
theoretic notation this quantity may be written as:
H(R) (Hp(RjD) +D(p(rjd)jjq(rjd)))
This equivalence is not extremely useful though since the
quantities in parenthesis can’t be estimated separately. For
the small meeting database and SBC however no entropy
reductions could be obtained. On the larger databases, on
the other hand, entropy reductions could be obtained ( 
0:5bit on the CallHome Spanish database Ries et al. (2000),
 1bit for the sub-database detection problem in Sec. 4).
6All quantities involving the neural net q(rjd) have been
determined using a round robin approach such that network
is trained on a separate training set.
7Since estimating q(d) is simple we may assume that q(d) P
rp(d;r).
Another option is to assume that the labels of one coder
are part of D. If the query by the other coder is R we are in-
terested in the reduction of the document entropy given the
query. If we furthermore assume that H(RjD) = H(RjR0)
where R0 is the activity label embedded in D:
H(D) H(DjR) = H(R) H(RjD) = MI(R;R0)
Tab. 2 shows that the labels of the semi-naive coder and the
 rst author only inform each other by 0:25 0:65 bits. How-
ever, since all constraints are important to apply, it might be
important to include manual annotations to be matched by
a query or in a graphical presentation of the output results.
Another interesting question to consider is whether the
activity is correlated with the rejoinder or not. This ques-
tion is important since a correlation of the activity with the
rejoinder would mean that the indexing performance of ac-
tivities needs to be compared to other indices that apply to
rejoinders such as attendance, time and place (for results on
the correlation with rejoinders see Waibel et al. (2001)). The
correlation can be measured using the mutual information
between the activity and the meeting identity. The mutual
information is moderate for SBC ( 0.67 bit) and much
lower for the meetings ( 0.20 bit). This also corresponds
to our intuition since some of the rejoinders in SBC belong
to very distinct dialogue genre while the meeting database
is homogeneous. The conclusion is that activities are useful
for navigation in a rejoinder if the database is homogeneous
and they might be useful for  nding conversations in a more
heterogeneous database.
# # #
Talk 344 Edu 25 Finance 8
News 217 Sci 24 Religious 5
Sitcom 97 Series 24 Series-Old 3
Soap 87 Cartoon 23 Infotain 3
Game 46 Movies 22 Music 2
Law 32 Crafts 17 Horror 1
Sports 32 Specials 15
Drama 31 Comedy 9
Table 4: TV show types: The distribution of show
types in a large database of TV shows (1067 shows)
that has been recorded over the period of a couple
of months until April 2000 in Pittsburgh, PA
4. DETECTION OF SUB-DATABASES
We set up an environment for TV shows that records the
subtitles with timestamps continuously from one TV chan-
nel and the channel was switched every other day. At the
same time the TV program was downloaded from http:
//tv.yahoo.com/ to obtain programming information in-
cluding the genre of the show. Yahoo assigns primary and
secondary show types and unless the combination of pri-
mary/secondary show-type is frequent enough the primary
showtype is used (Tab. 4). The TV show database has the
advantage that we were able to collect a large and varied
database with little e ort. The same classi er as in Sec. 2
has been used however dialogue acts have not been detected
since the data contains a lot of noise, is not necessarily con-
versational and speaker identities can’t be determined easily.
Detection results for TV shows can be seen in Tab. 5. It may
be noted that adding a lot of keywords does improve the de-
tection result but not so much the entropy. It may therefore
be assume that there is a limited dependence between topic
and genre which isn’t really a surprise since there are many
shows with weekly sequels and there may be some true re-
peats.
Feature accuracy entropy
Wordnet stylistic words
baseline 32.2 3.31
 50.9 2.73
 50 62.2 2.33
  50 60.0 2.29
  61.2 2.28
 56.9 2.41
 50 61.5 2.25
50 61.3 2.35
250 62.7 2.17
500 66.0 2.14
  500 64.9 2.13
5000 67.2 2.08
Table 5: Show type detection: Using the neural net-
work described in Sec. 2 the show type was detected.
If there is a number in the word column the word
feature is being used. The number indicates how
many word/part of speech pairs are in the vocabu-
lary additionally to the parts of speech.
5. EMOTION AND DOMINANCE
Emotions are displayed in a variety of gestures, some of
which are oral and may be detected via automated methods
from the audio channel (Polzin, 1999). Using only verbal
information the emotions happy, excited and neutral can
be detected on the meeting database with 88:1% accuracy
while always picking neutral yields 83:6%. This result can be
improved to 88:6% by adding pitch and power information.
While these experiments were conducted at the utterance
level emotions can be extended to topical segments. For
that purpose the emotions of the individual utterances are
entered in a histogram over the segment and the vectors
are clustered automatically. The resulting clusters roughly
correspond to a \neutral", \a little happy" and \somewhat
excited" segment. Using the classi er for emotions on the
word level the segment can be classi ed automatically into
categories with a 83:3% accuracy while the baseline is 68:9%.
The entropy reduction by automatically detected emotional
activities is  0:3bit 8. A similar attempt can be made for
dominance (Linell et al., 1988) distributions: Dominance is
easy to understand for the user of an information access
system and it can be determined automatically with high
accuracy.
8 A similar classi cation result for emotions on the utter-
ance level has been obtained by just using the laughter vs.
non-laughter tokens of the transcript as the input. This
may indicate that (a) the index should really be the amount
of laughter in the conversational segment and that (b) emo-
tions might not be displayed very overtly in meetings. These
results however would require a wider sampling of meeting
types to be generally acceptable.
6. CONCLUSION AND FUTURE WORK
It has been shown that activities can be detected and that
they may be e cient indices for access to oral communica-
tion. Overall it is easy to make high level distinctions with
automated methods while  ne-grained distinctions are even
hard to make for humans { on the other hand automatic
methods are still able to model some aspect of it (Fig. 3).
To obtain an reduction in entropy a relatively large database
such as CallHome Spanish is required (120 dialogues). Al-
ternatives to activities might be emotional and dominance
distributions that are easier to detect and that may be nat-
ural to understand for users. If activities are only used for
local navigation support within a rejoinder one could also
visualize by displaying the dialogue act patterns for each
channel on a time line.
The author has also observed that topic clusters and activ-
ities are largely independent in the meeting domain result-
ing in orthogonal indices. Since activities have intuitions for
naive users and they may be remembered it can be assumed
that users would be able to make use of these constraints.
Ongoing work includes the use of speaker activity for dia-
logue segmentation and further assessment of features for
information access. Overall the methods presented here and
the ongoing work are improving the ability to index oral
communication. It should be noted that some of the tech-
niques presented lend themselves to implementations that
don’t require (full) speech recognition: Speaker identi ca-
tion and dialogue act identi cation may be done without
an LVCSR system which would allow to lower the compu-
tational requirements as well as to a more robust system.
Figure 3: Detection accuracy summary: The detec-
tion of high-level genre as exempli ed by the di er-
entiation of corpora can be done with high accuracy
using simple features (Ries, 1999). Similar it was
fairly easy to discriminate between male and female
speakers on Switchboard (Ries, 1999). Discrimi-
nating between sub-genre such as TV-show types
(Sec. 4) can be done with reasonable accuracy. How-
ever it is a lot harder to discriminate between ac-
tivities within one conversation for personal phone
calls (CallHome) (Ries et al., 2000) or for general
rejoinders (Santa) and meetings (Sec. 2).
References
M. M. Bahktin. Speech Genres and other late Essays, chap-
ter Speech Genres. University of Texas Press, Austin,
1986.
D. Biber. Variation across speech and writing. Cambridge
University Press, 1988.
E. Brill. A report on recent progress in transformation based
error-driven learning. In DARPA Workshop, 1994.
J. Carletta, A. Isard, S. Isard, J. C. Kowtko, G. Doherty-
Sneddon, and A. H. Anderson. The reliability of a dia-
logue structure coding scheme. Computational Linguis-
tics, 23(1):13{31, March 1997.
C. Fellbaum, editor. WordNet { An Electronic Lexical
Database. MIT press, 1998.
G. Fritz and F. Hundschnur. Handbuch der Dialoganalyse.
Niemeyer, Tuebingen, 1994.
D. J. Herrmann. Autobiographical memory and the validity
of retrospective reports, chapter The validity of retrospec-
tive reports as a function of the directness of retrieval
processes, pages 21{31. Springer, 1993.
B. Kessler, G. Nunberg, and H. Sch utze. Automatic detec-
tion of genre. In Proceedings of the 35th Annual Meet-
ing of the Association for Computational Linguistics and
the 8th Meeting of the European Chapter of the Associ-
ation for Computational Linguistics, pages 32{38. Mor-
gan Kaufmann Publishers, San Francisco CA, 1997. URL
http://xxx.lanl.gov/abs/cmp-lg/9707002.
P. Linell, L. Gustavsson, and P. Juvonen. Interactional
dominance in dyadic communication: a presentation of
initiative-response analysis. Linguistics, 26:415{442, 1988.
K. Nigam, J. La erty, and A. McCallum. Using maxi-
mum entropy for text classi cation. In Proceedings of
the IJCAI-99 Workshop on Machine Learning for Infor-
mation Filtering, 1999. URL http://www.cs.cmu.edu/
~lafferty/.
T. Polzin. Detecting Verbal and Non-Verbal Cues in the
Communication of Emotion. PhD thesis, Carnegie Mellon
University, November 1999.
M. Riedmiller and H. Braun. A direct adaptive method for
faster backpropagation learning: The RPROP algorithm.
In Proc. of the IEEE Int. Conf. on Neural Networks, pages
586{591, 1993.
K. Ries. Towards the detection and description of textual
meaning indicators in spontaneous conversations. In Pro-
ceedings of the Eurospeech, volume 3, pages 1415{1418,
Budapest, Hungary, September 1999.
K. Ries, L. Levin, L. Valle, A. Lavie, and A. Waibel.
Shallow discourse genre annotation in callhome spanish.
In Proceecings of the International Conference on Lan-
guage Ressources and Evaluation (LREC-2000), Athens,
Greece, May 2000.
A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates,
D. Jurafsky, P. Taylor, R. Martin, C. V. Ess-Dykema, and
M. Meteer. Dialogue act modeling for automatic tagging
and recognition of conversational speech. Computational
Linguistics, 26(3), September 2000.
A. Thym e-Gobbel, L. Levin, K. Ries, and L. Valle. Dia-
logue act, dialogue game, and activity tagging manual for
spanish conversational speech. Technical report, Carnegie
Mellon University, 2001. in preperation.
van Bretan, J. Dewe, A. Hallberg, J. Karlgren, and N. Wolk-
ert. Genres de ned for a purpose, fast clustering, and an
iterative information retrieval interface. In Eighth DE-
LOS Workshop on User Interfaces in Digital Libraries
L angholmen, pages 60{66, October 1998.
A. Waibel, M. Bett, and M. Finke. Meeting browser: Track-
ing and summarising meetings. In Proceedings of the
DARPA Broadcast News Workshop, 1998.
A. Waibel, M. Bett, F. Metze, K. Ries, T. Schaaf, T. Schultz,
H. Soltau, H. Yu, and K. Zechner. Advances in automatic
meeting record creation and access. In ICASSP, Salt Lake
City, Utah, USA, 2001. to appear.
