Proceedings of the Analyzing Conversations in Text and Speech (ACTS) Workshop at HLT-NAACL 2006, pages 42–49,
New York City, New York, June 2006. c©2006 Association for Computational Linguistics
Topic Segmentation of Dialogue 
 
Jaime Arguello Carolyn Rosé 
Language Technologies Institute Language Technologies Institute 
Carnegie Mellon University Carnegie Mellon University 
Pittsburgh, PA 15217 Pittsburgh, PA 15217 
jarguell@andrew.cmu.edu cprose@cs.cmu.edu 
 
 
Abstract 
We introduce a novel topic segmentation 
approach that combines evidence of topic 
shifts from lexical cohesion with linguistic 
evidence such as syntactically distinct fea-
tures of segment initial and final contribu-
tions.  Our evaluation shows that this hy-
brid approach outperforms state-of-the-art 
algorithms even when applied to loosely 
structured, spontaneous dialogue.  Further 
analysis reveals that using dialogue ex-
changes versus dialogue contributions im-
proves topic segmentation quality. 
1 Introduction 
In this paper we explore the problem of topic 
segmentation of dialogue. Use of topic-based mod-
els of dialogue has played a role in information 
retrieval (Oard et al., 2004), information extraction 
(Baufaden, 2001), and summarization (Zechner, 
2001), just to name a few applications. However, 
most previous work on automatic topic segmenta-
tion has focused primarily on segmentation of ex-
pository text. This paper presents a survey of the 
state-of-the-art in topic segmentation technology. 
Using the definition of topic segment from (Pas-
sonneau and Litman, 1993) applied to two different 
dialogue corpora, we present an evaluation includ-
ing a detailed error analysis, illustrating why ap-
proaches designed for expository text do not gen-
eralize well to dialogue.  
We first demonstrate a significant advantage of 
our hybrid, supervised learning approach called 
Museli, a multi-source evidence integration ap-
proach, over competing algorithms. We then ex-
tend the basic Museli algorithm by introducing an 
intermediate level of analysis based on Sinclair and 
Coulthard’s notion of a dialogue exchange (Sin-
clair and Coulthard, 1975). We show that both our 
baseline and Museli approaches obtain a signifi-
cant improvement when using perfect, hand-
labeled dialogue exchanges, typically in the order 
of 2-3 contributions, as the atomic discourse unit in 
comparison to using the contribution as the unit of 
analysis. We further evaluate our success towards 
automatic classification of exchange boundaries 
using the same Museli framework.  
2 Defining Topic 
In the most general sense, the challenge of topic 
segmentation can be construed as the task of find-
ing locations in the discourse where the focus 
shifts from one topic to another. Thus, it is not pos-
sible to address topic segmentation of dialogue 
without first addressing the question of what a 
“topic” is. We began with the goal of adopting a 
definition of topic that meets three criteria. First, it 
should be reproducible by human annotators. Sec-
ond, it should not rely heavily on domain-specific 
knowledge or knowledge of the task structure. Fi-
nally, it should be grounded in generally accepted 
principles of discourse structure.  
The last point addresses a subtle, but important, 
criterion necessary to adequately serve down-
stream applications using our dialogue segmenta-
tion. Topic analysis of dialogue concerns itself 
mainly with thematic content. However, bounda-
ries should be placed in locations that are natural 
turning points in the discourse. Shifts in topic 
should be readily recognizable from surface char-
acteristics of the language. 
With these goals in mind, we adopted a defini-
tion of “topic” that builds upon Passonneau and 
Litman’s seminal work on segmentation of mono-
logue (Passonneau and Litman, 1993).  They found 
that human annotators can successfully accomplish 
a flat monologue segmentation using an informal 
notion of speaker intention. 
42
Dialogue is inherently hierarchical in structure. 
However, a flat segmentation model is an adequate 
approximation. Passonneau and Litman’s pilot 
studies confirmed previously published results 
(Rotondo, 1984) that human annotators cannot re-
liably agree on a hierarchical segmentation of 
monologue. Using a stack-based hierarchical 
model of discourse, Flammia (1998) found that 
90% of all information-bearing dialogue turns re-
ferred to the discourse purpose at the top of the 
stack.  
We adopt a flat model of topic segmentation 
based on discourse segment purpose, where a shift 
in topic corresponds to a shift in purpose that is 
acknowledged and acted upon by both conversa-
tional participants. We place topic boundaries on 
contributions that introduce a speaker’s intention to 
shift the purpose of the discourse, while ignoring 
expressed intentions to shift discourse purposes 
that are not taken up by the other participant. We 
adopt the dialogue contribution as the basic unit of 
analysis, refraining from placing topic boundaries 
within a contribution. This decision is analogous to 
Hearst’s (Hearst, 1994, 1997) decision to shift the 
TextTiling induced boundaries to their nearest ref-
erence paragraph boundary.  
We evaluated the reproducibility of our notion 
of topic segment boundaries by assessing inter-
coder reliability over 10% of the corpus (see Sec-
tion 5.1).  Three annotators were given a 10 page 
coding manual with explanation of our informal 
definition of shared discourse segment purpose as 
well as examples of segmented dialogues.  Pair-
wise inter-coder agreement was above 0.7 for all 
pairs of annotators. 
3 Previous Work 
Existing topic segmentation approaches can be 
loosely classified into two types: (1) lexical cohe-
sion models, and (2) content-oriented models.  The 
underlying assumption in lexical cohesion models 
is that a shift in term distribution signals a shift in 
topic (Halliday and Hassan, 1976). The best known 
algorithm based on this idea is TextTiling (Hearst, 
1997). In TextTiling, a sliding window is passed 
over the vector-space representation of the text. At 
each position, the cosine correlation between the 
upper and lower regions of the sliding window is 
compared with that of the peak cosine correlation 
values to the left and right of the window.  A seg-
ment boundary is predicted when the magnitude of 
the difference exceeds a threshold.    
One drawback to relying on term co-occurrence 
to signal topic continuity is that synonyms or re-
lated terms are treated as thematically-unrelated. 
One proposed solution to this problem is Latent 
Semantic Analysis (LSA) (Landauer and Dumais, 
1997). Two LSA-based algorithms for segmenta-
tion are described in (Foltz, 1998) and (Olney and 
Cai, 2005). Foltz’s approach differs from 
TextTiling mainly in its use of an LSA-based vec-
tor space model. Olney and Cai address a problem 
not addressed by TextTiling or Foltz’s approach, 
which is that cohesion is not just a function of the 
repetition of thematically-related terms, but also a 
function of the presentation of new information in 
reference to information already presented. Their 
orthonormal basis approach allows for segmenta-
tion based on relevance and informativity.  
Content-oriented models, such as (Barzilay and 
Lee, 2004), rely on the re-occurrence of patterns of 
topics over multiple realizations of thematically 
similar discourses, such as a series of newspaper 
articles about similar events. Their approach util-
izes a hidden Markov model where states corre-
spond to topics and state transition probabilities 
correspond to topic shifts. To obtain the desired 
number of topics (states), text spans of uniform 
length (individual contributions, in our case) are 
clustered. Then, state emission probabilities are 
induced using smoothed cluster-specific language 
models. Transition probabilities are induced by 
considering the proportion of documents in which 
a contribution assigned to the source cluster (state) 
immediately precedes a contribution assigned to 
the target cluster (state). Following an EM-like 
approach, contributions are reassigned to states 
until the algorithm converges. 
4 Overview of Museli Approach 
We cast the segmentation problem as a binary 
classification problem where each contribution is 
classified as NEW_TOPIC if it introduces a new 
topic and SAME_TOPIC otherwise. In our hybrid 
Museli approach, we combined lexical cohesion 
with features that have the potential to capture 
something about the linguistic style that marks 
shifts in topic. Table 1 lists our features.  
 
 
 
43
Feature Description 
Lexical  
Cohesion 
Cosine correlation of adjacent 
regions in the discourse. Term 
vectors of adjacent regions are 
stemmed and stopwords are re-
moved. 
Word-
unigram 
Unigrams in previous and cur-
rent contributions 
Word-bigram Bigrams in previous and current 
contributions 
Punctuation Punctuation of previous and cur-
rent contributions. 
Part-of-
Speech (POS)  
Bigram 
POS-Bigrams in previous and 
current contributions.  
Time  
Difference 
Time difference between previ-
ous and current contribution, 
normalized by: 
(X – MIN)/ (MAX – MIN), 
where X corresponds to this time 
difference and MIN & MAX are 
with respect to the whole corpus. 
Content  
Contribution 
Binary-valued, is there a non-
stopword term in the current 
contribution? 
Contribution 
Length 
Number of words in the current 
contribution, normalized by:  
(X – MIN) / (MAX – MIN). 
Previous 
Agent
1
Binary-valued, was the speaker 
of the previous contribution the 
student or the tutor? 
Table 1. Museli Features. 
 
  We found that using a Naïve Bayes classifier 
with an attribute selection wrapper using the chi-
square test for ranking attributes performed better 
than other state-of-the-art machine learning algo-
rithms on our task, perhaps because of the evi-
dence integration oriented nature of the problem.  
We conducted our evaluation using 10-fold cross-
validation, being careful not to include instances 
from the same dialogue in both the training and 
test sets on any fold to avoid biasing the trained 
model with idiosyncratic communicative patterns 
associated with individual dialogue participants.  
To capitalize on differences in conversational 
behavior between participants assigned to different 
                                                 
1
  The current contribution’s agent is implicit in the fact that 
we learn separate models for each agent-role (student & tutor). 
roles in the conversation (i.e., student and tutor), 
we learn separate models for each role. This deci-
sion is motivated by observations that participants 
with different speaker-roles, each with different 
goals in the conversation, introduce topics with a 
different frequency, introduce different types of 
topics, and may introduce topics in a different style 
that displays their status in the conversation. For 
instance, a tutor may be more likely to introduce 
new topics with a contribution that ends with an 
imperative. A student may be more likely to intro-
duce new topics with a contribution that ends with 
a wh-question. Dissimilar agent-roles also occur in 
other domains such as Travel Agent and Customer 
in flight booking scenarios. 
Using the complete set of features enumerated 
above, we perform feature selection on the training 
data for each fold of the cross-validation sepa-
rately, training a model with the top 1000 features, 
and applying that trained model to the test data.  
Examples of high ranking features output by our 
chi-squared feature selection wrapper confirm our 
intuition that initial and final contributions of a 
segment are marked differently. Moreover, the 
highest ranked features are different for our two 
speaker-roles. Some features highly-correlated 
with student-initiated segments are am_trying, 
should, what_is, and PUNCT_question, which re-
late to student questions and requests for informa-
tion. Some features highly-correlated with tutor-
initiated segments include ok_lets, do, see_what, 
and BEGIN_VERB (the POS of the first word in 
the contribution is VERB), which characterize im-
peratives, and features such as now, next, and first, 
which characterize instructional task ordering. 
5 Evaluation   
We evaluate Museli in comparison to the best 
performing state-of-the-art approaches, demon-
strating that our hybrid Museli approach out-
performs all of these approaches on two different 
dialogue corpora by a statistically significant mar-
gin (p < .01), in one case reducing the probability 
of error, as measured by P
k 
(Beeferman et al., 
1999), to about 10%. 
5.1 Experimental Corpora 
We used two different dialogue corpora from the 
educational domain for our evaluation. Both cor-
pora constitute of dialogues between a student and 
44
a tutor (speakers with asymmetric roles) and both 
were collected via chat software.  The first corpus, 
which we call the Olney & Cai corpus, is a set of 
dialogues selected randomly from the same corpus 
Olney and Cai obtained their corpus from (Olney 
and Cai, 2005). The dialogues discuss problems 
related to Newton’s Three Laws of Motion. The 
second corpus, the Thermo corpus, is a locally col-
lected corpus of thermodynamics tutoring dia-
logues, in which tutor-student pairs work together 
to solve an optimization task. Table 2 shows cor-
pus statistics from both corpora. 
 
 Olney & Cai 
 Corpus 
Thermo 
Corpus 
#Dialogues 42 22 
Conts./Dialogue 195.40 217.90 
Conts./Topic 24.00 13.31 
Topics/Dialogue 8.14 16.36 
Words/Cont. 28.63 5.12 
Student Conts. 4113 1431 
Tutor Conts. 4094 3363 
Table 2. Evaluation Corpora Statistics  
 
Both corpora seem adequate for attempting to 
harness systematic differences in how speakers 
with asymmetric roles may initiate or close topic 
segments. The Thermo corpus is particularly ap-
propriate for addressing the research question of 
how to automatically segment natural, spontaneous 
dialogue. The exploratory task is more loosely 
structured than many task-oriented domains inves-
tigated in the dialogue community, such as flight 
reservation or meeting scheduling. Students can 
interrupt with questions and tutors can digress in 
any way they feel may benefit the completion of 
the task. In the Olney and Cai corpus, the same 10 
physics problems are addressed in each session and 
the interaction is almost exclusively a tutor initia-
tion followed by student response, evident from the 
nearly equal number of student and tutor contribu-
tions. 
5.2 Baseline Approaches 
We evaluate Museli against the following four 
algorithms: (1) Olney and Cai (Ortho), (2) Barzilay 
and Lee (B&L), (3) TextTiling (TT), and (4) Foltz.  
As opposed to the other baseline algorithms, 
(Olney and Cai, 2005) applied their orthonormal 
basis approach specifically to dialogue, and prior 
to this work, report the highest numbers for topic 
segmentation of dialogue. Barzilay and Lee’s ap-
proach is the state of the art in modeling topic 
shifts in monologue text. Our application of B&L 
to dialogue attempts to harness any existing and 
recognizable redundancy in topic-flow across our 
dialogues for the purpose of topic segmentation.  
We chose TextTiling for its seminal contribution 
to monologue segmentation. TextTiling and Foltz 
consider lexical cohesion as their only evidence of 
topic shifts. Applying these approaches to dialogue 
segmentation sheds light on how term distribution 
in dialogue differs from that of expository mono-
logue text (e.g. news articles). The Foltz and Ortho 
approaches require a trained LSA space, which we 
prepared the same way as described in (Olney and 
Cai, 2005). Any parameter tuning for approaches 
other than our Museli was computed over the en-
tire test set, giving baseline algorithms the maxi-
mum advantage.  
In addition to these approaches, we include 
segmentation results from three degenerate ap-
proaches: (1) classifying all contributions as 
NEW_TOPIC (ALL), (2) classifying no contribu-
tions as NEW_TOPIC (NONE), and (3) classifying 
contributions as NEW_TOPIC at uniform intervals 
(EVEN), separated by the average reference topic 
length (see Table 2). 
As a means for comparison, we adopt two 
evaluation metrics: P
k
 and f-measure. An extensive 
argument in support of P
k
’s robustness (if k is set 
to ½ the average reference topic length) is pre-
sented in (Beeferman, et al. 1999).  P
k 
measures the 
probability of misclassifying two contributions a 
distance of k contributions apart, where the classi-
fication question is are the two contributions part 
of the same topic segment or not?  P
k 
is the likeli-
hood of misclassifying two contributions, thus 
lower P
k
 values are preferred over higher ones. It 
equally captures the effect of false-negatives and 
false-positives and favors predictions that that are 
closer to the reference boundaries. F-measure pun-
ishes false positives equally, regardless of their 
distance to reference boundaries.  
5.3 Results 
Table 3 shows our evaluation results.  Note that 
lower values of P
k
 are preferred over higher ones. 
The opposite is true of F-measure.  In both cor-
pora, the Museli approach performed significantly 
better than all other approaches (p < .01).  
 
45
 Olney and Cai 
Corpus 
Thermo Corpus 
 P
k
F P
k
F 
NONE 0.4897 -- 0.4900 -- 
ALL 0.5180 -- 0.5100 -- 
EVEN 0.5117 -- 0.5131 -- 
TT 0.6240 0.1475 0.5353 0.1614 
B&L 0.6351 0.1747 0.5086 0.1512 
Foltz 0.3270 0.3492 0.5058 0.1180 
Ortho 0.2754 0.6012 0.4898 0.2111 
Museli 0.1051 0.8013 0.4043 0.3693 
Table 3. Results on both corpora 
5.4 Error Analysis 
Results for all approaches are better on the Ol-
ney and Cai corpus than the Thermo corpus. The 
Thermo corpus differs profoundly from the Olney 
and Cai corpus in ways that very likely influenced 
the performance. For instance, in the Thermo cor-
pus each dialogue contribution is on average 5 
words long, whereas in the Olney and Cai corpus 
each dialogue contribution contains an average of 
28 words. Thus, the vector space representation of 
the dialogue contributions is more sparse in the 
Thermo corpus, which makes shifts in lexical co-
herence less reliable as topic shift indicators.   
In terms of P
k
, TextTiling (TT) performed worse 
than the degenerate algorithms. TextTiling meas-
ures the term overlap between adjacent regions in 
the discourse. However, dialogue contributions are 
often terse or even contentless. This produces 
many islands of contribution-sequences for which 
the local lexical coherence is zero. TextTiling 
wrongly classifies all of these as starts of new top-
ics. A heuristic improvement to prevent TextTiling 
from placing topic boundaries at every point along 
a sequence of contributions failed to produce a sta-
tistically significant improvement. 
The Foltz and the Ortho approaches rely on LSA 
to provide strategic semantic generalizations capa-
ble of detecting shifts in topic. Following (Olney 
and Cai, 2005), we built our LSA space using dia-
logue contributions as the atomic text unit.  In cor-
pora such as the Thermo corpus, however, this may 
not be effective due to the brevity of contributions. 
Barzilay and Lee’s algorithm (B&L) did not 
generalize well to either dialogue corpus. One rea-
son could be that probabilistic methods, such as 
their approach, require that reference topics have 
significantly different language models, which was 
not true in either of our evaluation corpora. We 
also noticed a number of instances in the dialogue 
corpora where participants referred to information 
from previous topic segments, which consequently 
may have blurred the distinction between the lan-
guage models assigned to different topics. 
6 Dialogue Exchanges  
Although results are reliably better than our 
baseline algorithms in both corpora, there is much 
room for improvement, especially in the more 
spontaneous Thermo corpus. We believe that an 
improvement can come from a multi-layer segmen-
tation approach, where a first pass segments a dia-
logue into dialogue exchanges and a second classi-
fier assigns topic shifts based on exchange initial 
contributions. Dialogue is hierarchical in nature. 
Topic and topic shift comprise only one of the 
many lenses through which dialogue behaves in 
seemingly structured ways. Thus, it seems logical 
that exploiting more fine-grained sub-parts of dia-
logue than our definition of topic might help us do 
better at predicting shifts in topic.  One such sub-
part of dialogue is the notion of dialogue exchange, 
typically between 2-3 contributions. 
Stubbs (1983) motivates the definition of an ex-
change with the following observation. In theory, 
there is no limit to the number of possible re-
sponses to the clause “Is Harry at home?”. How-
ever, constraints are imposed on the interpretation 
of the contribution that follows it: yes or no. Such a 
constraint is central to the concept of a dialogue 
exchange. Informally, an exchange is made from 
an initiation, for which the possibilities are open-
ended, followed by dialogue contributions that are 
pre-classified and thus increasingly restricted. A 
contribution is part of the next exchange when the 
constraint on its communicative act is lifted.  
Sinclair and Coulthard (1975) introduce a more 
formal definition of exchange with their Initiative-
Response-Feedback or IRF structure. An initiation 
produces a response and a response happens as 
direct consequence to an initiation. Feedback 
serves to close an exchange. Sinclair and Coulthard 
posit that if exchanges constitute the minimal unit 
of interaction, IRF is a primary structure of interac-
tive discourse in general.  
To measure the benefits of exchange boundaries 
in detecting topic shift in dialogue, we coded the 
Thermo corpus with exchanges following Sinclair 
46
and Coulthard’s IRF structure. The coder who la-
beled dialogue exchanges had no knowledge of our 
definition of topic or our intention to do topic-
analyses of the corpus. Any correlation between 
exchange boundaries and topic boundaries is not a 
bias introduced during the hand-labeling process.     
7 Topic Segmentation with Exchanges 
In our corpus, as we believe is true in domain-
general dialogue, knowledge of an exchange-
boundary increases the probability of a topic-
boundary significantly. One way to quantify this 
relation is with the following observation. In our 
experimental Thermo corpus, there are 4794 dia-
logue contributions, 360 topic shifts, and 1074 ex-
change shifts. Using maximum likelihood estima-
tion, the likelihood of being correct if we say that a 
randomly chosen contribution is a topic shift is 
0.075 (# topic shifts / # contributions). However, 
the likelihood of being correct if we have prior 
knowledge that an exchange-shift also occurs in 
that contribution is 0.25. Thus, knowledge that the 
contribution introduces a new exchange increases 
our confidence that it also introduces a new topic. 
More importantly, the probability that a contribu-
tion does not mark a topic shift, given that it does 
not mark an exchange-shift, is 0.98. Thus, ex-
changes show great promise in narrowing the 
search-space of tentative topic shifts. 
In addition to possibly narrowing the space of 
tentative topic-boundaries, exchanges are helpful 
in that they provide more coarse-grain building 
blocks for segmentation algorithms that rely on 
term-distribution as a proxy for dialogue coher-
ence, such as TextTiling (Hearst, 1994, 1997), the 
Foltz algorithm (Foltz, 1998), Orthonormal Basis 
(Olney and Cai, 2005), and Barzilay and Lee’s 
content modeling approach (Barzilay and Lee, 
2004).  At the heart of all these approaches is the 
assumption that a change in term distribution sig-
nals a shift in topic. When applied to dialogue, the 
major weakness of these approaches is that contri-
butions are often times contentless: terse and ab-
sent of thematically meaningful terms. Thus, a 
more coarse-grained discourse unit is needed.  
8 Barzilay and Lee with Exchanges 
Barzilay and Lee (2004) offer an attractive 
frame work for constructing a context-specific 
Hidden Markov Model (HMM) of topic drift. In 
our initial evaluation, we used dialogue contribu-
tions as the atomic discourse unit. Using contribu-
tions, our application of Barzilay and Lee’s algo-
rithm for segmenting dialogue fails at least in part 
because the model learns states that are not the-
matically meaningful, but instead relate to other 
systematic phenomena in dialogue, such as fixed 
expressions and discourse cues. Figure 1 shows the 
cluster (state) size distribution in terms of the per-
centage of the total discourse units (exchanges vs. 
contributions) in the Thermo corpus assigned to 
each cluster. In the horizontal axis, clusters (states) 
are sorted by size from largest to smallest.  
 
% of Total Discourse Units per Cluster
(clusters sorted by size, largest-to-smallest)
0%
10%
20%
30%
40%
50%
60%
70%
80%
1234567891011213141516
Cluster Rank 
%
 o
f
 
D
i
sco
u
r
se
 U
n
i
t
s 
i
n
 C
l
u
s
t
e
r
CONTRIBUTIONS EXCHA NGES
 
Figure 1. Exchanges produce a more evenly dis-
tributed cluster size distribution. 
   
The largest cluster contains 70% of all contribu-
tions in the corpus. The second largest cluster only 
generates 10% of the contributions. In contrast, 
when using exchanges as the atomic unit, the clus-
ter size distribution is less skewed and corresponds 
more closely to a topic analysis performed by a 
domain expert.  In this analysis, the number of de-
sired cluster (states), which is an input to the algo-
rithm, was set to 16, the same number identified in 
a domain expert’s analysis of the Thermo corpus. 
Examples of such topics include high-level ones 
such as greeting, setup initialization, and general 
thermo concepts, as well as task-specific ones like 
sensitivity analysis and regeneration. 
A closer examination of the clusters (states) con-
firms our intuition that systematic topic-
independent phenomena in dialogue, coupled with 
the terse nature of contributions in spontaneous 
dialogue, leads to an overly skewed cluster size 
distribution. Examining the terms with the highest 
emission probabilities, the largest states contain 
47
topical terms like cycle, efficiency, increase, qual-
ity, plot, and turbine intermixed with terms like 
think, you, right, make, yeah, fine, and ok. Also the 
sets of topical terms in these larger states do not 
seem coherent with respect to the expert induced 
topics. This suggests that thematically ambiguous 
fixed expressions blur the distinction between the 
different topic-centered language models, produc-
ing an overly heavy-tailed cluster size distribution. 
One might argue that a possible solution to this 
problem would be to remove these fixed expres-
sions as part of pre-processing. However, that re-
quires knowledge of the particular domain and 
knowledge of the interaction style characteristic to 
the context. We believe that a more robust solution 
is to use exchanges as the atomic unit of discourse. 
9 Evaluation with Exchanges 
To show the value of dialogue exchanges in 
topic segmentation, in this section we re-formulate 
our problem from classifying contributions into 
NEW_TOPIC and SAME_TOPIC to classifying 
exchange initial contributions into NEW_TOPIC 
and SAME_TOPIC. For all algorithms, we con-
sider only predictions that coincide with hand-
coded exchange initial contributions. We show 
that, except for our own Museli approach, using 
exchange boundaries improves segmentation qual-
ity across all algorithms (p < .05) when compared 
to their respective counterparts that ignore ex-
changes. Using exchanges gives the Museli ap-
proach a significant advantage based on F-measure 
(p < .05), but only a marginally significant advan-
tage based on P
k.
 These results confirm our intui-
tion that what gives our Museli approach an advan-
tage over baseline algorithms is its ability to har-
ness the lexical, syntactic, and phrasal cues that 
mark shifts in topic. Given that shift-in-topic corre-
lates highly with shift-in-exchange, these features 
are discriminatory in both respects.  
 Of the degenerate strategies in section 5.2, only 
ALL lends itself to our reformulation of the topic 
segmentation problem. For the ALL heuristic, we 
classify all exchange initial contributions into 
NEW_TOPIC.  This degenerate heuristic alone 
produces better results than all algorithms classify-
ing utterances (Table 4). In our implementation of 
TextTiling (TT) with exchanges, we only consider 
predictions on contributions that coincide with ex-
change initial contributions, while ignoring predic-
tions made on contributions that do not introduce a 
new exchange. Consistent with our evaluation 
methodology from Section 5, we optimized the 
window size using the entire corpus and found an 
optimal window size of 13 contributions. Without 
exchanges, the optimal window size was 6 contri-
butions. The higher optimal window-size hints to 
the possibility that by using exchange initial con-
tributions an approach based on lexical cohesion 
may broaden its horizon without losing precision. 
 
 Thermo Corpus 
(Contributions) 
Thermo Corpus 
(Exchanges) 
 P
k
F P
k
F 
NONE 0.4900 -- N/A -- 
ALL 0.5100 -- 0.4398 0.3809 
EVEN 0.5132 -- N/A -- 
TT 0.5353 0.1614 0.4328 0.3031 
B&L 0.5086 0.1512 0.3817 0.3840 
Foltz 0.5058 0.1180 0.4242 0.3296 
Ortho 0.4898 0.2111 0.4398 0.3813 
Museli 0.4043 0.3693 0.3737 0.3897 
Table 4. Results using perfect exchange boundaries 
 
In this version of B&L, we use exchanges to 
build the initial clusters (states) and the final 
HMM. B&L with exchanges significantly im-
proves over B&L with contributions, in terms of 
both P
k 
and F-measure (p < .005) and significantly 
improves over our ALL heuristic (where all ex-
change initial contributions introduce a new topic) 
in terms of P
k 
(p < .0005). Thus, its use of ex-
changes goes beyond merely narrowing the space 
of possible NEW_TOPIC contributions: it also 
uses these more coarse-grained discourse units to 
build a more thematically-motivated topic model.  
Foltz’s and Olney and Cai’s (Ortho) approach 
both use an LSA space trained on the dialogue 
corpus. Instead of training the LSA space with in-
dividual contributions, we train the LSA space us-
ing exchanges. We hope that by training the space 
with more contentful text units LSA might capture 
more topically-meaningful semantic relations.  In 
addition, only exchange initial contributions where 
used for the logistic regression training phase. 
Thus, we aim to learn the regression equation that 
best discriminates between exchange initial contri-
butions that introduce a topic and those that do not. 
Both Foltz and Ortho improve over their non ex-
change counterparts, but neither improves over the 
ALL heuristic by a significant margin.  
48
For Museli with exchanges, we tried both train-
ing the model using only exchange initial contribu-
tions, and applying our previous model to only ex-
change initial contributions. Training our models 
using only exchange initial contributions produced 
slightly worse results. We believe that the reduc-
tion of the amount of training data prevents our 
models from learning good generalizations. Thus, 
we trained our models using contributions (as in 
Section 5) and consider predictions only on ex-
change initial contributions. The Museli approach 
offers a significant advantage over TT in terms of 
P
k
 and F-measure. Using perfect-exchanges, it is 
not significantly better than Barzilay and Lee. It is 
significantly better than Foltz’s approach based on 
F-measure and significantly better than Olney and 
Cai based on P
k
 (p < .05). 
These experiments used hand coded exchange 
boundaries.  We also evaluated our ability to 
automatically predict exchange boundaries.  On the 
Thermo corpus, Museli was able to predict ex-
change boundaries with precision = 0.48, recall = 
0.62, f-measure = 0.53, and P
k
 = 0.14. 
10 Conclusions and Current Directions 
In this paper we addressed the problem of auto-
matic topic segmentation of spontaneous dialogue.  
We demonstrated with an empirical evaluation that 
state-of-the-art approaches fail on spontaneous dia-
logue because term distribution alone fails to pro-
vide adequate evidence of topic shifts in dialogue.   
We have presented a supervised learning algo-
rithm for topic segmentation of dialogue called 
Museli that combines linguistic features signaling a 
contribution’s function with local context indica-
tors. Our evaluation on two distinct corpora shows 
a significant improvement over the state-of-the-art 
algorithms. We have also demonstrated that a sig-
nificant improvement in performance of state-of-
the-art approaches to topic segmentation can be 
achieved when dialogue exchanges, rather than 
contributions, are used as the basic unit of dis-
course.  We demonstrated promising results in 
automatically identifying exchange boundaries. 
Acknowledgments 
This work was funded by Office of Naval Re-
search, Cognitive and Neural Science Division; 
grant number N00014-05-1-0043. 
 
References  
Regina Barzilay and Lillian Lee. 2004. Catching the 
drift: Probabilistic Content Models, with Applica-
tions to Generation and Summarization. In Proceed-
ings of HLT-NAACL, 113 - 120.  
Doug Beeferman, Adam Berger, John D. Lafferty. 1999.  
Statistical Models for Text Segmentation. Machine 
Learning, 34 (1-3): 177-210. 
Narjès Boufaden, Guy Lapalme, Yoshua Bengio. 2001. 
Topic Segmentation: A first stage to Dialog-based In-
formation Extraction. In Proceedings of NLPRS.  
Giovanni Flammia. 1998. Discourse Segmentation of 
Spoken Dialogue, PhD Thesis. Massachusetts Insti-
tute of Technology.  
Peter Foltz, Walter Kintsch, and Thomas Landauer. 
1998. The measurement of textual cohesion with 
LSA. Discourse Processes, 25, 285-307. 
Michael Halliday and Ruqaiya Hasan. 1976. Cohesion 
in English. London: Longman. 
Marti Hearst. 1997. TextTiling: Segmenting Text into 
Multi-Paragragh Subtopic Passages. Computational 
Linguistics, 23(1), 33 – 64.   
Thomas Landauer and Susan Dumais. A Solution to 
Plato’s Problem: The Latent Semantic Analysis of 
Acquisition, Induction, and Representation of 
Knowledge. Psychological Review, 104, 221-240.  
Douglas Oard, Bhuvana Ramabhadran, and Samuel 
Gustman. 2004. Building an Information Retrieval 
Test Collection for Spontaneous Conversational 
Speech. In Proceedings of SIGIR. 
Andrew Olney and Zhiqiang Cai. 2005. An Orthonor-
mal Basis for Topic Segmentation of Tutorial Dia-
logue. In Proceedings of HLT/EMNLP. 971-978. 
Rebecca Passonneau and Diane Litman. 1993. Inten-
tion-Based Segmentation: Human Reliability and 
Correlation with Linguistic Cues. In Proceedings of 
ACL, 148 – 155.  
John Rotondo, 1984, Clustering Analysis of Subject 
Partitions of Text. Discourse Processes, 7:69-88 
John Sinclair and Malcolm Coulthard. 1975. Towards 
an Analysis of Discourse: the English Used by 
Teachers and Pupils. Oxford University Press.  
Michael Stubbs. 1983. Discourse Analysis. A Sociolin-
guistic Analysis of Natural Language. Basil Black-
well.  
Klaus Zechner. 2001. Automatic Summarization of Spo-
ken Dialogues in Unrestricted Domains. Ph.D. The-
sis. Carnegie Mellon University
49
