Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, pages 144–151,
Sydney, July 2006. c©2006 Association for Computational Linguistics
An Analysis of Quantitative Aspects in the Evaluation of Thematic
Segmentation Algorithms
Maria Georgescul
ISSCO/TIM, ETI
University of Geneva
1211 Geneva, Switzerland
maria.georgescul@eti.unige.ch
Alexander Clark
Department of Computer Science
Royal Holloway University of London
Egham, Surrey TW20 0EX, UK
alexc@cs.rhul.ac.uk
Susan Armstrong
ISSCO/TIM, ETI
University of Geneva
1211 Geneva, Switzerland
susan.armstrong@issco.unige.ch
Abstract
We consider here the task of linear the-
matic segmentation of text documents, by
using features based on word distributions
in the text. For this task, a typical and of-
ten implicit assumption in previous stud-
ies is that a document has just one topic
and therefore many algorithms have been
tested and have shown encouraging results
on artificial data sets, generated by putting
together parts of different documents. We
show that evaluation on synthetic data is
potentially misleading and fails to give an
accurate evaluation of the performance on
real data. Moreover, we provide a criti-
cal review of existing evaluation metrics in
the literature and we propose an improved
evaluation metric.
1 Introduction
The goal of thematic segmentation is to iden-
tify boundaries of topically coherent segments
in text documents. Giving a rigorous definition
of the notion of topic is difficult, but the task
of discourse/dialogue segmentation into thematic
episodes is usually described by invoking an “in-
tuitive notion of topic” (Brown and Yule, 1998).
Thematic segmentation also relates to several no-
tions such as speaker’s intention, topic flow and
cohesion.
Since it is elusive what mental representations
humans use in order to distinguish a coherent
text, different surface markers (Hirschberg and
Nakatani, 1996; Passonneau and Litman, 1997)
and external knowledge sources (Kozima and Fu-
rugori, 1994) have been exploited for the purpose
of automatic thematic segmentation. Halliday and
Hasan (1976) claim that the text meaning is re-
alised through certain language resources and they
refer to these resources by the term of cohesion.
The major classes of such text-forming resources
identified in (Halliday and Hasan, 1976) are: sub-
stitution, ellipsis, conjunction, reiteration and col-
location. In this paper, we examine one form of
lexical cohesion, namely lexical reiteration.
Following some of the most prominent dis-
course theories in literature (Grosz and Sidner,
1986; Marcu, 2000), a hierarchical representation
of the thematic episodes can be proposed. The
basis for this is the idea that topics can be re-
cursively divided into subtopics. Real texts ex-
hibit a more intricate structure, including ‘seman-
tic returns’ by which a topic is suspended at one
point and resumed later in the discourse. However,
we focus here on a reduced segmentation prob-
lem, which involves identifying non-overlapping
and non-hierarchical segments at a coarse level of
granularity.
Thematic segmentation is a valuable initial
tool in information retrieval and natural language
processing. For instance, in information ac-
cess systems, smaller and coherent passage re-
trieval is more convenient to the user than whole-
document retrieval and thematic segmentation has
been shown to improve the passage-retrieval per-
formance (Hearst and Plaunt, 1993). In cases such
as collections of transcripts there are no headers
or paragraph markers. Therefore a clear separa-
tion of the text into thematic episodes can be used
together with highlighted keywords as a kind of
‘quick read guide’ to help users to quickly navi-
gate through and understand the text. Moreover
automatic thematic segmentation has been shown
to play an important role in automatic summariza-
tion (Mani, 2001), anaphora resolution and dis-
144
course/dialogue understanding.
In this paper, we concern ourselves with the task
of linear thematic segmentation and are interested
in finding out whether different segmentation sys-
tems can perform well on artificial and real data
sets without specific parameter tuning. In addi-
tion, we will refer to the implications of the choice
of a particular error metric for evaluation results.
This paper is organized as follows. Section 2
and Section 3 describe various systems and, re-
spectively, different input data selected for our
evaluation. Section 4 presents several existing
evaluation metrics and their weaknesses, as well
as a new evaluation metric that we propose. Sec-
tion 5 presents our experimental set-up and shows
comparisons between the performance of different
systems. Finally, some conclusions are drawn in
Section 6.
2 Comparison of Systems
Combinations of different features (derived for ex-
ample from linguistic, prosodic information) have
been explored in previous studies like (Galley et
al., 2003) and (Kauchak and Chen, 2005). In
this paper, we selected for comparison three sys-
tems based merely on the lexical reiteration fea-
ture: TextTiling (Hearst, 1997), C99 (Choi, 2000)
and TextSeg (Utiyama and Isahara, 2001). In the
following, we briefly review these approaches.
2.1 TextTiling Algorithm
The TextTiling algorithm was initially developed
by Hearst (1997) for segmentation of exposi-
tory texts into multi-paragraph thematic episodes
having a linear, non-overlapping structure (as re-
flected by the name of the algorithm). TextTiling
is widely used as a de-facto standard in the eval-
uation of alternative segmentation systems, e.g.
(Reynar, 1998; Ferret, 2002; Galley et al., 2003).
The algorithm can briefly be described by the fol-
lowing steps.
Step 1 includes stop-word removal, lemmatiza-
tion and division of the text into ‘token-sequences’
(i.e. text blocks having a fixed number of words).
Step 2 determines a score for each gap between
two consecutive token-sequences, by computing
the cosine similarity (Manning and Sch¨utze, 1999)
between the two vectors representing the frequen-
cies of the words in the two blocks.
Step 3 computes a ‘depth score’ for each token-
sequence gap, based on the local minima of the
score computed in step 2.
Step 4 consists in smoothing the scores.
Step 5 chooses from any potential boundaries
those that have the scores smaller than a certain
‘cutoff function’, based on the average and stan-
dard deviation of score distribution.
2.2 C99 Algorithm
The C99 algorithm (Choi, 2000) makes a linear
segmentation based on a divisive clustering strat-
egy and the cosine similarity measure between any
two minimal units. More exactly, the algorithm
consists of the following steps.
Step 1: after the division of the text into min-
imal units (in our experiments, the minimal unit
is an utterance1), stop words are removed and a
stemmer is applied.
The second step consists of constructing a sim-
ilarity matrix Sm×m, where m is the number of
utterances and an element sij of the matrix corre-
sponds to the cosine similarity between the vectors
representing the frequencies of the words in the i-
th utterance and the j-th utterance.
Step 3: a ‘rank matrix’ Rm×m is computed, by
determining for each pair of utterances, the num-
ber of neighbors in Sm×m with a lower similarity
value.
In the final step, the location of thematic bound-
aries is determined by a divisive top-down cluster-
ing procedure. The criterion for division of the
current segment B into b1,...bm subsegments is
based on the maximisation of a ‘density’ D, com-
puted for each potential repartition of boundaries
as
D =
summationtextm
k=1 sumksummationtextm
k=1 areak
,
where sumk and areak refers to the sum of rank
and area of the k-th segment in B, respectively.
2.3 TextSeg Algorithm
The TextSeg algorithm (Utiyama and Isahara,
2001) implements a probabilistic approach to de-
termine the most likely segmentation, as briefly
described below.
The segmentation task is modeled as a problem
of finding the minimum cost C(S) of a segmenta-
tion S. The segmentation cost is defined as:
C(S) ≡ −logPr(W|S)Pr(S),
1Occasionally within this document we employ the term
utterance to denote either a sentence or an utterance in its
proper sense.
145
where W = w1w2...wn represents the text con-
sisting of n words (after applying stop-words re-
moval and stemming) and S = S1S2...Sm is a po-
tential segmentation of W in m segments. The
probability Pr(W|S) is defined using Laplace
law, while the definition of the probability Pr(S)
is chosen in a manner inspired by information the-
ory.
A directed graph G is defined such that a path
in G corresponds to a possible segmentation of
W. Therefore, the thematic segmentation pro-
posed by the system is obtained by applying a dy-
namic programming algorithm for determining the
minimum cost path in G.
3 Input Data
When evaluating a thematic segmentation system
for an application, human annotators should pro-
vide the gold standard. The problem is that the
procedure of building such a reference corpus is
expensive. That is, the typical setting involves an
experiment with several human subjects, who are
asked to mark thematic segment boundaries based
on specific guidelines and their intuition. The
inter-annotator agreement provides the reference
segmentation. This expense can be avoided by
constructing a synthetic reference corpus by con-
catenation of segments from different documents.
Therefore, the use of artificial data for evaluation
is a general trend in many studies, e.g. (Ferret,
2002; Choi, 2000; Utiyama and Isahara, 2001).
In our experiment, we used artificial and real
data, i.e. the algorithms have been tested on the
following data sets containing English texts.
3.1 Artificially Generated Data
Choi (2000) designed an artificial dataset, built by
concatenating short pieces of texts that have been
extracted from the Brown corpus. Any test sample
from this dataset consists of ten segments. Each
segment contains the first n sentences (where 3 ≤
n ≤ 11) of a randomly selected document from
the Brown corpus. From this dataset, we randomly
chose for our evaluation 100 test samples, where
the length of a segment varied between 3 and 11
sentences.
3.2 TDT Data
One of the commonly used data sets for topic seg-
mentation emerged from the Topic Detection and
Tracking (TDT) project, which includes the task
of story segmentation, i.e. the task of segmenting
a stream of news data into topically cohesive sto-
ries. As part of the TDT initiative several datasets
of news stories have been created. In our evalua-
tion, we used a subset of 28 documents randomly
selected from the TDT Phase 2 (TDT2) collection,
where a document contains an average of 24.67
segments.
3.3 Meeting Transcripts
The third dataset used in our evaluation contains
25 meeting transcripts from the ICSI-MR corpus
(Janin et al., 2004). The entire corpus contains
high-quality close talking microphone recordings
of multi-party dialogues. Transcriptions at word
level with utterance-level segmentations are also
available. The gold standard for thematic segmen-
tations has been kindly provided by (Galley et
al., 2003) and has been chosen by considering the
agreement between at least three human annota-
tions. Each meeting is thus divided into contigu-
ous major topic segments and contains an average
of 7.32 segments.
Note that thematic segmentation of meeting
data is a more challenging task as the thematic
transitions are subtler than those in TDT data.
4 Evaluation Metrics
In this section, we will look in detail at the error
metrics that have been proposed in previous stud-
ies and examine their inadequacies. In addition,
we propose a new evaluation metric that we con-
sider more appropriate.
4.1 Pk Metric
(Passonneau and Litman, 1996; Beeferman et al.,
1999) underlined that the standard evaluation met-
rics of precision and recall are inadequate for the-
matic segmentation, namely by the fact that these
metrics did not account for how far away is a hy-
pothesized boundary (i.e. a boundary found by
the automatic procedure) from a reference bound-
ary (i.e. a boundary found in the reference data).
On the other hand, it is desirable that an algorithm
that places for instance a boundary just one utter-
ance away from the reference boundary to be pe-
nalized less than an algorithm that places a bound-
ary two (or more) utterances away from the ref-
erence boundary. Hence (Beeferman et al., 1999)
proposed a new metric, called PD, that allows for
a slight vagueness in where boundaries lie. More
146
specifically, (Beeferman et al., 1999) define PD
as follows2:
PD(ref,hyp) = summationtext1≤i≤j≤N D(i,j)[δref(i,j) ⊕
δhyp(i,j)].
N is the number of words in the reference data.
The function δref(i,j) is evaluated to one if the
two reference corpus indices specified by its pa-
rameters i and j belong in the same segment, and
zero otherwise. Similarly, the function δhyp(i,j)
is evaluated to one, if the two indices are hypothe-
sized by the automatic procedure to belong in the
same segment, and zero otherwise. The ⊕ opera-
tor is the XNOR function ‘both or neither’. D(i,j)
is a “distance probability distribution over the set
of possible distances between sentences chosen
randomly from the corpus”. In practice, a distri-
bution D having “all its probability mass at a fixed
distance k” (Beeferman et al., 1999) was adopted
and the metric PD was thus renamed Pk.
In the framework of the TDT initiative, (Allan
et al., 1998) give the following formal definition
of Pk and its components:
Pk = PMiss ·Pseg +PFalseAlarm ·(1−Pseg),
where:
PMiss =
a80N−k
i=1 [δhyp(i,i+k)]·[1−δref(i,i+k)]a80
N−k
i=1 [1−δref(i,i+k)]
,
PFalseAlarm =
a80N−k
i=1 [1−δhyp(i,i+k)]·[δref(i,i+k)]a80
N−k
i=1 δref(i,i+k)
,
and Pseg is the a priori probability that in
the reference data a boundary occurs within an
interval of k words. Therefore Pk is calculated by
moving a window of a certain width k, where k is
usually set to half of the average number of words
per segment in the gold standard.
Pevzner and Hearst (2002) highlighted several
problems of the Pk metric. We illustrate below
what we consider the main problems of the Pk
metric, based on two examples.
Let r(i,k) be the number of boundaries be-
tween positions i and i + k in the gold standard
segmentation and h(i,k) be the number of bound-
aries between positions i and i+k in the automatic
hypothesized segmentation.
• Example 1: If r(i,k) = 2 and h(i,k) = 1
then obviously a missing boundary should
2Let ref be a correct segmentation and hyp be a segmen-
tation proposed by a text segmentation system. We will keep
this notations in equations introduced below.
be counted in Pk, i.e. PMiss should be in-
creased.
• Example 2: If r(i,k) = 1 and h(i,k) =
2 then obviously PFalseAlarm should be in-
creased.
However, considering the first example, we will
obtain δref(i,i + k) = 0, δhyp(i,i + k) = 0
and consequently PMiss is not increased. By tak-
ing the case from the second example we obtain
δref(i,i + k) = 0 and δhyp(i,i + k) = 0, involv-
ing no increase of PFalseAlarm.
In (TDT, 1998), a slightly different defini-
tion is given for the Pk metric: the definition of
miss and false alarm probabilities is replaced with:
PprimeMiss =
a80N−k
i=1 [1−Ωhyp(i,i+k)]·[1−δref(i,i+k)]a80
N−k
i=1 [1−δref(i,i+k)]
,
PprimeFalseAlarm =
a80N−k
i=1 [1−Ωhyp(i,i+k)]·[δref(i,i+k)]a80
N−k
i=1 δref(i,i+k)
,
where:
Ωhyp(i,i+k) =
braceleftBigg
1, if r(i,k) = h(i,k),
0, otherwise.
We will refer to this new definition of Pk by
Pprimek. Therefore, by taking the definition of
Pprimek and the first example above, we obtain
δref(i,i+k) = 0 and Ωhyp(i,i+k) = 0 and thus
PprimeMiss is correctly increased. However for the case
of example 2 we will obtain δref(i,i + k) = 0
and Ωhyp(i,i + k) = 0, involving no increase of
PprimeFalseAlarm and erroneous increase of PprimeMiss.
4.2 WindowDiff metric
Pevzner and Hearst (2002) propose the alternative
metric called WindowDiff. By keeping our nota-
tions concerning r(i,k) and h(i,k) introduced in
the subsection 4.1, WindowDiff is defined as:
WindowDiff =
a80N−k
i=1 [|r(i,k)− h(i,k)|>0]N−k .
Similar to both Pk and Pprimek, WindowDiff is
also computed by moving a window of fixed size
across the test set and penalizing the algorithm
misses or erroneous algorithm boundary detec-
tions. However, unlike Pk and Pprimek, WindowDiff
takes into account how many boundaries fall
within the window and is penalizing in “how
many discrepancies occur between the reference
and the system results” rather than “determining
how often two units of text are incorrectly labeled
147
as being in different segments” (Pevzner and
Hearst, 2002).
Our critique concerning WindowDiff is that
misses are less penalised than false alarms and
we argue this as follows. WindowDiff can be
rewritten as:
WindowDiff = WDMiss +WDFalseAlarm,
where:
WDMiss =
a80N−k
i=1 [r(i,k)>h(i,k)]N−k ,
WDFalseAlarm =
a80N−k
i=1 [r(i,k)<h(i,k)]N−k .
Hence both misses and false alarms are weighted
by 1N−k.
Note that, on the one hand, there are indeed (N-
k) equiprobable possibilities to have a false alarm
in an interval of k units. On the other hand, how-
ever, the total number of equiprobable possibil-
ities to have a miss in an interval of k units is
smaller than (N-k) since it depends on the num-
ber of reference boundaries (i.e. we can have a
miss in the interval of k units only if in that interval
the reference corpus contains at least one bound-
ary). Therefore misses, being weighted by 1N−k,
are less penalised than false alarms.
Let Bref be the number of thematic boundaries
in the reference data. Let’s say that the refer-
ence data contains about 20% boundaries and 80%
non-boundaries from the total number of potential
boundaries. Therefore, since there are relatively
few boundaries compared with non-boundaries, a
strategy introducing no false alarms, but introduc-
ing a maximum number of misses (i.e. k · Bref
misses) can be judged as being around 80% cor-
rect by the WindowDiff measure. On the other
hand, a segmentation with no misses, but with a
maximum number of false alarms (i.e. (N − k)
false alarms) is judged as being 100% erroneous
by the WindowDiff measure. That is, misses and
false alarms are not equally penalised.
Another issue regarding WindowDiff is that it is
not clear “how does one interpret the values pro-
duced by the metric” (Pevzner and Hearst, 2002).
4.3 Proposal for a New Metric
In order to address the inadequacies of Pk and
WindowDiff, we propose a new evaluation metric,
defined as follows:
Prerror = Cmiss ·Prmiss +Cfa ·Prfa,
where:
Cmiss (0 ≤ Cmiss ≤ 1) is the cost of a miss, Cfa
(0 ≤ Cfa ≤ 1) is the cost of a false alarm,
Prmiss =
a80N−k
i=1 [Θref hyp(i,k)]a80
N−k
i=1 [∆ref(i,k)]
,
Prfa =
a80N−k
i=1 [Ψref hyp(i,k)]N−k ,
Θref hyp(i,k) =
braceleftBigg
1, if r(i,k) > h(i,k)
0, otherwise
Ψref hyp(i,k) =
braceleftBigg
1, if r(i,k) < h(i,k)
0, otherwise.
∆ref(i,k) =
braceleftBigg
1, if r(i,k) > 0
0, otherwise.
Prmiss could be interpreted as the probability
that the hypothesized segmentation contains less
boundaries than the reference segmentation in an
interval of k units3, conditioned by the fact that
the reference segmentation contains at least one
boundary in that interval. Analogously Prfa is
the probability that the hypothesized segmentation
contains more boundaries than the reference seg-
mentation in an interval of k units.
For certain applications where misses are more
important than false alarms or vice versa, the
Prerror can be adjusted to tackle this trade-off via
the Cfa and Cmiss parameters. In order to have
Prerror ∈ [0,1], we suggest that Cfa and Cmiss
be chosen such that Cfa + Cmiss = 1. By choos-
ing Cfa=Cmiss=12, the penalization of misses and
false alarms is thus balanced. In consequence, a
strategy that places no boundaries at all is penal-
ized as much as a strategy proposing boundaries
everywhere (i.e. after every unit). In other words,
both such degenerate algorithms will have an error
rate Prerror of about 50%. The worst algorithm,
penalised as having an error rate Prerror of 100%
when k = 2, is the algorithm that places bound-
aries everywhere except the places where refer-
ence boundaries exist.
5 Results
5.1 Test Procedure
For the three datasets we first performed two
common preprocessing steps: common words are
eliminated using the same stop-list and remaining
words are stemmed by using Porter’s algorithm
(1980). Next, we ran the three segmenters de-
scribed in Section 2, by employing the default val-
ues for any system parameters and by letting the
3A unit can be either a word or a sentence / an utterance.
148
systems estimate the number of thematic bound-
aries.
We also considered the fact that C99 and
TextSeg algorithms can take into account a fixed
number of thematic boundaries. Even if the num-
ber of segments per document can vary in TDT
and meeting reference data, we consider that in a
real application it is impossible to provide to the
systems the exact number of boundaries for each
document to be segmented. Therefore, we ran C99
and TextSeg algorithms (for a second time), by
providing them only the average number of seg-
ments per document in the reference data, which
gives an estimation of the expected level of seg-
mentation granularity.
Four additional naive segmentations were also
used for evaluation, namely: no boundaries,
where the whole text is a single segment; all
boundaries, i.e. a thematic boundary is placed af-
ter each utterance; random known, i.e. the same
number of boundaries as in gold standard, distrib-
uted randomly throughout text; and random un-
known: the number of boundaries is randomly
selected and boundaries are randomly distributed
throughout text. Each of the segmentations was
evaluated with Pk, Pprimek and WindowDiff, as de-
scribed in Section 4.
5.2 Comparative Performance of
Segmentation Systems
The results of applying each segmentation algo-
rithm to the three distinct datasets are summa-
rized in Figures 1, 2 and 3. Percent error values
are given in the figures and we used the follow-
ing abbreviations: WD to denote WindowDiff er-
ror metric; TextSeg KA to denote the TextSeg algo-
rithm (Utiyama and Isahara, 2001) when the av-
erage number of boundaries in the reference data
was provided to the algorithm; C99 KA to denote
the C99 algorithm (Choi, 2000) when the aver-
age number of boundaries in the reference data
was provided to the algorithm; N0 to denote the al-
gorithm proposing a segmentation with no bound-
aries; All to denote the algorithm proposing the de-
generate segmentation all boundaries; RK to de-
note the algorithm that generates a random known
segmentation; and RU to denote the algorithm that
generates a random unknown segmentation.
5.2.1 Comparison of System Performance
from Artificial to Realistic Data
From the artificial data to the more realistic
data, we expect to have more noise and thus the
algorithms to constantly degrade, but as our ex-
periments show a reversal of the assessment can
appear. More exactly: as can be seen from Figure
1, both C99 and TextSeg algorithms significantly
outperformed TextTiling algorithm on the artifi-
cially created dataset, when the number of seg-
ments was determined by the systems. A com-
parison between the error rates given in Figure
1 and Figure 2 show that C99 and TextSeg have
a similar trend, by significantly decreasing their
performance on TDT data, but still giving bet-
ter results than TextTiling on TDT data. When
comparing the systems by Prerror, C99 has simi-
lar performance with TextTiling on meeting data
(see Figure 3). Moreover, when assessment is
done by using WindowDiff, Pk or Pprimek, both C99
and TextSeg came out worse than TextTiling on
meeting data. This demonstrates that rankings ob-
tained when evaluating on artificial data are dif-
ferent from those obtained when evaluating on re-
alistic data. An alternative interpretation can be
given by taking into account that the degenerative
no boundaries segmentation has an error rate of
only 30% by the WindowDiff, Pk and Pprimek metrics
on meeting data. That is, we could interpret that
all three systems give completely wrong segmen-
tations on meeting data (due to the fact that topic
shifts are subtler and not as abrupt as in TDT and
artificial data). Nevertheless, we tend to adopt the
first interpretation, given the weaknesses of Pk, Pprimek
and WindowDiff (where misses are less penalised
than false alarms), as discussed in Section 4.
5.2.2 The Influence of the Error Metric on
Assessment
By following the quantitative assessment given
by the WindowDiff metric, we observe that the
algorithm labeled N0 is three times better than
the algorithm All on meeting data (see Figure 3),
while the same algorithm N0 is considered only
two times better than All on the artificial data (see
Figure 1). This verifies the limitation of the Win-
dowDiff metric discussed in Section 4.
The four error metrics described in detail in
Section 4 have shown that the effect of knowing
the average number of boundaries on C99 is posi-
tive when testing on meeting data. However if we
want to take into account all the four error met-
149
0
20
40
60
80
100
120
Err
or 
rat
e
Pk 34.75 11.01 7.89 10 7.15 44.12 55.5 47.71 52.51
P'k 35.1 13.21 8.55 10.94 7.87 44.13 99.58 48.85 80.84
WD 35.73 13.58 9.21 11.34 8.59 43.1 99.59 48.89 80.63
Pr_error 33.33 9.1 7.71 9.34 6.87 49.87 49.79 41.61 45.01
TextTiling C99 TextSeg C99_KA TextSeg_KA N0 All RK RU
Figure 1: Error rates of the segmentation systems on artificial data, where k = 42 and Pseg = 0.44.
0
20
40
60
80
100
120
Err
or r
ate
Pk 40.7 21.36 13.97 18.83 11.33 36.02 63.93 37.03 60.04
P'k 44.92 29.5 20.37 27.69 21.4 36.04 100 45.28 89.93
WD 44.76 36.28 30.3 40.26 31.46 46.69 100 53.75 91.92
Pr_error 34.09 25.69 25.62 27.17 21.05 49.96 50 44.89 48.31
TextTiling C99 TextSeg C99_KA TextSeg_KA N0 All RK RU
Figure 2: Error rates of the segmentation systems on TDT data, where k = 55 and Pseg = 0.3606.
rics, it is difficult to draw definite conclusions re-
garding the influence of knowing the average num-
ber of boundaries on TextSeg and C99 algorithms.
For example, when tested on TDT data, C99 KA
seems to work better than C99 by Pk and Pprimek met-
rics, while the WindowDiff metric gives a contra-
dictory assessment.
6 Conclusions
By comparing the performance of three systems
for thematic segmentation on different kinds of
data, we address two important issues in a quan-
titative evaluation. Strong emphasis was put on
the kind of data used for evaluation and we have
demonstrated experimentally that evaluation on
synthetic data is potentially misleading. The sec-
ond major issue addressed in this paper concerns
the choice of a valuable error metric and its side
effects on the evaluation assessment.
Acknowledgments
This work is supported by the Interactive
Multimodal Information Management project
(http://www.im2.ch/). Many thanks to Andrei
Popescu-Belis and the anonymous reviewers for
their valuable comments. We are grateful to the
International Computer Science Institute (ICSI),
University of California for sharing the data with
us. We also wish to thank Michael Galley who
kindly provided us the thematic annotations of
ICSI data.

References
James Allan, Jaime Carbonell, George Doddington,
Jonathan Yamron, and Yiming Yang. 1998. Topic
Detection and Tracking Pilot Study: Final Re-
port. In DARPA Broadcast News Transcription and
Understanding Workshop, pages 194–218, Lands-
downe, VA. Morgan Kaufmann.
Doug Beeferman, Adam Berger, and John Lafferty.
1999. Statistical Models for Text Segmentation.
Machine Learning, 34(Special Issue on Natural Lan-
guage Learning):177–210.
Gillian Brown and George Yule. 1998. Discourse
Analysis. (Cambridge Textbooks in Linguistics),
Cambridge.
Freddy Choi. 2000. Advances in Domain Independent
Linear Text Segmentation. In Proceedings of the 1st
Conference of the North American Chapter of the
Association for Computational Linguistics, Seattle,
USA.
Olivier Ferret. 2002. Using Collocations for Topic
Segmentation and Link Detection. In The 19th In-
ternational Conference on Computational Linguis-
tics, Taipei, Taiwan.
Michael Galley, Kathleen McKeown, Eric Fosler-
Luissier, and Hongyan Jing. 2003. Discourse Seg-
mentation of Multy-Party Conversation. In Annual
Meeting of the Association for Computational Lin-
guistics, pages 562–569.
Barbara J. Grosz and Candace L. Sidner. 1986. At-
tention, Intentions and the Structure of Discourse.
Computational Linguistics, 12:175–204.
Michael A. K. Halliday and Ruqaiya Hasan. 1976. Co-
hesion in English. Longman, London.
Marti Hearst and Christian Plaunt. 1993. Subtopic
Structuring for Full-Length Document Access.
In Proceedings of the 16th Annual International
ACM/SIGIR Conference, pages 59–68, Pittsburgh,
Pennsylvania, United States.
Marti Hearst. 1997. TextTiling: Segmenting Text into
Multi-Paragraph Subtopic Passages. Computational
Linguistics, 23(1):33–64.
Julia Hirschberg and Christine Nakatani. 1996.
A Prosodic Analysis of Discourse Segments in
Direction-Giving Monologues. In Proceedings of
the 34th Annual Meeting on Association for Com-
putational Linguistics, pages 286 – 293, Santa Cruz,
California.
Adam Janin, Jeremy Ang, Sonali Bhagat, Rajdip
Dhillon, Jane Edwards, Javier Macias-Guarasa, Nel-
son Morgan, Barbara Peskin, Elizabeth Shriberg,
Andreas Stolcke, Chuck Wooters, and Britta Wrede.
2004. The ICSI Meeting Project: Resources and Re-
search. In ICASSP 2004 Meeting Recognition Work-
shop (NIST RT-04 Spring Recognition Evaluation),
Montreal.
David Kauchak and Francine Chen. 2005. Feature-
based segmentation of narrative documents. In Pro-
ceedings of the ACL Workshop on Feature Engi-
neering for Machine Learning in Natural Language
Processing, pages 32–39, Ann Arbor; MI; USA.
Hideki Kozima and Teiji Furugori. 1994. Segmenting
Narrative Text into Coherent Scenes. Literary and
Linguistic Computing, 9:13–19.
Inderjeet Mani. 2001. Automatic Summarization.
John Benjamins Pub Co.
Chris Manning and Hinrich Sch¨utze. 1999. Foun-
dations of Statistical Natural Language Processing.
MIT Press Cambridge, MA, USA.
Daniel Marcu. 2000. The Theory and Practice of
Discourse Parsing and Summarization. MIT Press
Cambridge, MA, USA.
Rebecca J. Passonneau and Diane J. Litman. 1996.
Empirical Analysis of Three Dimensions of Spoken
Discourse: Segmentation, Coherence and Linguistic
Devices.
Rebecca J. Passonneau and Diane J. Litman. 1997.
Discourse Segmentation by Human and Automated
Means. Computational Linguistics, 23(1).
Lev Pevzner and Marti Hearst. 2002. A Critique and
Improvement of an Evaluation Metric for Text Seg-
mentation. Computational Linguistics, 16(1):19–
36.
Martin Porter. 1980. An Algorithm for Suffix Strip-
ping. Program, 14:130 – 137.
Jeffrey Reynar. 1998. Topic Segmentation: Algorithms
and Applications. Ph.D. thesis, University of Penn-
sylvania.
TDT. 1998. The Topic Detection and Tracking - Phase
2 Evaluation Plan. Available from World Wide Web:
http://www.nist.gov/speech/tests/tdt/tdt98/index.htm.
Masao Utiyama and Hitoshi Isahara. 2001. A Statisti-
cal Model for Domain-Independent Text Segmenta-
tion. In ACL/EACL, pages 491–498.
