A Statistical Model for Domain-Independent Text Segmentation
Masao Utiyama and Hitoshi Isahara
Communications Research Laboratory
2-2-2 Hikaridai Seika-cho, Soraku-gun,
Kyoto, 619-0289 Japan
mutiyama@crl.go.jp and isahara@crl.go.jp
Abstract
We propose a statistical method that
finds the maximum-probability seg-
mentation of a given text. This method
does not require training data because
it estimates probabilities from the given
text. Therefore, it can be applied to
any text in any domain. An experi-
ment showed that the method is more
accurate than or at least as accurate as
a state-of-the-art text segmentation sys-
tem.
1 Introduction
Documents usually include various topics. Identi-
fying and isolating topics by dividing documents,
which is called text segmentation, is important
for many natural language processing tasks, in-
cluding information retrieval (Hearst and Plaunt,
1993; Salton et al., 1996) and summarization
(Kan et al., 1998; Nakao, 2000). In informa-
tion retrieval, users are often interested in par-
ticular topics (parts) of retrieved documents, in-
stead of the documents themselves. To meet such
needs, documents should be segmented into co-
herent topics. Summarization is often used for a
long document that includes multiple topics. A
summary of such a document can be composed
of summaries of the component topics. Identifi-
cation of topics is the task of text segmentation.
A lot of research has been done on text seg-
mentation (Kozima, 1993; Hearst, 1994; Oku-
mura and Honda, 1994; Salton et al., 1996; Yaari,
1997; Kan et al., 1998; Choi, 2000; Nakao, 2000).
A major characteristic of the methods used in this
research is that they do not require training data
to segment given texts. Hearst (1994), for exam-
ple, used only the similarity of word distributions
in a given text to segment the text. Consequently,
these methods can be applied to any text in any
domain, even if training data do not exist. This
property is important when text segmentation is
applied to information retrieval or summarization,
because both tasks deal with domain-independent
documents.
Another application of text segmentation is
the segmentation of a continuous broadcast news
story into individual stories (Allan et al., 1998).
In this application, systems relying on supervised
learning (Yamron et al., 1998; Beeferman et al.,
1999) achieve good performance because there
are plenty of training data in the domain. These
systems, however, can not be applied to domains
for which no training data exist.
The text segmentation algorithm described in
this paper is intended to be applied to the sum-
marization of documents or speeches. Therefore,
it should be able to handle domain-independent
texts. The algorithm thus does not use any train-
ing data. It requires only the given documents for
segmentation. It can, however, incorporate train-
ing data when they are available, as discussed in
Section 5.
The algorithm selects the optimum segmen-
tation in terms of the probability defined by a
statistical model. This is a new approach for
domain-independent text segmentation. Previous
approaches usually used lexical cohesion to seg-
ment texts into topics. Kozima (1993), for exam-
ple, used cohesion based on the spreading activa-
tion on a semantic network. Hearst (1994) used
the similarity of word distributions as measured
by the cosine to gauge cohesion. Reynar (1994)
used word repetition as a measure of cohesion.
Choi (2000) used the rank of the cosine, rather
than the cosine itself, to measure the similarity of
sentences.
The statistical model for the algorithm is de-
scribed in Section 2, and the algorithm for ob-
taining the maximum-probability segmentation is
described in Section 3. Experimental results are
presented in Section 4. Further discussion and our
conclusions are given in Sections 5 and 6, respec-
tively.
2 Statistical Model for Text
Segmentation
We first define the probability of a segmentation
of a given text in this section. In the next section,
we then describe the algorithm for selecting the
most likely segmentation.
Let a0 a1a3a2a5a4a6a2a8a7a10a9a11a9a11a9a12a2a14a13 be a text consisting of
a15 words, and let
a16
a1
a16
a4
a16
a7a10a9a17a9a11a9
a16a19a18 be a segmen-
tation of a0 consisting of a20 segments. Then the
probability of the segmentation a16 is defined by:
a21a23a22a17a24
a16a26a25
a0a28a27a29a1
a21a30a22a11a24a31a0
a25a16
a27a32a21a23a22a33a24
a16
a27
a21a23a22a17a24a34a0a28a27
a9 (1)
The most likely segmentation a35a16 is given by:
a35a16
a1a37a36a38a22a31a39a41a40a42a36a38a43
a44
a21a23a22a17a24a34a0
a25a16
a27a45a21a30a22a33a24
a16
a27a34a46 (2)
because a21a23a22a47a24a34a0a28a27 is a constant for a given text a0 .
The definitions of a21a23a22a17a24a34a0 a25a16 a27 and a21a23a22a48a24 a16 a27 are
given below, in that order.
2.1 Definition of a21a23a22a17a24a34a0 a25a16 a27
We define a topic by the distribution of words in
that topic. We assume that different topics have
different word distributions. We further assume
that different topics are statistically independent
of each other. We also assume that the words
within the scope of a topic are statistically inde-
pendent of each other given the topic.
Let a15a23a49 be the number of words in segment a16 a49 ,
and leta2
a49
a50 be the
a51 -th word in a16
a49 . If we define
a0
a49
as
a0
a49
a1a52a2
a49
a4
a2
a49
a7
a9a11a9a17a9a6a2
a49
a13a54a53
a46
then a0a55a1a37a0a56a4a57a0a58a7a10a9a11a9a11a9a33a0 a18 anda15 a1 a18a49a60a59 a4 a15 a49 hold.
This means that a16 a49 and a0 a49 correspond to each
other.
Under our assumptions, a21a23a22a17a24a34a0 a25a16 a27 can be de-
composed as follows:
a21a30a22a11a24a31a0
a25a16
a27a61a1 a21a30a22a48a24a34a0 a4 a0 a7 a9a11a9a17a9a57a0
a18 a25a16
a27
a1
a18
a49a60a59
a4
a21a23a22a17a24a34a0
a49
a25a16
a27
a1
a18
a49a60a59
a4
a21a23a22a17a24a34a0
a49
a25a16
a49
a27
a1
a18
a49a60a59
a4
a13a62a53
a50
a59
a4
a21a23a22a17a24a63a2
a49
a50
a25a16
a49
a27a31a9 (3)
Next, we define a21a30a22a48a24a63a2
a49
a50
a25a16
a49
a27 as:
a21a23a22a17a24a63a2
a49
a50
a25a16
a49
a27a10a64a66a65
a49
a24a67a2
a49
a50
a27a30a68a70a69
a15 a49
a68a70a71
a46 (4)
where
a65
a49
a24a63a2
a49
a50
a27 is the number of words in a0
a49 that
are the same asa2
a49
a50 and
a71 is the number of different
words in a0 . For example, if a0a55a1a37a0a56a4a38a0a58a7 , where
a0a56a4a41a1a73a72a75a74a38a72a75a74a33a72 and a0a58a7a5a1a77a76a38a76a38a76a38a78a79a76a38a76 , then
a65
a4a38a24a31a72a79a27a8a1a73a80 ,
a65
a4a38a24a34a74a38a27a81a1a83a82 ,
a65
a7a62a24a34a76a48a27a84a1a83a85 ,
a65
a7a62a24a34a78a79a27a81a1a86a69 , and a71a87a1a89a88 .
Equation (4) is known as Laplace 's law (Manning
and Sch¨utze, 1999).
a65
a49
a24a63a2
a49
a50
a27 can be defined as:
a65
a49
a24a67a2
a49
a50
a27a10a64a37a90a32a24a63a2
a49
a50
a25
a2
a49
a4 a2
a49
a7 a9a11a9a11a9a12a2
a49
a13a54a53 a27 (5)
for
a90a32a24a63a2
a49
a50
a25
a2
a49
a4
a2
a49
a7
a9a17a9a11a9a67a2
a49
a13a54a53
a27a91a64
a13a54a53
a92 a59
a4a94a93
a24a63a2
a49
a92
a46a31a2
a49
a50
a27a34a46 (6)
where
a93
a24a67a2
a49
a92
a46a34a2
a49
a50
a27a95a1a96a69 when a2
a49
a92 and
a2
a49
a50 are the
same word and
a93
a24a67a2
a49
a92
a46a34a2
a49
a50
a27a97a1a99a98 otherwise. For
example, a90a32a24a34a72 a25a74a38a72a75a74a33a72a75a27a100a1
a93
a24a34a74a38a46a101a72a79a27a57a68
a93
a24a34a72a75a46a17a72a75a27a34a68
a93
a24a31a74a33a46a11a72a75a27a34a68
a93
a24a34a72a75a46a17a72a75a27a29a1a102a98a103a68a70a69a103a68a87a98a104a68a105a69a26a1a37a82 .
Equations (5) and (6) are used in Section 3 to
describe the algorithm for finding the maximum-
probability segmentation.
2.2 Definition of a21a23a22a17a24 a16
a27
The definition of a21a23a22a17a24 a16 a27 can vary depending on
our prior information about the possibility of seg-
mentation a16 . For example, we might know the
average length of the segments and want to incor-
porate it into a21a23a22a48a24 a16 a27 .
Our assumption, however, is that we do not
have such prior information. Thus, we have to
use some uninformative prior probability.
We define a21a23a22a17a24 a16 a27 as
a21a23a22a48a24
a16
a27a10a64
a15a91a106
a18 (7)
Equation (7) is determined on the basis of its de-
scription length,1 a107 a24 a16 a27 ; i.e.,
a21a23a22a48a24
a16
a27a10a1a102a82
a106a109a108a111a110
a44a32a112
(8)
where a107 a24 a16 a27a113a1 a20a115a114a116 a39 a15 bits.2 This description
length is derived as follows:
Suppose that there are two people, a sender and
a receiver, both of whom know the text to be seg-
mented. Only the sender knows the exact seg-
mentation, and he/she should send a message so
that the receiver can segment the text correctly.
To this end, it is sufficient for the sender to send
a117 integers, i.e.,
a118a120a119a122a121a123a118a32a124a122a121a122a125a122a125a126a125a122a121a67a118a45a127 , because these
integers represent the lengths of segments and
thus uniquely determine the segmentation once
the text is known.
A segment lengtha118
a53
can be encoded usinga128a130a129a33a131a132a118
bits, because a118
a53
is a number between 1 and a118 .
The total description length for all the segment
lengths is thusa117 a128a130a129a33a131a132a118 bits.3
Generally speaking, a21a30a22a11a24 a16 a27 takes a large value
when the number of segments is small. On the
other hand, a21a30a22a11a24a31a0 a25a16 a27 takes a large value when the
number of segments is large. If only a21a23a22a17a24a31a0 a25a16 a27 is
used to segment the text, then the resulting seg-
mentation will have too many segments. By using
both a21a30a22a48a24 a16 a27 and a21a30a22a11a24a31a0 a25a16 a27 , we can get a reason-
able number of segments.
3 Algorithm for Finding the
Maximum-Probability Segmentation
To find the maximum-probability segmentation
a35a16 , we first define the cost of segmentation a16 as
a133 a24
a16
a27a10a64a102a134
a114a116
a39a14a21a23a22a17a24a31a0
a25a16
a27a32a21a23a22a33a24
a16
a27a34a46 (9)
1Stolcke and Omohundro uses description length priors
to induce the structure of hidden Markov models (Stolcke
and Omohundro, 1994).
2‘Log ' denotes the logarithm to the base 2.
3We have used a127
a124
a128a135a129a33a131a132a118 as a136a60a137a60a138a94a139 before. But we use
a117
a128a135a129a33a131a32a118 in this paper, because it is easily interpreted as a
description length and the experimental results obtained by
using a117 a128a130a129a33a131a132a118 are slightly better than those obtained by us-
ing a127
a124
a128a130a129a33a131a132a118 . An anonymous reviewer suggests using a Pois-
son distribution whose parameter is a140
a127
, the average length
of a segment (in words), as prior probability. We leave it
for future work to compare the suitability of various prior
probabilities for text segmentation.
and we then minimize a133 a24 a16 a27 to obtain a35a16 , because
a35a16
a1a37a36a38a22a31a39a41a40a42a36a38a43
a44
a21a23a22a17a24a34a0
a25a16
a27a45a21a30a22a33a24
a16
a27a29a1a102a36a33a22a31a39a41a40a42a141a142
a44
a133 a24
a16
a27a31a9
(10)
a133 a24
a16
a27 can be decomposed as follows:
a133 a24
a16
a27a143a1 a134
a114a116
a39a144a21a23a22a17a24a34a0
a25a16
a27a45a21a30a22a33a24
a16
a27
a1 a134
a18
a49a60a59
a4
a13a62a53
a50
a59
a4
a114a116
a39a26a21a23a22a48a24a67a2
a49
a50
a25a16
a49
a27a30a134
a114a116
a39a14a21a23a22a17a24
a16
a27
a1 a134
a18
a49a60a59
a4
a13a62a53
a50
a59
a4
a114a116
a39 a65
a49
a24a67a2
a49
a50
a27a145a68a87a69
a15 a49
a68a87a71
a68
a20a77a114a116
a39
a15
a1
a18
a49a60a59
a4
a76a38a24a63a2
a49
a4
a2
a49
a7
a9a17a9a11a9a12a2
a49
a13a62a53
a25
a15
a46a17a71a94a27a34a46 (11)
where
a76a38a24a63a2
a49
a4
a2
a49
a7
a9a17a9a11a9a12a2
a49
a13a62a53
a25
a15
a46a17a71a94a27
a64
a13a62a53
a50
a59
a4
a114a116
a39
a15 a49
a68a105a71
a65
a49
a24a63a2
a49
a50
a27a30a68a70a69
a68
a114a116
a39
a15
a9 (12)
We further rewrite Equation (12) in the form
of Equation (13) below by using Equation (5)
and replacing a15 a49 with a146 a24a67a2
a49
a4
a2
a49
a7
a9a17a9a11a9a12a2
a49
a13a62a53
a27 , where
a146
a24a57a147a10a148a54a149a122a150a54a151a11a27 is the length of words, i.e.,the number
of word tokens in words. Equation (13) is used to
describe our algorithm in Section 3.1:
a76a38a24a63a2
a49
a4
a2
a49
a7
a9a11a9a17a9a6a2
a49
a13a54a53
a25
a15
a46a11a71a94a27
a1 a152
a110a135a153
a53
a119
a153
a53
a124a34a154a154a155a154
a153
a53
a140
a53
a112
a50
a59
a4
a114a116
a39 a146
a24a67a2
a49
a4 a2
a49
a7 a9a17a9a11a9a12a2
a49
a13a62a53 a27a30a68a70a71
a90a32a24a67a2
a49
a50
a25
a2
a49
a4
a2
a49
a7
a9a11a9a17a9a6a2
a49
a13a62a53 a27a30a68a70a69
a68
a114a116
a39
a15
a9 (13)
3.1 Algorithm
This section describes an algorithm for finding the
minimum-cost segmentation. First, we define the
terms and symbols used to describe the algorithm.
Given a text a0 a1a89a2a5a4a156a2a14a7a29a9a11a9a17a9a6a2a144a13 consisting of
a15 words, we define
a157
a49 as the position between
a2
a49
and a2 a49a60a158 a4 , so that a157a101a159 is just before a2 a4 and a157 a13 is
just after a2a14a13 .
Next, we define a graph a160 a1a162a161a67a163a164a46a166a165a168a167 , where a163
is a set of nodes and a165 is a set of edges. a163 is
defined as
a163a169a1a37a170
a157
a49
a25
a98a144a171a70a172a104a171
a15a91a173 (14)
and a165 is defined as
a165a174a1a37a170a62a175
a49
a50
a25
a98a176a171a52a172a178a177
a51
a171
a15a100a173
a46 (15)
where the edges are ordered; the initial vertex and
the terminal vertex of a175 a49a50 are a157 a49 and a157 a50 , respec-
tively. An example of a160 is shown in Figure 1.
We say that a175 a49a50 covers a2 a49a60a158 a4a12a2 a49a60a158 a7a164a9a11a9a11a9a12a2 a50 .
This means that a175 a49a50 represents a segment
a2
a49a60a158
a4a12a2
a49a67a158
a7a164a9a11a9a17a9a6a2
a50 . Thus, we define the cost
a76
a49
a50 of
edge a175 a49a50 by using Equation (13):
a76
a49
a50
a1a37a76a48a24a63a2
a49a60a158
a4a60a2
a49a60a158
a7a164a9a11a9a11a9a12a2
a50
a25
a15
a46a17a71a94a27a34a46 (16)
where a71 is the number of different words in a0 .
Given these definitions, we describe the algo-
rithm to find the minimum-cost segmentation or
maximum-probability segmentation as follows:
Step 1. Calculate the cost a76 a49a50 of edge a175 a49a50 for a98a168a171
a172a178a177
a51
a171
a15 by using Equation (16).
Step 2. Find the minimum-cost path from a157 a159 to
a157
a13 .
Algorithms for finding the minimum-cost path in
a graph are well known. An algorithm that can
provide a solution for Step 2 will be a simpler ver-
sion of the algorithm used to find the maximum-
probability solution in Japanese morphological
analysis (Nagata, 1994). Therefore, a solution can
be obtained by applying a dynamic programming
(DP) algorithm.4 DP algorithms have also been
used for text segmentation by other researchers
(Ponte and Croft, 1997; Heinonen, 1998).
The path thus obtained represents the
minimum-cost segmentation in a160 when edges
correspond with segments. In Figure 1, for
example, if a175 a159 a4a38a175a166a4a63a179a180a175a31a179a34a181 is the minimum-cost path,
then a182a2a168a4a63a183 a182a2a144a7a184a2a8a179a57a183 a182a2a144a185a184a2a8a181a57a183 is the minimum-cost
segmentation.
The algorithm automatically determines the
number of segments. But the number of segments
can also be specified explicitly by specifying the
number of edges in the minimum-cost path.
The algorithm allows the text to be segmented
anywhere between words; i.e., all the positions
4A program that implements the algorithm described in
this section is available at http:
//www.crl.go.jp/jt/a132/members/mutiyama
/softwares.html.
between words are candidates for segment bound-
aries. It is easy, however, to modify the algorithm
so that the text can only be segmented at partic-
ular positions, such as the ends of sentences or
paragraphs. This is done by using a subset of a165
in Equation (15). We use only the edges whose
initial and terminal vertices are candidate bound-
aries that meet particular conditions, such as be-
ing the ends of sentences or paragraphs. We then
obtain the minimum-cost path by doing Steps 1
and 2. The minimum-cost segmentation thus ob-
tained meets the boundary conditions. In this pa-
per, we assume that the segment boundaries are at
the ends of sentences.
3.2 Properties of the segmentation
Generally speaking, the number of segments ob-
tained by our algorithm is not sensitive to the
length of a given text, which is counted in words.
In other words, the number of segments is rela-
tively stable with respect to variation in the text
length. For example, the algorithm divides a
newspaper editorial consisting of about 27 sen-
tences into 4 to 6 segments, while on the other
hand, it divides a long text consisting of over 1000
sentences into 10 to 20 segments. Thus, the num-
ber of segments is not proportional to text length.
This is due to the term a20a77a114a116 a39 a15 in Equation (11).
The value of this term increases as the number of
words increases. The term thus suppresses the di-
vision of a text when the length of the text is long.
This stability is desirable for summarization,
because summarizing a given text requires select-
ing a relatively small number of topics from it.
If a text segmentation system divides a given text
into a relatively small number of segments, then
a summary of the original text can be composed
by combining summaries of the component seg-
ments (Kan et al., 1998; Nakao, 2000). A finer
segmentation can be obtained by applying our
algorithm recursively to each segment, if neces-
sary.5
5We segmented various texts without rigorous evaluation
and found that our method is good at segmenting a text into a
relatively small number of segments. On the other hand, the
method is not good at segmenting a text into a large num-
ber of segments. For example, the method is good at seg-
menting a 1000-sentence text into 10 segments. In such a
case, the segment boundaries seem to correspond well with
topic boundaries. But, if the method is forced to segment
the same text into 50 segments by specifying the number of
g0 g2 g3 g4 g5g1w1 w2 w3 w4 w5
e01
e14 e35
e13 e45
Figure 1: Example of a graph.
4 Experiments
4.1 Material
We used publicly available data to evaluate our
system. This data was used by Choi (2000) to
compare various domain-independent text seg-
mentation systems.6 He evaluated a133a26a186a62a186 (Choi,
2000), TextTiling (Hearst, 1994), DotPlot (Rey-
nar, 1998), and Segmenter (Kan et al., 1998) by
using the data and reported that a133a26a186a180a186 achieved the
best performance among these systems.
The data description is as follows: “An artifi-
cial test corpus of 700 samples is used to assess
the accuracy and speed performance of segmen-
tation algorithms. A sample is a concatenation of
ten text segments. A segment is the first a15 sen-
tences of a randomly selected document from the
Brown corpus. A sample is characterised by the
range a15 .” (Choi, 2000) Table 1 gives the corpus
statistics.
Range ofa118 a187a145a188a42a189a33a189 a187a145a188a191a190 a192a145a188a191a193 a194a145a188a81a189a33a189
# samples 400 100 100 100
Table 1: Test corpus statistics. (Choi, 2000)
Segmentation accuracy was measured by the
probabilistic error metric a195 a92 proposed by Beefer-
man, et al. (1999).7 Low a195
a92 indicates high accu-
edges in the minimum-cost path, then the resulting segmen-
tation often contains very small segments consisting of only
one or two sentences. We found empirically that segments
obtained by recursive segmentation were better than those
obtained by minimum-cost segmentation when the specified
number of segments was somewhat larger than that of the
minimum-cost path, whose number of segments was auto-
matically determined by the algorithm.
6The data is available from
http://www.cs.man.ac.uk/˜choif/software/
C99-1.2-release.tgz.
We used
naacl00Exp/data/a196 1,2,3a197/
a196 3-11,3-5,6-8,9-11a197 /*,
which is contained in the package, for our experiment.
7Let
a198a11a199a57a200 be a correct segmentation and leta201a47a202a57a203a166a204 be a seg-
mentation proposed by a text segmentation system: Then the
racy.
4.2 Experimental procedure and results
The sample texts were preprocessed - i.e., punc-
tuation and stop words were removed and the re-
maining words were stemmed - by a program us-
ing the libraries available in Choi 's package. The
texts were then segmented by the systems listed
in Tables 2 and 3. The segmentation boundaries
were placed at the ends of sentences. The seg-
mentations were evaluated by applying an evalu-
ation program in Choi 's package.
The results are listed in Tables 2 and 3. a205 a98a62a98 is
the result for our system when the numbers of seg-
ments were determined by the system. a205 a98a180a98
a110a135a206
a112 is
the result for our system when the numbers of seg-
ments were given beforehand.8 a133a26a186a180a186 and a133a26a186a62a186
a110a135a206
a112
are the corresponding results for the systems de-
scribed in Choi 's paper (Choi, 2000).9
a187a29a188a191a189a33a189 a187a145a188a191a190 a192a145a188a81a193 a194a145a188a42a189a33a189 Total
a207a32a208a33a208 11%
a209a156a209 13%a209a12a209 6%a209a156a209 6%a209a12a209 10%a209a12a209
a210
a194a33a194 13% 18% 10% 10% 13%
prob 7.9E-5 4.9E-3 2.5E-5 7.5E-8 9.7E-12
Table 2: Comparison of a195 a92 : the numbers of seg-
ments were determined by the systems.
In these tables, the symbol “a211a62a211 ” indicates that
the difference in a195 a92 between the two systems is
statistically significant at the 1% level, based on
“number a212a145a213a57a137a63a198a11a199a57a200a11a121a67a201a47a202a33a203a166a204a57a139 is the probability that a randomly
chosen pair of words a distance of a214 words apart is inconsis-
tently classified; that is, for one of the segmentations the pair
lies in the same segment, while for the other the pair spans
a segment boundary” (Beeferman et al., 1999), where a214 is
chosen to be half the average reference segment length (in
words).
8If two segmentations have the same cost, then our sys-
tems arbitrarily select one of them; i.e., the systems select
the segmentation processed previously.
9The results for a210
a194a33a194a47a215a135a216a123a217 in Table 3 are slightly different
from those listed in Table 6 of Choi 's paper (Choi, 2000).
This is because the original results in that paper were based
on 500 samples, while the results in our Table 3 were based
on 700 samples (Choi, personal communication).
a187a145a188a81a189a33a189 a187a23a188a84a190 a192a23a188a84a193 a194a145a188a191a189a33a189 Total
a207a32a208a33a208
a215a135a216a111a217 10%a209a12a209 9% 7%a209a12a209 5%a209a12a209 9%a209a12a209
a210
a194a33a194a47a215a135a216a111a217 12% 11% 10% 9% 11%
prob 2.7E-4 0.080 2.3E-3 1.0E-4 6.8E-9
Table 3: Comparision of a195 a92 : the numbers of seg-
ments were given beforehand.
a one-sided a218 -test of the null hypothesis of equal
means. The probability of the null hypothesis
being true is displayed in the row indicated by
“prob”. The column labels, such as “a80a26a134a219a85 ”, in-
dicate that the numbers in the column are the av-
erages of a195 a92 over the corresponding sample texts.
“Total” indicates the averages of a195 a92 over all the
text samples.
These tables show statistically that our system
is more accurate than or at least as accurate as
a133a26a186a180a186 . This means that our system is more accurate
than or at least as accurate as previous domain-
independent text segmentation systems, because
a133a26a186a180a186 has been shown to be more accurate than pre-
vious domain-independent text segmentation sys-
tems.10
5 Discussion
5.1 Evaluation
Evaluation of the output of text segmentation sys-
tems is difficult because the required segmenta-
tions depend on the application. In this paper, we
have used an artificial corpus to evaluate our sys-
tem. We regard this as appropriate for comparing
relative performance among systems.
It is important, however, to assess the perfor-
mance of systems by using real texts. These
texts should be domain independent. They should
also be multi-lingual if we want to test the mul-
10Speed performance is not our main concern in this pa-
per. Our implementations of a207a32a208a33a208 and a207a32a208a33a208a38a220 are not opti-
mum. However, a207a32a208a33a208 and a207a32a208a33a208a38a220 , which are implemented in
C, run as fast as a210 a194a33a194 and a210 a194a33a194 a220 , which are implemented in
Java (Choi, 2000), due to the difference in programming lan-
guages. The average run times for a sample text were
a207a32a208a33a208 a221
a189a38a125a130a187a33a192 sec.
a210
a194a33a194
a221
a189a38a125a222a180a194 sec.
a207a32a208a33a208a38a220a223a221
a189a38a125a130a187a33a224 sec.
a210
a194a33a194
a220a223a221
a189a38a125a222a180a190 sec.
on a Pentium III 750-MHz PC with 384-MB RAM running
RedHat Linux 6.2.
tilinguality of systems. For English, Klavans, et
al. describe a segmentation corpus in which the
texts were segmented by humans (Klavans et al.,
1998). But, there are no such corpora for other
languages. We are planning to build a segmen-
tation corpus for Japanese, based on a corpus
of speech transcriptions (Maekawa and Koiso,
2000).
5.2 Related work
Our proposed algorithm finds the maximum-
probability segmentation of a given text. This
is a new approach for domain-independent text
segmentation. A probabilistic approach, however,
has already been proposed by Yamron, et al. for
domain-dependent text segmentation (broadcast
news story segmentation) (Yamron et al., 1998).
They trained a hidden Markov model (HMM),
whose states correspond to topics. Given a word
sequence, their system assigns each word a topic
so that the maximum-probability topic sequence
is obtained. Their model is basically the same as
that used for HMM part-of-speech (POS) taggers
(Manning and Sch¨utze, 1999), if we regard topics
as POS tags.11 Finding topic boundaries is equiv-
alent to finding topic transitions; i.e., a continuous
topic or segment is a sequence of words with the
same topic.
Their approach is indirect compared with our
approach, which directly finds the maximum-
probability segmentation. As a result, their model
can not straightforwardly incorporate features
pertaining to a segment itself, such as the average
length of segments. Our model, on the other hand,
can incorporate this information quite naturally.
Suppose that the length of a segment a225 follows
a normal distribution a226 a24 a225a164a227a31a228 a46a47a229a30a27 , with a mean of
a228 and standard deviation of a229 (Ponte and Croft,
1997). Then Equation (13) can be augmented to
a76a38a24a67a2
a49
a4
a2
a49
a7
a9a11a9a17a9a6a2
a49
a13a54a53
a25
a15
a46a11a71a109a46
a228
a46a11a229a94a46a11a230a231a46a34a232a233a46a126a234a164a27
a1 a230a191a152
a110a135a153
a53
a119
a153
a53
a124 a154a154a155a154
a153
a53
a140
a53 a112
a50
a59
a4
a114a116
a39 a146
a24a67a2
a49
a4 a2
a49
a7 a9a17a9a11a9a67a2
a49
a13a62a53 a27a30a68a105a71
a90a32a24a63a2
a49
a50
a25
a2
a49
a4
a2
a49
a7
a9a17a9a11a9a67a2
a49
a13a54a53
a27a30a68a87a69
a68a104a232
a114a116
a39
a15
a68a104a234
a114a116
a39
a69
a226
a24
a146
a24a63a2
a49
a4
a2
a49
a7
a9a17a9a11a9a12a2
a49
a13a62a53
a27
a25a228
a46a17a229a30a27
a46 (17)
11The details are different, though.
where a230a58a68a97a232a235a68a97a234a56a1a174a69 . Equation (17) favors seg-
ments whose lengths are similar to the average
length (in words).
Another major difference from their algorithm
is that our algorithm does not require training data
to estimate probabilities, while their algorithm
does. Therefore, our algorithm can be applied to
domain-independent texts, while their algorithm
is restricted to domains for which training data
are available. It would be interesting, however,
to compare our algorithm with their algorithm for
the case when training data are available. In such
a case, our model should be extended to incor-
porate various features such as the average seg-
ment length, clue words, named entities, and so
on (Reynar, 1999; Beeferman et al., 1999).
Our proposed algorithm naturally estimates the
probabilities of words in segments. These prob-
abilities, which are called word densities, have
been used to detect important descriptions of
words in texts (Kurohashi et al., 1997). This
method is based on the assumption that the den-
sity of a word is high in a segment in which the
word is discussed (defined and/or explained) in
some depth. It would be interesting to apply our
method to this application.
6 Conclusion
We have proposed a statistical model for domain-
independent text segmentation. This method finds
the maximum-probability segmentation of a given
text. The method has been shown to be more
accurate than or at least as accurate as previous
methods. We are planning to build a segmenta-
tion corpus for Japanese and evaluate our method
against this corpus.
Acknowledgements
We thank Freddy Y. Y. Choi for his text segmen-
tation package.

References

James Allan, Jaime Carbonell, George Doddington,
Jonathan Yamron, and Yiming Yang. 1998. Topic
detection and tracking pilot study final report. In
Proc. of the DARPA Broadcast News Transcription
and Understanding Workshop.

Doug Beeferman, Adam Berger, and John Lafferty.
1999. Statistical models for text segmentation. Ma-
chine Learning, 34(1-3):177-210.

Freddy Y. Y. Choi. 2000. Advances in domain independent linear text segmentation. In Proc. of
NAACL-2000.

Marti A. Hearst and Christian Plaunt. 1993. Subtopic
structuring for full-length document access. In
Proc. of the Sixteenth Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval, pages 59-68.

Marti A. Hearst. 1994. Multi-paragraph segmentation
of expository text. In Proc. of ACL '94.

Oskari Heinonen. 1998. Optimal multi-paragraph text
segmentation by dynamic programming. In Proc.
of COLING-ACL '98.

Min-Yen Kan, Judith L. Klavans, and Kathleen R.
McKeown. 1998. Linear segmentation and segment significance. In Proc. of WVLC-6, pages 197-205.

Judith L. Klavans, Kathleen R. McKeown, Min-Yen
Kan, and Susan Lee. 1998. Resources for the eval-
uation of summarization techniques. In Proceed-
ings of the 1st International Conference on Lan-
guage Resources and Evaluation (LREC), pages 899-902.

Hideki Kozima. 1993. Text segmentation based on
similarity between words. In Proc. of ACL '93.

Sadao Kurohashi, Nobuyuki Shiraki, and Makoto Na-
gao. 1997. A method for detecting important de-
scriptions of a word based on its density distribution
in text (in Japanese). IPSJ (Information Processing
Society of Japan) Journal, 38(4):845-854.

Kikuo Maekawa and Hanae Koiso. 2000. Design of
spontaneous speech corpus for Japanese. In Proc. of
International Symposium: Toward the Realization
of Spontaneous Speech Engineering, pages 70-77.

Christopher D. Manning and Hinrich Schutze. 1999.
Foundations of Statistical Natural Language Pro-
cessing. The MIT Press.

Masaaki Nagata. 1994. A stochastic Japanese morphological analyzer using a forward-DP backward- n-best search algorithm. In Proc. of COLING '94, pages 201-207.

Yoshio Nakao. 2000. An algorithm for one-page summarization of a long text based on thematic hierarchy detection. In Proc. of ACL '2000, pages 302-309.

Manabu Okumura and Takeo Honda. 1994. Word
sense disambiguation and text segmentation based
on lexical cohesion. In Proc. of COLING-94.

Jay M. Ponte and W. Bruce Croft. 1997. Text segmentation by topic. In Proc. of the First European
Conference on Research and Advanced Technology
for Digital Libraries, pages 120-129.

Jeffrey C. Reynar. 1994. An automatic method of
finding topic boundaries. In Proc. of ACL-94.

Jeffrey C. Reynar. 1998. Topic segmentation: Algo-
rithms and applications. Ph.D. thesis, Computer
and Information Science, University of Pennsylvania.

Jeffrey C. Reynar. 1999. Statistical models for topic
segmentation. In Proc. of ACL-99, pages 357-364.

Gerard Salton, Amit Singhal, Chris Buckley, and Mandar Mitra. 1996. Automatic text decomposition
using text segments and text themes. In Proc. of
Hypertext '96.

Andreas Stolcke and Stephen M. Omohundro. 1994.
Best-first model merging for hidden Markov model
induction. Technical Report TR-94-003, ICSI,
Berkeley, CA.

Yaakov Yaari. 1997. Segmentation of expository texts
by hierarchical agglomerative clustering. In Proc.
of the Recent Advances in Natural Language Processing.

J. P. Yamron, I. Carp, S. Lowe, and P. van Mulbregt. 1998. A hidden Markov model approach
to text segmentation and event tracking. In Proc. of
ICASSP-98.
