Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in NLP, pages 32–39,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Feature-Based Segmentation of Narrative Documents
David Kauchak
Palo Alto Research Center and
University of California, San Diego
San Diego, CA 92093
dkauchak@cs.ucsd.edu
Francine Chen
Palo Alto Research Center
3333 Coyote Hill Rd.
Palo Alto, CA 94304
fchen@parc.com
Abstract
In this paper we examine topic segmen-
tation of narrative documents, which are
characterized by long passages of text
with few headings. We  rst present results
suggesting that previous topic segmenta-
tion approaches are not appropriate for
narrative text. We then present a feature-
based method that combines features from
diverse sources as well as learned features.
Applied to narrative books and encyclope-
dia articles, our method shows results that
are signi cantly better than previous seg-
mentation approaches. An analysis of in-
dividual features is also provided and the
bene t of generalization using outside re-
sources is shown.
1 Introduction
Many long text documents, such as magazine arti-
cles, narrative books and news articles contain few
section headings. The number of books in narrative
style that are available in digital form is rapidly in-
creasing through projects such as Project Gutenberg
and the Million Book Project at Carnegie Mellon
University. Access to these collections is becom-
ing easier with directories such as the Online Books
Page at the University of Pennsylvania.
As text analysis and retrieval moves from retrieval
of documents to retrieval of document passages, the
ability to segment documents into smaller, coherent
regions enables more precise retrieval of meaningful
portions of text (Hearst, 1994) and improved ques-
tion answering. Segmentation also has applications
in other areas of information access, including docu-
ment navigation (Choi, 2000), anaphora and ellipsis
resolution, and text summarization (Kozima, 1993).
Research projects on text segmentation have fo-
cused on broadcast news stories (Beeferman et al.,
1999), expository texts (Hearst, 1994) and synthetic
texts (Li and Yamanishi, 2000; Brants et al., 2002).
Broadcast news stories contain cues that are indica-
tive of a new story, such as  coming up , or phrases
that introduce a reporter, which are not applicable to
written text. In expository texts and synthetic texts,
there is repetition of terms within a topical segment,
so that the similarity of  blocks of text is a useful
indicator of topic change. Synthetic texts are created
by concatenating stories, and exhibit stronger topic
changes than the subtopic changes within a docu-
ment; consequently, algorithms based on the simi-
larity of text blocks work well on these texts.
In contrast to these earlier works, we present a
method for segmenting narrative documents. In this
domain there is little repetition of words and the seg-
mentation cues are weaker than in broadcast news
stories, resulting in poor performance from previous
methods.
We present a feature-based approach, where the
features are more strongly engineered using linguis-
tic knowledge than in earlier approaches. The key to
most feature-based approaches, particularly in NLP
tasks where there is a broad range of possible feature
sources, is identifying appropriate features. Select-
ing features in this domain presents a number of in-
teresting challenges. First, features used in previous
methods are not suf cient for solving this problem.
We explore a number of different sources of infor-
mation for extracting features, many previously un-
used. Second, the sparse nature of text and the high
32
cost of obtaining training data requires generaliza-
tion using outside resources. Finally, we incorporate
features from non-traditional resources such as lexi-
cal chains where features must be extracted from the
underlying knowledge representation.
2 Previous Approaches
Previous topic segmentation methods fall into three
groups: similarity based, lexical chain based, and
feature based. In this section we give a brief
overview of each of these groups.
2.1 Similarity-based
One popular method is to generate similarities be-
tween blocks of text (such as blocks of words,
sentences or paragraphs) and then identify section
boundaries where dips in the similarities occur.
The cosine similarity measure between term vec-
tors is used by Hearst (1994) to de ne the simi-
larity between blocks. She notes that the largest
dips in similarity correspond to de ned boundaries.
Brants et al. (2002) learn a PLSA model using EM
to smooth the term vectors. The model is parame-
terized by introducing a latent variable, representing
the possible  topics . They show good performance
on a number of different synthetic data sets.
Kozima and Furugori (1994) use another similar-
ity metric they call  lexical cohesion . The  cohe-
siveness of a pair of words is calculated by spread-
ing activation on a semantic network as well as word
frequency. They showed that dips in lexical cohe-
sion plots had some correlation with human subject
boundary decisions on one short story.
2.2 Lexical Chains
Semantic networks de ne relationships be-
tween words such as synonymy, specializa-
tion/generalization and part/whole. Stokes et al.
(2002) use these relationships to construct lexical
chains. A lexical chain is a sequence of lexicograph-
ically related word occurrences where every word
occurs within a set distance from the previous word.
A boundary is identi ed where a large numbers of
lexical chains begin and end. They showed that
lexical chains were useful for determining the text
structure on a set of magazine articles, though they
did not provide empirical results.
2.3 Feature-based
Beeferman et al. (1999) use an exponential model
and generate features using a maximum entropy se-
lection criterion. Most features learned are cue-
based features that identify a boundary based on the
occurrence of words or phrases. They also include a
feature that measures the difference in performance
of a  long range vs.  short range model. When
the short range model outperforms the long range
model, this indicates a boundary. Their method per-
formed well on a number of broadcast news data
sets, including the CNN data set from TDT 1997.
Reynar (1999) describes a maximum entropy
model that combines hand selected features, includ-
ing: broadcast news domain cues, number of content
word bigrams, number of named entities, number of
content words that are WordNet synonyms in the left
and right regions, percentage of content words in the
right segment that are  rst uses, whether pronouns
occur in the  rst  ve words, and whether a word
frequency based algorithm predicts a boundary. He
found that for the HUB-4 corpus, which is composed
of transcribed broadcasts, that the combined feature
model performed better than TextTiling.
Mochizuki et al. (1998) use a combination of lin-
guistic cues to segment Japanese text. Although a
number of cues do not apply to English (e.g., top-
ical markers), they also use anaphoric expressions
and lexical chains as cues. Their study was small,
but did indicate that lexical chains are a useful cue
in some domains.
These studies indicate that a combination of fea-
tures can be useful for segmentation. However,
Mochizuki et al. (1998) analyzed Japanese texts, and
Reynar (1999) and Beeferman et al. (1999) evalu-
ated on broadcast news stories, which have many
cues that narrative texts do not. Beeferman et al.
(1999) also evaluated on concatenated Wall Street
Journal articles, which have stronger topic changes
than within a document. In our work, we examine
the use of linguistic features for segmentation of nar-
rative text in English.
3 Properties of Narrative Text
Characterizing data set properties is the  rst step
towards deriving useful features. The approaches
in the previous section performed well on broad-
33
Table 1: Previous approaches evaluated on narrative
data from Biohazard
Word Sent. Window
Model Error Error Diff
random 0.486 0.490 0.541
TextTiling 0.481 0.497 0.526
PLSA 0.480 0.521 0.559
cast news, expository and synthetic data sets. Many
properties of these documents are not shared by nar-
rative documents. These properties include: 1) cue
phrases, such as  welcome back and  joining us 
that feature-based methods used in broadcast news,
2) strong topic shifts, as in synthetic documents cre-
ated by concatenating newswire articles, and 3) large
data sets such that the training data and testing data
appeared to come from similar distributions.
In this paper we examine two narrative-style
books: Biohazard by Ken Alibek and The Demon
in the Freezer by Richard Preston. These books are
segmented by the author into sections. We manu-
ally examined these author identi ed boundaries and
they are reasonable. We take these sections as true
locations of segment boundaries. We split Biohaz-
ard into three parts, two for experimentation (exp1
and exp2) and the third as a holdout for testing. De-
mon in the Freezer was reserved for testing. Biohaz-
ard contains 213 true and 5858 possible boundaries.
Demon has 119 true and 4466 possible boundaries.
Locations between sentences are considered possi-
ble boundaries and were determined automatically.
We present an analysis of properties of the book
Biohazard by Ken Alibek as an exemplar of nar-
rative documents (for this section, test=exp1 and
train=exp2). These properties are different from pre-
vious expository data sets and will result in poor per-
formance for the algorithms mentioned in Section 2.
These properties help guide us in deriving features
that may be useful for segmenting narrative text.
Vocabulary The book contains a single topic with a
number of sub-topics. These changing topics, com-
bined with the varied use of words for narrative doc-
uments, results in many unseen terms in the test set.
25% of the content words in the test set do not oc-
cur in the training set and a third of the words in the
test set occur two times or less in the training set.
This causes problems for those methods that learn
a model of the training data such as Brants et al.
(2002) and Beeferman et al. (1999) because, with-
out outside resources, the information in the training
data is not suf cient to generalize to the test set.
Boundary words Many feature-based methods rely
on cues at the boundaries (Beeferman et al., 1999;
Reynar, 1999). 474 content terms occur in the  rst
sentence of boundaries in the training set. Of these
terms, 103 occur at the boundaries of the test set.
However, of those terms that occur signi cantly at
a training set boundary (where signi cant is de-
termined by a likelihood-ratio test with a signi -
cance level of 0.1), only 9 occur at test boundaries.
No words occur signi cantly at a training boundary
AND also signi cantly at a test boundary.
Segment similarity Table 1 shows that two
similarity-based methods that perform well on syn-
thetic and expository text perform poorly (i.e., on
par with random) on Biohazard. The poor perfor-
mance occurs because block similarities provide lit-
tle information about the actual segment boundaries
on this data set. We examined the average similarity
for two adjacent regions within a segment versus the
average similarity for two adjacent regions that cross
a segment boundary. If the similarity scores were
useful, the within segment scores would be higher
than across segment scores. Similarities were gener-
ated using the PLSA model, averaging over multiple
models with between 8 and 20 latent classes. The
average similarity score within a segment was 0.903
with a standard deviation of 0.074 and the average
score across a segment boundary was 0.914 with a
standard deviation of 0.041. In this case, the across
boundary similarity is actually higher. Similar val-
ues were observed for the cosine similarities used by
the TextTiling algorithm, as well as with other num-
bers of latent topics for the PLSA model. For all
cases examined, there was little difference between
inter-segment similarity and across-boundary simi-
larity, and there was always a large standard devia-
tion.
Lexical chains Lexical chains were identi ed as
synonyms (and exact matches) occurring within
a distance of one-twentieth the average segment
length and with a maximum chain length equal to
the average segment length (other values were ex-
34
amined with similar results). Stokes et al. (2002)
suggest that high concentrations of lexical chain be-
ginnings and endings are indicative of a boundary
location. On the narrative data, of the 219 over-
all chains, only 2 begin at a boundary and only 1
ends at a boundary. A more general heuristic iden-
ti es boundaries where there is an increase in the
number of chains beginning and ending near a possi-
ble boundary while also minimizing chains that span
boundaries. Even this heuristic does not appear in-
dicative on this data set. Over 20% of the chains
actually cross segment boundaries. We also mea-
sured the average distance from a boundary and the
nearest beginning and ending of a chain if a chain
begins/ends within that segment. If the chains are a
good feature, then these should be relatively small.
The average segment length is 185 words, but the
average distance to the closest beginning chain is 39
words away and closest ending chain is 36 words
away. Given an average of 4 chains per segment,
the beginning and ending of chains were not concen-
trated near boundary locations in our narrative data,
and therefore not indicative of boundaries.
4 Feature-Based Segmentation
We pose the problem of segmentation as a classi -
cation problem. Sentences are automatically iden-
ti ed and each boundary between sentences is a
possible segmentation point. In the classi cation
framework, each segmentation point becomes an ex-
ample. We examine both support vector machines
(SVMlight (Joachims, 1999)) and boosted decision
stumps (Weka (Witten and Frank, 2000)) for our
learning algorithm. SVMs have shown good per-
formance on a variety of problems, including nat-
ural language tasks (Cristianini and Shawe-Taylor,
2000), but require careful feature selection. Classi -
cation using boosted decisions stumps can be a help-
ful tool for analyzing the usefulness of individual
features. Examining multiple classi cation meth-
ods helps avoid focusing on the biases of a particular
learning method.
4.1 Example Reweighting
One problem with formulating the segmentation
problem as a classi cation problem is that there are
many more negative than positive examples. To dis-
courage the learning algorithm from classifying all
results as negative and to instead focus on the posi-
tive examples, the training data must be reweighted.
We set the weight of positive vs. negative exam-
ples so that the number of boundaries after testing
agrees with the expected number of segments based
on the training data. This is done by iteratively ad-
justing the weighting factor while re-training and re-
testing until the predicted number of segments on the
test set is approximately the expected number. The
expected number of segments is the number of sen-
tences in the test set divided by the number of sen-
tences per segment in the training data. This value
can also be weighted based on prior knowledge.
4.2 Preprocessing
A number of preprocessing steps are applied to the
books to help increase the informativeness of the
texts. The book texts were obtained using OCR
methods with human correction. The text is pre-
processed by tokenizing, removing stop words, and
stemming using the Inxight LinguistiX morpholog-
ical analyzer. Paragraphs are identi ed using for-
matting information. Sentences are identi ed using
the TnT tokenizer and parts of speech with the TnT
part of speech tagger (Brants, 2000) with the stan-
dard English Wall Street Journal n-grams. Named
entities are identi ed using  nite state technology
(Beesley and Karttunen, 2003) to identify various
entities including: person, location, disease and or-
ganization. Many of these preprocessing steps help
provide salient features for use during segmentation.
4.3 Engineered Features
Segmenting narrative documents raises a number of
interesting challenges. First, labeling data is ex-
tremely time consuming. Therefore, outside re-
sources are required to better generalize from the
training data. WordNet is used to identify words that
are similar and tend to occur at boundaries for the
 word group feature. Second, some sources of in-
formation, in particular entity chains, do not  t into
the standard feature based paradigm. This requires
extracting features from the underlying information
source. Extracting these features represents a trade-
off between information content and generalizabil-
ity. In the case of entity chains, we extract features
that characterize the occurrence distribution of the
35
entity chains. Finally, the  word groups and  entity
groups feature groups generate candidate features
and a selection process is required to select useful
features. We found that a likelihood ratio test for sig-
ni cance worked well for identifying those features
that would be useful for classi cation. Throughout
this section, when we use the term  signi cant we
are referring to signi cant with respect to the likeli-
hood ratio test (with a signi cance level of 0.1).
We selected features both a priori and dynami-
cally during training (i.e., word groups and entity
groups are selected dynamically). Feature selection
has been used by previous segmentation methods
(Beeferman et al., 1999) as a way of adapting bet-
ter to the data. In our approach, knowledge about
the task is used more strongly in de ning the fea-
ture types, and the selection of features is performed
prior to the classi cation step. We also used mutual
information, statistical tests of signi cance and clas-
si cation performance on a development data set to
identify useful features.
Word groups In Section 3 we showed that there are
not consistent cue phrases at boundaries. To general-
ize better, we identify word groups that occur signif-
icantly at boundaries. A word group is all words that
have the same parent in the WordNet hierarchy. A
binary feature is used for each learned group based
on the occurrence of at least one of the words in the
group. Groups found include months, days, tempo-
ral phrases, military rankings and country names.
Entity groups For each entity group (i.e. named
entities such as person, city, or disease tagged by the
named entity extractor) that occurs signi cantly at
a boundary, a feature indicating whether or not an
entity of that group occurs in the sentence is used.
Full name The named entity extraction system
tags persons named in the document. A rough
co-reference resolution was performed by group-
ing together references that share at least one to-
ken (e.g.,  General Yury Tikhonovich Kalinin and
 Kalinin ). The full name of a person is the longest
reference of a group referring to the same person.
This feature indicates whether or not the sentence
contains a full name.
Entity chains Word relationships work well when
the documents have disjoint topics; however, when
topics are similar, words tend to relate too easily. We
propose a more stringent chaining method called en-
tity chains. Entity chains are constructed in the same
fashion as lexical chains, except we consider named
entities. Two entities are considered related (i.e. in
the same chain) if they refer to the same entity. We
construct entity chains and extract features that char-
acterize these chains: How many chains start/end at
this sentence? How many chains cross over this sen-
tence/previous sentence/next sentence? Distance to
the nearest dip/peak in the number of chains? Size
of that dip/peak?
Pronoun Does the sentence contain a pronoun?
Does the sentence contain a pronoun within 5 words
of the beginning of the sentence?
Numbers During training, the patterns of numbers
that occur signi cantly at boundaries are selected.
Patterns considered are any number and any number
with a speci ed length. The feature then checks if
that pattern appears in the sentence. A commonly
found pattern is the number pattern of length 4,
which often refers to a year.
Conversation Is this sentence part of a conversa-
tion, i.e. does this sentence contain  direct speech ?
This is determined by tracking beginning and end-
ing quotes. Quoted regions and single sentences be-
tween two quoted regions are considered part of a
conversation.
Paragraph Is this the beginning of a paragraph?
5 Experiments
In this section, we examine a number of narra-
tive segmentation tasks with different segmentation
methods. The only data used during development
was the  rst two thirds from Biohazard (exp1 and
exp2). All other data sets were only examined after
the algorithm was developed and were used for test-
ing purposes. Unless stated otherwise, results for the
feature based method are using the SVM classi er.1
5.1 Evaluation Measures
We use three segmentation evaluation metrics that
have been recently developed to account for  close
but not exact placement of hypothesized bound-
aries: word error probability, sentence error prob-
ability, and WindowDiff. Word error probability
1SVM and boosted decision stump performance is similar.
For brevity, only SVM results are shown for most results.
36
Table 2: Experiments with Biohazard
Word Sent. Window Sent err
Error Error Diff improv
Biohazard
random (sent.) 0.488 0.485 0.539   -
random (para.) 0.481 0.477 0.531 (base)
Biohazard
exp1 → holdout 0.367 0.357 0.427 25%
exp2 → holdout 0.344 0.325 0.395 32%
3x cross validtn. 0.355 0.332 0.404 24%
Train Biohazard
Test Demon 0.387 0.364 0.473 25%
(Beeferman et al., 1999) estimates the probability
that a randomly chosen pair of words k words apart
is incorrectly classi ed, i.e. a false positive or false
negative of being in the same segment. In contrast to
the standard classi cation measures of precision and
recall, which would consider a  close hypothesized
boundary (e.g., off by one sentence) to be incorrect,
word error probability gently penalizes  close hy-
pothesized boundaries. We also compute the sen-
tence error probability, which estimates the proba-
bility that a randomly chosen pair of sentences s sen-
tences apart is incorrectly classi ed. k and s are cho-
sen to be half the average length of a section in the
test data. WindowDiff (Pevzner and Hearst, 2002)
uses a sliding window over the data and measures
the difference between the number of hypothesized
boundaries and the actual boundaries within the win-
dow. This metric handles several criticisms of the
word error probability metric.
5.2 Segmenting Narrative Books
Table 2 shows the results of the SVM-segmenter on
Biohazard and Demon in the Freezer. A baseline
performance for segmentation algorithms is whether
the algorithm performs better than naive segment-
ing algorithms: choose no boundaries, choose all
boundaries and choose randomly. Choosing all
boundaries results in word and sentence error proba-
bilities of approximately 55%. Choosing no bound-
aries is about 45%. Table 2 also shows the results
for random placement of the correct number of seg-
ments. Both random boundaries at sentence loca-
tions and random boundaries at paragraph locations
are shown (values shown are the averages of 500
random runs). Similar results were obtained for ran-
dom segmentation of the Demon data.
Table 3: Performance on Groliers articles
Word Sent. Window
Error Error Diff
random 0.482 0.483 0.532
TextTile 0.407 0.412 0.479
PLSA 0.420 0.435 0.507
features (stumps) 0.387 0.400 0.495
features (SVM) 0.385 0.398 0.503
For Biohazard the holdout set was not used dur-
ing development. When trained on either of the de-
velopment thirds of the text (i.e., exp1 or exp2) and
tested on the test set, a substantial improvement is
seen over random. 3-fold cross validation was done
by training on two-thirds of the data and testing on
the other third. Recalling from Table 1 that both
PLSA and TextTiling result in performance simi-
lar to random even when given the correct number
of segments, we note that all of the single train/test
splits performed better than any of the naive algo-
rithms and previous methods examined.
To examine the ability of our algorithm to perform
on unseen data, we trained on the entire Biohaz-
ard book and tested on Demon in the Freezer. Per-
formance on Demon in the Freezer is only slightly
worse than the Biohazard results and is still much
better than the baseline algorithms as well as previ-
ous methods. This is encouraging since Demon was
not used during development, is written by a differ-
ent author and has a segment length distribution that
is different than Biohazard (average segment length
of 30 vs. 18 in Biohazard).
5.3 Segmenting Articles
Unfortunately, obtaining a large number of narrative
books with meaningful labeled segmentation is dif-
 cult. To evaluate our algorithm on a larger data set
as well as a wider variety of styles similar to narra-
tive documents, we also examine 1000 articles from
Groliers Encyclopedia that contain subsections de-
noted by major and minor headings, which we con-
sider to be the true segment boundaries. The articles
contained 8,922 true and 102,116 possible bound-
aries. We randomly split the articles in half, and
perform two-fold cross-validation as recommended
by Dietterich (1998). Using 500 articles from one
half of the pair for testing, 50 articles are randomly
selected from the other half for training. We used
37
Table 4: Ave. human performance (Hearst, 1994)
Word Sent. Window
Error (%) Error (%) Diff (%)
Sequoia 0.275 0.272 0.351
Earth 0.219 0.221 0.268
Quantum 0.179 0167 0.316
Magellan 0.147 0.147 0.157
a subset of only 50 articles due to the high cost of
labeling data. Each split yields two test sets of 500
articles and two training sets. This procedure of two-
fold cross-validation is performed  ve times, for a
total of 10 training and 10 corresponding test sets.
Signi cance is then evaluated using the t-test.
The results for segmenting Groliers Encyclope-
dia articles are given in Table 3. We compare
the performance of different segmentation models:
two feature-based models (SVMs, boosted deci-
sion stumps), two similarity-based models (PLSA-
based segmentation, TextTiling), and randomly se-
lecting segmentation points. All segmentation sys-
tems are given the estimated number of segmenta-
tion points based based on the training data. The
feature based approaches are signi cantly2 better
than either PLSA, TextTiling or random segmenta-
tion. For our selected features, boosted stump per-
formance is similar to using an SVM, which rein-
forces our intuition that the selected features (and
not just classi cation method) are appropriate for
this problem.
Table 1 indicates that the previous TextTiling and
PLSA-based approaches perform close to random
on narrative text. Our experiments show a perfor-
mance improvement of >24% by our feature-based
system, and signi cant improvement over other
methods on the Groliers data. Hearst (1994) ex-
amined the task of identifying the paragraph bound-
aries in expository text. We provide analysis of this
data set here to emphasize that identifying segments
in natural text is a dif cult problem and since cur-
rent evaluation methods were not used when this
data was initially presented. Human performance
on this task is in the 15%-35% error rate. Hearst
asked seven human judges to label the paragraph
2For both SVM and stumps at a level of 0.005 us-
ing a t-test except SVM TextTile-WindowDiff (at 0.05)
and stumps TextTile-WindowDiff and SVM/stumps PLSA-
WindowDiff (not signi cantly different)
Table 5: Feature occurrences at boundary and non-
boundary locations
boundary non-boundary
Paragraph 74 621
Entity groups 44 407
Word groups 39 505
Numbers 16 59
Full name 2 109
Conversation 0 510
Pronoun 8 742
Pronoun ≤ 5 1 330
boundaries of four different texts. Since no ground
truth was available, true boundaries were identi ed
by those boundaries that had a majority vote as a
boundary. Table 4 shows the average human perfor-
mance for each text. We show these results not for
direct comparison with our methods, but to highlight
that even human segmentation on a related task does
not achieve particularly low error rates.
5.4 Analysis of Features
The top section of Table 5 shows features that are
intuitively hypothesized to be positively correlated
with boundaries and the bottom section shows nega-
tively correlated. For this analysis, exp1 from Alibek
was used for training and the holdout set for testing.
There are 74 actual boundaries and 2086 possibly
locations. Two features have perfect recall: para-
graph and conversation. Every true section bound-
ary is at a paragraph and no section boundaries are
within conversation regions. Both the word group
and entity group features have good correlation with
boundary locations and also generalized well to the
training data by occurring in over half of the positive
test examples.
The bene t of generalization using outside re-
sources can be seen by comparing the boundary
words found using word groups versus those found
only in the training set as in Section 3. Using word
groups triples the number of signi cant words found
in the training set that occur in the test set. Also, the
number of shared words that occur signi cantly in
both the training and test set goes from none to 9.
More importantly, signi cant words occur in 37 of
the test segments instead of none without the groups.
38
6 Discussion and Summary
Based on properties of narrative text, we proposed
and investigated a set of features for segmenting nar-
rative text. We posed the problem of segmentation
as a feature-based classi cation problem, which pre-
sented a number of challenges: many different fea-
ture sources, generalization from outside resources
for sparse data, and feature extraction from non-
traditional information sources.
Feature selection and analyzing feature interac-
tion is crucial for this type of application. The para-
graph feature has perfect recall in that all boundaries
occur at paragraph boundaries. Surprisingly, for cer-
tain train/test splits of the data, the performance of
the algorithm was actually better without the para-
graph feature than with it. We hypothesize that the
noisiness of the data is causing the classi er to learn
incorrect correlations.
In addition to feature selection issues, posing the
problem as a classi cation problem loses the se-
quential nature of the data. This can produce very
unlikely segment lengths, such as a single sentence.
We alleviated this by selecting features that capture
properties of the sequence. For example, the entity
chains features represent some of this type of infor-
mation. However, models for complex sequential
data should be examined as possible better methods.
We evaluated our algorithm on two books and
encyclopedia articles, observing signi cantly bet-
ter performance than randomly selecting the correct
number of segmentation points, as well as two pop-
ular, previous approaches, PLSA and TextTiling.
Acknowledgments
We thank Marti Hearst for the human subject perfor-
mance data and the anonymous reviewers for their
very helpful comments. Funded in part by the Ad-
vanced Research and Development Activity NIMD
program (MDA904-03-C-0404).

References
Doug Beeferman, Adam Berger, and John Lafferty.
1999. Statistical models for text segmentation. Ma-
chine Learning, 34:177 210.
Kenneth R. Beesley and Lauri Karttunen. 2003. Finite
State Morphology. CSLI Publications, Palo Alto, CA.
Thorsten Brants, Francine Chen, and Ioannis Tsochan-
taridis. 2002. Topic-based document segmentation
with probabilistic latent semantic analysis. In Pro-
ceedings of CIKM, pg. 211 218.
Thorsten Brants. 2000. TnT  a statistical part-of-speech
tagger. In Proceedings of the Applied NLP Confer-
ence.
Freddy Choi. 2000. Improving the ef ciency of speech
interfaces for text navigation. In Proceedings of IEEE
Colloquium: Speech and Language Processing for
Disabled and Elderly People.
Nello Cristianini and John Shawe-Taylor. 2000. An In-
troduction to Support Vector Machines. Cambridge
University Press.
Thomas Dietterich. 1998. Approximate statistical tests
for comparing supervised classi cation learning algo-
rithms. Neural Computation, 10:1895 1923.
Marti A. Hearst. 1994. Multi-paragraph segmentation of
expository text. In Meeting of ACL, pg. 9 16.
Thorsten Joachims, 1999. Advances in Kernel Methods -
Support Vector Learning, chapter Making large-Scale
SVM Learning Practical. MIT-Press.
Hideki Kozima and Teiji Furugori. 1994. Segmenting
narrative text into coherent scenes. In Literary and
Linguistic Computing, volume 9, pg. 13 19.
Hideki Kozima. 1993. Text segmentation based on sim-
ilarity between words. In Meeting of ACL, pg. 286 
288.
Hang Li and Kenji Yamanishi. 2000. Topic analysis us-
ing a  nite mixture model. In Proceedings of Joint
SIGDAT Conference of EMNLP and Very Large Cor-
pora, pg. 35 44.
Hajime Mochizuki, Takeo Honda, and Manabu Okumura.
1998. Text segmentation with multiple surface lin-
guistic cues. In COLING-ACL, pg. 881 885.
Lev Pevzner and Marti Hearst. 2002. A critique and
improvement of an evaluation metric for text segmen-
tation. Computational Linguistics, pg. 19 36.
Jeffrey Reynar. 1999. Statistical models for topic seg-
mentation. In Proceedings of ACL, pg. 357 364.
Nicola Stokes, Joe Carthy, and Alex Smeaton. 2002.
Segmenting broadcast news streams using lexical
chains. In Proceedings of Starting AI Researchers
Symposium, (STAIRS 2002), pg. 145 154.
Ian H. Witten and Eibe Frank. 2000. Data Mining:
Practical machine learning tools with Java implemen-
tations. Morgan Kaufmann.
