A Maximum Entropy Approach to Identifying Sentence 
Boundaries 
Jeffrey C. Reynar and Adwait Ratnaparkhi* 
Department of Computer and Information Science 
University of Pennsylvania 
Philadelphia, Pennsylvania~ USA 
{jcreynar, adwait } @unagi.cis.upenn.edu 
Abstract 
We present a trainable model for identify- 
ing sentence boundaries in raw text. Given 
a corpus annotated with sentence bound- 
aries, our model learns to classify each oc- 
currence of., ?, and / as either a valid or in- 
valid sentence boundary. The training pro- 
cedure requires no hand-crafted rules, lex- 
ica, part-of-speech tags, or domain-specific 
information. The model can therefore be 
trained easily on any genre of English, and 
should be trainable on any other Roman- 
alphabet language. Performance is compa- 
rable to or better than the performance of 
similar systems, but we emphasize the sim- 
plicity of retraining for new domains. 
1 Introduction 
The task of identifying sentence boundaries in text 
has not received as much attention as it deserves. 
Many freely available natural language processing 
tools require their input to be divided into sentences, 
but make no mention of how to accomplish this (e.g. 
(Brill, 1994; Collins, 1996)). Others perform the 
division implicitly without discussing performance 
(e.g. (Cutting et al., 1992)). 
On first glance, it may appear that using a short 
list of sentence-final punctuation marks, such as., 
?, and /, is sufficient. However, these punctua- 
tion marks are not used exclusively to mark sen- 
tence breaks. For example, embedded quotations 
may contain any of the sentence-ending punctua- 
tion marks and . is used as a decimal point, in e- 
mail addresses, to indicate ellipsis and in abbrevia- 
tions. Both / and ? are somewhat less ambiguous 
* The authors would like to aclmowledge the support 
of ARPA grant N66001-94-C-6043, ARO grant DAAH04- 
94-G-0426 and NSF grant SBR89-20230. 
but appear in proper names and may be used mul- 
tiple times for emphasis to mark a single sentence 
boundary. 
Lexically-based rules could be written and excep- 
tion lists used to disambiguate the difficult cases 
described above. However, the lists will never be 
exhaustive, and multiple rules may interact badly 
since punctuation marks exhibit absorption proper- 
ties. Sites which logically should be marked with 
multiple punctuation marks will often only have one 
((Nunberg, 1990) as summarized in (White, 1995)). 
For example, a sentence-ending abbreviation will 
most likely not be followed by an additional period 
if the abbreviation already contains one (e.g. note 
that D. C is followed by only a single . in The presi- 
dent lives in Washington, D.C.). 
As a result, we believe that manually writing rules 
is not a good approach. Instead, we present a solu- 
tion based on a maximum entropy model which re- 
quires a few hints about what information to use and 
a corpus annotated with sentence boundaries. The 
model trains easily and performs comparably to sys- 
tems that require vastly more information. Training 
on 39441 sentences takes 18 minutes on a Sun Ultra 
Sparc and disambiguating the boundaries in a single 
Wall Street Journal article requires only 1.4 seconds. 
2 Previous Work 
To our knowledge, there have been few papers about 
identifying sentence boundaries. The most recent 
work will be described in (Pa.lmer and Hearst, To 
appear). There is also a less detailed description of 
Pahner and Hearst's system, SATZ, in (Pahuer and 
Hearst, 1994). 1 The SATZ architecture uses either 
a decision tree or a neural network to disambiguate 
sentence boundaries. The neural network achieves 
98.5% accuracy on a corpus of Wall Str'eet Journal 
t~Ve recommend these articles for a more compre- 
hensive review of sentence-boundary identification work 
than we will be able to provide here. 
16 
articles using a lexicon which includes part-of-speech 
(POS) tag information. By increasing the quantity 
ol" 1.ra.ining data and decreasing the size of their test 
,,~rlouS. Palmer and Hearst achieved performance of 
!)s.9% with the neural network. They obtained simi- 
lar results using the decision tree. All the results we 
will present for our a.lgorithms are on their initial, 
larger test. corpus. 
In (Riley, 1989), Riley describes a decision-tree 
based approach to the problem. His performance on 
/he Brown corpus is 99.8%, using a model learned 
t'rom a corpus of 25 million words. Liberman and 
Church suggest in (Liberlnan and Church, 1992) 
that. a system could be quickly built to divide 
newswire text into sentences with a nearly negligible 
error rate. but, do not actually build such a system. 
3 Our Approach 
\¥e present two systems for identifying sentence 
boundaries. One is targeted a.t high performance 
and uses some knowledge about the structure of En- 
glish financial newspaper text which may not be ap- 
plical)le t.o text from other genres or in other lan- 
guages. The other system uses no domain-specific 
knowledge and is aimed at being portable across En- 
glish t, ext genres and Roman alphabet languages. 
Pot.ential sentence boundaries are identified by 
scamfing the text tbr sequences of characters sep- 
aa'ated by whitespace (tokens) containing one of the 
symbols !, . or ?. We use information about the to- 
llen containing the potential sentence boundary, as 
well as contextual information about the tokens im- 
mediately to the left and to the right. We also con- 
ducted tests using wider contexts, but performance 
did not, improve. 
We call the token containing the symbol which 
marks a putative sentence boundary the Candidate. 
'Phe portion of the Candidate preceding the poten- 
t.ial sentence boundary is called the Prefix and the 
portion following it is called the Suffix. The system 
that focused on maximizing performance used the 
following hints, or contextual "templates": 
• The Prefix 
• The Suffix 
• The presence of particular characters in the Pre- 
fix or Suffix 
• Whether the Candidate is an honorific (e.g. 
A,l,s., Dr., Gen.) 
• Whether the Candidate is a. corporate designa- 
tor (e.g. Coriv., ,5'.p.A., L.L.C.) 
• Features of the word left of the Candidate 
• Features of the word right of the Candidate 
The templates specify only the form of the in- 
formation. The ~J~acl intbrmation used by the 
maximum entropy model \['or the potential sentence 
boundary marked by . in Col7~. in Example 1 would 
be: PreviousWordIsCapitalized, Prefix=Corp, Suf- 
fix=NULL, PrefixFeature=C.orporateDesignator. 
(1) ANLP Corp. chairman Dr. Smith resigned. 
The highly portable system uses only the identity 
of the C',andidate and its neighboring words, and a 
list of abbreviations induced froln the training data. 2 
Specifically, the "templates" used are: 
• The Prefix 
• The Suffix 
• Whether the Prefix or Suffix is on the list of 
induced abbreviations 
• The word left, of the Candidate 
• The word right of the Candidate 
• Whether the word to the left or right of the 
Candidate is on the list of induced abbreviations 
The intbrmation this model would use for Exam- 
ple 1 would be: PreviousWord=ANLP, Following- 
Word=chairman, Prefix=Corp, Suffix=NULL, Pre- 
fixFeature=InducedAbbreviation. 
The abbreviation list is automatically produced 
from the training data., and the contextual ques- 
tions are also automat.ically generated by scanning 
the training data with question templates. As a. re- 
sult, no hand-crafted rules or lists are required by 
the highly portable system and it can be easily re- 
trained for other languages or text genres. 
4 Maxilnum Entropy 
The model used here for sentence-boundary de- 
tection is based on the maximum entropy model 
used for POS tagging in (Ratnaparkhi, 1996). For 
each potential sentence boundary token (., ?, and 
!), we estimate a joint probability distribution p 
of the token and it.s surrounding context, both of 
which are denoted by c, occurring as an actual 
sentence I)oundary. The (list, ribul.ioll is given by: I,. f,(b,c) 
p(b, ~) = ~ 1-I j=, ,~') , ,,,h~,-e ~ ~ {no, y~}, where 
2A token in the training data is considered an abbre- 
viation if it is preceded and followed by whitespace, and 
it contains a . that is not a sentence boundary. 
17 
the ctj's are the unknown parameters of the model, 
and where each c U corresponds to a fj, or a feature. 
Thus the probability of seeing an actual sentence 
boundary in the context c is given by p(yes, e). 
The contextual information deemed useful for 
sentence-boundary detection, which we described 
earlier, must be encoded using features. For exam- 
pie, a useful feature might be: 
WSJ Brown 
Sentences 20478 51672 
Candidate P. Marks 32173 61282 
Accuracy 98.8% 97.9% 
False Positives 20 \[ 7.50 
False Negatives 171 506 
Table 1: Our best pertbrmance on two corpora. 
1 if Prefix(c) = Mr &; b.= no 
\])(b,c) = 0 otherwise 
This feature will allow the model to discover that the 
period at the end of the word Mr. seldom occurs as 
a sentence boundary. Therefore the parameter cor- 
responding to this feature will hopefully boost the 
probability p(no, c) if the Prefix is Mr. The param- 
eters are chosen to maximize the likelihood of the 
I.raining data using the Generalized Iterative Scaling 
(Darroeh and Ratcliff, 1972) algorithm. 
The model also can be viewed under the Maxi- 
mum Entropy framework, in which we choose a dis- 
t.ribution p that maximizes the entropy H(p) 
H(p) = - Ep(b, c)logp(b, c) 
under the following constraints: 
Ep(b,c)J)(b,c ) = E~(b,c)fj(b,c),l < j < k 
where iS(b, c) is the observed distribution of sentence- 
boundaries and contexts in the training data. As a 
result, the model in practice tends not to commit 
towards a particular outcome (yes or no) unless it 
ha~s seen sufficient evidence for that outcome; it is 
maximally uncertain beyond meeting the evidence. 
All experiments use a simple decision rule to elas- 
si\[y each potential sentence boundary: a potential 
sentence boundary is an actual sentence boundary if 
and only if p(yeslc ) > .5, where 
p(yes, c) p(yeslc ) = 
p(yes, c) -I- p(no, c) 
and where c is the context including the potential 
sentence boundary. 
5 System Performance 
We trained our system on 39441 sentences (898737 
words) of Wall Street Journal text from sections 
00 through 24 of the second release of the Penn 
Treebank 3 (Marcus, Santorini, and Marcinkiewicz, 
:~We did not train on files which overlapped with 
Pahner and Hearst's test data, namely sections 03, 04, 
05 and 06. 
1993). We corrected punctuation mistakes and er- 
roneous sentence boundaries in the training data. 
Performance figures for our best performing system, 
which used a hand-crafted list of honorifics and cor- 
porate designators, are shown in Table 1. The first 
test set, WSJ, is Pahner and Hearst's initial test 
data and the second is the entire Brown corpus. We 
present the Brown corpus performance to show the 
importance of training on the genre of text on which 
testing will be performed. Table 1 also shows the 
number of sentences in each corpus, the lmmber of 
candidate punctuation marks, the accuracy over po- 
tential sentence boundaries, the nmnber of false posi- 
tives and the number of false negatives. Performance 
on the WSJ corpus was, as we expected, higher than 
perforlnance on the Brown corpus since we trained 
the model on financial newspaper text. 
Possibly more significant than the system's per- 
formance is its portability to new domains and lan- 
guages. A trimmed down system which used no 
information except that derived from the training 
corpus performs nearly as well, and requires no re- 
sources other than a training corpus. Its perfor- 
mance on the same two corpora is shown in Table 2. 
Test False False 
Corpus Accuracy Positives Negatives 
WSJ 98.0% 396 245 
Brown 97.5% 1260 265 
Table 2: Performance on the sa.me two corpora, using 
the highly portable system. 
Since 39441 training sentences is considerably 
more than might exist ill a new dolnail~ or a lan- 
guage other than English, we experimented with the 
quantity of training data required to maintain per- 
forlnance. Table 3 shows performance on the WSJ 
corpus as a flmction oft, raining set size using the best 
performing system and the more portable system. 
As can seen fl'om the table, performance degrades 
as the quantity of training data decreases, but even 
18 
Number of sentences in training corpus 
500 1000 2000 4000 8000 16000 39441 
Best performing 97.6% 98.4% 98.0% 98.4% 98.3% 98.3% 98.8~Z~, 
Highly portable 96.5% 97.3% 97.3% 97.6% 97.6% 97.8% 98.0% 
Table 3: Performance on Wall £'t~vet Journal test data a.s a. flmction of training set. size for both systems. 
with only 500 exalnple sentences performance is bet- 
I(,~' lhan the baselines of 64.0% if a sentence bound- 
~l\; is guessed at every potential site and 78.4(K, if 
only token-final instances of sentence-ending punc- 
tuation are assumed to be boundaries. 
6 Conclusions 
We have described an approach to identifying sen- 
tence boundaries which performs comparably to 
other state-of-the-art systems that require vastly 
luore resources. For example, Riley's performance 
ot~ the Brown corpus is higher than ours, but his sys- 
l era is trained on the Brown corpus and uses thirty 
i.ilnes as much data as our system. Also, Pahner 
& Hearst's system requires POS tag information, 
which limits its use to those genres or languages for 
which there are either POS tag lexica or POS tag 
annotated corpora, that could be used to train auto- 
marie taggers. In comparison, our system does not 
require POS tags or any supporting resources be- 
yond the sentence-boundary annotated corpus. It 
is theretbre easy and inexpensive to retrain this sys- 
t.em tbr different genres of text in English and text in 
()tiler l:(.oma.n-a.lphabet languages. Furthermore, we 
showed that a small training corpus is sufficient for 
good performance, and we estimate that annotating 
enough data to achieve good performance would re- 
quire only several hours of work, in comparison to 
the many hours required to generate POS tag and 
lexical probabilities. 
7 Acknowledgments 
We would like to thank David Palmer for giving us 
I.he test data he and Marti Hearst used for their 
sentence detection experiments. We would also like 
to thank the anonymous reviewers for their helpful 
insights. 

References 

Brill, Eric. 1994. Some advances in transformation-based part-of-speech tagging. In Proceedings of 
the Twelfth National Conference on Artificial Intelligence, volume 1, pages 722-727. 

Collins, Michael. 1996. A new statistical parser 
Imsed on bigram lexical dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, June. 

Cutting, Doug, ,lulian Kupiee, Jan Pedersen, and 
Penelope Sibun. 1992. A practical part-of-speech 
tagger. In Proceedings of the Third Conference on 
Applied Natural Language Processing, pages 133-140, Trento, Italy, April. 

Darroch, J. N. and D. Ratcliff. 1972. Generalized 
Iterative Scaling for Log-Linear Models. The An- 
nals of Mathematical ,';tatistics, 43 (5) : 1470-1480. 

Liberman, Mark Y. and Kenneth W. Church. 1992. 
Text analysis and word pronunciation in text-to-speech synthesis. In Sadaoki Furui and M. Mohan 
Sondi, editors, Advances in Speech Signal Processing. Marcel Dekker, Incorporated, New York. 

Marcus, Mitchell, Beatrice Santorini, and Mary Ann 
Marcinkiewicz. 1993. Building a large annotated 
corpus of English: the Penn Treebank. Computational Linguistics, (2):313-330. 

Nunberg, Geoffrey. 1990. The Linguistics of Punctuation. Number 18 in C, SLI Lecture Notes. University of Chicago Press. 

Pahner, David D. and Marti A. Hearst. 1994. Adaptive sentence boundary disambiguation. In Proceedings of the 1994 conference on Applied Natural Language Processing (ANLP), Stuttgart, Germany, October. 

Palmer, David D. and Marti A. Hearst. To appear. 
Adaptive multilingual sentence boundary disam- 
biguation. Computational Linguistics. 

Rutnaparkhi, Adwait. 1996. A maximum entropy 
model for part-of-speech tagging, in Conference 
on Empirical Methods in Natural Language Processing, pages 133 142, (h~iversit, y of Pennsylvania, May 17-18. 

Riley, Michael D. 1989. Some applications of 
tree-based modelling to speech and language. In 
DARPA ,Speech and Language Technology Workshop, pages 339-352, Cape Cod, Massachusetts. 

White, Michael. 1995. Presenting punctuation. In 
Proceedings of the Fifth Annual Workshop on 
Natural Language Generation, pages 107-125, Leiden, The Netherlands. 
