Experiments in Unsupervised 
Entropy-Based Corpus Segmentation 
Andrd Kempe 
Xerox Research Centre Europe - Grenoble Laboratory 
6 chemin de Maupertuis - 38240 Meylan - France 
andre, kempe@xrce, xerox, corn 
http://ggla, xrce. xerox, corn/research/mitt 
Abstract 
The paper presents an entropy-based approach to 
segment a corpus into words, when no additional 
information about the corpus or the language, and 
no other resources such as a lexicon or grammar 
are available. To segment the corpus, the algorithm 
searches for separators, without knowing a priory by 
which symbols they are constituted. Good results 
can be obtained with corpora containing "clearly 
perceptible" separators such as blank or new-line. 
1 Introduction 
The paper presents an approach to segment a cor- 
pus into words, based on entropy. We assume that 
the corpus is not annotated with additional informa- 
tion, and that we have no information whatsoever 
about the corpus or the language, and no linguistic 
resources such as a lexicon or grammar. Such a situ- 
ation may occur e.g. if there is a (sufficiently large) 
corpus of an unknown or unidentified language and 
alphabet. 1 Based on entropy, we search for separa- 
tors, without knowing a priory by which symbols or 
sequences of symbols they are constituted. 
Over the last decades, entropy has frequently been 
used to segment corpora \[Wolff, 1977, Alder, 1988, 
Hutchens and Alder, 1998, among many others\]. 
and it is commonly used with compression tech- 
niques. Harris \[1955\] proposed an approach for seg- 
menting words into morphemes that, although it did 
not use entropy, was based on an intuitively similar 
concept: Every symbol of a word is annotated with 
the count of all possible successor symbols given the 
.substring that ends with the current symbol, and 
with the count of all possible predecessor symbols 
I Such a corpus can be electronically encoded with arbi- 
trarily defined symbol codes. 
given the tail of the word that starts with the cur- 
rent symbol. Maxima in these counts are used to 
segment the word into morphemes. 
All steps of the present approach will be described 
on the example of a German corpus. In addition, we 
will give results obtained on modified versions of this 
corpus, and on an English corpus. 
2 The Approach 
2.1 The Corpus 
We assume that any corpus C can be described by 
the expression: 
c = S. T IS+ T\]* 3, (1) 
There must be at least one token T ("word") which 
is a string of one or more symbols s : 
T = s+ (2) 
Different tokens T must be separated form each 
other by one or more separators S which are strings 
of zero or more symbols s : 
S = s* (3) 
Separators can consist of blanks, new-line, or "'real" 
symbols. They can also be empty strings. 
2.2 Recoding the Corpus 
We will describe the approach on the example of a 
German corpus. 
First, all symbols s (actually all character codes) 
of the corpus are recoded by strings of "'visible" 
ASCII characters. For example: 2 
2In this example, \ denotes that the current line is not 
finished yet but rather continues on the next Ihle. 
! 
! 
! 
! 
Fiir Instandsetzung und Neubau der\ 
Kanalisation diirften in 
den n~ichsten zehn Jahren Betr~ige in \ 
MilliardenhShe ausgegeben werden. 
Allein in den alten Bundesl~indern miissen bis \ 
zur Jahrhundertwende die 
Kom.munen km des insgesamt km langen \ 
Kanal- und 
Leitungsnetzes sanieren. 
is recoded as:a 
Fiir BLInst andset zungBLundBLNe 
u bau BL der BLK an alisat ionBL diir f 
t en BLi n NLden BLn ~ichs ten BL ze h n 
BL J ah ren BLB e t r~igeBLi nBL M illi a 
rdenhSheBLausgegebenBLwerden 
• NL Allei n BLinBL d enBLalt enBL B u 
ndesl£nder n BLmiisse nBL bis BL z u 
r BL J ahr hunder twendeBLdieNLKo 
m munenBL kmBLdesBLinsgesamt 
BL km BL l an g e nBL K anal- BLund NL 
Leit ungsnet zesBLsanieren. BL 
If the language and the alphabet are unknown 
or unidentified, the symbols of the corpus can be 
encoded by arbitrarily defined ASCII strings. 
2.3 Information and Entropy 
We estimate probabilities of symbols of the corpus 
using a 3rd order Markov model based on maximum 
likelihood. The probability of a symbol s with re- 
spect to this model M and to a context c can be 
estimated by: 
f(s, M, c) p(slM, c) = (4) 
\](U,c) 
The inIormation of a symbol s with respect to the 
model M and to a context c is defined by: 
I(s\]U, c) = -log2 p(slU, c) (5) 
Intuitively, information can be considered as the sur- 
prise of the model about the symbol s after having 
seen the context c. The more the symbol is unex- 
pected from the model's experience, the higher is the 
value of information \[Shannon and Weaver, 1949\]• 
The entropy of a context c with respect to this 
model M expresses the expected value of informa- 
tion, and is defined by: 
g(Af, c) = Zp(s\]M,c ) I(slM, c) (6) 
sEE 
Monitoring entropy and information across a cor- 
pus shows that maxima often correspond with word 
3Note that blanks become "BL" and new-lines become 
"NL". 
boundaries \[Alder, 1988, Hutchens and Alder, 1998, 
among many others\]. 
More exactly, maxima in left-to-right entropy 
HLR and information ILR often mark the end of 
a separator string S, and maxima in right-to-left 
entropy Hn  and information IR~ often mark the 
beginning of a separator string, as can be seen in 
Figure 1. Here, an information value is assigned to 
every symbol. This value expresses the information 
of the symbol in a given left or right context. An 
entropy value is assigned between every two sym- 
bols. It expresses the model's uncertainty after hav- 
ing seen the left or right context, but not yet the 
symbol• 
When going from left to right, all end of a sep- 
arator, is often marked by a maximum in entropy 
because the next word to the right can start with 
almost any symbol, and the model has no "idea" 
what it will be. There is also a maximum in infor- 
mation because the first symbol of the word is (more 
or less) unexpected; the model has no particular ex- 
pectation. 
Similarly, when going from right to left, a begin- 
ning of a separator is often marked by a maximum 
in entropy because the word next to the left call end 
with almost any symbol. There is also a maximum 
in information because the last symbol of the word 
is (more or less) unexpected; the model has no par- 
ticular expectation. 
Usually, there is no maximum at a beginning of 
a separator, when going from left to right, and no 
maximum at a separator ending, when going from 
right to left, because words often have "typical" be- 
ginnings or endings, e.g. prefixes or suffixes• 
This means, when we come from inside a word to 
the beginning or end of this word then the model 
will anticipate a separator, and since the number 
of alternative separators is usually small, the model 
will not be "surprised" to see a particular one. On 
the other side, when we come from inside a separator 
to the beginning or end of this separator, although 
the model will expect a word, it will be "surprised" 
about any particular word because the uumber of 
alternative beginnings or endings of words is large. 
It also may be observed that the maxima in one 
direction are bigger then the maxima in the other 
direction due to the fact that a particular language 
may have e.g. stronger constraints on endings than 
on beginnings of words: A language may employ 
suffixes with most words in a corpus, which limits 
the number of endings, but rarely use prefixes, which 
allows a word to start with almost any symbol• 
/ 6 "l- Entropy 
I,R 
I 
2 
0 ii, |o |l,|,, o1.,o,,,o,,,,,,, o,,,,o,,,,.o,,,,,o,o,.lo.,,,o,.i.,.=.,,,o.o,*,.,o,o ,,,' 
6 ~ ~ Entropy RL 
| I, 2 
F*r{-}lnlstandset~'l.ng nnd Ne,ba. dt, r Kanallslntion-d*rften.in*den.n*chst,.n-~,ehn Jahr 
I l} ~} l} I 1} 1} 1} 1" l} l} I" 1} l} I 
6 ~lnfornlation LR ~ A _ " A 
2 
O : 1: 
! 
2 
Figure 1: Entropy and information across a section of a German corpus 
2.4 Thresholds 
Not all maxima correspond with word boundaries. 
Hutchens and Alder \[1998\] apply a threshold of 
0.5 log2 \[1~\]\[ to select among all maxima, those that 
represent boundaries. 
The present approach uses two thresholds that 
are based on the corpus data and contain no other 
factors: The first threshold van is the average of all 
values of the particular function, HLR, HI{L, ILR, or 
II{L, across the corpus. The second threshold Vm,~ is 
the average of all maxima of the particular function. 
All graphs of Figure 1 contain both thresholds (as 
dotted lines). 
To decide whether a value v of HLa, HRL, ILR, 
or II{n should be considered as a boundary, we use 
the four functions: 
bo(v) : v>T.,~ (7) 
b,(v) : v ~vatt (S) 
b2(v) : ismax(v) (9) 
b3(v) : ~ismin(v) (10) 
2.5 Detection of Separators 
To find a separator, we are looking for a strong 
boundary to serve as a beginning or end of the sep- 
arator. In the current example, we have chosen as a 
criterion for strong boundaries: 
(bo(h) A  2(h)) A (bi(i) V b3(i)) = (11) 
((h > rma=(H)) A ismax(h)) 
A ((i > Tan(I)) V -~isrnin(I)) 
Here H and I mean either HLR and ILR if we are 
looking for the end of a separator, or HRL and Inn if 
we are looking for the beginning of a separator. The 
variables h and i denote values of these functions at 
the considered point. 
Once a strong boundary is found, we search for a 
weak boundary to serve as an ending that matches 
the previously found beginning, or to serve as a be- 
ginning that matches the previously found ending. 
For weak boundaries, we use the criterion: 
(51 (h) A b2(h)) A (bz(i) V b3(i)) = (12) 
((h > Tall(H)) A ismax(h)) 
A ((i > raU(I)) V ~ismin(I)) 
If a matching pair of boundaries, i.e. a beginning 
and an end of a separator, are found, the separator 
9 
i 
is marked. In Figure 1 this is visualized by \] for 
empty and { } for non-empty separators. 
The search for a weak boundary that matches a 
strong one is stopped (without success) either after 
a certain distance 4 or at a breakpoint. For example, 
if we have the beginning of a separator and search 
for a matching end then the occurrence of another 
beginning will stop the search. As a criterion for a 
breakpoint we have chosen: 
(bl (h) A b~(h) ) V (bl (i) A b2(i) ) = 
((h > Tau(H)) A ismax(h)) 
V ((i > ran(I)) A ismax(i)) 
(13) 
If the search for a matching point has been 
stopped for either reason, we need to decide whether 
the initially found strong boundary should be 
marked despite the missing match. It will only be 
marked if it is an obligatory boundary. Here we ap- 
ply the criterion: 
(bo(h) A b2(h)) A (bo(i) A b2(i)) = 
((h > rma=(H)) A ismax(h)) 
A ((i >_ rrnax(I)) A ismax(i)) 
(14) 
In Figure 1 these unmatched obligatory boundaries 
are visualized by {u or }u. 
Each of the four criteria, for strong bound- 
aries, weak boundaries, break points, and obligatory 
boundaries, can be built of any of the four functions 
boo to b30 (eq.s 7' to 10). 
2.6 Validation of Separators 
All separator strings that have a matching beginning 
and end marker are collected and counted. 
Alias fsepar fcontezt 
95 103 1 484 
h 10 841 850 
45 475 621 
3 464 450 
697 360 
E 1 271 281 
1 328 241 
3 223 199 
h 6 793 160 
3 306 126 
11o 545 119 
Ill 162 I08 
Table 1: 
ftotal Separator 
115 619 BL 
20 011 NL 
1 024 152 
11 637 t BL 
3 096 
5 736 . BL 
17 053 e BL 
48 136 s 
138 769 e 
32 049 e r 
4 110 . NL 
1 372 t NL 
Separators from a German corpus 
(truncated list) 
4In the example, the maximal separator length is set to 6. 
This seems sufficient because we found no separators longer 
than 3 so far (Tables 1 to 5). 
Table 1 shows such separators collected from the 
German example corpus. Column 5 contains the 
strings that constitute the separators, column 2 
shows the count of these strings as separators, col- 
unm 3 says in how may different contexts s the sep- 
arators occurred, colunm 4 shows the total count 
of the strings in the corpus, and column 1 contains 
aliases furtheron used to denote the separators. In 
Table 1 all separators are sorted with respect to col- 
umn 3. From these separators we retain those that 
are above a defined threshold relative to the number 
of different contexts of the top-most separator. In 
all examples throughout this article, we are using a 
relative threshold of 0.5, which means in this case 
(Table 1) that the top-most two separators, "BL" 
and "NL" that occur ifi 1484 and 850 different con- 
texts respectively, are retained. 6 
In the corpus, all separators that have been re- 
tained (Table 1) and that have at least one detected 
boundary (Fig. 1), are validated and marked. For 
the above corpus section this leads to: 
Fiir Iolnstandsetzung Iound IoNeu 
bau Ioder IoKanalisation Jodiirften 
10inNLden 6n~ichsten 6zehn 10Jahr 
en 6Betr~ige 10in IoMilliardenhSh 
e 10ausgegeben 10werden. hAlleln 
loin 6den Ioalten IoBundesl~indern 
Iomfissen ~bis ~zur JoJahrhundert. 
wende JodieNLKommunen 6km'10de 
s 10insgesamtBLkm ~langen \[0Kan 
al-BLund h LeitungsnetzesBLsani 
eren. BL 
2.7 Recall of Separators 
For the above corpus we measured a recall of 86.0 % 
for both blank (BL) and new-line (NL) together (Ta- 
ble 2). 
Alias Separator Recall rio.n, t flo.~t 
t0 BL 88.6 % 102 412 115 619 
~I NL 71.2 % 14 254 20 011 
All 86.0 % 116 666 135 630 
Table 2: Recall of separators from a German 
corpus 
Due to the approach, the precision for BL and NL 
is 100 %. A string which is different from BL and 
NL cannot be marked as a separator in the above 
example. If empty string separators were admitted, 
the precision would decrease. 
5As context of a separator, we consider the preceding and 
the following symbol. 
6In the Tables 1 to 5 the retained separator strings are 
separated by a horizontal line form the others. 
10 
3 More Examples 
We applied the approach to modified versions of the 
above mentioned German corpus and to an English 
corpus. 
3.1 German with Empty String Sep- 
arators 
For this experiment, we remove all original separa- 
tors, "BL" and "NL", from the above German cor- 
pus: 
FiirInstandsetzungundNeubaude 
r Kanalisationdfirftenindenn$ch 
stenzehnJ ahrenBetr~igein Millia 
rdenhSheausgegebenwerden. All 
eininden alt enBundesl~indernmfi 
ssenbiszurJahrhundertwendedie 
Kommunenkmdesinsgesamtkmla 
ngenKanal- undLeitungsnetzess 
anieren. 
From this corpus, we collected the separators in 
Table 3 
Alias IseVar Icontezt Itotal Separator 
110 969 1 257 888 522 
II 6 355 580 33 335 e n 
5 872 466 54 278 t 
7 975 407 138 769 e 
5 661 374 32 158 e r 
916 345 11 178 
3 063 306 48 136 s 
b 5O5 297 3 096 
399 206 4 566 t e n 
621 189 15 836 t e 
Table 3: Separators from a German corpus 
without blanks and new-line (truncated list) 
and obtained the result: 
FiirIn lost 10andsetz lou 10ng lound 
loNeu lobau loder loKanalis loation 
diirf loten loinden lon~ich losten loze 
hnJahren loBetr~ig \]oe loin loMillia 
rd \[oen IohSh loeaus ~ge \[0geben low 
erden Io. All loe loin loind loen loalte 
n loBun lod loes lol~ind loer lonmiisse 
n lobiszur \[0Jahrh lound loer lotwen 
de \[0die loKommunen lokrnde los loin 
s logesarn \[0t lokm lolangen loKanal- 
und loLei lot loungs lonetzessan loie 
ren lo. 
3.2 German with Modified Separa- 
tors 
For the next experiment, we chmlged all original sep- 
arators, "BL" and "NL", in the above German cor- 
pus into a string from the list { ........ , "- -", ,,#,,, -# #,,, ,,__ _,,, ,,# #,, ,,_ _,, ~ :7 
Fiir---Instandset z ung# #und--N 
eubauder-Kanalisat ion--diir fte 
n #in # # den---n~ic hs ten # # zehn- 
- JahrenBetr~ige-in-- Milliar denh 
5he#ausgegeben##werden .... A 
llein # # in--denalten- B u n desliin 
dern--miissen #bis# #zur--- Jahr 
hundertwende##die--Kommune 
nkm-des--insgesamt#km##lang 
en---Kanal-# # und--Leit ungsne 
tzessanieren. - 
From this corpus, we collected the separators in 
Table 4 
Alias \]se~ar 
127 490 
h 4 966 3 875 
h 3 516 
h 6 876 
4 494 
3 699 
b 5 161 
378 
335 
Ito 1 588 In 
1 555 
fcontezt ftotal Separator 
844 1 108 920 
618 33 907 # # 
591 17 247 --- 
532 68 198 -- 
292 
268 
265 
227 
189 
179 
170 
140 
138 769 
32 059 
48 136 
54 278 
2 437 
3 732 
32 909 
41 035 
e 
er 
S 
t .## 
e n 
a 
Table 4: Separators from a German corpus with 
modified separators (truncated list) 
and obtained the result: 
Fiir 6In 6st 6and los loetz loung Io# 
# lound loNeu lobau loder-K loa loon lo 
al loisation lo--diirf loten#in h den 
h lon~ich losten \[1 zehn \[3 IoJahrenBe 
tr~ige-in h loMilliard loenhSh loe# 
loaus loge logeb loen lo## lowerden 
lo. hAIlein lo## 6in hd loenal loten 
-Bundesliindern h lomiiss Ioen#bi 
s I1 lozur hJahrhund loer lotwende 
lo##die hKommun Ioen lokm-des lo 
-- loins logesamt#km## lolangen 
\[2 \[0Kanal lo-##und hL loeit loungs 
lonetz loe lossan loieren.- 
7The replacement of every blank and new-line w~s done 
by rotation through the list. 
11 
3.3 English Corpus 
On an English corpus where all original separators 
have been preserved: s 
In the days when the spinning-wheels \ 
hummed busily in the 
farmhouses and even great ladies clothed \ 
in silk and 
thread lace had their toy spinning-wheels \ 
of polished oak there might be seen in \ 
districts far away among the lanes 
or deep in the bosom of the hills certain \ 
pallid undersized 
men who by the side of the brawny \ 
country-folk looked 
like the remnants of a disinherited race. 
we measured the information and entropy shown in 
Figure 2, collected the separators in Table 5, 
Alias fsepa~ fconte=t 
164 744 1 082 
\[l 7 700 602 
h 19 983 374 
h 2 689 323 
k 4 009 223 
734 216 
6 039 171 
b 357 167 
646 154 
b 489 99 
ho 1 932 96 
\[I l 180 96 
ftotal Separator 
181 518 BL 
20 000 NL 
1 096 301 
5 895 . BL 
32 576 e BL 
I 253 . NL. 
105 185 e 
647 ' NL. 
2 876 . NL 
17 789 t BL 
71 247 a 
1 742 - 
Table 5: Separators from an English corpus 
(truncated list) 
and obtained the result: 
In \[othe iodays iowhen \[othe \[ospinn 
ing-wheels \[ohummed bbusily ~in 
\[othe h farmhouses ~and \[oeven log 
teat 1oladies \[oclothed loin Iosilk \[o 
andNLthread \[olace \[ohad ~their \]a 
toy iospinning-wheelsBLof iapolis 
fled \[ooak \[othere \[omight lobe \[osee 
n loin \[odistricts \]ofar 10awayBLam 
ong \[othe \[olanesNLor \[odeep bin 10t 
he ~)bosom \[oof \[othe \[ohills \[0certa 
in \[0pallid \[0undersized \[lmen \[owh 
o \[0by \[othe \[osideBLof \[othe \[0braw 
ny \[0country-folk \[01ooked hlike \[0 
the \[0remnantsBLof ~a \[0disinheri 
ted \[orace. 
Sln this example, \ denotes that the current line is not 
finished yet but rather continues on the next line" 
4 Conclusion and Future In- 
vestigations 
The paper attempted to show that entropy and in- 
formation can be used to segment a corpus into 
words, when no additional knowledge about the cor- 
pus or the language, and no other resources such as 
a lexicon or grammar are available. 
To segment the corpus, the algorithm searches for 
separators, without knowing a priory by which sym- 
bols or sequences of symbols they are constituted. 
Good results were obtained with a German 
and an English corpus with "clearly perceptible" 
separators (blank and new-line). Precision and 
recall decrease if the original separators of these 
corpora are removed or changed into a set of 
different co-occurring separators. 
So far, only separators and their frequencies have 
been taken into account. Future investigations may 
include: 
• the use of frequencies of tokens and their differ- 
ent alternative contexts, to validate these to- 
kens and the adjacent separators, and 
• a search for criteria (based on the corpus it- 
self and on the obtained result) to evaluate the 
"quality" of segmentation, thus enabling a self- 
optimizing approach. 
Acknowledgements 
Many thanks to the anonymous reviewers of my ar- 
ticle and to my colleagues. 
12 
t Entropy T.R 
'l'=.l=l,=l===l=ll=llll==I Jl°,=,l==| 
e--- Entropy RI, 
= ~ |l w i= i* ii iim ill i i= f ~ J | i| = Ill , , , , * D =l~ ~ = = , J ,ll = II ~ = III J I ~ I ~ I t I I = , ~l 
In{ }! h \['{.}d a y sl-jwh ,. nl.it h,.|-lS p i, n, ng-wh ,-~" I~1 }h ,,rmw d .{,, { }bus ilul,-.(\] in-{ }th ,-.{ } far mh o,, s,.{u s{ }a, ,1{-},. v ,. n{. }~. r 
t Information I,R 
Figure 2: Entropy and information across a section of an English corpus 

References 

\[Alder, 1988\] Mike Alder. Stochastic grammatical 
inference. Master's thesis, University of Westeru 
Australia, 1988. 

\[Harris, 1955\] Zellig S. Harris. From phoneme to 
morpheme. Language, 31(2):190-222, 1955. 

\[Hutchens and Alder, 1998\] Jason L. Hutchens and 
Michael D. Alder. Finding structure via com- 
pression. In David M. W. Powers, editor, NeM- 
LaP3/CoNLL98: Joint Conference on New Meth- 
ods in Natural Language Processing and Compu- 
tational Natural Language Processing Learning, 
pages 79-82, Sydney, Australia, 1998. Association 
for Computational Linguistics. 

\[Shannon and Weaver, 1949\] Claude E. Shannon 
and Warren Weaver. The Mathematical Theory 
of Communication. University of Illinois Press, 
1949. 

\[Wolff, 1977\] J. G. Wolff. The discovery of segments 
in natural language. British Journal o.f Psychol- 
ogy, 68:97-106, 1971. 
