Unsupervised Learning of Word Boundary with 
Description Length Gain 
Chunyu Kit t$ 
Dept. of Chinese, Translation and Linguistics 
City University of Hong Kong t 
ctckit@cityu, edu. hk 
Yorick Wilks t 
Department of Computer Science 
University of Sheffield t 
yorick@dcs, shef. ac. uk 
Abstract 
This paper presents an unsupervised approach to 
lexical acquisition with the goodness measure de- 
scription length gain (DLG) formulated following 
classic information theory within the minimum de- 
scription length (MDL) paradigm. The learning 
algorithm seeks for an optimal segmentation of an 
utterance that maximises the description length 
gain from the individual segments. The resultant 
segments show a nice correspondence to lexical 
items (in particular, words) in a natural language 
like English. Learning experiments on large-scMe 
corpora (e.g., the Brown corpus) have shown the 
effectiveness of both the learning algorithm and the 
goodness measure that guides that learning. 
1. Introduction 
Detecting and handling unknown words properly 
has become a crucial issue in today's practical nat- 
ural language processing (NLP) technology. No 
matter how large the dictionary that is used in 
a NLP system, there can be many new words in 
running/real texts, e.g., in scientific articles, news- 
papers and Web pages, that the dictionary does 
not include. Many such words are proper names 
and special terminology that provide critical infor- 
mation. It is unreliable to rest on delimiters such 
as white spaces to detect new lexical units, be- 
cause many basic lexical items contain one or more 
spaces, e.g., as in "New York", "Hong Kong" and 
"hot dog". It appears that unsupervised learning 
techniques are necessary in order to alleviate the 
problem of unknown words in the NLP domain. 
There have been a number of studies on lex- 
ical acquisition from language data of different 
types. Wolff attempts to infer word bound- 
aries from artificially-generated natural language 
sentences, heavily relying on the co-occurrence 
frequency of adjacent characters \[Wolff1975, 
Wolff 1977\]. Nevill-Manning's text compression 
program Sequitur can also identify word bound- 
aries and gives a binary tree structure for an 
identified word \[Nevill-Mmming 1996\]. de Mar- 
cken explores unsupervised lexical acquisition 
from Enghsh spoken and written corpora and 
from a Chinese written corpus \[de Marken 1995: 
de Marken 1996\]. 
In this paper, we present all unsuper- 
vised approach to lexical acquisition within the 
minimum description length (MDL) paradigm 
\[Rissanen 1978, Rissanen 1982\] \[Rissanen 1989\], 
with a goodness measure, namely, the descrip- 
tion length gain (DLG), which is formulated in 
\[Kit 1998\] following classic information theory 
\[Shannon 1948, Cover and Thomas 1991\]. This 
measure is used, following the MDL princi- 
ple, to evaluate the goodness of identifying a 
(sub)sequence of characters in a corpus as a lex- 
ical item. In order to rigorously evaluate the ef- 
fectiveness of this unsupervised learning approach, 
we do not limit ourselves to the detection of un- 
known words with respect to ally given dictionary. 
Rather, we use it to perform unsupervised lexi- 
cal acquisition from large-scale English text cor- 
pora. Since it is a learning-via-compression ap- 
proach, the algorithm can be further extended to 
deal with text compression and, very likely, other 
data sequencing problems. 
The rest of the paper is organised as follows: Sec- 
tion 2 presents the formulation of the DLG mea- 
! 
! 
! 
! 
p! 
! 
! 
! 
! 
! 
| 
! 
! 
! 
! 
! 
! 
sure in terms of classic information theory; Sec- 
tion 3 formulates the learning algorithm within the 
MDL framework, which aims to achieve an opti- 
mal segmentation of the given corpus into lexical 
items with regard to the DLG measure; Section 
4 presents experiments and discusses experimental 
results with respect to previous studies; and finally, 
the conclusions of the paper are given in Section 5. 
2. Description Length Gain 
Kit defines the description length of a corpus X = 
xlx2""xn, a sequence of linguistic tokens (e.g., 
characters, words, POS tags), as the Shannon- 
Fano code length for the corpus \[Kit 1998\]. FoP 
s from X. As an extracted s is supposed to be ap- 
pended to the modified corpus by a string concate- 
nation, as shown in (2), the original corpus can be 
easily recovered by a transformation that reverses 
the extraction, i.e., replacing all r's in X\[r --r s\] 
with the string s. 
It is worth noting that we can achieve the pur- 
pose of calculating DL(X\[r --¢ s\] (9 s) without 
carrying out the string substitution operations 
throughout the original corpus. The calculation 
can be based on the token count change involved 
in the substitution operatious to derive the new 
corpus X\[r -+ s\] (9 s, as follows: 
lowing classic information theory \[Shannon 1948, DL(X\[r -+ s\] (9 s) = ~ a;'(x)log d(x) (4) 
i Cover and Thomas 1991\], it can be formulated in n' xEVu{r} 
terms of token counts in the corpus as below for 
empirical calculation: where d(x) is the new count ofx in the new corpus 
and n ' is the new corpus length. The new counts 
I DL(X) = n\[-I(X) and the new length are, straightforwardly, 
= 
e(s) ifx = r; 
zeV d(x) = c(x) - c(s)cs(x) + cs(x) otherwise. I = -~-~c(x)log c(x~) (I) 
• IXl n' = n - c(s)lsl + c(s) + Isl + 1 
(5) 
I where V is the set of distinct tokens (i.e., the vo- where c(x) and cs(x) are the counts of in the x 
cabulary) in X and c(x) is the count of x in X. original corpus X and in the string s, respectively. 
Accordingly, the description length gain (DLG) A key problem in this straightforward calcula- 
t from identifying a (sub)sequence s = sis2.." sk in tion is that we need to derive the count c(s) for 
the corpus X as a segment or chunk, which is ex- all possible string s's in the original corpus X, be- 
_ pected to have a nice correspondence to a linguis- cause during the lexical learning process it is nec- 
i tically significant unit (e.g., a lexical item such as to consider all fragments (i.e., all n-grams) essary 
a word, or a syntactic phrase), is formulated as in the corpus in order to select a set of good cam 
i DLG(seX) = DL(X)- DL(X\[r--+ s\] (9 s) (2) didates for lexical items. Kit and Wilks provide 
an efficient method for deriving n-ga'ams of any 
where r is an index, X\[r --+ s\] represents the resul- length and their counts from large-scale corpora 
tant corpus by the operation of replacing all occur- \[Kit and Wilks 1998\]. It has been adopted as the 
fences s r through out (in words, operational implementation un- of with X other basis for the of the 
we extract a rule r --+ s from X) and (9 represents supervised lexical acquisition algorithm that is to 
the concatenation of two strings (e.g., X\[r -+ s\] be reported in the next sections. 
and s) with a delimiter inserted in between. It is 
straightforward that the average DLG for extract- 3. Learning Algorithm 
ing an individual s from X is Given an utterance U = totl".tn as a string 
I of some linguistic tokens (e.g., characters, words, DLG(s) 
aDLG(s) c(s) (3) POS tags), the unsupervised lexical acquisition al- 
gorithm seeks for an optimal segmentation OS(U) 
This average DLG is an estimation of the compres- over the string U such that the sum of the compres- 
sion effect of extracting an individual instance of sion effect over the segments is maximal. Formally | 
BSs\[j\] 
to tt t2 ...... ti "- tj tj+l "" tk "" tn 
(A) An illustration for the Viterbi segmentation 
0pSeg(U = tit2-" in) 
For k-- 0,1,2,---,n do 
Initialise OS\[k\] = ¢; 
For j = k- I,-.-,0 do 
If c(\[tj+,---tk\]) <2, break; 
If DLG(0S\[j\] (0 {\[tj+l "'" tk\]}) > DLG(0S\[k\]) 
then OS\[k\] = OS\[j\] I@ {\[tj+l"" tk\]} 
The final result: OS\[n\]. 
(B) The Viterbi segmentation algorithm 
Figure 1: The Viterbi algorithm for optimal seg- 
mentation, with an illustration 
put, it looks for 
os(u) = 
k 
arg max ~_, aDLG(s,) (6) 
sl...sk s.t. U=sl+..-+sl,, z=l 
where 0 < k _< n, + represents a string concate- 
nation and aDLG(s,) is the average DLG for each 
instance of the string s, in the original corpus, as 
defined in (3) above. 
Based on this description length gain calcula- 
tion, a Viterbi algorithm is formulated to search 
for the optimal segmentation over an utterance U 
that fulfils (6). It is presented in Figure 1 with an 
illustration. The algorithm uses a list of intermedi- 
ate variables OS\[0\], OS\[1\],..-, OS\[n\], each OSs\[i\] 
stores the optimal segmentation over tot1 ... ti (for 
i = 0, 1.2,-..,n). A segmentation is an ordered 
set (or list) of adjacent segments. The sign ~ rep- 
resents an ordered set union operation. The DLG 
over a list of seg~mnts, e.g., DLG(OS\[j\]), is de- 
3 
fined as the sum of all segments' DLGs in the set: 
DLG(OS\[j\])= ~ DLG(s) (7) 
seOS\[i\] 
Notice that the algorithm has a bias against the 
extraction of a single token as a rule, due to the 
fact that a single token rule bears a negative DLG. 
When j = k - 1, OS\[j\] ~ \[tj+,..-tk\] becomes 
OS\[k - 1\] ~ {\[t~\]}, which is less preferable than 
OS\[k - 1\] t~ {tk}. The difference between the de- 
notations \[tk\] and tk is that the former indicates 
that the string tk is extracted from the corpus as 
the right-hand side of a rule (a deterministic CFG 
rule), which results in a negative DLG; whereas 
the latter treats tk as an individual token instead 
of a segment, which has a zero DLG. 
It is worth noting that the breaking condition 
c(\[ tj ... tk\]) < 2 in the inner loop in the algo- 
rithm is an empirical condition. Its main purpose 
is to speed up the algorithm by avoiding fruitless 
iterations on strings of count 1. According to our 
observation in experiments, learning without this 
breaking condition leads to exactly the s.ame.re- 
sult on large-scale corpora but the speed is many 
times slower. Strings with a count c = 1 can be 
skipped in the learning, because they are all long 
strings with a negative DLG*and none of them can 
become a good segment that contributes a positive 
compression effect to the entire segmentation of the 
ISince extracting a string \[t,...t~.\] of count 1 as a 
rule does not change any token's count in the new corpus 
C\[r -4 t, • .. tk\] (9 t, ... tk), except the new non-terminal r 
and the delimiter ~, whose counts become 1 (i.e., c(r) = 
c(\[t,.., tk\]) = 1 and c(~) = 1) after the extraction.- Thus, 
DLG(\[t,...t#\]) = DL(C) - DL(C\[r -4 t,...t#\] % t,...tk) 
= -- Z c(t) log 2 c(t) Z 
tEV tE %"U{r.~9 } 
c(t) ~ c(t', = - ~ c(t)(log= I-~ - log~ ) + ~ c(0 log=,. ICl -~ 
tEV tE{r,~} 
ICl + 2 . ^, 1 
= - ~ c(0 log= -TSF + z,og2 ICl + 2 
fEV 
= -ICl log= ICl + 2 2 logdlCl + 2) ICl 
= -(ICl + 2)log2(ICl + 2)) + ICl log= ICl 
<0 
! 
I 
I 
utterance. Rather, they can be broken into shorter 
segments with a positive DLG. 
Time complexity analysis also shows that this 
breaking condition can speed up the algorithm sig- 
nificantly. Without this condition, the time com- 
plexity of the algorithm is O(n2). With it, the 
complexity is bounded by O(mn), where m is the 
maximal common prefix length of sub-strings (i.e., 
n-grams) in the corpus. Accordingly, the average 
time complexity of the algorithm is O(an): where a 
is the average common prefix length in the corpus, 
which is much smaller than m. 
4. Experiments 
We have conducted a series of lexical acquisition 
experiments with the above algorithm on large- 
scale English corpora, e.g., the Brown corpus 
\[Francis and Kucera 1982\] and the PTB WSJ cor- 
pus \[Marcus et al. 1993\]. Below is the segmenta- 
tion result on the first few sentences in the Brown 
corpus: 
\[the\] \[_fulton_county\] \[_grand_jury\] \[_said_\] \[friday_\] 
\[an\] \[_investigation_of\] \[_atlanta\] \[_'s_\] \[recent\] 
\[_primary_\] \[election\] \[_produced\] \[_' '_no\] \[_evidence\] 
\[_' '_\] \[that_any\] \[_irregularities\] \[_took_place_\] \[@\] 
\[_the_jury\] \[_further\] \[_said_\] \[in_term\] I-el \[nd_\] 
\[present\] \[ments\] \[_that_\] \[the_city_\] \[executive\] 
\[_committee\] \[_ ,_which_had\] \[_over-all_\] \[charge_of\] 
\[_the_election\] \[_, _' ' _\] \[deserves\] \[_the_\] \[praise\] 
\[_and_\] \[thanks\] \[_of_the_c\] \[ity_of_\] \[atlanta\] \[_''_\] 
\[for\] \[_the_manner_in_which\] \[_the_election\] \[_was\] 
\[_conducted_\] \[@\] \[_the\] \[_september\] \[-\] \[october_\] 
\[term\] \[_jury\] \[_had_been_\] \[charge\] \[d_by_\] \[fulton_\] 
\[superior_court\] \[_judge\] \[_dur\] \[wood_\] \[py\] \[e_to\] 
\[_investigat\] \[e\] \[_reports_of_\] \[possible\] \[_" _\] 
\[irregularities\] \[_ ' ' _\] \[in_the\] \[_hard-\] \[fought_\] 
\[primary\] \[_which_was_\] \[w\] \[on_by_\] \[mayor\] \[-\] \[nominat\] 
\[e_\] \[iv\] Jan_allen_\] \[jr\] \[_..\] \[_' '_\] \[only_a\] \[_relative\] 
\[_handful_of\] \[_such_\] \[reports\] \[_,as\] \[_received\] 
\[_' '_, _\] \[the_jury\] \[_said_,_" _\] \[considering_the\] 
\[_widespread\] \[_interest_in_\] \[the_election\] \[_,_\] 
\[the_number_of_\] \[vo\] \[ters_and_\] \[the_size_of\] 
\[_this_c\] \[ity_' '_\] \[@\] \[_the_jury_said\] \[_it_did\] 
\[_find\] \[_that_many_of\] \[_georgia_' s\] \[_registration\] 
\[_and_\] \[election\] \[_laws\] \[_"_\] \[are_\] \[out\] \[mode\] \[d_\] 
\[or\] \[_inadequate\] \[_and_often_\] \[ambiguous\] \[. "_\] \[@\] 
\[_it\] \[_recommended\] \[_that_\] \[fulton\] \[_legislators_\] 
\[act\] \[_' '_\] \[to_have_the\] Is\] \[e_lavs\] \[_studied_\] 
\[and_\] \[revi\] \[sed_to\] \[_the_end_of_\] \[moderniz\] ling\] 
\[_and_improv\] ling_them\] \[_"_\] \[@\] \[_the\] \[_grand_jury\] 
\[_commented_\] \[on\] \[_a_number_of_\] \[other_\] \[top\] \[ics_,_\] 
\[among_them\] \[_the_atlant\] \[a\] \[_and_\] \[fulton_county\] 
\[_purchasing\] \[_department\] Is_which_\] \[it\] \[_said\] \[_' '_\] 
\[are_well\] \[_operated_\] land_follow\] \[_generally.\] 
\[accepted_\] \[practices\] \[_which_in\] lure_to\] \[_the_best\] 
\[_interest\] \[_of_both\] \[_government\] Is_' '_\] \[@\] 
4 
where uppercase letters are converted to lowercase 
ones, the spaces are visualised by all underscore 
and the full-stops are all replaced by (@'s. 
Although a space is not distinguished from any 
other characters for the learner, we have to rely 
on the spaces to judge the correctness of a word 
boundary prediction: a predicted word boundary 
immediately before or after a space is judged as 
correct. But we also have observed that this cri- 
terion overlooks many meaningful predictions like 
"-.-charge\] \[d_by-..", "---are_outmode\] \[d_.-." 
and ".--government\] \[s...:'. If this is taken into 
account, the learning pcrformance is evidently bet- 
ter than the precision and recall figures reported 
in Table 1 below. 
Interestingly, it is observed that n-gram counts 
derived from a larger volume of data can signifi- 
cantly improve the precision but decrease the recall 
of the word boundary prediction. The correlation 
betwee, the volume of data used tbr deriving n- 
gram counts and the change of precision and recall 
is shown in Table 1. The effectiveness of the un- 
supervised learning is evidenced by the fact that 
its precision and recall are, respectively, ~tll tl~ree 
times as high as the precision and recall by random 
guessing. The best learning performance, in terms 
of both precision and recall, in the experiments is 
o the one with 79.33% precision and 63.01~ recall, 
obtained from the experiment on the e,ltire Brown 
corpus. 
Table 1: The correlation between corpus size (mil- 
lion char.) and precision/recall 
It is straightforwardly understandable that the 
increase of data volume leads to a significant in- 
crease of precision in the learning, because pre- 
diction based on more data is more reliable. The 
reason for the drop of recall is that when the vol- 
ume of data increases, more multi-word strings 
have a higher compression effect (than individual 
words) and, consequently: they are learned by the 
learner as lexical items, e.g., \[fulton_county\], 
\[grand_jury\] and \[_took_place\]. If the credit 
in such nmlti-word lexical items is counted, the re- 
call nmst be much better than the one in Table 
1. Of course, this also reflects a limitation of the 
learning algorithm: it only conducts an optimal 
segmentation instead of a hierarchical chunking on 
an utterance. 
The precision and recall reported above is not 
a big surprise. To our knowledge, however, it 
is the first time that the performance of unsu- 
pervised learning of word boundaries is exam- 
ined with the criteria of both precision and re- 
call. Unfortunately, this performance can't be 
compared with any previous studies, for sev- 
eral reasons. One is that the learning re- 
sults of previous studies are not presented in 
a comparable manner, for example, \[Wolff 1975, 
Wolff 1977\] and \[Nevill-Manning 1996\], as noted 
by \[de Marken 1996\] as well. Another is that the 
learning outcomes are different. For example, the 
output of lexical learning from an utterance (as a 
character sequence) in \[Nevill-Manning 1996\] and 
\[de Marken 1995, de Marken 1996\] is a hierarchi- 
cal chunking of the utterance. The chance to hit 
the correct words in such chunking is obviously 
many times higher than that in a flat segmenta- 
tion. The hierarchical chunking leads to a recall 
above 90% in de Marken's work. Interestingly, 
however, de Marken does not report the preci- 
sion, which seems too low, therefore meaningless, 
to report, because the learner produces so many 
chunks. 
5. Conclusions and Future Work 
We have presented an unsupervised learning algo- 
rithm for lexical acquisition based on the goodness 
measure description length gain formulated follow- 
ing information theory. The learning algorithnl fol- 
lows the essence of the MDL principle to search for 
the optimal segmentation of an utterance that has 
the maximal description length gain (and there- 
fore approaches the minimum description length 
of the utterance). Experiments on word boundary 
prediction with large-scale corpora have shown the 
effectiveness of the learning algorithm. 
For the time being, however, we are unable to 
compare the learning performance with other re- 
searchers' previous work, simply because they do 
not present the performance of their learning al- 
gorithms in terms of the criteria of both precision 
and recall. Also, our algorithm is significantly sim- 
pler, in that it rests on n-gram counts only, instead 
of any more complicated statistical data or a more 
sophisticated training algorithm. 
Our future work will focus on the investi- 
gation into two aspects of the lexical learning 
with the DLG measure. First, we will incor- 
porate tile expectation-maximization (EM) algo- 
rithm \[Dempster et al. 1977\] into our lexical learn- 
ing to see how nmch performance can be improved. 
Usually, a more sophisticated learning algorithm 
leads to a better learning result. Second, we will 
explore the hierarchical chunking with the DLG 
measure. We are particularly interested to know 
how nmch nmre compression effect can be further 
squeezed out by hierarchical chunking from a text 
corpus (e.g., the Brown corpus) and how much im- 
provement in the recall can be achieved. 
Acknowledgements 
The first author gratefully acknowledges a Univer- 
sity of Sheffield Research Scholarship awarded to 
him that enables him to undertake this work. We 
wish to thank two anonymous reviewers for their 
invaluable comments, and thank Hamish Cun- 
ningham, Ted Dunning, Rob Gaizauskas, Randy 
LaPolla, Steve Renals, Jonathon Webster and 
many other colleagues for various kinds of help and 
useful discussions. 

References 

\[Cover and Thomas 1991\] Cover~ T. M., and J. A. 
Thomas, 1991. Elements of Information Theory. 
John Wiley and Sons, Inc.. New York. 

\[de Marken 1995\] de Marken, C. 1995. Tile Unsu- 
pervised Acquisition of a Lexicon from Contin- 
uous Speech. Technical Report A.I. Memo No. 
1558, AI Lab., MIT. Cambridge, Massachusetts. 

\[de Marken 1996\] de Marken, C. 1996. Unsuper- 
vised Language Acquisition. Ph.D. dissertation, 
MIT, Cambridge, Massachusetts. 

\[Dempster et al. 1977\] Dempster, A. P., N. M. 
Laird, and D. B. Rubin. 1977. Maximum like- 
lihood from incomplete data via the EM algo- 
rithm. Journal of the Royal Statistical Society, 
39(B):1-38. 

\[Francis and Kucera 1982\] Francis, W. N., and H. 
Kucera. 1982. Frequency Analysis o\] English Us- 
age: Lexical and Grammar. Houghton-Mifflin, 
Boston. 

\[Kit 1998\] Kit, C. 1998. A goodness measure 
for phrase learning via compression with the 
MDL principle. In The ESSLLI-98 Student 
Session, Chapter 13, pp.175-187. Aug. 17-28, 
Saarbriiken. 

\[Kit and Wilks 1998\] Kit, C., and Y. Wilks. 1998. 
The Virtual Corpus approach to deriving n-gram 
statistics front large scale corpora. In C. N. 
Huang (ed.), Proceedings of 1998 International 
Conference on Chinese Information Processing, 
pp.223-229. Nov. 18-20, Beijing. 

\[Li and Vit£nyi 1993\] Li, M., and P. M. B. 
Vit£nyi. 1993. Introduction to Kolmogorov Com- 
plexity and its Applications. Springer-Verlag, 
New York. Second edition, 1997. 

\[Marcus et al. 1993\] Marcus, M., B. Santorini and 
M. Marcinkiewicz. 1993. Building a large an- 
notated corpus of English: The Penn Treebank. 
Computational Linguistics, 19(2):313-330. 

\[Nevill-Manning 1996\] Nevill-Manning, C. G. In- 
ferring Sequential Structure. Ph.D. dissertation, 
University of Waikato, New Zealand. 

\[Powers 1997\] Powers, D. M. W. Unsupervised 
learning of linguistic structure: an empirical 
evaluation. International Journal of Corpus Lin- 
guistics: 2(1):91-131. 

\[Rissanen 1978\] Rissanen, J. 1978. Modelling by 
shortest data description. Automatica, 14:465-471. 

\[Rissanen 1982\] Rissanen, J. 1982. A universal 
prior for integers and estimation by mil~imum 
description length. Ann. Statist.. 11:416-431. 

\[Rissanen 1989\] Rissanen, J. 1989. 
Com-plexity in Statistical Inquiry. 
Stochastic World Scientific, N.J. 

\[Shannon 1948\] Shannon, C. 1948. A mathemati- 
cal theory of communication. Bell System Tech- 
nical Journal, 27:379-423, 623-656. 

\[Solomonoff 1964\] Solomonoff, R. J. 1964. A for- 
real theory of inductive inference, part 1 and 2. 
Information Control, 7:1-22, 224-256. 

\[Stolcke 1994\] Stolcke, A. 1994. Bayesian Learn- 
ing of Probabilistic Language Models. Ph.D. dis- 
sertation, UC Berkeley, CA. 

\[Vit~nyi and Li 1996\] VitgLnyi, P. M. B., and M. 
Li. 1996. Minimum Description Length Induc- 
tion, Bayesianism, and Kohnogorov Complexity. 
Manuscript, CWI, Amsterdam. 

\[Wolff 1975\] Wolff, J. G. An algorithm fbr the 
segmentation of an artificial language analogue. 
British Journal of Psychology, 66:79-90. 

\[Wolff 1977\] Wolff, J. G. The discoverj, of seg- 
ments in natural language. British Journal of 
Psychology, 68:97-106. 

\[Wolff 1982\] Wolff, J. G. Language acquisition, 
data compression and generalization. Language 
and Communication, 2:57-89. 
