SEGMENTING SPEECH WITHOUT A LEXICON: 
THE ROLES OF PHONOTACTICS AND SPEECH SOURCE 
Timothy Andrew Cartwright and Michael R. Brent 
Department of Cognitive Science 
The Johns Hopkins (\]nivcrsity 
3400 North Charles Street 
Ba.ltinlore, MI) 21218, lISA 
Inl,crncl,: cat©laa:i.1, cog. jhu. edu 
Abstract 
Infants face the diMcult problem of segmenting 
continnous speech into words without the bene- 
lit of a fully dew,loped lexicon. Several sources 
of information in speech might help infants solve 
(.his probhml, including prosody, semantic corre- 
lations an(I I)honotacl, ics. lt,(~search to date has 
filciiscd i)ll ~h,l,(,rluining to which (if these SOllrCcs 
hlfanl.s luight bc sciisil,ivc, bnt little work lli~s been 
done I.o deternihie l,he potential usefulness of each 
SOllrce. 'l'll~ (~ouipnteF siinulations reported here 
arc it Ih'st attcnilit to iiica.,lllre I.he IlSefllhless of 
Iistriliutiollil, I am l iihouol,acl, ic ilirornlation in seg- 
lil~qil,ilig liholiCllll , si,qllCli('(,s. The algorithlllS hy-- 
in~l,licsiz~ ' ~lill'~'r('nl. scgiuelital.hiliS ill" the hilnit illl,o 
w~io'ds il.lld .~ch,el, I.\[ie best Ilyliol.licsis ac(:ol'diilg to 
t,lil, m illilllillll I)escripl,ion I,eugl,h lirinciplc. Oul" 
i'csull.s hldical,c tliat while I.here is SOlii~, useful 
hiforuial.ioli ill both phoneme distributions arid 
lihtmol,acl.ic ruh's, i,he coull)ination of bol,h SOllrces 
is inost us(.J'lll. 
INTRODUCTION 
Ilifants lUllSt Icarni I,o recognize ccrtain sound se- 
qllellCl'.s ;IS I)(~illg words; this is a dillicult i)rob- 
Icn~ I)ecausc norllial speech contains no obvious 
acoustic divisions between words. Two sources 
of hifornlation that liiighl, aid Sl)coch segnierlta- 
l,ion arc: disl.ribullion I,h¢ I)holienic s(;qilCli~;e in 
<'.l alillC:u's frcqli(~jil,ly in scw'ral contcxl,s includ- 
illg Ihc~'al, cats allil ('fl/ll(li?J, wlicl'(~a~ I.li(~ s(~lillCll(~e 
in i'(ilu is rti.l'l, illllll alillCal's ill rl,stricl,ed Colll.l'xts; 
:ilill llli(ui~ll,:l,cl.i~'s cal is rill acl'(qil,al)h~ syllabl(~ in 
I'hilxJish, wh<,l'Clis p<'lll is not. While evidcnc(' (~x-- 
isl,s I.li~d. infanl.s a, rc scnsitiv(' to I.hcsc ili\['ornlal;ioli 
s~ili'l:(.s, wl, kll(~w of iio Illea.SllrelllelltS of I, heir IIS¢;- 
I'uhi~'ss. In this paper, we attempt to quantify the 
ils~flllli~ss OF distribution and phonotactics in seg- 
lil(,litillg spe(,ell. W(' found thai, each source pro- 
vi(Icd Solue IlSCfllJ information for speech seginen- 
I,atioli, bill I, li(' colirbiiiation of sources provided 
subsl,anl,ial hiforliiation. Wc also fonnd that child- 
dir~cl,('d Slmech was Uulch ea.~icr to soglnenl, than 
adult-directed speech when using both sources. 
'Fo date, psychologists have focused on two as- 
pects of the speech segmentation problem. The 
first is the problem of parsing continuous speech 
into words given a developed lexicon to which in- 
coming sounds can be matched; both psycholo- 
gists (e.g., Cutler & Carter, 1987; Cutler & But- 
terliel(I, 1992) and designers of speech-recognition 
systems (e.g., (\]hur(:h, 1987) have examined I~his 
problem. However, the problem we examined 
is dilferent---we want to know how infants seg- 
ment speech before knowing which phonemic se- 
qllelW,('s form words. '1'he second aspect psycholo- 
gists liaw~ focnsed (ill is the lirobleln of dcternihi- 
ilig the ill\['Orluatioll SOllr(:(~s t() which ilifants are 
SCllSil,ive. Priluarily, I.wo sotircos haw~ ll(~ell (~x-- 
ainine~l: prosody and word stress. II,enults sug- 
gest I.hal, parents (~xaggcrate prosody in child- 
directed speech to highlight iniportant words (Fer- 
nahl & Mazzie, 1991; Aslin, Woodward, LaMen- 
dola & Bever, in press) and that infants are sen- 
sitive to prosody (e.g., Hirsh-Pasek et al., 1987). 
Word stress in English fairly accurately predicts 
the location of word beginnings (Cutler & Norris, 
1988; Cutler & Butterfield, 1992); Jusczyk, Cutler 
and II,edanz (1993) demonstrated that 9-month- 
ohls (but not 6-month-olds) are sensitive to the 
common strong/weak word stress pattern in En- 
glish. Sensitivity to native-language phonotactics 
in 9-month-olds was re(:ently reported by Jusczyk, 
I,'riedcrici, Wessels, Swmkerud and Jusczyk (1993). 
'i'lles~ sl, udi(~s deruoilstratcd infants' perceptive 
abilil.il's wil,hout deiilonsl.ral,hig tlw usefuhicss of 
hli'alil,s > ll(~rcel)l,ioils. 
I low do childl'(,n coiubine l,li(: iiiforiii;d,ion I, hey 
i)crc~,iw; froln dilrerenl, SOlll'l;es'. ? Aslili el, al. Sl)(~c- 
Illate that infants first learn words heard in isola- 
tion, then use distribution and prosody to refine 
and expand their w)cabulary; however, Jusczyk 
(1(,)93) sliggests that sound sequences learned in 
isolation dill~r too greatly from those in contexi. 
to bc useful. He goes on to say, "just how far in- 
forniation in the sound structure of the input can 
83 
pies, we see that Hypothesis 1 uses 48 characters 
and Hypothesis 2 uses 75. However, this simplis- 
tic method is inefficient; for instance, the length of 
lexical indices are arbitrary with respect to prop- 
erties of the words themselves (e.g., in Hypothesis 
2, there is no reason why/jul/was assigned tile in- 
dex '10'--length two--instead of '9'--length one). 
Our system improves upon this simple size metri(: 
I)y coml)uting sizes based on ;t ('Onll)act rel)rcs(,n- 
tat.ion motivated I)y informati(m theory. 
W(: inmginc hypothes(:s r(qu'(~sented ;~ a string 
of ones and zeros. This binary string must r(,p- 
resent not only the lexical entries, their indices 
(called code words) and the coded sample, but 
also overhead information specifying the number 
of items coded and their arrangement in the string 
(information implicitly given by spacing and sl)a- 
tial placement in the introductory cxamples). Fur- 
therrnore, the string and its components must be 
self-delimiting, so that a decoder could identify the 
endpoints of components by itself. The next sec- 
tion describes the binary representation and the 
length formulm derived from it in detail; readers 
satisfied with the intuitive descriptions presented 
so far should skip ahead to the Phonotactics sub- 
section. 
Representation and Length Formulae 
The representation scheme described below ix 
I);~scd on information theory (for more examples 
of coding systems, see, e.g., Li L: VitKnyi, 1993 
and Quinlan & Rivest, 1989). From this repre- 
sentation, we can derive a formula describing its 
length in bits. However, the discrete form of the 
formula would not work well in practice for our 
simulations. Instead, we use a continuous approx- 
imation of the discrete formula; this approxima- 
tion typically involves dropping the ceiling func- 
tion from length computations. For example, we 
sometimes use a self-delimiting representation for 
integers (as described in Li & VitS.nyi, pp. 74-75). 
In this representation, the number of bits needed 
to code an integer x is given by 
e(~)(=) = i+ \[log~(= + I)1 +2 \[log~ \[logs(= + I)11 
lIowever, we use the following approximation: 
e (~) (x) = 1.5+log2(x + 1) +2 log2(log 2 (x +2) +0.5) 
Using the discrete formula, the dilference I)etwc(,n 
g(21(126) and g(2)(127) is zero, while the differ- 
ence between e(~)(127) and g(21(128) is one bit; us- 
ing the continuous formula, the difference between 
~(~)(126) and g(2)(127) is 0.0156, while the differ- 
(m(:e I)ct.wecn g~)( 1271 and g(2)(128) is 0.0155. We 
f(mn(I it easier to inl.m'l)ret tim results using a con- 
t.imu)us fun(:ti(m, s,) in t,lw J'~dh)witlg(liscussion, w(' 
will only i)r('s(.ut. I.h(. a.i)l)roxim;d.(~ fi~rmuh,~. 
'rite lexicon lists words (represented as 
phoneme sequences) paired With I,Imir code 
words 1 . For (,xample: 
Word Code Word 
Oa \[the\] 
k~;t \[cat\] 
klti \[kitty\] 
si \[see\] 
Ill the hhm, ry relu'esentation , the two rohmms a, re 
represented separately, one ;ffter the other; tim 
first column is called the word inventory col- 
umn; the second column is called the code word 
inventory column. 
In the word inventory colunul (see Figure la 
for a schematic), the list of lexical items is rel)r('- 
sented as a continuous string of i)honemes, without 
separators between words (e.g., ~;)kmtkltisi...). 
To mark tile boundaries between lexical items, the 
phoneme string is preceded by a list of integers 
representing the lengths (in phonemes) of each 
word. Each length is represented am a. lixcd-length, 
zero-padded binary number, l'rceeding this list is 
a single integer denoting the length of each length 
field; this integer is represented in unary, so that 
its length need not be known in adwmce. Pre- 
ceding the entire column is the numl)er of h,xica.I 
entries n codc(I as a self-dclimiting integer. 
The length of the representation of I.he integer 
n is given by the fimction 
t(-~)(,,) (I) 
We define len(wi) to be the mmlber of 
phonemes ill word wi. If there are p total unique 
phonemes used in tile sample, titan wc represent 
each phoneme as a fixed-length bit string of length len(p) 
= log 2 p. So, the length of the representa- 
tion of a word wi in tile lexicon is the mnnber 
of phonemes in the word times the length of a 
phoneme: len(p), len(wi). The total length of all 
the words in the lexicon is tile sum of this formula 
over all lexical items: 
t/, T1 
lien(p), len(wi)) = len(p) Z lc,,(w,) (2) 
i=1 i=1 
As stated al)ovc, the length liehls used to di- 
vide the phoneme string are lixe(Mcugth, lu e;u'h 
field is an integer I)etween one an(I the munl)er of 
phonemes in the longest word. Since repres(mtitlg 
integers between one and x takes log2 x bits, tim 
length of each field is: 
tog~(,;!?,~ t.,(,,,,)) 
I( ',ode words ;tl'e I'elWeSelfl,ed I)y Sqllitl'4, br;wi(qq,s, so 
\[:v\] means %he (:ode won'd coro'eSl)C,lldintl4 I,o :r'. 
T 
I.)otsl.ral) I.hc acquisition of other levels \[of linguis- 
l.ic organization\] remaius to be determined." In 
this paper, we measure the potential roles of dis- 
I,ribution, phonotactics and their combination us- 
ing a computer-sitnulated learning algorithm; the 
simulation is based on a bootstrapping model in 
which phonotactic knowledge is used to constrain 
the distributional analysis of speech samples. 
While our work is in part motivated by the 
above research, other developmental research sup- 
ports certain ;assumptions we make. The input 
to our system is represented as a sequence of 
i)houenms, so we implicitly assume that infants are 
aisle I.o ,'ouv('rl, from acoustic inl)ut to phoneme se- 
quem:es; research i)y Kuhl (e.g., Gricser & Kuhl, 
1989) suggests tha.t this assmnl)tion is remson- 
al)h,. Since sentence I)oundaries provide informa- 
l.ion ahout word I)oumlaries (the end of a sentence 
is also the end of a word), our input contains 
sentence I~oumhu'ik~s; several studies (13ernstein- 
II.atm'r, 1985; Ilirsh-lh~sek et al., 1987; Kemler 
Nels~m, I lirsh-I'asek, ,lusczyk & Wright C;msidy, 
1989; ,I usczyk et al., 1992) have shown that infimts 
can perceive senl,cncc I)oundarics using prosodic 
cues. Ih)wever, FiSher and 'lbkura (in press) found 
m) evidence that prosody can accurately predict 
word boundaries, .so the task of finding words re- 
mains. Finally, one might question whether in- 
Ikmts have the ability we are trying to model--that 
is, whether they can identify words embedded in 
sentences; Jusczyk and Aslin (submitted) found 
that 7 I/2-month-olds can do so. 
The Model 
To gain an intuitive understanding of our model, 
consider the f()llowing speech sample (transcrip- 
ti{,u is in IPA): 
Orthogral)hy: I)o you see tim kitty? 
Se(' the kitty? 
I)o you like the kil,t,y'.~ 
Trauscril)l,ioil: (luj usiiS;~kl ti 
si(ioklti 
du.iulalk&)klti 
There are many differeut ways to break this sam- 
pie into I)utative words (each particular segmen- 
l, ation is called a segmentation hypothesis). Two 
sucll hypotheses a~re: 
Segmentation 1: du ju si 59 klti 
si 5~ klti 
du ju lalk 5,3 klti 
S<:gnmnfl,ation 2: duj us i5 ~)klt i 
sic) ~)k nti 
(lu jul alk ()ok Iti 
lasting I, he wor(Is used I)y each segmentation hy- 
i,othcsis yMds the Ibllowing two lexicons: 
85 
Segmentation I 
1 du 3 klti 5 si 
2 59 4 lalk 6 ju 
Segmentation 2 
1 alk 5 ok 9 Iti 
2du 60kIt 10jul 
3 duj 7 i 11 sis 
45ok 8i5 12 us 
Note that Segmentation 1, the correct hypothesis, 
yields a compact lexicon of frequent words whereas 
Segmentation 2 yields a much larger lexicon of in- 
frequent words. Also note that a lexicon contains 
only the words used in the sample--no words are 
known I.o tim system a priori, nor are any carried 
ow;r from one hypothesis to the next. Given a lexi- 
con, tim saml)le can I)e encoded by ret)lacing words 
with their respective indices into the lexicon: 
Encoded Sample l: 1, 6, 5, 2, 3; 
5, 2, 3; 
l, 6, 4, 2, 3; 
Encoded Saml)le 2: 2, 11_2, 6, 4, 5; 
11, 3, 8; 
1, 9, 10, 7, 8; 
Our simulation attempts to find the hypothesis 
that minimizes the combined sizes of the lexicon 
and encoded sample. This approach is called the 
Minimum Description Length (MDL) paradigm 
and has been used recently in other domains to 
analyze distributional information (Li & Vitgnyi, 
1993; Rissanen, 1978; Ellison, 1992, 1994; Brent, 
1993). For reasons explained in the next section, 
the system converts these character-based repre- 
sentations to compact binary representations, us- 
ing the number of bits in the binary string as a 
Ine~u re of size. 
I)imnotac(.ic rules can I)e used to restrict 
tim s(wnenl,al,ion hyl)ol, hesis Sl)ace I)y prevent- 
ing word I)ountlari(,s a.t certain places; for in- 
stance, /ka,l,sp:)z/ (",:at's paws") has six i,,ternal 
s(~gmental.ion I)oints (k ;~l,Sl):)z, ka: t.sl):)z, el.c), 
only two of which are I)honotactically allowed 
(ka:t Sl):)z and kmts 1)3z). '17o evaluate the use- 
fuhmss of phonotactic knowledge, we compared 
results between phonotactically constrained and 
unconstrained simulations. 
SIMULATION DETAILS 
'Ib use the MDL principle, as introduced above, 
we search for the smallest-sized hypothesis. We 
must have some well-defined method of measur- 
ing hypothesis sizes for this method to work. A 
silnllle, intuitive way of measuing the size of a hy- 
pothesis is to count the numl)er of characters used 
to rcl)resent it. \[:or example, counting the charac- 
ters (cxclu(ling spaces) in the introductory exam- 
Figure 1: Schematic diagrams for components of the representation 
'e,,.(t,,,l)\] I 'en(t'nl) I 
(a) 
(b) 
(c) 
'lb be fully self-delinfiting, the width of a field 
must be represented in a self-delinfiting way; we 
use a unary representation--i.e., write an extra 
field consisting of only '1' bits followed by a termi- 
nating '0'. There are n fields (one for each word), 
plus the unary prefix, so the combined length of 
i,hc fields plus prefix (plus terminating zero) is: 
1 + (,~ + 1) log.,(max m~(,.~)) (:~) 
" 1...n 
The total length of the word inventory column rep- 
resentation is the sum of the terms in (1), (2) and (.~). 
The code word inventory column of the lexicon 
(see Figure lb for a schematic) has a nearly iden- 
tical representation as the previous colmnn except 
that code words are listed instead of phonemic 
words--the length fields and unary prefix serve 
the same purpose of marking the divisions between 
code words. 
The sample can be represented most com- 
pactly by assigning short code words to frequent 
words, reserving longer code words for infrequent 
words. To satisfy this property, code words are as- 
signed so that their lengths ar~ fre(luency-l);l~sed; 
the lengl.h of tim ,:ode word fi)r a word of I're(Itn~ncy 
f(',,) will not be greater than: 
len(\[w\]) = log2 ~inlf(w) _ log 2 m f(w) f(w) 
The total length of the code word list is the sum 
of the code word lengths over all lexieal entries: 
IZ 7t 
Zlen(\[w\]) = Z l°g2 m f(wi) (4) 
i=1 i=1 
As in the word inventory colmnn (described 
;d)ove), the length of each code word is represented 
in a fixed-length field. Since the least frequent 
word will have the longest code word (a prol)erty 
of the formula for /cn(\[wi\])), the longest possible 
.code word comes from a word of frequency one: 
m l°g'2 
T : l°g2 m 
Sim'e t, he fields contains integers between one aud 
this ,mt,d)('r, w," ~lefit,o the length of a \[i('ld I,o I)(': 
I, ,g..,( I, ,g:~ .,) 
As above, we represent the width of a lield in 
unary, so there are a total of n + 1 elements of 
this size (n fields plus the unary representation of 
the field width). The combined length of the fields 
plus prefix (and terminating zero) is: 
1 + (n + I)Iog2(Iog ~ m) (5) 
The total length of the code word inventory col- 
umn representation is tile sum of the l.errus ill {,1) 
and (5). 
Finally, the sequence of words which form tim 
sample (see Figure le for a schematic) is repre- 
sented as the nurrd)er of words in the sample (m) 
followed by the list of code words. Since code 
words are used as compact indices into the lex- 
icon, the original sample could I)e re(x)nstructed 
completely by looking up eacil code word in this 
list and replacing it with its phoneme sequence 
from the lexicon. The code words we assigned to 
lexical items are self-delimiting (once the set of 
codes is known), so there is no need to represent 
the boundaries between code words. 
The length of the representation of the iuteger 
m is given by I.h(~ fimction 
e(~)(,.) ((i) 
The length of the representation of the sanq)le 
is computed by summing the lengths of the code 
words used to represent the sample. We can sim- 
plify this description by noting that the combined 
length of all occurrences of a particular code word 
\[wi\] is f(wi), len(\[wi\]) since there are f(u,i) oc- 
currences of the code word in the sample. So, the 
length of the encoded sample is the sum of this 
formula over all words in the lexicon: 
: \[ ('" 
i=1 i=1 (7) 
The total length of the sample is given by adding 
the terms in (6) and (7). The total length of the 
representation of the entire hyl)othesis is the sam 
of the rel)resentation lengt,hs of the word inw,ntory 
('f)llnlnl, I,he code word illV(qltory ~'ohllllll ;Hid I,ho 
na.mpb'. 
This systein of ('olnputhig hyl)othesis sizes is 
('llicielil, in the sense that elenlents ;ire thought 
of a.s being rel)resent;ed compactly and that (:ode 
words arc assigned based on the relative frequen- 
cies of words. '\['he'final evaluation given to a hy- 
pothesis is an estimate of the minimal number of 
bits required to transmit that hypothesis. As such, 
it pernfits direct comparison between competing 
hypotheses; that is, the shorter the representation 
of some hypothesis i the more distributional infor- 
uiation can be extracted and, therefore, the better 
the hypothesis. 
Phonotactics 
I'honotactic knowledge was given to the system as 
a. list of licit initial and Iinal consoliant clusters of 
English words~; this list was checked against all 
six sanlples so tha£ the list was inaxinmlly pcr- 
inissiv(' (e.g., I,li(~ underlined consonliut clusl,er in 
exllloi'e could I)e divhled as ek-splore or eks-plore). 
Ih-q\]iose sinnilittions which used the l)liouota(:tic 
knowledge., it word boundary could lrl()t be inserted 
when (Iohlg so would create a word initial or final 
(:onsonant chister not on the list or would create a 
word without a vowel. For example (from an ac- 
tual sample--corresponds to the utterance, "Want 
me to help baby?"): 
Sample: wantmituhclpbebi 
VaJid Boundaries: want.mi.t.u.help.be.bi 
lit I,he s('('()li(I lille, I, hose word I)olluda, ries that a, re 
lihtiliiil,acl,ic;I.Ily nel-ial are iiliirk(;d wil, ii dots. The 
I>,,,,,.I;,.,.y I,,.i,w,,,.,, /w/and /a/ is ilh'gal I,eca,,s,, 
/w/ I,y itself is ii(fl, a legal wor(I in English; the 
I,,mmlary I>ctwcen /a/ an(I liil i~ illegal l)ecausc 
/ntm/ is n()t a va, lkl word inil,\[al (:ons(inant chls- 
to,'; th(; I)oundary between /m/ and /i/ is illegal 
I)('ca, llSe /iitiu/ iS also not a valid word liual COIISO:" 
nant chml,er; i,h(' 1,01 ndary between/I)/and/b/is 
legid i:,e(:ause /11)/is a valid word linal (:luster and 
/I)/ is a valid word initial cluster. Note that using 
the l)honotactic constraints reduces the number of 
I)otential word boundaries from fifteen to six in 
this exaruple. 
After the system inserts a new word bound- 
a ry, it updates the list of remaining valid insertion 
I)oints --adding a point may cause nearby points 
I.o I)cconm unusable clue to the rcstriction that ev- 
ery wor(I must llave a vowel. For example (corre- 
Sl)On(ling to the utterance "green and"): 
4hi I,lu,n,h,~ical terms, the syllal)h. (,ns(.i.s permit- 
I.l',l ni Wl,l',l I,i.giliiiiii.gs llilll syllaldl' codas lilwniil,l,ed al, 
W¢il'd i.iids. Si,IIIi! lailgllit~es (inchidilig l'\]liglish) ha.v( 
ilil\[iW('lil, slrl.s of OllSCts itlld c()das toi" word-illi, erilal aud 
word-liolllldary positions--we use the word-boundary 
set. 
87 
Before: gri.n.mnd 
After: grin 
send 
After the segmentation of/grin/ and /send/, the 
potential boundary between /i/ and /n/ becomes 
invalid because inserting a word boundary there 
would produce a word with no vowel (/n/). 
Inputs and Simulations 
Two speech samples from each of three subjects 
were used in the simulations in one sample a 
mother was speaking to her daughter and in 
the other, the same mother was speaking to the 
researcher. The samples were taken from the 
CHILDES database (MacWhinney ~ Snow, 1990) 
from studies reported in Bernstein (1982). Each 
sample was checked for consistent word spellings 
(e.g., 'ts wits changed to its), then was transcribed 
into an ASCll-I)ased I)honemic rel)res(mtation :l. 
'Fhe transcription sysl, em was based on IPA an(I 
used one character for each consonant or vowel; 
diphthongs, r-colored vowels and syllabic conso- 
nants were each represented as one character. For 
example, "boy" was written as bT, "bird" as bRd 
and "label" as lebL. For purposes of phonotac- 
tic constraints, syllabic consonants were treate(! as 
vowels. Sample lengths were selected to make the 
nmnber of available segmentation points nearly 
equal (about 1,350) when no ph0notactic con- 
straints were applied; child-directed samples had 
498-536 tokens and 153-166 types, adult-directed 
sa.ml)les had 443-484 tokens and 196--205 types. 
I"inMly, Iwl'ore tim saml)les were fi~(I to the sinm- 
lations, divisions bel,wcell words (but not l)(%w(~en 
s(HIt(qIcos) wcr(~ reiuovc'.(L 
The sl)ace of l)Ossil)le hyl)oi, hcses is vlmt 4, so 
sonl(~ nmthod of finding a minimum-length hy- 
pothesis without considering all hypotheses is nec- 
essary. We used the following method: first, evalu- 
ate the input sample with no segmentation points 
added; then evaluate all hypotheses obtained by 
adding one or two segmentation points; take the 
shortest hypothesis found in the previous step and 
evaluate all hypotheses obtained by adding one 
or two more segmentation points; continue this 
way until the sample has been segmented into the 
smallest possible units and report the shortest hy- 
pothesis ever found. Two variants of this simu- 
lation wcre used: (1) DIST-FREE was free of any 
phonotactic restrictions on the hypotheses it could 
form (DIST refers to the measurement of distri- 
butionaJ information), whereas (2) DIST-PtloNO 
,Ise~l I.Iw i)hon~,t;wtic r(,sl.ricl,i(ms (lescril.,I ;,.I,,w,'. 
3The I,ranisi:riptioli nil%hod CUlSiix'(!d the identh:al 
trallscripl~ioii of all occurrences of a word. 
4F'or our samples, unconstrained by phonotactics, 
there are about 2 lsS° ~ 2.5 × 104°s hypotheses. 
I",ach simulation was rim on ('a('h S;I.IIIpl('. for ;i. I.() 
tal of twelve DIST rllns. 
Finally, two other simulations were run on 
each sample to measure chance performance: 
.(1.) RAND-FREE inserted random segmentation 
points and reported the resulting hypothesis, 
(2) RAND-PItONO inserted random segmentation 
points where permitted by the phonotactic con- 
straints. Since the RAND simulations were given 
the number of segmentation points to add (equal 
to the number of segmentation points needed to 
I)rodnce the natural English segmentation), their 
j)(~rrormance is an upper t)oml(I on chance pcrl'or- 
Illll.ll(:(;. hi C#)lltl';i.st, tim I)INT shrlnlatiollS nnlst 
determine I.im lllllilh(:r of SC~lll(~lll,;i.l.i()n poinl.s I.o 
a d(I using M I)1, ev;iJurtl.ious. Tim results for each 
I'~,ANI) sinlulatiou a.re averages over 1,000 trims Oil 
e:~ch input sample. 
RESULTS 
Each simulation was scored for the number of cor- 
rect segmentation points inserted, as compared to 
the natural English segmentation. From this scor- 
ing, two values were computed: recall, the per- 
cent of all correct segmentation points that were 
actually found; and accuracy, the percent of the 
hypothesized segmentation points that were actu- 
ally correct. In terms of hits, False alarms and 
misses, we have: 
hits recall : 
hits + misses 
hils accuracy : 
hits + false alarms 
Results are given in Table 1. Note that there 
is a trade-off between recall and accuracy--if all 
possible segmentation points were added, recall 
would be 100% but accuracy would be low; like- 
wise, if only one segmentation point was added 
I)ctwccn two words, accuracy would be 100% but 
recall would be low. Since our goal is to correctly 
s(.gl|lelit speech, accuracy is more important th;m 
finding every correct segmentation. I"or exa.ml)h~, 
deciding 'littlekitty' is ;~ word is less disastrous 
than deciding 'li', 'tle', 'ki' and 'ty' are all words, 
because assigning meaning to 'littlekitty' is a rea- 
sonable first try at learning word-meaning pairs, 
whereas trying to assign separate meanings to 'li' 
and 'tle' is problematic. 
Tile i)erl~)rlnan(:e of i)IS'I'-I)IIONO Oll ddhl- 
(lir('clcd Sl)oech shows l, hal, this systenl goes a long 
way toward solving the segnleutation i)rol)lcm. 
liowever, comparing the average pertbrmanees of 
simulations is also useful. The effect of phone- 
tactic information can be seen by comparing the 
average performances of RAND-FREE and RAND- 
I'IIONO, since the ¢)nly difference I)etw('(m l.h('m 
is the addit,ion of phonotactic constra, ints on seg- 
mentations in the, latl;er. Clearly I)houol, actic c(m- 
straints are useful, as both recall an(I accuracy im- 
prove. A similar comparison between RAND-IQtEE 
and DtST-FREE shows that distributional inlbr- 
mation alone also improves performance. Note in 
all the results of D\[ST-FaEE that using distribu- 
tional information alone favors recall over accu- 
racy; in fact, the segmentation hypotheses pro- 
duced by DIST-FREE have most words broken into 
single phoneme units with only a handful of words 
remaining intact. Two comparisons are nee(ted 
to show that the cond)ination of disl.rilmtional 
and phonotactic information I)erfi)rnm I)(,tter than 
either sour(:e a.lolle: DIST-PIIONO COml)ared to 
I{,ANI)-i~IIONO, to see tin'. elrect of a(Idiug disl.ri- 
butionai analysis to phonotactic constraints, and 
DIST-PIIONO compared to l)IsT-I,'ltl,:l,;, to see l.he 
effect of adding phonotactic constraints to distri-- 
butional analysis. The former comparison shows 
that the sources corrlbined are more useful than 
phonotactic information alone, rl'he latter com- 
parison is less obvious--the trade-off between re- 
call and accuracy seems to have reversed, with 
no clear winner 5. Data on discovered word types 
helps make this conlparison: DIS'I'-I?IU,:I,~ found 
12% of the words with 30% ac(:llraey an(1. Dis'r- 
PltONO found 33% of the words with 50% accu- 
racy. Whereas the segmentation point data are in- 
conclusive, word type data demonstrate that com- 
bining information sources is more usefifl than us- 
ing distributional information alone. 
There is no obvious difference in performance 
between child- and adult-directed speech, except 
in DIST-PI1ONO (combined information sources) 
in which the difference is striking: accuracy re- 
mains high and recall rate more than tril)les for 
child-directed speeclL This difference is again sup- 
ported by word type data: 14% recall with 30% ac- 
curacy for adult-directed speech, 56% recall with 
65% accuracy for chihl-directed Sl)eech. 
DISCUSSION 
Our technique segments continuous speech into 
words using only distributional and phonol,ac- 
tic information more effectively than one might 
expect--up to 66% recall of segmentation points 
with 92% accuracy on one sample, which yields 
58% recall of word types with 67% accuracy (the 
relatively low type accuracy is mitigated by the 
fact that most incorrect words are meaningf,,I 
concatenations of correct words --e.g., 'thekitty'). 
5'Fhe higher accuracy of DIST-I)HoNO is st good 
.sign. Furthermore, the minimum of the recall/accu- 
racy pair is greater in \])IST-PIIONO t|lml ill I)IST-\[?llEl,: 
and ¢,he nmximum of the recall/accuracy pair is alsc~ 
greal, er in I)IST- IqloNO I,hau hi I)IST- I"IIEI,L 
'l'~d)le l: Results for all simulations aveTugcd over individual speech samples 
Simulation 
Target Measure RAND-I"ItEE RAND-PIIONO DIST-FREE DIST-PHONO 
Adult % Recall 25.1 39.5 95.5 22.5 
% Accuracy 28.9 50.5 36.0 92.0 
Child % Recall 23.4 40.2 79.9 72.3 
% Accuracy 26.7 51.7 37.4 88.3 
Average % Recall 24.3 39.9 88.0 46.4 
% Accuracy 27.8 51.1 36.6 89.2 
This linding conlirms the idea that distril)ution 
a.nd i)houotactics ~irc useful sources of itfforlnation 
that infants might use ill diseow~ring words (e.g., 
.lusczyk eL al., 19931)). In fact, it helps explain in- 
fa.uts' ability to h'a.rn words \['rom parental speech: 
these two sources alone are useful a\]l(l infant,s have 
scw,ral oth(~rs, like prosody and word stress pat- 
terns, available as well. It also suggests that se- 
manl.ics and isolated words need not play as cen- 
tral a role as one Might think (e.g., Jusczyk, 1993, 
downplayed the utility of words in isolation). It is 
dillicult, if not impossible given currently available 
methods, to determine which sources of informa- 
l.ion are necessary for infants to segment speech 
and learn words; only this sort of indirect evidence 
is availal)lc to us. 
The results show a difference between adult- 
a.ml child-directed speech, in that the latter is ea.s- 
i~l' I,o sc~gm~ilt giv(m both distril)ution ;ul(I l)hono - 
tactics. This lends quantitative SUl)l)ort to re- 
search which suggests that mothercse dilthrs fl'o\]u 
normal adult speech in ways possibly useful to the 
language-learning inl'aut (Aslin et al.). In fact, the 
factors making motherese more learnable might be 
chmidated using this technique: coral)are the re- 
suits of sevcra.I (lilt~rent models, each containing 
a. dilrercnt factor or combination of factors, look- 
ing for those in which a siibstautial i)erformanee 
dill~,renec exists b(itwecn child- and adult-directed 
Sl~eech. 
()ur model uses phonotaetie constraints as al)- 
solute requirenlent!s on the structure of individual 
wt~l'tls; this iml)lies th;tt phonot;teties have been 
h'arncd prior to ;tttempts at segmentation. We 
must. therefore show that phonotactics can indeed 
he learned without access to a lexicon--without 
such a tlemonsl.raliion, we are trapped in circular 
rc;Lsoning. Gafos :and Brent (1994) demonstrate 
that phonotacties can be learned with high accu- 
racy from the same unsegmented utterances we 
used in our simulations. In general, two meth- 
89 
ods exist for combining information sources in ttle 
MI)I, I)aradign-l: one is to have absolute require- 
nllHits Oil plausible hyl*otheses (like our i)honota(:- 
tic co,lsl.rainl,s) -these requirements must I)e in- 
dCl)C,l¢lcntly learnable; the other method of com- 
bination is to include an information source in the 
internal representation of hypotheses (like our dis- 
trib,ltional information)---all components of the 
representation are learned simultaneously (see EI- 
l ison, 1992, for an example of multiple components 
in a representation). 
We would like to extend the system by using 
a more detailed transcription system. We expect 
that this would help the system find word bound- 
aries for reasons detailed in Church (1987)--in 
brief, that allophonic variation may be quite use- 
ful in predicting word boundaries. Another sim- 
pler extension of this researeh will be to increase 
l, he length of I,he speech samples used. Finally, we 
will try the current system on samples from other 
languages, to make sure this method generalizes 
approl)riately. 
This research program will provide comple- 
mentary evidence supporting hypotheses about 
the sources of information infants use in learning 
their natiw~ languages. Until now, research has fo- 
cused ou demonstrations of infants' sensitivity to 
various sourees; we have begun to provide quanti- 
tatiw, memsures of the usefiflness of those sources. 

References 
Richard N. Aslin, Julide Z. Woodward, Nicholas P. 
I.aMendola, and Thomas G. Bever. In press. 
Models of word segmentation in fluent ma- 
ternal speech to infants. In Morgan & De- 
muth (Eds.), Signal to Syntax: Bootstrapping 
from Speech to Syntax in Early Acquisition. 
Erlbaum, Hillsdale, NJ. 
Nan Bernstein. 1982. Acoustic study of mothers' 
speech to language-learning children: An anal- 
ysis of vowel articulatory chaTueteristies. Un- 
published doctoral dissertation, Boston lhfi- 
versity. 
Nan Bernstein-Ratner. 1985. Cues which mark 
clause-boundaries in mother-child speech. I)a - 
per presented at the meeting of the American 
Speech-Language Hearing Association, Wash- 
ington DC. 
Michael R. Brent 1993. Minimal generative ex- 
planations: A middle ground between neurons 
and triggers. In Proceedings of the 15th An- 
nual Conference of the Cognitive Science So- 
ciety, pages 28-36, Boulder, Colorado. 
Kenneth Church. 1987. Phonological parsing and 
lexical retrieval. Cognition, 25:53-69. 
Aline Cutler, and Sally I hlttcrli(;Id. 1,(192. Ii.hyth- 
iiiic cues to si)(~cch segnl('iltation: I'\]vid~'lle(~ 
I'l'olll .itlllCl, llr(' niisl)(wcl~l)t,i()ll. ,\]ourntll of 
Memory £.7 I,aligliage, ~f I :218 236. 
Alllm Cutler, and I)avi(I M. (Jarter. 1987. Tim 
predominance of strong initial syllables in tim 
English vocabulary. Computer Speech and 
Language, 2:133-142. 
Anne Cutler, and D. G. Norris. 1988. The role 
of strong syllables in segmentation for lexical 
access. Journal of Experimental Psychology: 
Humen Perception and Performance, 14:113- 
121. 
T. Mark Ellison. 1992. The Machine Learning of 
Phonological Structure. Unpublished doctoral 
dissertation, University of Western Australia. 
T. Mark Ellison. In press. The iterative learning 
of phonological rule's. Com, putalional I,ingais- 
lies. 
Aline I:crnal(I, and (Jlaudia Mazzie. 1991. 
Prosody and focus in speech to infants arid 
adults. Developmental Psychology, 27:20'() 
221. 
Cindy Fisher, H. Tokura. In press. Acoustic 
cues to clause boundaries in speech to infants: 
Cross-linguistic evidence. In Morgan ~ De- 
muth (Eds.), Signal to Syntax: Bootstrapping 
from Speech to Syntax in Early Acquisition, 
Erlbaum, Hiltsdale, NJ. 
Adamantios Gafos, and Michael R. Brent. 1994. 
l,earning syllable structure without word 
boundaries. Paper presented at the 1994 
Stanford Child Language R,esearch Forum, 
Stanford, CA. 
I )i A IiI1(~ ( lri('s(,r, ;tii(I I)al,ricia K. I{uhl. 1914'(). 'l'h~" 
c;IJ.(~g\[ll'izati(in (d" Slie(~eh by infalil,s: Siilll)orl, 
f(ir Sll(,~ch-soliiid lirol,otylicS. #){'llf'loplll,{'lltal 
I~s~lehology, 25:577- 588. 
Kathy ltirsh-Pasck, Deborah G. Kemh:r Nelson, 
Peter W. ,iusezyk, K. Wright (-',;~ssi(ty, I:L 
1)russ, and 1,. Kennedy. 1987. ('.laus('s ar(, 
perceptual units \[br young infants. (:ognilion, 
26:269-286. 
Peter W. Jusczyk. 199;I. Discov(,xing sound I)at - 
terns in the natty(: language. In Proceedings ~f 
the 15th Annual Confelvliee of the Cognitive 
Science Society, pages 49-60, Boulder, Col- 
orado. 
Peter W. Jusczyk, and Richard N. Asliu. Submit- 
ted for publication. Recognition of familiar 
patterns in fluent speech by 7 1/2-month-old 
infants. 
Peter W. Jusczyk, Anne Cutler, and Nancy ,I. 
Redanz. 1993. Infants' preference for the 
predominant stress patterns of English words. 
Child Development, 64:675-687. 
llel.er W..lusezyk, Angela I). I,'ri('(h'rici, .h'a 
liilil' M. W~,ssl'ls, Vigdis Y. Svl,tll~(,rlill, and 
A. M. ,lusezyk. 1993. hifanl,s" sl'ilsil.ivil,y I,~ 
the stunid liatterns (if li~d,ivl' latigilag~' Wlll'ds. 
,Journal of M('.mory \[~'f Lallgnagc, 32:402 420. 
Peter W. Jusczyk, Kathy Hirsh-Pasek, l)cljo- 
rah G. Kemler Nelson, Lori J. Kennedy, 
Amanda Woodward, and Julie Piwoz. 1992. 
Perception of acoustic correlates of major 
phrasal units by young infants. Cognitive Psy- 
chology, 24:252-293. 
Deborah G. Kemler Nelson, Kathy Hirsh-Pasck, 
Peter W. Jusczyk, and K. Wright Cassidy. 
1989. How the prosodic cues in motherese 
might assist language learning. .Journal of 
Child Language, 16:55--68. 
Miug IA, and Paul VitAnyi. 1993. Ari lutroda<'- 
lion I.o I~'olntogorov Complexity and ils Appli- 
cations., Slu'ing(,r-Verlag, New York, NY. 
Brian MacWhinut;y, and C. Snow. I!)'()0. 'l'll~ 
Child Language Data nxchaugo Syst('lil: All 
update.. Journal of Child Language, 17:457 
472. 
J. R. Quinlan, and R. L. Rivest. 1989. lnffi'.rring 
decision trees using the minimum description 
length principle. Information and Computing, 
80:227-248. 
J. Rissanen. 1978. Modeling by shortest data de- 
scription. Automatica, 14:465-471. 
