Acquiring a Lexicon from Unsegmented Speech 
Carl de Marcken 
MIT Artificial Intelligence Laboratory 
545 Technology Square, NE43-804 
Cambridge, MA, 02139, USA 
cgdemarc@ai.mit.edu 
Abstract 
We present work-in-progress on the ma- 
chine acquisition of a lexicon from sen- 
tences that are each an unsegmented phone 
sequence paired with a primitive represen- 
tation of meaning. A simple exploratory 
algorithm is described, along with the di- 
rection of current work and a discussion 
of the relevance of the problem for child 
language acquisition and computer speech 
recognition. 
1 Introduction 
We are interested in how a lexicon of discrete words 
can be acquired from continuous speech, a prob- 
lem fundamental both to child language acquisition 
and to the automated induction of computer speech 
recognition systems; see (Olivier, 1968; Wolff, 1982; 
Cartwright and Brent, 1994) for previous computa- 
tional work in this area. For the time being, we ap- 
proximate the problem as induction from phone se- 
quences rather than acoustic pressure, and assume 
that learning takes place in an environment where 
simple semantic representations of the speech intent 
are available to the acquisition mechanism. 
For example, we approximate the greater problem 
as that of learning from inputs like 
Phon. Input: /~raebltslne~ b~W t/ 
Sem. Input: { BOAT A IN RABBIT THE BE } 
(The rabbit's in a boat.) 
where the semantic input is an unordered set of iden- 
tifiers corresponding to word paradigms. Obviously 
the artificial pseudo-semantic representations make 
the problem much easier: we experiment with them 
as a first step, somewhere between learning language 
"from a radio" and providing an unambiguous tex- 
tual transcription, as might be used for training a 
speech recognition system. 
Our goal is to create a program that, after train- 
ing on many such pairs, can segment a new phonetic 
utterance into a sequence of morpheme identifiers. 
Such output could be used as input to many gram- 
mar acquisition programs. 
2 A Simple Prototype 
We have implemented a simple algorithm as an ex- 
ploratory effort. It maintains a single dictionary, a 
set of words. Each word consists of a phone sequence 
and a set of sememes (semantic symbols). Initially, 
the dictionary is empty. When presented with an 
utterance, the algorithm goes through the following 
sequence of actions: 
• It attempts to cover ("parse") the utterance 
phones and semantic symbols with a sequence 
of words from the dictionary, each word offset a 
certain distance into the phone sequence, with 
words potentially overlapping. 
• It then creates new words that account for un- 
covered portions of the utterance, and adjusts 
words from the parse to better fit the utterance. 
• Finally, it reparses the utterance with the old 
dictionary and the new words, and adds the new 
words to the dictionary if the resulting parse 
covers the utterance well. 
Occasionally, the program removes rarely-used 
words from the dictionary, and removes words which 
can themselves be parsed. The general operation of 
the program should be made clearer by the follow- 
ing two examples. In the first, the program starts 
with an empty dictionary, early in the acquisition 
process, and receives the simple utterance/nina/{ 
NINA } (a child's name). Naturally, it is unable to 
parse the input. 
Utterance: 
Words: 
Unparsed: 
Mismatched: 
Phones Sememes /nina/ { JINA } 
/nina/ { NINA } 
From the unparsed portion of the sentence, the 
program creates a new word, /nina/ { NINA }. It 
then reparses 
Phones Sememes 
Utterance: /nina/ { NINA } Words: /nine/ { sISA } 
Unparsed: 
Mismatched: 
311 
Having successfully parsed the input, it adds the 
new word to the dictionary. Later in the acquisition 
process, it encounters the sentence you kicked off ~he sock, 
when the dictionary contains (among other 
words) /yu/ { YOU }, /~a/ { THE }, and /rsuk/ { 
SOCK }. 
Utterance: 
Words: 
Unparsed: 
Mismatched: 
Phones Sememes 
/yukIkt~f~sak/ { KiCK YOU OFF 
SOCK THE } /y./ { 
YOU } 
I~1 { THE } /rs~k/ { sock } 
kIkt~f { KICK OFF } 
r 
The program creates the new word /kIkt~f/ { 
KICK OFF } to account for the unparsed portion of 
the input, and/suk/{ SOCK} to fix the mismatched 
phone. It reparses, 
Phones 
Utterance: /yukIkt3f5~sak/ 
Words: !yu/ 
/klkt~f/ /a~/ 
/s~k/ /rs~k/ unused 
Unparsed: 
Mismatched: 
Sememes 
{ KICK YOU OFF 
SOCK THE } 
{ You } 
{ KICK OFF } 
{ THE } 
{ SOCK } { SOCK } 
On this basis, it adds/kIkt~f/{ KICK OFF } and 
/sak/ { SOCK } to the dictionary. /rsuk/ { SOCK 
}, not used in this analysis, is eventually discarded 
from the dictionary for lack of use. /klkt~f/{ KICK 
OFF } is later found to be parsable into two sub- 
words, and also discarded. 
One can view this procedure as a variant of the 
expectation-maximization (Dempster et al., 1977) 
procedure, with the parse of each utterance as the 
hidden variables. There is currently no preference 
for which words are used in a parse, save to mini- 
mize mismatches and unparsed portions of the input, 
but obviously a word grammar could be learned in 
conjunction with this acquisition process, and used 
as a disambiguation step. 
3 Tests and Results 
To test the algorithm, we used 34438 utterances 
from the Childes database of mothers' speech to chil- 
dren (MacWhinney and Snow, 1985; Suppes, 1973). 
These text utterances were run through a publicly 
available text-to-phone engine. A semantic dictio- 
nary was created by hand, in which each root word 
from the utterances was mapped to a correspond- 
ing sememe. Various forms of a root ("see", "saw", 
"seeing") all map to the same sememe, e.g., SEE . 
Semantic representations for a given utterance are 
merely unordered sets of sememes generated by tak- 
ing the union of the sememe for each word in the 
utterance. Figure 1 contains the first 6 utterances 
from the database. 
We describe the results of a single run of the al- 
gorithm, trained on one exposure to each of the 
34438 utterances, containing a total of 2158 differ- 
ent stems. The final dictionary contains 1182 words, 
where some entries are different forms of a com- 
mon stem. 82 of the words in the dictionary have 
never been used in a good parse. We eliminate these 
words, leaving 1100. Figure 2 presents some entries 
in the final dictionary, and figure 3 presents all 21 
(2%) of the dictionary entries that might be reason- 
ably considered mistakes. 
Phones /yu/ 
/~// 
/.st/ 
It,ll /d./ 
/e/ 
/It/ 
/ax/ 
/in/ 
/wi/ 
Sememes Phones Sememes 
{ YOU } /bik/ { BEAK } 
{ THE } /we/ { wAY } { WHAT } /hi/ { HEY } 
{ TO } /brik/ { BREAK } { DO } /f, vg3/ { FINGER } 
{ A } Ikisl { KISS } { IT } /tap/ { TOP } 
{ I } /k~ld/ { CALL } { 
IS } l~gz/ { EGG } { WE } 
/eng/ { THING } 
Figure 2: Dictionary entries. The left 10 are the 
10 words used most frequently in good parses. The 
right 10 were selected randomly from the 1100 en- 
tries. 
/iv/{ BE } /z~/ { YOU } 
/iv/{ DO } 
Hi,./{ SHE BE } /shappin/ { HAPPEN } 
/t I { NOT } /skatt/ 
{ BOB SCOTT } 
/nidahz/ { NEEDLE BE } IsAmOl 
{ SOMETHING } 
Innpi~/{ sNooPy } I*oI 
{ WILL } I""I 
{ AT ZOO } /don/ { DO } 
/sdf/{ YOU } /~/{ 
BE } 
/smAd/ { MUD } 
/~r~/{ BE } 
Idontl { DO NOT } /watarOiz/ { 
WHAT BE THESE } 
/wathappind/ { WHAT HAPPEN} /dran^63wiz/ 
{ DROWN OTHERWISE } 
Figure 3: All of the significant dictionary errors. 
Some of them, like /J'iz/ are conglomerations that 
should have been divided. Others, like/t/, /wo/, 
and /don/ demonstrate how the system compen- 
sates for the morphological irregularity of English 
contractions. The /I~/problem is discussed in the 
text; misanalysis of the role of/I~/ also manifests 
itself on something. 
The most obvious error visible in figure 3 is the 
suffix -ing (/I~/), which should be have an empty se- 
meme set. Indeed, such a word is properly hypothe- 
sized but a special mechanism prevents semantically 
empty words from being added to the dictionary. 
Without this mechanism, the system would chance 
312 
Sentence 
this is a book. 
what do you see in the book? 
how many rabbits? 
how many? 
one rabbit. 
what is the rabbit doing? 
Phones 
/bIslzebuk/ 
/watduyusilnb~buk/ 
/hat~menirabhlts/ 
/hatlmeni/ 
/w^nrabblt/ 
/watlzb~rabbItdulD / 
Sememes 
{ THIS BE A'B00K ) 
{ WHAT DO YOU SEE IS THE BOOK } 
{ HOW MANY RABBIT } 
{ HOW MANY } 
{ ONE RABBIT } 
{ WHAT BE THE RABBIT DO } 
Figure 1: The first 6 utterances from the Childes database used to test the algorithm. 
upon a new word like ring,/rig/, use the/I~/{} to 
account for most of the sound, and build a new word 
/r/{ RINa } to cover the rest; witness something in 
figure 3. Most other semantically-empty affixes (plu- 
ral/s/for instance) are also properly hypothesized 
and disallowed, but the dictionary learns multiple 
entries to account for them (/eg/ "egg" and /egz/ 
"eggs"). The system learns synonyms ("is", "was", 
"am", ...) and homonyms ("read", "red" ; "know", 
"no") without difficulty. 
Removing the restriction on empty semantics, and 
also setting the semantics of the function words a, 
an, the, that and of to {}, the most common empty 
words learned are given in figure 4. The ring prob- 
lem surfaces: among other words learned are now 
/k/{ CAR } and/br/{ BRI/IG }. To fix such prob- 
lems, it is obvious more constraint on morpheme 
order must be incorporated into the parsing pro- 
cess, perhaps in the form of a statistical grammar 
acquired simultaneously with the dictionary. 
Word Source 
l~v/ {} -~,,g I~I {} 
the 
/o/{} ? /r/{} 
uo./yo., Is/{) 
plur~ -~ 
It/ {) is/'s 
Word Source 
/wo/{} ? 
/el {} a /an/{} 
.. 
/~,,/{} o/ /z/ 
{} plural -s 
Figure 4: The most common semantically empty 
words in the final dictionary. 
4 Current Directions 
The algorithm described above is extremely simple, 
as was the input fed to it. In particular, 
• The input was phonetically oversimplified, each 
word pronounced the same way each time it oc- 
curred, regardless of environment. There was 
no phonological noise and no cross-word effects. 
• The semantic representations were not only 
noise free and unambiguous, but corresponded 
directly to the words in the utterance. 
To better investigate more realistic formulations 
of the acquisition problem, we are extending our 
coverage to actual phonetic transcriptions of speech, 
by allowing for various phonological processes and 
noise, and by building in probabilistic models of 
morphology and syntax. We are further reducing 
the information present in the semantic input by 
removing all function word symbols and merging 
various content symbols to encompass several word 
paradigms. We hope to transition to phonemic in- 
put produced by a phoneme-based speech recognizer 
in the near future. 
Finally, we are instituting an objective test mea- 
sure: rather than examining the dictionary directly, 
we will compare segmentation and morpheme- 
labeling to textual transcripts of the input speech. 
5 Acknowledgements 
This research is supported by NSF grant 9217041- 
ASC and AR.PA under the ttPCC program. 

References 
Timothy Andrew Cartwright and Michael R. Brent. 
1994. Segmenting speech without a lexicon: Evi- 
dence for a bootstrapping model of lexical acqui- 
sition. In Proc. of the 16th Annual Meeting of the 
Cognitive Science Society, IIillsdale, New Jersey. 
A. P. Dempster, N. M. Liard, and D. B. Rubin. 1977. 
Maximum liklihood from incomplete data via the 
EM algorithm. Journal of the Royal Statistical 
Society, B(39):1-38. 
B. MacWhinney and C. Snow. 1985. The child lan- 
guage data exchange system. Journal of Child 
Language, 12:271-296. 
Donald Cort Olivier. 1968. Stochastic Grammars 
and Language Acquisition Mechanisms. Ph.D. 
thesis, Harvard University, Cambridge, Mas- 
sachusetts. 
Patrick Suppes. 1973. The semantics of children's 
language. American Psychologist. 
J. Gerald Wolff. 1982. Language acquisition, data 
compression and generalization. Language and 
Communication, 2(1):57-89. 
