Machine-Readable Dictionaries in Text-to-Speech 
Systems 
Judith L. Klavans~and Evclyne Tzoukcrmann * 
"~Colnmbia University, l)epartment of Computer Science, New York, New York 10027 
klavans@cs.columhia.edu 
~* A.T.&T. Bell Laboratories, 600 Monntain Avenue, Murray llill, N.J. 07974 
evelyne@research.att.com 
Abstract 
This paper presents the results of an experiment 
usiug machine-readable dictionaries (Mill)s) and 
corpora for building concatenativc units for text 
to speech (T'PS) systems. Theoretical questions 
concerning the nature of t)honemic data in dic- 
tionaries are raised; phonemic dictionary data is 
viewed as a representative corpus over which to 
extract n- gram phonemic frequencies in the lan- 
guage. Dictionary data are compared to corpus 
data, and phoneme inventories arc evaluated for 
coverage. A methodology is defined to compute 
I)honemic n-grams for incorporation into a TTS 
system. 
1 Introduction 
The majority of speech synthesis systems use 
two techniques: concatenation and formant- 
synthesis. Building a comprehensive and intelli- 
gible concatenative-based speech synthesis system 
relies heavily on the successfid choice of concate- 
native units. Our results contribute to the t~sk of 
developing an eificient and elfective methodology 
for reducing the potentially large set of concaten- 
live units to a manageable size, and to chosing the 
optimal set for recording and storage. 
The paper is aimed primarily at two audiences: 
one consists of those concerned with research on 
the automatic use of MR.D data; the other are 
TTS system designers who require linguistic and 
lcxicographic resources to improve and streamline 
system-building. Issues of morphological analysis 
and generation, as well as stress assigmnent based 
on dictiona.ry data, are discussed. 
2 Using MRDs in Text to 
Speech 
Several problems are addressed in this paper; 
one concerns tile subtle comple×itics and idiosyn- 
crasies ilwolved iu parsing dictionaries and ex- 
tracting data. Added to this is the lack of consis- 
tency both within the same dictionary and across 
dictionaries which often requires ad hoc proce- 
dures for each- resource. Another issue relates to 
tile structure of the modules of a TTS system, 
specifically ill the grapheme-to-phoneme compo- 
nent; dictionary lookup depends on several factors 
including size, machine power and storage, factors 
that have important consequences for the extrac- 
tion ofconcatenative nnits. Another consideration 
concerns tile nature of the language itself: a lan- 
guage with irregular graphcme4o-phoneme map- 
ping and lexically determined stress assignment 
(such as English) benefits rnost from the large ex- 
ception list which a dictionary can provide, There 
is also the practical issue of dictionary availabil- 
ity, and of pronunciation field accuracy within an 
available dictionary. Thus, decisions on the use 
of MRD data depend on many factors, and can 
significantly impact efficiency and accuracy of a 
speech system. 
Since a dictionary entry consists of several 
fields of information, naturally, each will bc use- 
rid for different applications \[1\]. Among the stan- 
dard fields are prommciation, etymology, sub- 
jcct field notes, definition fields, synonym and 
antonym cross references, semantic and syntactic 
comments, run-on forms, conjugational class and 
inflectional information where relevant, and trans- 
lation for the I)ilingual dictionaries. Each of these 
fields has proven usefifl for different applications, 
such as for building semantic taxouomies \[3\], \[13\] 
and machine translation \[12\]. The most directly 
useflfl for TTS is the pronunciation field \[4\], \[11\]. 
Equally usefifl for TTS, but less dir6ctly acces- 
971 
sible, are data from run-on fields, conjugational 
class information, and part-of-speech. 1 
To illustrate, the following partial entries from 
Webster's Seventh (W7) \[15\] illustrate typical pro- 
nunciation, definition, and run-on fields: 
(l) ha.yen/'h.~-v0n/ n 1: IIAnBOR, POLO' 2 : a 
place of safety : ASYLUM haven vt 
(2) bi.son/'brs-on, 't>iz-/ n ... 
(3) ho.m,,.ge.neous /-'j~-ne-0s,-ny0s/ ... 
(4) den.tic.u.late/den-'tik-y0-1~t/ or 
den.tic.u.lat.ed/-,lat-od/ adj 
The entry for "haven" contains one fnll pronun- 
ciation. The entry for "bison" has one alterna- 
tive, but the user must figure out that the /on/ 
should be appended after 't)\]z-/, as in the first pro- 
mmciation, in order to obtain the correct varia- 
tion. Correct pronunciation for "homogeneous" 
relies on the pronunciation of the previous en- 
try, "homogeneity" , and requires the user to sep- 
arate and bring the prefix "homo-" from one en- 
try to another. To complicate matters, the alter- 
native pronunciation for the suffix /n6.-as/-nyas/ 
must also be correctly interpreted by the user. Fi- 
nally, "dentieulate" has a morphologically related 
run-on form "denticulated" in tile early part of 
the entry, and the pronunciation of that run-on is 
related to the main entry, but the user must de- 
cide how to strip and append the given syllables. 
2 While these types of reasoning are not difficult 
for humans, for whom the dictionary was written, 
they are quite difficult for programs, and thus are 
not straighforward to perform automatically. 
2.1 Using the MRD pronunciation 
field 
Extracting the prommciation field from an MRD is 
one of the most obvious uses of a dictionary. Nev- 
ertheless, parsing dictionaries in general can be a 
very complex operation (\[16\]) and even the extrac- 
tion of one field, such as prommciation, can pose 
problems. Similar to W7, in the Robert French 
dictionary \[9\], which contains about 89,000 entries, 
several pronunciations can be given for a head- 
word and the choice of one must be made. More- 
over, because of the rich morphology of French 
1 Notice, however, that the fifll Collins Spanish-English 
dictionary \[7\], as opposed to the other bilinguals, does not 
contain any prommciatlon information. Although this is 
rather surprising taking into account that the smaller ver- 
si ...... h as the paperback and g .... (\[8\], \[lO\]) do 1 ..... 
phonetic field, it could be attributed to the fact that pro- 
mmciation miles in Spanish are relatively predictable. 
\[2\] reports on the need to resyllablfy entries already syl- 
labified in LDOCE \[18\], since syllable boundaries for writ- 
ten forms usually reflect hyphenation conventions, rather 
than phonologically motivated syllabification conventions 
necessary for pronunciation. 
which has a rough ratio of eight morphologically 
inflected words for one baseform, Robert lists only 
the non-inflected forms of the lexical entries. Itow- 
ever, if pronunciation varies during inflection of 
nouns and adjectives, the pronunciation field re- 
flects that variation which makes the information 
difficult to extract automatically. For example, in 
(5) and (6), one needs to know the nature of the 
rule to apply in order to relate both forms of the 
adjective. 
(5) blanc, blanche/bl~, blbJ'/adj, et n. 
(6) vif, rive/vif, viv/adj, et n. 
In (5), the masculine/bl~/is obtained by remov- 
ing the phoneme /J'/ from the feminine /bl~,j'/ 
(blanche, "white" ). In (6), the form masculine 
fornr/vif/("sharp, qnick") is formed by stripping 
the affix /ve/ and substituting the phoneme /f/. 
Notice that tile rules are different in nature, the 
first being a addition/deletion relation, and the 
second being a substitution. 
In this project, the dictionary pronunciation 
field was used to start building the phonetic inven- 
tory of a speech synthesis system. For the French 
TTS system \[?\], the set of diphones was estab- 
lished by taking most of the thirty-flve phonemes 
for French and coupling them with each other (352 
= 1225 pairs). Then, the diphones were extracted 
from the pronunciation field for headwords in the 
Robert dictionary. A program was written to 
search through the dictionary phonetic field and 
select the longest word where the phoneme pairs 
would be in mid-syllable position. For example, 
the phonemic pair/lo/was found in the pronun- 
ciation field/zoolo3ik/corresponding to the head- 
word zoologiquc "zoologic." 
Out of 1225 phonemic pairs, 874 words were 
fonnd with at least one occurence of the pair. 
The pair \[headword_orth, headword_phon\] was ex- 
tracted and headword_orth was placed in a carrier 
sentence for recording. For instance, the speaker 
would utter the following sentence: "C'est zo- 
ologique que je dis" where "C'est ... que je dis" 
is the carrier sentence. Due to the lack of explicit 
inflectional information for nmms and adjectives, 
only the non-inflected forms of the entries were ex- 
tracted during dictionary lookup for building tile 
diphone table. Similarly for verbs, only the infini- 
tive forms were used since the dictionary does not 
list the inflected forms as headwords. This exem- 
plifies the most simple way to use pronunciation 
field data, which we have completed. A pronun- 
ciation list of around 85,796 phonetic words was 
obtained from the original list of ahnost 89,000 en- 
tries, i.e. 96% of the entries. The remaining 4% 
consist primarily of prefixes and suffixes which are 
listed in the dictionary without pronunciations, 
972 
and which should not be used in isolation in arty 
ease. 
2.2 Using the MRD for morphology 
Even though an MRI) may not list complete in- 
tlectional paradigms, it contains useful inflectional 
information. For example in the Collins Spanish- 
English dictionary, verb entries are listed with an 
index pointing to the conjugation chess and table, 
listed at the end of the dictionary. Using this infer 
mation, a finite-state transducer for morphological 
analysis and generation was built for Spanish \[20\]. 
From the original list of over 50,000 words, a few 
million words have been generated. These forms 
can then be used as tile input to the grapheme- 
to-phoneme conversion module, in ;t Spanish TTS 
system. 
2.3 Using Run-on's 
A run-on is defined as a morphological variant of 
a headword, included in the entry. Run-on's are 
problematic data in MRI)s \[16\], and they can be 
found nearly anywhere in the entry. In example 
(4), the run-on occurs at the beginning of the en- 
try, and consists of a fitll form with suffix. More 
commonly, run-on's occur towards the end of the 
entry, and tend to consist of predictable suttixa- 
tion, that is, class II or neutral suttixes \[19\] , such 
as :hess, dy, or -er, ~s in: 
(7) sharp adj .... sharp.ly adv sharp.hess n 
(8) suc.ces.sion n .... suc.ces.sion.al adj 
snc.ces.sion.al.ly adv 
In cases where stress is changed with class I non- 
neutral sultixes, a separate prououneiation is given 
as in: 
(9) gy.ro.scope /'ji-ra-,skSp/ n .... 
gy.ro.scop.ie /ji-ra-'sk~p-ik/ adj- 
gy.ro.s('ol,.i.cal.ly/d-k(a-)le/ adv 
The run-on form with part-of-speech is given in.- 
side the entry, so it could be used for morphologi.- 
eel analysis, tIowever, since proton|elation is usu- 
ally predictable from the headword (i.e. there is 
usually no stress change, and if there is a change, 
this is explicitly indicated) the run-on pronuncia- 
tion often consists of a truncated form, requiring 
some logic for reconstruction of the entire pronun- 
ciation. Again, this may be obvious to the human 
user, but rather complex to tigure out by l)rogram. 
'l'hus, the run-on may be nsefld for Inorpl|ology, 
but is not ms useful h)r automatic pronunciation 
extraction. 
3 Methodology and Results 
3.1 Collecting Data 
As stated al)ove, out of ahnost 89,000 headwords 
in the dictionary, 874 phonemic pairs,which repre- 
sents 71% of the total, were found. This is due to 
the fact that (a) the lookup occurs only on non: 
inflected words, thus a limited sample of the lan- 
guage, (b) because the dictionary consists of a list 
of isolated words, it does not aceonnt for inter: 
word boundary phenomena. Sitme French liaison 
plays such all important role in tile phonology of 
French, a look at phonetic data from a corpus mnst 
be giw'n in order to achieve fnll coverage. A per 
tion of the llansard French corpus (over 2.3 million 
words) wa.s used h)r this purpose. Graplmme-to- 
phoneme software \[14\] was utilized in order to con- 
vert l"rench orthography into phonemes. For the 
sake of comparison, both the phonetic transcrip: 
tion from the corpus and the one from the MRI) 
were converted into a unique set of i)honemes. 
Typical outt>ut front the dictionary looks like: 
ABACA \[abaka\] n. m. 
ABASOURI)IR \[abazuRdiR,\]; \[abasuRdiR\] 
AI~ASOUR1)ISSANT, ANTI" \[abazu RdisA, At 
AI~NI"I'EUt{, EUSE \[abat8R, 7z\] n. 
ABCE'S \[absE; apsE\] n. m. 
ABDOMINAL, ALE, AUX \[abd>minal, el adj. 
ABI)OMINO- 
ABDUCTION \[abdyksjO\] n. f. 
A small sample of the tlansard followed by tile 
ascii transcription is shown below: 
l)re'sident de la Compagnie d'Ame',mgement du 
barreau de. 
X Monsieur X I)e'pute' anrien Ministre Pre'sident 
du Conseil. 
prezidA d& la kOpaNi d amenaZmA dy bare d& 
iks m&sju dis depyte Asjl ministr prezid dy kOsEj 
As all experhnent, we compared triphones ex- 
tracted fronl dictionary data and corpora. A 
greedy algorithm 3 to locate the most common 
coocurrences between ortlmgraphy and transcrip- 
tion was run on the data sets. A sample of the. 
corpus and dictionary results are given in the Ta- 
ble below. The table shows in the leftmost two 
coh|mns the top twenty triphones and occurring 
frequencies extracted from the Hansard corpus, 
whereas the righthand columns show dictionary 
results. Notice the discrepancy between tlmse 
lists; for the top twenty triphones, there are only 
3 We thank Jan van Smlt.en for providing this software. 
9Z~ 
two overlaps, sjO and jO*. The levels of common- 
ality between the triphones of the tIansard and the 
dictionary (5% of commonality for the top 100 tri- 
phones and 15% of commonality for the top 1000 
triphones) is interesting to observe. 
Hansard data 
54580 ~O 38745 
53948 jO* 38707 
47339 par 35052 
44328 asj 30389 
44065 prL 39722 
43288 tr~ 29093 
41356 &la 28784 
40877 put 26766 
39122 ~mA 25997 
38707 d&l 25378 
*z& 
set 
k>m 
*mE 
re* 
&pr ist 
Eat 
HIA* 
Robert data 
3636 mA* 1725 
3324 ik* 1554 
3223 jO* 1492 
2823 sjO 1462 
2597 te* 1405 
2202 *de 1391 
2105 5sj 1389 
2086 EFt* 1376 
2067 aZ* 1341 
1789 ist 1321 
Table 1: Twenty most frequent triphones 
The preliminary results indicate that the coar- 
ticulatory effects derived from the corpus data will 
be usefnl, in particular for languages like French 
where liaison plays a major role. This remains to 
be tested in the TTS system. 
3.2 Related Work 
Although the statistical analysis of MRDs has fo- 
cussed primarily on definitions and translations, 
\[5\] used the prommciation field as data. A dic- 
tionary of over 110,000 entries containing 51,219 
common words and 59,625 proper nouns, \[17\] was 
used for selecting candidate units that were fur- 
ther utilized in the set of concatenative units (di- 
phones, triphones, and longer milts) for synthe- 
sis. The phonemic string was split according 
to ten language-dependent segmentation princi- 
ples. For example, the word "abacus" \['ab-o-kos\] 
was first transformed into cuttable units as fol- 
lows: \[#'a,'~b,bo,ok,ko,os,s#\]. Once each dictio- 
nary word was split, the duplicates were removed 
and the remaining units formed the set of con- 
catenative units. At the end of this operation, a 
rather long list was obtained that was pruned by 
methods such as reduction of secondary and pri- 
mary stress into one stress in order to keep only 
one +stress/-stress distinction. Techniques were 
shown that allow the selection of a minimal set 
of word pairs for inter-word junctures; every can- 
didate unit inside and across word sequence was 
included. The same strategy was replicated on the 
Collins Spanish-English dictionary by \[6\]. In this 
fashion, the dictionary was used as a sample of the 
language in the sense that it assnmes that most of 
the phonemic combinations of the language were 
present. 
i81H 
ite 
je* 
8111" 
bl* 
Mi 
abl 
tik 
st* 
4 Limitations of MRDs 
The most straightforward way, but in the long 
run not the nlost flexible, is to parse the phonetic 
information out of the prommciation field. The 
)ronunciation field information can generally be 
~onsulted by a TTS system within the grapheme- 
;o-phoneme module. Additional rules for pro- 
:esses such as inter-word assimilation, juncture, 
md prosodic contouring need to be added, since 
solated word pronunciation couhl already be ban- 
tied by look-up table. Although appealing, there 
~re two major drawbacks to this approach: 
(a) dictionary pronunciation fields are often not 
)honetically fine-grained enough for acceptable 
speech output. For example, the pronunciation 
for "inquest" is given ill W7 as /'in-,kwest/, but 
of course the nasal will assimilate in place to the 
velar, giving /i0-kwest/. Without assimilation, 
the perceptual cffect is of two words: "in quest" 
and would be misleading. Again, the human user 
will a.ssimilatc naturally, but a text to speech sys- 
tem must figure out such details, since artieulatory 
ease is not a factor in most synthesis systems. One 
way to solve this problem is to impose such assim- 
ilation on input from the pronunciation field by 
a set of post-processing rules. Although this so- 
lution wouht be correct in the majority of cases, 
blanket application of such rules is not always ap- 
propriate for lexical exceptions. For example, as- 
similation is optional for words like "uncaring", 
in this case related to the morphological structure 
of the lexical item. A TTS system will proba- 
bly already have snch rules since they are inher- 
ent in the graphemc-to-phoneme approach. Thus, 
it could be argued that there is no need for the 
dictionary prommciation, since with a complete 
and comprehensive grapheme-to-phoneme conver- 
sion system, a list which requires post-processing 
is simply inadequate and unnecessary. Tiffs is the 
approach taken, for example, by \[14\], who makes 
use of small word lists (the main dictionary being 
25K stored forms) and several affix tables to recog- 
nize graphemic forms, which arc then transformed 
into phonemic reprcsentations; 
(b) only a small percentage of possible words 
are listed with prommciations in a dictionary. For 
example, Wcbster's Sevcnth contains about 70,000 
headwords, but is missing words like "computer- 
ize" and "computerization" since they came into 
frequent use in the language after the 1963 publi- 
cation date. Two solutions to this problem present 
themselves. One is to expand the word list from 
tile dictionary to include run-on's, as illustrated in 
examples (3) and (4), and discussed in Section 2.3. 
The other is to build a morphological generator, 
974 
using headwords, part of sl)eech , and other in- 
formation as input, discussed in Section 2.2 that 
would be invoked when the word does not tigure 
in tl,e headword list. 
5 Final Remarks 
Although limitations ('lcarly constrain the use of 
MRI)s in TTS, we have demonstrated in this pa- 
per that it is more cost eflqcient to post process 
underspecilie(l dictionary information such as in- 
flection, pronunciation, and part-of-speech, rather 
than generate rules from scratch to arrive at the 
same end point. For speech synthesis, thc data 
is not always perfect, and often must be post- 
processed. This paper h~us demonstrated ways we 
have successfully used dictionary data in 'FTS sys- 
tems, ways wc have post-processed data to make 
it morc useful, and ways data Camlot bc easily 
post-processed or used. 
Of course, for any TTS system, the power of 
the dictionary data can be found at the lexical, 
t)hrmqal, and idiom level. Although any word 
list such ,-Ls a dictionary is by definition closed, 
whereas language is open-ended, dictionary data 
has proven to be usefid from both a theoretical 
and practical point of view. 
References 
\[1\] Branimir \]loguraev, Roy Byrd, Judith Klavans, 
anti Mary Neff. From machine readable dictionar- 
ies to a lexical knowledge base. Detroit, Michi- 
gan, 1989. First International l,exical Acquisition 
Workshop. 
\[2\] David Carter. I,doce and speech recognition. 
in Branimir Boguraev and Ted llriseoe, ed- 
itors, Computational Lexicography fl~r Natural 
Language Processing, chapter 6, pages 135--152. 
Longman, Burnt tlill, llarlow, Essex, 1989. 
\[3\] Martin S. Chodorow, Roy J. Byrd, and George E. 
Iteidorn. Extracting scmantic hierarchies from a 
large on-line dictionary. In Procccdinqs of the 23rd 
Annual Meeting of the Association for Computa- 
tional Linguistics, pages 299 304. Association for 
Computational Linguistics, 1985. 
\[4\] Paul Cohen. Spelling to sound conversion for text 
to speech. 1982. 
\[5\] John Coleman. Computation of candidate syn- 
thesis units. In 112~22-930719-07TM, Mnrray llill, 
N.J., USA, 19.(13. 'technical Memorandum, AT& 
Bell Lahoratories. 
\[6\] John Coleman and Pilar Prieto. Accurate pro- 
nunciation rules for american spanish text-to- 
speech. In 11222-930719-06TM, Murray Ilill, 
N.J., USA, 19!13. q~chnical Memorandum, AT& 
Bell l,aboratories. 
\[7\] Collins Spanish Dictionary: Spanish-English. 
Collins Publishers, Glasgow, 1989. 
\[8\] P-II. Cousin, L. Sinclair, J-F. Allain, and C. E. 
Love. The Collins Paperback French Dictionary: 
t4"eneh- English. English-French. Collins Publish- 
ers, l,ondon, 1989. 
\[9\] Main l)uval et al. Robert Encyclopedic Dictionary 
(CD-ROM). \[Iachettc, Paris, 1992. 
\[10\] M. (;onzMes. Collins Gem Spanish Dictionary: 
l'i'ench-English. English-French. Harper ColliNs 
lhlblishers, London, 1990. 
\[11\] Judith Klavans and Sara Basson. l)ocnmentation 
of letter to sound components of the WALR.US text 
to spe(;eh system. 1984. 
\[12\] Judith l(lavans and Evclyne Tzoukermann. The 
bicord system: (~omt)ining lexical information 
front bilingual corpora and machine readable die-. 
tionarles. In I¥ocecdings of the 131h Interna- 
tional Confcrencc on Computational Linguistics, 
llelsinki, Finland, 1990. 
\[13\] Judilh L. Klavans, Martin S. Chodorow, and 
Nina Wacholder. From dictionary to knowledge 
base via l~txonomy. (;entre for the New Oxford 
Fnglish I)ictionary and q~xt Research: Electronic 
Texl |leseareh, University of Waterloo, Canada, 
1990. l)rocecdings of the Sixth Conference of the 
University of Waterloo. 
\[14\] 1". Marty. Trois systbmes inforlnatiques de 
transcription I)hondtique et graph6mique. Lc 
l'¥ancais Modcrnc, LX, 2:179 197, 1992. 
\[15\] Mcrriam. Wcbstcr's Seventh New Collegiate Dic- 
tionary. G.&~ C. Mcrriam, Springficld, M~s., 
1963. 
\[16\] M. Netf and B. Boguraev. Dictionaries, dictio- 
nary gl'ammars and dictionary entry parsing. In 
t'rocecdings of the 271h Annual Meeting of the 
Association for Computational Linguistics, Van- 
couver, Canada, 1989. Association for Computa- 
tional l,inguistics. 
\[17\] Olive Joe P. and Mark Y. l,iberman. A set ofcon- 
catenative units for speech synthesis. In In J. J. 
Wolf and 1). II. Klatt, editors, Speech Commu- 
nication t)apcrs Prcscntcd at the 971h Mccting of 
tin Acoustical Society of America, pages 515-518, 
New York: American Institute of Physics, 1979. 
\[18\] Paul Procter, editor. Longman Dictionary of 
Contemporary English. l,ongman Group, Burnt 
llill, ttarlow, l!'ssex: Longnran, 1978. 
\[19\] F, lisabcth O. Selkirk. The Syntax of Words. MIT 
Press, Cambridge, Mass., 1982. 
\[20\] l",velync '\['zonkcrnutnn and Mark Y. l,iberman. 
A finite-state morphological processor for span- 
ish. In Procccdings of ColinggO, \]Iclsinki, l"inland, 
1990. International Conference on Computational 
Linguistics. 
9Z5 
