Unification Phonology: Another look at "synthesis-by-rule" 
John Coleman 
Experimental Phonetics Laboratory 
Department of Langm~ge and Linguistic Science 
University of York 
Heslington 
YORK 
YOI 5DD 
United Kingdom 
e-maih JANET%UK.AC.YORK.VAX::J SC 1 
Transformational grammars and "synthesis- 
by-rule" Most current text-to-speech systems 
(e.g. Allen et al. 1987; Hertz 1981~ 1982, forthcoming; 
Hertz et aL 1985) are, at heart, unconstrained string- 
based transformational grammars. Generally, text- 
to-speech programs are implemented as the compo- 
sition of three non-invertible mappings: 
1. grapheme to phoneme mapping (inverse spelling 
r ales + exceptions dictionary) 
2. phoneme to allophone mapping (pronunciation 
rules) 
3. allophone to parameter mapping (interpolation 
rules) 
\])'or example: 
ph\] %, pit 
\[p'\] ~-- /p/ ~-- sip 
\[p-\] e / spit 
allophones ~ phonemes ,--- graphemes 
h denotes strong release of breath (aspiration) 
' denotes slight/weak aspiration 
- denotes no aspiration 
These mappings are usually defined using rules of 
the form A -+ B/U D e.g. (1), usually called 
"context-sensitive", but which in fact define unre- 
stricted rewriting systems, since B may be the empty 
string (Gazdar 1.987). It should be recalled that "if 
all we can say about a grammar of a natural language 
is that it is an unrestricted rewriting system, we have 
said nothing of any interest" (Chomsky 1963:360). 
(1) p --, p-/s__ 
else p --, ph/__V 
(where V is any vowel symbol) 
else p --~ p' 
Often, of course, grammars made with rules of this 
type may be (contingently) quite restricted. For in.- 
stance, if the rules apply in a fixed order without 
cyclicity, they may be compiled into a finite-state 
transducer (Johnson 1972). But in general there 
is no guarantee that a program which implements 
such a grammar will halt. This would be pretty 
disastrous in speech recognition, and is undesirable 
even in generation-based applications, such as text- 
to-speech. However, this has not prevented the at> 
pearance of a number of "linguistic rule compilers" 
such as Van Leenwen's (1987, 1989) and Hertz's sys.- 
gems. 
Tile basic operations of a transformational gram° 
mar -- deletion, insertion, permutation, and copying 
-- are apparently empirically instantiated by such 
well-established phonological phenemona as elision, 
epenthesis, metathesis, assimilation and coarticula- 
tion. 
Copying (i): Assimilation 
e.g. 1 ran \[hi 
ran quickly \[9 \] 
Rule: n-~O/ {k,g} 
\[0\] denotes back-of-tongue (velar) 
nasal closure 
e.g. 2 sandwich \[samwitJ'\] 
Rule: n -~ m/ {l),b,w etc.} 
79 
Copying (ii): Ooarticulation 
e.g. keep \[~_\] 
cool 
c rt \[k\] 
V 
Rules: k-, +k /--i-back\] 
V 
v 
denotes advanced articulation + 
denotes lip-rounding 
denotes retracted articulation 
Insertion: Epenthesis 
e.g. mince \[mints\] 
pence \[pents\] 
Rule: ns --~ nts 
Deletion: Elision 
e.g. sandwich \[sanwitf\] 
Rule: nd --* n 
Permutation: Metathesis 
e.g. burnt \[brunt\] 
Rule: ur-+ ru 
The problems inherent in this approach are many: 
1. Deletion rules can make Context- 
Se,lsitive grammars undecidable. (Salomaa 
1973:83, Levelt 1976:243, Lapointe 1977:228, 
Berwick and Weinberg 1984:127) 
2. Non-monotonicity m~kes for computational 
complexity. 
3. There is no principled 1 way of limiting the do- 
main of rule application to specific linguistic do- 
mains, such as syllables. 
4. Using sequences as data-structures is really only 
plausible if all speech parameters change with 
nlore-or-less equal regularity. 
1N.B. The use of labelled brackets to delimit domains is 
completely unrestricted mechanism for partitioning strings. 
e.g. "chip" 
Syllable 
/ \ 
Onset Rime 
! ! \ 
Affricate Nucleus Coda 
/ \ I I 
Closure Friction Vowel Closure 
t ... 8h ... i ... p 
Figure 1: Richer structure in phonological represen- 
tations 
In partial recognition of some of these problems, pho- 
nologists have been attemptiug to reconstruct the 
transformational component as the epiphenomenal 
result of several interacting general "constraints". 
Numerous such "constraints" and ~principles" have 
been proposed, such as the Well-Formedness Con- 
dition (Goldsmith 1976 and several subsequent for- 
nmlations), the Obligatory Contour Principle (Leben 
1973), Cyclicity (Kaisse and Shaw 1985, Kiparsky 
1985), Structure-Preservation (Kiparsky 1985), the 
Elsewhere Condition (Kiparsky 1973) etc. While 
this line of research is in some respects conceptually 
cleaner than primitive transformational grammars, 
there has been no demonstration that a "principle'- 
based phonology is indeed more restrictive than 
primitive transformational phonology in any compu- 
rationally relevant dimension. 
A declarative model of speech For the last few 
years, I have been developing a "synthesis-by-rule" 
program which does not employ such string-to-string 
transformations (Coleman and Local 1987 forthcom- 
ing; Local 1989 forthcoming; Coleman 1989). 
The basic hypothesis of this (and related) research is 
that there is a trade-off between the richness of the 
rule component and the richness of the representa- 
tions (Anderson 1985). According to this hypothesis, 
the reason why transformational phonology needs to 
use transformations is because its data structure, 
strings, is too simple. Consequently, it ought to be 
possible to considerably simplify or even completely 
eliminate the transformational rule component by 
using more elaborate data structures than just well- 
ordered sequences of letters or feature-vectors. For 
instance if we use graphs {fig. 1) to represent phono- 
logical objects, then instead of copying, we can im- 
80 
Tongue-back: ANY 
I 
~Ibngue-tip: CLOSURE 
I 
Nasality: + 
Tongue-back: CLOSURE Tongue-back: CLOSURE 
I I 
Tongue-tip: ANY Tongue4ip: CLOSURE 
I ¢~ / \ 
Nasality: - Nasality: + Nasality:- 
k ,) k 
Figure 2: Declarative characterisation of assimilation 
plement harmony phenomena using the structure- 
,dlaring technique. 
:(ncorpora~ing richer data-structures allows many if 
not all rewriting rules to be abandoned, to the extent 
J~hat the transformational rewrite-rule mechanism 
can be ditched, along with the problems it brings. 
Consider how the "processes" discussed above can 
be given a declarative (or "configurational") analy- 
sis. 
Allophony can be regarded as the different interpre- 
tation of t:he same element in different structural con- 
~exts, rather than as involving several slightly differ- 
ent phonological objects instantiating each phoneme. 
Onset Coda Onset 
i i / \ 
p p s p 
\[ph\] \[p,\] \[p\] 
Aspirated Slightly Unaspirated 
aspirated 
Assimilation can also be modelled non-destructively 
by unification (fig. 2). 
Coarticulation is simple to model if parametric pho- 
netic representations may be glued together in par- 
allel, rather than simply concatenated. Consonants 
may then overlaid over vowels, rather than simply 
concatenated to them (Ohman \]966, Perkell 1969, 
Gay 1977, Mattingly 1981, Fowler 1983). If required, 
~his analysis can also be implemented in the phono- 
logical component, using graphs of the 'overlap' re- 
lation (Griffen 1985, Bird and Klein 1990): e.g.: 
ii tiLl aa / \ / \ / \ 
k p k I k t 
E~\] \[~\] \[_k\] 
It is now common to analyse epenthesis, not as the 
insertion of a segment into a string, but as due to 
Closure Friction Closure Friction 
I I ~ / \ / 
Na~ality Non-nasality Nasality Non-nas. 
I\] S n t S 
Figure 3: Declarative characterisation of epenthesis 
Closure Non-clo. Closure Non-clo. 
/ \ / *~ I I Nasality Non-nas. N,~sality Non-nas. 
n d w n w 
Figure 4: Declarative eharacterisation of elision 
minor variations in the temporal coordination of in- 
dependent parameters (Jespersen 1933:54, Anderson 
1976, Mohanan 1986, Browman and Goldstein 1986) 
(fig. 3). 
It has been demonstrated (Fourakis 1980, Kelly and 
Local 1989) that epenthetic elements are not phonet- 
ically identical to similar non-epenthetic elements. 
The transformational analysis, however, holds that 
the phonetic implementation of a segment is depen- 
dendent on its features, not its derivatonal history 
('% \[t\] is a It\] is a It\]'), and thus incorrectly predicts 
that an epenthetic \[t\] should be phonetically identi- 
ca.1 to any other It\]. 
Elision is the inverse of epenthesis, and is thus in 
some sense "the same" phenomenon, taking the "un- 
elided': form as more primitive than the "elided" 
form, a decision which is entirely meaningless in the 
declarative account (fig. 4) 
Metathesi3 is another instance of "the same" phe- 
nomenon i.e. different temporal synchronisation of 
an invariant set of elements. Epenthesis, Elision and 
Metathesis may all be regarded as instances of the 
more general phenomenon of non-significant variabil- 
ity ill the timing of parallel events. 
81 
Figure 5: Phrase structure grammar of English phoneme strings 
Word Word \[+inflected\] -~ \[-inflected\] Inflection Inflection e.g. cat+s 
Word Word Word \[-in fleeted\]-~ 
\[-inflected\] \[-inflected\] Compounding e.g. black+bird 
Word Prefix* Word Suffix* 
\[-Latinate\]-~ \[-Latinate\] \[-inflected\] \[-Latinate\] 
Word Stress Morphology 
\[+Latinate\]--* \[+Latinate\] o \[+Latinate\] o denotes complete constituent overlap 
Stress \[+Latinate\] --* Non_final_feet Foot 
Foot Non_final_feet -~ \[+initial\] Foot* 
Syl ab,  (i  ll b e) +heavy -heavy 
Syllable Syllable Syllable 
-heavy, -heavy 
Morphology Pre fix* Stem 
\[+Latinate\] -* \[+Latinate\] \[+Latinate\] 
Suffix* 
\[+ Latinate\] 
Syllable Rime \[~hea,~y\] -' (Onset) \[aheavy\] 
Onset Affricate 
\[avoi I -~ \[avoi\] 
Onset \[-voi\] -~ Aspirate 
Onset ( Obstruence ) (Glide) 
\[~voi\] -* \[~voi\] 
Obstruence \[-~oi\] -~ (\[sl), Closure (Either order) 
Constraint: in onsets, \[s\] < Closure 
Rime Nucleus ( Coda ) 
\[aheavy\]--* \[aheavy\] \[c~heavy\] 
Nucleus -~ Peak Offglide 
etc. etc. 
82 
U U \ / 
r i" 
As well as these relatively low-level phonological 
phenomena, work in Metrical Phonology (Church 
1985) and Dependency Phonology (Anderson and 
Jones 1974) has shown how stress assignment, a 
paradigm example for transformational phonology, 
can be given a declarative analysis. 
Overview of text-to-parameter conversion in 
the YorkTalk system 
1. Each symbol in the input text string is trans- 
lated into a column-vector of distinctive pho- 
netic features (nasal, vowel, tongue-back, etc.) 
Sequences of letters are thus translated into se- 
quences of feature-structures. 
2. The sequence of feature-structures is parsed. 
This process translates the sequence into a 
directed graph representing the phonological 
constituent structure of the utterance. 
3. The phonological structure is traversed and an 
interpretation function applied at each node to 
derive a phonetic parameter matrix. 
Parsing is done using a Phrase Structure Grammar 
of phoneme strings. A very simplified version of such 
a grammar is fig. 5. I have implemented several such 
grammars so far, including a DCG implementation 
and a PATR-II-like implementation. With one or 
two simple extensions to the grammar formalism, it 
is also possible to parse re-entrant (e.g. ambisyllabic) 
structures and other overlapping structures, such as 
those arising from bracketting "paradoxes". The re- 
suiting graphs are thus not trees, but directed acyclic 
graphs. 
In computational syntactic theory, one of the main 
uses for the parse-tree of a string is to direct the 
construction of a compositional (Fregean) semantic 
interpretation, according to the rule-to-rule hypoth- 
esis (Bach 1976). In the YorkTalk system, the same 
approach is employed to assign a phonetic interpre- 
tation to the phonological representation. A sec- 
ond, theory-internal motivation for constructing rich 
parse-graphs of the phonemic string is that it enables 
the phoneme string to be discarded completely, thus 
liberating the phonetic interpretation function from 
the sequentiality and other undesirable properties of 
seglnental strings. 
After the phonological graph has been constructed 
by the parser, a head-first graph-traversal algorithm 
maps the (partial) phonological category of each 
node into equations describing the time-dependent 
motion of the synthesis parameters for specified in- 
telwals of time. These parametric time-functions are 
finally instantiated with actual numbers represent- 
ing times, in order to derive a complete matrix of 
(parameter, value) pairs. 
As well as being computationally "clean", this 
method of synthesis has the additional merit of being 
genuinely non-segmental in (at least) two respects: 
there are no segments in the phonological representa- 
tions, and there is no cross-parametric segmentation 
in the phonetic representations. The resulting speech 
does not manifest the discontinuities and rapid cross- 
parametric changes which often cause clicks, pops, 
and the other disfiuencies which typify some syn- 
thetic speech. On the contrary, the speech is fluent, 
articulate and very human-like. When the model 
is wrong in some respect, it sounds like a speaker 
of a different language or dialect, or someone with 
dysfluent speech. For all these reasons, the York- 
Talk model is attracting considerable interest in the 
speech technlogy industry and research commulfity, a 
circumstance which I hope will promote a widespread 
change of approach to computational phonology in 
future. 

References 

\[1\] Allen, J., S. Hunnicutt and D. Klatt. 1987./~¥om 
Text to Speech: The MITalk System. Cambridge 
University Press. 

\[2\] Anderson, J. and C. Jones. 1974. Three theses 
concerning phonological representations. Jour- 
nal of Linguistics 10, 1-26. 

\[3\] Anderson, S. R. 1976. Nasal Consonants and the 
internal Structure of Segments. Language 52.2 
326-344. 

\[4\] Anderson, S. R. 1985. Phonology in the Twen- 
tieth Century. University of Chicago Press. 

\[5\] Bach, E. 1976. An extension of classical trans- 
formational granunar. Problems in Linguistic 
Metatheory, Proceedings of the 1976 Conference 
at Michigan State University, 183-224. 

\[6\] Berwick, R. C. and A. S. Weinberg. 1984. 
The Grammatical Basis of Linguistic Perfor- 
mance: Language Use and Acquisition. Cam- 
bridge, Massachusetts: M. I. T. Press. 

\[7\] Bird, S. and E. Klein. 1990. Phonological 
Events. To appear in Journal of Linguistics 26 
(1). 

\[8\] Browman, C. P. and L. Goldstein. 1986. To- 
wards an articulatory phonology. Phonoloqy 
Yearbook 3, 219-252. 

\[9\] Chomsky, N. 1963. Formal Properties of Gram- 
mars. In R. D. Luce, R. R. Bush and 
E. Galanter, eds. Handbook of Mathematical 
Psychology Vol. II. New York: John Wiley. 

\[10\] Church, K. 1985. Stress Assignment in Letter 
to Sound Rules for Speech Synthesis. In 28rd 
Annual Meeting of the Association .for Compu- 
tational Linguistics Proceedings. 

\[11\] Coleman, J. S. and J. K. Local. 1987 forthcom- 
ing. Monostratal Phonology and Speech Synthe- 
sis. To appear in C. C. Mock and M. Davies 
(eds.) In press. Studies in Systemic Phonology 
London: Francis Pinter. 

\[12\] Coleman, J. S. 1989. The Phonetic Interpreta- 
tion of Headed Phonological Structures Con- 
raining Overlapping Constituents. ms. (Cur- 
rently submitted to Phonology) 

\[13\] Fourakis, M. S. 1980. A Phonetic Study of Sono- 
rant Fricative Clusters in Two Dialects of En- 
glish. Research in Phonetics 1, Department of 
Linguistics, Indiana University. 

\[14\] Fowler, C. A. 1983. Converging Sources of Ev- 
idence on Spoken and Perceived Rhythms of 
Speech: Cyclic Production of Vowels in Mono- 
syllabic Stress Feet. Journal of Experimental 
Psychology: General Vol. 112, No. 3, 386-412. 

\[15\] Gay, T. 1977. Articulatory Movements in VCV 
Sequences. Journal of the Acoustical Society of 
America 62, 182-193. 

\[16\] Gazdar, G. 1987. COMIT ==> * PATR II. In 
TINLAP 3: Theoretical Issues in Natural Lan- 
guage Processing 3. Position Papers. 39-41. As- 
sociation for Computational Linguistics. 

\[17\] Goldsmith, J. 1976. Autosegmetttal Phonology. 
Indiana University Linguistics Club. 

\[18\] Griffen, T. D. 1985. Aspects of Dynamic Phonol- 
ogy Amsterdam studies in the theory and his- 
tory of linguistic science. Series 4: Current is- 
sues in linguistic theory, vol. 37: Benjamins. 

\[19\] Hertz, S. R. 1981. SRS text-to-phoneme rules: a 
three-level rule strategy. Proceedings of ICASSP 
81, 102-105. 

\[20\] Hertz, S. R. 1982. From text to speech with 
SRS. Journal of the Acoustical Society of Amer- 
ica 72(4), 1155-1170. 

\[21\] Herez, S. R., Kadin, J. and Karplus, K. 1985. 
The Delta rule development system for speech 
synthesis from text. Proceedings of the IEEE 
73(11), 1589-1601. 

\[22\] Hertz, S. R. forthcoming. The Delta program- 
ming language: an integrated approach to non- 
linear phonology, phonetics and speech synthe- 
sis. In J. Kingston and M. Beckman, eds. Papers 
in Laboratory Phonology I: Between the Gram- 
mar and Physics of Speech. Cambridge Univer- 
sity Press. 

\[23\] Jespersen, O. 1933. Essentials of English Gram- 
mar. London: George Allen and Unwin. 

\[24\] Johnson, C. D. 1972. Formal Aspects of Phono- 
logical Description, Mouton. 

\[25\] Kaisse, E. and P. Shaw. 1985. On the Theory of 
Lexical Phonology. Phonology Yearbook 2, 1-30. 

\[26\] Kelly, J. and J. K. Local. 1989. Doing Phonol- 
ogy. Manchester University Press. 

\[27\] Kiparsky, P. 1973. ~Elsewhere" in Phonology. 
Indiana University Linguistics Club. 

\[28\] Kiparsky, P. 1985. Some Consequences of Lexi- 
cal Phonology. Phonology Yearbook 2, 82--136. 

\[29\] Lapointe, S. 1977. Recursiveness and deletion. 
Linguistic Analysis 3: 227-265. 

\[30\] Leben, W. 1973. Suprasegmental Phonology. 
Ph.D. dissertation, M. I. T. 

\[31\] Levelt, W. J. M. 1976. Formal grammars and 
the natural language user: a review. In A. Mar- 
zollo, ed. Topics in Artificial Intelligence CISM 
courses and lecture notes no. 256. Springer. 

\[32\] Local, J. K. 1989. Modelling assimilation in 
non-segmental rule-free synthesis. To appear in 
D. R. Ladd and G. Docherty, eds. Papers in 
Laboratory Phonology H Cambridge University 
Press. 

\[33\] Mattingly, I. G. 1981. Phonetic Representations 
and Speech Synthesis by RUle. In T. Myers, 
J. Laver and J. Anderson, eds. The Cognitive 
Representation of Speech. North-Holland. 

\[34\] Mohanan, K. P. 1986. The Theory of Lezical 
Phonology. D. Reidel. 

\[35\] Ohman, S. E. G. 1966. Coarticulation in 
VCV Utterances: Spectrographic Measure- 
ments. Journal of the Acoustical Society of 
America 39, 151-168. 

\[36\] Perkell, J. S. 1969. Physiology of Speech Pro- 
duction: Results and Implications of a Quanti- 
tative Cineradiographic Study Cambridge, Mas- 
sachusetts: M. I. T. Press. 

\[37\] Salomaa, A. 1973. Formal Languages. New 
York: Academic Press. 

\[38\] Van Leeuwen, H. C. 1987. Complementation in- 
troduced in linguistic rewrite rules. Proceedings 
of the European Conference on Speech Technol- 
ogy 1987 Vol. 1, 292-295. 

\[39\] Van Leeuwen, H. C. 1989. A development tool 
for linguistic rules. Computer Speech and Lan- 
guage 3, 83-104. 
