1. Introduction 
In this paper, I discuss the role of phonology in the mo- 
delling of speech processing. It will be argued that recent 
models of nonlinear representations in phonology should be 
put to use in speech processing systems (SPS). Models ol 
phonology aim at the reconstruction of the phonological 
knowledge that speakers possess and utilize in speech prow 
cessing. The most important function of phonology in SPS 
is, therefore, to put constraints on what can be expected 
in the speech stream. A second, more specific function re- 
lates to the particular emphasis ot the phonological models 
mentioned above and outlined in § g: It has been realized 
that many SPS do not make sufficient use of the supraseg- 
mental aspects of the speech signal. But it is precisely in 
the domain of prosody where nonlinear phonology has made 
important progress in our insight into the phonological com- 
ponent of language. 
From the phonetic point of view, phonological knowledge 
is higher level knowledge just as syntactic or semantic in- 
formation. But since phonological knowledge is in an obvi- 
ous way closer to the phonetic domain than syntax or se- 
mantics, it is even more surprising that phonological know- 
ledge has been rarely applied systematically in SPS. 
2. Prosodic factors in the variability of speech 
One claim of this paper is that the proper use of phono- 
logy is one key to the successful handling of variability in 
speech. In (l), five versions of a common greeting in Ger- 
man are transcribed in a fairly close manner. 
(1) Guten Morgen a. \[,gu:ton 'm0e.gon\] 
b. \[,gu:t 9 'm~gn\] 
c. \[,gun 'm.~(e)gp\] 
d. \[,g~ 'm~(e)@ 
e. \[n mS~r3\] 
The version (la) is certainly overly careful even for speak- 
ers of the standard language in highly controlled situations. 
But it is precisely in front of the-~ignorant--computer, 
that speakers might revert to a speech mode as the one in 
(Is). It has been noted that speakers talking to a SPS turn 
to careful, hyper-correct speech when repeating utterances 
that the system did not understand (Vaissi~re 1985: 204). If 
a system does not have the means for representing this ve- 
ry explicit form of speech, talking like in (la) is no help for 
the system; in fact, it is even harder to understand 
THE ROLE OF PHONOLOGY IN SPEECH PROCESSING 
Richard Wiese 
Seminar f(ir Allgemeine Sprachwissenschaft 
Universitfit D~isseldorf 
D-4000 D0sseldorf, FRG 
for the system than the less careful speech. The SPS will 
almost necessarily fail to analyze the utterance although 
the speaker has made considerable effort to make himself 
understood. 
On the other side of the scale oI variability, there is re- 
duction, assimilation and even deletion ol sounds, which 
makes speech processing extremely difficult. (lb) might be 
the normative standard language version. Compared to 
(la), the nasal consonants carry the syllabicity of the un- 
stressed syllables. Also the r-sound will be more or less vo- 
calized, and the final nasal will be assimilated to the plo- 
sive. (lc) and (ld) show further reductions in the segmental 
material. I assume that the various processes occur roughly 
in the steps given although nothing hinges on that. it is im- 
portant, however, that the suprasegmental information is 
quite stable over all the reductions in the segmental mate- 
rial. (Is) to (lc) show the same number of syllables (as do 
d and e), and all versions share the same stress pattern. 
The unstressed syllables are the ones that disappear first, 
the syllable with secondary stress is reduced in (le). 
Ti~e conCluSloi\] is that ceductio,,s and omissioi,s h, speed-, 
are such that as much as possible is kept of the supraseg- 
mental structure. Apart from this aspect, the example de- 
monstrates a major problem for a SPS: The signal for what 
is regarded as one utterance can be, even in the abstract 
form given in (1), highly variable and context-dependent. 
It is important to realize that phonology since its begin- 
nings aims at the extraction of the relevant information 
from the speech stream. The concept of distinctive vs. 
predictable and accidental features is a cornerstone for all 
phonological theories. To see how this could be relevant for 
a SPS, we have to look at the structure of such a system. 
3. The structure of a speech processin s~_~s~e_m 
SPS analyze or synthesize speech in order to relate the 
speech signal to an utterance representation (text). The 
text could consist oi the orthographic form of words or 
some other form closer to the representation of words in 
a mental lexicon. It is common for advanced SPS, however, 
to define an intermediate representation between the raw 
signal and the text. This representation, a symbolic code 
for phonetic categories, stands halfway between the unana- 
lyzed signal and the textual or lexical representation. The 
broad structure of a SPS can therefore be depicted as (2). 
608 
(2) Signal 
Symbolic Represen Lation 
.t Tex{ual or lexical Representation 
As a first pass, the symbolic representation can be: seen 
as a phonetic transcription, exemplified in (1). This reveals 
its intermediate nature: I~ codi:\[ies properties o~ the speech 
signal into discrete phonetic categories, but it also contains 
idiosyncratic teatures that are not part of the lexical re- 
preseneations or el the representation el the utterance. 
The role ol the symbolic representation in SPS can be 
illustrated as \[allows. In speech recognition, it serves as a 
meeting-poinl: :for the two I<inds of procedures called upon 
in systems of this kind. For bottom-up analysis of d~e sig- 
nal, results art.* outputted as pieces of fl~e symbolic repre- 
sentation. For top-down procedures, i.e., hypotheses about 
what might occur in the signal, the output is again some 
piece of the representation. The requirements and possibi- 
lities :\[or bottom-up and top-down analysis define to a large 
extent which criteria tl~e symbolic representation has to 
meet: Whereas the signal is highly speaker-dependent, the 
symbolic representation is not. On the other hand, while a 
lexieal representation of a word would not include predic~- 
able phonetic information, the phonetic transcription as a 
symbolic representation would contain information of this 
I<ind. In speech synthesis, lexical representations can first 
be translated into a phonetic representation which is then 
transformed into signal components. This two-step procedure 
for the adjustment of the phonetic forms to context influ- 
ences such as assimilation between adjacent words can pos- 
sibly very efficient, H lexical representations are mapped 
directly onto speech signals, it is hard to see how adjust- 
ments of this sort can be performecl systematically. 
I have been deliberately vague about the nature of the 
symbolic representation, because there are various proposals 
to this question. A number ol units have been used and dis- 
cussed as the elementary recognition or synthesis units, e.g., 
the phone, the diphone, the phoneme, the demi-syllable, and 
the syllable. The basic requirement for a symbolic represen- 
tation in a general-purpose SPS would be that it is able to 
denote as much information as can be extracted from the 
signal or be deduced from the lexical representation. Thus, 
if the system can compute the occurrence o£ an allophonic 
variant of some sound, then this allophone should be repre- 
sentable in the symbolic representation. Similarly, if it is 
detected that two syllables are present in the signal, this 
fact should be encoded in the representation. 
These considerations lead to the conclusion that the sym- 
bolic representation might be richer as is often assumed in 
existing systems. We will now show that phonological theory 
can help Lo define an adequate symbolic representation 
which is both a (:ode for expressing phonetic categories anti 
a model el the phonological knowledge of the language user. 
tlt__Solrl ~ re,cent developtr~z!\]its !n ~ L\]hoI~oloKy 
There is a long tradition in phonology to distinguish I)etween 
segmental and suprasegmental features° Segmental features 
are those o:\[ tile inclividual segment; suprasegmental ones 
belong to a domain larger than one segment. 
But it is by no means clear in advance where a feature 
stands in this classification. To give an example, segments 
are el:Lea speci\[ied by the feature syllabic . A segment is 
syllabic if iL stands in Lhe peal< of the syllable. Thus, in (3a) 
all the segments marked with a vertical line are syllabic, 
all others are not. 
(3) a.\[mt3wner~on\] b. \[ntone~n3 intonation 
i i , ,i f r J~, 
But here, there are other pronunciations of the same word 
with different syllabic elements, such as (3b). What remains 
constant is that for each syllable there is exactly one sylla- 
bic peal<. This suggests that syllabicity is not a segmental 
feature but suprasegrnental. 
In this chapter, three examples are used to introduce 
some aspects of recent models in phonology. The examples 
are ambisyllabicity, vowel length and stress patterns; the 
constructs to deal with these are the syllable-node, Lhe CV- 
tier and the metrical tree. 
Z~.l. Am~itz__~llable structure 
There is a common notation to marl< syllable-boundaries by 
some symbol inserted into the segment string. But recent 
work on the syllable (such as l(iparsl<y 1979, Clements & 
Keyser 1983) has assigned to the syllable a more important 
role than iust a boundary notion. That syllables are not just 
boundaries can be shown by the phenomenon of ambisyllabi- 
city, which occurs in a number of languages. 
It is well-known that in German words as Mitte or lassen 
the intervocalic consonants are a part of both syllables o1 
each word. In view of this fact, it becomes a rather arbi- 
trary and unmotivated decision to insert a syllable-boundary. 
But the syllable division and the ambisyllabic nature of some 
consonants can be naturaliy denoted if the syllable is given 
a hierarchial character. The notation for Mitre would then 
be as in (4), with '~ ' denoting the syllable node. 
\[ m r/~t "a \] 
"the segments and the syllable nodes appear on different 
rows or 'tiers' of the representation. This does away with 
the concept of the phonetic representation as a unilinear 
609 
string. Elements on the different tiers are connected by 
'association lines'. In the unmarked case, association is 
one-to-one, but in the case of an ambisyllabic segment as- 
sociation, association is one-to-many, as demonstrated by 
the /t/ in (#). 
#.2. Vowel length and the CV-tier 
The syllable is probably more complex than is assumed in 
O). This can be illustrated by the facts of vowel length. In 
German, which has contrastive vowel length, it appears that 
long vowels take up the space of a short vowel plus a con- 
sonant or of a diphthong (two short vowels). This is shown, 
e.g.~ by the fact that the maximal number of final conso- 
nants is 4 in a word with a short vowel (Herbst), but 3 in 
a word with a long vowel (Obst). To give formal recognition 
to the idea that a long vowel uses two positions in the syl- 
lable, although it is only one segment, yet another tier can 
be introduced into the syllable, called the CV-tier. It con- 
sists only of the elements C and V, where V denotes the 
syllabic nucleus of the syllable and C a consonantal position 
in the syllable. A syllable, then, is of the form (5); the ma- 
ximal number of C-positions has to be determined for each 
language. The fact noted above that every syllable has ex- 
actly ol~esyllabic nucleus can be expressed by letting V be 
an obligatory constituent of the syllable in the schema (5). 
(5) e 
... C V C... 
We have now a new formalism to express (phonological!) 
length not as a segmental feature such as long but as an 
association between the segmental tier and the CV~tier. 
The minimal pair Fall 'fall' vs. fahl 'pale' would be given 
the structural representation (6). With a given number of 
consonants following the V-position, the system also ex- 
plains the fact that long vowels allow one consonant less 
in the syllable than short vowels. 
(6) o 
C V C C V C C 
t \/ Ef la 1 \] \] \[~ a 1 \[3 
By treating phonological length as an association between 
tiers, l do not imply that all durational aspects of speech 
can be handled this way. There are other important timing 
relations between segments that determine intelligibility 
and naturalness of synthetic speech (s,ee Huggins 1980). 
These have to be represented by other means, but are 
(partly) effects of the prosodic structure. Well-known ex- 
amples include phrase-final lengthening and stress- timing 
vs. syllable timing. 
610 
4.3. Stress patterns and the metrical tree 
Moving up one or two levels in the prosodic hierarchy, 
there is the fact that strings of syllables occur in distinct 
accentuation patterns. It is part of the phonological compe- 
tence of speakers of a language to be able to judge the ac- 
centual weight of a syllable with respect to its neighbouring 
syllables. In metrical models, this competence is formally 
expressed by metrical trees with syllables as terminal nodes. 
To give an example, the adjective dberfldssig 'superfluous' 
has the highest degree of prominence on the first syllable, 
and the third syllable is relatively stronger than the last 
one. If a binary tree such as (8) is constructed over the 
syllables, and the nodes are labelled 's' (strong) and 'w' 
(weak), these accentual relations can be expressed easily 
and adequately. Syllabic and segmental detail is ignored in 
the examples here. 
(8)/.x (9) a. b. "~--s w 
S W S W ~/ 
(5(5 O C5 G O GO O C5 O O O O O 
iiberflLissig dog house university regulations 
Interpreting accent as an abstract pattern over larger 
units has several advantages. It is, e.g., possible to give 
simple configurations as accent patterns for certain types 
of constructions. Compounds consisting of two words (in 
English as well as German) can be assigned the accent pat- 
tern ,/",. , independently of its internal accent pattern. S W 
(g) and (9) illustrate the situation. As (9b) shows, word-inter- 
nal accent relations can become quite complex. This is not 
the point to discuss how trees of this kind are constructed, 
nor can we present alternatives that 
A set of difficult questions arises 
ual patterns of this kind are realized 
that the metrical tree itself is quite 
have been suggested, 
if we ask how accent- 
phonetically. Notice 
uninformative in this 
respect. But this may turn out to be an advantage, since it 
are is clear that there a number of phonetic parameters corre- 
lating with accent. Intensity, length, Fo-movement, and vo- 
wel quality have all been identified as potential factors. 
But it may even be the case that listeners perceive an ac- 
cent for which there is no cue in the signal. This is not so 
surprising, if accent is part of the phonological competence, 
and if at least some word-internal accents do not carry 
much information. Given that this is roughly a true picture 
of the situation, then it is a good strategy to have rather 
abstract accent representations which can be realized pho- 
netically in a very flexible manner--and sometimes not at 
all. 
5. Some consequences for ~_ssi~ 
It is sometimes asked in speech processing work what 
should be the recognition or synthesis unit of SPS. The sur- 
vey o\[ i)honological theory in § t~. reveals this to be a pseu- 
do-question, fhere are hierarchies of units, and as far as 
they participate in phonological/phonetic processes, they 
e are real and shouldbused in SPS. Therefore, the symbolic 
representation intermediate between the acoustic signal 
and the 2inal representation of the utterance (see (1)) should 
be richer in structure than is generally assumed. It is not a 
string o:~ units, but a multi-layered system of units. Some 
ingredients o:f this representation have been introduced a- 
bove. 
l:f prosodic information including the syllable is so impor- 
tant for speech processing, one might conclude that the use 
of a higher level unit such as the demi-syllable or the syl- 
lable is strongly recommended. But a consideration of some 
results of the morphology-phonology interaction shows this 
to be a precipitated conclusion. 
Very often, wordinternal morpheme boundaries do not 
match syllable boundaries, if the phonetic information for 
the words (1o1, , and bus would be stored as the syllable tem- 
plates \[dog\] and \[bAs\], there would have to be additional 
templates for the plural :forms \[dogz7 and \[b,~s\]l,\[s\]z\]. But 
plural Iorrnation in English is a very regular process, con- 
sisting of the affixation of a segment and a few rules de- 
pending on the nature ol the final segment of the stem. 
Only if this segmental inlormation is available to the sys- 
tem, a general algorithm ;for plural formation can work. 
Taking syllables as unanalyzable wholes would mean the 
spelling out of each plural Iorm in the lexicon, thus nearly 
doubling tbe number of lexical representations. There are 
numerous similar examples in the morphology ol languages 
like English and German. 
In particular, there seem to be the following advantages 
in using a muLti-linear representation of tbe kind sl<etched 
above. First, the representations derived from prosodic the- 
ories almost torce the utilization o:\[ all kinds of inforrnation 
in the speech signal, especially suprasegmental inlormation. 
This leads to a higher degree of predictability for segments. 
Take the example ol word boundary detection, which is a 
crucial task :for all SPS :~or connected speech, l)ifferent 
languages have different domains ot syllabification. In some 
languages, e.g. English and German, the lexical word is the 
regular domain for syllabilication. (Clitics, such as it's or 
auI'm (from auI dem) are the main exceptions.) But this 
is by no means a universal rule. In Mandarin Chinese, there 
is a good correlation between morphemes and syllables, 
which holds just as well as the one between words and syl- 
lables in English. In French, on the other hand, the domain 
for syllabification is a larger unit, say, the intonational 
phrase. It is the implementation of tbis kind of knowledge 
that mal<es it possible :for a SPS to utilize information about 
syllable boundaries for the detection ot word boundaries. 
Secondly, the handling ol both interspeaker and intra- 
speaker variation requires a framework in which the phone- 
tic representation includes extensive prosodic structure. 
First, the rules governing variable speech (including fast- 
speech rules) are largely prosody dependent, as was illus- 
trated h~ (1). An adequate :formalization of the rules is thus 
only possible on the basis of prosodic representations. Se- 
cond, extracting the relevant phonetic cues from the signal 
becomes easier if prosodic parameters are taken into ao- 
count as Iully as possible. Both vowel and consonant recog- 
nition is improved by taking into account Fo-values in the 
local context. 
I have not addressed the computational side of the re-. 
presentational problem. It might be argued that a multiline- 
ar representation of the kind envisaged here is much harder 
to compute and represent in an actual SPS. But intelligent 
systems are quite able to deal with hierarchical or heterar- 
chical objects of different kinds. Also, Woods (1985: 332) 
mentions the possibility of using cascaded ATNs for speech 
processing. Interlocking chains of ATNs could apply to re- 
cognize features, to bundle features into segments, to build 
syllables from segments~ to combine syllables into words 
and to derive stress patterns for these words. 
The general picture of a SPS assumed in this paper is 
that of a I<nowledge-based, intelligent system. I would like 
to stress that the phonological component is only compo- 
nent in such a system. But it is perhaps a component whose 
potential value has not been fully explored. 
References 

Clements, G.N. & S.3. Keyser (1983) CV-Phonology. A Ge- 
nerative Theory of the Syllable. Cambridge, Mass.: MIT- 
Press. 

Huggins, A.W.F. (1978) 'Speech timing and intelligibility.' 
In: Requin, J. (ed.): Attention and Performance VII. Hills- 
dale, N.J.: Erlbaum. 

Kiparsky, P. (1979) 'Metrical structure is cyclic.' Linguistic 
Inquiry 10, p. t~21-t*#l. 

Vaissiere, J. (1985) 'Speech recognition: A tutorial.' In: 
Fallside, F. & W.A. Woods (eds.) Computer Speech Pro- 
cessing. Englewood Clif:\[s, N.3.~' Prentice I-tall~ t5. 191-292. 

Woods, W.A. (198.5) 'Language Processing for Speech Under- 
standing.' in: Fallside, F. & W.A. Woods (eds.): Computer 
Speech Processing. Englewood Cliffs, N.J.: Prentice Hall, 
p. 305.-33g. 
