Probabilistic Context-Free Grammars for Phonology
Karin M¨uller
Department of Computational Linguistics
University of Saarland, Germany
kmueller@coli.uni-sb.de
Abstract
We present a phonological probabilistic context-
free grammar, which describes the word and syl-
lable structure of German words. The grammar
is trained on a large corpus by a simple super-
vised method, and evaluated on a syllabification
task achieving 96.88% word accuracy on word to-
kens, and 90.33% on word types. We added rules
for English phonemes to the grammar, and trained
the enriched grammar on an English corpus. Both
grammars are evaluated qualitatively showing that
probabilistic context-free grammars can contribute
linguistic knowledge to phonology. Our formal ap-
proach is multilingual, while the training data is
language-dependent.
1 Introduction
In this paper, we present an approach to supervised
learning and automatic detection of syllable struc-
ture. The primary goal of the paper is to show that
probabilistic context-free grammars can be used to
gain substantial phonological knowledge about syl-
lable structure. Beyond an evaluation of the trained
model on a real-world task documenting the perfor-
mance of the model, we focus on an extensive qual-
itative evaluation.
In contrast to other approaches which work with
syllable structures extracted from a pronunciation
dictionary, our approach focuses on the probabil-
ity of use of certain syllable structures. Among
other approaches that deal with syllable structure,
there are example-based approaches (Hall (1992),
Wiese (1996), F´ery (1995), Kenstowicz (1994),
Morelli (1999)), symbolic approaches (Belz,
2000), connectionist phonotactic models (Stoianov
and Nerbonne, 1998), stochastic models de-
scribing partial structures (Pierrehumbert (1994),
Coleman and Pierrehumbert (1997)), or applica-
tion-based approaches for syllabification (Van den
Bosch, 1997) or text-to-speech systems (Kiraz and
M¨obius, 1998).
Our method builds on two resources. The first one
is a large written text corpus, which is looked-up in
a pronunciation dictionary resulting in a large tran-
scribed and syllabified corpus. The second resource
is a manually written context-free grammar describ-
ing German and English syllable structure. We
code the assumptions (similar to Goldsmith (1995))
that the phonological material that can occur in
the onsets or codas might differ depending on
the syllable positions: word-initial, word-final,
word-medial, versus monosyllabic words.
We train the context-free grammar for German on
the transcribed and syllabified training corpus with a
simple supervised training method (M¨uller, 2001a).
The main idea of the training method is that after a
grammar transformation step, the grammar together
with a parser can predict syllable boundaries of un-
known phoneme strings. The trained model is eval-
uated on a syllabification task showing a high preci-
sion on a test corpus. We exemplify that the method
can be easily transferred to related languages (here
English) by adding rules for missing phonemes to
the grammar. In an qualitative evaluation, we com-
pare German and English syllable structure by in-
terpreting the probability weights of the preterminal
                     July 2002, pp. 70-80.  Association for Computational Linguistics.
        ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia,
       Morphological and Phonological Learning: Proceedings of the 6th Workshop of the
grammar rules.
In sum, we aim to show that our method (i) mod-
els all possible words of a language, (ii) models how
likely certain structures are used (in comparison to
pure dictionary-based approaches), (iii) yields good
results in an application-oriented evaluation, (iv) is
able to disambiguate competing structures, (v) can
be easily applied to other languages, (vi) produces
mathematically well-defined models.
The paper is organized as follows. We present our
method in Section 2, the experiments in Section 3,
and our evaluation in Section 4. In Section 5, we
discuss the results, and in Section 6, we conclude.
2 Method
We build on the novel approach of M¨uller (2001a)
which aims to combine the advantages of treebank
and bracketed corpora training. In general, this ap-
proach consists of four steps: (i) writing a (sym-
bolic i.e. non-probabilistic) context-free phonologi-
cal grammar with syllable boundaries, (ii) training
this grammar on a large automatically transcribed
and syllabified corpus, (iii) transforming the result-
ing probabilistic phonological grammar by dropping
the syllable boundaries, and (iv) predicting syllable
boundaries of unseen phoneme strings by choosing
their most probable phonological tree according to
the transformed probabilistic grammar.
The advantages of this approach are, that sim-
ple and efficient supervised training on bracketed
corpora can be used (the brackets guarantee that
all syllabified words of the training corpus receive
only one single analysis), and that raw phoneme
strings can be parsed and syllabified after the gram-
mar transformation.
Preserving these advantages, our approach dif-
fers in several important details. First, we write a
more advanced phonological grammar for German,
yielding a more fine-grained probabilistic model of
syllable structure. Second, it is easily possible to
enrich our phonological grammar by adding gram-
mar rules for missing phonemes to adapt our phono-
logical grammar to other languages (here English).
Third, in addition to an evaluation on a real-world
task (syllabification for German), we qualitatively
evaluate the resulting probabilistic versions of our
phonological grammar for German and English.
[
a
Nucleus.ini
p
Cod.ini.1.1
Coda.ini
a0
a0a2a1
a1
Rhyme.ini
Syl.ini ] [
f
On.fin.1.1
Onset.fin
a
Nucleus.fin
l
Cod.ini.1.1
Coda.fin
a3
a3a4a1
a1
Rhyme.fin
a5
a5
a5 a6
a6
Syl.fin
Syl ]
a7
a7
a7
a7
a7
a7
a7
a7
a7
a7
a8
a8
a8
a8
a8
a8
a8 a9
a9
a9
a10
a10
a10
a10
a10
a11
a11
a11
a11
a11
a11
a11
a11
a11
a11
Word
Figure 1: Syllable structure of the word “Abfall” (a12a13a15a14a17a16a18a12a19a17a13a21a20a22a16 )
according to our phonological grammar for German.
Our phonological grammar divides a word into
syllables, which are in turn rewritten by onset, nu-
cleus, and coda. Furthermore, the phonological
grammar differentiates between monosyllabic and
polysyllabic words. In polysyllabic words, the syl-
lables are divided into syllables appearing word-
initially, word-medially, and word-finally. Addition-
ally, the grammar distinguishes between consonant
clusters of different sizes (ranging from one to five
consonants), as well as between consonants occur-
ring in different positions within a cluster. Figure 1
displays the structure of the German word “Abfall”
(waste) according to our phonological grammar.
In the following sections, we especially focus on the
rewriting rules involving phonemic terminal nodes:
Xa23a25a24a17a23a27a26a28a23a25a29a31a30 a32 and Ya23a25a24a33a30 a34a35a23 The rules of the first
type bear three of the above mentioned features for a
consonant a32 inside an onset or a coda (X=On, Cod),
namely: the position of the syllable in the word
(a24 =ini, med, fin, one), the cluster size (a29a37a36a39a38a40a23a41a23a41a23a43a42 ),
and the position of a consonant within a cluster
(a26a44a36a45a38a40a23a41a23a41a23a46a42 ). Obviously, vowels or diphthongs a34
of a nucleus (Y=Nucleus) do not need the position
and size features (a29 and a26 ). The probabilities of these
phonological rules (after supervised training) are ex-
actly the basis for our description and evaluation of
the syllable parts in Section 4.
3 Experiments
In the following, we describe two experiments. The
first experiment investigates syllable structure for
German. In the second experiment, we generalize
and apply the method to another language (English).
Experiment with German data.
First, we manually write a phonological grammar
for German consisting of 2,394 context-free rules.
If compared to the most successful grammar con-
structed by M¨uller (2001a), our grammar is enriched
with an additional feature: the size of the onsets
and codas. Second, we extracted a training cor-
pus of 2,127,798 words (3,961,982 syllables) from
a German newspaper, the Stuttgarter Zeitung, and
an additional corpus of 242,047 words for test-
ing. All words are looked up in the German part
of the CELEX (Baayen et al., 1993) yielding tran-
scribed and syllabified corpora. As phoneme set,
we used the symbols from the English and Ger-
man SAMPA alphabet (Wells, 1997). In contrast
to M¨uller (2001a), we did not investigate smaller
training corpora, since we are interested in maximal
phonological knowledge about internal word struc-
ture. Third, we train the phonological context-free
grammar on the training corpus using the supervised
method presented in Section 2. Additionally, due to
events not occurring in the training data, we use the
implemented smoothing procedure of the LoPar sys-
tem (Schmid, 2000) producing rules with positive
probabilities.
Experiment with English data.
In this experiment, we show that our method can be
easily applied to other languages. We create a sec-
ond training corpus of the same size of 2,123,081
words from the British National Corpus. The words
are looked-up in the English part of the CELEX.
Furthermore, the context-free grammar is extended
by rules for all possible English phonemes. This
means, we add preterminal rules for phonemes not
occurring in German words, e.g., rules for the apico-
dental phoneme a47Ta48 (appearing in the word this).
This (semi-automatic) procedure yields an English
phonological grammar consisting of 4,418 rules,
which is trained on the new corpus.
4 Evaluation
First, we evaluate on a syllabification task for Ger-
man. Second, we analyze linguistically the errors
made by the German phonological parser on the
evaluation corpus (word types). Third, and more im-
portant, we concentrate on a qualitative evaluation of
syllable structure for German and English.
4.1 Evaluation on Syllabification
The resulting probabilistic phonological grammar
of German (Section 3) is evaluated on a syllabifi-
cation task by comparing the maximum-probability
parses of all raw phoneme strings in the test corpus
(242,047 word tokens, 24,735 word types) with their
annotated bracketed variants. As evaluation mea-
sure, we used “word accuracy” which computes the
rate of words with all predicted syllable brackets ex-
actly matching the annotated syllable brackets. The
evaluation shows that our phonological grammar for
German achieves 96.88% word accuracy on word to-
kens, and 90.33% on word types.
Error Analysis
We analyze the results of the German phonological
parser on the evaluation corpus consisting of word
types. Out of 24,735 words types 2391 words con-
tained wrongly predicted syllable boundaries. We
analyze every tenth word of the incorrect words,
which means we look at 239 items. There are 243
errors found in the analyzed words. Due to the
fact that there can occur more than one error in one
word, the number of errors is higher than the num-
ber of items. We find that 72.42% of the errors are
made for consonants uncorrectly assigned to the on-
set, whereas only 27.57% errors are made for con-
sonants wrongly assigned to the coda. The tendency
that more errors occur in predicting the onset agrees
with the findings of M¨uller (2001b). The main errors
appear when a a47 ta48 , a47 Ra48 , or an a47 na48 is predicted to be an
onset consonant. The main errors found in predict-
ing the coda consonants mainly occur with a47 ka48 , a47 sa48 ,
and a47 pa48 .
If we investigate the errors made on the linguis-
tic level, the most frequent error occurs with word
boundaries (98 cases). However, a further error
source is found with syllable boundaries occuring
in conjunction with prefixes and suffixes. The most
frequent error appears with syllable boundaries af-
ter prefixes like /ver-/, /er-/, and /un-/, whereas with
suffixes the most frequent error appears with /-lich/.
Foreign words, which are subject to different phono-
tactic constraints, seem to be a minor source of er-
rors (20 cases). Thus, we can see that most of the
errors are found in conjunction with morphological
English initial medial final monosyl sum
0C onset 191,888 17,630 32,350 453,117 694,985
1C onset 407,380 246,802 556,954 852,866 2,064,002
2C onset 88,692 58,888 96,604 122,295 366,479
3C onset 3,332 3,577 5,368 4,511 16,788
4C onset - 1 16 - 17
sum 691,292 326,898 691,292 1,432,789 3,142,271
0C coda 485,968 257,595 122,058 435,247 1,300,868
1C coda 196,409 65,594 440,176 777,347 1,479,526
2C coda 8,872 3,706 112,597 206,941 332,116
3C coda 42 3 16,361 13,014 29,420
4C coda 1 - 100 240 341
5C coda - - - - -
sum 691,292 326,898 691,292 1,432,789 3,142,271
German initial medial final monosyl sum
0C onset 294,428 54,388 50,508 273,407 672,731
1C onset 658,986 577,358 963,600 687,429 2,887,373
2C onset 127,632 103,353 75,721 68,050 374,756
3C onset 10,879 7,160 2,096 6,987 27,122
4C onset - - - - -
sum 1,091,925 742,259 1,091,925 1,035,873 3,961,982
0C coda 633,196 489,576 236,905 170,845 1,530,522
1C coda 404,616 223,590 708,651 645,447 1,982,304
2C coda 48,399 26,807 130,712 200,472 406,390
3C coda 4,988 2,255 14,945 16,028 38,216
4C coda 726 31 712 3,078 4,547
5C coda - - - 3 3
sum 1,091,925 742,259 1,091,925 1,035,873 3,961,982
Table 1: Frequency of occurrence of onsets and codas for English (left) and German (right), displayed for different complexities
(ranging from 0 to 5 consonants) and different positions of the syllable in the word (initial, medial, final, monosyl).
English word initial word medial word final monosyl
3: 0.024 0.013 0.009 0.014
9 a49 0.001 - - -
A: 0.030 0.012 0.005 0.017
E 0.126 0.105 0.030 0.044
E@ 0.008 0.002 0.003 0.010
I 0.166 0.282 0.367 0.171
I@ 0.009 0.010 0.019 0.003
O 0.072 0.026 0.006 0.082
O: 0.028 0.021 0.014 0.041
OI 0.002 0.004 0.001 0.004
U 0.028 0.020 0.005 0.015
U@ a49 0.001 0.003 0.001 a49 0.001
V 0.069 0.021 0.008 0.029
Y a49 0.001 - - -
& 0.070 0.032 0.008 0.095
@ 0.167 0.268 0.386 0.220
@U 0.037 0.020 0.022 0.029
aI 0.036 0.030 0.022 0.042
aU 0.015 0.003 0.014 0.014
eI 0.039 0.077 0.031 0.069
i: 0.045 0.017 0.027 0.029
u: 0.018 0.024 0.012 0.064
German word initial word medial word final monosyl
2: 0.008 0.008 0.002 a49 0.001
9 0.010 0.003 a49 0.001 a49 0.001
@ 0.082 0.180 0.693 -
Aa50 : 0.001 0.091 a49 0.001 -
E 0.131 0.015 0.023 0.061
E: 0.011 0.129 0.005 0.001
I 0.064 0.030 0.049 0.163
O 0.051 - 0.008 0.052
OI - 0.008 a49 0.001 -
OY 0.020 - 0.001 0.001
Oa50 - a49 0.001 - -
Oa50 : a49 0.001 a49 0.001 a49 0.001 -
U 0.047 0.033 0.044 0.082
Y 0.019 0.012 0.002 0.002
a 0.107 0.092 0.036 0.103
a: 0.066 0.056 0.032 0.037
aI 0.105 0.048 0.026 0.060
aU 0.031 0.009 0.008 0.051
e: 0.075 0.054 0.010 0.171
eI a49 0.001 - - -
i: 0.054 0.135 0.022 0.124
o: 0.061 0.052 0.018 0.021
u: 0.026 0.027 0.007 0.039
y: 0.023 0.010 0.003 0.021
Table 2: Nuclei for English (left) and German (right).
entities. This might point out that a further morpho-
logical level could help to disambiguate syllabifica-
tion alternatives.
4.2 Qualitative Evaluation
The evaluation is carried out for both English and
German. First, we compare the complexity of
words, syllables, and syllable parts. Second, we
analyze the probabilities of grammar rules involv-
ing phonemic terminal nodes. Unfortunately, due to
space constraints and the large size of our derived
probabilistic phonological grammars, only prelimi-
nary results can be presented.
Table 1 displays the occurrence frequencies of on-
sets and codas (of different size and different syl-
lable position), counted on the basis of the train-
ing corpora for English and German (see Section 3).
The following three complexity analyses are carried
out on the basis of this table.
Word complexity. German words tend to be
more complex than English words. The German
training corpus comprises 48.7% 1 monosyllabic
words, whereas 67.4% are found in the English
training corpus. The high frequency of occurrence
of monosyllabic words justifies the separate treat-
ment of those words.
Syllable complexity. German syllables usually
consist of onset and rhyme. An onset is observed in
initial syllables (73%) 2, in medial syllables (94%),
in final syllables (94%), and in monosyllabic words
(73%). A coda is found in initial syllables (42%), in
medial syllables (34%), in final syllables (78%), and
1(1,035,873 / 2,127,798)=48.7%
2((658,986 + 127,632 + 10,879) / 1,091,925) = 73%
English initial medial final monosyl
D 0.002 0.002 0.022 0.250
N a49 0.001 a49 0.001 a49 0.001 -
S 0.006 0.017 0.073 0.019
T 0.011 0.006 0.007 0.008
Z a49 0.001 0.004 0.010 -
b 0.068 0.025 0.042 0.072
d 0.084 0.059 0.075 0.024
f 0.057 0.052 0.029 0.055
g 0.018 0.016 0.018 0.017
h 0.048 0.004 0.003 0.063
j 0.019 0.001 0.003 0.032
k 0.113 0.046 0.052 0.032
l 0.048 0.097 0.095 0.034
m 0.087 0.078 0.057 0.045
n 0.042 0.091 0.054 0.035
p 0.080 0.065 0.037 0.021
r 0.098 0.080 0.065 0.020
s 0.114 0.097 0.067 0.052
t 0.033 0.125 0.185 0.105
v 0.026 0.080 0.052 0.003
w 0.034 0.002 0.012 0.105
z 0.003 0.041 0.031 a49 0.001
German initial medial final monosyl
N - 0.003 0.021 -
R 0.040 0.074 0.082 0.009
S 0.011 0.024 0.033 0.004
Z a49 0.001 0.001 a49 0.001 -
b 0.096 0.056 0.062 0.031
d 0.073 0.072 0.106 0.454
f 0.115 0.052 0.029 0.099
g 0.098 0.103 0.065 0.016
h 0.062 0.031 0.016 0.013
j 0.030 0.005 0.001 0.011
k 0.064 0.032 0.023 0.015
l 0.042 0.121 0.076 0.011
m 0.078 0.055 0.042 0.072
n 0.037 0.078 0.120 0.076
p 0.024 0.018 0.006 0.002
s - 0.017 0.030 -
t 0.026 0.116 0.184 0.007
v 0.115 0.055 0.019 0.064
x a49 0.001 0.009 0.036 -
z 0.078 0.069 0.039 0.107
Table 3: Onsets consisting of 1 consonant for English (left) and German (right)
English initial medial final monosyl
D 0.010 a49 0.001 a49 0.001 0.019
N 0.043 0.018 0.138 0.006
S 0.011 0.012 0.007 0.002
T 0.005 a49 0.001 a49 0.001 0.007
Z - a49 0.001 a49 0.001 a49 0.001
b 0.019 a49 0.001 a49 0.001 0.002
d 0.022 0.008 0.106 0.068
f 0.037 0.001 0.003 0.102
g 0.022 0.008 a49 0.001 0.003
h - - - a49 0.001
k 0.118 0.176 0.020 0.029
l 0.074 0.105 0.119 0.046
m 0.111 0.066 0.023 0.052
n 0.435 0.426 0.166 0.168
p 0.014 0.039 0.005 0.015
r* a49 0.001 a49 0.001 0.178 0.123
s 0.026 0.080 0.046 0.038
t 0.031 0.035 0.056 0.162
v 0.005 0.016 0.016 0.026
x a49 0.001 - a49 0.001 a49 0.001
z 0.009 0.002 0.108 0.124
German initial medial final monosyl
N 0.011 0.014 0.055 0.002
R 0.356 0.315 0.183 0.271
S a49 0.001 a49 0.001 0.004 0.001
b - - - a49 0.001
f 0.028 0.020 0.005 0.036
k 0.038 0.046 0.021 0.009
l 0.067 0.082 0.026 0.026
m 0.030 0.020 0.038 0.050
n 0.269 0.287 0.511 0.273
p 0.031 0.019 0.001 0.008
s 0.084 0.091 0.047 0.156
t 0.040 0.018 0.055 0.064
v a49 0.001 - - -
x 0.041 0.082 0.047 0.096
Table 4: Codas consisting of 1 consonant for English (left) and German (right).
in monosyllabic words (83%).
English syllables. An onset is observed in ini-
tial syllables (72.2%), in medial syllables (94%), in
final syllables (95%), and in monosyllabic words
(68.3%). A coda is found in initial syllables
(29.7%), in medial syllables (21.2%), in final syl-
lables (82.3%), and in monosyllabic words (69%).
Onset and coda complexity. English and Ger-
man syllables prefer simple onsets and codas.
English onsets. For both initial and medial sylla-
bles, a single consonant is found (80%), two conso-
nants (18%), and three consonants (less than 1%).
For both final syllables and monosyllabic words,
one consonant is observed (85%), two consonants
(13%), and three consonants (less than 0.9%).
German onsets. For both initial and medial sylla-
bles, one consonant is found (82%), two consonants
(15%), and three consonants (1%). For both mono-
syllabic words and final syllables, a single consonant
is found (90%), two consonants (7%), and three con-
sonants (less than 1%).
English codas. For both initial and medial syllables,
one consonant is observed (95%), and two conso-
nants (5%). For both final syllables and monosyl-
labic words, one consonant is found (77%), two con-
sonants (20%), and three consonants (2%).
German codas. For both initial and medial syllables,
one consonant is observed (88%), two consonants
(10%), and three consonants (about 1%). In final
syllables, one consonant is found (82.8%), two con-
sonants (15.2%), three consonants (1.7%), and four
consonants (0.08%). In monosyllabic words, one
consonant occurs (74.6%), two consonants (23.1%
), three consonants (1.8%), and four consonants
(0.3%).
Nuclei. Table 2 displays the nuclei found for En-
glish and German. The symbol “-” indicates that
the phoneme has not been observed in the training
corpus. However, the corresponding grammar rules
receive a (very small) positive probability according
to our smoothing procedure. Note, we do not dis-
play those phonemes in the tables which are marked
with “-” for all positions . Moreover, the symbol
“a51a53a52a54a23a55a52a56a52a57a38 ” indicates a probability of less than 0.001
(which means a occurrence frequency of less than
0.1%).
English nuclei. The most likely nuclei in initial syl-
lables are a47 @, I, E, O, &, Va48 (16.7%, 16.6%, 12.6%,
7.2%, 7%, 6.9%), in medial syllables a47I, @, E, eIa48
(28.2%, 26.8%, 10.5%, 7.7%), in final syllables a47@,
Ia48 (38.6%, 36.7%), and in monosyllabic words a47@,
I,&, O, eI, u:a48 (22%, 17.1%, 9.5%, 8.2%, 6.9%,
6.4%). Furthermore, in monosyllabic words, 33.6%
of the nuclei are long vowels/diphthongs, 29.1% in
initial syllables, 23.6% in medial syllables, and 18%
in final syllables.
German nuclei. In initial syllables, the most prob-
able nuclei are a47 E, a, aI, @, e:, a:, I, o:, i:, O,a48
(13.%, 10.7%, 10.5%, 8.2%, 7.5%, 6.6%, 6.4%,
6.1%, 5.4%, 5.1%), in medial syllables, a47@, i:, I, a,
E, a:, e:, o:a48 (18%, 13.5%, 12.9%,9.2%, 9.1%, 5.6%,
5.4%, 5.2%), in final syllables a47 @, I, U, aa48 (69.3%,
4.9%, 4.4%, 3.6%), and in monosyllabic words a47 e:, I,
i:, a, U, E, aI, O, aUa48 (17.1%, 16.3%, 12.4%, 10.3%,
8.2%, 6.1%, 6%, 5.2%, 5.1%). Generally, we can
observe that long vowels/diphthongs are more likely
in monosyllabic words (52.6%) than in all other syl-
lable positions (48.3% in initial syllables, 42% in
medial syllables, and 13.4% in final syllables).
Mono-consonantal onsets and codas. Table 3
and Table 4 display the onsets and codas consisting
of 1 consonant.
German onsets. The most probable consonants in
initial syllables are a47 f, v, g, b,z, m, d, k, ha48 , (11.5%,
11.5%, 9.8%, 9.6%, 7.8%, 7.8%, 7.3%, 6.4%, 6.2%
), in medial syllables a47 l, t, g, n, R, d, za48 (12.1%,
11.6%, 10.3%, 7.8%, 7.4%, 7.2%, 6.9%), in final
syllables a47t, n, d, R, la48 (18.4%, 12%, 10.6%, 8.2%,
7.6%), and in monosyllabic words a47 d, z, f, n, ma48
(45.4%, 10.7%, 9.9%, 7%).
English onsets. In initial syllables, the most proba-
ble consonants are a47 s, k, r, m, d, pa48 (11.4%, 11.3%,
9.8%, 8.7%, 8.4%, 8%), in medial syllables a47 t, s, l,
n, r, va48 (12.5%, 9.8%, 9.7%, 9.1%, 8%, 8%), in final
syllables a47 t, l, d, S, s, ra48 (18.5%, 9.5%, 7.5%, 7.3%,
6.7%, 6.5%), and in monosyllabic words a47D, t, w, b,
ha48 (25%, 10%, 10%, 6%, 6%).
German codas. In initial position, the most likely
consonants are a47 R, na48 (35.6%, 26.9%), in medial syl-
lables a47R, na48 (31.5%, 28.7%), in final syllables a47 n, Ra48
(50.1%, 18.3%), and in monosyllabic words a47n, R, s,
xa48 (27.3%, 27.1%, 15.6%, 9.6%).
English codas. In initial syllables, the most dom-
inant consonants are a47 n, k, ma48 (43.5%, 11.8%,
11.1%), in medial syllables a47n, k, la48 (42.6%, 17.6%,
10.5%), in final syllables a47 r*, n, N, l, z, da48 (17,8%-
10%), and in monosyllabic words a47 n, t, z, r*, fa48
(16.8%- 11%).
Onset and coda clusters. Clusters of 2-3 con-
sonants are displayed and described in Table 5 and
Table 7 (onsets), and in Table 6 and Table 8 (co-
das). Due to space constraints, we omitted to display
clusters of 4-5 consonants, but our analysis can be
found elsewhere (M¨uller, to appear 2002). Clusters
with more than 5 consonants have not been found
in our corpora. Furthermore, for German, no onsets
comprising 4 consonants, and for English, no codas
occur comprising 5 consonants. Last, for German,
there is only one consonant cluster a47 Rnstsa48 appear-
ing in words like “Ernsts”, the genitive case of the
proper name “Ernst”.
5 Discussion
As M¨uller (2001a), we presented a method for
prediction of syllable boundaries using phonolog-
ical probabilistic context-free grammars for Ger-
man. However, our approach performs slightly bet-
ter (96.88% word accuracy on word tokens ver-
sus 96.49%). Van den Bosch (1997) reports a word
error rate of 2.22% on English syllabification us-
ing inductive learning. Due to the feature “clus-
ter size”, which was not used by M¨uller (2001a),
we are able to give an extensive qualitative evalu-
ation of syllable structure considering syllable po-
sitions, as well as the complexities of consonant
clusters, and the position of a consonant within
a cluster. Since our approach is multilingual
(only training is language dependent), we evalu-
ated two languages (German and English) show-
ing that probabilistic context-free grammars add lin-
guistic knowledge to phonology. In contrast to
theoretical approaches, we focus on syllable struc-
tures that are preferred in a certain language. The-
oretical phonotactic approaches like (Hall (1992),
Wiese (1996), F´ery (1995)) describe possible sylla-
ble structures or German, and Kenstowicz (1994),
Morelli (1999) for English. There are many more
approaches dealing with syllable structure. For in-
stance, Kiraz and M¨obius (1998) develop multilin-
gual syllabification models on the basis of a pro-
nunciation dictionary. Partial English syllable struc-
ture is described by Pierrehumbert (1994), who also
used a dictionary. A more general model was intro-
duced by M¨uller et al. (2000), who used a cluster-
ing algorithm to induce English and German sylla-
ble classes. However, the approach treats the onsets
and codas as one string. In our method, we describe
in more detail the internal structure of onsets and
codas. Our model has several advantages. (i) We
believe that the syllable structure of all words oc-
curring in a certain language can be described by
an elaborated context-free grammar. (ii) Moreover,
using probabilistic context-free grammars, alterna-
tive syllabifications of phoneme strings can be dis-
ambiguated. (iii) Our model is able to analyze un-
expected phoneme strings. For instance, the onset
a47 bRda48 of the proper name “Brdaric” (a47 bRda:RItSa48 ) is
not allowed according to German phonotactics. Ta-
ble 7 correctly displays that neither a47 ba48 occurs as a
first, nor a47 Ra48 as a second, or a47 da48 as a third conso-
nant in initial triconsonantal German onset clusters.
Due to the smoothing procedure, our model syllab-
ifies this name as a47 bRda:a48a58a47 RItSa48 although the onset
a47 bRda48 has never been observed in the German train-
ing corpus. (iv) The syllable structure of nonsense
words can be predicted. The model can be exploited
in two ways: first, it predicts the most probable syl-
labification, second, it can be used to model the lex-
ical decision task. An example of English words
is mentioned by Pierrehumbert (1994). She com-
pared “bistro” (a47 bIstr@Ua48 ) as a possible word, with a
good word “bimplo” (a47 bImpl@Ua48 ), and a bad word
“bilflo” (a47 bIlfl@Ua48 ). The four possible syllabifica-
tions are a47 bIma48a58a47 pl@Ua48 , a47 bIa48a58a47 mpl@Ua48 , a47 bImpa48a58a47 l@Ua48 ,
and a47 bImpla48a58a47 @Ua48 . The most probable syllable struc-
ture for “bimplo” is a47 bIma48a58a47 pl@Ua48 (1.8403e-13). For
the twosyllabic word “bilflo”, the model assigns the
highest probability to a syllable boundary between
a47 la48 and a47 fla48 . Out of the four possible syllabifications
for the real word “bistro” a47bIstra48a58a47@Ua48 is the most
probable syllable structure. Although the triconso-
nantal cluster should be rather an onset cluster than
a coda cluster, the model prefers a47 stra48 as a cluster,
whereas a syllable boundary is added between a47 ma48
and a47 pla48 , and a47 la48 and a47 fla48 . A further example men-
tioned in the literature is a47 brIka48 , a47 blIka48 , and a47 bnIka48 .
The first one is a possible word of English a47 brIka48 ,
which receives a probability of 1.16e-08, the sec-
ond non-occurring word a47 blIka48 (7.2391e-09), and the
non-English word a47 bnIka48 (4.3249e-09). The least
probable one is the non-English word, followed by
the non-occurring one, and the highest probability is
assigned to the real word brick.
Beside the good performance of the current mod-
els in applications, further improvements of the
present approach can possibly be achieved by em-
bedding more prior phonotactic knowledge. For
example, it might be useful to model the distribu-
tion of a consonant dependent on the previous one
(Menzel, 2001). In future work, we will inves-
tigate this issue by using head-lexicalized proba-
bilistic context-free grammars, like those suggested
by Carroll and Rooth (1998), where the consonant
cluster a47 StRa48 would be analyzed as:
X.r.1.3a59a61a60a63a62
S X.r.2.3a59a65a64a46a62
t X.r.3.3a59a67a66a68a62
R
Here, the lexical choice events express the desired
phonotactic feature. Moreover, it would be interest-
ing to incorporate the stress feature.
Our linguistic evaluation of the errors point out
that most errors of the phonological parser occur
in conjunction with morphological phenomena like
prefixes, suffixes and word boundaries. This might
point out that a further morphological layer could
improve word accuracy (see also Meng (2001)).
6 Conclusion
We introduced a multilingual approach to proba-
bilistic modeling of syllable structure using proba-
bilistic context-free grammars. We exemplified our
approach for two languages, German and English.
Evaluation on a syllabification task shows a small
improvement in word accuracy rate compared to
other state-of-the-art systems for German. Addi-
tionally, we presented an extensive qualitative eval-
uation of German and English syllable, onset, and
coda structure (section 4) showing which structures
are preferably used. However, the presented work
is a starting point of analyzing in detail the huge
amount of data with regard to phonological regular-
ities. For instance, we found evidence that in conso-
nant clusters sonorous consonants (like a47 Ra48 , a47 la48 ) are
more restricted to appear next to the nucleus than
less sonorous consonants. Clearly, this is promis-
ing work for future investigations. We believe that
our method can be easily transferred to a variety
of other languages, and aim in future work at em-
bedding more fine-grained phonotactic constraints
to our grammar.

References
Harald R. Baayen, Richard Piepenbrock, and H. van Rijn.
1993. The CELEX lexical database—Dutch, English,
German. LDC. Univ. of Pennsylvania.
Anja Belz. 2000. Multi-syllable phonotactic modelling.
In Proc. of SIGPHON.
Glenn Carroll and Mats Rooth. 1998. Valence induction
with a head-lexicalized pcfg. In Proc. of EMNLP.
John Coleman and Janet Pierrehumbert. 1997. Stochas-
tic Phonological Grammars and Acceptability. In
Proc. of SIGPHON.
Carolin F´ery. 1995. Alignment, syllable and metrical
Structure in German. SFS-report, Univ. of T¨ubingen.
John A. Goldsmith. 1995. Phonological Theory. In
Handbook of Phonological Theory.
Tracy Hall. 1992. Syllable structure and syllable related
processes in German. Niemeyer, T¨ubingen.
M. Kenstowicz. 1994. Phonology in Generative Gram-
mar. Blackwell, Cambridge, MA.
George Anton Kiraz and Bernd M¨obius. 1998. Mul-
tilingual Syllabification Using Weighted Finite-State
Transducers. In Proc. 3rd ESCA Workshop on Speech
Synthesis (Jenolan Caves), pages 59–64.
Helen Meng. 2001. A hierarchical lexical
representation for bi-directional spelling-to-
pronunciation/pronunciation-to-spelling generation.
Speech Communication, 33:213–239.
Wolfgang Menzel. 2001. Personal communication.
Frida Morelli. 1999. The Phonotactics and Phonology of
Obstruent Clusters in Optimality Theory. Ph.D. thesis,
University of Maryland.
Karin M¨uller, Bernd M¨obius, and Detlef Prescher. 2000.
Inducing Probabilistic Syllable Classes Using Multi-
variate Clustering. In Proc. of ACL.
Karin M¨uller. 2001a. Automatic Detection of Syllable
Boundaries Combining the Advantages of Treebank
and Bracketed Corpora Training. In Proc. of ACL.
Karin M¨uller. 2001b. Probabilistic Context-Free Gram-
mars for Syllabification and Grapheme-to-Phoneme
Conversion. In Proc. of EMNLP, Pittsburgh, PA.
Karin M¨uller. to appear 2002. Probabilistic Syllable
Modeling Using Supervised and Unsupervised Learn-
ing Methods. Ph.D. thesis, University of Stuttgart.
Janet Pierrehumbert. 1994. Syllable structure and word
structure. In Phonological Structure and Phonetic
Form. University Press, Cambridge.
Helmut Schmid. 2000. LoPar. Design and Implementa-
tion. IMS, University of Stuttgart.
Ivelin Stoianov and John Nerbonne. 1998. Explor-
ing Phonotactics with Simple Recurrent Networks. In
Proc. of Comp. Linguistics in the Netherlands.
Antal Van den Bosch. 1997. Learning to Pronounce
Written Words: A Study in Inductive Language Learn-
ing. Ph.D. thesis, Univ. Maastricht, Maastricht, The
Netherlands.
John Wells. 1997. Spoken language reference materials.
In Handbook of Standards and Resources for Spoken
Language Systems. Mouton de Gruyter, Berlin.
Richard Wiese. 1996. The Phonology of German.
Clarendon Press, Oxford.
