Paradigmatic Cascades: a Linguistically Sound Model of 
Pronunciation by Analogy 
Francois Yvon 
ENST and CNRS, URA 820 
Computer Science Department 
46 rue Barrault - F 75 013 Paris 
yvon~±nf, enst. fr 
Abstract 
We present and experimentally evaluate a 
new model of pronunciation by analogy: 
the paradigmatic cascades model. Given a 
pronunciation lexicon, this algorithm first 
extracts the most productive paradigmatic 
mappings in the graphemic domain, and 
pairs them statistically with their corre- 
late(s) in the phonemic domain. These 
mappings are used to search and retrieve 
in the lexical database the most promising 
analog of unseen words. We finally apply 
to the analogs pronunciation the correlated 
series of mappings in the phonemic domain 
to get the desired pronunciation. 
1 Motivation 
Psychological models of reading aloud traditionally 
assume the existence of two separate routes for con- 
verting print to sound: a direct lexical route, which 
is used to read familiar words, and a dual route rely- 
ing upon abstract letter-to-sound rules to pronounce 
previously unseen words (Coltheart, 1978; Coltheart 
et al., 1993). This view has been challenged by 
a number of authors (e.g. (Glushsko, 1981)), who 
claim that the pronunciation process of every word, 
familiar or unknown, could be accounted for in a 
unified framework. These single-route models cru- 
cially suggest that the pronunciation of unknown 
words results from the parallel activation of similar 
lexical items (the lexical neighbours). This idea has 
been tentatively implemented both into various sym- 
bolic analogy-based algorithms (e.g. (Dedina and 
Nusbaum, 1991; Sullivan and Damper, 1992)) and 
into connectionist pronunciation devices (e.g. (Sei- 
denberg and McClelland, 1989)). 
The basic idea of these analogy-based models is 
to pronounce an unknown word x by recombin- 
ing pronunciations of lexical items sharing common 
subparts with x. To illustrate this strategy, Ded- 
ina and Nussbaum show how the pronunciation of 
the sequence lop in the pseudo-word blope is analo- 
gized with the pronunciation of the same sequence 
in sloping. As there exists more than one way to re- 
combine segments of lexical items, Dedina and Nuss- 
baum's algorithm favors recombinations including 
large substrings of existing words. In this model, 
the similarity between two words is thus implicitely 
defined as a function of the length of their common 
subparts: the longer the common part, the better 
the analogy. 
This conception of analogical processes has an im- 
portant consequence: it offers, as Damper and East- 
mona ((Damper and Eastmond, 1996)) state it, "no 
principled way of deciding the orthographic neigh- 
bouts of a novel word which are deemed to influ- 
ence its pronunciation (...)". For example, in the 
model proposed by Dedina and Nusbaum, any word 
having a common orthographic substring with the 
unknown word is likely to contribute to its pronun- 
ciation, which increases the number of lexical neigh- 
bouts far beyond acceptable limits (in the case of 
blope, this neighbourhood would contain every En- 
glish word starting in bl, or ending in ope, etc). 
From a computational standpoint, implement- 
ing the recombination strategy requires a one-to- 
one alignment between the lexical graphemic and 
phonemic representations, where each grapheme is 
matched with the corresponding phoneme (a null 
symbol is used to account for the cases where the 
lengths of these representations differ). This align- 
ment makes it possible to retrieve, for any graphemic 
substring of a given lexical item, the corresponding 
phonemic string, at the cost however of an unmoti- 
vated complexification of lexical representations. 
In comparison, the paradigmati c cascades model 
(PCP for short) promotes an alternative view of 
analogical processes, which relies upon a linguisti- 
cally motivated similarity measure between words. 
428 
The basic idea of our model is to take advantage 
of the internal structure of "natural" lexicons. In 
fact, a lexicon is a very complex object, whose ele- 
ments are intimately tied together by a number of 
fine-grained relationships (typically induced by mor- 
phological processes), and whose content is severely 
restricted, on a language-dependant basis, by a com- 
plex of graphotactic, phonotactic and morphotac- 
tic constraints. Following e.g. (Pirrelli and Fed- 
erici, 1994), we assume that these constraints sur- 
face simultaneously in the orthographical and in 
the phonological domain in the recurring pattern 
of paradigmatically alterning pairs of lexical items. 
Extending the idea originally proposed in (Federici, 
Pirrelli, and Yvon, 1995), we show that it is possible 
to extract these alternation patterns, to associate 
alternations in one domain with the related alterna- 
tion in the other domain, and to construct, using this 
pairing, a fairly reliable pronunciation procedure. 
2 The Paradigmatic Cascades Model 
In this section, we introduce the paradigmatic cas- 
cades model. We first formalize the concept of a 
paradigmatic relationship. We then go through the 
details of the learning procedure, which essentially 
consists in an extensive search for such relationships. 
We finally explain how these patterns are used in the 
pronunciation procedure. 
2.1 Paradigmatic Relationships and 
Alternations 
The paradigmatic cascades model crucially relies 
upon the existence of numerous paradigmatic rela- 
tionships in lexical databases. A paradigmatic re- 
lationship involves four lexical entries a, b, c, d, and 
expresses that these forms are involved in an ana- 
logical (in the Saussurian (de Saussure, 1916) sense) 
proportion: a is to b as e is to d (further along ab- 
breviated as a : b = c : d, see also (Lepage and 
Shin-Ichi, 1996) for another utilization of this kind 
of proportions). Morphologically related pairs pro- 
vide us with numerous examples of orthographical 
proportions, as in: 
reactor : reaction = factor : faction (1) 
Considering these proportions in terms of ortho- 
graphical alternations, that is in terms of partial 
fnnctions in the graphemic domain, we can see that 
each proportion involves two alternations. The first 
one transforms reactor into reaction (and factor 
into faction), and consists in exchanging the suffixes 
or and ion. The second one transforms reactor into 
factor (and reaction into faction), and consists in 
exchanging the prefixes re and f. These alternations 
are represented on figure 1. 
f reactor • reaction 
factor • faction 
f 
Figure 1: An Analogical Proportion 
Formally, we define the notion of a paradigmatic 
relationship as follows. Given E, a finite alphabet, 
and/:, a finite subset of E*, we say that (a, b) E/: x/: 
is paradigmatically related to (c, d) E/: x/: iff there 
exits two partial functions f and g from E* to E*, 
where f exchanges prefixes and g exchanges suffixes, 
and: 
f(a) = c and f(b) = d (2) 
g(a) = b and g(c) = d (3) 
f and g are termed the paradigmatic alternations 
associated with the relationship a : b =:,9 c : d. 
The domain of an alternation f will be denoted by 
dora(f). 
2.2 The Learning Procedure 
The main purpose of the learning procedure is to 
extract from a pronunciation lexicon, presumably 
structured by multiple paradigmatic relationships, 
the most productive paradigmatic alternations. 
Let us start with some notations: Given G a 
graphemic alphabet and P a phonetic alphabet, a 
pronunciation lexicon £ is a subset of G* × P*. The 
restriction of/: on G* (respectively P*) will be noted 
/:a (resp./:p). Given two strings x and y, pref(x, y) 
(resp. suff(x, y)) denotes their longest common pre- 
fix (resp. suffix). For two strings x and y having a 
non-empty common prefix (resp. suffÉx) u, f~y (resp, 
g~y) denotes the function which transforms x into y: 
as x = uv, and as y = ut, f~y substitutes a final v 
with a final t. ~ denotes the empty string. 
Given /:, the learning procedure searches /:G for 
any for every 4-uples (a, b, c, d) of graphemic strings 
such that a : b =:,g c : d. Each match increments 
the productivity of the related alternations f and 
g. This search is performed using using a slightly 
modified version of the algorithm presented in (Fed- 
erici, Pirrelli, and Yvon, 1995), which applies to ev- 
ery word x in/:c the procedure detailled in table 1. 
In fact, the properties of paradigmatic relation- 
ships, notably their symetry, allow to reduce dra- 
matically the cost of this procedure, since not all 
429 
GETALTERNATIONS (X) 
1 z)(x) ~- {y e 12a/(t = pref(x, y)) # ~} 
2 for yeD(x) 
3 do 
4 P(x,y) ~- {(z,t) c 12~ × 12~1z = f;,~(t)} 
5 if P(x,y) ¢ O 
6 then 
7 IT~crementCount (fSy) 
8 IncrementCount (f:Pt) 
Table 1: The Learning Procedure 
4-uple of strings in £c, need to be examined during 
that stage. 
For each graphemic alternation, we also record 
their correlated alternation(s) in the phonological 
domain, and accordingly increment their productiv- 
ity. For instance, assuming that factor and reactor 
respectively receive the pronunciations/faekt0r/and 
/rii~ektor/, the discovery of the relationship ex- 
pressed in (1) will lead our algorithm to record that 
the graphemic alternation f -+ re correlates in the 
phonemic domain with the alternation /f/-+ /ri:/. 
Note that the discovery of phonemic correlates does 
not require any sort of alignment between the or- 
thographic and the phonemic representations: the 
procedure simply records the changes in the phone- 
mic domain when the Mternation applies in the 
graphemic domain. 
At the end of the learning stage, we have in hand 
a set A = {Ai} of functions exchanging suffixes or 
prefixes in the graphemic domain, and for each Ai 
in A: 
(i) a statistical measure Pi of its productivity, de- 
fined as the likelihood that the transform of a 
lexical item be another lexieal item: 
Pi = I {x e dom(di) and Ai(x) E 12}1 
i dom(&)l (4) 
(ii) a set {Bi,j},j G {1...hi} of correlated func- 
tions in the phonemic domain, and a statistical 
measure Pi,j of their conditional productivity, 
i.e. of the likelihood that the phonetic alterna- 
tion Bi,j correlates with Ai. 
Table 2 gives the list of the phonological correlates 
of the alternation which consists in adding the suffix 
ly, corresponding to a productive rule for deriving 
adverbs from adjectives in English. If the first lines 
of table 2 are indeed <'true" phonemic correlates of 
the derivation, corresponding to various classes of 
adjectives, a careful examination of the last lines re- 
veals that the extraction procedure is easily fooled 
alternation 
x 
x-It~ 
x-loll 
x-I~l/ -~ 
x -+ 
x-/iin/ 
x-I1dl 
x-~iv~ 
x-~o~ 
x -+ 
z-/Ir/ -+ 
x-/3n/ 
Example 
x-/li'/ good 
x-/adli'/ marked 
x-/oli:/ equal 
x-/li'/ capable 
x-~i:~ cool 
x-/enli'/ clean 
x-/aldli'/ id 
x-/aIli:/ live 
x-/51i'/ loath 
x-/laI/ imp 
x-/3:li'/ ear 
x-/onlil/ on 
Table 2: Phonemic correlates of x --+ x - ly 
by accidental pairs like imp-imply, on-only or ear- 
early. A simple pruning rule was used to get rid of 
these alternations on the basis of their productivity, 
and only alternations which were observed at least 
twice were retained. 
It is important to realize that A allows to specifiy 
lexical neighbourhoods in 12a: given a lexical entry 
x, its nearest neighbour is simply f(x), where f is 
the most productive alternation applying to x. Lex- 
ical neighbourhoods in the paradigmatic cascades 
model are thus defined with respect to the locally 
most productive alternations. As a consequence, 
the definition of neighbourhoods implicitely incorpo- 
rates a great deal of linguistic knowledge extracted 
fl'om the lexicon, especially regarding morphological 
processes and phonotactic constraints, which makes 
it much for relevant for grounding the notion of anal- 
ogy between lexical items than, say, any neighbour- 
hood based on the string edition metric. 
2.3 The Pronunciation of Unknown Words 
Supose now that we wish to infer the pronunciation 
of a word x, which does not appear in the lexicon. 
This goal is achieved by exploring the neighbour- 
hood of x defined by A, in order to find one or several 
analogous lexica.1 entry(ies) y. The second stage of 
the pronunciation procedure is to adapt the known 
pronunciation of y, and derive a suitable pronuncia- 
tion for x: the idea here is to mirror in the phonemic 
domain the series of alternations which transform x 
into y in the graphemic domain, using the statistical 
pairing between alternations that is extracted dur- 
ing the learning stage. The complete pronunciation 
procedure is represented on figure 2. 
Let us examine carefully how these two aspects of 
the pronunciation procedure are implemented. The 
first stage is I;o find a lexical entry in the neighbour- 
430 
Graphcmic domain Phonemic domain 
Figure 2: The pronunciation of an unknown word 
hood of x defined by L:. 
The basic idea is to generate A(x), defined 
as {Ai(x), forAi E ,4, x E domain(Ai)}, which con- 
tains all the words that can be derived from x us- 
ing a function in ,4. This set, better viewed as a 
stack, is ordered according to the productivity of the 
Ai: the topmost element in the stack is the nearest 
neighbour of x, etc. The first lexical item found in 
fl, (x) is the analog of x. If A (x) does not contain 
any known word, we iterate the procedure, using 
x I, the top-ranked element of .4 (x), instead of x. 
This expands the set of possible analogs, which is 
accordingly reordered, etc. This basic search strat- 
egy, which amounts to the exploration of a deriva- 
tion tree, is extremely ressource consuming (every 
expension stage typically adds about a hundred of 
new virtual analogs), and is, in theory, not guar- 
anted to terminate. In fact, the search problem is 
equivalent to the problem of parsing with an unre- 
stricted Phrase Structure Grammar, which is known 
to be undecidable. 
We have evaluated two different search strategies, 
which implement various ways to alternate between 
expansion stages (the stack is expanded by gener- 
ating the derivatives of the topmost element) and 
matching stages (elements in the stack are looked 
for in the lexicon). The first strategy implements a 
depth-first search of the analog set: each time the 
topmost element of the stack is searched, but not 
found, in the lexicon, its derivatives are immediately 
generated, and added to the stack. In this approach, 
the position of an analog in the stack is assessed a.s a 
function of the "distance" between the original word 
x and the analog y = A~ (A~_, (... A~ (x))), accord- 
ing to: 
l=k 
d(x, y) = 1-I 
/----1 
The search procedure is stopped as soon an ana- 
log is found in L:a, or else, when the distance be- 
tween x and the topmost element of the stack, which 
monotonously decreases (Vi, pi < 1), falls below a 
pre-defined theshold. 
The second strategy implements a kind of com- 
promise between depth-first and breadth-first explo- 
ration of the derivation tree, and is best understood 
if we first look at a concrete example. Most alter- 
nations substituting one initial consonant are very 
productive, in English like in many other languages. 
Therefore, a word starting with say, a p, is very likely 
to have a very close derivative where the initial p 
has been replaced by say, a r. Now suppose that 
this word starts with pl: the alternation will de- 
rive an analog starting with rl, and will assess it 
with a very high score. This analog will, in turn, 
derive many more virtual analogs starting with rl, 
once its suffixes will have been substituted during 
another expansion phase. This should be avoided, 
since there are in fact very few words starting with 
the prefix rl: we would therefore like these words to 
be very poorly ranked. The second search strategy 
has been devised precisely to cope with this problem. 
The idea is to rank the stack of analogs according 
to the expectation of the number of lexical deriva- 
tives a given analog may have. This expectation is 
computed by summing up the productivities of all 
the alternations that can be applied to an analog y 
according to: 
p, (61 
i/yEdom(Ai) 
This ranking will necessarily assess any analog start- 
ing in rl with a low score, as very few alternations 
will substitute its prefix. However, the computation 
of (6) is much more complex than (5), since it re- 
quires to examine a given derivative before it can be 
positioned in the stack. This led us to bring for- 
ward the lexical matching stage: during the expan- 
sion of the topmost stack element, all its derivatives 
are looked for in the lexicon. If several derivatives 
are simultaneously found, the search procedure halts 
and returns more than one analog. 
The expectation (6) does not decrease as more 
derivatives are added to the stack; consequently, 
it cannot be used to define a stopping criterion. 
The search procedure is therefore stopped when 
al} derivatives up to a given depth (2 in our ex- 
periments) have been generated, and unsuccessfully 
looked for in the lexicon. This termination criterion 
is very restrictive, in comparison to the one imple- 
mented in the depth-first strategy, since it makes it 
impossible to pronounce very long derivatives, for 
which a significant number of alternations need to 
431 
be applied before an analog is found. An example is 
the word synergistically, for which the "breadth- 
first" search terminates uncessfully, whereas the 
depth-first search manages to retrieve the "analog" 
energy. Nonetheless, the results reported hereafter 
have been obtained using this "breadth-first" strat- 
egy, mainly because this search was associated with 
a more efficient procedure for reconstructing pronun- 
ciations (see below). 
Various pruning procedures have also been imple- 
mented in order to control the exponential growth of 
the stack. For example, one pruning procedure de- 
tects the most obvious derivation cycles, which gen- 
erate in loops the same derivatives; another prun- 
ing procedure tries to detect commutating alterna- 
tions: substituting the prefix p, and then the suffix 
s often produces the same analog than when alter- 
nations apply in the reverse order, etc. More de- 
tails regarding implementational aspects are given 
in (Yvon, 1996b). 
If the search procedure returns an analog y = 
Aik(Aik_~(...Ail(x))) in £, we can build a pronun- 
ciation for x, using the known pronunciation ¢(y) 
of y. 'For this purpose, we will use our knowledge 
of the Bi,j, for i E {il...ik}, and generate ev- 
ery possible transforms of q;(y) in the phonological 
domain: -1 -1 {Bik,jk(Bik_~,jk_~ (. .. (q~(y))))), with jk in 
{ 1 ... nik }, and order this set using some function of 
the Pi,j. The top-ranked element in this set is the 
pronunciation of x. Of course, when the search fails, 
this procedure fails to propose any pronunciation. 
In fact, the results reported hereafter use a slightly 
extended version of this procedure, where the pro- 
nunciations of more than one a.nMog are used for 
generating and selecting the pronunciation of the un- 
known word. The reason for using multiple analogs 
is twofold: first, it obviates the risk of being wrongly 
influenced by one very exceptional analog; second, 
it enables us to model conspiracy effects more accu- 
rately. Psychological models of reading aloud indeed 
assume that the pronunciation of an unknown word 
is not influenced by just one analog, but rather by 
its entire lexical neighbourhood. 
3 Experimental Results 
3.1 Experimental Design 
We have evaluated this algorithm on two different 
pronunciation tasks. The first experiment consists 
in infering the pronunciation of the 70 pseudo-words 
originally used in Glushko's experiments, which have 
been used as a test-bed for various other pronun- 
ciation algorithms, and allow for a fair head-to- 
head comparison between the paradigmatic cascades 
model and other analogy-based procedures. For 
this experiment, we have used the entire nettalk 
(Sejnowski and Rosenberg, 1987) database (about 
20 000 words) as the learning set. 
The second series of experiments is intended to 
provide a more realistic evaluation of our model ill 
the task of pronouncing unknown words. We have 
used the following experimental design: 10 pairs of 
disjoint (learning set, test set) are randomly selected 
from the nettalk database and evaluated. In each 
experiment, the test set contains abou~ the tenth 
of the available data. A transcription is judged to 
be correct when it matches exactly the pronuncia-- 
tion listed in the database at the segmental level. 
The number of correct phonemes in a transcription 
is computed on the basis of the string-to-string edit 
distance with the target pronunciation. For each 
experiment, we measure the percentage of phoneme 
and words that are correctly predicted (referred to 
as correctness), and two additional figures, which are 
usually not significant in context of the evaluation 
of transcription systems. Recall that our algorithm, 
unlike many other pronunciation algorithms, is likely 
to remain silent. In order to take this aspect into ac- 
count, we measure in each experiment the number 
of words that can not be pronounced at all (the si- 
lence), and the percentage of phonemes and words 
that are correctly transcribed amongst those words 
that have been pronounced at all (the precision). The 
average values for these measures are reported here- 
after. 
3.2 Pseudo-words 
All but one pseudo-words of Glushko's test set could 
be pronounced by the paradigmatic cascades algo- 
rithm, and amongst the 69 pronunciation suggested 
by our program, only 9 were uncorrect (that is, were 
not proposed by human subjects in Glushko's ex- 
periments), yielding an overall correctness of 85.7%, 
and a precision of 87.3%. 
An important property of our algortihm is that it 
allows to precisely identify, for each pseudo-word, 
the lexical entries that have been analogized, i.e. 
whose pronunciation was used in the inferential pro- 
cess. Looking at these analogs, it appears that three 
of our errors are grounded on very sensible analo- 
gies, and provide us with pronunciations that seem 
at least plausible, even if they were not suggested in 
Glushko's experiments. These were pild and bild, 
analogized with wild, and pornb, analogized with 
tomb. 
These results compare favorably well with the per- 
formances reported for other pronunciation by anal- 
ogy algorithms ((Damper and Eastmond, 1996) re- 
432 
ports very similai" correctness figures), especially if 
one remembers that our results have been obtained, 
wilhout resorting to any kind of pre-alignment be- 
tween the graphemic and phonemic strin9s in the 
lea'icons. 
3.3 Lexical Entries 
This second series of experiment is intended to 
provide us with more realistic evaluations of the 
paradigmatic cascade rnodeh Glushko's pseudo- 
words have been built by substituting the initial 
consonant or existing monosyllabic words, and con- 
sl.itute theretore an over-simplistic test-bed. The 
nettalk dataset contains plurisyllabic words, com- 
plex derivatives, loan words, etc, and allows to test 
the ability of our model to learn complex morpho- 
phonological phenomenas, notably vocalic alterna- 
tions and other kinds of phonologically conditioned 
root a.llomorphy, that are very difficult to learn. 
With this new test set, the overall performances 
of our algorithm averages at about 54.5% of en- 
tirely correct words, corresponding to a 76% per 
phoneme correctness. If we keep the words that 
could not be pronounced at all (about 15% of the 
test set) apart fi'oln the evaluation, the per word and 
per phoneme precision improve considerably, reach- 
ing respectively 65% and 93%. Again, these pre- 
cision results compare relatively well with the re- 
suits achieved on the same corpus using other self- 
learning algorithms for grapheme-to-phoneme trma- 
scription (e.g. (van den Bosch and Daelemans, 1993; 
Yvon, 1996a)), which, unlike ours, benefit from 
the knowledge of tile alignment between graphemic 
and phonemic strings. Table 3 suimnaries the per- 
forma.uce (in terms of per word correctness, si- 
lence, and precision) of various other pronunciation 
systems, namely PRONOUNCE (Dedina and Nus- 
baum, 1991), DEC (Torkolla, 1993), SMPA (Yvon, 
1!)96a). All these models have been tested nsing ex- 
a.c(.ly the sanle evMual.ion procedure and data. (see 
(Yvon, 1996b), which also contains an evalution per- 
formed with a French database suggesting that this 
h'arning strategy effectively applies to other lan- 
guages). 
System corr. prec. silence 
DE(/', 56.67 56.67 0 
SMPA 63.96 64.24 0.42 
PRONOUNC.F, 56.56 56.75 0.32 
I)CP 54A9 63.95 14.80 
Table 3: A Comparatiw. l~;valuation 
'\[a/)le 3 pinpoints the main weakness of our model, 
that is, its significant silence rate. The careful ex- 
alnination of the words that cannot be pronounced 
reveals that they are either loan words, which are 
very isolated in an English lexicon, and .for which 
no analog can be found; or complex morphological 
derivatives for which the search procedure is stopped 
before the existing analog(s) can be reached. Typical 
examples are: synergistically, timpani, hangdog, 
oasis, pemmican, to list just a few. This suggests 
that the words which were not pronounced are not 
randomly distributed. Instead, they mostly belong 
to a linguistically homogeneous group, the group of 
foreign words, which, for lack of better evidence, 
should better be left silent, or processed by another 
pronnnciation procedure (for example a rule-based 
system (Coker, Church, and Liberman, 1990)), than 
uncorrectly analogized. 
Some complementary results finally need to be 
mentioned here, in relation to the size of lexical 
neighbourhoods. In fact, one of our main goal was 
to define in a sensible way the concept of a lexical 
neighbourhood: it is therefore important to check 
that our model manages to keep this neighbourhood 
relatively small. Indeed, if this neighbourhood can 
be quite large (typically 50 analogs) for short words, 
the number of analogs used in a pronunciation aver- 
ages at about 9.5, which proves that our definition 
of a lexical ncighbourhood is sufficiently restrictive. 
4 Discussion and Perspectives 
4.1 Related works 
A large number of procedures aiming at the auto- 
matic discovery of pronunciation "rules" have been 
proposed over the past few years: connectionist 
models (e.g. (Sejnowski and Rosenberg, 1987)), tra- 
ditional symbolic machine learning techniques (in- 
duction of decision trees, k-nearest neighbours) e.g. 
(Torkolla, 1993; van den Bosch and Daelemans, 
1993), as well as various recombination techniques 
(Yvon, 1996a). In these models, orthographical cor- 
respondances are primarily viewed as resulting from 
a strict underlying phonographical system, where 
each grapheme encodes exactly one phoneme. This 
assumption is reflected by the possibility of align- 
ing on a one-to-one basis graphemic and phonemic 
strings, and these models indeed use this kind of 
alignment t.o initiate learning. Under this view, tile 
orthographical representation of individual words is 
strongly subject to their phonological forms on an 
word per word basis. The main task of a machine- 
learning algorithm is thus mainly to retrieve, on 
a statistical basis, these grapheme-phoneme corre- 
spondances, which are, in languages like French or 
433 
English, accidentally obscured by a multitude of ex- 
ceptional and idiosyncratic correspondances. There 
exists undoubtly strong historical evidences support- 
ing the view that the orthographical system of most 
european languages developped from a such phono- 
graphical system, and languages like Spanish or Ital- 
ian still offer examples of that kind of very regular 
organization. 
Our model, which extends the proposals of (Coker, 
Church, and Liberman, 1990), and more recently, 
of (Federici, Pirrelli, and Yvon, 1995), entertains a 
different view of orthographical systems. Even we 
if acknowledge the mostly phonographical organiza- 
tion of say, French orthography, we believe that the 
nmltiple deviations from a strict grapheme-phoneme 
correspondance are best captured in a model which 
weakens somehow the assumption of a strong de- 
pendancy between orthographical and phonological 
representations. In our model, each domain has its 
own organization, which is represented in the form 
of systematic (paradigmatic) set of oppositions and 
alternations. In both domain however, this orga- 
nization is subject to the same paradigmatic prin- 
ciple, which makes it possible to represent the re- 
lationships between orthographical and phonologi- 
cal representations in the form of a statistical pair- 
ing between alternations. Using this model, it be- 
comes possible to predict correctly the outcome in 
the phonological domain of a given derivation in the 
orthographic domain, including patterns of vocalic 
alternations, which are notoriously difficult to model 
using a "rule-based" approach. 
4.2 Achievements 
The paradigmatic cascades model offers an origi- 
nal and new framework for extracting information 
from large corpora. In the particular context of 
grapheme-to-phoneme transcription, it provides us 
with a more satisfying model of pronunciation by 
analogy, which: 
• gives a principled way to automatically learn 
local similarities that implicitely incorporate a 
substantial knowledge of the morphological pro- 
cesses and of the phonotactic constraints, both 
in the graphemic and the phonemic domain. 
This has allowed us to precisely define and iden- 
tify the content of lexical neighbourhoods; 
• achieves a very high precision without resorting 
to pre-aligned data, and detects automaticMly 
those words that are potentially the most dif- 
ficult to pronounce (especially foreign words). 
Interestingly, the ability of our model to pro- 
cess data which are not aligned makes it directly 
applicable to the reverse problem, i.e. phoneme- 
to-grapheme conversion. 
is computationally tractable, even if extremely 
ressource-consuming in the current version of 
our algorithm. The main trouble here comes 
from isolated words: for these words, the search 
procedure wastes a lot of time examining a very 
large number of very unlikely analogs, before re- 
alizing that there is no acceptable lexical neigh- 
bout. This aspect definitely needs to be im- 
proved. We intend to explore several directions 
to improve this search: one possibility is to use 
a graphotactieal model (e.g. a rt-gram model) in 
order to make the pruning of the derivation tree 
more effective. We expect such a model to bias 
the search in favor of short words, which are 
more represented than very long derivatives. 
Another possibility is to tag, during the learning 
stage, alternations with one or several morpho- 
syntactic labels expressing morphotactical re- 
strictions: this would restrict the domain of an 
alternation to a certain class of words, and ac- 
cordingly reduce the expansion of the analog 
set. 
4.3 Perspectives 
The paradigmatic cascades model achieves quite sat- 
isfactory generalization performances when evalu- 
ated in the task of pronouncing unknown words. 
Moreover, this model provides us with an effective 
way to define the lexical neighbourhood of a given 
word, on the basis of "surface" (orthographical) local 
similarities. It remains however to be seen how this 
model can be extended to take into account other 
factors which have been proven to influence analogi- 
cal processes. For instance, frequency effects, which 
tend to favor the more frequent lexical neighbours, 
need to be properly model, if we wish to make a 
more realistic account of the human performance in 
the pronunciation task. 
In a more general perspective, tile notion of simi- 
larity between linguistic objects plays a central role 
in many corpus-based natural language processing 
applications. This is especially obvious in the con- 
text of example-based learning techniques, where the 
inference of some unknown linguistics property of a 
new object is performed on the basis of the most 
similar available example(s). The use of some kind 
of similarity measure has also demonstrated its effec- 
tiveness to circumvent the problem of data sparse- 
ness in the context of statistical language modeling. 
In this context, we believe that our model, which 
is precisely capable of detecting local similarities in 
434 
lexicons, and to 16erform, on the basis of these sinai- 
larities~ a global inferential transfer of knowledge, is 
especially well suited for a large range of NLP tasks. 
Encouraging results on the task of learning the En- 
glish past-tense forms have already l~een reported in 
(Yvon, 1996b), and we intend to continue to test this 
model on various other potentially relevant applica- 
tions, such as morpho-syntactical "guessing", part- 
of-speech tagging, etc. 

References 
Coker, Cecil H., Kenneth W. Church, and Mark Y. 
Liberman. 1990. Morphology and rhyming: two 
powerful alternatives to letter-to-sound rules. In 
Proceedings of the ESCA Conference on Speech 
Synthesis, Autrans, France. 
Coltheart, Max. 1978. Lexical access in simple read- 
ing tasks. In G. Underwood, editor, Strategies 
of information processing. Academic Press, New 
York, pages 151-216. 
Coltheart, Max, Brent Curtis, Paul Atkins, and 
Michael Haller. 1993. Models of reading aloud: 
dual route and parallel distributed processing ap- 
proaches. Psychological Review, 100:589-608. 
Damper, Robert I. and John F. G. Eastmond. 1996. 
Pronuncing text by analogy. In Proceedings of 
the seventeenth International Conference on Com- 
putational Linguistics (COLING'96), pages 268- 
273, Copenhagen, Denmark. 
de Saussure, Ferdinand. 1916. Cours de Linguis- 
tique Ggn@rale. Payot, Paris. 
Dedina, Michael J. and Howard C. Nusbaum. 1991. 
PRONOUNCE: a program for pronunciation by 
analogy. Computer Speech and Langage, 5:55-64. 
Federici, Stefano, Vito Pirrelli, and Franqois Yvon. 
1995. Advances in analogy-based learning: false 
friends and exceptional items in pronunciation by 
paradigm-driven analogy. In Proceedings of I J- 
CA I'95 workshop on 'New Approaches to Learning 
for Natural Language Processing', pages 158-163, 
Montreal. 
Glushsko, J, R. 1981. Principles for pronouncing 
print: the psychology of phonography. In A. M. 
Lesgold and C. A. Perfetti, editors, Interactive 
Processes in Reading, pages 61-84, Hillsdale, New 
Jersey. Erlbaum. 
Lepage, Yves and Ando Shin-Ichi. 1996. Saussurian 
analogy : A theoretical account and its applica- 
tion. In Proceedings of the seventeenth Interna- 
tional Conference on Computational Linguistics 
(COLING'96)~ pages 717-722, Copenhagen, Den- 
1Tlarl(. 
Pirrelli, Vito and Stefano Federici. 1994. "Deriva- 
tional" paradigms in morphonology. In Proceed- 
ings of the sixteenth International Conference on 
Computational Linguistics (COLING'94), Kyoto, 
Japan. 
Seidenberg, M. S. and James. L. McClelland. 1989. 
A distributed, developnaental model of word 
recognition and naming. Psychological review, 
96:523-568. 
Sejnowski, Terrence J. and Charles R. Rosenberg. 
1987. Parrallel network that learn to pronounce 
English text. Complex Systems, 1:145-168. 
Sullivan, K.P.H and Robert I. Damper. 1992. Novel- 
word pronunciation within a text-to-speech sys- 
tem. In G~rard Bailly a.nd Christian Benoit, edi- 
tors, Talking Machines, pages 183-195. North Hol- 
land. 
Torkolla, Karl. 1993. An efficient way to learn 
English grapheme-to-phoneme rules automati- 
cally. In PTvceedings of the International Confer- 
ence on Acoustics, Speech and Signal Processing 
(ICASSP), volume 2, pages 199-202, Minneapo- 
lis, Apr. 
van den Bosch, Antal and Walter Daelemans. 1993. 
Data-oriented methods for grapheme-to-phoneme 
conversion. In Proceedings of the European Chap- 
ter of the Association for Computational Linguis- 
tics (EACL), pages 45-53, Utrecht. 
Yvon, Francois. 1996a. Grapheme-to-phoneme 
conversion using multiple unbounded overlapping 
chunks. In Proceedings of the conference on New 
Methods in Natural Language Processing (NeM- 
LaP II), pages 218-228, Ankara, Turkey. 
Yvon, Francois. 1996b. Prononcer par analogie : 
motivations, formalisations et dvaluations. Ph.D. 
thesis, Ecole Nationale Sup6.rieure des T@l~com- 
munications, Paris. 
