Incorporating Metaphonemes in a Multilingual Lexicon 
Carole Tiberius and Lynne Cahill 
Information Technology Research Institute 
University of Brighton 
Brighton, UK 
{Carole.Tiberius, Lynne.Cahill}@itri.brighton. ac .uk 
Abstract English 
bed 
This paper describes a framework for multilingual /bEd/ 
inheritance-based lexical representation which al- rib 
lows sharing of information across s at /rib/ 
all levels of linguistic description. The paper fo- hand 
cuses on phonology. It explores the possibility /h{nd/ 
of establishing a phoneme inventory for a group cat 
of s in which -specific phonemes /k{t/ 
function as "allophones" of newly defined recta- 
phonemes. Dutch, English, and German were taken 
as a test bed and their vowel phoneme inventories 
were studied. The results of the cross-linguistic 
analysis are presented in this paper. The paper con- 
cludes by showing how these metaphonelnes can be 
incorporated in a multilingual lexicon. 
1 Introduction 
This paper describes a framework for multilingual 
inheritance-based lexical representation which aP 
lows sharing of information across (related) hm- 
guages at all levels of linguistic description. Most 
work on multilingual lexicons up to now has as- 
sumed mouolingual lexicons linked only at the level 
of semantics (MUI_TILEX 1993; Copestake et al. 
1992). Cahill and Gazdar (1999) show that this 
approach might be appropriate for unrelated lan- 
guages, as for example English and Japanese, but 
that it makes it impossible to capture useful gener- 
alisations about related s - such as English 
and German. Related s share many linguis- 
tic characteristics at all levels of description - syn- 
tax, morphology, phonology, etc. - not just seman- 
tics. For instance, words which come fl'om a single 
root have very similar orthographic and phonologi- 
cal forms. Compare English, Dutch, and German1: 
IThe lranscriptions are taken from CELEX (Baayen et al. 
1995) and use tile SAMPA phonetic alphabet (Wells 1989). 
Dutch Gernmn 
bed Bett 
/bEt/ /bEt/ 
rib Rippe 
/rip/ /rip@/ 
hand ltand 
/hAnt/ /hant/ 
kat Katze 
/kAt/ /kats @/ 
Most differences can be attributed to dil'ferent 
orthographic conventions and regular phonological 
changes (e.g. final devoicing in Dutch and German). 
The English/{I, the Dutch/AI, and the German/a/ 
in the last two exmnples, are even virtually the same. 
They have slightly different realisations but they are 
phonologically non-distinctive, i.e. if the Dutch/A/ 
were substituted by the English/{/in Dutch, the re- 
sult would not be a different word, but it would sim- 
ply sound like a different accent. 
Cahill and Gazdar (I 999) describe an architecture 
for nmltilingual lexicons which aims to encode and 
exploit lexical similarities between closely related 
s. This architecture has been successfully 
applied in the PolyLex project 2 to define a trilingual 
lexicon for Dutch, English, and German sharing 
morphological, phonological, and lnorphophono- 
logical information between these s. 
in this paper, we will take the Polykex fiame- 
work as our basis. We will focus on the phono- 
logical similarities between related hmguages and 
we will extend the PolyLex approach by capturing 
cross-linguistic phoneme correspondences, such as 
the/{/-/A/-/a/correspondence mentioned above 3. 
First, we will discuss how a phoneme inventory 
can be defined for a group of s - l)utch, 
2http://www.cogs. susx. ac.uk/lab/nlp/polylex/ 
3We believe the approach would be even more beneficial if 
exlended to a featural level, but for tile present purposes we 
conline ourselves to the segmental level. 
1126 
English, and German. Then, we will explain tile 
multilingual architecture used in PolyLex. Finally, 
we will explore how these cross-linguistic phoneme 
correspoudences can be integrated into tile multilin- 
gual frmnework. 
2 A Metaphoneme Inventory 
In this section we describe how a phoneme inven- 
tory can be defined for a group of s in 
which -specific phonemes flmction its "al- 
lophones" of newly defined metaphonemes. We will 
restrict ourselves to the vowel phonemes of l)utch, 
English, and Gerlnan. If we know, for example, 
that words which are realised with an /{I in En- 
glish are usually realised with an/A/in I)utch, and 
an/a/in German (as in hand/h{nd/ versus/hAnt/ 
w:rsus/hant/, cal/k{t/versus/kArl versus/kats(@/, 
elc.), we might be able to generalise over these three 
hmguage-specific phonemes and introduce a meta- 
phoneme, e.g. I{Aa\], which captures this generali- 
sation. 
To give an impression of the distribution of the 
different vowel phonemes across l)utch, English, 
and German, their vowel charts (K6nig and van der 
Auwera 1994; Wells 1989) were merged into one 
big vowel chart containing all the vowel phonemes 
of these three hmguages. 4, The resulting char| is 
given iu tigure 15: 
I;ronl I:hlck 
2: \ 
e \ 9 .~: 
:\ \ v 
I ..... a: \____ _ ) AA: 
a: A 
I1." 
IJ 
if; 
O: 
o: i - l)ulch 
\[~: i - English 
i -Gerrnan 0 
(2 
Figure 1: Vowel phonemes in Dutch, English, and 
German 
This figure shows which vowel phonemes are re- 
atised in which  (e.g./{/occurs in English, 
but not in l)utch and German), but it does not tell us 
4phonemes that only occur in loanwords were not i~lcluded 
a.'; s adapt loanwords to different degrees to their own 
phonetic syslem. 
5The w)wels are described along the three dimensions of 
w)wel quality: \[high\], \[back\], and \[round\]. The rounded w~wels 
are/y,y:,Y,Y,2:,2:,9,O,O,O, O:,o:,o:,u,u:,tr:,U, UI. 
anything about cross-linguistic phoneme correspon- 
deuces. Knowing that Dutch and German both have 
a phoneme/o:/, does not mean that they are cross- 
linguistically non-distinctive. 
qb find cross-linguistic phoneme correspon- 
deuces, we followed O'Connor's (1973) strategy 
for establishing phonelne conespondences between 
difl'erent accents, identifying phonemes of one ac- 
cent with those el' another: 
"How are we to decide whether to equate 
phoneme X with phoneme A or with 
phoneme D? We can do so only on the 
basis el' the words in which they occur: 
if X and A both occur in a large number 
of words common to both accents we link 
them together as representing the same 
point on the pattern, if, on the other hand, 
X shares more words with D than with A, 
we linkXandD. \[...\] Even so, ifXand 
D occur in a very similar word-set and X 
and A do not, then it is much more reveal- 
ing to equate X and D than X and A." 
(O'Connor 1973, p. 186) 
We extended O'Connor's strategy and applied it 
to a group of (closely) related hmguages sharing 
a colnmou word stock - in our case a sllbset of 
the West Gmmanic s sharing worcls with 
a common Germanic origin. We compiled a list 
of g00 (mono- and disyllabic) Germanic cognates, 
looked up the transcriptions in the CELEX database 
(Baayen el al. 1995), and then mapped words con- 
tainiug a palticular vowel in one hmguage onto its 
cognates in the other two hmguages to see how this 
particular vowel was realised in tile other two lan- 
guages. This process was repeated for all the vow- 
els, for all three s. 
A few examples of tile results we obtained for En- 
glish vowels are included below c'. 
As can be seen fl'om these v, tlaere is some vari- 
ation in the closeness of the correspondences. The 
vowel set/{/-/A/-/a/, as we anticipated at the out- 
set, does turn out to be a wflid correspondence. The 
set associated with English/i:/, on the other hand, 
is less clearcut, as there are several possible cor- 
r'The remaining correspondence tables are available at 
http://www, itri .bton.ac.uk/~Carole.'i~iberius/ 
mphon, html 
7Note that the total number o1' words is not always exactly 
the same in all lhree hmguages. This is because for some words 
the con'esponding phonemic transcription was not found. 
1127 
English 
{ 37 
Dutch Gernmn 
A 27 a 22 
a: 3 a: 3 
E 2 E 3 
} 2 I 2 
o: 2 e: 1 
u: 1 O 1 
o: 1 
u: 1 
l: l 
total 37 total 35 
Table 1: Correspondences for English/{/ 
in hand/h{nd/vs/hAnt/vs/hant/. 
words as 
English 
i: 65 
Dutch German 
a: 14 a: 12 
o: 11 i: 8 
e: 9 ai 7 
i: 8 e: 5 
u: 7 y: 5 
I 5 au 5 
E 4 I 5 
EI 3 o: 4 
l: 2 a 3 
/I i E 3 
A 1 u: 3 
O 2 
E: 1 
Y 1 
I: \] 
total 65 total 65 
Table 2: Colxespondences for English/i:/words as 
in meal/mi:l/vs/ma:l/vs/ma:l/and deep/di:p/vs 
/di:p/vs/ti:ff. 
responding vowel phonemes in the other two lan- 
guages. If we consider the correspondences from 
the starting point of one of the other s, the 
results are slightly different. For instance, English 
/A:/ corresponds strongly to Dutch/A/, but Dutch 
/A/ corresponds ahnost equally to English/(/and 
/A:/. Further investigation is required to ascertain 
how many of these cases can be further generalised 
by recourse to phonological or phonotactic proper- 
ties of the words in question. Currently the mapping 
from metaphoneme to (-specific) phoneme 
requires reference only to the . For a more 
English 
A: 31 
Dutch German 
A 19 a 15 
a: 4 a: 5 
E 4 E 5 
O 2 e: 2 
e: 1 E: 1 
El 1 U 1 
Y 1 
ai 1 
total 31 total 31 
Table 3: Correspondences for English/A:/words as 
in heart/hA:T/vs/hArt/vs/hart/. 
Dutch 
A 77 
English German 
{ 25 a 53 
A: 17 a: 9 
ell 10 E 6 
O: 8 I 3 
Q 4 ai 1 
@U 4 e: 1 
u: 2 
E 2 
3: 2 
i: 1 
I 1 
aI 1 
total 77 total 73 
Table 4: Correspondences for Dutch/A/words as in 
hand (hand) and hart (hem't). 
sophisticated analysis, phonological and phonotac- 
tic information would need to be considered as well. 
Howcvel; even at the present level of analysis, the 
metaphoneme principle can be helpful in the mul- 
tilingual lexical structure proposed, as we now dis- 
CUSS. 
3 The multilingual inheritance lexicon 
In this section, we will explore the sharing of phono- 
logical information in the lexical entries of a mul- 
tilingual inheritance-based lexicon. We focus on 
phonology rather than orthography as phonology is 
nearer to primary  use (i.e. spoken lan- 
guage), it can be used as input for hyphenation rules, 
spelling correction, and it is essential as the level of 
symbolic representation for speech synthesis (MUD 
TILEX 1993). 
1128 
We will take the multilingual architecture of 
PolyLex as our starting point. First, we will describe 
the PolyLex arclaitecture. Then, we will show how 
phonological information can be shared in the lexi- 
cal entries. 
PolyLex detines a multilingual inheritance-based 
lexicon for l)utch, English and German. It is 
implemented in DATR, an inheritance-based lexi- 
cal knowledge representation formalism (Evans and 
Gazdar 1996). The rationale of inheritance-based 
lexicons requires information to be pushed as far up 
the hierarchy as it can go, generalising as much as 
possible. In a multilingual lexicon, this means that 
information which is common to several s 
is stated at higher points in the hierarchy than that 
which is unique to just one of the s. In 
addition, Polykex makes use of orthogonal multiple 
inheritance which allows a node in the hierarchy to 
inherit different kinds of information (e.g. seman- 
tics, morphology, phonology, syntax) fi'om different 
parent nodes. In this papen we are just interested in 
the phonological hierarchy. 
Polykex assumes a contemporary phonological 
fralnework in which all lexical entries are detined 
as having a phonological structure consisting of a 
sequence of structured syllables, a syllable consist- 
ing o1' an onset (the initial consonant cluster, which 
might be split up into onset 1, onset 2, etc.) and a 
rhylne. The rhyme consists of a peak (the vowel) 
and a coda (the final consonant cluster, which might 
bc split up into coda 1, coda 2, etc.). This struc- 
ture is defined at the top el' the hierarchy, and ap- 
plies by default to all words. Only the relevant val- 
ues for onset, peal<, and coda have to be defined at 
the individual lexical entries (see Cahill and Gazdar 
1!)97). Following PolyLex we will concentrate on a 
segmental phonelnic representation. An example of 
the lexical entry gram as it would be represented in 
PolyLex, is shown in figure 2. 
The multilingual phonological entry for gram, is 
delined by sharing identical segments occnrring in 
the majority of the -specific entries (/gr{m/ 
-/xrAm/-/gram/). That is, onset 1 is/g/, onset 2 is 
/1"/, and coda is/m/. 
English and German can inherit all the informa- 
tion fiom the common part except for the value of 
their peak, which is respectively /{/ and /a/. In 
Dutch, the value of the peak has to be specified as 
being/A/, plus we will have to override the wdue 
for the first onset to get \[xrAm\]. 
This example misses the generalisation that the 
Peak = { 
Collln'lon 
MGram: 
Onset 1 = g 
Onset 2 = r 
Coda = Ill 
English Dutch German 
Onset I = x 
l'cak = A Peak = a 
Syllable 
Onset Rhyme ® /X 
Peak Coda 
© ® 
Figure 2: A multilingual inheritance lexicon with- 
out metaphonemes 
English/{/, the Dutch/A/, and the German/a/are 
phonologically non-distinctive. For each lexical ca- 
try where English uses/{/, l)utch/A/, and German 
/a/, the value for peak has to be specitied in the 
-specific parts. By using the metaphoneme 
I{Aal instead, this information needs to be speci- 
fied only once. The resulting multilingual phonemic 
representation for gram is given in ligure 3. 
M_,;, ...... Coma,on ll~ 
()I1SCl 2 =I" ~ 
Peak = {Aa ~ f~ 
Coda =~;~ 
English l)ulch German 
()llSCI "- X 
Figure 3: A multilingual inheritance lexicon with 
iYletaphonelneS 
All the information has now been pushed up as 
far as it can go, capturing as many generalisations 
as possible. The information that \]{Aa\] results ill 
an/{/in English, an/A/in Dutch, and an/a/in Ger- 
man is specified only at the top level. The - 
specitic boxes are almost empty, except for the value 
of the first onset in Dutch. The reason for this is 
that as yet we have only defined cross-linguistic 
phoneme correspondences for vowels, not for con- 
sonants. We do, howevm, suspect that the Dutch/x/ 
is phonologically non-distinctive fi'om the German 
and English /g/. Further research defining cross- 
linguistic phoneme correspondences for consonants 
1129 
will have to confirm this. 
It is a fundamental feature of this account that 
the inherited information is only default informa- 
tion which can be overridden. Thus, it is not re- 
quired that metaphoneme correspondences are com- 
plete and we may choose to use a metaphoneme 
even if one of the s uses a different vowel 
in some words. The definitions can be overridden 
in exactly the same way as the onset definition in 
Dutch in the example above. So if we consider the 
vowel correspondences in table 1, we can see that 
of the 35 words which have cognates in all three 
s, 27 can be defined as having the meta- 
phoneme \[{Aa I in the common lexical entry (those 
for which both English and l)utch have the corre- 
sponding vowels). Five of these will require a sep- 
arate vowel defined for Gerlnan, while the remain- 
der will need separate vowel definitions for all three 
s. 
Given this, we can see that economy of rep- 
resentation can be achieved even in cases where 
the vowel correspondences are far from conclusive. 
Even if only half or fewer of the Dutch words, for 
example, have the same vowel in cognates for which 
the English words have the same vowel, this still 
means that those half can be defined without the 
need for the -specific vowel to be defined. 
Another feature of the metaphoneme principle 
that differentiates it from the phonemic principle 
is that there is no requirement for biuniqueness. 
A phoneme in a  can be a realisation of 
morn than one metaphoneme. This means that we 
can define a metaphoneme I{Aa\[ as well as another, 
IA:Aal. Each of these will then be used in different 
common lexical entries. This can be used as an al- 
ternative to phonological/phonotactic conditioning 
or in addition to it, for just those cases where there 
is more than one correspondence but no obvious 
phonologicai/phonotactic conditioning for the deci- 
sion between phonemes. 
4 Conclusion 
In this paper, we have discussed the concept of 
metaphonemes. Metaphonemes are cross-linguistic 
phoneme correspondences such as the English/{/, 
the Dutch/A/, and the German/a/correspondence 
inentioned above. At the lnultilingual level, the 
realisation of the metaphoneme is conditioned by 
the choice of . At the lower monolingual 
level its realisation as an allophone of a particular 
phoneme is conditioned by the phonological envi- 
romnent. As such, a metaphoneme is a generalisa- 
tion of a generalisation. 
We have shown how a metaphoneme inventory 
can be defined for a group of s and that 
incorporating these cross-linguistic phoneme corre- 
spondences in a multilingual inheritance lexicon in- 
creases the number of generalisations that can be 
captured. Calculations on the syllable inventories of 
Dutch, English, and German in the CELEX database 
show that the introduction of metaphonemes in- 
creases the amount of sharing at the syllable level 
by about 25%. 
Another benefit of introducing metaphonemes is 
improved robustness in NLP systems. Knowledge 
about cross-linguistic commonalities can help to 
provide grounds for making an "intelligent" guess 
when a lexical item for a particular  is not 
present. 
Tiffs research has concentrated on cross-linguistic 
vowel phoneme correspondences. Similar research 
will be done for consonants. 

References 

Baayen, 1t., R. Piepenbrock and It. van Rijn. 1995. The CELEX 
Lexical Database, Release 2 (CD-ROM). Linguistic l)ata 
Consortium, University of Pennsylvania, Philadelphia, PA. 

Cahill, L. and G, Gazdar. 1997. "The inllectional phonology of 
Gerlnan adjectives, determiners and pronouns", In Linguis- 
tics, 35.2, pp.211-245. 

Cahill, L. and G. Gazdar. 1999. "The Polykex architecture: 
multilingual lexicons for related s", In 7)witement 
Automatique des l~mgues, 40:2, pp.5-23. 

Copestake, A., B. Jones, A. Sanfilippo, H. Rodriguez, P. 
Vossen, S. Montemagni, and E. Marinai. 1992. "Multi- 
lingual Lexical Representation". ESPRIT BRA-3030 AC- 
QUILEX Working Paper N ° 043. 

Evans, R. and G. Gazdar. 1996. "DATR: A Language for Lexi- 
cal Knowledge Representation", In Contlmtational Linguis- 
tics, Vol. 22-2, pp.167-216. 

Kfnig, E. and J. van der Auwera (eds.) 1994. The Germanic 
lJtnguages, Routledge, London. 

MULTILEX, 1993. "MLEX,I Standards for a Multi functional 
Lexicon", Final Report, CAP GEMINI INNOVATION for 
tt~e MULTILEX Consortium, Paris. 

O'Connor, J.l). 1973. Phonetics, Pelican Books, Great Britain. 

Wells, J. 1989. "Computer-coded phonemic notation of indi- 
vidual s of lhe European Community", In Journal 
qf the International Phonetic Association, 19:1, pp.31-54. 
