Computational dialectology in Irish Gaelic 
Brett Kessler* 
Department of Linguistics 
Stanford University 
Stanford CA 94305-2150 USA 
kessler@csli.stanford.edu 
Abstract 
Dialect groupings can be discovered ob- 
jectively and automatically by cluster 
analysis of phonetic transcriptions such 
as those found in a linguistic atlas. The 
first step in the analysis, the computa- 
tion of linguistic distance.between each 
pair of sites, can be computed as Leven- 
shtein distance between phonetic strings. 
This correlates closely with the much 
more laborious technique of determining 
and counting isoglosses, and is more ac- 
curate than the more familiar metric of 
computing Hamming distance based on 
whether vocabulary entries match. In 
the actual clustering step, traditional ag- 
glomerative clustering works better thaxl 
the top-down technique of partitioning 
around medoids. When agglomerative 
clustering of phonetic string comparison 
distances is applied to Gaelic, reasonable 
dialect boundaries are obtained, corre- 
sponding to national and (within Ire- 
land) provincial boundaries. 
1 Introduction 
Defining dialects is one of the first tasks that lin- 
guists need to pursue when approaching a lan- 
guage. Knowing the dialect areas helps one allo- 
cate resources in language research and has impli- 
cations for language learners, publishers, broad- 
casters, educators, and language planners. Un- 
fortunately, dialect definition can be a time- 
consuming and ill-defined process. The traditional 
approach has been to plot isoglosses, delineating 
regions where the same word is used for the same 
concept, or perhaps the same pronunciation for 
the same phoneme. But isoglosses are frustrat- 
ing. The first problem, as Gaston Paris noted 
(apud Durand, 1889:49), is that isoglosses rarely 
coincide. At best, isoglosses for different features 
approach each other, forming vague bundles; at 
*I thank Martin Kay, Paul Kiparsky, Tom Wasow, 
and a reviewer for helpful comments. 
worst, isoglosses may cut across each other, de- 
scribing completely contradictory binary divisions 
of the dialect area. That is, language may vary ge- 
ographically in many dimensions, but the require- 
ments we usually impose require that a specific 
site be placed in a unique dialect. Traditional di- 
alectological methodology gives little guidance as 
to how to perform such reduction to one dimen- 
sion. 
A second problem is that many isoglosses do not 
neatly bisect the language area. Often variants 
do not neatly line up on two sides of a line, but 
are intermixed haphazardly. More importantly, 
for some sites information may be lacking, or the 
question is simply not applicable. When compar- 
ing how various sites pronounce the first conso- 
nant of a particular word, it is meaningless to ask 
that question if the site does not use that word. So 
the isogloss is incomplete and cannot be meaning- 
fully compared with isoglosses based on different 
sets of sites. 
The third problem is that most languages have 
dialect continua, such that the speech of one com- 
munity differs little from the speech of its neigh- 
bouts. Even though the cumulative effects of such 
differences may be great when one considers the 
ends of the continua (such as southern Italian ver- 
sus northern French), still it seems arbitrary to 
draw major dialect boundaries between two vil- 
lages with very similar speech patterns. Such co- 
nundrums led Paris and others to conclude that 
the dialect boundary, and therefore the very no- 
tion of dialect, is an ill-defined concept. 
More recently, the field of dialectometry, as in- 
troduced by S~guy (1971, 1973), has addressed 
these issues by developing several techniques for 
summarizing and presenting variation along mul- 
tiple dimensions. They replace isoglosses with a 
distance matrix, which compares each site directly 
with all other sites, ultimately yielding a single 
figure that measures the linguistic distances be- 
tween each pair of sites. There is however no firm 
agreement on just how to compute the distance 
matrices. Sfiguy's earliest work (1971) was based 
on lexical correspondences: sites differed in the 
extent to which they used different words for the 
60 
same concept. S~guy (1973), Philps (1987), and 
Durand (1989) use some combination of lexical, 
phonological, and morphological data. Babitch 
(1988) described the dialectal distances in Aca- 
dian villages by the degree to which their fishing 
terminology varied. Babitch and Lebrun (1989) 
did a similar analysis based on the varying pro- 
nunciation of/r/. Elsie (1986) grouped the Gaelic 
dialects on the basis of whether the vocabulary 
matched. Ebobisse (1989) grouped the Sawabantu 
languages of Cameroon by whether phonologi- 
cal correspondences in matching vocabulary items 
were complete, partial, or lacking. There seems to 
be a certain bias in favour of working with lexical 
correspondences, which is understandable, since 
deciding whether two sites use the same word for 
the same concept is perhaps one of the easiest lin- 
guistic judgements to make. The need to figure 
out such systems as the comparative phonology of 
various linguistic sites can be very time-consuming 
and fraught with arbitrary choices. 
Not all dialectometrists agree on the wisdom of 
delineating dialect areas. Sdguy (1973:18) insisted 
that the concept of dialect boundaries was mean- 
ingless, and his emphasis on the gradience of lan- 
guage similarity has been widely maintained. Bu't 
those who do look for firna dialect affiliations (such 
as Babitch and Ebobisse) use bottom-up agglom- 
erative techniques. The two linguistically closest 
sites are grouped into one dialect, and thence- 
forth treated as a unit. The process continues 
recursively until all sites are grouped into one su- 
perdialect embracing the entire language area un- 
der consideration. This yields a binary tree. But 
Kaufman and Rousseeuw (1990:44) suggest that 
when the emphasis in a clustering problem is on 
the top-level clusters--here, finding the two main 
dialects--then such bottom-up methods, which 
can potentially introduce error at each of several 
steps, are less reliable than top-down partitioning 
methods. Perhaps past researchers have used in- 
ferior bottom-up techniques simply because the 
necessary algorithms are computationally more 
tractable. Comparing all possible pairs of sites is a 
O(N 2) problem, 1 whereas considering all possible 
two-way partitions of tile dialect area is 0(2 N). 
The current state of dialectometry thus presents 
two main questions which constitute the method- 
ological focus of this paper. The first deals with 
distance matrices. Is there a way to build ac- 
curate distance matrices that minimize editorial 
decisions without discarding relevant data? My 
research suggests that this may be done by us- 
ing string distances computed directly on pho- 
netic transcriptions, and that this is better than 
restricting the study to lexical comparisons. The 
second deals with clustering. Do bottom-up or 
1The overall algorithm is O(N 3) since for each ue~v 
group one must compute the distances between it and 
each of the other sites or groups. 
top-down techniques work best? My conclusion 
is that the traditional bottom-up technique works 
better than a typical top-down method. These 
conclusions are based partly on an analysis of 
the mathematical properties of the clusters them- 
selves, partly on how well they correlate with 
analyses based on more traditional isogloss tech- 
niques, and partly on how well they compare with 
previously-published descriptions of dialects in a 
specific language, Irish Gaelic. 
At one time the Gaelic languag~ group was spo- 
ken throughout Ireland, from where it spread to 
the Isle of Man and to much of Scotland. Cur- 
rently fully native use of Gaelic is limited to a few 
discontiguous areas in the westernmost reaches of 
Ireland and Scotland. In the case of Ireland, ev- 
eryone agrees that Gaelic is nowadays found in 
three main dialects: that of Ulster, that of Con- 
nacht, and that of Munster (0 Siadhail, 1989). 
But several questions are raised that are less eas- 
ily answered. Do the three provinces separate out 
so neatly for intrinsic linguistic reasons, or simply 
because their speakers have become so widely sep- 
arated from each other geographically as speakers 
in intervening areas have adopted English? Does 
the language of Connacht naturally group with 
that of Ulster or with that of Munster? And look- 
ing beyond Ireland, many have commented that 
the language of Ulster in general is similar to that 
of Scotland. Are Irish, Manx, and Scottish Gaelic 
considered three separate languages for intrinsic 
linguistic reasons, or because they are spoken in 
different countries? To a large extent, dialectolo- 
gists have found these questions difficult to an- 
swer because they accepted Paris's conundrum. 
For 6 Siadhail, the ultimate scientific justifica- 
tion ill adopting the three-dialect account is the 
fact that the Gaeltacht (Irish-speaking territory) 
is so fragmented nowadays that it no longer forms 
a continuum. O Cuiv (1951:4-49) felt that there 
can be no dialect boundaries because transitions 
are gradual. Elsie (1986:240) considers a dialect to 
be an area where all communities are linguistically 
more similar to each other than any community is 
to any site outside the dialect. Such notions pro- 
vide a very firm, absolute notion of dialecthood: 
a set of communities either constitutes a dialect 
area, or it does not. But as the dialectometrists 
have shown, other notions of clustering are equally 
scientific and may more accurately correspond to 
intuitive notions of what it means to be a dialect. 
2 Data 
The data for nay study were taken from Wagner 
1958. Wagner administered a questionnaire to na- 
tive speakers of Irish Gaelic in 86 sites. 2 Most of 
the informants were over seventy years old and 
2The atlas also maps for Kilkenny some informa- 
tion gathered from another source. 
61 
had not spoken Irish since their youth, The at- 
las is therefore an approximate reconstruction of 
the linguistic landscape of the turn of the century, 
when the Gaeltacht was more continuous. Wag- 
ner also presents material from the Isle of Man and 
seven sites in Scotland. The mapped entries are 
presented in a very narrow phonetic transcription 
based on the International Phonetic Alphabet. 
Volume 1 of Wagner 1958 consists of 300 maps, 
plotting about 370 concepts. I used the first 51 
concepts, or about 4500 different string tokens, as 
part of an ongoing project to enter all of the atlas 
into machine readable format. These 51 concepts 
were represented by 312 different Gaelic words or 
phrases, whose stems derived from 171 different 
etymons. 
3 Methodology and results 
3.1 Distance matrices 
To form a baseline for comparison, I analysed the 
distribution of each of the 51 plotted concepts, 
finding a total of 3,337 features by which two or 
more sites differed. For example, for the concept 
'sell', I identified two sets, one using the word 
d(ol (most sites in Ireland), and one using the 
word creic (Rathlin Island, the Isle of Man, and 
Scotland). The dialects partitioned in a different 
way according to how much stress they placed on 
the verb relative to the pronoun in 'I sold' (even 
stress in Dunlewy and four of the Scottish sites, 
else extra stress on the verb). Not all partitions 
covered the entire map. In this example, only 
sites that used the word d(ol were compared on 
the basis of whether a schwa developed in the se- 
quence \[i:l\]. In some cases the divisions were more 
than two-way: for example, Wagner distinguishes 
whether the final consonant in creic is unpalatal- 
ized, palatalized, or slightly palatalized. Distance 
between sites was determined by counting 0 when- 
ever two sites were in the same set and 1 whenever 
two sites were in contrasting sets, then taking the 
mean. This baseline approach corresponds for- 
mally to determining distance by the number of 
isoglosses that separate sites, which is in principle 
the traditional technique. 
This baseline was compared to several other ap- 
proaches. The etymon identity metric averaged 
the number of times the sites agreed in using 
words whose stem had the same ultimate deriva- 
tion. For example, the dialects differed as to 
whether they used some form of bull- or damh- 
for the word 'bullock'. Etymon identity is one 
of the more common approaches in dialectom- 
etry; Elsie for example used it in his study of 
the Gaelic dialects (1986). Closely related is the 
idea of word identity, where the words are not 
counted the same unless all of their morphemes 
agree. In this analysis, sites that used some form 
of the word bulldn, with the suffix -dn, were dis- 
tinguished from those using the suffix -6g. 
Another set of approaches for computing dis- 
tance was based on the phonetics. This com- 
puted the Levenshtein distance between phonetic 
strings. The Levenshtein distance is the cost of 
the least expensive set of insertions, deletions, or 
substitutions that would be needed to transform 
one string into the other (Sankoff and Kruskal, 
1983). The simplest technique used was phone 
string comparison. In this approach, all opera- 
tions cost 1 unit. Thus in comparing the forms 
\[AL:i\] and \[aLi\] for eallaigh 'cattle', the (minimal) 
distance was 2, for the substitutions \[a\]/\[A\] and 
\[L:\]/\[L\]. (For this measure, diacritics such as the 
length mark ':' were counted as part of the letter, 
and different diacritics were adjudged to make for 
different letters.) A pair of unrelated words like 
\[AL:i\] and \[khruh\] (for crodh, another word for 'cat- 
tie') would get a much larger score, 5. 
In the above technique, very small phonetic 
differences, such as that between a moderately 
palatalized and a very palatalized \[t\], count the 
same as major differences, such as that between 
a \[t\] and an \[e\]. It would seem to be more ac- 
curate to assign a greater distance to substitu- 
t.ions involving greater phonetic distinctions. Un- 
fortunately I know of no comprehensive study on 
the differences between phones, at least not for 
all 277 contrasts made by Wagner. Instead I dis- 
tinguished them on the basis of twelve phonetic 
features that systematically account for all of the 
distinctions in Wagner's inventory: nasality, stric- 
ture, laterality, articulator, glottis, place, palatal- 
ization, rounding, length, height, strength, and 
syllabicity. The features were given discrete ordi- 
nal values scaled between 0 and 1, the exact values 
being arbitrary. For example, place took on the 
values glottal=O, uvular=0.1, postvelar=0.2, ve- 
lar=0.3, prevelar=O.~, palatal=0.5, alveolar=0.7, 
dental=0.8, and labial=l. The distance between 
any two phones was judged to be the difference 
between the feature values, averaged across all 
twelve features. These distances were used instead 
of uniform 1-unit costs in computing Levenshtein 
distance. The resulting metric was called feature 
string comparison. 
It could be argued that it is meaningless to com- 
pare the phonetics of different words, as in the case 
of eallaigh vs. crodh mentioned above. Therefore 
the feature string comparison was also computed 
only for pairs of citations that used the same word, 
so that \[AL:i\] vs. \[aLi\] would be compared, but 
\[AL:i\] vs. \[khruh\] would be ignored. The differ- 
ent approaches are called all-word vs. same-word 
feature string comparisons. 
All of these distance matrices were compared 
with the isogloss matrix, to see which of them 
gives results closest to that base method. I used 
two different methods of comparison, Pearson's p 
computed between all corresponding cells in the 
62 
Table 1: Correlation of distance matrices to the 
isogloss distance matrix 
p K~ 
Phone string comparison .95 .76 
Feature string comparison 
-- Ml-word .92 .70 
-- same-word .91 .69 
Etymon identity .85 .61 
Word identity .84 .63 
two matrices, and 
Kc = Average(sign((Xij - Xik)(Yij - Yik)) 
which is a derivative of Kendall's r that Dietz 
(1983) empirically found particularly accurate as 
a test statistic for comparing distance matrices, a 
Table 1 shows that the two measures give paral- 
lel results. More importantly, it shows that the 
approaches based on string comparisons of the 
phonetic transcriptions correlate more highly with 
the isogloss approach than do the word or etymon 
identity measures. Furthermore, comparing whole 
phone letters works better than the more sophi.s- 
ticated technique of comparing features, and re- 
stricting comparison to pairs based on the same 
words does not make the latter any better. 
Of course, I do not expect that this technique 
using flat 1-unit costs will prove superior to all 
methods that are more sensitive to phonetic de- 
tails. Feature comparison may work better if fea- 
tures were weighted differentially, or if the numeric 
values they assume were assigned less arbitrarily, 
or if the Manhattan-style distance computation 
were replaced by some formula that did not as- 
sume that the features are independent of each 
other. An ideal comparison would be based on 
data telling how likely it is for the one phone to 
turn into the other in the course of normal lan- 
guage change. In the method described here, Is\] 
is adjudged closer to \[g\] than to \[h\]. But \[s\] of- 
ten changes into \[hi in the world's languages, and 
so the pair should have a small distance; whereas 
the change of \[s\] to \[g\] has never occurred to my 
knowledge, and so should have a very large dis- 
tance. The unfortunate fact that such ideal data 
are lacking is compensated for by the fact that the 
inexpensive phone-string comparison employed in 
this study performs quite well. 
3.2 Clustering techniques 
The traditional agglomerative technique for clus- 
tering has been described above. There is some 
3That is, for each site i, one considers all other 
pairs of sites, j and k, and asks whether the linguistic 
difference between i and j is greater or less than that 
between i and k. One counts 1 if the answer is the 
same for both distance matrices, -1 if it is different. 
//'c is the average of these numbers. 
variation in how the distance between two clus- 
ters is measured. For this study I used the average 
distance between all pairs of elements that are in 
different clusters. I compared agglomeration to a 
top-down method that Kaufman and Rousseeuw 
(1990) call partitioning around medoids. This 
model reduces the 0(2 Jr) intractability of top- 
down approaches discussed above by dramatically 
reducing the number of binary partitions that are 
considered. Here one seeks to divide the sites into 
two groups by finding the two representative sites 
(the medoids) around which all the other sites 
cluster in such a way as to give the least average 
distance between the sites and their representa- 
tives. This is therefore a O(N 3) algorithm, com- 
parable in efficiency to agglomeration. The pro- 
cess was repeated recursively on each dialect. 
One way of measuring how well a binary clus- 
tering technique works for dialect grouping is to 
compare for each site i its average dissimilarity 
from the other sites in the same dialect, a(i), with 
its average dissimilarity from the sites in the other 
dialect, b(i). Kaufman and Rousseeuw (1990:83- 
86) define the statistic s(i) to be 1 - a(i)/b(i) if 
a(i) is less than b(i), otherwise b(i)/a(i) - 1. The 
statistic thus ranges from 1 (perfect fit) to -1 (site 
i would perfectly fit in the other group). Plotting 
this statistic gives a silhouelte by which the eye 
call judge how well classified each site is. Aver- 
aging this statistic across all sites gives an idea of 
how felicitous the overall clustering is, ~. 
Figures 1-2 present the silhouette for clustering 
the isogloss distance matrix by partitioning. This 
analysis produces a large dialect which groups to- 
gether the sites in Munster, Scotland, the Isle of 
Man, and almost all sites in Connacht, as well 
as Rathlin Island in Ulster; and another which 
groups together all the other sites in Ulster, as 
well as County Cavan in Connacht. Although the 
Ulster group is fairly tight, with an g of 0.41, the 
other group has a more anemic ~ of 0.25, with 
the sites outside of Munster and Southern Con- 
nacht being indifferently classified. The weighted 
for both groups comes out at 0.29. By com- 
parison, Figures 3-4 show what happens when 
the traditional agglomerative technique is used. 
The dialects of Scotland and the Isle of Man 
form a cluster with a great deal of internal di- 
versity ($ = 0.12), and all the sites in Ireland 
form another cluster averaging ~ = 0.37, with only 
Rathlin Island being indifferently classified. The 
weighted average is 0.35, which is superior to that 
of the partitioning technique. 
The same comparative results obtain for almost 
all of the distance measuring techniques. Table 2 
shows that the ~ for the first binary split is usually 
appreciably higher for agglomeration than it is for 
partitioning. This result suggests not any inferi- 
ority of top-down techniques in general--applying 
the g statistic to all binary partitions would by 
63 
**** Kilkenny, Kilkenny, Leinster, Ireland 
**** Lough Attorick, Galway, Connacht, Ireland 
*** Doolin, Clare, Munster, Ireland 
*** Fanore, Clare, Munster, Ireland 
*** Clear Island, Cork, Munster, Ireland 
*** Skibbereen, Cork, Munster, Ireland 
*** Kinvra, Galway, Connacht, Ireland 
*** Coomhola, Cork, Munster, Ireland 
*** Cloghane, Kerry, Munster, Ireland 
*** Kilgarvan, Kerry, Munster, Ireland 
*** Waterville, Kerry, Munster, Ireland 
*** Killorglin, Kerry, Munster, Ireland 
*** Glandore, Cork, Munster, Ireland 
*** Carraroe, Galway, Connacht, Ireland 
*** Dursey Sound, Cork, Munster, Ireland 
*** Careeny, Galway, Connacht, Ireland 
*** BMlymacoda, Cork, Munster, Ireland 
*** Newbridge, Galway, Connacht, Ireland 
*** Kilsheelan, Wateriord, Munster, Ireland 
*** Conakilty, Cork, Munster, Ireland 
*** Craughwell, Galway, Connacht, Ireland 
-*** Coolea, Cork, Munster, Ireland 
*** Rosmuck, Galway, Connacht, Ireland 
*** Glenflesk, Kerry, Munster, Ireland 
*** Mount Melleray, Waterford, Munster, Ireland 
*** Cornamona, Galway, Connacht, Irel,~nd 
*** Dunquin, Kerry, Munzter, Ireland 
*** Tourmakeady, Mayo, Connacht, Ireland 
*** Ring, Waterford, Munster, Ireland 
*** Laughanbeg, Galway, Connacht, Ireland 
*** Emlaghmore, Galway, Connacht, Ireland 
** Lauragh, Kerry, Munster, Ireland 
** Kilbaha, Clare, Munster, Ireland 
** Moycullen, Galway, Connacht, Ireland 
** Sliabh gCua, Waterford, Munster, Ireland 
** Kilmovee, Mayo, Connacht, Ireland 
** Goatenbridge, Tipperary, Munster, Ireland 
** Colmanstown, Galway, Connacht, Ireland 
** Inisheer, Galway, Connacht, Ireland 
** Glentrasna, Galway, Connacht, Ireland 
** Letterfrack, Galway, Connacht, Irela, nd 
** Angliham, Galway, Connacht, Ireland 
** Annaghdown, Galway, Connacht, Ireland 
** Lough NMooey, Galway, Connacht, Ireland 
** Sonnagh, Galway, Connacht, Ireland 
** C~.rna, Galway, Connacht, Ireland 
** Louisburgh, Mayo, Connacht, Ireland 
** Inishmaan, Galway, Connacht, Ireland 
** Camderry, Galway, Connacht, Ireland 
** Ceathrtl an TMrbh, Roscommon, Connacht, Ire. 
** Carnmore, Galway, Connacht, Ireland 
** Ballycastle, Mayo, Connacht, Ireland 
** Cashel, Galway, Connacht, Ireland 
** Tobercurry, Sligo, Connacht, Ireland 
** BMlyglunin, Galway, Connacht, Ireland 
** Aclare, Sligo, Connacht, Ireland 
** Belderg, Mayo, Connacht, Ireland 
* Dohooma, Mayo, Connacht, Ireland 
* Portacloy, Mayo, Connacht, Ireland 
* Blacksod, Mayo, Connacht, Ireland 
* Kintyre, Argyll, Scotland 
Isle of Man 
Slievenakilla, Leitrim, Connacht, Ireland 
Inveraray, Argyll, Scotland 
Arran, Bute, Scotland 
Curraun Peninsula, Mayo, Connacht, Ireland 
Achill, Mayo, Connacht, Ireland 
LochMsh, Ross and Cromarty, Scotland 
Assynt, Sutherland, Scotland 
Carloway, Lewis, Ross and Cromarty, Scotland 
Benbecula, Inverness, Scotland 
BMlyconnell, Sligo, Connacht, Ireland 
Rathlin Island, Antrim, Ulster, Ireland 
Figure 1: Silhouette for the first top-level binary 
dialect grouping computed on the isogloss dis- 
tance matrix via partitioning. Stars represent rel- 
ative s(i). Locations in Ireland are cited by local- 
ity, county, province, and country. 
***** Kildarragh, Donegal, Ulster, Ireland 
**** Creeslough, Donegal, Ulster, Ireland 
**** Glenvar, Donegal, Ulster, Ireland 
**** Loughanure, Donegal, Ulster, Ireland 
**** Letterm&caward, Donegal, Ulster, Ireland 
**** Beflaght, Donegal, Ulster, Ireland 
**** Kingarroo, Donegal, Ulster, Ireland 
**** Croaghs, Donegal, Ulster, Ireland 
=*** Aranmore, Donegal, Ulster, Ireland 
**** Gortahork, Donegal, Ulster, Ireland 
**** Downings, Donegal, Ulster, Ireland 
**** Tory Island, Donegal, Ulster, Ireland 
**** Dunlewy, Donegal, Ulster, Ireland 
*** RannMast, Donegal, Ulster, Ireland 
*** Meenacharvy, Donegal, Ulster, Ireland 
*** Teelin, Donegal, Ulster, Ireland 
*** Ardara, Donegal, Ulster, Ireland 
*** Ballyhooriskey, Donegal, Ulster, Ireland 
*** Creggan, Tyrone, Ulster, Ireland 
*** Clonmany, Donegal, Ulster, Ireland 
** Omeath, Louth, Ulster, Ireland 
* Glangevlin, Cavan, Connacht, Ireland 
Figure 2: Silhouette for the second dialect group- 
ing computed on the isogloss distance matrix via 
partitioning. 
* Carloway, Lewis, Ross and Cromarty, Scotland 
* Benbecula, Inverness, Scotland 
* Assynt, Sutherland, Scotland 
* Inveraray, Argyll, Scotland 
* LochMsh, Ross and Cromarty, Scotland 
* Kintyre, Argyll, Scotland 
* Arran, Bute, Scotland 
Isle of Man 
Figure 3: Silhouette for isogloss dialect grouping 
using agglomerative clustering, first group. 
defiuition reveal the optimal split--nor of parti- 
tioning around medoids in general. Rather, it 
appears that the assumption behind this parti- 
tioning heuristic, that a site will be closer to the 
medoid of its own group than to the medoid of 
the other group, often fails to hold true in dialee- 
tology. The lack of clean breaks between dialects 
and the fact that dialects of the same language 
may vary greatly in diameter (i.e., maximal intra- 
group distances) both mean that the assumption 
will often be invalid. 
3.3 Gaelic dialects 
Since agglomeration is the better clustering tech- 
nique, the best dialect analysis should be ob- 
tained by agglomerating the isogloss matrix. The 
best automated approximation should be agglom- 
Table 2: Statistic g for the top-level binary dialect 
division, comparing partitioning around medoids 
and agglomeration for the different distance ma- 
trices. 
Isoglosses 
Phone string comparison 
Feature string comparison 
---- all-word 
---- same-word 
Etymon identity 
Word identity 
Part. Aggl. 
0.287 0.345 
0.185 0.322 
0.252 0.353 
0.219 0.401 
0.363 0.478 
0.370 0.309 
64 
**** Colmanstown, Galway, Connacht, Ireland 
**** Moycullen, Galway, Connavht, Ireland 
**** Ceathrd an Tairbh, Roscommon, Connacht, Ire. 
**** Ballyconnell, Sligo, Connacht, Ireland 
**** Carnmore, Galway, Connacht, Ireland 
**** Annaghdown, Galway, Connacht, Ireland 
**** Lough Nafooey, Galway, Connacht, Ireland 
**** Glentrasna, Galway, Connacht, Ireland 
**** Ballycastle, Mayo, Connacht, Ireland 
**** Carraroe, Galway, Connacht, Ireland 
**** Cornamona, Galway, Connacht, Ireland 
**** Aclare, Sligo, Connacht, Ireland 
**** Emla:ghmore, Galway, Connacht, Ireland 
**** Dohooma, Mayo, Connacht, Ireland 
**** BMlyglunin, Galway, Connacht, Ireland 
**** Sonnagh, Galway, Connacht, Ireland 
**** Cashel, Galway, Connacht, Ireland 
**** Curraun Peninsula, Mayo, Connacht, Ireland 
**** Craughwell, Galway, Connacht, Ireland 
'~*** Belderg, Mayo, Connacht, Ireland 
**** Portacloy, Mayo, Connacht, Ireland 
**** Angliham, Galway, Connacbt, Ireland 
**** Blacksod, Mayo, Connacht, Ireland 
**** Laughanbeg, Galway, Connacht, Ireland 
**** Kinvra, Galway, Connacht, Ireland 
**** Coomhola, Cork, Munster, Ireland 
**** Camderry, Galway, Connacht, Ireland 
**** Cloghane, Kerry, Munster, Ireland 
**** Rosmuck, Galway, Connacht, Ireland 
**** Newbridge, Galway, Connacht, Ireland 
**** Lough Attorick, Galway, Connacht, Ireland 
**** Kilmovee, Mayo, Connacht, Ireland 
**** Achill, Mayo, Connacht, Ireland 
*** Dursey Sound, Cork, Munster, Ireland 
*** Tourmakeady, Mayo, Connacht, Ireland 
*** Letterfrack, Galway, Connacht, Ireland 
*** Clear Island, Cork, Munster, Ireland 
*** Skibbereen, Cork, Munster, Ireland 
*** Glandore, Cork, Munster, Ireland 
*** Tobercurry, Sligo, Connacht, Ireland 
*** Glenflesk, Kerry, Munster, Ireland 
*** Fanore, Clare, Munster, Ireland 
*** Careeny, Galway, Connacht, Ireland 
*** Doolin, Clare, Munster, Ireland 
*** Killorglin, Kerry, Munster, Ireland 
*** Dunquin, Kerry, Munster, Ireland 
*** Louisburgh, Mayo, Connacht, Ireland 
*** Kilsheelan, Waterford, Munster, Ireland 
*** Waterville, Kerry, Munster, Ireland 
*** Kilgarvan, Kerry, Munster, Ireland 
*** C~rna, Galway, Connacht, Ireland 
*** Ballymacoda, Cork, Munster, Ireland 
*** Conakilty, Cork, Munster, Ireland 
*** Kilbaha, Clare, Munster, Ireland 
*** Coolea, Cork, Munster, Ireland 
*** Lauragh, Kerry, Munster, Ireland 
*** Glangevlin, Cavan, Connacht, Ireland 
*** Sliabh gCua, Waterford, Munster, Ireland 
*** Mount Melleray, Waterford, Munster, Ireland 
*** Kingarroo, Donegal, Ulster, Ireland 
*** Goatenbridge, Tipperary, Munster, Ireland 
*** Downings, Donegal, Ulster, Ireland 
*** Croaghs, Donegal, Ulster, Ireland 
*** Inishmaan, Galway, Connacht, Ireland 
*** Glenvar, Donegal, Ulster, Ireland 
*** Ring, Waterford, Munster, Ireland 
*** bettermacaward, Donegal, Ulster, Ireland 
*** Kildarragh, Donegal, Ulster, Ireland 
*** Inisheer, Galway, Connacht, Ireland 
*** Creeslough, Donegal, Ulster, Ireland 
*** Gortahork, Donegal, Ulster, Ireland 
*** Eleflaght, Donegal, Ulster, Ireland 
*** Kilkenny, Kilkenny, Leinster, Ireland 
** Ardara, Donegal, Ulster, Ireland 
** Rannafast, Donegal, Ulster, Ireland 
** Aranmore, Donegal, Ulster, Ireland 
** Dunlewy, Donegal, Ulster, Ireland 
** Meenacharvy, Donegal, Ulster, Ireland 
** boughanure, Donegal, Ulster, Ireland 
** Slievenakilla, Leitrim, Connacht, Ireland 
** BMlyhooriskey, Donegal, Ulster, Ireland 
** Omeath, Louth, Ulster, Ireland 
** Creggan, Tyrone, Ulster, Ireland 
** Clonmany, Donegal, Ulster, Ireland 
** Teelin, Donegal, Ulster, Ireland 
** Tory Island, Donegal, Ulster, Ireland 
Rathlin Island, Antrim, Ulster, Ireland 
Figure 4: Silhouette by agglomeration, Irish 
group. 
erating the distance matrix computed by pho- 
netic string comparison, and indeed the top-level 
topologies produced by both techniques are vir- 
tually identical. Both group into one loosely- 
connected entity all the sites in Scotland, and into 
another all the sites in Ireland. The isogloss ap- 
proach groups Manx as a cousin of the Scottish 
dialects, and the phonetic approach makes it a 
cousin of the Irish dialects, but in both cases the 
s of Manx is very small (less than 0.06), making it 
essentially intermediate between the two groups. 
Thus the popular notion that Scottish, Irish and 
Manx Gaelic are distinct entities is well supported 
by the analyses. Both analyses group Rathlin Is- 
land very weakly with the rest of Irish, but the s 
for Rathlin is so low (less than 0.09) that its group- 
ing too is essentially arbitrary. This aligns with 
the fact that authorities disagree as to whether it 
is a dialect of Irish (as does Wagner) or of Scot- 
tish (O'Rahilly 1932:191). Except for Rathlin Is- 
land, both methods group the Irish sites into one 
group containing all the sites in Ulster, and an- 
other, Southern, group, which itself breaks into 
a group containing all the sites in Connacht and 
one containing all the sites in Munster. 4 Both 
methods agree on how the 87 sites are distributed 
among these dialects. This three-way division ac- 
cords with the universal perception that Ulster, 
Connacht and Munster form the three major di- 
alect groups. The special status of Ulster con- 
tradicts the position of O'Rahilly (1932:18) that 
Connacht groups with Ulster to form a Northern 
dialect over against Munster. But it agrees with 
Elsie's finding (1986:255) that that province is lex- 
icostatistically more remote from Connacht and 
Munster than those two are from each other. Fur- 
thermore, Hindley reports (1990:63) that speakers 
of other dialects often switch off radio broadcasts 
in Ulster Irish, 'which is very distinctive'. 
Thus the classification of the major Gaelic di- 
alects gives the same general results by both dis- 
tance metrics, if one discounts Manx and Rathlin 
Island Gaelic, which are flagged as indifferent in 
both schemes. It is encouraging that the resultant 
dialect areas are continuous, align with traditional 
provincial boundaries, and agree with commonly 
accepted taxonomies. However, dialect groupings 
at narrower levels, such as the exact subgrouping 
of the major provincial dialects, are at this point 
unstable. This is no doubt to be explained by the 
fairly small number of mapped concepts on which 
the distance metrics are based (51). 5 As language 
4The one site ill Co. Cavan is intermediate between 
tile Ulster and the Southern group. Wagner also gives 
two sites in Leinster. The more southern site, in 
Kilkenny, groups with the Southern group, and the 
more northern site, in Co. Louth, groups with Ulster, 
and indeed the county used to be considered part of 
that province. 
5S~guy (1973) cites empirical research suggesting 
65 
differences get smaller, one expects that more data 
will be required in order to elucidate them. 
4 Conclusions 
This experiment shows that an automatic proce- 
dure can reliably group a language into its di- 
Meet areas, starting from nothing more than pho- 
netic transcriptions as commonly found in linguis- 
tic surveys. Accurate distance matrices, corre- 
sponding highly to those obtained by the tedious 
uncovering of thousands of isoglosses, can be ob- 
tained by averaging the Levenshtein distance be- 
tween phonetic strings, weighting equally all inser- 
tion, deletion, and substitution operations on the 
constituent phones. This turns out to be quite 
a bit more precise than the common technique 
of measuring distances by judging etymon iden- 
tity, and requires even less editorial work. That 
phonetic comparison is more precise is not par- 
ticularly surprising, since etymon identity ignores 
a wealth of phonetic, phonological, and morpho- 
logical data, whereas comparing phones has the 
side effect of also counting higher-level variation: 
if words differ in morphemes, their phonetic dif- 
ference is going to be high. As for clustering the 
sites into dialect areas, the familiar bottom-up ag- 
glomeration method proves superior to top-down 
partitioning around medoids. 
Of course simply knowing the dialect areas is 
not the last word in dialectology. There remain 
such essential problems as identifying the differ- 
ing linguistic structures that characterize the di- 
alects, and discovering their history. But all of 
these tasks will be greatly facilitated by a quick 
and accurate grouping of the dialects. 
References 
Babitch, Rose Mary (1988). "The areal structure 
of three syntagmatic variables in the terminol- 
ogy of Acadian fishermen." In Papers from the 
Tenth Annual Meeting of the Atlantic Provinces 
Linguistic Association, pages 121-137, Univer- 
sity College of North Wales, August 1987. Mul- 
tilingual Matters, Clevedon, Avon. 
Babitch, Rose Mary, and Lebrun, Eric (1989). 
"Dialectometry as computerized agglomerative 
hierarchical classification analysis." Journal of 
English Linguistics, 22:83-90. 
Dietz, E. Jacquelin (1983). "Permutation tests 
for association between two distance matrices." 
Systematic Zoology, 32:21-26. 
Durand, J.-P. (1889). "Notes de philologie rouer- 
gate, 18". Revue des langues romanes, 33:47- 
84. 
that general dialectometry requires about a hundred 
concepts. 
Ebobisse, Carl (1989). "Dialectom&rie lexi- 
cale des parlers sawabantu." Journal of West 
African Languages, 19,2:57-66. 
Elsie, Robert (1986). Dialect relationships in 
Goidelic: A study in Celtic dialectology. IIel- 
mut Buske, Hamburg. 
Hindley, Reg (1990). The death of the Irish lan- 
guage. Routledge, London. 
Kaufman, Leonard, and Rousseeuw, Peter J. 
(1990). Finding groups in data: An introduc- 
tion to cluster analysis. John Wiley & Sons, 
New York. 
() Cufv, Brian (1951). Irish dialects and Irish 
speaking districts. Dublin Institute for Ad- 
vanced Studies. 
O'Rahilly, Thomas Francis (1972). Irish dialects 
past and present. Dublin Institute for Advanced 
Studies. 
6 Siadhail, Miche~l (1989). Modern Irish: Gram- 
matical structure and dialectal variation. Cam- 
bridge University Press, Cambridge, Mass. 
Philps, Dennis (1987). "La relation entre distance 
linguistique et distance spatiale dans l'Atlas lin- 
guistique de la Gascogne." In Acres du Premier 
Congrds b~ternational de l'Association Interna- 
tionale d'Etudes Occitanes, 551-569. Westfield 
College, University of London. 
Saakoff, David, and Kruskal, Joseph B., eds. 
(1983). Time warps, string edits, and macro- 
molecules: The theory and practice of sequence 
comparison. Addison-Wesley, Reading, Mass. 
S@guy, Jean (1971). "La relation entre la distance 
spatiale et la distance lexieale." Revue de lin- 
guistique romane, 35:335:-357. 
S@guy, Jean (1973). "La dialectom&rie dans 
l'Atlas linguistique de la Cascogne." Revue de 
linguistique romane, 37:1-24. 
Wagner, tIeinrich (1958). Linguistic atlas and 
survey of Irish dialects. Dublin Institute for 
Advanced Studies, 1958-1969.4 vol. 
66 
