ENRICO CAMPANILE- ANTO~qIO ZA~POLr~ 
PROBLEMS IN COMPUTERIZED HISTORICAL LINGUISTICS: 
THE OLD CORNISH LEXICON * 
ft 
This work represents an attempt to utilize the computer in solving 
problems in historical linguistics. 
The corpus upon which it operates is not a language but a recently 
published etymological dictionary of Old Cornish. 1 Any observations 
regarding the scarcity or inaccuracy of the data utilized are, therefore, 
irrelevant, as far as the present paper is concerned. 
As the dictionary in question was compiled according to the usual 
methods employed with such works, a detailed explanation of metho- 
dology is unnecessary. It should also be noted that Old Cornish is 
known only through glosses to Latin words, and that in this case <~ Cor- 
nish gloss ~ is equivalent to ~ Cornish word ~. 
With the help of the computer, we have attempted to solve the 
following problems: 
a) To establish the percentage of words with and without Indo- 
European etymology in the Cornish lexicon. (Let us stress that this 
study concerns not a language but an etymological lexicon; hence, the 
presence or absence of Indo-European etymology should not be con- 
strued as a definitive characteristic of a Cornish word. Such statistics 
are, in fact, relevant only to the present state of research on the subject). 
b) To establish the degree of certainty concerning the material 
of Indo-European etymology. 
c) To evaluate the extent of the connection between elements of 
Indo-European etymology existing in the Cornish lexicon and the 
other Indo-European linguistic groups according to the degree of 
certainty of each individual etymology. 
d) To establish, on the basis of existing etymological studies of 
Old Cornish, the lines future research should follow. 
* The computational part of this research has been conducted by A. Zampolli, 
the historical linguistics part by E. Campanile. 
1 E. CAMPAmL~, Profilo etirnologico del Cornico antico, Pisa, 1974. Also in SSL, 
(1973), p. 1. 
11 
162 E. CAMPANILE- A. ZAMPOLLI 
The reader will observe that the ftrst problem is purely statistical 
(though it has an obvious diachronic premise), that the second aims 
at attaining qualitative data (though they are expressed quantitatively), 
that the third concerns the area of Indo-European dialectology, and that 
the fourth has its own specific heuristic and methodological signifidance. 
In order to acc0mplish these goals, the contents of the etymologi- 
cal dictionary were put on cards, each of which contained the follow- 
ing entries: 
a) a non-Cornish word (with an indication of the language to 
which it belongs); 
b) the Cornish word related in the dictionary to the item under a) ; 
c) the type of relationship existing between item a) and item b) ; 
and whether this relationship is afftrmed, denied or uncertain; 
d) the indication that item b) is or is not a nominal compound 
(this being the only type of compound found in Old Cornish); 
e) in the event that item b) is a nominal compound, a break- 
down of the elements contained in it; 2 
f) the page from which the foregoing material was taken. 
With regard to item c), the possible types of relationships have 
been described (see below) according to the information supplied, 
either explicitly or implicitly, by the etymological dictionary and have 
been rated according to the following numerical system: 
1 = the relationship between the two words is etymologically certain. 
2 = ~ ~) very probable. 
3 = ~ ~ probable 
4 = ~ ~ doubtful 
5 = ~ ~ not very probable 
6 ----- ~ ~ improbable 
7 = ,~ ~ non-existent 
8 = the Cornish word was borrowed from item a) 
80 = ~ ~ is a caique from item a) 
82 = a relationship exists between the Cornish word and item a), 
but the nature of the relationship cannot be determined exactly 
(that is, whether it is a matter of kinship or loan).a 
Every element has been given either in the Cornish form (if it is attested elsewhere 
in the text or if it is not attested only be cause of lack of documentation), or in 
the common Celtic form or in the Indo-European form; certain diacritic signs indicate 
which possibility has been chosen. 
• ~ The distinction between borrowed words and co-radicals is that provided by the 
etymological dictionaries and handbooks of historical linguistics. Since the difference 
..... , - , ^ . - ..... 
SOME EXPERIMENTS IN HISTORICAL COMPUTATIONAL LINGUISTICS 163 
9-----the Celtic co-radical of the Cornish word (this rating prevails 
over ratings 1,2 and 3 because the prime object of the present 
research is Indo-European etymology rather than the Celtic 
connections of Cornish). -" 
To these eleven ratings will be added that of 0 which will not indi- 
cate the relationship between Cornish and non-Cornish voices, as in the 
case of the other ratings, but will serve instead to distinguish the non- 
Cornish words (actually, Cymric) which, due to the various vicissi- 
tudes of the handwritten tradition, have crept into the authentic Cor- 
nish glosses and which, as such, do not form part of the present study. 
The following items, taken from the etymological dictionary, and 
their ratings illustrate the preceding principles: 
roan gl. fornax I. clibanus 920. Come il bret. fo(u)rn (ant. bret. gufor(n) gl. 
clibani), il cimr. ffwrn e l'irl, sorn, ~ prestito dal lat. furnus. HV, 179; VG, 
221; LH, 274; VB, 190. 
rRIIC gl. nasus 30. Formazione in -IC (con originario valore, forse, dimi- 
nutivo), da compararsi con bret. j~i ~ naso~. Non ~ da escludersi un rapporto 
con formazioni (originariamente onomatopeiche) in *sr- designanti il rus- 
saree il naso; c£ gr. ~kyXc0, arm. ;ngunk' etc. IEW, 1002. 
tROT gl. alueus 737. Identico a bret. froucl ~ torrente ~, cimr. ffrwd ~ corrente ~, 
irl. sruth (gen. srotha) ~ flume, corrente ~, gall. OpouS~ (leggi OOou-:u¢), tutti 
da *sprutu-. Mail confronto con lit. spria~nas ~ fresco ~, ted. sprtde ~ secco ~ 
non ~ semanticamente convincente. II termine sopravvive anche nell'ital. 
dial. froda ~ torrente ~ (REW, 3545), VG, 35; Pokorny, Celtica 3, 1956, 308; 
LH, 541; Meid, IF 65, 1960, 39; IEW, 994. ~ 
between the two concepts exists only as a chronological distinction, the problem is, 
therefore, irrelevant. Cf. V. Pis^m, Parent~ linguistique, in <, Lingua ~, (1952), p. 3 (or 
Saggi di linguistica storica, Torino, 1959, p. 29) and Variazioni sul problema indoeuropeo, in 
Lingua e culture, Brescia, 1969, p. 21. 
4 FORN gl. fornax l. dibanus 920. Like the Breton fo(u)rn (OBr. gufor(n) gl. 
clibani), the Cymr. ffwrn and the Irish. sorn, was borrowed from the Latin furnus. HV, 
179; VG, 221; LH, 274; VB, 190. 
FRIIC gl. nasus 30. Formation in --IC (originally, perhaps, diminutive), is com- 
parable to Breton fri * nose ~. They may also have kinship with formations, originally 
onomatopoeic, in *sr- which designate both snoring and nose; cf. gr. ~'~'Xco, arm. bngunk' 
etc. IEW, 1002. 
FR.OT gl. alueus 737. Identical to Bret. froud ~ brook ~, Cymr. ffrwd ~ stream *, Irish 
sruth (gen. srotha) • rover, stream~, Gaul. ~pouS~g (read ~po~'rug), all from *sprutu- But 
the comparison with Lit. spria~nas ~ cool ,, Germ. spri~de ~, dry ~ is not semantically con- 
vincing. The term survives in Italian (dial.) froda ~ brook ~ (REW, 3545). VG, 35; 
Pokorny, Celtica 3, 1956, 308; LH, 5, 1; Meid, IF 65, 1960, 39; IEW, 994. 
164 E. CAMPANILE- A. ZAMPOLLI 
These three paragraphs gave the following 15 cards: 
br. fo(u)rn 5 9 0 forn 47 
a. br. gufor(n) 9 forn 47 
cim. ffwrn 9 forn 47 
irl. sorn 9 forn 47 
lat. furnus 8 forn 47 
br. fri 9 friic 47 
gr. ~-fXco 3 fiiic 47 
arm. #ngunk' 3 friic 47 
br. froud 9 riot 47 
cim. ffrwd 9 frot 47 
irl. sruth 9 frot 47 
gall. ~po~.ru¢ 9 riot 47 
lit. spriafmas 5 frot 47 
ted. spr6de 5 riot 47 
ital. dl. froda 1 frot 47 
All the words with an index of 0 were eliminated prior to the ope- 
ration. The analysis of compounds was found to be a particular prob- 
lem. When the rating was carried out, the section of the compound 
with a kinship with the non-Cornish word a) was indicated (and hence 
a numerical rating was given). For example, the following paragraph: 
m~wuir gl. uigil 401. Composto dal prefisso celt. *so- ~ bene, buono ~ (ant. 
bret. ho-, hu-, he-, ant. cimr. hi-, he-, hu-, irl. su-, so-) simile ma non identico 
al scr. su-, gr. 6- (in b~,~¢ da *su-g~ii.~s << che vive bene ~) e da *gull ~ veglia 
(= cimr. g(vyl <~ festa ~>, bret. goel ~id. ~, irl. f3il ~ id. ~, tutti dal tardo lat. 
u~lia, per uigilia). HV, 140; VG, 214; LH, 463 e 659. 7 
yielded the following 14 cards: 
a. br. ho- = 9 hewuil (°°so °'gull) 64 
a. br. hu- = 9 hewuil (°°so °guil) 64 
Column reserved for information concerning nominal compounds. 
6 Column reserved for the analysis of nominal compounds. 
7 I-I~wtm., gl. vigil 401. Composed by the Celtic prefhx *so- • well, good* (Old 
Bret. ho-, hu-, he-, Old Cymr. hi-, he- ho-, hu-, Irish su-, so-), similar but not identical 
to Scr. su-, Gr. ~- (in ~y~ from *su-gWi~s ~ that lives well ~) and by *guil • vigil J (= 
Cymr. gfvyl <~ feast ~), Bret. god <~ id. ~>, Irish f3il, id. ~, all from late Latin u~a, equal to 
vigilia). HV, 140; VG, 214; LH, 463 and 659. 
SOME EXPERIMENTS IN HISTORICAL COMPUTATIONAL LINGUISTICS 165 
a. br. he- -~ 9 hewuil (°°so °gull) 64 
a. cim. hi- = 9 hewuil (°°so °gull) 64 
a. cim. ho- = 9 hewuil ("'so °gull) 64 
a. cim. hu- = 9 hewuil (°°so °gull) 64 
irl. su- = 9 hewuil (°°so °gull) 64 
irl. so- ----- 9 hewuil (°°so °guil) 64 
scr. su- = 5 hewuil (°°so °guil) 64 
gr. 6- = 5 hewuil ("'so °gull) 64 
cim. g(vyl ~ 9 hewuil (°°so °guil) 64 
br. goel ~ 9 hewuil (°°so °gull) 64 
irl. fdil ~ 9 hewuil (°°so °guil) 64 
It. volg. u.Hia ~ 8 hewuil (°°so °gull) 64 
(Note: in the preceding table, the sign = indicates that the kinship 
of word (a) is with the ftrst part of the Cornish compound; the sign 
- indicates that the kinship is with the second part; the sign oo indicates 
that the given form of the ftrst member of the dissolved compound 
is referable to the common Celtic period; and the sign o indicates that 
the word does not happen to be attested). 
But, from the point of view of historical linguistics, it is evident 
that, while gull has not been attested as an autonomous form merely 
because no documentation happens to be available on the subject, 
he- existed (and always has existed) only as a member of a compound. 
Nevertheless, while gull could possibly be included among the autono- 
mous lexical elements of our text, he- could only be found among the 
morphemes. And finally, the compound hewuil, as a creation of the 
Cornish (or Celtic) age, has no precise equivalents in other Indo-Euro- 
pean languages, and any equivalents that happen to exist may be con- 
sidered a priori only the result of chance. 
The task of analyzing compounds is further complicated by the 
presence of words (the Latin credere, for instance) that from a diachro- 
nic point of view are compounds while from a synchronic point of 
view they are not. 
For the reasons just stated, we decided to eliminate the compounds 
from the present analyses and to make them the object of a separate 
study. 
Thus, in addition to the words with a rating of 0, entries containing 
the signs = and-or - have also been discarded. 
166 E. CAMPANILE- A. ZAMPOLLI 
After the words with a rating of 0 and the nominal compounds 
were discarded, the surviving Cornish material consisted of 745 ele- 
ments that, in relation to our first problem, were subdivided in the 
following way. 
words of Indo-European etymology s 284 38 °/o 
words borrowed from other languages 9 254 34 °/o 
calques from other languages 10 0 0 °/o 
uncertain kind of kinship 11 0 0 % 
words without Indo-European etymology 1~ 207 28 °/o 
745 100 % 
With regard to the second problem, the 284 words of Indo-Euro- 
pean etymology were divided according to the degree of probability. 
The breakdown is as follows: 
words of certain etymology 13 
words of very probable etymology 14 
words of probable etymology 15 
238 84 °/o 
23 8% 
23 8 °/o 
284 100 °/o 
In order to solve the third problem, all the entries containing non- 
Cornish words correlated to one of the 284 Cornish words of Indo- 
European etymology were taken into consideration. These entries 
(742 in all) were subdivided into 17 groups according to the \]inguistic 
kinship of the language to which the word in item (a) belongs: 
l~Tocarian A and B 
2 ~ Sanskrit, Avestan, Persian (Aryan group) 
3 ~ Armenian 
8 Words carrying at least one kinship index of 1,2 or 3. 
9Words carrying a kinship index of 8. The indices 80 and 82 are found only among 
the nominal compounds. 
10 Words carrying a kinship index of 80. This is found only among compounds. 
11 Words carrying a kinship index of 82. This is found only among compounds. 
1, Words carrying a kinship index of 4 and/or 5 and/or 6 and/or 7 (eventually with ag). 
1~ Words carrying at least one kinship index of 1. 
1, Words carrying no index of 1 and at least one of index 2. 
15 Words carrying no indices of 1 or 2 and at least one index of 3. 
SOME EXPERIMENTS IN HISTORICAL COMPUTATIONAL LINGUISTICS 167 
4~ 
5----- 
6~ 
7~ 
8~ 
9----- 
10~ 
11-~ 
12~ 
13 ----- 
14 
15---- 
16----- 
17 -~ 
Hittite 
Phrygian 
Greek 
Macedonian 
Illyrian 
Old Slavonic, Old Czech, Russian, Ukranian, Serbian, Middle 
Bulgarian (Slavonic group) 
Albanian 
Lithuanian (old and modern), Old Prussian, Lettish (Baltic 
group) 
Old English, Middle English, Danish, Old Icelandic, Old Gut- 
niac, Dialectal Norwegian, Swedish, Old High German, Middle 
High German, German (modern), Longobard, Gothic, Burgun- 
dian (Germanic group) 
Ligurian 
Oscan, Umbrian 
Latin, Vulgar Latin, Medieval Latin, Italian, Dialectal Italian, 
Old French, Catalan, Old Spanish, Spanish (Latin and neo-La- 
tin group) 
Breton, Middle Breton, Old Breton, Cymric, Middle Cymric, 
Old Cymric, Old Irish, Modern Irish, Ogamic, Scottish, Gaulish, 
Galatian, Ladn-Gaulish, Latin-British, Vam~elais (Celtic group) 
Finnish, Vogulian 
The reader will notice that not all Indo-European languages are 
represented here. This is due to the fact that not all Indo-European 
languages are represented in the etymological dictionary that provided 
the material for the present work. On the other hand, there are two 
non-Indo-European languages in group 17 because one Cornish word 
is thought to have a kinship with non-Indo-European words. 
Each of the 742 words has an etymological kinship with Cornish 
words that is either certain (rating 1), very probable (rating 2) or pro- 
bable (rating 3). These words were arranged into linguistic groups 
with the rank of 1 going to the group that had at least one exponent 
with a rating of 1, the rank of 2 going to the group with at least 
one exponent with a rating of 2, and the rank of 3 to the group with 
neither rating. Here are the results: 
168 E. CAMPANILE - A. ZAMPOLLI 
r. 1 r. 2 r. 3 tot. %r. 1 %r. 2 %r. 3 % of tot. 
GP~. 1 12 1 0 13 0,9231 0.0769 0.0 0.0175 
GK. 2 95 9 9 113 0.8407 0.0796 0.0796 0.1523 
GK. 3 24 0 3 27 0.8889 0.0 0.1111 0.0364 
GR.. 4 11 0 1 12 0.9167 0.0 0.0833 0.0162 
GI~. 5 0 0 0 0 0.0 0.0 0.0 0.0 
GR.. 6 95 6 8 109 0.8716 0.0550 0.0734 0.1469 
GI~. 7 0 1 0 1 0.0 1.0 0.0 0.0013 
GR.. 8 1 0 0 1 1.0 0.0 0.0 0.0013 
GK. 9 45 4 2 51 0.8824,0.0784 0.0392 0,0687 
GIk. 10 11 3 2 16 0.6875 0.1875 0.1250 0.0216 
GIk. 11 71 9 5 85 0.8353 0.1059 0.0588 0.1146 
GK. 12 154 8 8 170 0.9059 0.0471 0.0471 0.2291 
GK. 13 1 0 0 1 1.0 0.0 0.0 0.0013 
GR.. 14 10 0 0 10 1.0 0.0 0.0 0.0135 
GK. 15 108 6 12 125 0.8571 0.0476 0.0952 0.1698 
GI~. 16 3 1 3 7 0.4286 0.1429 0.4286 0.0094 
Gtk. 17 0 0 0 0 0.0000 0.0 0.0000 0.0 
ToT. 641 48 53 742 0.0002 0.0 
It will be observed that the Celtic group appears to be poorly re- 
presented in that the existing etymological kinship with Cornish words 
is normally expressed by the rating 9 (and not, therefore, 1,2 or 3), 
while the rank of 1,2 or 3 has been attributed only to those words 
which, though still within the Celtic group, are part of other linguistic 
traditions (words of the Celtic substratum in the romance languages, 
for example). 
An analogous operation was then carried out with all the material 
having a rating of 4,5,6 or 7 (that is, negative etymologies). The re- 
stilts are as follows: 
r. 4 r. 5 r. 6 r. 7 tot. %r. 4 %r, 5 %r. 6 %r. 7 % oftot. 
GK. 1 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 
GK. 2 3 4 9 10 26 0,1154 0.1538 0.3462 0.3846 0.1300 
GK. 3 0 0 5 1 6 0.0 0.0 0.8333 0.1667 0.0300 
GR.. 4 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 
GK. 5 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 
GK. 6 3 5 8 11 27 0.1111 0.1852 0.2963 0.4074 0.1350 
GK. 7 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 
GR.. 8 0 0 0 1 1 0.0 0.0 0.0 1.0000 0.0050 
SOME EXPERIMENTS IN HISTORICAL COMPUTATIONAL LINGUISTICS 169 
Gtk. 9 2021 
GP,..10 0 2 0 0 
GP~. 11 2 5 2 6 
GR. 12 1 6 5 5 
GI~. 13 0 0 0 0 
GP~. 14 0 0 0 0 
GR. 15 2 5 15 30 
GR. 16 2 6 8 33 
GR. 17 0 0 0 0 
TOT. 15 33 54 98 
5 0.4000 0.0 0.4000 0.2000 0.0250 
2 0.0 1.0000 0.0 0.0 0.0100 
15 0.1333 0.3333 0.1333 0.4000 0.0750 
17 0.0588 0.3529 0.2941 0.2941 0.0850 
0 0.0 0.0 0.0 0.0 0.0 
0 0.0 0.0 0.0 0.0 0.0 
52 0.0385 0.0962 0.2885 0.5769 0.2600 
49 0.0408 0.1224 0.1633 0.6735 0.2450 
0 0.0 0.0 0.0 0.0 0.0 
200 
From all this material it was possible to draw the following con- 
elusions: 
1) In the Cornish lexicon there are 254 (34 ~) lexical loan-words, 
there are 284 (38 ~) words with Indo-European etymologies, and 
207 (28 ~) without any known etymology. 
2) The vast majority of the words with an Indo-European ety- 
mology (238 out of 284 = 84 ~/o) have an etymology that is certain, 
as far as is known at the present state of research on the subject. Ano- 
ther 16 ~ have etymologies that are either very probable (23; 8 ~) or 
merely probable (23; 8 ~). 
3) With regard to etymological kinships with non-Celtic Indo- 
European linguistic groups, the closest connections are with German 
(0.2291), Latin (0.1698), Indo-~yan (0.1523), Greek (0.1469) and 
with Baltic (0.1146). Such results appear to be extremely important 
in that they conftrm the innovative character of the occidental lexicon 
(kinships with German, Latin and, at least in part, Baltic) existing side 
with the preservation of archaic elements in lateral areas (kinship with 
Indo-Aryan), thereby showing strong kinships with the central area 
of the Indo-European world (Greek and, at least in part, Baltic) which 
have yet to be adequately assessed. 
4) The highest' percentages of now unacceptable relationships 
suggested by scholars in the past are those with Lati/a (0.2600), Greek 
(0.1350) and Indo-A_ryan (0.1300). This, together with the fact that 
these same groups have also yielded a very high percentage of accep- 
table etymologies, suggests that these areas have been exhausted. As 
working hypothesis, new etymological comparisons ought now to be 
considered particularly with German and Baltic, which combine a 
high yield with a more tolerable percentage of acknowledged errors 
(0.0850 and 0.0750 respectively). 
170 E. CAMPANILE- A. ZAMPOLLI 
5) Of the 745 Cornish words which have supplied the material 
for the present study, as many as 671, almost 90 °/o, bear at least an 
index of 9; that is, have one or more Celtic co-radicals. This confirms 
the , compact ~) character of the Celtic lexicon. 
Moreover, there are Cornish words which have one or more in- 
dices of 9 to the exclusion of any other index (139; 18 ~/o)" These 
are words that have co-radicals exclusively in the Celtic world. On 
the heuristic level, this verification gives rise to a question that is at 
the same time a working hypothesis: are they substratum words? 
The same question and the same working hypothesis also arise with 
the words where one or more indices of 9 accompany the indices 
4, 5, 6, 7: these are words with Celtic co-radicals formerly thought 
to be of Indo-European etymology but now refuted in the dictionary, 
They are 60. 
Our analysis, therefore, seems to suggest, too, that future linguistic 
research will fred rich material for substratum studies in Cornish and, 
more generally, in Celtic. 
