Electronic Dictionaries and Linguistic Analysis of Italian Large Corpora 
Slmonetta Vletn and Anmbale Eha 
24 March, 1999 
Word Count 2 429 
Abstract 
In thts paper we wdl show how Itahan electronic dictionaries have been budt within the methodological 
framework of Lexicon-grammar We wdl see the structure of electromc d~ctlonanes of simple and 
compound words, and we wdl show how to analyse texts employing these hngmst~c tools within INTEX. 
a morphological analyser Finally, we wdl show how electromc grammars (budt w~th INTEX) interact 
with dlctlonarles and allow recogmctlon of sequences of slmple and compound words m large corpora 
0. Introduction 
We present the system of Itahan morphological dictionaries (the DELI system) which has been developed 
at the Department of Commumcatl0n Science of the Umverslty of Salerno We wdl see how these 
dictionaries can be employed m order to index a text Finally, we w~ll examine the construction of local 
grammars whlch, interacting with dictionaries, allow precise tagging of sequences of words 
1. The Italian Electronic Dictionaries 
The DELI system contains several electronic dictionaries of simple words and of compound words The 
electromc dictionary of s~mple words, named DELAS, contains about 100 000 Itahan entries to which an 
alphanumencal code has been assigned This code refers to the grammatlcal category of the word and to 
~ts mflect~onal paradigm What follows is an example of the DELAS d~ct~onary 
dottore N80 
cortese A79 
amare V3 
dl PREP 
lentamente AVV 
the noun (N) dottore is given above m mascuhne singular canomc form The adjective (A) cortese is m 
the mascuhne smgular canomc form Verbs (V) are listed m the mflmtwe form, as amare Those items 
which do not inflect are assigned a code indicating only the grammatical category, as above shown for the 
preposition dt and the adverb lentamente The numerical code refers to specific mflechonal algorithms 
For example, code 80, assoclated to nouns, corresponds to the endings 
N80 
ms ~ mp fp 
-e -essa -i -esse 
thus indicating that all nouns as dottore, i e campzone, professore, etc, inflect by adding to the root -e for 
the mascuhne smgular (ms), -essa for the feminine singular (fs), -l for the mascuhne plural (mp) and -esse 
for the feminine plural (fp) On the other hand, adjectives encoded A79, as cortese but also trtbale, etc, 
can be described by the following reflectional model 
91 
Ii 
A79 
ms fs mp fp 
-e -e -i -1 
m which the mascuhne and feminine singular (ms and Is) on one side, and the mascuhne and feminine 
plural (rap and fp) on the other, correspond to the homographic forms cortese and cortest 
The algorithm for verbs is more complicated since it has to refer to 40 forms referring to simple tenses 
Therefore, verbs like amare, abbandonare, tmparare, etc, are encoded as V3 which corresponds to the 
following endings and grammatical values 
V3 
md/pr(3o,31,3a,31amo,3ate,3ano) 
lmp(3avo,3avl,3ava,3avamo,3avate,3avano) 
pass r(3at,3astt,3b,3ammo,3aste,3arono) 
fut s(3erb,3eral,3erL3eremo,3erete,3eranno) 
lmperat(-,3a,31,31amo,3ate,3tno) 
cong/pr(31,31,31,3mmo,3mte,3mo) 
imp(3assl,3assl,3asse,3asslmo,3aste,3assero) 
cond/pr(3erel,3erestl,3erebbe,3eremmo,3ereste,3erebbero) 
part/pr(3ante,3antl) 
pass(3ato,3ata,3at~,3ate) 
ger/pr(3ando) 
For example, the first line of the list above mdIcates that In order to form the Indlcatwe Present (md/pres) 
of the verb amare, It IS necessary to delete the last three characters of the infinitive form of the verb and 
add o for the first singular person, t for the second singular person, a for the third singular person, zamo 
for the first plural, ate for the second plural, and finally ano for the thlrd plural 
Using DELAS and its mflecnon codes, a software allows to automatically generate all mflected forms 
The result will be the electromc dictionary of reflected forms of Italian simple words, named DELAF, 
which has the followmg structure 
ama,amare V3 Imper2s 
ama,amare V3 IndPres3s 
amal,amare V3 IndPassls 
amammo,amare V3 IndPass I p 
amando,amare V3 Ger 
amano,amare V3 IndPres3p 
amante,amare V3 PartPres ms ts 
amantl,amare V3 PartPres mp fp 
amare V3 Inf 
amarono,amare V3 IndPass3p 
amasse,amare V3 Conglmp3s 
amassero,amale V3 Conglmp3p 
amass1 amare V3 Conglmp ls 
amassl amare V3 Conglmp2s 
amasslmo,amare V3 Conglmp 1 p 
amaste,amare V3 Conglrnp2p 
amaste,amare V3 IndPass2p 
amastl,amare V3 IndPass2s 
amata,amare V3 PartPass fs 
amate,amare V3 Imper2p 
amate,amare V3 PartPass fp 
92 
amate,amare V3 IndPres2p 
amau,amare V3 PartPass mp 
amato,amare V3 PartPass ms 
amava,amare V3 IndImp3s 
amavamo,amare V3 IndImp 1 p 
amavano,amare V3 IndImp3p 
amavate,amare V3 IndImp2p 
amavl,arnare V3 IndImp2s 
amavo,amare V3 IndImp I s 
amer~,amare V3 IndFut3s 
ameral,amare V3 IndFut2s 
ameranno,amare V3 IndFut3p 
amerebbe,amare V3 CondPres3s 
amerebbero,amare V3 CondPres3p 
amerm,amare V3 CondPres 1 s 
ameremmo,amare V3 CondPtes 1 p 
ameremo,amare V3 IndFutlp 
amereste,amare V3 CondPres2p 
ame~estl,amare V3 CondPres2s 
amerete,amare V3 IndFut2p 
amerb,amare V3 IndFut ! s 
aml,amare V3 Imper3s 
arm,amar e V3 CongPres I s 
amt,amare V3 CongPres2s 
aml,amare V3 CongPres3s 
aml,amare V3 IndPres2s 
ammmo,amare V3 Imperlp 
ammmo,amare V3 CongPres 1 p 
ammmo,amare V3 IndPres lp 
ammte,amare V3 CongPres2p 
ammo,amare V3 Imper3p 
ammo,amare V3 CongPres3p 
amb,amare V3 IndPass3s 
amo,amare V3 IndPres Is 
cortese,cortese A79 fs 
cortese A79 ms 
cortesl,cortese A79 fp 
cortesl,cortese A79 mp 
dl,dl PREP 
dottore N80 ms 
dottoressa,dottore NS0 fs 
dottoresse,dottore N80 fp 
dotton,dottore NS0 mp 
lentamente,lentamente AVV 
The DELAS-DELAF dlctlonanes contaln simple words whlch are formally defined as sequences oJ 
characters between blank spaces or separators On the other hand, dictionaries of compound words 
contain those words which are formally defined as sequences of words, they contain spaces or separators 
Compound words are constrained sequences of words which can have either a metaphorical meaning as 
cavaUo dl battagha, which means "something at which somebody particularly excels, somebody's 
favourlte p~ece" or a "neutral" or techmcal meaning as carta dt credtto, m Enghsh credit card Compound 
words are constrained sequences of words, since the subst~tutmn of lex~cal elements within the sequence 
w~th synonyms most of the time produces unacceptable compounds, as the following examples show 
(cavallo + *puledro) dl battagha 
cavallo dl (battagha + combattlmento) 
(carta + *tessera) d~ credzto 
carta dl (credlto + *fido) 
Furthermore, it is not possible to change the second occurrence of the noun in the plural form 
*cavallo d~ battaghe 
*carta dl credltl 
The only morphologlcal variation can be apphed to the head of the whole compound, that is the fHst noun 
occurrence 
cavalh dl battagha 
carte d~ cred~to 
The electromc dlctmnary of compound words, named DELAC, contains about 50 000 Italian entries to 
which grammatical codes have been assigned These codes refer to the grammatical category to which the 
1tern belongs and to its internal structure and morphological behavlour In the following examples of the 
DELAC d~ctlonary 
cavallo d~ battagha,N+NPN ms-+ 
carta dl credzto,N+NPN fs-+ 
pesce spada,N+NN ms-+ 
casa madre,N+NN fs-+ 
agente speclale,N+NA ms++ 
alta carlca N+AN fs-+ 
the compound ~tems are followed by a symbol of part of speech (N), the separator "+" ~s followed by the 
internal structure of the compound The first two items are formed by a Noun, a Preposmon and a Noun 
(NPN), the third and fourth items are formed by two nouns (NN), the fifth item ts formed by a Noun and 
an AdJective (NA)I whale the last item is formed by an Adjective and a Noun (AN) Columns are followed 
by the gender and number the examples are either mascuhne singular (ms) or feminine sxngular (fs) 
93 
Finally, two marks indicate above morphological variations In gender and number If the varlatmns are 
accepted then the mark zs "+", ff they are not accepted the mark ~s "-" The internal structure defines the 
element of the compound which inflects compounds which belong to the class NPN inflect the first noun, 
compounds whichbelong to the classes NA and AN reflect both elements, while compounds which 
belong to the NN class can either reflect both nouns, as hngua madre(fs) - hngue madrt(fp), or only the 
first noun, as pesce spada(ms) - pesct spada(mp) Once we codify the morphological behawour of 
compound nouns m such a way, a set of computatmnal routines allows to automatically generate the 
DELACF, that Is the electromc dlctmnanes of reflected forms of Italian compound nouns, which has the 
following format 
agente specmle N+NA fs++ 
agente specmle N+NA ms++ 
agentl specmh,agente specmle N+NA fp++ 
agentl specmh,agente speclale N+NA mp++ 
alta canca N+AN fs-+ 
alte canche,alta carlca N+AN fp-+ 
carta dl credlto N+NPN fs-+ 
carte dl cred~to,carta d~ credlto N+NPN fp-+ 
casa madre N+NN fs-+ 
case madn,casa madre N+NN fp-+ 
cavalh dl battaglla,cavallo dl battagha N+NPN mp-+ 
cavallo dl battagha N+NPN ms-+ 
pesce spada N+NN ms-+ 
pescl spada,pesce spada N+NN mp-+ 
2. The Automatic Morpho-lexical Analysis 
Once dlctmnarles of simple and compound words are built, it is possible to apply them to texts by means 
of the programme INTEX of morpho-lexlcal analysis Th~s software has been developed by Max 
Sdberztem and allows to load electromc dictionaries of simple and of compound words structured m the 
way shown above INTEX applies both dictionaries to a text and builds the d~ctmnary of that text which 
wdl contain not only simple words but also all compound nouns present in the text This step allows to 
recogmze and hlghhght within a text all compounds 
- Che stone dlcono ~ - chiedo - Io non so mente So che Im ha un negoz~o, senza llnsegna lununosa Ma non so 
nemmeno dov'~ 
Me 1o sptega E' un negozlo dI pellamt, vahge e artlcoh da vlagglo Non ~ sulla pmzza della stazmne ma m una vm 
laterale, vlcmo al passagg~o a hvello dello scalo rnerc~ 
and also to build a frequency hst for them 
1 amcoh da vlagglo 
l msegna lummosa 
1 passagm a hvelto 
1 scalo merc~ 
Such an lndexatlon Is extremely reliable for the management of technical and scientific documentation 
Technical documents contain a lot of terminology which Includes mostly compound nouns INTEX gives 
us the possibility of loading more than one dictionary, so, the user can build not only a DELACF for 
generic compounds but also specialized dictionaries of compounds belonging to various fields such as 
Economy, Engineering, Computer Science, and so on It is then possible to analyze technical texts on the 
base of such dictionaries The following text in which compounds have been hlghhghted is an example 
Q/, 
drawn fron an article of the Itahan economics newspaper tl Sole 24 ore (the wlaole article contains 84 
hnes) 
Pohtica economica., anno zero E l'assenza di impegm precis~ e credibih contro l'mflazione continua a tenere 
IFI tenslone i mercati 
Ne~ mesi che hanno preceduto la svalutaz~one della hra e' stato npetutamente e autorevolmente affermato 
che la stabiht~ del camblo rappresentava l'asse portante d~ tutta la pohtlca econo.mica ~tahana La h.nea di 
condotta segmta nelle prime settimane dal Governo Amato era sembrata coerente con tale enuncmzione 
1 tl nconoscimento dell'autonomm della Banca d'Itaha, 
2 I'affermaz~one della pnont~t assoluta dell'ob~etUvo antHnflaz~omsUco, 
3 d blocco delle contrattaz~om salarlah nel settore pubbhco, 
4 l'accordo di fine lugho con i slndacaU 
I pnm~ due punt~ erano Iondamentah e complementan, m quanto l'autonomia della Banca centrale nceveva 
una prec~sa carattenzzazlone (rafforzata dalla prospemva dl adeslone all'umone monetana europea) dalla 
pr!0nth assoluta dell'ob~ett~voantHnflaziomstico E' bene sottolmeare che tale Pr!Onth assoluta valeva anche 
nei confronti dell'obietuvo di nsanamento della finanza pubbhca Essa doveva m sostanza essere mtesa ne~ 
seguent~ termm~ a) la Banca centrale avrebbe rmpettato target dt crescita monetana coerentl con gh 
oblettivl mflazlomstici annuncmtl (cosa che non era fino allora avvenuta, nernmeno dopo l'adeslone alia 
banda nstretta dello Sme, che pure avrebbe dovuto comportare un accrescmto ngore della p0ht~ca 
monetana), b) nel far czb essa non si sarebbe curata degh effetti di breve periodo di tale condotta su~ tass._._.!~ 
dl lnteresse, e qum& anche sulla finanza pubbhca, c) tl Governo avrebbe adottato con la massima urgenza 
provve&mentl dl nsanamento della finanza pubbhca, senza tuttavta fare ncorso a m~sure che Incldessero 
sull'm&ce del prezz~ al consumo Quest~ nchmmi hanno ormm sapore stonco, ma sono uUh per megho 
mettere a fuoco la situaz~one attuale In parucolare la prima affermaz~one (l'essere c~oe' la stabd~tb, del 
carabao asse portante di tutta la pohuca economica) ci sembra ancora p~enamente vahda rasse portante non 
c'e' pda e con esso e' sparita anche la pohtica econom~ca Schematlcamente sembra che esistano due 
alternative posslbdl dl pohtlca economica La prima (che nproduce m sostanza, pur helle con&zlom 
mo&ficate, la hnea pre-svalutazlone tratteggmta sopra) potrebbe arucolarsi nei seguenu termini a) 
aggmstamento fiscale come ob~ett~vo della mass~ma urgenza, b) naffermaz~one del ruolo autonomo della 
Banca centrale, degh lmpegm assuntt m vista del mercato umco (m part~colare m matena di hbera 
c~rcolaztone dei cap~tah), c) mass~mo contemmento deUe spmte, mflazlomsuche denvanti dalla svalutazione 
della hra e rmffermazione dl un obletUvo antHnflaziomstlco preciso (tradotto m un target ngoroso e 
tmpegnatwo, e qumdt cred~bde, d~ cresctta monetana) Part~colarmente importante quest'ulttmo punto, m 
quanto non e'affatto mewtabde che gh effett~ della svalutaz~one s~ traducano lntegralmente m una 
mflaztomst~ca aggtuntiva Se la cresc~ta monetarm e' tenuta sotto controllo, e graz~e agh effetti restnttiv~ 
della manovra d~ bflanclo, la svalutaz~one pub tradurs~ anz~che' m un fattore mflaz~omstlco, m uno d~ 
mutamento dei prezz~ lelatlvi La seconda lmea punta mvece a un abbassamento de~ tass~ dt mtetesse e a 
impart~re stimoh espansiw all'economia (entramb~ gh obtett~v~ potrebbero rlcevere una motwaz~one 
aggluntwa grazle al solhevo che, nel breve penodo, potrebbe~o portare alia tmanza pubbhca) Una tale 
hnea nchiederebbe certamente I'accantonamento,almeno temporaneo, di qualslasl ob~ettlVO ant~- 
mflaziomstico E' anz~ probablle che i suo~ effettt p~fa sigmficauv~ e durevoh sarebbero quelh prodott~ 
dall'mflaz~one sulla dastnbuztone del red&to, e soprattutto della ncchezza, e sul valore reale dello stock del 
deb~to pubbhco Sl notl chela seconda hnea e' certamente mcompat~bde con un r~torno m temp~ brew a un 
regime di cambio fisso 
A frequency hst which contams terminological compound nouns give us the posstbthty to Immediately 
understand the specific content of this article Such an index prowdes a p~cture of the content of the text 
(the mdex which follows was budt on the whole article) 
8 pohttca econom~ca 
5 finanza pubbhca 
4 pohtlca monetana 
4 cresctta monetana 
3 prlonth assoluta 
3 tass~ d~ mteresse 
3 Banca centrale 
3 asse portante 
2 valore reale 
2 stabtl~th det camb~o 
2 svalutazlone della hra 
2 manovra dl bdanclo 
95 
2 tasso d~ inflazlone 1 contrattazmm salanah 
1 deblto pubbhco I Polmcaeconom,ca 
I abbassamento de~ tassl I hnea dl condotta 
1 dlstnbuzlone del reddlto 1 Banca d'Itaha 
i dffferenzmh dl mteresse 1 hbera mrcolazlone 
1 tltoll pubbhcl 1 spmte mflazmmstlche 
1 cambm flsso 1 spmta mflazmnlstlca 
I stabdlzzamone del cambm I mercato umco 
1 premm dl nschm 1 indlce de~ prezz~ 
l settore pubbhco 1 prezzl al consumo 
1 umone monetana 
3. Local Grammars 
Electronic dlcttonartes give us the posslblhty of recognlzmg wlthm texts words and sequences of words as 
defined by dictionaries INTEX allows to recogntze combmatmns of simple and compound words, thanks 
to the interaction between dlctmnarles and grammars INTEX contains a tool whtch allows to construct 
local grammars on the model of fimte state automata These grammars can be based not only on words but 
also on the non-terminal symbols contained in the dictionaries For example, m order to identify all 
compound nouns followed by an adjective, which agrees in gender and number with them, we construct 
the following grammar 
~ N~+NPN ~>~ N~+NA ms> \[~L-~ 
If we apply such a grammar to a text, INTEX will hlghhght all occurrences of th~s pattetn and 
subsequently construct concordances for that pattern 
trovare espressmne sm in accrescmu dffferenzlah dl interesse reah, sm In un deprezzamen 
utta la pohtlca econom~ca ~tahana La hnea dt condotta segmta helle prime setumane dal Gov 
ppresentava l'asse portante d~ tutta la polmca econormca ltahana La hnea dl candotta segul 
ant~-mflazmmsuca Se questa e' la polmca economica ltahana Sl sarebbe tentitl dl dire 
zmne s~ traducano mtegralmente in una spmta lnflazmmst~ca agglunttva Se la cresmta monet 
apttah), c) masslmo contemmento delle spmte mflazlomstmhe denvant~ dalla svalutazmne de 
vl sufflclentemeute sald~ per tollerare tasst d~ mteresse penahzzantt a qualche asta,accompag 
una scelta realmente ~mpegnat~va 2 II tasso dt mflazmne programmato e' stato portato dal 3, 
rzata dalla prospemva dl adeslone all'umone monetarla europea) dalla prmntg assoluta dell' 
96 
Hence, electromc dxctlonanes on one side, and the posslblhty of construct,.ng grammars whxch interact 
with dlcuonanes on the other gave us the posslbdlty of automatically analyzing large corpora, consldenng 
not only words but also sequences of words 

Bibliography 

Gross M (1988), Grammatre transformatwneUe du francazs Syntaxe de l'adverbe Pans Cant116ne 

Gross, M & M Sdberztem (edd), Acres des Preml~res Journges INTEX (21-22 mars 1996), Pans 
L A D L., Unlversxtd de Pans 7 -- 

Eha, A (1984), Le verbe ltahen Les complettves dans les phrases a un complement, Fasano-Pans 
Schena-Nlzet. 

Sdberztem M (1993), Dtcttonnatres ~lectromques et analyse automattque de textes Le syst~me 
INTEX Pans" Masson. 
