DETECTING PATTERNS IN A LEXICAL DATA BASE 
Nicoletta Calzolari 
Dipartimento di Linguistica - Universita' di Pisa 
Istituto di Linguistica Computazionale del CNR 
Via della Faggiola 32 
50100 Pisa - Italy 
ABSTRACT 
In a well-structured Lexica\] Data Base, a 
number of relations among lexica\] entries can he 
interactively evidenced. The present article 
examines hyponymy, as an example of paradigmatic 
relation, and "restriction" relation, as a 
syntagmatic relation. The theoretical results of 
their implementation are illustrated. 
I INTRODUCTION 
In previous papers it has been pointed out 
that ill a well-structured Lexical Data Has(. it 
becomes possible to detect automatical;y, an(l ~e 
evidence through interactlve queries a number Of 
morphologica\] , syntact.ic, or semant i~. 
relationships between lexical entries, .~uch ~lb 
synonymy, hyponymy, hyperonymy, der ivat ion, 
case-argument, lexical field, etc. 
The present article examines hyponymy, a.~ dI: 
example of paradigmatic relation, and what can b(. 
called "restriction or modification" relaLion, as 
a syntagmat ic relation, l-~y reSLl'iet Jell or 
modification relation, l mean that part of a 
so-called "aristotellan" definition which has tiJe 
function of linking th(~ "genus" and the 
"differentia specifica". 
When evidenced in a lexicon, tile hyponymy 
relation produces hierarchical trees partitioniI*K 
the lexicon in many semant ica i ly coilerent 
subsets. These trees are not created once and 
for al i, but it is important that uhey are 
procedurally activated at the query moment. 
While evidencing the second relation 
considered, one can investigate as to whether it 
is possible to discover any correlation be~wneI* 
lexical or grammatical features in definitions 
and particular kinds of "definienda", and thus 
try to answer questions such as the following: 
"Are there any connections between these 
restriction relations and ~he fundamental ways of 
definition, i.e. the criterial parameters by 
which people defines things?" 
For both relations, the paper presents the 
different procedures by which they are" 
automatically recognized and extracted from the 
natural language definitions, the degree of 
reliability of their automatic labeling, the use 
of these labels in interactive queries on the 
lexical data base, and finally the theoretical 
results of their implementation in a 
Machine-Dictionary. 
II THE LANGUAGE OF DEFINITIONS AS A SUBLANGUAGE 
1 am trying to develop and exploit the idea of 
considering the language of dictionary 
definitions as a particular sublanguage within 
natural language. This perspective cannot 
obviously be adopted for subject matter 
restrictions in definitions, but only for the 
purpose of the text, i.e. the specific 
communicative goal. From this restriction on the 
purpose of the text, certain lexico-grammatical 
restrictions do result, which prove to be very 
useful. 
As to tile restrictions on tile lexical richness 
of definitions, these are not due to the fact 
that they relate to a specific domain of 
discourse, but only to the property of closure 
(although not satisfied at 100%') that the 
defining vocabulary should in principle be 
simpler and more restricted than the defined set 
of \]emmas, i.e. the former should be a proper 
subset of the latter. 
This kind of quantitative restriction on the 
vocabulary of definitions would not be of any 
interest in itself, if it were not accompanied by 
other kinds of constraints both on a) the 
lexical, and on b) the grammatical side. 
a) From the frequency list of the words used 
in definitions (about 800,000 word-occurrences, 
and 75,000 word-types), it appears in fact that 
some words have a much greater importance than in 
normal language, as evidenced by a comparison 
with the data of the Lessico di Frequenza della 
Lingua Italiano Contemporaneo (Bortolini et al., 
1971). These are the defining generic terms 
170 
which are traditionally used by lexicographers, 
such as ACT, EFFECT, PERSON, OBJECT, WHO, 
PROCESS, CAUSE, etc. It is not by chance that 
these same concepts are of relevance in many 
Artificial Intelligence systems. 
b) Not only single words, or classes of words, 
are particularly relevant in the defining 
sublanguage. There are also lexical patterns and 
syntactic patterns which occur with great 
frequency, and which play a very special role in 
defining sentences. 
The combination of these constraints carl be 
and actually is very useful, when trying to 
exploit the information contained in definitions, 
and when transforming an archive of natural 
language definitions into a knowledge base. 
structured as a network. Some important parts of 
knowledge are in fact already retrievable in 
interactive mode from the Italian Lexica\] Data 
Base, which has recently been restructured. 
Analyses on large corpora of definitions, 
carried out on many dictionaries (Amsler. I')80; 
Calzolari, 1983a, 1983b; Michiels, Noel, 1')82) 
have in fact shown that the definitions 
sublanguage displays several regularities of 
lexJca\] and syntactic occurrences and patterns. 
These general lexica\] c\]asses and the classes of 
recurrent patterns can be more or less eusi\]y 
captured for instance by pattern-matching r. les. 
and if possible characterized with formal rules. 
II\] HYPONYMY RELATION 
Hyponymy is the most important relation to b(, 
evidenced ill a lexicon. Due tO it.% taxollom i {: 
nature, it gives the lexicon, when implemented, a 
particular hierarchical structure: its result is 
obviously not a tree, but many tangled 
hierarchies (Amsler, 1980). 
Instead of evidencing and labelling this 
relation by hand, I have tried to characterize it 
procedurally. The procedure which automatically 
coded (with a precision of more thah 90% 
calculated on a random sample of 2000 
definitions) true superordinates in all the 
definitions (approx. 185.000 for \]03.000 iemmas). 
was based almost exclusively on the position of 
the "genus" term at the beginning of the 
definitional phrases, giving Nouns, Verbs. and 
Adjectives as superordinates of defined entries 
of the same lexical category. Ad hoc subroutines 
solved exceptional cases where a) quantifiers, or 
other modifiers preceded the genus term (e.g. 
aletta ---> piccolo gruppo di Donne dietro 
l'angolo dell'ala), or b) more than one genus was 
present in the definition (e.g. Qssordore ---> 
attutire, smorzarsi detto di suono), or c) a 
prepositional phrase, usually of locative type, 
was at the beginning of the phrase (e.g. piazzato 
---> nel rugby, calcio al pallone collocate sul 
terreno). 
Even though the first immediate purpose of 
this procedure is of classificationa\] nature, the 
ultimate goal is the extraction and formalization 
of the most relevant relationship between lexical 
items which is implicitly stored in any standard 
printed dictionary. It is in fact now possible 
to retrieve in the \]exica\] data base not only all 
the definitions in which any possible word-form 
appears, together with the defined lemmas (e.g. 
SUONO appears in 328 definitions), but also to 
retrieve on-line, if desired, only the 
definitions in which the given word-form is used 
as a superordinate, therefore with the list of 
its hyponyms (e.g. the same word SUONO is used as 
superordinate of only 65 words, i.e. of a subset 
of the preceding set containing MUSICA, RUNORE, 
SQUILLO, SUSSURRO, etc.~. 
The query-language so far implemented for the 
lexica\] data base permits therefore to retrieve 
information on this hierarchical relation. 
identifying on-line the a\]lowable 
interconnections within the entire lexicon. The 
links produced can he analyzed, evaluated, and, 
if necessary, interactive\]y corrected. 
From explorations on the trees thus obtained. 
we can also try Lo set up classes and subclasses 
of superordinates, on the basis of the upper 
nodes to which many other nodes are connected as 
descendants. Only as an example, the 
identification criterion for the noun-class 
"SET-OF" containing \]NSIEME, GRUPPO, COLLEZJONE, 
COMPLESSO. AGGREGATO. etc., among the set of 
noun-superordinates, is the fact that they are 
linked one to the other in the tree which results 
from querying the data base. Their hyponyms will 
obviously be for the most part collective nouns. 
The identification of word-classes like this 
one leads to the next step Jn the formalization 
of the hyponymy relation, which will consist in 
the insertion of a label indicating a semantic 
class to these sets of superordinates. It will 
thus be possible to retrieve, for example, all 
the nouns generically definable as "SET-OF", 
independently of tile particular word denoting a 
set used in definitions. Since it is already 
possible to trace these chains of hyponyms going 
upwards or downwards for more than one level, one 
can immediately ask whether, for example, 
MASSERIA belongs to the set of collectives even 
if it is defined as HANDRIA, because MANDRIA is 
defined as BRANCO, which is in turn defined as 
INSIENE, which finally is one of the nouns 
belonging to the class "SET-OF". 
171 
IV RESTRICTION RELATION 
Even though some refinements are still 
required in order to improve the reliability of 
the automatic recovery of ISA-re\]ated terms 
chains, this kind of structural relation within 
the lexicon, that is hyponymy, is at a good stage 
of implementation in the Italian \]exica\] data 
base. 
Much still remains to be done as far as other 
very interesting rel at iouships bt~tween tile 
entries are concerned. I am now considering what 
could be called "restriction or modificatioi*" 
relation, since its purpose is to restrict or 
modify the meaning of the genus term. It is 
exemplified in the following definitions by the 
words in italics: 
stannJte ---> calcopirite contenente stagno 
arricciolare ---> modellare o \[ormo di rieciolo 
risonatore ---:" dispositivo otto o generaro 
risonauza 
I wish to evaluate what could be done with 
respect to this kind of relation, starting from 
the available definitional data. One of the 
first aims of this lexicologJcal rese;Irch is to 
analyze, by m~ans of computational tools. ;llld to 
use tile information ConLalned in tile dJ fl or,,nL 
definitional formats and suructures. "l'i~c 
implementaLion of a number of proc:eduros which 
convert the natural language information convey~,d 
by definitions into processable formals, made tlp 
by structured relational links between lexJcal 
items or classes of lexical items, i.~ nok Lakol; 
into consideration. 
These formals call be made ~raceable e.g. in all 
Information Retrieval system on definitions, like, 
the one actually implemented, on th,: entir., 
corpus, for the taxonomic part of the |exical 
structure. But these formatted re I ationa \] 
structures can also be used as starting points 
for a computationally exploitable reorgnnizat~on 
of the definitional content. (me, of the 
characteristics of the definitional sublanguage, 
i.e. the presence of recurrent patterns ( ,%uch as 
proprio di, relotivo o, prodotro do, originorio 
di, etc.), enables, at least in certain cases, to 
produce a constant mapplng from certain variable 
types of more frequently detected definitional 
phrases no constant underlying relationa! 
structures. 
Using rather simple pattern-matching 
procedures some classes and subclasse~ of 
definitions can be separated, and a small number 
of simpler types of definitions have already been 
converted into a formalized coded format also 
with regard to this restriction relation. A new 
virtual Relation is thus added to the original 
data base. The distinguished elements of a 
number of simple natural language patterns are 
mapped into some general structured information 
formats. Up to now, some of the definitions 
displaying the following restriction relations 
have been treated: 
REL.FORM (e.g. o formo di) 
REL.PROV (e.g. provvisto di) 
REL.APT (e.g. otto o) 
and the corresponding relational links generated. 
Among the lexical variants of REL.PROV there 
are fornito di, dototo di, munito di, pieno di, 
rlcco di, etc.; while REL.FORM groups the 
following variants of a different type: in \[ormo 
di, che ha (la) forma (di), di formo, di formo 
simile a (quella di), $otto forma dl, avente formo 
di, etc, It is thus possible, for example, to 
retrieve, among the 1271 definitions in which the 
word FORHA appears, only those defining something 
as "having the shape of something else". The 
implementation of these links allows to produce 
another kind of partitioning within the lexical 
system, and permits to better investigate the 
internal structure of words. 
A procedure of the kind exemplified above, 
based on pattern-matching, is possible for a good 
number of definition types; for example, with a 
different formaL, for many adjectives: 
def , NP = 
Adj .... >> REL.X 
: VP : 
where several groups of definitions are found to 
share a common underlying structure in terms of 
the restriction relation involved, in spite of 
other lexical and syntactic differences. 
V FUTURE PERSPECTIVES 
A comparison with the definitional corpora of 
other dictionaries, also of other languages, will 
certainly prove to be useful in establishing the 
set of the most general or primitive Relations, 
used for definition in lexicographieal practice, 
often overlapping with the primitive Relations 
stated in many AI systems. These relations, 
mapped into a formal link in the data base, can 
then be paraphrased in each language, in the 
standard language. 
The data base structure envisaged does permit 
both to maintain at a lower level (the starting 
level), and to eliminate at an upper level, many 
peculiarities and variations in the linguistic 
172 
expression of the same or of similar concepts or 
relations; their effect is to facilitate the 
comprehension by the users of the printed 
dictionary, inhibiting however immediate 
comprehension by procedural routines in the 
mechanical processing of dictionary data. 
By applying similar methods of automatic 
conversion and mapping into suitable formats, as 
extensively as possible throughout the lexicon, 
many definitional expressions can be submitted to 
an attempt of standardization, thus achieving 
major precision, which gives a considerable 
improvement when performing, for example, 
information retrieval operations on the content 
of a dictionary. 
This more structured, but, in another sense. 
simplified version of definitions, which also 
accounts for their relational nature, provides an 
excellent basis for testing and studying the 
"knowledge of the world" which underlies the 
structure of a dictionary. 
Vl REFERENCES 
Alinei, M., La Struttura del l,essico, Bologna: Ii 
Hulino, 1974. 
Amsler, R.A., The Structure of the 
Herriam-Webster Pocket Dictionary, Ph.D, 
Thesis, Department of Computer Science~. 
University of Texas, Austin, Texas, 1')80. 
Bortolini, U., Tag\]iavini, C., Zampolli, A.. 
Lessico di Frequenza de\] la Lingua I ta\] ian,J 
Contemporanea, Hilano: Garzanti. 1972. 
Calzolari, N. , "Towards the organization of 
lexical definitions or. a data bus,' 
structure , COLING82 Abstracts, ed. by" E. 
Haji~ov~, Prague: Charles University, 1982, 
61-64. 
Calzolari, N., "Lexiual definitions in a 
computerized dictionary'", Computers and 
Artificial Intelligence, II(1983a~3, 225-233. 
Calzolari, N. , "Semantic links and the 
dictionary", in Proceedings of the ~tl ! 
International Conference on Computers and the 
Humanities, ed. by S.K.Burton, D.D.ShorL, 
Rockville (Haryland): Computer Science 
Press, 1983b, 47-50. 
Calzolari, N., Ceccotti, H.L., "Organizing a 
large scale lexica\] database dictionary", 
Acres du Con~r~s Informatique et Sciences 
Humaines, Li&ge: L.A.S.L.A., 1981, 155-163. 
Clark, E.V., Clark, H.H., "When nouns surface as 
verbs", Language, 55(1979)4, 767-811. 
Evens, M.W., Litowitz, B.E., Harkowitz, J.A., 
Smith, R.N., Werner, O., Lexical-Semantic 
Relations: a Comparative Survey, Edmonton, 
Alberta: Linguistic Research Inc., 1980. 
Findler, N.V. (ed.), Associative Networks, New 
York: Academic Press, 1979. 
Hendrix, G.G., "Natural-language interface", 
Proceedings of the Workshop 'Applied 
Computational Linguistics in Perspective', 
American Journal of Computational 
Linguistics, 8(198-)-, 56-61. 
Michiels, A., M~llenders, J., No~l, J., 
"Exploiting a large data base by Longman", 
COLING80: Proceedings of the 8th 
International Conference on Computational 
Linguistics, Tokyo, 1980, 374-382. 
Hichiels, A., Noel, J., "Approaches to thesaurus 
production", COLING82: Proceedings of the 
Ninth International Conference on 
Computational Linguistics. ed. by J.\]lorecky', 
Amsterdam: North-}lo\]land, 1982, 227-232. 
Nagao, M., Tsujii, J., t;eda, Y., Takiyama, M., 
"An attempt to computerize dictionary dale 
bases", COLING80: Proceedings of tht: ~th 
International Confermme on Computational 
Linguistics, Tokyo, \]qSO, 534-542. 
Quillian, H.R. , "Semantic memory'", in Semantic 
Information Processing, ed. by .~I..~li:*s ky, 
Cambridge (.~lass.): }liT Press. 1!)68, -,,°°'--;0."" 
Smith, R.N., "On defining adjectives: part II\]" 
Dictionaries, the Journal of the Dictionary 
Society of North America, Winter, {lq~l)5. 
28-38. 
Smith, R.N., ,Haxwell, E., "An English diction-ry 
for computerized syntactic and semantic 
process lug", in Comput at i one \] ar, d 
Hathematica\] Linguistics, ed. by A.Zampo\]li, 
N.Calzolari, Firenze: Olschki, 1977, 303-322. 
Walker, D.E., Amsler, R.A., Proposal to the 
National Science Foundation on alJ 
Invitational Workshop on Machine-Readahl~ 
Dictionaries, SRI, 1982 (mimeo). 
Zingarelli, N., Vocabolario della 
ital~99a, Bologna: Zanichelli, 1971. 
lingua 
173 
