File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/73/c73-2005_concl.xml

Size: 27,847 bytes

Last Modified: 2025-10-06 13:55:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="C73-2005">
  <Title>NICOLETTA CALZOLARI- LAURA PECCHIA- ANTONIO ZAMPOLLI* WORKING ON THE ITALIAN MACHINE DICTIONARY: A SEMANTIC APPROACH</Title>
  <Section position="4" start_page="0" end_page="0" type="concl">
    <SectionTitle>
56 NICO~.ETTA CAI.ZOLARI- LAURA PECCIIIA- ANTONIO Z~POLLI
</SectionTitle>
    <Paragraph position="0"> the corresponding linguistic theory, constituting, it could be said, the computational &amp;quot; transcription&amp;quot; of it. The rapid evolution of the theories, the models, the formal apparatus require a continual updating of the corresponding computational systems, which does not seem very easy to realize, at least in practice. Furthermore, the generative-transformational schools whose theories are usually incorporated in these systems have so far only described isolated regions of the linguistic structure, aiming at verifying the adequacy of descriptive methods rather than at describing coherently and exhaustively a language. As a consequence of this, anyone wishing to use the results of their researches in a computational system would face a set of isolated observations distributed in different regions of a language, not systematically linked to each other, but divided by so far unexplored regions.</Paragraph>
    <Paragraph position="1"> On the other hand, however, the analytical methods produced by the generative-transformational theories have revealed a very efficient heuristic power, and have considerably increased the precision and subtlety of the observations. The number of new phenomena that have been revealed has grown notably in the last 20 years.</Paragraph>
    <Paragraph position="2"> In front of this situation, the behaviour of LDV researchers may range between two alternatives.</Paragraph>
    <Paragraph position="3"> The first position is usually characterized as the rise of a linguistic &amp;quot;computational paradigm&amp;quot;, which is distinct from, if not directly in contrast with, the generative-transformational paradigm, and tends to assume the computational aspect among the principal characteristics of a linguistic theory. The conviction is expressed that the primary &amp;quot;focus&amp;quot; of linguistic research must be shifted from the description of the competence as formal abstract mechanisms towards the simulation-like studies of the processes which underlie the production and the comprehension of the utterances. The &amp;quot;natural language understanding&amp;quot; computational system could constitute a powerful experimental and heuristic tool for the study of the complexity and the constraints of these processes, making it easier to emphasize the mechanisms of interaction between the components which are involved in these processes.</Paragraph>
    <Paragraph position="4"> The scantiness of the results obtained so far (some of the devotees of this approach have likened the situation to that of medieval alchemy as opposed to modern chemistry) makes it impossible to formulate even a summary judgement. Nevertheless, it is quite clear that this type of research is limited, and will be probably limited, at least for some time in the future, to the consideration of extremely limited language subsets.</Paragraph>
    <Paragraph position="5"> THE ITALIAN MACHINE DICTIONARY: A SEMANTIC APPROACH 57 The second position seems to prefer, in the actual situation of linguistic theory, a systematic examination of the data to an immediate construction of a formal global model. Obviously, the use of abstractions or notions (e.g. those of transformation or of componential analysis) whose theoretical state may vary depending on the global evolution of the theory itself, but which have been seen to be experimental devices of extraordinary efficiency in the analysis, is not rejected. The complex formal mechanisms proposed by the generative-transformational school is not implemented into a computational system as a representation of a &amp;quot; language theory&amp;quot; but some of their characteristics (form of the rules, relationship between the rules, etc.) are utilized to store, handle and organize the data accumulated in the inductive moment of the research.</Paragraph>
    <Paragraph position="6"> T.DV essentially offers two complementary contributions to this approach. Firstly, it supplies techniques which permit the automatic handling of the data. Secondly, LDP studies algorithms which permit the data to be structured conveniently, organizing them so that their regularity, diversity, correlations, etc., can be evidenced without it being necessary to make this organisation dependent on the &amp;quot;a priori &amp;quot; choice of a general global theoretical model.</Paragraph>
    <Paragraph position="7"> The inventories of linguistic units recorded in &amp;quot; machine readable form&amp;quot; must be considered within this framework and, in particular, those lexical inventories in which each lexical unit is supplied with an explicit, suitably coded, representation of its linguistic behaviour should be considered.</Paragraph>
    <Paragraph position="8"> In addition, the use of a lexical inventory would facilitate the definition of the degree of exhaustivity of the descriptions and the evaluation of the extension of the phenomena studied. (The term ' extension' must be here understood obviously not as frequency of appearance in texts but as frequency of appearance in the system). 15 At the same time, it seems that the time has come to systematize and put at the public disposal the linguistic data accumulated in machine 15 The information is often represented by binary matrixes in which a line corresponds to a lexical unit, a column to a specific linguistic property. This organization obviously facilitates the identification of identical or similar configurations, the verification of the coherence between the contents of the interelated columns, etc. (see Joss~rsoN, 1969). The work of M. Glloss (1975) and his group in the construction of a grammatical lexicon of French certainly constitutes the most important example. Furthermore, the role which the lexicon and its description have assumed within the most recent developments of the generative-transformational school (Bresnan, etc.) should not be neglected.</Paragraph>
    <Paragraph position="9"> 58 NICOLETTA CALZOLARI - LAURA PECCHIA - ANTONIO ZAMPOLLI readable form (texts, dictionaries, descriptions, rules, etc.) and the computational tools (software packages, integrated systems, mid level and high level languages for I.DI,, etc.) produced in different institutes of different countries in different ways, but on the basis of similar methodological assumptions and of a general common sum of knowledge.</Paragraph>
    <Paragraph position="10"> It is within this framework, and not only for applied and operational purposes, that since 1968 (ZAMPOLLI, 1968), I have promoted the construction of the DMI as one of the principle projects of the newly constituted Dr..</Paragraph>
    <Paragraph position="11"> The project described in the following pages by N. Calzolari and  L. Pecchia is an original development in the field of semantics along these general planning lines.</Paragraph>
    <Paragraph position="12"> 2. TOWARDS A FORMALIZATION OF LEXICAL DEFINITIONS 2.1. Preliminary steps.</Paragraph>
    <Paragraph position="13">  This part of the article describes an attempt to formalize all the noun-definitions in the Italian Machine Dictionary (DMI). The definitions recorded in the DMI were taken from the Zingarelli Dictionary (1970) after having undergone a first process of normalization and shortening. Part of the normalization process was to classify the Zingarelli definitions into 9 different types and to mark each of these with a particular code.</Paragraph>
    <Paragraph position="14"> The main types of definitions are:  1) the relational (coded as 1), which is composed of a) a fixed part representing a function, and b) a variable part, the basis; 2) the synonymous (coded as 2), which is made up of one or more single words which are referred to for an explanation of the meaning considered; 3) the one per ' genus et differentia' (coded as 3), which is made up of a) a fixed word considered as a classifier (the ' generic part' of  the definition), and b) a descriptive or predicative phrase of the classifier (the 'specific part' of the definition).</Paragraph>
    <Paragraph position="15"> The framework of our research is typical of componential analysis, according to which even that which appears to be &amp;quot;a list of basic irregularities &amp;quot; (BLOOMFIELD, 1933, p. 162), i.e. the lexicon, could become a well-structured and therefore formalizable set, in other words, THE ITALIAN MACHINE DICTIONARY: A SEMANTIC APPROACH 59 a system. We were first given the idea for the analysis by the theory of componential analysis, but we have attempted to expand its field of application which, up to now, has been dedicated only to well structured sets, as shown in the work of componential anthropologists, (the domains of words of kinslfip), or to lemmas isolated from the rest of the lexicon (the well known example of Katz: 'bachelor '). Our intention has been that of extending the application of this theory to all nouns of the Italian lexicon. We are helped in this by the great quantity of material at our disposal in the DML As we are well aware of the limitations of componential analysis, we have used it only as a tool, not as an end, in achieving our purpose.</Paragraph>
    <Paragraph position="16"> From the entire corpus of lemmas and definitions in the DMI we have excluded those lemmas and definitions which are marked as archaic or rare. We have analyzed, up to now, all those definitions classed under codes 3 and 5, i.e. those with one generic and one specific part. These are the most numerous groups of definitions. After this selection had been made, the total number of lexical items on which we are actually working is 28,873, among which 20,453 are monosemic and 8,420 polysemic; the total amount of their definitions is 44,051. We have worked on this corpus of lemmas and definitions using programs and checks of different kinds, working in two main directions which will be discussed later in more detail. Firstly, we have extracted a considerable number of markers whi&amp; would be assigned to the highest possible number of lemmas. Secondly, we have started an analysis of prepositions, of prepositional groups and of other syntagms which can be considered as grammatical in a very generic sense. These syntagms have been chosen because they satisfy, simultaneously, the following two criteria: a) that of occurring with a high frequency in the definitions; b) that of showing well defined semantic relations existing between noun and noun, or between verb and noun, or between noun and proposition.</Paragraph>
    <Paragraph position="17"> 2.2. Markers.</Paragraph>
    <Paragraph position="18"> In the first phase of our work, the aim was to extract a certain number of'markers ', starting mainly from the definitions; in other words, working in an inductive way. We obtained the first basic working elements from a control of the frequency-list of the forms found 60 NICOLETTA CALZOLARI - LAURA PECCHIA - ANTONIO ZAMPOLLI in the corpus of noun-definitions. This list helped us to make a first purely provisional inventory of lemmas which might be used as ' markers ' Then, by looking up the concordances of these definitions, we were able to test the validity of these basic elements. In fact we have ascertained that the most frequent lemmas in the set of noun-definitions, (i.e., the lexical entries which will be most probably used as ' semantic markers ') almost always appear in the context in a generic sense and in the first position, only occasionally assuming a specific sense in different positions. The fact that, as expected, with the exclusion of syntactic words such as prepositions, conjunctions, articles, etc., the highest frequency-indexes pertain to the grammatical category of nouns has also been relevant.</Paragraph>
    <Paragraph position="19"> We shall use the name ' markers' to refer to these most frequent lemmas: but there is a difference between our' markers' and the markers referred to in componential analysis; although our markers function as markers usually do, i.e., they describe a meaning or part of it, they remain essentially lemmas. It is thus not necessary to use a metalanguage different from the language which is being described; the elements of the lexicon can be given a metalinguistic function.</Paragraph>
    <Paragraph position="20"> These markers have been grouped into lists on the basis of different semantic criteria such as synonymy, antonymy, etc. We have also made a distinction between markers behaving as one-place predicates and markers behaving as two-or n-place predicates.</Paragraph>
    <Paragraph position="21"> A first group of 450 semantic markers was extracted and matched by a program with the generic part of all the definitions. We have verified that 22,146 definitions out of 47,291 were covered, in their generic part, by these markers. This first part of the work is described in more detail in CAI~ZOI.Am, MORETTI (1976).</Paragraph>
    <Paragraph position="22"> In the prosecution of the work, through further additions or substitutions of semantic markers which were either provided by literature on this subject, or resulted from our intuition, or by other successive analyses on the corpus, we have covered 40,135 definitions with 407 markers.</Paragraph>
    <Paragraph position="23"> We have ascertained that, in almost every case, the generic part of the definitions of the DMI (and therefore of the Zingarelli) gives the word whose level is immediately higher with respect to that of the defined lemma (considering a hierarchical classification moving from the more specific to the more general, i.e. from a greater to a smaller intension). This homogeneity in the definitions justifies the validity THE ITALIAN MACHINE DICTIONARY: A SEMANTIC APPROACH 61 of the method we have adopted to refer all the lemmas back to the markers.</Paragraph>
    <Paragraph position="24"> In practice, for the lemmas not covered by markers after the first procedure of matching, i.e. for those lemmas which are defined, in their generic part, by words which are too specific to be used as markers, we have established some chains which refer back to more and more generic words until at least one marker is reached. In order to construct these chains we used a procedure to convert all the lemmas into numbers: this seemed to be the simplest way to keep in main storage the great quantity of data we had to work with. Using a program which works on these numbers, we have simulated a path for each lemma. This path starts from the lemma itself, and the program examines the generic part of the definition of the \]emma. The program checks if this generic part is included in the list of markers, and in any case examines this generic part itself as a lemma to be defined and looks for its definition; the procedure continues in this way until the generic part of a definition is found to be a marker without any other more generic marker above it. By this procedure, 91% of the definitions have been reconducted to the markers, i.e. 40,135 out of 44,051.</Paragraph>
    <Paragraph position="25"> By means of these chains, we have given the noun-dictionary a resemblance to a tree-structure. This tree-structure has been formed using the definitions of the DMI for almost all the lemmas; the hierarchical structure we have given to the markers has, on the contrary, been partly taken from the definitions, and partly imposed by us according to the traditional rules of class inclusion.</Paragraph>
    <Paragraph position="26">  In setting these chains (see Fig 1), we discovered that some chains of definienda and definientlc; are circular, e.g. PARTE is defined in the DlviI as PEZZO, and PEZZO as PARTE (see also CALZOZARr, 1977).</Paragraph>
    <Paragraph position="27"> In the example given in Fig. 1, the asterisk indicates the presence of at least one marker in the chain; the first number indicates the length of the chain; the second the length of the chain if it is circular; the third the distance between the two identical lemmas in the circular chain.</Paragraph>
    <Paragraph position="28"> It has been possible, using these chains, to assemble the entire dictionary around some essential cores of more inclusive meanings. These cores are the tops of the trees, and from there thick branches lead off to the more particular and specific levels of the lexicon. The final data concerning the number and depth of the chains are shown in Table 1.</Paragraph>
    <Paragraph position="29">  Moreover, for every marker (see Fig. 2), we have counted the number of times it occurs in all the chains (second column), and the number of times it appears in the chains which stop at the first marker (third column). In both of these cases, we have computed separately the occurrences of the marker at all the levels (lst, 2nd, 3rd, etc.; the  As far as the structure of the definitions is concerned, we wanted to start the analysis again from the definitions themselves (not trying to test some preconceived structures), with a careful checking of the corpus of definitions.</Paragraph>
    <Paragraph position="30"> We have extracted prepositions, and prepositional or grammatical syntagms, on the basis of a frequency-criterion, placing together under the term 'locution' or 'prepositional syntagm' (even if this term is not a very exact one) expressions of this kind: a forma di (in the form of); dal colore (of colour); provvisto di (provided by); munito di (furnished with); in contrasto con (in opposition to); consistence in (consisting in); simile a (similar to); originario di (originating from); che serve per (which serves for/as); etc.</Paragraph>
    <Paragraph position="31"> 64 NICOLETTA CALZOLARI - LAURA PECCHIA - ANTONIO ZAMPOLLI These phrases which we will call, arbitrarily, 'prepositional syntagms' have been divided into various categories. This subdivision was made possible through an introspective examination of the associations of analogous meanings. The criterion was the individualization of the recurring semantic functions which have a similar meaning, even though these functions have been expressed lexically and/or syntactically in a completely different way.</Paragraph>
    <Paragraph position="32"> One example of such grouped functions is the category SCOPO (aim), for which we have individualized the following set of lexicalizations (when necessary, with relative flection): tendente a (tending to); diretto a (aimed at); volto a (directed to); con Io scopo di (with the purpose of); a scopo di (for the purpose of); che ha Io scopo di (which has the purpose of); che mira a (which aims at); chi mira a (who aims at); mirante a (aiming at); rivolto a (turned to); per conseguimento d; (for achieving); etc.</Paragraph>
    <Paragraph position="33"> We have grouped these lists of prepositions and prepositional syntagms into files on the basis of their affinity of meaning. This has been possible through the analysis of the functions and of the different possibilities of their expression, following inductive and deductive methods. The validity of these associations of meaning, made intuitively, was afterwards verified empirically: various procedures for the extraction of the definitions in which each function appears, provided the material to be analyzed for this checking. For instance, in the analysis of various relations, such as those we called ATTITUDINE (aptitude), COLORE (colour), FORMA (form), CONTENUTO (content), ORI-GINE (origin), SCOPO (aim), USO (use), SOMIGLIANZA (similarity), COMPOS TO (composed of), MUNITO (furnished with), RE-LATIVO A (relative to), the check of all the definitions in which elements of the corresponding lists appear has shown the validity (about 80-90%) of our groupings made on the basis of our intuition. In addition, from this careful examination of different groups of definitions, we obtained some data which made it possible for us to formulate some interesting considerations.</Paragraph>
    <Paragraph position="34"> THE ITALIAN MACHINE DICTIONARY: A SEMANTIC APPROACH 65 We have observed, for instance, that the definitional structure based on the relation A TTITUDINE (aptitude) has a quantitatively high homogeneity of application with respect to the lemmas in whose definitions the relation is used. In fact, in 50% of the definitions in which this relation appears, it is applied to lemmas whose generic part, i.e. whose main semantic marker, is included in the list of homogeneous markers we have called S TRUMENTO (instrument) (see Fig. 3). Examples of the recurring generic parts with a high frequency are: Mec-</Paragraph>
    <Paragraph position="36"> Flint-lock Tool apt to cause accension ARCHIPENDOLO = Strumento atto a rendere orizzontale una retta Plumb-line Instrument apt to make a straight line horizontal CARICATORE = Attrezzatura atta al carico e allo scarico di materiali Loader Machinery apt to load and unload materials SPEZZATRICE = Macchina del panificio atta a tagliare la pasta in pezzi Cutter Machine of the bakery apt to cut the dough into pieces  It is interesting to point out the way in which a certain definition structure can be frequently associated to a certain kind of marker. Other definition structures linked to other functions can make it possible to delimit, within the lexicon, sufficiently homogeneous semantic fields. Since these associations between markers and functions occur in several groups of definitions, we think that this correspondence ' marker-relation' is not random, but is established for semantic reasons of affinity at a syntagmatic level. It seems possible for us to assert, at this point, that some markers effect a preferential selection toward certain types of defining relations rather than others, and vice versa. If this hypothesis is tested extensively on the lexicon, it can help in reaching a formalization of the semantic information which is in the DML We think that a more complete formalization, in comparison to that obtained by the simple hierarchical organization of the markers, can be achieved by also identifying the other kinds of relations which are different from the hierarchical one. Functions such as those described above will allow: 66 NICOLETTA CALZOLARI - LAURA PECCHIA - ANTONIO ZAMPOLLI a) the linking of markers: for example, the pertinence relation PARTE (part) makes it possible to link the markers PERSONA (person), UOMO (man), DONNA (woman), with a set of markers such as MANO (hand), CAPELLI (hair), BOCCA (mouth), TESTA (head), CAPO (head), etc.; and/or b) the joining of the generic to the specific part of the definitions, for example in the definition of ACCHIAPPAMOSCHE = Strumento atto a catturare mosche (Fly-swatter ----instrument apt to catch flies) the function SCOPO (aim), in its lexicalization ATTO A (apt to), links the marker STRUMENTO (instrument) to its specification.</Paragraph>
    <Paragraph position="37"> For the final structure of the definitions, we think that the markers can either be considered as n-place predicates joined to their arguments by these various types of functions, or as nodes of a semantic network linked to the specific part of the definitions, i.e. the other nodes, by arcs which express these various types of functions.</Paragraph>
    <Paragraph position="38"> Such relations can be used as the starting point in the study of the use of prepositions and prepositional syntagms in the Italian language and, particularly, in the language of vocabulary definitions.</Paragraph>
    <Paragraph position="39"> Unifying these functions is also of great help in structuralizing the definitions, at a higher level of formalization, assisting greatly in the extraction of all the data linked by the same function.</Paragraph>
    <Paragraph position="40"> We have also noticed that some types of sentence-structure occur more frequently in the definitions. Besides considering the functions in isolation, we have been working on a quantitative examination of the various possible matchings of these functions among themselves; this has been done with the aim of also identifying the kinds of sentencestructures more frequently used by lexicographers in the compilation of dictionaries. A practical goal for us is to work further towards the unifying of the definitions, by leading them back, as far as possible, to the more frequent and common structures.</Paragraph>
    <Paragraph position="41"> 2.4. Perspectives.</Paragraph>
    <Paragraph position="42"> Our research had a number of different aims but was principally directed towards the lexicographic aspect. This aspect consists in an attempt to analyze the defining method adopted by Italian lexicographic tradition as shown by the Zingarelli. This analysis has been developed in two different stages: THE ITALIAN MACHINE DICTIONARY: A SEMANTIC APPROACH 67 1) An analysis of the terminology used in the definitions, through the enucleation of markers. We have seen that, among the most frequent lemmas in the definitions (i.e. among those words whose extension is greater or, in other words, whose intension is smaller), those words considered as markers by literature on this subject appear.</Paragraph>
    <Paragraph position="43"> 2) A check of the definitions considered from the point of view of their structure. This emphasized the very high frequency of certain types of functional syntagms as being more suitable in compiling definitions. It will be interesting to have a comparative examination with dictionaries of other languages.</Paragraph>
    <Paragraph position="44"> The semantic aspect is very closely related to the lexicographic aspect of this study. Our aim was to give a hierarchical type of organization, even if provisional, to the large set of Italian nouns at our disposal. In doing so, we have taken what in our opinion is the first step towards a decomposition of a meaning into distinctive markers, i.e. the attribution as main semantic marker of the lemma which is at an immediately higher level in a hierarchical scale. Many hierarchical scales can be individualized in the lexicon, or more precisely among the meanings of the lexical items.</Paragraph>
    <Paragraph position="45"> We have also begun, through the study of prepositional functions, the second step in the decomposition of a meaning into markers: the linking of markers with other markers, the individualization of the different kinds of relations which exist among markers, and of those relations which exist between primary and secondary markers expressed respectively by the generic and the specific part of the definitions. There is also an important practical aspect of this work: that of making the definitions of the DMI more uniform from a semantic point of view. This is achieved by indicating the semantic uniformities which are latent under the different lexicalizations of the same markers or of identical relations, and by reducing these diversities of lexical forms to one single symbol reflecting their uniformity. This will make the looking up of the DMI easier.</Paragraph>
    <Paragraph position="46"> This work should also be of relevance, at a future date, in connection with an analysis of the verb which takes into consideration the above mentioned analyses of the noun at a level of selectional restrictions at first, and, later, extends these analyses to the level of &amp;quot;knowledge of the world &amp;quot;. Thus, we feel that our work can provide a first step for a future utilization of the DMI in syntactic and semantic analyses of the Italian language.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML