File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/87/e87-1006_metho.xml
Size: 27,750 bytes
Last Modified: 2025-10-06 14:11:54
<?xml version="1.0" standalone="yes"?> <Paper uid="E87-1006"> <Title>A GENERATIVE GRAMMAR APPROACH FOR THE MORPHOLOGIC AND MORPHOSYNTACTIC ANALYSIS OF ITALIAN</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> A GENERATIVE GRAMMAR APPROACH FOR THE MORPHOLOGIC AND MORPHOSYNTACTIC ANALYSIS OF ITALIAN </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="35" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> A morphologic and morphosyntactic analyzer for the Italian language has been implemented in VM/Prolog 131 at the IBM Romc Scientific Center as part of a project on text understanding.</Paragraph> <Paragraph position="1"> Aim of this project is the development of a prototype which analyzes short narrative texts (press agency news) and gives a formal representation of their &quot;meaning&quot; as a set of first order logic expressions. Question answering features are also provided.</Paragraph> <Paragraph position="2"> The morphologic analyzer processes every word by means of a context free grammar, in order to obtain its morphologic and syntactic characteristics.</Paragraph> <Paragraph position="3"> It also performs a morphosyntactic analysis to recognize fixed and variable sequences of words such as idioms, date cxpressi{~ns, compound tenses of verbs and comparative and superlative form~ of adjectives.</Paragraph> <Paragraph position="4"> The lexicon is stored in a relational data base under thc control of SQL/DS \[2\], while the endings of the grammar are stored in thc workspace as Proiog facts.</Paragraph> <Paragraph position="5"> A friendly interface written in GDDM \[11 allows the uscr to introduce on line the missing lemmata, in order to directly ulxlatc thc dictionary.</Paragraph> <Paragraph position="6"> Introduction About thirty years ago, the development of decripting tccniques made computer scientists be involved for the first time in the field of Linguistics, especially in automatic translation matters.</Paragraph> <Paragraph position="7"> The failure of most of these projects contributed to a general sensibilization towards natural language problems, and gave rise to a variety of formal theories for their treatment.</Paragraph> <Paragraph position="8"> In the last few years, one of the main research objectives ix-came the design of systems able to acquire knowledge directly from fcxts. using natural language as an interface between man and machine. At the IBM Rome Scientific Center a system has been developed for processing Italian texts. The task of the system is to * analyze short narrative texts (press agency news) on a restricted domain (Economics and Finance), * give the formal representation of their &quot;meaning&quot; as a set of first order logic expressions, stored in a knowledge base, * consult this knowledge base in order to answer any qucstlon about the contents of analyzed texts.</Paragraph> <Paragraph position="9"> The system consists of: * a mmphologie analyzer based on a context-free logic grammar with the &quot;word&quot; as axiom and its possible components as terminal nodes. It.includes a lexic9n of about 7000 elementary lemmata, structured in a table of a relational data base under the control of SQL/DS.</Paragraph> <Paragraph position="10"> * a morphosyntaetic analyzer realized by three regular grammars, recognizing respectively compound tenses of verbs (e,g. ha.~ been signed), comparative and superlative forms of adjectives (e.g. Ihe most interesting) and compound numbers (e.g. three billions .~64 millions 234.000). This module reduces the number of possible syntactic relations among the words of the sentence in order to simplify the task of the syntax.</Paragraph> <Paragraph position="11"> * a syntactic parser developed by means of a meta-analyzcr \[6\[ which aUows to write production rules for attribute gntmmars, and generates from these the corresponding top-down parser. A grammar has been written to describe the fragment of Italian consider.~l.</Paragraph> <Paragraph position="12"> * a semantic la'oe~sm * based on the Conceptual Graphs formal;sin \[10\] and provided, with a semantic dictionary containing at present about 350 concepts. Its task is to solve syntactic ambiguities and recognize semantic relations between the words of the sentence 191.</Paragraph> <Paragraph position="13"> This paper deals in particular with the structure of the lexicon adopted in tht: system and with the morhologic and morhosynlactic analyzer.</Paragraph> <Paragraph position="14"> In this system the morphology and the lexicon are strictly combined; for this reason this lexicon does not contain semanlic information. In the approach of Alinei \[4\], on the contrary, lexicon structures contain semantic information in order to describe every word also in te~qns of its &quot;meaning&quot; Another possible approach is the one adopted by Zampolli who developed a frequency lexicon of Italian language at tile Computational Linguistic Institute in Pisa \[5\]. The lexicon realized by ZampoUi's working group containes morphologie hints in order to guide directly the analysis of every word, without the support of a morphologic p~ rser.</Paragraph> <Paragraph position="15"> in most of the works referring to English language morphology is considered onl) as a part of the syntactic parser. On the contrary. Italian morpho'ogy requires to be previously analyzed because it is more complex: there are more rules than in English and these rides present many exceptions.</Paragraph> <Paragraph position="16"> For this reason, in the last few years Italian researchers began to face systematically these problems beside a purely linguistic conlcxk A procedural approach is the one followed by Stock in the development of a morphologlc analyzer realized for lhe &quot;Wednesday2&quot; parser I 11\[.</Paragraph> <Paragraph position="17"> A different approach makes use of formal grammars to describe the rules of Italian morphology. This morhologic analyzer is based on a context free grammar describing the logic rules for the word generation. Other two morphologic systems have been developed according to the ATN formalism (Augmcuted Transition Network). The fast one has been realized at the CNR Institute of I'is~ by Morreale, Campagnola and MugeUesi, as a research tool for teaching Italian morphology, with applications in automatic processinC/ of natural language and knowledge representation 18\]. The second one has been realized by Delmonte, Mian, Omologo and Satta, as part of a system for the development of a reading machine for blind people. 171.</Paragraph> <Paragraph position="18"> In the first section of this paper there is a brief discussion atx, ut morphologic problems and about the possible approaches to their solution.</Paragraph> <Paragraph position="19"> The next section describes the structure adopted for the lexicon and the other sets of data.</Paragraph> <Paragraph position="20"> The third section deals with a preanalyzer, which simplifies the work of morphologie analysis by recognizing standard sequences of words, as idioms and date expressions.</Paragraph> <Paragraph position="21"> In the fourth section the morphologic analyzer is described and in the last one the morphosyntactic analyzer, both realized by means of context free grammars.</Paragraph> <Paragraph position="22"> The problem The aim of morphology is to retrieve from every analyzed word the lemma it derives from, its syntactic category (e.g. verb, ..un, adjective, conjunction .... ) and its morphologic catego~ (e.g. masculine, singular, indicative .... ).</Paragraph> <Paragraph position="23"> A possible approach to the problem is to store in a data base a list of all the declined forms for every lemma of the language, as well as their morphologic, syntactic and semantic characteristics. The size of such a list would be enormous, because a common dictionary contains about 50000-100000 lemmata and each lemma gives rise to several derived words and each word may be declined in different ways.</Paragraph> <Paragraph position="24"> Such a large data base is hard to enter and to update, and it is limited by the fixed size of its words list.</Paragraph> <Paragraph position="25"> In Italian, the creation of words is a generative proces~ ~hat follows several roles like, for instance: In English, rules like composition or cliticization are not strictly morphologlc, because they often involve more than a word. In Italian, on the contrary, they modify the single word, producing new words like, for instance: (to wrap in paper) These rules make the set of Italian words potentially unlimiled, and sometimes make insufficient even a common dictionary.</Paragraph> <Paragraph position="26"> A different approach takes two different lists: one containing the lemmata of the language and the other the logic rules of derivations, from which all the correct Italian words can be produced starting from the lemmata.</Paragraph> <Paragraph position="27"> These rules can be easily described by means of a context-free grammar, in which every &quot;word&quot; results from the concatenation of the &quot;stem&quot; of a lemma with alterations, affixes, endings and enelities. This grammar can both generate from a given lemma all the current Italian words deriving from it and analyze a given word by giving all the possible lemmata it derives from.</Paragraph> <Paragraph position="28"> The backtracking mechanism of Prolog directly allows to obtain all the solutions.</Paragraph> <Paragraph position="29"> This morphologic analyzer can also provide further information about some linguistic peculiarities, like, for instance: compound names modal verbs altered names pelle-rossa (red-skin), which has as plural peUi-rosse.</Paragraph> <Paragraph position="30"> which take another verb as object (1 can go) foglia (leaf) can be altered in fogli-olina (leaf-let), whose meaning is piccola foglia (small leaf).</Paragraph> <Paragraph position="31"> Data structure A correct morphologie analysis requires not only knowledgc on the language lemmata, but also on the word components as alterations, affixes, endings and enclitics. This information might hc represented in form of Prolog facts. In this way, data mighl be directly accessed by the program, because the homogeneity of their structure. The disadvantage is a performance degradation when the size of data increases, since Prolog is not provided with efficient search algorithms.</Paragraph> <Paragraph position="32"> Hence it seemed convenient to draw a distinction between data: on one hand the set of lemmata, and on the other the sets of affixes, alterations, endings and enclitics. The former (which is the most relevant and needs to be continuously updated), has been struclurcd as a relational data base table, managed by the SQI,/DS. The advantage is that this system is directly accessible from VM/Prolog (the string containing the query is processed by SQI., which returns the answer as a Prolog list). The latter (which have fixed lenghl and are not so large), have been stored in the Prolog workspace i, f, rm of Prolog facts.</Paragraph> <Paragraph position="33"> The set of lemmata is a table with five attributes: 1. the fu'st is the lemma.</Paragraph> <Paragraph position="34"> 2. the second is the stem (the invariable part of the lemma): this is the access key in the table.</Paragraph> <Paragraph position="35"> 3. the third is the name of the &quot;class of endings&quot; associated with every lemma. A class of endings is the set of all the endings related to a given class of words. For example, each of the regular verbs of the first conjugation has the same endings; hence there exists a class named dv_leonjug containing all and only these endings. Generally each irregular verb is related to different classes of endings: andare (to go), for example, admits two different stems, vad (go) and and (went); so there exist two subclasses of endings named respectively dvl andare and dr2 andare.</Paragraph> <Paragraph position="36"> 4. the fourth attribute is the syntactic category of the lemma: Ior example, the information that to have is an auxiliary transitive verb.</Paragraph> <Paragraph position="37"> 5. the fifth is an integer identifying the type of analysis Iobc performed: I the analysis can be performed completely 2 the lemma can neither be altered nor affixed (this is the case for example of prepositions and conjunctions) 3 only the longest analysis of the lemma is considered (this is the case of the false alterated nouns: mattino (morning) is not a little matto (mad), such as in english outlet is not a little out!) lemma I stem ending dam synt=categ label matte matt da_bello adj.qualific. 1 mattino mattin dn_oggctto noun.common 3 di di --- prep.simple 2 andare vad dv 1 _andare v.intran.simple 1 andare and I dv2. andar(c) v.intran.simple I The other sets of data are contained in the Prolog workspace and are structured as tables of a relational data base.</Paragraph> <Paragraph position="38"> The set of the classes of endings is a table with three attributes: l.</Paragraph> <Paragraph position="39"> 2.</Paragraph> <Paragraph position="40"> 3.</Paragraph> <Paragraph position="41"> the first is the name of the class and it is the access key in the table.</Paragraph> <Paragraph position="42"> the second is one of the endings belonging to the class the third is the morphologic category associated with the ending: for example, the class dn..oggetto contains the two endings which are used in order to inlleet all the masculine nouns behaving like the word oggetto (object): o for the singular (oggett-o), and i for the plural (oggett-O.</Paragraph> <Paragraph position="43"> eading..da~ ending morph_categ dn_oggctto o mas.sing.</Paragraph> <Paragraph position="44"> dn_oggetto i mas.phir.</Paragraph> <Paragraph position="45"> The affixes can be divided in la'eflxcs preceding the stem of the lemma, and suffixes following the stem of the lemma.</Paragraph> <Paragraph position="46"> The prefixes are simply listed by means of a one attribute table. In this way it is not necessary to list the prefixed words in the lexicon: they are obtained by chaining the prefix with the original word. For example, from the verb to handle with the prefix re we obtain the verb to rehandle. Morphologlc and syntactic characteristics remain the same; for the verbs only, the prefixed verb differs sometimes from the previous one in the syntactic atlribules (transitive/intransitive, simple/modal).</Paragraph> <Paragraph position="47"> The set of suffixes is a table with four attributes: the first is the suffix itself the second is the stem of the suffix (the access key to the table) the third is the ending class of the suffix the fourth is the syntactic class of the suffix. Suffixcs, in fact, differently from prefixes, changes both morphologic and syntactic characteristics of the original word: they change verbs into names or adjectives (deverba/suff'oces), names into verbs or adjectives (denominal suffixes), adjectives into verbs or names (deadje:tival suffixes). The first attribute is chained to the stem of the original lemma in order to obtain the derived lemma: for example, from the stem of the lemrna mattino (morning), which is a noun, with the suffix iero, we obtain the new lemma mattin-iero (early rising), which is an adjective, and from the second stem of the lemma andare (to go), which is a verb, with the suffix amento, we obtain the new lemma and-amento (walking), which is a noun.</Paragraph> <Paragraph position="48"> 1. the first is the stem of the alteration (the access key in the tablc l 2. the second is the ending class of the alteration 3. the third is the semantic type of the alteration. Alterations change the morphologic and semantic characteristics of the altered word, but not its syntactic cathegory: for example, the lemma easa (house) can be altered in casina (little house), easona (big house), easaeeia (ugly house), and so on: stem endinLda.~ seman categ in da belle diminutive on dn_cosa augmentative acc da_~bio pejorative The cnclitics are pronouns linked to the ending of a verb: for example va li&quot; (go there) can be expressed also in the form vaeei (ci is the C/nclitic, the c is duplicated according with a phonetic rule). The set of the enclitics is a table with two attributes: the first is the maclitic (this is the access key to the table) and the second is the morphologlc characteristic of the encfitic. The analy-zer divides the verb from the enclitic, so that it becomes a different word, taking the morphologlc characteristic stated in the table and the syntactic category of pronoun.</Paragraph> <Paragraph position="49"> Other two sets of data have been defined in order to handle fixed sequences of words, such as proper names and idioms.</Paragraph> <Paragraph position="50"> The set of the most common italian idioms has been structured as a table with two attributes: the first one is the idiom itself, while the second is the syntactic category of the idiom. In this way it is possible to recognize the idiom without performing the analysis of each of the component words. For example, di mode che (in such a way as) is an idiom used in the role of a conjunction, and a mane a matzo (little by little) is used in the role of an adverb. The set of proper names belonging to the context of Economics and Finance is a table with three attributes: the first is the proper name, the second its syntactic category and the third its moq~hologic The preanalyzer simplifies the work of analysis recognizing all the &quot;fixed&quot; sequences of words in the sentence.</Paragraph> <Paragraph position="51"> Fixed sequences of words arc, for example, idioms like in such a way as. To analyze this sequence of words it is not necessary to know that in is a preposition, such is an adjective, a an article, and so on: the only useful information is that this sequence takes the role of conjunction. Other fixed sequences of words are proper names: it is necessary to know, for example, that Montepolimeri Montedi.wn or Vittorio Ripa di Meana are single entities.</Paragraph> <Paragraph position="52"> Idioms and proper names are recognized by means of a pattern matching algorithm: the comparison is made between the lll|,tll sentence and the first attribute of the tables of idioms and proper names. When the comparison fails, backtracking evaluates another hypothesis. Every recogniz~ed sequence of words is written on an appropriate fde and then removed from the input sentence. Date expressions, as lunedi' 13 agosto (monday, august tile /3rd), arc considered as single entities, in order to simplify the work of syntax. They are recognized by means of a context-free grammar, whose axiom is the &quot;date': Numbers are recognized by the library function numb(*) and by means of a context-free grammar translating strings into numbers. In this way it is possible to evaluate in the same way expressions such as 1352 and milletreeentoeinquantadue (one thousand three hundred and fifty two).</Paragraph> <Paragraph position="53"> The morphologic analyzer This is the main module of the whole system. Its task is to analyse each element (word) of the list received from the preanalyser and to produce for every form analyzed the list of all its characteristics: I. the lemma it derives from 2. its syntactic characteristics 3. its morphoiogic characteristics (none for invariable words) 4. the list of alterations (possibly empty) 5. the list of enclitics (possibly empty).</Paragraph> <Paragraph position="54"> For example the form sono (the ist sing. and the 3rd plur. person of the present indicative of essere, to be), after the analysi~ is represented by the list: ( S ono.</Paragraph> <Paragraph position="55"> (V. int ran. aux. ind. pres. act. 1. sing. es s ere. n i 1 ). (v. int ran. aux. ind. pres. act. 3. plur. essere, nil ). nil) Every Italian word is made up by a fundamental nuclc,s, tile stem (two for the compound names). This is preceded by one or more prefixes, and followed by one or more suffixes and alterati,,ns, by an ending and, as far as the verbs are concerned, by one or more enclitics.</Paragraph> <Paragraph position="56"> This structure has been described by means of a context-free grammar in which the &quot;word&quot; is the axiom and all its comlxmcnts the endings.</Paragraph> <Paragraph position="57"> tlere are some example of words analyzed with this grammar: muraglione (high wall) tour is the stem of the word muro (wall) agl is the stem of the suffix aglia i-on on is the stem of the alteration one (augmentative): the i is an euphonic vowel e is the ending of the singular.</Paragraph> <Paragraph position="58"> port is the stem of the verb portare (to carry) at is the ending of the past participle of the verb or is the stem of the deverbal suffix ore e is the ending of the masculine singular.</Paragraph> <Paragraph position="59"> prefix tr! port ending sufflx T~L I I .oL</Paragraph> <Paragraph position="61"> ridandoglido (giving h to him/her again) rl is the prefix (R means again) d is the stem of the verb dare (to give) ando is the ending of the present tense of gerund of the verb glie is the first enclitic (it means to ~tim~he,): e is an The compound nouns are not reported in the lexicon: they arc derived from &quot;the two component lemmmata. Their plural is made according to the following set of rules: The task of this part of the morphology is to: reeoguize all the &quot;well-formed&quot; words of Italian language. The analyzer parses the words from left to right, splitting them into elementary parts: prefix(es), the stem(s) of the appropriate lemma(ta) of derivation (retrieved from a restricted dictionary reporting only the &quot;elementary lemmata') suffix(es), alteration(s), ending(s), enclitic(s). Each hypothesis is checked by verifying that all the conditions for a right composition of those parts are satisfied.</Paragraph> <Paragraph position="62"> 2. submit every word not recognized to the user, who can state wether: (r) the word is really wrong, because of - an orthographic error: for example squola instead of scuola (school).</Paragraph> <Paragraph position="63"> - a composition error: for example serviziazione is wrong as 'iazione' is a deverbal suffix and 'serviz&quot; is the stem of the noun 'servizio' (service) and the corresponding verb does not exist.</Paragraph> <Paragraph position="64"> a the word derives from a lemma which is not reported in the lexicon. In this case the user can recall a graphic interface, allowing him/her to update directly the lexicon.</Paragraph> <Paragraph position="65"> 3. perform, if requested by the user, an inspection in the list of the &quot;currently used&quot; words. In this way, for example, the user knows that coton-~eio (cotton-mill) and coton-iera are two well-formed Italian words, but that only the first one is commonly used. The morphosyntactic analyzer The aim of the morphosyntactic analyzer is to perform the analysis of the contiguous words in the sentence, in order to recognize regular structures such as compound tenses of verbs and comparative and superlative forms of adjectives.</Paragraph> <Paragraph position="66"> Compound tenses of verbs are described by means of a regular grammar, whose rules are applied any time the analyzer finds in the sentence the past participle of the verb. These rules arc:</Paragraph> </Section> <Section position="3" start_page="35" end_page="36" type="metho"> <SectionTitle> I C0MP:ZNSZ 2 COMP TENSE 3 REM 4 REM 5 REM </SectionTitle> <Paragraph position="0"> When a rule is successfully applied the morphologic categories of the verbs are changed and the attribute 'active'/'passive' can bc specified correctly. For example, after the morphosyntactic analysis.</Paragraph> <Paragraph position="1"> the phrase io suno chiamato (I'm called) ((io.</Paragraph> <Paragraph position="2"> (pron. pets. 1. sing. io. nil).</Paragraph> <Paragraph position="3"> nil).</Paragraph> <Paragraph position="4"> ( s ono.</Paragraph> <Paragraph position="5"> (v. intran, aux. ind. pres. act. 1. sing. essere, ni I ). (v. int ran. aux. ind. pres. act. 3. plur. essere, ni \] ). nil).</Paragraph> <Paragraph position="6"> (ehiamato.</Paragraph> <Paragraph position="7"> (v. tran. sire. part. past. act. mas. sing. chiamare, ni I ). nil).</Paragraph> <Paragraph position="8"> nil) becomes ((io.</Paragraph> <Paragraph position="9"> (pron. pers. 1. s ing. io. nil).</Paragraph> <Paragraph position="10"> nil).</Paragraph> <Paragraph position="11"> ( sono_chiamato.</Paragraph> <Paragraph position="12"> (v. tran. s \]an. pass. ind. pres. 1. sing. chiamare, ni I ). nil).</Paragraph> <Paragraph position="13"> nil).</Paragraph> <Paragraph position="14"> in which only the fu-st analysis of the word &quot;sono&quot; has been taken, as the number of the auxiliary verb must correspond to the nu,nber of the past participle. The form is passive, as &quot;chiamare&quot; (to call) is a transitive verb (the auxiliary verb for the active form is to have). In this case morphosyntactic analysis has solved an ambiguity: only an interpretation will be analyzed by syntax. The following figure shows the task of the grammar, applied any time the parser finds the past participle of a verb in the sentence. (r) If the verb is transitive the parser looks at the word BF.FORE the verb: - if the word is a tense of the verb to be, the resulting verb is SIMPLE PASSIVE (the rules applied are the 2nd and the 4th); - if the word is a tense of the verb to have, the resulting verb is COMPOUND ACTIVE (the rule applied is the lst). u If the verb is intransitive the parser looks at the word AF'I'FR the verb: - if it is the past participle of another verb the resulting vcrh is COMPOUND PASSIVE (the rules appfied are the 2nd and the 3rd); - otherwise it is COMPOUND ACTIVE (the rules applied arc the 2nd and the 5th).</Paragraph> <Paragraph position="15"> pIIIT IMATImlq8 l i '-- i I &quot;- i 2,4 1 2.3 2.8 The grammar for the comparative and supcrlativc forms of adjectives is applied any time the analyzer finds thc words piu' (more), meno (less) followed by a qualificative adjective. In this way it is possible to recognize and to distinguish expressions like piu' interessante (more interesting) and il pin' interessante (the most interesting). Remark that in English there is the use of more, most to make cleat the distinction between the comparativc and the superlative form of the adjective.</Paragraph> <Paragraph position="16"> form of adjectives In the same manner it is possible to recognize mixed numeric expressions like three billions 564 millions 234000 and to cwduate thcrn into their equivalent numeric form (3564234000). The talcs arc applied any time the analyzer finds the words miliardi (billions), milioni (millions) in the sentence.</Paragraph> <Paragraph position="17"> This approach presents the advantage of a higher flexibilily in the analysis of words. Moreover such a method has requested a strong initial effort in the formalization of the rules (with all their exceptions) for the morphologic treatment of words, but has largely simplified the work of classification of every Italian word. The lexicon stores about 7000 elementary lemmata, derived from a list of about 20000 different Italian forms. They correspond to about 15000 ordinary lemmata (entries of a common dictionary).</Paragraph> </Section> class="xml-element"></Paper>