It Would Be Much Easier If WENT Were GOED 
Dan TUFIS 
Institute for Computer Technique and Informatics 
8-10, Miciurin Bd., 71316 Bucharest 1, Romania 
Tel. 653390, Telex 1189t-icpci-r 
ABSTRACT 
The paper proposes a paradigmatic approach to 
morphological knowledge acquisition. It ad- 
dresses the problem of learning from examples 
rules for word-forms analysis and synthesis. 
These rules, established by generalizing the train- 
ing data sets, are effectively used by a built-in in- 
terpreter which acts consequently as a 
morphological processor within the architecture 
of a natural language question-answering system. 
The PARADIGM system has no a priori knowledge 
which should restrict it to a particular natural lan- 
guage, but instead builds up the morphological 
rules based only on the examples provided, be 
they in Romanian, English, French, Russian, SIo- 
vak and the like. 
1. INTRODUCTION 
For highly Inflexlonal languages, encoding all 
word forms Into a word lexicon (declarative mor- 
phology approach) appears to be a poor solution 
not only due to a great redundancy (which for 
some languages is prohibitive (Japplnen,t983)) 
but also with respect to some theoretical aspects 
(as for Instance descriptional adequacy 
(Wehrli,1985)). 
Within an Inflexional morphology environment 
(Alshawl, t985), we propose a procedural ap- 
proach based on automatically acquired flexlo- 
nlng paradigms. The paradigmatic model is 
Incorporated Into an experimental system called 
PARADIGM, which Is Intended to (partially) re- 
place the acquisition and modelling part of our 
MORPHO lexicon management system 
(Tufis,1987a) incorporated by the lURES environ- 
ment for building natural language applications 
(Tufts, 1985). 
tioned motivations. With respect to the second 
one maybe it is worth saying that when PARA- 
DIGM lacks appropriate or complete knowledge It 
is supposed to act the same way a child or a 
foreign speak partner does. That is, for instance, 
to say "goed" or "womens" ff the corresponding ir- 
regularity is unknown. The reason for such a de- 
cision stems mainly from our attempt to study in 
parallel with the implementation, the effectiveness 
of language learning based on Informal examples. 
From the linguistic engineering point of view, the 
purpose of the system is stated very pragmati- 
cally, that is to ease and speed up as much as 
possible the building of language specific mor- 
phological knowledge bases without (too much) 
help from theoretical morphologists, experts on 
the language concerned. It is not always easy to 
find appropriate written material, not to speak 
about human experts, presenting in a rigorous 
manner (as imposed by computer applications) 
the rules and peculiarities of word structuring In 
different languages. 
PARADIGM was conceived to overcome, at least 
partially, these difficulties and to provide a handy 
tool for Immediate verification of specific rules va- 
lidity. As the general situation is with learning sys- 
tems, different copies of PARADIGM may be usecl 
in parallel and finally merge the Individually de- 
veloped knowledge bases. This Is beneflclaly not 
only with respect to the development time but also 
with respect to the linguistic coverage. 
Architecturally, PARADIGM was Influenced by 
DISCIPLE ('recuci, 1988) in the sense that the be- 
havioral dichotomy "apprentice-expert" was Incor- 
porated into Its Implementation. However, due to 
the more specific task, the technical solutions 
adopted in PARADIGM for knowledge acquisition 
are different, being problem oriented. 
The aim of our work is twofold: to obtain a sound 
linguistic toOl for word-forms analysis and syn- 
thesis (which In case of highly lnflexlonal lan- 
guages is by no means a trivial task) and to 
provide for a psychologically motivated behaviour 
of such a system in dealing with unknown words. 
In the following, we shall dwell on the technical is- 
sues connected to the first of the above two men- 
2. DEFINITIONS 
2.1 MORPHOLOGICAL MODEL 
We call a morphological model the tuple: 
MM = (C,SC,M,V, F1,F2,F3,P) where C is a set of 
categories: C = {cl ..... c*}; SC Is a set of sub-ca- 
tegories of the categories in C: SC = {scl ..... scJ}; 
- 145 - 
M is a set of features of the sub-categories in SC: 
M = { m l ..... n~ }; V is a set of values which features 
can take: V = {vl ..... vm}; Ft Is a function defined 
on C, taking values in the power set of SC: 
Ft: C--- > PS(SC); F2 is a function defined on SC, 
taking values in the power set of M: 
F2: SC--- > PS(M); F3 is a function defined on M, 
taking values in the power set of V: F3: 
M---> PS(V); P is a subset of the Cartesian pro- 
duct C x SC x P(M) x P(V) so that Vpi = (ci,so,Mi,Vt) 
E P the following are true: cl~ C & so E SC & Mi c M 
& VIcV & scl~Fl(ci) & Ml=F2(sci) & 
Mi = {rr~l ..... mik}& Vi = {vii ..... VIk} & Vq~ \[1,k| 
viq~ F3(miq) P is called the paradigmatic ftexio- 
ning space of the morphological model MM. For 
instance a point of P in a certain MM might be: 
(noun common-noun 
(gender number case articulation) 
(masculine plural genitive definite)) 
2.2. THEMATIC FAMILIES 
We call a thematic family ('rF) the set of all word- 
forms of a given (lemma) word, obtained by gram- 
matical Inflecting: TF= {W1 ..... Wm}. Let us 
consider a TF to be always lexicograpHIcally 
sorted. Let <X> denote an arbitrary string of 
characters and < X > < Y > the string obtained by 
concatenating the substrings <X > and <Y >. 
We say that a TF Is regular iff there Is a q-letter 
substring < Rq > called root, common to all the m 
words of TF so that: 
la)Vi<Wi> ETF, <Wi> = <Rq> <ei> 
lb) < Rq > is the longest substrlng with the 
property 1 a) 
lc) q > = low-limit (an Integer varying from lan- 
guage to language) 
I d) VIe \[2,m-1 \] the subsets { < W1 > ..... < Wi > } 
and {<Wl+l> ..... <Win>} give the roots 
<Rql> and <Rq2> with <Rql > being a sub- 
string of < Rqa > or at most equal to < Rq2 >. 
The remainlng part of a word in TF alter remov- 
ing the root is called an ending (we use the term 
'ending' to Include both deslnences and suffixes). 
The list of all endlngs obtained from a TF Is called 
a paradigmatic endings family (PEF). 
A thematic family is called partial regular if there 
Is a partition of TF = {TF1 ,TF2 ..... TFk} so that: 
2a) lJTFi =TF & Vl,J(l~,J) TFin TF)= £~ 
2b) VI (TFi is regular & CARD(TFI) > 1). 
According to the above definition, a partial-regu- 
lar TF will be characterized by k roots. A thematic 
family which is neither regular nor partial-regular 
Is called Irregular. 
In the following, in order to simplify notations, 
when referring to strings of characters, we use an- 
gular brackets only if we need to outline a compo- 
sition/decomposition of a word-form. 
A central notion of our approach is that of flex- 
toning paradigm. Its meaning is similar to that 
used by most of the morphologists. 
We define a flexlonlng paradigm Q as a list of 
pairs: Q = {(el pl)(e2 p2)...(ek ~)} where 'e' are 
endings extracted from a thematic family (irre- 
spective of their regularity) and 'pl' are appropri- 
ate points in P (the appropriateness will be 
revealed in the fourth chapter). 
2.3 UNINTERPRETED LEXICON 
Let LS be a set of words obtained fromthe union 
of K thematic families, called a lexical stock: 
LS=TF1uTF2LJ...uTFk. We call an uninter- 
prated lexicon of the word stock LS a set 
UL = {R1,R2 ..... Rp} so that for any i~ \[1,p\] Ri is a 
root of a certain TFi in LS. The mapping 
h LS --- > PS(UL x P) Is called an Interpretation of 
an UL within a morphological model MM (recall 
that P Is a paradigmatic flexloning space of a cer- 
tain MM). Let us observe that I mapping allows a 
word to be ambiguously interpreted, which is 
quite natural at the level of isolated word-form ana- 
lysis. Such a common ambiguity, for Instance, is 
figured out by the Romanlan word "modul", which 
may stand either for the unarttculated nomina- 
tive/accusative form of "modul" (module) or for the 
articulated nominative/accusative form of "mod" 
(mode, manner). The I mapping abstracts the pro- 
cess of word-forms analysis. The abstraction of 
the reverse process, the generation of word- 
forms, Is represented by the mapping G defined 
as follows: G: Ul.xP --- > LS. As opposed to I, G 18 
a univoque function, that Is for a given root and a 
specific point in the paradigmatic flexioning point 
P, a unique word-form will result. 
3. BUILDING A MORPHOLOGICAL MODEL 
To build a morphological model the designer 
starts by speclfiying the categories of interest In 
his/her application. The traditional categories are 
NOUN, ADJECTIVE, VERB, PRONOUN and so 
forth, but by no means this categorlal system Is 
146 - 
obligatory (for instance one might think of using 
semantically flavoured categories such as OB- 
JECT, PROPERTY, ACTION, STATE, ANAPHOR 
etc). 
For each defined category in C, the designer will 
be asked to provide the desired sub-categories 
(for instance COMMON-NOUN and PROPER- 
NOUN for NOUN). This activity Is equivalent In the 
formal model to defining the SC subset and the F1 
function. Further, for each sub-category In SC the 
system asks the designer to enter the specific fea- 
tures along which the Inflexlonal behaviour of the 
words gets relevant. With Romanlan language for 
instance, while number, case and enclitic articu- 
lation are relevant for COMMON-NOUN, for fe- 
minine PROPER-NOUN only the case Is 
significant (but this is not always true: the feminine 
proper-nouns ending In a consonant, whatever 
their etimology, do not flexate at all). By entering 
all sub-category-features associations, the system 
is implicitly provided with the M set and F2 func- 
tion. Finally, for each feature in M, the designer 
will be asked to define the possible values the cur- 
rent feature may take (e.g. 'singular' and 'plural' 
for the 'number' feature). When the list of features 
Is exhausted, the system has already learnt the V 
set and F3 mapping. At this point, the activity of 
the designer Is theoretically finished and it is the 
system itself which wlU generate, based on these 
definitions, the paradigmatic flexloning space (P), 
thus accomplishing the MM internal repre- 
sentation. From this Internal representation, the 
system generates for each defined sub-category 
a graphic tabular menu (we call it an Acquisition 
Scenario AS) partlally filled In. The only blank col- 
umn in an AS is called WORD-FORM column and 
is accessible for writing in by the trainer (tutor) of 
the system. Each line In an AS Is filled (except the 
last field corresponding to the WORD-FORM col- 
umn) by the Information uniquely Identifying e 
point in P. 
4. KNOWLEDGE ACQUISITION 
When the tutor chooses a defined sub-category 
of a category in C to be exemplified, the system 
answers by displaying the associated acquisition 
scenario. What the tutor Is asked to do is to fill In 
the blanks the WORD-FORM column with the In. 
flected forms of the thematic word. Each word 
form must obey the restrictions Imposed by the 
combination of the feature values displayed on the 
line which the tutor is writing In. 
Once the WORD-FORM column of the current 
AS completely filled In, the root detection phase is 
activated. The word-forms are ordered lexlco- 
graphically with the provision that the Initial asso- 
ciation: ((ci sci Mi Vi) W0 will be remembered. In 
the general case of a partially regular TF contain- 
Ing n word-forms the result of the root detection 
phase is represented as follows: 
LA = (((ei...ek)rootO...((em...en) rooU)). 
The n endings In the above Ilst Inherit the mor- 
phological features which are associated with the 
word forms which they were extracted from. That 
ts if the word Wi --- rootJ + ei was associated with 
pj = (cj scl Mj Vj) then ei would also be associated 
with pj. As a result, a possible new flexloning para- 
digm appears: Q = ((el pl)(e2 p2)...(en pn)). While 
pi are all distinct, this is not obligatory the case for 
the endings. The Q paradigm is looked for In a list 
of already known paradigms and if not found 
there, is marked for Interning. To interne a para- 
digm means to integrate it into an associative 
structure (a discrimination tree) appropriate for 
word-form morphological analysis (see further). 
With the generation of word forms, the above rep- 
resentation is very suitable (Tufts,t988). A para- 
digm is interned immediately after Its marking only 
if it was learnt from a regular TF. Otherwise, this 
process is delayed until the roots of the TF are pro- 
cessed. The discrimination tree internally repre- 
sents all the known endings and their 
morphological feature values. Its nodes are la- 
belled with letters appearing in different endings. 
A proper ending is represented by the concaten- 
ation of the letters labelling the nodes along a cer- 
tain path, starting from a terminal node towards 
the root of the tree. Due to the retrograde strate- 
gy used in our system, possible endings which are 
searched for In a word from right to the left, are 
checked in the discrimination tree from top to bot- 
tom. This explains the ordering of the label letters 
in the tree. A terminal node (T-node) Is not obliga- 
tory a leaf node because of the possibility of Inclu- 
sion of one ending Into a longer one (the reverse 
is always true). All T-nodes are associated with the 
paradigmatic information specific to the ending 
which they stand for. This hdormatlon is repre- 
sented by a list of pairs: ((Q1 pQ(Q2 p2)...(Qm pro)) 
where Qi are paradigm Identifiers and p~ are 
(identifiers for) points in the paradigmatic flexlo- 
ning space P. If a T-node (hence an ending) has 
associated more than a single pair (Q p) it Is called 
extrinsically ambiguous. Another type of ambi- 
guity Is induced by the endings containing shor- 
ter embedded endings. We call such endings 
intrinsically ambiguous. Let us suppose two en- 
dings < X > and < Y • so that < X • may be writ- 
ten as <Z><Y>. In case of a word-form 
< A • < Z • < Y • without additional information 
one cannot definitely decide if the word should be 
- 147- 
segmented as <A> <Z> + <Y>or as 
<A> +<Z><Y>. For both types of ambi- 
guities there are sound methods of resolution if 
the decision procedure has access to the root lex- 
icon or to some syntactic rules. 
Anyway, for Intrinsically ambiguous cases, our 
system has found out that for Romanlan, in almost 
all cases the segmentation with the longest ending 
is the correct one. For extrinsically ambiguous en- 
dings, the system uses some statistics, extracted 
from the training data, which proved to be valu- 
able. For Instance, the system updates for each 
paradigm, a so-called local counter with each new 
thematic family behaving according to that para- 
digm. The value of this counter, specific for each 
paradigm is considered in sorting the Interpreta- 
tions of an ending :((Q1 pl)...(Qr pr)). According 
to this sorting, an Interpretation (QI pl) is con- 
sldered more likely than another one (Qi pJ), If In 
the lexicon there are more roots "belonging" to Qi 
than to Qj. This preference heuristics does not 
take into account the frequency of the words in 
running texts but only their paradigmatic classifi- 
cation. We plan to introduce the "dynamic 
counters" which are supposed to provide qualita- 
tive estimation based on word-forms frequences. 
It is clear that In order to provide valuable pref- 
erences, the values of the static/dynamic counters 
must result from large sets of examples and run- 
ning texts. This requirement may be fulfilled by 
using in parallel many PARADIGM incarnations 
and finally by merging their outputs. It is Import- 
ant to note that the.preference ~eurlstics we talk 
about are intended only for getting a plausability 
ordering criterion for the possible interpretations 
of an ending or segmentations of an word-form. It 
means that no interpretation variant is rejected at 
this level, so that if a preferred (according to the 
preference heuristics) interpretation or segmenta- 
tion was wrong, the correct one may still be found. 
Roots processing and eventually paradigms 
modification or absorbtlon (see further) are based 
on some similarity criteria. If no similarity is de- 
tected between the roots of a TF, the correspond- 
Ing paradigm, if marked as new, is Interned as it 
was initially constructed. But if the roots are simi- 
lar, the system tries to reduce differences between 
them, either by modifying the inflexlonal paradigm 
or by inferring rules for root modification. The first 
approach is generally taken If the differences be- 
tween roots appear at their boundary with the en- 
dings. The second method Is usually tried In case 
of differences appearing inside the roots. The simi- 
larity criteria are declaratlvely specified, so that It 
Is easy to modify, augment or adapt them to spe- 
cific needs. The notion of similarity, as used in our 
approach, Is very simple. We have developed a 
similarity description language In which one may 
describe the conditlons under which two strings 
are to be considered similar. With the current ver- 
sion of the system, we use only three simple simi- 
larity rules: 
sl) <X>?<Y> == <X><Y> 
s2) <X>I = <X>? 
s3) <X>?<Y>I == <X>?<Y>? 
In the above rules the metasymbol < X > stands 
for an arbitrary string, the question mark for zero 
or one letter, the exclamation mark for exactly one 
letter and == for the similarity relation. Their read- 
ings are: 
rst) two strings are similar If they differ by at 
most one embedded letter (calculatoAr =, calcu- 
lator); 
rs2) two strings are similar ff they are the same 
• except the last letter of one or both of them (coplL 
=, cop,); 
rs3) two strings are similar if they differ by at 
most one embedded letter and by the last letter of 
one or both of them (fereAstrA == ferestrE). 
Actually, the similarity description language is 
more powerful than it is suggeste~. For instance, 
one may impose restrictions on an <X > con- 
structlon such as minimal or exact number of 
characters in X, prosodic restrictions such as 
presence or absence of accent, a.s.o. If two roots 
are similar, the system attempts to generalize their 
similarity beyond the particular TF currently pro- 
cessed. The simllarlty between two roots is 
necessary but not sufficient for making a generali- 
zation. What is needed, Is an explanation, In terms 
of morphological features, accounting for root 
modification. This explanation, if found, will be 
used as a precondition for the root modification 
rule to be synthesized. The explanation Justifies 
the difference between the two roots (of the same 
TF), and consists of dlscrlminant descriptions (in 
terms of morphological features) of the endings 
associated with them. In the current version of the 
system, it looks for the morphological features 
which have the same value for all the word-forms 
obtainable from the first root and another different 
value for all word-forms derived from the second 
root. For instance with the similar roots 'copil' and 
'copll' (child), the system d~covered that all the 
forms in singular are produced by the first root 
while the second generates all the plural forms. 
- 148 - 
Using this fact, the system built the following rule, 
entering only one root (cop,) In the lexicon: 
"If a root X behaves according to the paradigm 
Q39 and its last letter Is T then In all plural forms 
T must be replaced by the letter"T. 
The "generative" flavour of this rule should not 
be misleading: that is, one must not infer that it is 
good only for generation. The same rule applies 
to analysis: 
"If a root was discovered according to the para- 
digm Q39 and its last letter was T, the root may 
be recorded in the lexicon with its T replaced by 
the letter T". 
As more data sets are provided the rules may 
be generalized further in order to cover the new 
cases. 
We said before that the internalization of a 
marked paradigm was delayed until the roots of a 
partial TF were processed. As we shall see in the 
example below, the delay is justified by the possi- 
bility to alter the initial endings (hence the para- 
digm) In order to minimize the differences 
between the considered roots. A paradigm modi- 
fication may appear if the last letter from each of 
the roots taken into account is transferred in front 
of all their corresponding endings (recall the LA 
list in the beginning of this chapter). If the system 
finds no feature-based Justification for root modi- 
fication and ff the difference between the roots is 
given by their last letters, it decides to transfer 
these 'laulty" letters into the appropriate endings, 
thus "regularizing" the TF. As a side-effect the In- 
Itial paradigm is modified and in case the new one 
is already known the decision is considered sound 
and the older paradigm is forgotten. If the new 
paradigm is not known to the system then both 
paradigms (the initial and the modified ones) are 
kept until further evidence will allow the system to 
choose among them. If no such evidence is ob- 
tained in favour of one or another paradigm, it will 
be the task of the knowledge base designer to de- 
cide on the matter. 
Let us follow on an example the process of learn- 
ing a root modification rule. Consider that the 
trainer provided the thematic family for the the- 
matic word "fereastre" (window). The root detec- 
tion process will generate the following 
segmentations: 
fereastra + O (0 stands for the null ending) 
fereastra + 
ferestre + i 
ferestre + E~ 
ferestre + le 
ferestre + O 
ferestre + Ior 
ferestre + E~ 
There are identified two roots: 'fereastra' and 
'ferestre'. According to the rule s3) they are simi- 
lar, with < X > and < Y > bound to 'fere' and '$tr' 
respectively. The fault letters are associated with 
their appearance context: > e I a I s <, > r I a I and 
> r I e I. The notations are interpreted as follows: 
"> e" = = an 'e' preceded by some other letters; 
"lal" = = the 'a' fault letter; 
"s <" = = an's' followed by some other letters; 
">rial" = = the final 'a' preceded by 'r'; 
"> r le I" = = the final 'e' preceded by 'r'. 
The first decision made in order to minimize the 
differences between the two roots is to transfer the 
last character of them into endings, thus resulting 
the segmentatior,s: 
fereastr + a 
fereastr + a 
ferestr ÷ ei 
ferestr + • 
ferestr +ele 
ferestr + e 
ferestr + elor 
ferestr + e 
A second step towards difference ellimination Is 
to consider the deletion of the 'a' letter between 
< fare > and < sir >. But because this operation 
does not contribute to paradigm modification it 
must be generalized (if possible) as a rule for root 
modification. By Inspecting the morphological 
features of the word-forms, the system finds out 
that the root 'fereastr' is characterized by the fea- 
ture values: feminine, singular and nom-acc, while 
the root 'ferestr' Is characterized in all its appear- 
ances only by the 'feminine' feature. Because 'fe- 
minine' value Is common to all word-forms of the 
thematic family, it is considered irrelevant with re- 
spect to roof modification. Moreover, no word- 
- 149 - 
form derivable from the 'ferestr' root has attached 
the "singular + nom-acc" feature values combina- 
tion. Therefore, this is taken as a possible condi- 
tion for the faulty letter deletion and the 
synthesized rule is the following: 
RMRt){<X> = >elats<& PARA- 
DIGM='PO0007' & NUMBER='stngular' & 
CASE ='nom-acc'} = = > { -~ \[NUMBER = 'sin- 
gular' & CASE = 'nom-acc'\] = = > >es< ) 
The reading of this rule is: "If a root of a word- 
form which flexions according to the POD007 para- 
digm, in singular and nom-acc, contains the 
embedded string "eas", then for all combinations 
of morphological features not containing both sin- 
gular and nom-acc values, the 'eas' string is re- 
placed by 'es'". 
Let us notice that the rule is more specific than 
it should be, Imposing that all eligible words be- 
have according to the P00007 paradigm and re- 
quiring the letter's' to follow the dlphtong 'ea'. But 
the system cannot infer more from this single 
example. If provided with another example, let's 
say 'ceapa' (onion), with a similar behaviour the 
system synthesizes a rule very alike to RMRI: 
RMR2){<X> = >elalp<& PARA- 
DIGM='P00007' & NUMBER= 'singular' & 
CASE ='nom-acc'} = = > { ~ \[NUMBER = 'sin- 
gular' & CASE ='nom-acc'\] = = > >ep< } 
The only difference between RMR1 and RMR2 
is the condition that the diphtong 'ea' must be fol- 
lowed by 's' and 'p' respectively. By considering 
this condition a particular one, the system drops 
it and obtains a more general rule subsuming both 
previous ones: 
RMR3){<X> = >elal<& PARA- 
DIGM='P00007' & NUMBER='slnguler' & 
CASE ='nom-acc'} = = > {-~ \[NUMBER = 'sin- 
gular' & CASE = 'nom-acc'\] = = > >e< } 
The rule RMR3 is still too specific. The process- 
Ing of the thematic family for the word 'sears' (eve- 
ning) produces a further generalization of RMR3. 
Firstly, the system generates the following rule: 
RMR4){<X> = >elsl<& PARA- 
DIGM='P00008' & NUMBER= 'singular' & 
CASE = 'nom-acc'} = = > {-- \[NUMBER = 'sin- 
gular' & CASE = 'nom-acc'\] = = > > • < } 
The difference between RMR3 and RMR4 is 
made by the restriction that the flexioning para- 
digms are required to be P00007 Instead of 
P00008. To generalize these rules, the system in- 
vestigates the feature values of the two Involved 
paradigms. Their common properties are 
SC =COMMON-NOUN, GENDER = FEMININE, 
so the system is able to propose a new rule sub- 
suming the RMR3 and RMR4 rules: 
RMRS){<X> = >sial < &SUB-CA- 
TEGORY ='common-noun' & GENDER = 'fe- 
minine' & NUMBER = 'singular' & 
CASE ='nom-acc'} = = > {-, \[NUMBER = 'sin- 
gular' & CASE ='nom-acc'\] = = > >es< } 
Because generalization correctness over incom- 
plete data cannot be guaranteed, each syn- 
thesized rule has two associated lists, one of them 
containing positive examples (Initially only the 
prototype root which generated the rule) and the 
other one containing exceptions (initially empty). 
A similar point of view, that is attaching exception 
lists to general rules, may be found in (Bear,1988). 
The roots are entered into the root lexicon. For 
partial regular thematic families, the two or more 
roots are linked together bidirectionally. The first 
of them, in lexicographic order, is attached to the 
relevant common morpho-lexical information: 
paradigm name end the label for the semantic de- 
scription. This information is Inherited by all linked 
roots. There is also root specific morphological In- 
formation such as selectional restrictions and 
phonemic patterns. The selectional restrictions 
are contributed by the system and they refer to 
the constraints to be satisfied In order that a root 
be selected in a word-form generation. For the 
regular modifying roots, links to the rules they 
obey and the position(s) In the root where letter 
Insertion or deletion Is to be performed are also 
recorded In this field. 
The lexicon building side-effect of the tutorial 
sessions Is not the main Interest of the research 
reported here (for this purpose we developed the 
MORPHO lexicon management system 
(Tufts, 1987a)).This feature was Implemented only 
for testing the PARADIGM system in learning and 
using learnt knowledge. Also, we were Interested 
in experimenting some generetlon strategies at 
the level of morphology (for instance choosing the 
least ambiguous or the more common used root 
from a synonimy set - see (Tufis,1988)). it was 
possible, in this way, to test the functionality of 
PARADIGM without coupling it to MORPHO, oper- 
ation which would have required a greater pro- 
gramming effort. The embedding of PARADIGM 
into MORPHO is planned for the Immediate future. 
- 150 - 
At the end of the system's apprenticeship Is ac- 
tivated a processing phase which we call the para- 
dlgmatic absorbtlon. A paradigm Q1 may be 
absorbed Into another paradigm Q2 iff: 
abl) they describe the same subcategory, 
ab2) for each ending 'eli' from Q1 and the corre- 
sponding ending 'ea' from Q2 the following are 
true: 
'eli' is a suffix of 'e21': < e2i > : < x > < eti > and 
the < x > preffix in 'e2i' exists as a suffix in all the 
roots In the lexicon which, from the flexlonlng 
point of view, behave according to QI. 
The implementation of paradigms absorbtlon Is 
computatlonally motivated: firstly by decreasing 
the number of paradigms, the search space Is nar- 
rowed and consequently word-form processing 
time Improved; secondly, by lengthening the en- 
dings, they become more discriminating and 
therefore the ambiguity Is reduced. In Romanlan 
the case Is that the longer an ending, the less am- 
biguous Its Interpretation. For instance the 'i' en- 
ding has 19 possible Interpretations (in our 
model), while the ending 'ulul' has only one. We 
think that this is a general property with inflexlo- 
nal languages and therefore we consider paradig- 
matic absorbtion not to be specific for Romanlan. 
The paradigmatic absorbtion limits both types 
of ambiguity discussed eadler: Intrinsically (due 
to different possibilities of a word segmentation) 
and extrinsically (due to different Interpretations 
an ending may have). 
In order to obtain a complete morphological 
knowledge in a relatively short time, PARADIGM 
is accompanied by a merging utility program, 
(partially) able to unify two or more knowledge 
bases developed with different copies of the sys- 
tem. 
5. FINAL REMARKS 
One of our eadler goals, some years ago, was 
to establish, by manual procedures, a reasonable 
set of flexionlng paradigms for Romanlan, In order 
to implement a reliable morphological processor, 
general enough to cover the requirements of tech- 
nical texts. The task was taken by seven col- 
leagues with different backgrounds (linguists, 
logicians, engineers and mathematicians ) and 
lasted for almost half an year (Crlstea,1982). I 
used the examples from the then written material, 
in order to test the PARADIGM system. While dif- 
ferently organized, the equivalent (in linguistic 
coverage) knowledge base was obtained in a four- 
hour session. Moreover, the number of paradigms 
discovered by PARADIGM was smaller (97 para- 
digms versus 123). The rest were absorbed with- 
out any loss. By rul~ning test data on the manually 
discovered knowledge base and on the PARA- 
DIGM acquired knowledge base we noted up to 
10% improvement in analysis time. In hypothesis- 
Ing the lexical status and morphological features 
of the unknown words, based only on endings 
analysis, the PARADIGM generated knowledge 
base was also batter. 
A morphological knowledge base for Russian 
and another one for Spanish are under develop- 
ment. Experiments have also been made with 
French, Slovak and Hungarian. In the near future, 
we plan to develop the system in two Important di- 
rections: 
- learning compound word-forms rules (procUtic 
articulation of nouns and adjectives, verb com- 
pound tenses, degrees of comparison for adjec- 
tives); 
- learning lexical affixes (that is meaning mod- 
ifying preffixes and suffixes (Tufts,1988)). 
Related work is reported in (Wohtke, t986), 
(Trost,1986) but they are either concerned with 
English (not a very exciting language from the 
morphological point of view) or address gener- 
ation or analysis only. The very popular two-level 
morphology model of Koskennlemi (1983) in- 
tended primarily to derivatlonal morphology is, 
from our point of view, too expensive for a gram- 
matlcal oriented processing. 
Recent work reported in (Goertz, 1988), (Woth- 
ks', 1986), (Zock, 1988) share some points with our 
approach. 
We consider that the main contributions of our 
work stem from the following features: 
- freedom In defining the categorlal system for 
the model; 
- Independence of a specific natural language, 
provided it is within our "root + ending" approach; 
- applicability of the synthesized rules both In 
analysing and generating word forms; 
- possibility of rapid development of morphologi- 
cal knowledge bases, by merging the results of 
many parallel acquisition sessions; 
~__~ - 151 - 
- duality of system behaviour (apprentice - ex- 
pert) which allows Immediate check of the ac- 
quired knowledge; 
- low level of linguistic competence required to 
the trainers. 
ACKNOWLEDGEMENTS 
The main part of the PARADIGM system, initially 
implemented in TC-LISP (Tufls,1987b) for PDP1 l- 
like computers, was transferred on IBM-PC under 
GCLISP, during a reseach stay at the International 
Basic Laboratory on Artificial Intelligence of the In- 
stitute of Technical Cybernetics of the Slovak 
Academy of Sciences In Bratlslava. I would like to 
thank to all the people responsible for my research 
stay in the International Laboratory and especially 
to Dr.Josef Miklosko, Dr.Pavol Hrivlk, Dr.Karol 
Richter and Dr.Rudolf Fiby. I would also like to 
thank to Leos Tovarek who kindly helped me on 
the Slovak language testing. 
REFERENCES 
Alshawi H., Boguraev B., Briscoe T. 1985; -To- 
wards a Dictionary Support for Real Time Parsing. 
Proceedings of the 2-nd Conference of ECACL, 
Geneva,1985, 171-178. 
Bear J., 1988; - Morphology with Two-Level 
Rules and Negative Rule Features. Proceedings 
of COLING'88, Budapest, 1988, 28-31. 
Crlstea D., Curteanu N., Mihaescu P., 1982; -Re- 
search in Natural Language Communication with 
Computers, Final Report to A.I.Cuza University - 
ICI Contract,1982 
Gortz G., Paullus D.,1988; -A Finite State Ap- 
proach to German Verb Morphology. Proceed- 
Ings of COLING'89, Budapest, 1988. 
Jappinen H, Lehtola A., Nellmarkka E., Ylilam- 
ml M., 1983; - Knowledge Engineering Approach 
to Morphological Analysis. Proceedings of the 
First Conference of ECACL, Piss, 1983, 49-51. 
Koskennleml K., 1963; -Two Level Model for Mor- 
phological Analysis. Proceedings of IJCAI'83, 
Karlsruhe, 1983, 683-685. 
Tecuci G., 1988; - DISCIPLE-l: A Theory Meth- 
odology and System for Learning Experts Knowl- 
edke. PhD Thesis, University of Paris-Sud, 1988. 
Trost H., Buchberger E., 1986; - Towards the 
Automatic Acquisition of Lexical Data. Proceed- 
ings of COLING'86, Bonn, 1986, 387-389. 
Tufts D., Crlstea D., 1985;-lURES: A Human En- 
gineering Approach to Natural Language Ques- 
tion Answering. In Bibel W., Petkoff 
B.(eds):Artiflclal Intelligence: Methodology, Sys- 
tems, Applications. North Holland, Elsevier, 1985, 
177-184. 
Tufts D., Dumltrescu C., 1987; - MORPHO: A 
Lexicon Management System. Reference Manual, 
ITCI, 1987 (In Romanlan). 
Tufts D., Tecucl G.,Cristea D.I 1987; - LISP (vol 
2.):TC-LISP for Minis. AI Systems Implemented in 
TC-LISP (lURES, QUERNAL, DISCIPLE), Techni- 
cal Publishing House, Bucharest, 1987(in Roman- 
ian). 
Tufts D., 1988; - Analysis and Generation of 
Words, Based on Automatically Acquired Mor- 
phological Knowledge. Research Report, Interna- 
tional Basic Laboratory UTK, Bratislava,1986. 
Wehrll E., 1985; - Design and Implementation of 
a Lexical Data Base. Proceedings of the 2-nd Con- 
ference of ECACL Geneva, 1985, 146-153. 
Wothke K., 1986; - Machine Learning of Morpho- 
logical Rules by Generalization and Analogy. Pro- 
ceedings of COLING'86, Bonn, 1986, 283-289 
Zernlk U., 1988; - Language Acquisltlor~: Coping 
with Lexical Gaps. Proceedings of COLING'88, 
Budapest, 1988, 796-800. 
Zock M., Francopoulo G.,Laronl A., 1988; - Lan- 
guage Learning as Problem Solving. Proceedings 
of COLING'88, Budapest, 1988, 806-811. 
- 152- 
