Synergy of ~ and _.mor_phology in automatic parsing 
of French language with a minim .urn of data 
Feasibility study of the method 
Jacques Vergne Pascale Pages 
Inalco Paris 
We intend to present in this paper a parsing method of French 
language whose particularities are: a multi-level approach: syntax and 
morphology working simultaneously, the use of string pattern matching 
and the absence of dlcUonary. We want here to evaluate the 
feasibility of the method rather than to present an operationnal system. 
I General ebiectives; 
We intend to demonstrate that it is possible to parse texts with very few 
data: onlv determil2~.~j3_ts, indefinite and numeral adiectives, con!unction&o, nd 
r)reDositions, that in such an analysis, the consecutive application of 
morphology and syntax is insufficient, but that the use of .~YJ3.t¢~ and 
m_gr~gy simultaneously is very efficient, end at last that the notion of a 
grammatical or lexical category of the word is not attached to the word, but 
deoends on the local situation of the word in the senten~..~. 
/I Commen_~_U.p~n objectives_: 
A. Comparison with the classical parsing strategies: 
In nearly every contemporary automatic parsing system, we have a 
chronology of several steps, one step by linguistic level: the 
moroholoeical steQ, then the .&yntactical ~;tep, and then eventually the semantic 
£zt_O..E. The morphological step uses an exhaustive dictionary of forms or 
sometimes of laminas. 
But we know that the human understanding uses simultaneous information 
coming from all levels. Some parsers now begin to use two levels 
simultaneously: for example the system ASCOF (SFB t00 Sarrebrt3ck, see 
reference) uses semantic information coming from semantic nets in the 
syntactical analysis and semantic constraints in the syntactical rules. 
We propose to use morphology and syntax simultaneously in a 
parser with no dictionary. 
B. Assumptions and theorlcal aspect: 
By "parsing" we mean to Lg..q.0_gj3j¢~ adjectives and nouns with their 
gender and number, verbs in the infinitive or in the present participle and the 
adverbs derived from adjectives, to Erg~ the lexicon of the text and 
to ~Lel.ez£\[\]/~ syntactical relations between words. 
The \[oossible al~Dfications of such a parser are: automatic indaxation, and 
also as a first step in every system which uses a parser: inquiry system, 
automatic translation, filling knowledge bases from texts, etc .... particularly 
when the parser must work in open semantics. 
The main practical interests are to avoid consultino an exhaustive 
In the first step of the parsing and to Process the neolooisms 
exactly the same wav as the other words. Dictionaries are expensive to 
and in fact are never completely updated; consulting a dictionary 
produces many artefact-ambiguities, locally to the word, which are 
cancelled as soon as the immediate context is examined; these ambiguities 
produce a combination-explosion and much redundant processing. 
There are also some theorical aspects. If the Jg.EJ_q.~ category can be 
deduced with no dictionary, is this category really lexicel ? We could rather 
name it a contextg..~ or ~ category or even a ~\[D_qJLOJ3. : a word 
gets its category from the flow of the text and the dictionary gives the 
categories. The ~ category then is only the reoularitv of the function. 
Further, any word can potentially h~we any category (perhaps more in English 
than in French). Claude Hag6ge: "co sent des ~.O~.tJg.~, non des parties clu 
discours, qu'il convient d'abord de poser" (L'homme de paroles 1985 page 
137). A more general theorical aspect lies in the use of computer as an 
experimental tool which allows now to consider linguistics as an 
experimental science and to use the experimental method to test linguistic 
theories. 
III The mean_~: 
A, General method: 
The main method is pattern matching. The principle is the following: the 
to recognize is compared to ~ set of pattorr,c, until a match is found. 
But what exactly is a .~ in a natural language? 
The classical terms are "mg.E~&\[9.gJL" for the word, and "&VJ3.t6~" for the 
sentence. We could say that morphology is the shape of the word and syntax 
the shape of the sentence; and more, we propose here to fill conceptually the 
gap between the level morphology-word and the level syntax-sentence: 
4- ~L.QLa2\[ = shape of the text. 
We can also remember that our habit of the ~Lrj_U,9..D. word properly 
delimited by spaces er punctuation rnakes us forget that the ~ string is a 
continuum that we cut while understanding: we use simultaneously 
morphological, syntactical, semantic and pragmatical information with which 
we make deductions, inferences, deadlocks and use intuition (about the word 
see Tesni6re page 27, § 11 to 15). 
It is possible to classify the pattern matching methods in two categories: 
the statistic and the structural methods (see Miclet and Fu). More 
precisely, we use here a string pattern matching method: every word of 
the sentence is replaced by its category, coded by a character, and question 
marks for unknown words: 
r#lectricit6 c6r#brale --> d?? (d=determiner) <the cerebral electricity> 
maladies mentales et l#sions c#r#brales --> ??c?? (c=coordination) 
<mental illnesses and cerebral lesions> 
Let us call this string the pattern by word that will be used in the 
grammar and in the parsing, 
The information used in the parsing is composed of three types of data: a 
~,LELe_~J.£~ of the words in finite number (about ~_0JLoJLn~), morphological 
deduction rules for each word, a set of eatterns of the noun phrase for pattern 
matching, and of course thetexHo ana!vse itself. 
The 1irst steo of this st.u._d.£ is :lhe noun or ereoositional #hre~g. The 
following steps are the recognition of these phrases in the sentence, and the 
whole parsing of the sentence. 
This work is implemented on Apple Macintosh, and the programming 
language is Pascal UCSD which is suitable to develop such a parser whose 
algorithms are rarely recursive. 
B. The small lexicon: 
It contains about 80 forms (not lemmas): determiners (articles, 
possessive, demontrative and indefinite adjectives), prepositions, 
coordinations, some punctuation signs (considered as words). These words are 
the .a_EqbPE~g..D._qJ.O~ for pattern matching. 
269 
But we realize that it is impossible to keep the general position to have 
only the words in finite number in this lexicon, and that the problem becomes 
pragmatic: what is the minimum of data necessary to recognize correctly the 
other words ? We have added the first numeral adjectives, indefinite 
adjectives, some very current adjectives often placed before the noun (petit, 
autre, m~me ), adverb not derived from adjective (bien, real, tr~s ). 
Every form has its possible categories, eventually gender and number; the 
list of the possible categories can be open: a form can have another category 
in a particular sentence: le blen et le real, le la de ma clarinette . 
For example, et can be: 
- conjunction which coordinates adjectives (b): 
valour Iocalisatrice etpronostique --> ??b? <localizing and prognosal value> 
- conjunction which coordinates noun phrases (c): 
maladies mentales et lSsions c~rSbralas --> ??o?? 
<mental illnesses and cerebral lesions> 
- conjunction which coordinates nouns (e): 
crSation et renouvellement lexicaux -->?a?? <lexical creation and renewing> 
- conjunction which coordinates prepositional phrases (C): 
I'influence de I'inductance at de la capacit6 --> d?pd?Opd? 
<the influence of inductance and capacity> 
We have distinguished two categories of preposition according to the 
"attraction" between the two np: high altraction: le syst~me d'unit~s <the 
unit system> (sort of compound nouns), or low attraction: un chat sur un toit 
<a cat on a roof> (facultative "circumstant"), whence two kinds of 
prepositional phrases: \[E\[~ to the np : dl de en (o) or #.~\[fim~ to the 
np: ,~ de en sur clans chez vers (p), 
So de can be: 
- internal preposition (o) in: 
une th#orie de la morphog#n#se --> d?od? <a morphogenasis theory> 
le syst#me d' unit#s internationnal -->d?o?? <the international unit system> 
- external preposition (p) in: 
de ranimal & rhomme --> pd?pd? <fromanimaltohuman> 
- preposition (q) in: 
Ins diff#rents moyens de faire lea mesures --> d??qid? 
<the different ways to make measures> 
C. The morphological approach: 
Our attitude is to explore all possibilities to extract, to deduce 
Information from the mere morphology of the word, without 
dictionaries, information which can be used ~t ~,NV time of the s~ 
For example, lot us observe the words ending with -Ire) : -icltd -ivit6 
-abi/It6 -ibillt6 -ubi/It(~ -arlt6 -a/lt~ ; we have a regular 
alternation adjective/noun: ~lectr" ~lg.~ / 6/ectrj£jJ~, combatjl / combatjyj.~, 
portb#L~_ /portb#~_t~, particu/j.~ /particu/arit~ ; from these endings, we 
can deduce that the word means a quality (semantic aspect) and is a singular 
feminine noun (category). 
On the semantic opposite, endings as -ification, -isation suggest an 
action, for example: class'Ej_q_fitj_~ comes from the noun class(e) + suffixe 
-ification , national~ comes from the adjective national + suffixe 
-isation , climatisflrLOj2 <air-conditioning> comes from the noun climat + 
-isation ; these words have been derived on the same way, with the same 
semantic aspect: the suffixe -is- + -er (verbal ending) = .tsar or -is- + 
-ation (noun ending) = -isation has the property to make a verb or a noun 
which expresses an action, from adjectives (national) or nouns (cfimat); 
words ending with -ification or -isation are always feminine nouns. 
In some cases, at first sight, the morphology does not give reliable 
information: a word ending by -ement can be an adverb (derived from 
adjective) or a masculine noun: for example lachement <slackly> is adverb 
and rel&chement <slackening> masculine noun, but a more precise study 
brings the following information: -Scent ==> adverb except 3 roots: 
agr#ment complSment increment and except the word #lSnrent ; -Oment 
==> adverb: assidOment; -ublement -iblement -ivement ==> adverb derived 
from an adjective:indissolublement visiblement h#tivement ; .oment .rment 
-gment ==> noun: moment sarment fragment ; -issement -ionnement ==> 
noun derived from a verb: vagissement positionnement, 
At last, as far as neological production uses these elements and these rules 
to create new words, ~Q.Q.LQg\[.~,ZLE# analysed exactly as the other werd~ (see 
Guilbert and Kokourek). 
These morphological properties of each word are the second kind of 
arlchorln~ nolnt~ for pattern matching, 
D. The grammar: 
1. the grammar of the complex noun or prepositional phrase: 
The phase is considered as a three level hierarchical structure (finite 
number of levels): the grammar is not recursive (on that point joining 
Tesnlbre and leaving Chomsky): phrase = complex noun or prepositional 
phrase which is composed of simple noun phrases which are composed of 
words or ".&ggJ_uJ\[DE\[#~" words. 
A oomolex noun D.br.~,&e (cnp) is: 
* either a simple noun phrase alone (G=snp) 
- or a train of simple noun phrases separated by: an expernal 
preposition (p=de, clans, pour), or a conjunction co-ordinating sap (c=et , 
ou ), or a conjunction co-ordinating prepositional phrase (C=et , ou) and 
followed by a preposition (Cp=et de , ou avac), or a preposition preceding an 
infinitive (q=de, ~, pour ) or a present participle (r=-ant). 
These snp have between them relations of subordination or co-ordination. 
2. the grammar of the simple noun phrase: 
A ~ is a train of words obeying two types of constraints: 
- ~Ey.QL&EEg..~, when the phrase agrees with a dependency tree 
- EI.OE.QtLQJQgJ_0~, which is usually named gender.number agreement 
A .~#,ttern of a snD is a horizontal Dro!ection of a sub-etemma of_#, 
canonical etemma. Let us remember that a stomma (word introduced by 
Tesni~re) is a dependency tree. The canonical stemma represents an 
abstract of ~~. 
a sub-stemma is: - either the unchanged canonical stemma, 
- or the canonical stemma without a leaf, 
- or a sub-stemma without a leaf. 
A stemma is a ~.9_~\[Lri\]gj\].~ram: the vertical dimension of the 
hierachical levels and the horizontal dimension of the written words; a stemma 
can be horizontally projected to obtain the one dimensional train of words 
as they are written. Here is an example of a possible canonical stemma of the 
snp, and its projection: 
I I 
dot adv ady adj noun adv adv adj adv adv adj 
dependency relations: adj ........... > noun (depends on) 
projection relations: dot ..... >det (is projected as) 
There is a snp pattern for every sub-stemma of the canonical stemma. The 
three canonical sternmas now used are equivalent to about 2000 
rewriting rules. 
The "aggluUnatlon" rules are applied JZIE\[.~L~312 from right to left 
and are the bottora-up aspect of the algorithm: 
- every adjective can be an agglutinated adjective (A) as the result of 
the co-ordination of several adjectives: ?b?->A Bb?->A ==> ?=adjectives 
- in the forms: noun de noun or: noun ,t noun, we have JAL~.ES\[ 
~ositional_ phr~e..~, working like ~ (B), which are included in the 
snp, and processed as.~ in the parsing of the snp: o?->B od?->B 
==> ?=noun. 
We recognize here Tesni~re's "translation" concept: the "translation" e{ 
Jb_e..Eg3Lg_LEt~ (see Tasni0re pages 443 and seq.). The prepositional 
phrases which can be considered as adjectives are not only preceded by: de & 
en, but potentially by every preposition, for example in: 
t\]g.~ par domalna ou lexicographique 
usual ou solon la thSorle 
\[Ag_b.EgJ2#. heuristique clans los graphes appliqu~e ~ la reconnaissance 
We can remark here that t~co-ordinated obie(:ts, have fundamentally 
Be same functlorl that is fDJ.tQ.Lt._b_g.t_A~~ 
We shall now deduce linguistic information from the form of the snp. For 
example, if we analyse the form: un \[unknown.word\] (d?), we can deduce that 
this unknown word is a masculine singular noun for two reasons: it matches the 
pattern: determiner - noun, and the whole snp inherits its gender and number 
from the determiner. Here is an ambiguous case: \[unknown word 1\] \[unknown 
word 2\] (??); we have here three solutions: either noun - adjective , or 
adjective - noun, or noun - apposed noun.. It is often possible to decide by a 
morphological study of each word: 
I'~lectrlclt6 c~r~Jbrale --> d?? and Iclt6 ==> singular feminine noun 
==> noun - ad!ective 
une nouvelle conception --> d?? and tlon ==> singular feminine noun 
==> .~jg.DJL~ 
lee andes alpha --> d?? and no number agreement 
=> noun - a~xx)sed noun 
In the texts now processed (content tables of scientific books), we have 
noun - adiectiy.#, in about 97 % eases, probably because French is a centrifugal 
270 
language: the governor first, then the dependants (see Wagner et Pinchon page 
155, Tesnibre pages 33 and 147). For example, la linguistique infornratique 
and rinformatique linguistique which are morphologically ambiguous, are both 
understood by a native speaker as a form: no~_0_z_&d~. 
If we choose to obtain only one analysis to get one deduction, we must have 
an order of trial, barring syntactical or morphological impossibility: this order 
is now: noun - adiective, then adjective - noun, then noun - apposed noun, 
with three stemmas tried in this order. 
3. some p~rsing difficulties: 
Wrong deductions upon the category come in the cases we have several 
possible analysises aod whorl morphology does not implies the category: 
o what does et coordinate ? 
wdeur \[(Iocalisatrice) et (pronostique)\] noun \[(adjective) el (adjective)\] 
\[(valour Iocalisatrice) et (pronostique)\] \[(snp) et (noun)\] 
two possible analysises according to whether of co-ordinates two snp or two 
adjectives; then orenostique can be analysed as noun or adjective; 
- if the pattern is ?? without any possible morphological deduction, the form 
nourt - adjective will be choased, and that rnay be wrong in some rare cases. 
But at the end of the parsing of the text, the lexicon is extracted and it is 
possible to consull it to reparse the ambiguous phrases. 
A. How syntax and morphology work together: 
In such a pars~rr, the parsed language implies a parsing strategy: in French, 
syntax gives more information than morphology; for example, in English 
rnarphology is peer and syntax becomes more important, in German, 
morphology is richer because of declensions and file three genders. So, in 
French, etbA~j2~.. \[O.g~iUi~ e.~l b~'z~Z ~t aj£a__n ~s o ~L0 i~\[i~h t ~by~tp h el ogy: 
. at the beginning, we look if it is possible to deduce its category, gender 
and number, and the deduction is marked sure or not sure, for example: 
qcit6 ==> feminin singular noun, sure (61ectricitd) 
-ement ==> masc. singular noun (enregistremont) or adverb (pumment) 
-ant ==> present participle, not sure (concerrrant or passant) 
• in the study of each snp, every category and some geoders and numbers 
are known and the gender-number agreement is verified, for example: 
-at and adjective (deduced by syntax) ==> masculine singular (principal) 
-ives and adjective ==> feminine plural (qualitatives) 
If a snp does not agree in goader and number, the analysis fails and the 
next stomma is tested. 
B. General ease: 
First, some n~placoments are made in the phrase submitted to the analysis, 
for example: space inserted after the apostrophes to isolate /' or d' as 
one word, autour de --> autour.de (one word), du o-> de le , des --> de los, 
au --> ",) le , aux -.> # los . 
Then for each word, the lexicon is consulted, and if not found, the first 
morphological study is made (see above), whence the set of the possible 
categories of each word; this set is classed in the order of trial. 
Then the ~h.£L6Lh~Lg_J~_rl:Ls~ t)~ is made from tire 
combinations of the possible categories of each word, and from contextual 
constraints of each letter of the pattern; these constraints are as severe as 
possible to reduce the number of combinations as much and as soon as possible: 
for example, for the phrase: 6volution de 1'61octro..enc6phalodramn?e d'un 
malade attaint de paralysie g6n6rale salon los effete du traitement, the 
n umber of possible patterns is reduced from 1250 to 8. 
\]hen, each pattern is tested until the first successful analysis, except if 
there are possible adverbs, infinitives or present participle. In that case, a 
measure of the quality of the analysis is made to get only tile best analysis. 
The test of ol~a pattern is made in the following way: the pattern by 
snp is calculated: I'electricit& c6r6brale --> d?? --> G (G=snp); maladies 
mentales et 16starts du cerveau --> ??c?od? --> GcG (o=prepesition internal 
to snp); /'activation par fermeture des yeux --> d?p?od? .-> GpG (the 
activation by closing eyes) (p=preposition external to snp). 
We verify that this pattern by snp carl constitute a cnp (top-down aspect). 
The patterns by snp may be for example: G (snp) GcG (co-ordinated snp) GpG 
(sub.ordinated snp) GrG (two snp separed by a present participle). 
We try to apply tile .&gg\[g.t~3.tLg_qJM..L~. (bottom-up aspect: see above). 
\]'hen we study each snp: - we test if it is possible to find a match with one 
of the three sternmas tested in the order: ~r~.g.~._tL~, then d.~ 
.O.#J, tB, then ##g.B._-....~, whence a deduced or confirmed category 
(noun or adjective) for every question mark; - we test if we have a 
gender-number agreement between the governing noun and its eventual 
depending determinant and adjectives; this is done by a set intersection 
algorithm and by getting gender and number of the determinants from the 
lexicon, and by a morphological study (see above) of adjectives and nouns 
whose category h\[~s just been deduced. 
At any moment, if a constraint is not satisfied, the test of this pattern is 
stopped and the next one is tested. 
A bracketed phrase gives the history of the analysis. 
Co A parsing example: 
valour Iocallsatrlc~ et pr'onostlquo 
process by ward: 
? valour 
? Iocalisatrice 
bcC et co-ordinates adjective (b), snp (c) or prep. phrases (C) 
? pronostique 
C is impossible because et is not followed by a preposition 
possible pattsrns: ??b? ??c? 
test of ??b? 
calculation of the pattern by snp: ??b? -> G possible cornplex phrase 
agglutination: appScabie rule: ?b? --> A ( co=ordinated adjectives) 
tocalisatrice : ?= adjective 
pronostique :'?= adjective 
bracketed structure: valour +A(Ioealisatrice +at +pronostique ) 
new pattern: ?A and end of agglutination 
study of the single snp: 
syntactical constraint: ?A matches with the stamina 1 (noun-adjective) 
valour : ?==noun 
morphological constraints: 
valour : singular ( by morphological study ) 
Iocalisatrice : feminine singular (-trice by morphological study ) 
pronestique : singular ( by morphological study ) 
gender-number agreement: ferninine singular 
this snp is correct 
and of course the cnp is correct: 
wdeur >neun/f/s can be adjective elsewhere 
Iocalisatnce >odj/f/s carl be noun elsewhere, qualifies valour 
pronostiquc >adj/f/s can be noun elsewhere, qualifies valeur 
bracketed structure: G( valour +A(Iocalisatrice +at +pronostique ) ) 
if we ask for all possible analysises, we get also: 
G( wdeur +loealisatrice ).~ et + G( pronostique ) 
V Conclusion: 
In the texts now processed, tables of contents arrd diagrams in scientific 
books and articles (about 10 000 words), the .Lg_COgg_~£..p~f £;~e£Lor~es As 
c.eqrzeccUo.99 °Z~, and tile ~n.o~f \]t3Aj~gx_i is £o~eqfly_ex4r-q~£ted., but the 
deduction of the hierarchy of the snp and of relations between snp cannot be 
realisod only by using syntactical o1: morphological data because bAQI~(_" and 
prag~at~ information is lacking. 
3he original assumptions are w\]rifiod: 
- it is passible to deduce categories of words by erring pattern 
nratchhrg, with rio dictionary and :wJt~3AL~..v_.~\[6, by simultaneous 
Use of ~¢J.tLc~L and \[t\]_~_p~\[)lo_~LL~.L Inforlnatlon. 
• . the concept of category is really a functional concept. 
References

Blower, F6aeyrol, Rltzke, Stogentrltt ASCOF - a modular multilevel 
syste/n far French-German translation 1985 Computational Linguistics, 
special issue Slocum (ed.) 

K.S. Fu Syntactic pattern recognition and applications 1982 Prentice-Hall 

Louis Gullbert La cr6afivit6 lexicale 1975 Larousse Paris 

Claude Hag~go La atructure des langues 1982 Qua sais*je? PIJF Paris 
L 'homme de paroles 1985 Fayard Paris 

Rostlslav Kokourek La ;angue h'angaise de la technique et de la science 
1982 BrandstetterVerlag Wiesbaden 

Laurent Mlclet M6thodes structurelles pour la reconnaissance des formes 
1984 Eyrolles P~ris 

Pascale Pagbs Analyse motphologique automatique du frangais, extraction 
des verbes et raise en valour merpho-s6mantique de la d6rivatioa 1984 
th6se de doctorat de 3 ibm° cycle en traitement automatique des langues 
Inalco Universit~ de Paris III 

Patrlce Pognan Analyse automatique du t~:h6que 1979 Universit6 de Paris III 

Lucian Tesnl~re EI6ments de syntaxe structurale 1982 Klincksieck Paris 

Jacques Vergne Symbiose de la syntaxe ot de la morphologie dans I'analyse 
automatique du frangais avec un minimum de donn#es 1986 TA Informations 
Paris 

R.L. Wagner et J. Plnohon Grammairo du frangais classique et moderne 
1962 Hachette Paris 
