First Results of a French Linguistic 
Development Environment 
L. Bouchard (GIREIL) L. Emirkanian (cIREIL) D. Estival 0ssco) 
C. Fay-Varnier (CRIN) C. Fouquer6 (LIBN) (\]. Prigent (CNFT-L.~aon) 
P. Zweigenbaum (INSERM-U104) 
1 Introduction: EGL 
The EGL (Enviroimement de Gdnie Linguis- 
tique) project started in 1989, with the proposal 
to create a linguistic software development envi- 
ronment containing a computational treatment 
of 1;'leach grmmltarJ Its three main objectives 
were to allow research groups working in NLP: 
m to develop and test both general l'Yencb 
graamtmrs and specific linguistic anMyses 
for that bmguage, 
• to test new parsers mtd to compare several 
parsers in a uniform setting, and 
* to have at their disposal an ~ma- 
lyzer/generator for French, easy to maim 
rain and to port to other domains. 
tThe EGL project involves 6 different partners: 
* GIREIL: Universit6 dn Qu6bee h. Montr¢!al, 
D6partement Math-Into, Montr6al Qu6bec, 
Case Postale 8888 - Succursale A - H3C3P8, 
CANADA. <lhb@mips 1.info.uqam.ca>, 
<le@mipsl.uqam.ca:> 
• ISSCO: Universit6 de Gen~ve, 54 rte des Aca- 
cias, CI\[-1227 Gen~ve. 
< estival@divsuu.unige.ch > 
e CRIN: Campus Scientifique, BP 239, P-54506 
Vandoeuvre-l&s-Nuncy Cedex. 
< Christine. Fay@loria.fr > 
• LIPN: Universit6 Paris-Nord, F-93430 Villeta- 
neuse. <ef@lipn.univ-parisl 3.fr> 
• CNET-Lunnion: Route de Tr6gastel, BP 40, 
P-22301 Lannion Cedex. 
<prigentg@lannion .cnet.fr 2> 
• INSERM-U194:91 Bd de l'ttSpital, P-75634 
Paris Cedex 13. <zweig@frsim51.bitnet> 
It was supported by the Association pour la 
Coopdration Culturelle et Technique and by the 
French Programme PRC Communication Homme- 
Machine. Development of the GPSG grammar of 
bYeneh was also supported by grants from the SSRC 
of Canada (grant #410-89-1469) and the FCAR of 
Quebec (grants #89-EQ-4213 and #92-ER-1198). 
Independently of a particular application, the 
envirolmmnt must be usable both as a compo- 
nent in a system making use of an existing syn- 
tactic database, aald as a development environ- 
ment for new syntactic treatments of the lan- 
guage. The first phase of the EGL project was 
partly based oil a critical evaluation of existing 
work (in particular GDE \[1\]), eatd defined a gen- 
eral architecture with the following modules: 
* parser, 
• basic gramnmr, 
* test-suite database, 
o lexicon, 
• develolmmnt m,'mageluent tools, 
* graphie'al utilities. 
The initial grmmitatical formalism chosen was 
that of unification-based gralmnar and three 
main linguistic frameworks are taken into ac- 
comlt in F~G~: GPSG \[11\], Lt"G \[16\] and FUG 
\[17\]). The parser is based on tile general princi- 
ple of a chart; different attalyzers for the tliffereut 
forxmdisms can be integrated into the system by 
making retereuce to that model and by including 
specitie nlethods for the types of objects they 
tmmipulate. Tile basic anMyzer is a revised ver- 
sion of the GDF parser \[8\]; two LFG parsers are 
being iutegrated, and a FUG parser is planned. 
The French test-suite and the grarmnax are 
both already fairly well developed. The basic 
gramumr provided with the envirormaent is the 
keystone of the whole system. It allows using the 
environment directly ~md without further work, 
sam also serves as a testbench for the computa- 
tional solutions to liugtfistic problelrLs. 
'\]?he test-suite serves as a guideline for tile 
coverage of (system-provided or user-defined) 
grarmnars, to test whether they accept an in- 
dependently established corpus of written sen- 
tences which exemplify the nmiu linguistic prob- 
lems anti phenonmna of the language. 
ACRES DE CO1JNG-92, NANqE.S. 23-28 not'yr 1992 1 I 7 7 l'aoc, ov COLING-92. NANTES, AUG. 23-28. 1992 
Wtfile defining a French lexicon was not one 
of the main objectives of the project, having a 
lexicon is an mmvoidable requirement for test- 
ing grammars and analyzers and the treatment 
of lexical information became an important coin- 
ponent of the work. The need to access a single 
lexicon required a study of the normalization of 
lexical information which led to interesting ques- 
tions about the reusability of syntactic features. 
Detining development management tools 
turned out to pose challenging theoretical prob- 
lems. The History component keeps track of 
grammar development and modification, and 
is complementary to the Coherence component 
which validates a state of the grasmttar. The 
Generation component allows the linguist to test 
limit cases in the grammar, both from tile point 
of view of analysis complexity and in order to 
check overgeneration. 
We start our description with the module 
making the system usable as a development tool 
for linguistic software, i.e. the set of graptfical 
utilities for the visual representation of tile gram- 
mar, the analysis process and the results. 
2 User environment 
EGL lets the user parazneterize execution and 
control commands, explore their results, amt vi- 
sualize and edit lexicai and syntactic knowledge. 
In contrast with earlier approadles such as \[4\], 
we tlfink that user interface standards are now 
sufficiently ilmture to allow reasonably portable 
software to be developed, and most of these frmc- 
tions are part of a graphical user interface run- 
ning under X-window Motif. The EGL graphical 
user interface is best illustrated with the parsing 
tools, wtfich are directed towards both the grean- 
mar developer and the parser developer. The 
user can select a sentence, control parser exe- 
cution, mtd explore the results. During parsing, 
the user can display the chart and watch it evolve 
dynanfically. The agenda of awaiting chart tasks 
can also be displayed and manipulated. Tiffs al- 
lows the parser developer to e~cperiment mann- 
ally with chart parsing strategies before integrat- 
ing them into the parser. 
After parsing~ the grammar developer can 
display the relevant structures (derivation trees, 
feature structures, rules used, etc.) and navi- 
gate through them. The whole user interface be- 
haves as a structure inspector, or hypertext-style 
browser, with displays and limks tailored to the 
linguistic needs and habits of ti~e user. 
3 Development Management Tools 
Besides the test suite elaborated for the 
project, three validation tools contribute to 
grammar development: the tIistury, Coiterenee 
and Generation components. As the test suite 
and the ftistory components are described in de- 
tail elsewhere \[5\], we will spend more time on the 
Coherence and Generation components. They 
are both based upon a formalism which is com- 
mon to GPSG, LFG and FUG, and thus able 
to include all tile data and constraints of those 
three frameworks. In this way, EGL goes beyond 
previous projects such as \[8, 7\] and provides a 
common tool for various frameworks. 
A gT~mmar consists of four sets (category, 
(ID-)rule, LP-rule and metarule). 2 Each set in- 
cludes both data and principles. A principle 
is a constraint that must apply everywhere mid 
which defines the admissible data. 
A category (I, F, A) is represented as: 3 
e A categorial identifier I, which is a symbol 
identifying the category. 
A formula/~', which defines constraints ap- 
plicable to the category. These are de- 
duced from the rule that generated the 
category, or from principles. The allowed 
predicates are: standard D, constrained 
D~, default 3d deduction; standard --, con- 
strained =%, default -a ttnification; nega- 
tion -,, ration /', and disjtmction Y. 
• An attribute-wdue structure A. A value 
may be atomic or complex (itself an 
attribute-value structure). It can be de- 
dared explicitly (with constants) or im- 
plicitly (referring to another value in the 
structure, thus allowing data sharing). 
Local trees stem fronl rewrite rules, 4 con- 
strained by LP-rules and principles, s The prece- 
dence constraints can be mentioned in the right- 
hand side of a rule inside the rule as well as a 
principle via precedence rules. This expressive 
power ('allowing "formalism mixing") facilitates 
2An (ID-)rule is a regular expression constructed 
from an Immediate I)ominanee rule with Linear 
Precedence constraints. 
aEach element of a structure or a category can be 
omitted; in that case, it is considered a variable. 
4These are themselves defined with metarules. 
5This is the way to express the Foot Feature Prin- 
ciple, Head Feature Convention, etc. of GPSG. 
AcrEs DE COL1NG-92, NA/VI~..S. 23-28 AOt~l 1992 1 1 7 8 PRoC. ov COLING-92. NAN'rJ~s. AUG. 23-28, 1992 
grannnax development. Two exaauples: 
LFG (rewrite rule): 
(P,,) * {I,(NP,$0.SUJ :: $1,)} 
A {2,(VP,$0 : $2, \[TRANS-\])} 
GPSG (default constraint): 
(,,\[V +, N 1) -)a (,,\[VFORM V, PASS--\]) 
The m,'dn protdem in the Coherence coinpo- 
nent is that of salisfiabilit~, ls there any valid 
parse with the user's graznmar? Besides satisfi- 
ability, some questions are of great interest from 
a linguistic point of view, e.g. sufficiency and 
necessity of all the data. A grammar must be 
structurally coherent, and we say that a grarn- 
mar is coherent iff it satisfies: 
o non-cyclicity: there is no cyclic point. 
, non-redtmdancy: A is redmidant w.r.t. B 
in a grammar S iff S-A has the stone 
strong generative capacity as S-B. 
non-superffifity: A is superfluous in S iff S 
aml S-A have the same strong generative 
capacity. 
accessibilJty-coaccessibility: data is acces- 
sible (resp. coarcessible) iff used at least 
once in generation (resp. a parse). 
We have shown 12\] that cyclicity, redundancy 
,'rod superfluity are subproblems of accessibility: 
an accessibility algorittun can be used as a nec- 
essary condition for the three other problems, lit 
a context-free granmlar, linguistic coherence can 
be tested locally. Therefore, a first pass applies 
to a context-free paxt of the grarunmr (without 
data shaxing nor nonmonotonic atonfic formu- 
las). A second, global, pass uses label propaga- 
tion, where labels are defined by constraints. We 
are also investigating a clique method to treat 
accessibility in a trartahle way \[9, 2\]. 
The inputs to the Generation component are 
the following constraints: 
s on the graummr: specification of obliga- 
tory, forbidden or cooccurrent rules, 
• on ternfinal nodes: specification of com- 
plex structures that deternfine terminal 
nodes types, 
• on iuitial structures: specification of in- 
complete parse trees. 
These parameterizations were easily included 
into tlm formalisnt, but problems occur with tire 
algorithm itself, which chart Mgoritlmls are in- 
sufficient to deal with. Three agendas take care 
of post-modification of nodes in incomplete trees, 
thus extending Slffeber's algorithm \[21, 18\]. 
4 Linguistic Descriptions 
4.1 Gralntnar 
"\];he development of tim GPSG granunax for 
1,~rench cau be traced through three steps. 
First, we implemented a demonstration groam 
mar \[12\], patterned alter tile English granunar 
described in the GDF, User ManuM \[8\]. In terms 
of coverage, tiffs French grammar cau handle 
some simple questions, wtfich required the def- 
inition of two additional nmtarules, ht terms of 
gralmnar-writing style, following a suggestion of 
\[22, pp. 115-t19\], we detine the person feature 
in temps of two tffnary featm'es, EGO said PTC 
(participant). Finally, agreement is a nmch more 
pervasive phenontenon in French than in English, 
and ntaaly more eases nmst be taken into ar- 
comit: adjective/noun, determiner/noun, adjec- 
tival predicate, arid the past participle. 
As a second step, we developed a GPSG- 
based I,'rend~ grauunax ".along the lines of the \]:~iI- 
glish gratnnmr described in \[15\]. Although the 
linguistic coverage is sinfflax in both of them, 
the l'arench graumlaX is only loosely patterned 
after the Enghsh one.. Its development was bro- 
ken into subtasks according to the types of con- 
stituents encountered (AI', NP, VP ...) as well 
as to the types of specific linguistic problems to 
be accounted fl~r (e.g. agreement, comparatives 
and coordination), lu generM, the rides in our 
graxmuax axe driven by lexicM infornmtion: we 
ttms model our computational grammax on tim 
results of current linguistic theory. 
Our treatment of agreement is fairly complete 
\[13\]. For example, we can handle complex color 
adjectives (des robes vert bouteille, "bottle-green 
dresses"), predicate APs (los robes sont reties, 
"the dresses are green"), mid past participles (les 
dtudiantes que les policiers out matraqu~es, "the 
students that the police beat up"). 
Tim treatment of VPs is extensive \[14\] attd 
includes the positioning of clitics \[3\] and of nega- 
tion. l,exical VI iteius are used to handle com- 
plex tenses ,~Ld the positioning of negation mid 
certain adverbs. We strived to ndnhiffze the 
nuntber of lexical II)-rules and tackle tim prob- 
lem of "categoriM distortion" \[20\] (in particular, 
the granunar ca:u account tor complement sub- 
categorization alternations in a systematic way). 
The treatment of 1qPs was found to cause 
At:IEs DE COLING-92, NAtCI'ES. 23-28 AOt~l 1992 1 1 7 9 PROC. oF COLING-92. NANIT.S. Auo. 23-28. 1992 
more serious problems. Although we were able 
to pattern our treatment of modifiers after \[15\], 
that of specifiers is more problematic \[19\]. It has 
rapidly become clear that semantic information 
is necessary for a satisfactory solution. Thus, 
the third step is to enrich our morpho-syntactic 
grammar with a semantic component \[6\]. 
4.2 Lexicon 
A lexical database is obviously necessary to 
perform any test on gramm_,3rs and parsers. 
Defining a French lexicon within the GPSG for- 
realism was not one of our goals but, in parallel 
to the syntactic database, we had to construct a 
lexicon couched in a formalism compatible with 
different grammars and with enough coverage to 
be useful. Like the grammar provided with the 
environment s this lexicon can be taken as is, or 
be replaced by the users. We eventually settled 
on (automatically) transforming the information 
present in an already existing dictionary (the 
CNET lexicon) to serve as the lexical database. 6 
4.3 Normalizing Lexical Information 
In building a linguistic environment which is 
both French specific and usable by separate users 
with independently built systems, we knew that 
these would require lexical information to be pre- 
sented in different ways. However, with the as- 
sumption that all of the lexical information nec- 
essary for the various syntactic analyses is actu- 
ally present in the lexicon provided with EGL, 
we make the hypothesis that the content of this 
information is common to the various systems. 
Since an increasing number of grammatical 
formalisms put a large part of the linguistic de- 
scription in the lexicon, we are interested in the 
nature and complexity of lexical entries, in the 
division of information between grammar and 
lexicon, in the representation of the syntactic in- 
formation in the lexicon, as well as in the use of 
texical information in the grammar. Normalizing 
this information thus became an important part 
of the linguistic aspect of the project: the fea- 
tures in the pre-existing lexicon had to be trans- 
formed to serve as the basis for a "neutral" lexi- 
con, Which must be usable by grammars not writ- 
ten in the same framework as that of the CNET. 
eThc CNET lexicon has more than 55000 entries 
defined with 200 keywords. The lexicon is trans- 
formed into minimal automata with quasi-linear time 
complexity for access. The compactness of the au- 
tomata allows them to be resident in core memory. 
First, a correspondence was established be- 
tween the syntactic and morpho-syntactic fea- 
tures of the CNET lexicon and the features 
required in systems created by members of 
the project: the GIREIL grammar; the LN- 
2-3 granlmar (INSI~.~RM); the ELU grammar 
(ISSCO). From the list of features used by each 
of them, we extracted those that pertain to the 
lexicon. We only considered attributes required 
by the grammars at the lexical level, thus dis- 
carding the features which represent information 
that cml only be evaluated during processing, i.e. 
which cannot be present in a lexical entry (e.g. 
VEUT-AUX-COMPOSE on a complex verbal 
form for LN-2-3, or REL on a nominal form in 
ELU). Since all three systems adopt to some ex- 
tent a lexicallst approach mid include a large 
amount of syntactic information in the lexicon, 
this division required a detailed interpretation of 
their internal workings. 
Conversely, although morphological analysis 
is most often performed in a separate component 
(i.e. inflected forms do not constitute separate 
lexical entries), morphological information is in- 
cluded in our normalization, because that infor- 
marion must be present on the lexemes serving 
as starting points for the syntactic analysis. 
We then put in correspondence the lexical fea- 
tures of the various systems; here again, it was 
necessary to interpret the way they are actually 
used (e.g. in the representation of reflexive con- 
structions). The normalization of the morpho- 
syntactic features required in these three gram- 
mars can now be extended to other grammatical 
analyses through the more general list of features 
established for the mapping which allows each 
system to recover in the lexicon the information 
it needs to perform an analysis. 
5 Conclusion 
While French has been the object of relatively 
extensive research in computational linguistics, 
no extensive formal description of that language 
has been integrated in a linguistically motivated 
development environment. The EGL project is 
part of a growing trend towards a wider linguistic 
coverage coupled with greater flexibility. 
Designing a linguistic development environ- 
ment requires making sonic fundamental choices 
about the grartmlatical forlnalism, and the eval- 
uation of competing formalisms depends on as- 
sumptions inlposed by the task at hand (corn- 
ACTE~ DE COLING-92. NANT~, 23-28 ^O~' 1992 1 1 8 0 PROC. OV COLING-92, NANTES, AUO. 23-28, 1992 
plexity, deternfiulsm, performance degradation 
in case of unforeseen input, use and integra- 
tion of semantic information). The use of NL 
as a medimn for communication between loan 
and nmchine renders desirable the adaptability of 
an NLP system to various linguistic forlnalisms. 
However, if automatic information processing 
projects now more often include an NL compo- 
nent, that component is generally "closed" a~td 
unmodifiable: few systems are designed to pro- 
vide the syntactic analysis of natural language 
texts or to be usable in various contexts. 7 In 
EGL, several of the modules nmy be reused out- 
side of the grammatical formafisln chosen for our 
own linguistic description. This basic reqrfire- 
ment of system design can have important con- 
sequences when we want to tailor the system to 
applications where the linguistic domain is lim- 
ited, which is the case in most natural laalguage 
interface applications. As a design tool, EGL 
makes it possible to see simultaneously arul to 
manipulate easily each of its components. 
References \[14\] 
\[1\] Baldy, B. and A. de Sousa (1989) ALVEY : une 
d~ude informatique pour la comprdhension des \[15\] 
mdcanismes de l'aualyse syntazique azds sur la 
thdorie des Grammaires Syntagmatiques Gdndralisdes. 
Rapport de Recherche, LIPN. 
{2\] Belabbas, A. (1991) Cohdrence des grammaires \[16\] 
ddcrivant le Langage Natural. Rapport de DEA, 
LIPN. 
\[3\] B~s, G. (1988) "Clitiques et constructions topi- 
calls&as dans une grammalre GPSG du franqais". 
In G. B~s & C. b~tchs eds. Lezique e~ paraphrase \[17\] 
pp. 55-81. Lille: Presses universitaires de Lille. 
\[4\] Boguraev, B. J. Carroll, T. Briscoe and C. Grover 
(1988) "Software Support for Practical Gram- 
mar Development." Proceedings of the 12th In- \[18\] 
te~aiional Conference on Computational Lin- 
guistics (COLING), Budapest, pp. 54-57. \[19\] 
\[5\] Bouchard, L. It., L. Emirkanian, D. Estival, C. 
Fay-Varrder, C. Fouquet~, G. Prigent and P. Zweigen- 
baum (1992) "EGL: a l~ench Linguistic Devel- \[20\] 
opment Environment". Natural Language Pro- 
cessing and its Applications, Avignon 92. \[21\] 
\[6\] Bouchard, L. H. and L. Emirkanian (1991) Se- 
mantic Interpretation in the Grammar Devel- 
opment Environmenl. Working Paper, GIREIL, 
UQAM. 
7The systems developed in France which have been 
studied in \[10\] are all concerned with more than the 
syntactic treatment of the language. 
\[7\] Carpenter, tL and C. Pollard (1991) "Inclusion, 
Disjointness and Choice: The Logic of Linguis- 
tic Classification." Proceedings of the 29th An- 
nual Meeting of the Association for Computa- 
tronal Linguistics, Berkeley. 
\[8\] Carroll, J, B. Boguraev, C. Grover and T. Briscoe 
(1988) A Development Environment for Large 
Natural Language Grammars. Tceh. Report 127, 
Computer Laboratory, University of Cambridge. 
\[9\] Dechter, R. and J. Pearl (1989) "Tree Clustering 
for Constraint Networks." Artificial Intelligence, 
38 (3), pp. 353-366. 
\[10\] Fay-Varuier, C., C. Fouquer6, G. Prigent et P. 
Zweigenbaum (1991) Comparaison de syst~rues 
d'analyse syntazique du fran~ais : Donndes et 
Commeniaires. Journ~es Nationales du PRC-CHM, 
Toulouse. 
\[11\] Gazdar, G., E. Klein, G. Pullum and I. Sag 
(1985) Generalized Phrase Structure Grammar. 
Cambridge: Harvard University Press. 
\[12\] GIREIL (1990a) "nr~ve description de la gram- 
matte "pocimir" du fran~ais'. Rapport de re- 
cherche. UQAM. 
\[13\] GIRE1L (1990b) "Grammaire minimale de l'ac- 
cord". Rapport de recherche. UQAM. 
GIREIL (1991) "La structure du syntagme ver- 
bal ea fran§ais". Rapport de recherche. UQAM. 
Grover, C., T. Briscoe, J. Carroll and B. Bogu- 
racy (1989) 2'he Alvey Natural Language Tools 
Grammar (Second Release). Tech. Report 162. 
Computer Laboratory, University of Cambridge. 
Kaplan, It. and J. Bresnan (1982) "Lexical-fuuc- 
tional grammar: A formal system for grammat- 
ical representation". In The Mental Representa- 
tion of Graramatical Relations. J. Bresnan, ed. 
Cambridge: MIT Press. 
Kay, M. (1982). "Parsing in Functional Unifica- 
tion Grammar". In Natural Language Parsing, 
D. Dowty, L. Karttunen and A. Zwicky, ads. 
Cambridge: Cambridge University Press. 
Le Barzic, J.P. (1991) Gdndration paramdtrde 
multi-formahsme. Rapport de DEA, LIPN. 
Milner, J.-C. (1978) De la syntaze d l'interprdta- 
tion : Quantitds, insultes, ezclamations. Paris: 
Editions du Scull. 
Mitner, J.-C. (1989) Introduction h use science 
du langage. Paris: Editions du Seuil. 
Shiel)er, S., G. van Noord, R.C. Moore, and 
F.C.N. Pereira (1989). "A Semantic-Head-Driven 
Generation Algorithm for Unification-Based For- 
malisms'. Proceedings of the 27th Annual Meet- 
ing of the Association for Compula¢mnal Lin- 
guistics, Vancouver, pp. 7 17. 
\[22\] Tesni&re, L. (1988) Eldments de syntaze strue- 
tnrale. (DeuxJ~me ~!dition revue et corrigde. Cin- 
quinine impression). Paris: Klincksieck. 
ACTES DE COLING-92, NANq\]iS. 23-28 Ao(rr 1992 1 1 8 1 l)uo~:. OF COLING-92, N^l'crEs, AUG. 23-28. 1992 
