ALEP-Based 
Distributed Grammar Engineering 
Axel Theofilidis 
IAI 
Martm-Luther-~,tr. 14 
66111 Saarbr/icken, Germany 
axel~ia£, un±- sb. de 
Paul Schmidt 
University of Mainz, FASK Germersheim 
An der Universit~t 2 
76726 Germersheim, Germany 
schmidtp@usun2, fask. uni-mainz, de 
Abstract 
Starting from a clarification concerning 
the notion of distributed grammar engi- 
neering, we present options of distributed 
grammar engineering as are supported by 
the ALEP grammar development platform 
and as were instantiated in the LS-GRAM 
project. The notion of distributed gram- 
mar engineering being grounded in the 
concepts of grammar modularization and 
grammar module integration, we focus on 
ALEP features providing operational sup- 
port for these two concepts in terms of 
both data storage and internal classifi- 
cation of grammar code. We conclude 
with an indication of the major benefits 
of ALEP facilities for distributed gram- 
mar engineering and derive some general 
desiderata from this. 
1 Two Notions of Distributed 
Grammar Engineering 
According to our understanding, two notions of 
distributed grammar engineering (DGE) should be 
distinguished: (i) DGE meaning that tasks which 
jointly contribute to the development and mainte- 
nance of a grammatical resource are distributed over 
different persons working, perhaps, at different sites; 
(ii) DGE meaning that different domains of linguis- 
tic information, different layers of linguistic descrip- 
tion are distributed over different processing (sub- 
)tasks. 
Irrespective of this difference in notion, there is 
a fundamental presupposition common to both no- 
tions of DGE: that grammatical resources bear a 
modular structure, and that grammar modules be- 
ing distributed over different authors or processing 
devices can neatly be integrated to form a coherent 
grammatical resource, respectively to support a co- 
herent chain of processing tasks. Thus, as shown 
in Figure 1, the idea of DGE is firmly grounded 
in the concepts of modularization and integration: 
once grammar modularization is provided and inte- 
gration of grammar modules is ensured, the option 
of DGE falls off as a side-effect. 
grammar 
integration ~ modularization 
distribution 
Figure 1: DGE grounded in the concepts of modu- 
larization and integration 
In the following two sections we will present 
options of grammar modularization and grammar 
module integration as are supported in the ALEP 
grammar development platform both in terms of 
data storage (section 2) and in terms of internal 
classification of grammar code (section 3). We will 
indicate how these options were made use of to sup- 
port DGE in the LS-GRAM project, 1 in the context 
of which broad-coverage grammatical resources for 
nine EU-languages were developed. Though no op- 
erational concept of DGE has been implemented at 
the multi-lingual level in this project, each particu- 
lar grammar was collaboratively written by several 
authors which, in some cases, were even located at 
different sites? thus giving rise to a real need for 
DGE support. 
1LRE 61029: Large-Scale Grammars for EU 
Languages 
2The Dutch grammar was developed at SST, Utrecht, 
and KUL, Leuven; the German grammar at IAI, 
Saarbrficken, and IMS, Stuttgart; the Italian grammar 
at DIMA, Torino, and ILC, Pisa; the Spanish grammar 
at FBG and UPF, both Barcelona. 
84 • 
2 Grammar Modularization and 
Integration in terms of Data 
Storage 
The ALEP platform realizes an object-oriented envi- 
ronment. As such it assumes storage of data at two 
levels: the file level and the object level. At the file 
level, data (grammar code) may be distributed over 
an arbitrary number of files, each of them contain- 
ing a particular type of data (e.g. type and feature 
declarations, lexical entries, phrase structure rules). 
A phrase structure grammar, for instance, may be 
distributed over several files each of them containing 
a set of rules accounting for a particular domain of 
constituency (e.g. 'S', 'VP', 'NP') or for some par- 
ticular type of construction (e.g. apposition, coordi- 
nation, extraposition). Similarly, a lexicon may be 
distributed over several files along dimensions such 
as part-of-speech category or sub-language. 
At the object level, on the other hand (cf. (Groe- 
nendijk, 1994)), an arbitrary number of files consti- 
tuting a coherent grammar module may be grouped 
into an object being defined in terms of the ALEP 
User Language (AUL) 3 and stored in the ALEP Ob- 
ject Repository (AOR). Figures 2 and 3 show sam- 
ple (but partial) AUL objects representing a phrase 
structure component and a declarations component 
of a grammar. 
lg_ps_rules I 
lowv \[name 
nowv \[.owner 
rule_files 
str\] 
i~j 
loc 
(file-infolbTe 
\file_info \[base 
decl_ref 
'GRAM//RULES//STR//'\] str_xp 
J' 
'GRAM//RULES/STR//'\] 
str_sent J ' 
\[,ypo \] / n°wV \[o me dec\] 
tnowv" nowvLOWner iai jj 
Figure 2: Sample A UL object representing a phrase 
structure component 
The object of type I lg_ps_rules bears the list-type 
feature 'rule_files', with each of the 'fileAnfo' fea- 
ture structures pointing to a file containing a set 
of phrase structure rules. 4 In addition, an object of 
type ~ is referred to by the feature 'decl_ref'. 
3A typed feature structure notation similar to the lin- 
guistic formalism supplied with ALEP. 
4A UNIX directory path and a file name are given as 
the vMues of the attributes 'loc' and 'base'. 
85 
nowv \[name dec\] nowv/owner iaij 
( r,o  
decl_files file_info I.base 
'ORAM //DEOLS //  PES //'\] \ typ_syn J'~ 
Figure 3: Sample A UL object representing a decla- 
rations component 
This object (shown in Figure 3) represents the dec- 
larational basis of the respective phrase structure 
component. It refers, in its turn, to a list of files con- 
taining a coherent set of declarations (e.g. type and 
feature or macro declarations) upon which a phrase 
structure component or any other grammar compo- 
nent (e.g. a lexicon component) may be based. 
A coherent set of grammar components may, in 
turn, be grouped (i.e. integrated) into a higher- 
level object of type lglingware_group\] represent- 
ing a complete grammar (or sub-grammar). This is 
shown in Figure 4 illustrating the modular and, at 
the same time, hierarchical style of data structuring 
characteristic of ALEP. 5 
Presuming a principled distribution of grammat- 
ical data over files (reflecting, for instance, differ- 
ent types of grammatical phenomena or different do- 
mains of grammatical description), a whole range of 
specialized grammars or grammar components may 
be configured at the object level according to par- 
ticular grammar development or maintenance tasks. 
Objects may be defined, for instance, which incre- 
mentally extend the core coverage of a grammar by 
new domains of linguistic description. This is illus- 
trated in Figure 5, where two \[lg_ps_rules\] objects 
are shown, both of which share the files containing 
the basic phrase structure rules dealing with senten- 
tim (file 'str_sent') and non-sentential (file 'str_xp') 
constructions, but which extend this set of files once 
by a file accounting for coordinated structure and, in 
the second case, by a file accounting for paragraph 
structure. 
Based on this kind of definition of specialized 
grammars or grammar components, grammar de- 
velopment and maintenance tasks being related to 
particular domains of linguistic decription may eas- 
ily be assigned to, and distributed over, different 
persons possibly working at different sites. Physi- 
cal distribution (exchange) of grammar components 
is conveniently supported by the ALEP 'Export' 
5Objects of typo I lg_lox ules I and L lg_tlm_r°lesl 
represent lexicon, respectively two-level morphology, 
components. 
lgJingwa,re_group \] 
I'g  ex-,:u'es I I'g-p s-ru'esl I 'g-t m-rulos I 
~~ GRAM/ 
DECLS'~ ~ "~ RULES/ 
x LEX/ STR/ TLM/ 
l\ l\ l\ 
1 \ / % 1 \ 
l \ l %. 1 \ 
l %. l \ l \ 
Figure 4: Data storage at the AOR object level and the UNIX file level 
lg_ps_rules 
\[lOWV 
name str_coor \] 
owner axel I 
versionW°rkarea" ~lsg_de, i i)J 
nowv 
Ig_ps_rules 
\[lOWV 
name str_para I 
owner thierry / 
versionW°rkarea ~lsg_de, ims)\] 
nOWV 
.../str..sent .../str_xp .../str_coor .../str_para 
Figure 5: Object-based definition of specialized grammar components 
and 'Import' functionality, where 'Export' performs 
a (UNIX) 'tar' operation on a selected set of ob- 
jects, and 'Import' performs an 'untar' operation on 
an 'Export'-created tar-file, asserting all respective 
objects to the AOR as well as creating all directo- 
ries and files being referred to by these objects (of. 
(Groenendijk, 1994), chapter 3). 
Management of larger sets of objects being iter- 
atively exchanged between persons or sites is well 
supported by a number of features being assigned to 
every object, such as the 'comment' feature, which 
allows to encode a comment string with every ob- 
ject, and, most importantly, the object identifica- 
tion feature 'nowv' which requires every object to 
be assigned a unique combination of (object) name, 
(object) ownership, (object) workarea, and (object) 
version (cf. (Groenendijk, 1994), chapter 4), thus 
allowing to keep track of distributedness in gram- 
mar writing. In the illustration given in Figure 5, 
for instance, the 'nowv' feature indicates that Axel 
obviously is the person who elaborates on the sec- 
ond version of a phrase structure component cover- 
ing coordinated structure as part of the German LS- 
GRAM resources developed at IAI, whereas Thierry 
is the person who elaborates on the third version of 
a phrase structure component covering paragraph 
structure as part of the German LS-GRAM re- 
sources developed at IMS. 
This style of distributed grammar development 
has been a standard practice in the LS-GRAM 
project. Thus, for instance, it has been a typical 
approach to have the morphological and the syn- 
tactic grammar components developed by different 
persons and, as in the case of the Italian or Spanish 
grammar, even at different sites. 
After each cycle of distributed grammar develop- 
ment based on the definition of specialized grammars 
or grammar components, re-integration of the var- 
ious grammar modules can easily be performed at 
the level of data storage by defining respective new 
86 
I lg-"ngware-gr°up" I I 'g ngware-gr°up b I \[lg_"ngware_group  
1 I I I I I 1 
.../lex_core .../lex_finc .../lex_econ 
(= finance) (= economy) 
.../str_core .../str_apps .... /str_coor 
(= apposition) (= coordination 
Figure 6: Object-based sub-grammar configuration 
objects. Grammar modules being established at the 
file level and spread over different objects represent- 
ing specialized grammar components can be merged 
into one object representing a full coverage gram- 
mar component. Such full coverage grammar com- 
ponents may, in turn, be grouped into a higher level 
object of type I lg_lingware_group \] representing a full 
coverage grammar. 6 
Interesting to note is that the style of modular- 
ization and integration of grammars supported in 
ALEP at the level of data storage not only conve- 
niently supports DGE (as we hope to have shown), 
but also a high degree of flexibility in configuring 
(and re-configuring) grammars according to specific 
(and changing) demands of different application sce- 
narios. This is illustrated in Figure 6, showing how 
sub-grammars with varying coverage can be con- 
figured at the object level based on a fine-grained 
modularization of grammar components along di- 
mensions such as sub-language for lexicon compo- 
nents, or types of syntactic constructions for phrase 
structure components. 
Grammar Modularization and 
Integration in terms of Data 
Classification 
Besides in terms of data storage, ALEP also sup- 
ports grammar modularization and integration in 
terms of internal classification of grammar code 
based on the notion of specifiers. Specifiers are desig- 
nated feature structures 7 which serve the purpose of 
encoding membership of a rule in one (or more than 
one) class of rules and, thus, realize the notion of a 
rule classifier. By specifying rules to be members of 
6Though the process of merging objects is not yet 
functionally supported in the ALEP environment, it 
is considered a trivial task to integrate a respective 
functionality. 
7'designated' in that they are picked out by a special 
feature path declaration 
particular classes of rules, grammars are internally 
(and, in that, multi-dimensionally) partitioned into 
sub-grammars (cf. (Simpkins, 1994a), chapters 5 
and 7). 
Specifier-based grammar partitions may be estab- 
lished along two basic dimensions illustrated in Fig- 
ure 7: along a vertical dimension, grammar par- 
titions may be established according to different 
types of processing operations to apply; lexical en- 
tries, for instances, may be specified to be applied 
during word segmentation (= two-level based mor- 
phographemic analysis), during analysis (= pars- 
ing), or during refinement only (the operation of re- 
finement will be explicated below). Along a horizon- 
tal dimension, on the other hand grammar partitions 
may be established according to different types of 
structural units being involved in the parsing oper- 
ation; structure rules may be specified to be applied 
only when parsing morphemes to words, words to 
sentences, or sentences to paragr~,phs. 
The main effect that can be obtained by an 
intelligent specifier-based grammar partitioning is 
in terms of increased performance efficiency: By 
specifier-based grammar partitioning, access of rules 
during execution of some processing (sub-)task can 
be restricted to a sub-grammar being identified via 
a particular instance of the specifier feature struc- 
ture and encoding only as much information as is 
relevant to the respective processing task. 
Irrespective of performance support, however, 
specifier-based grammar partitions constitute an op- 
erational concept of grammar modularization that 
can be multiply exploited in DGE. In that, the 
ALEP 'Refine' operation plays a crucial role. 'Re- 
fine' is a monotonically operating feature-decoration 
algorithm which (re-)applies structure rules and lex- 
ical entries to consolidated structure trees as are ob- 
tained from analysis (or synthesi, 0. Important with 
regard to DGE, as we will see shortly, is that 'Re- 
fine' may be executed an arbitrary number of times 
in succession. 
The set of rules applied by the: 'Refine' operation 
87 
sent-to-para 
word-to-sent 
morph-to-word 
- Oegmen 3 -- C-. yse  -- -- 
Figure 7: Grammar partitioning along a vertical and a horizontal dimension 
must constitue a complete grammar which is unifi- 
able (partially identical perhaps) with the grammar 
that was applied by the preceding operation. In 
that, however, 'Refine' will produce an effect only 
by application of rules which add some information 
(feature decoration!) compared to the corresponding 
rules that were applied by the preceding operation. 
By a systematic distribution (based on a vertical 
grammar partition scheme) of different domains of 
linguistic information over an analysis grammar and 
one, or more, 'Refine' grammars, the presumed mod- 
ularity of linguistic knowledge is (monotonically) 
fleshed out at the level of grammar engineering. 
Thus, for instance, it has been a typical practice 
in LS-GRAM to distribute syntactic and semantic 
information over an analysis and a refinement gram- 
mar respectively; by this, parsing will not be affected 
by ambiguities residing in the semantic domain (cf., 
for instance, (Schmidt et al., 1996a) and (Schmidt 
et al., 1996b)). 
1 
Canalyse~ 
! 
! 
syntax 
semantics 
Figure 8: Grammar partitioning according to differ- 
ent domains of linguistic information 
More importantly with regard to DGE, however, 
is that grammar modules being distributed over 
the processing operations of analysis and refine- 
ment and being delimited according to different do- 
mains of linguistic information, can simultaneously 
be distributed over different authors according to 
their specific expertise. The degree of this kind of 
distributedness can be significantly increased by a 
still finer-grained modularization of grammatical re- 
sources assuming multiple application of the 'Refine' 
operation, as illustrated in Figure 9. Thus an ac- 
count of syntax can be distributed over a grammar 
module supporting shallow parsing and a grammar 
module performing syntactic filtering based on the 
refinement operation; different aspects of semantics 
and pragmatics may futhermore be distributed over 
distinct grammar modules being successively applied 
during further refinement stages: 
~eflne~ 
Oeflnek~ 
1 
t 
Creflne,.D 
l 
syntax1 : shallow parsing 
syntax2: syntactic filtering 
semantics1: linking theory 
semantics2: lexical semantics 
pragmaticsl : register & style 
pragmaticsz : implicature 
Figure 9: Grammar partitioning according to differ- 
ent domains of linguistic information 
However, if the only effect of specifier-based gram- 
mar partitioning was that of distributing differ- 
ent domains of linguistic information over grammar 
modules to be applied during successive process- 
ing stages, the same effect could also be obtained 
by simply assuming distinct grammar modules in 
terms of data storage (i.e. at the file and object 
level of data storage). But, in terms of both gram- 
88 
\[:o.o.O: r\] 
gramphen \[extrp : 
g_phen " " " 
core y spec spec 
l \[appo~ n\] 
gramphen \[::t °'d... y 
g_phen 
\[.core Y 
\[:ppo d n\] 
gramphen \[?xtrp ; 
core g_pheny 
spec 
Figure 10: Specifier feature structures establishing a cross-classificatory grammar partition 
mar development and maintenance tasks it is, in 
fact, an advantage of the specifier-based approach 
to grammar modularization that logically related 
grammar code may be stored in one and the same 
file, though it will be applied, in effect, at differ- 
ent processing stages. An even bigger advantage 
is that the specifier-based approach to grammar 
modularization supports a multi-dimensional, cross- 
classificatory partitioning of grammars which is not 
possible in terms of data storage, unless at the cost 
of redundantly duplicating grammar code. In terms 
of specifier feature structures as those shown in Fig- 
ure 10, for instance, grammar rules are simultane- 
ously assigned to the class of rules establishing a 
core-coverage grammar and to one of several classes 
of rules accounting for specific grammatical phenom- 
ena, such as apposition, coordination or extraposi- 
tion. 
Based on specifier feature structures establishing 
a cross-classificatory partition scheme for grammars, 
specialized grammar modules can be both uniquely 
identified by full specification, and integrated by un- 
derspecification, of specifier features. Thus, for in- 
stance, by reference to the underspecified feature 
structure shown in Figure 11, all grammar modules 
are called except for those dealing with extraposi- 
tion. 
\[gramphela \[extrp n\]\] spec \[ g_pnen 
Figure 11: Underspecified specifier feature structure 
effecting grammar module integration 
4 Conclusions and Desiderata 
The mechanisms and devices described in the pre- 
vious sections constitute a first step towards remov- 
ing a decisive bottleneck in large-scale distributed 
grammar engineering s. The bottleneck is that the 
bigger and the more complex a grammar becomes 
the more difficult it is to extend it, improve it, or 
8We think that large-scale grammar engineering must 
be distributed in general. 
adapt it. Grammars tend to become huge, incom- 
prehensible monolithic blocks which are more and 
more difficult to maintain, with distributed develop- 
ment becoming impossible. In the following we will 
make some general, summarizing points about what 
we think can be derived from the kind of facilities 
provided by ALEP for DGE in general. In this, we 
will mainly focus on benefits of specifier-based parti- 
tioning, addressing the issues of testing, distribution, 
maintenance and deployment, as well as touching on 
the issue of monotonic vs. non-monotonic grammar 
development. In section 4.2 we conclude with deriv- 
ing some general desiderata for DGE. 
4.1 Benefits for Large-Scale Grammar 
Engineering 
Testing: A major advantage of the specifier facil- 
ity lies in the fact that testing (and thus develop- 
ment) becomes easier as specific modules (that may 
be under reconstruction) may be tested separately 
by plugging together just the relevant modules in 
terms of reference to the approriate specifiers. Thus, 
it is possible to plug together a core grammar with 
a coordination module in order to test coordination 
while leaving aside other phenomena, such as ex- 
traposition, which may be irrelevant at the given 
stage of grammar development. Then, gradually, 
more modules may be added (one by one) in order 
to explore the interaction of these modules. 
Distribution: It is obvious that, as far as distri- 
bution of grammar development is concerned (in the 
sense that different linguists develop the same gram- 
mar), facilities as described are a minimum. Based 
on mechanisms of grammar partitioning, grammars 
can be developed in a distributed fashion in that one 
linguist may work on developing a treatment of ap- 
positions, building on a given core grammar, while 
another one is working on a treatment of coordina- 
tion, building on the same core grammar without 
unwanted interference. 
Maintenance: From what has been said about 
testing and distribution, it is obvious that a gram- 
mar existing in a modular form as supported by 
the facilities described is better maintainable than 
a monolithic one. 
Deployment: In the same way grammar mod- 
ules can be plugged together for testing and, more 
89 
generally, for development purposes, grammar mod- 
ules can also be plugged together in order to be de- 
ployed in different application scenarios with distinct 
requirements in terms of linguistic coverage or depth 
of analysis. That is, one could well envisage that, 
for particular applications, specific sets of grammar 
modules are plugged together by reference to the ap- 
propriate specifiers. 
Monotonic vs. Non-Monotonic Grammar 
Development: The specifier facility, as introduced 
so far, suggests that the optimal way of proceeding 
in grammar development (also independently of this 
facility) is to proceed monotonically, i.e. by sim- 
ply adding different modules accounting for specific 
domains of linguistic description (such as those ex- 
emplified by coordination and extraposition) to al- 
ready available modules. However, this is an unreal- 
istic requirement as the treatment of new phenom- 
ena hardly ever consists of simply adding (monoton- 
ically) sets of new rules to an existing grammar. It 
is often required to revise existing modules (even the 
core grammar) in the light of requirements deriving 
from the treatment of new phenomena. But even in 
this case the modularization achieved by specifier- 
based partitioning of grammars is the condition for 
performing the required changes efficiently and un- 
der optimal control, since the consequences of such 
changes can be studied module by module. 
As a final point, it should be mentioned that par- 
titioning of grammars by approriate specifiers sup- 
ports adaptation of a grammar to new theoretical 
insights about the nature of human language that 
may become necessary by new developments in the- 
oretical linguistics. Specifiers thus contribute con- 
siderably to the 'reusability' of grammars. 
4.2 Desiderata 
The options of grammar modularization provided in 
ALEP both in terms of file and object-based data 
storage and in terms of specifier-based classification 
of grammar code, constitute a good basis for sup- 
porting DGE at an operational level. As for support 
for grammar module integration, however, we still 
see the need for add-on functionality complementing 
operational support for grammar modularization. 
When thinking of grammar module integration in 
terms of data storage, one has to bear in mind that, 
in itself, integration of grammar modules at the level 
of data storage does not yet entail an integrated 
grammar. Integration of grammar modules presup- 
poses integration in terms of linguistic specifications 
meaning that each particular grammar module must 
be integratable at least wrt. a common declarational 
basis (type and feature theory, macros etc.) and 
some implemented notion of a core grammar. 
In both respects, problems with integration may 
occur during the course of distributed grammar de- 
velopment, where different persons simultanously 
elaborate on different grammar modules. Interfer- 
ences may occur, for instance, due to the fact that 
work on a specific grammar module requires parts 
of the declarational basis to be modified which are 
shared by other modules. Interference may also oc- 
cur due to the fact that a specific grammar module 
being developed by one author feeds in information 
into modules developed by other authors, with these 
modules, in turn, being supposed to thread this in- 
formation to still other modules. To get hold of such 
interferences, sophisticated versioning control func- 
tionality is required providing for automatic creation 
of version protocols (encoding modifications, author- 
ship etc.), for comparing or merging parallel versions 
of some grammar module, and for checking informa- 
tion paths across grammar modules. 
As for the concept of specifier-based grammar 
partitioning, we consider it desirable to provide a 
direct linking to the data storage layer by provid- 
ing the option of defining objects not only in terms 
of a reference to files, but also in terms of a ref- 
erence to specifier information. The idea is that 
the data being represented by an object can be se- 
lected (and automatically stored in appropriate files) 
based on a full or partial specification of the spec- 
ifier feature structure. By implementation of this 
idea, specifier-based grammar partitions of arbitrary 
grain-size could become physically manifest at the 
object-level of data storage making them immedi- 
ately accessible to object-level support of DGE such 
as the 'Export'/'Import' functionality. 

References 
Marius Groenendijk. 1994. Environment Tools 
Guide. ALEP-2 - Guide to the ALEP User In- 
terface Tools. CEC, Luxembourg. 
Paul Schmidt, Sibylle Rieder, Axel Theofilidis, 
Thierry Declerck. 1996a. Final Documen- 
tation of the German LS-GRAM Lingware 
(LRE 61029, Deliverable DC-WP6e (German)). 
IAI, Saarbriicken (http://www.iai.uni-sb.de/LS- 
GRAM). 
Paul Schmidt, Sibylle Rieder, Axel Theofilidis, 
Thierry Declerck. Lean Formalisms, Linguis- 
tic Theory, and Applications. Grammar Devel- 
opment in ALEP. 1996b. In Proceedings of the 
16th International Conference on Computational 
Linguistics (COLING-96), pages 286-291, Copen- 
hagen, Denmark. 
Neil K. Simpkins. 1994a. ET-6/1 Linguistic Formal- 
ism. ALEP-2 - User Guide. CEC, Luxembourg. 
Neil K. Simpkins. 1994b. Linguistic Development 
and Processing. ALEP-2 - User Guide. CEC, 
Luxembourg. 
