Lean Formalisms~ Linguistic Theory~ and Applications. 
Grammar Development in ALEP. 
Paul Schmidt & Axel Theofilidis 
& Sibylle Rieder 
IAI 
Martin-Luther-Str. 14 
D-66 111 Saarbrficken 
{paul,axel,sibylle} @iai.uni-sb.de 
Abstract 
This paper describes results achieved in a 
project which addresses the issue of how the 
gap between unification-based grammars as a 
scientific concept and real world applications 
can be narrowed down 1. Application-oriented 
grammar development has to take into ac- 
count the following parameters: Efficiency: 
The project chose a so called 'lean' formal 
ism, a term-encodable language providing effi- 
cient term unification, ALEP. Coverage: The 
project adopted a corpus-based approach. 
Completeness: All modules needed from text 
handling to semantics must be there. The 
paper reports on a text handling compo- 
nent, Two Level morphology, word structure, 
phrase structure, semantics and the interfaces 
between these components. Mainstream ap- 
proach: The approach claims to be main- 
stream, very much indebted to HPSG, thus 
based on the currently most prominent and 
recent linguistic theory. The relation (and 
tension) between these parameters are de- 
scribed in this paper. 
1 Introduction 
Applications on the basis of unification-based gram- 
mars (UG) so far are rather rare to say the least. 
Though the advantages of UGs are obvious in that 
properties such as monotonicity, declarativity, per- 
spicouity are important for maintaining and easily 
extending grammars, their popularity (despite 15 
years of history) is still restricted to the academia. 
This paper reports of a project, LS-GRAM, which 
tried to make a step further in bringing UGs closer 
to applications. 
Application-oriented grammar development has to 
take into account the following parameters: 
• Efficiency: A major problem of UGs is (lack- 
ing) efficiency. For the LS-GRAM project this 
1This project is LS-GRAM, sponsored by the Com- 
mission of the European Union under LRE 61029. 
Thierry Declerek 
IMS, University of Stuttgart 
Azenbergstr. 12 
D-70174 Stuttgart 
thierry@ims.uni-stuttgart.de 
led to the decision to use a so called 'lean' for- 
nrMism, ALEP, providing efficient term unifi- 
cation. 'Leanness' means that computation- 
ally expensive formal constructs are sacrificed 
to gain efficiency. Though this is at the cost of 
expressiveness, it is claimed that by 'leanness' 
'hnguistic felicity' does not suffer. 
• Coverage: Most grammar development 
projects are not based on an investigation of 
real texts, but start from 'the linguists" text 
book'. This is different here in that a corpus- 
based approach to granlmar development has 
been adopted which is the implementation of 
the sinlple principle that if a grainnlar is sup- 
posed to cover reM texts, that the coverage of 
these texts has to be determined first. The 
was a corpus investigation in the in the begin- 
ning, in the course of which tools have been 
used and developed which allow for automatic 
and semi-automatic determination of linguistic 
phenomena. 
• Conlpleteness: All modules needed from text 
handling to semantics had to ve developed. 
This is why this paper does not focus on one 
single topic, but tries to represent the ulajor 
achievements of the whole of the system. The 
paper reports on a text handling component, 
Two Level morphology, word structure, phrase 
structure, semantics and and very importan{ly 
the interaction of these components. 
• MMnstream approach: None-the-less, the ap- 
proach we adopted clainls to be mainstream, 
very much indebted to HPSG, thus based on 
the currently most prominent and recent lin- 
guistic theory. 
The relation (and tension) between these parame- 
ters is the topic of this paper. 
First, we will show, how a corpus-investigation 
estabfished the basis for the coverage, second, 
how various phenomena deternlined by corpus- 
investigation are treated in text handhng (TH), 
third, how the linguistic modules, Two Level Mor- 
phology (TLM), word and phrase structure, the lex- 
icons look hke. The last section is devoted to the 
286 
efficiency and performance of the system. Figures 
are given which prove that the system is not so far 
from real applications 2 
2 Corpus-Investigation with the 
MPRO-System 
The project started with a corpus investigation. It 
consisted of 150 newspaper articles from the Ger- 
man 'Die ZEIT'. They are descriptive texts from the 
domain of economy. They were investigated auto- 
matically by the (non-statistical) 'MPRO' tagger. 
'MPRO' provides the attachment of rich linguis- 
tic information to the words. In addition, 'MPRO' 
provides a built-in facility for resolving categorial 
ambiguities on the basis of homograph reductions 
and a facility for handling unknown words which 
are written on a file. Encoding the missing stems 
(which were very few) ensured complete tagging of 
the corpus. 
'MPRO' also provides a facility for searching syn- 
tactic structures in corpora. A detailed analysis 
on the internal structure of main clauses, subor- 
dinate clauses, verbal clusters, clausal topoi (e.g. 
structure of Vorfeld and Nachfeld), NPs, PPs, APs, 
CARDPs, coordinate structures, occurrence of ex- 
pletives, pronominals and negation occurring in the 
corpus was made which then guided grammar de- 
velopment. 
Another major result of the corpus investigation 
was that most sentences coutMn so called 'messy 
details', brackets, figures, dates, proper names, ap- 
positions. Most sentences contain compounds. 
In generM, most of the known linguistic phenom- 
ena occur in all known variations. Description has 
to be done in great &tail (all frames, all syntactic 
realizations of frames). (Long distant) discontinu- 
ities popular in theoretical linguistics did not play 
a role. In order to give a 'general tlavour' of the 
corpus-investigation one noteworthy result should 
be reported: 25% of the number of words occur in 
NPs of the structure \[ DET (A) (A) N \]. But 'A' 
and 'N' are of a complex and unexpected nature: 
• A: (name) - er: Dortmund-er, Schweiz-er 
® (figure) - A: 5-j/£hrig, ffinfj~hrig, 68-prozentig 
• N: Ex-DDR-Monopolisten (Hyphenated com- 
pounding, including names and abbreviations). 
The corpus-investigation guided the grantmar de- 
velopnrent. A.o. it showed the necessity to devlop 
a TH component and separate out specific phenom- 
ena from the treatment in the grannnar. (This was 
also necessary from an efficiency point of view). 
2It should be mentioned that we are referring to the 
German grammar built in the LS-GRAM project. For 
other languages similar system exist. 
3 Text Handling 
The ALEP platform provides a TH component 
which allows "pre-processing" of inputs, it con- 
verts a number of formats among them 'Latex'. 
Then the ASCII text first goes through SGML- 
based tagging: Convertion to an EDIF (Eurotra 
Document Interchange Format) format, then para- 
graph recognition, sentence recognition and word 
recognition. The output of these processes consists 
in the tagging of the recognized elements: 'P' for 
paragraphs, 'S' for sentences, 'W' for words (in case 
of morphological analysis, the tag 'M' is provided 
for morphemes) and 'PT' for punctuation signs as 
exelnplified in (1). 
(1) <P> 
<S> 
<W>Jolm</W> 
<W>sleeps</W> 
<PT>.</PT> 
</s> 
</P> 
In the default case, this is the information which is 
input to the TH-LS component (Text-Handling to 
Linguistic Structure) component. ALEP provides a 
facility (tsls-rules) which allows the grammar writer 
to identify information which is to flow from the TtI 
to the linguistic processes. We will show how this 
facility can be used for all efficient and consistent 
treatnlent of all kinds of 'messy details'. 
The TH component of the ALEP platform also 
foresees the integration of user-defined tags. The 
tag (USR) is used if the text is tagged by a user- 
defined tagger. An example of an integration of a 
user-defined tagger between the sentence recogni- 
tion level and the word recognition level of the TH 
tool is given below. 
The tagger for 'messy details' has been integrated 
into the German grammar and has been adapted 
for the following patterns: 
1 'Quantities' (a cardinal number followed by an 
amount and a currency name (e.g. "16,7 Mil- 
lionen Dollar")) 
2 Percentages (12%, 12 Prozent, zwSlf Prozent) 
3 Dates (20. Januar 1996) 
4 Acronyms and abbreviations ('Dasa', 'CDU', 
'GmbH', etc.). 
5 Prepositional contractions (zum, zur, am etc.) 
• Appositions: Prof. Dr. Robin Cooper 
We will examplify the technique for 'quantities'. 
Recursive patterns are described in the program- 
ruing language 'awk'. (2) defines cardinal numbers 
(in letters). 
287 
(2) two-to-ten = "((\[Zz\]wei) l(\[Dd\]rei)l 
(\[Vv\]ier) I(\[Ff\] Inf) I 
( \[Ss\] echs) \] ( \[Ss\] leben) \] 
( \[Aa\] cht) l ( \[Nn\] ean) I ( \[Zz\] ehn) )" 
eleven-to-nineteen = . ... 
t went y-t o-ninety 
etc. 
card = "((\[Ee\]inl"two-to-ten") 
(und"twenty-to-ninety") ?I " .... ) 
On the basis of these variables other variables can 
be defined such as in (3). 
(3) range = "("number"l"card")" 
amount =" ("Millionen" I"Milliaxden") 
currency=" ("Mark", "DM", "Dollar")" 
curmeasure=" ("amount"??"currency" ?)" 
quantity =" ("range .... curmeasure")" 
The following inputs are automatically recognized 
as 'quantities': 
"Zweiundzwanzig Dollar" 
"Sechsundzwanzig Milliarden" 
"Dreiundvierzig Milliarden DM" 
This treatment of regular expressions also means a 
significant improvement of efficiency because there 
is only one string whereas the original input con- 
sisted of five items ("vierzig bis ffinfzig Milharden 
Dollar"): "vierzig_bis_fuenfzig_Milliarden_Dollar". 
(4) gives an exalnple for information flow from TH 
to linguistic structure: 
(4) id:{ 
spec => spec:{ 
lu => TYPE}, 
sign => sign:{ 
string => string: { 
first => \[ STRING I REST\], 
rest => REST}, 
synsem => synsem:{ 
syn => SYN => syn:{ 
constype => morphol:{ 
lemma => VAL, 
rain => yes } } } } }, 
'USR',\['TYPE' => TYPE, 
'VAL' => VAL\], STRING ). 
The feature 'TYPE' bears the variable TYPE (in 
our case: "quantities"). The feature 'VAL' repre- 
sents the original input (e.g.: "ffinfzig Milliarden 
Dollar") and the variable STRING represents the 
output string of the tagged input (in this case: 
"fuenfzig_Milharden_Dollar"). This value is co- 
shared with the value of the "string" feature of the 
lexicon entry. The definition of such generic entries 
in the lexicon allows to keep the lexicon smaller 
but also to deal with a potentially infinite number 
of words. 
These strategies were extended to the other phe- 
nomena. The TH component represents a pre- 
processing component which covers a substantial 
part of what occurs in real texts. 
4 The Linguistic Modules 
4.1 Two Level Morphology (TLM) 
The TLM component deals with most major mor- 
phographemic variations occurring in German, such 
as umlautung, ss-fl alternation, schwa-instability. 
We will introduce this component by way of ex- 
emplification. (5) represents the treatment of'e'-'i' 
umlautung as occurring in German verbs like 'gebe', 
'gibst', referring to (Trostg0). 
An ALEP TL rule comes as a four to five place 
PROLOG term, the first argument being the rule 
name, the second a TL description, the third (rep- 
resented by the anonymous variable ) a specifier 
feature structure, the fourth a typed feature struc- 
ture constraining the application of the rule and 
a fifth allowing for linking variables to predeflned 
character sets. 
(5) tlm_rule( 
umlautE_no, 
\[\] \[el \[\] <=> \[\] \['E'\] \[\], 
syn:{ 
constype => stem:{ 
phenol => pho:{ 
umlaut => no } } } ). 
tim_rule( 
umlautE_i_yes, 
\[3 \[i\] \[C\] <=> \[\] \['E'\] \[\], 
syn:{ 
constype => stem:{ 
phenol => pho:{ 
umlaut => (yes,i)} } }, 
\[C in consonants\] ). 
In order to understand the treatment one has to 
bear in mind that there exists the lexical entry in 
(6) to which the TL rules above map to. ,6,\[ ,\]\] 
syn\] cons phenol \[umlaut none&i 
The morphologically relevant information is en- 
coded in 'cons'. It contains two features, 'lemma' 
which encodes the abstract nlorpheme with a cap- 
ital 'E' (this is the basis for a treatment according 
to ((Trost90)) and the feature 'phenol' which en- 
codes phonologically relevant information, in this 
case whether ulnlautung is available or not. 
The values for 'phenol' is a boolean conjunction 
from two sets: sl ={none, no, yes} and s2 = {e, 
i\]. 
The treatment consists of mapping a surface 'e' to a 
lexical 'E' in case the constraint which is expressed 
288 
as a feature structure in the fourth argument holds. 
It says that for the first rule 'e' is nrapped on 'E' if 
the feature 'umlaut' has a value which is 'no'. This 
applies to (6). This handles cases such as 'geb-e'. 
The second rule maps 'i' 'E' if 'umlaut' has the 
value 'yes & i'. This also holds in cases like 'gib-st'. 
One would expect according to (Trost90) that only 
two values 'no' and 'yes' are used. The change has 
been done for merely esthetic reasons. The '2nd 
pers sing' morpheme 'st' e.g. requires an 'umlaut 
= yes' stenr, if the stem is capable of ulnlautung, 
at all which is the case for 'gibst'. In case the stem 
cannot have umlautung (as for 'kommst') 'st' also 
attaches. This makes that uon-unflautung stems 
have to be left unspecified for umlautung, as other- 
wise 'st' could not attach. 'st' can now be encoded 
fox' 'umlaut = no'. 
4.2 Lexicon 
f,exical information is distributed over three lexi- 
cons: 
• A TL lexicon which contains information rele- 
vant for seglnentation, cxdusively. 
• A syntax lexicon. 
• A semantic lexicon. 
The distribution of information over three lexicons 
has a sinrple reason, namely avoiding lexical anrbi- 
guities at places where they cannot be resolved or 
where they have inrpact on eificiency. So, e.g. the 
verbal suffix 't' has lots of interpretations: '3rd pets 
sing', 2nd pers pl', preterite and more. These ambi- 
guities are NOT introduced in the TLM lexicon as 
the only effect would be that a great nunlber of scg- 
mentations would go into syntactic analysis. Only 
then these ambiguities could be resolved on the ba- 
sis of morphotactic intorlnation. A similar situa- 
tion holds on syntactic level. There is no point in 
nmltiplying syntactic entries by their semantic am- 
biguities and make all of these entries available for 
analysis. It would result in a desaster ibr efficiency. 
Semantic reading distinctions thus are put into the 
(semantic) refinement lexicon. We would like to in- 
troduce lexical information for the preposition 'in' 
by way of illustration. 
TL-Entry for 'in': 
(7) 'string \[inl_ \] 
,,,od,,,,,,,,ut ,,ojjjj 
The nrorphological information is encoded in the 
'cons' feature. 
Analysls-Entries for 'in': 
Prepositions \]nay occur in 'strongly bound PPs 
where they are functional elenrents (semantically 
empty, but syntactically relevant). This is encoded 
in (8). A PP headed by a functor cannot be an ad- 
junct (rood=none). The head-dtr chosen by 'in' is 
an NPacc. The mother of such a construction also 
comes out as NPacc which is encoded in 'projects'. 
The major reason for such a treatment lies in the 
fact that it allows for a unified treatment of all 
functional elements like inflectional affixes, com- 
plenrentizers, auxiliaries, infinitival zu, functional 
prepositions etc..). 
(8) 
I c~t / . . \[selects NPacc 
\[. \[rood oo et 
L suDca~ func \[projects NPacc 
(9) is the entry for 'in' as a head of a PP subcate- 
gorizing for an NPacc. 
(9) 
\[ '' y 'c't'Subc tL sub t\[c°' ps<- cc>\] 
Semantic entries for qrlh 
Prepositions need (semantically) different entries 
depending on whether the.p heads a PP which is a 
conlplentent or &it adjunct. 
qn' as complement: 
l subj < > 
The content of a PP is a relational psoa. 
'in' as Adjunct: 
I-qu o s 11 
synlcat "" m°dl'"lc°nt/. . \[pso., \[211// V °-c°n~ Lrest4 z\] \]// 
L r-lm°a A / 
Lsubcat I comp~ < .. I sem \[cont \[4\] > / 
"" c°~'t rd-c°=t \[r~str rrex i~ \]'z" < \[,,~gi \[4\]\] t J > 
r_l)SOa 
289 
The preposition puts its content on a restriction list. 
It composes the restriction list with the restriction 
list of the modified item. Quants and the psoa are 
copied. 
4.3 Word Structure and Phrase Structure 
(PS) 
Both the word structure and the phrase structure 
conlponent are based on the same snlall set of bi- 
nary schenlata closely related to HPSG. In the sys- 
tenl described here they exist as nlacros and they 
are spelt out in category-specific word and phrase 
structure rules. (Efficiency is the major reason, as 
underspecified syntax rules are very inefficient). 
Such a schema is e.g. the following head-comp- 
schema. 
Mother: 
synsem 
\[I c t he'°L Jl\]\] 
syn ,subcat \[compls \[5\] Lsubj\[~\] 
Lsem \[3\] 
Head-dtr: 
ksem \[3\] 
Comp-dtr: 
\[synsem \[J\]\] 
subcat \[compls < \[41115\] > 
\[subj \[2\] 
Head information is propagated from head-dtr to 
mother, so is semantic information. 'subact' infor- 
marion is structured slightly differently as in HPSG 
to allow for one innovation wrt HPSG which is our 
treatment of functional elements. 
Mother: 
SYN \[\] I CONSTR funct_att \] 
L / 
SEM ~ \] 
Head-dtr: 
HEAD I BASE °\[- -....° oj\] 
Functor: l F "EAD  II 
The functor macro is highly general. It shows the 
functor treatment applied in the German gram- 
mar, namely that the functor selects its head-dtr, 
combines itself with the head-dtr and projects the 
mother. More specifically: The functor-dtr, indi- 
cated by the value 'funct' of the attribute 'subcat' 
shares the value of the attribute 'selects' with the 
'synsem' value of the head-dtr and its 'projects' 
value with the 'syn' attribute of the mother. The 
'head' value is shared between head-dtr and mother, 
the 'base' value in addition between head-dtr and 
functor. The subcategorization is shared between 
head-dtr and mother. 
The difference to the head-comp schema is that 
head information comes from the functor, also the 
semantics. 'subcat' is inherited from the head-dtr. 
The powerful mechanism comes in by the subcat- 
feature of the functor which allows for a very de- 
tailed specification of the information to be pro- 
jected. 
The PS component covers all categories, especially 
all clausal categories (main clauses, subordinate 
clauses, relatives), NPs, APs, ADVPs, PPs. 
5 'Efficiency' and Performance 
In this section we would like to address the topic of 
efficiency. A number of points contributing specifi- 
cally to efficiency should be summarized here. 
• ALEP is designed to support efficiency as far as 
the formalism ('lean' approach) is concerned. 
Formal constructs known to be computation- 
ally expensive are not available 3. 
• Refinement (mentioned al- 
ready) is a monotonic appfication of phrase 
structure rules and lexical entries to further 
featurize (flesh-out with features) a finguis6c 
structure, established in analysis. 
If Q1 is the finguistic output structure of the 
analysis, then Q~ is the output structure of're- 
finenlent' if Q1 subsumes Q2, i.e. every local 
tree in Q= and every lexical entry in Q~ is sub- 
sumed by a corresponding local tree and a cor- 
responding lexical entry in Q1. 
Any non-deterministic backtracking algorithm 
(depth-first) is badly effected by ambiguities 
as it has to redo repeatedly large amounts 
of work. In terms of lingware development 
this means that lexical ambiguities have to be 
avoided for analysis. As on the other hand lexi- 
calist theories result in an abundance of lexical 
ambiguities, 'refinement' is a relief. Optimal 
distribution of inforlnation over analysis and 
refinement results in a gain of efficiency by sev- 
eral factors of magnitude. 
,, Head selections: ALEP allows for user-defined 
parsing head declarations as "the most appro- 
3It should have been shown in the precious sections 
that felicitous descriptions are possible anyway. 
290 
priate choice of head relations is grammar de- 
pendent" (Alshtl), p.318. On the basis of the 
user-defined head relations the reflexive transi- 
tive closure over head relations is calculated. It 
has to be made sure that the derived relations 
are as compact as possible. Optimal choice of 
head relations pays off in a gain in efficiency 
by several factors of magnitude. 
• Keys: Keys are values of attributes within 
linguistic descriptions defined by path decla- 
rations. Keys allow for indexation and effi- 
cient retrieval of rules and lexical entries. This 
beconres extremely relevant for larger-scale re- 
sources. A key declaration which the grammar 
developer may do identifies the atomic value 
which is to serve as a key. Optimal keys again 
result in a substantial gain in efficiency. 
• Last not least tuning the grammars with a view 
oil efficiency has contributed to tile current 
performance of the system. 
In the following we would like to give some actual 
figures which may illustrate performance. These 
figures are not meant to be an exact measurement 
as exact measurenrents are not available, in or- 
der to give an indication it may be said that ALL 
the phenomena which increase indeterminism in 
a grammar of German are covered: All forms of 
the articles ('die', 'der') and homomorphous rela- 
tive pronouns, all readings of verbs ( all frames, 
all syntactic realizations of complements), seman- 
tic readings, prepositions and honu)lnorphous pre- 
fixes, PPs as nominal adjuncts, as preadjectival 
complements, as adjuncts to adverbs, as VP ad- 
juncts, valent nouns (with optional complementa- 
tion), all readings of Gerlnan 'sein', coordination, 
N -~ N combinations, relatives, Nachfeld. 
One result of the corpus investigation was that 95% 
of the sentences in the corpus have between 5 and 
40 words. The grammar is able to parse sentences 
with up to 40 words in 120 sees. The following are 
corpus examples containing time-consmning parse 
problems. 
Input: In den Wochen vor Weihnaehten konnte der 
stolze Vorsitzende der zu Daimler-Benz 
gehoerenden Deutsche Aerospace AG ein 
Jahresergebnis, das alle Erwartungen 
uebertraf, verkuenden. 
(Comment: In the weeks before X-mas the proud 
head of the Deutsche Aerospace AG which belongs 
to Daimler-Benz could announce an annual 
statement of accounts which exceeds all 
expectations. ) 
total RWordSeg RLiftAna Refine 
sol : i 34. 450 0.380 34.070 0.000 
Input: Dieser Erfolg ueberrascht in zwei 
Hinsichten. 
(Comment: This success is surprising in two 
respects,) 
total RWprdSeg RLiftAna Refine 
sol : 1 1.910 0.130 1.780 0.000 
For industrial purposes this may still be too slow, 
but we think that the figures show that the system 
is not so far away front reality. 
6 Conclusions 
This paper was about the following aspects of ling- 
ware development: 
e Linguistic felicity and leanness. 
• Leanness and efficiency. 
• Methods for large-scale grammar development. 
• 'Ilolistic' approach. 
We can summarize briefly: 
• ALEP provides all modules and tools from text 
handling to discourse processing (the latter not 
introduced here). The lingware created is es: 
pecially interesting ill that it provides an inte- 
grated design for all the modules. 
• The formalism for lingware development is 
lean, but it provides sufficient means to sup- 
port mainstream felicitous linguistic descrip- 
tions. 
• Efficiency is good 
compared to other unification-based systems. 
It is not yet ready for immediate commercial 
applications, but it is neither very far away. 
• The corpus-based approach to grammar devel- 
opment is the only realistic way to get closer 
to a coverage that is interesting from all appli- 
cation point of view. 
ALEP is a promising platform for development of 
large-scale application-oriented grammars. 
References 
Iliyan Alshawi, Arnold D J, Backofen R, Carter D 
M, Lindop J, Netter K, Pulman S G, Tsujii J and 
Uszkoreit H, (1991), Eurotra ET6/I: Rule Formal- 
i,~m and Virtual Machine Design Study (Final Re- 
port), CEC 1991. 
H. Trost: The application of two-level nrorphology 
to non-concatenative German morphology. Procecd- 
ing.s of COLING-90, Helsinki, 1990, vol.2, 371-376. 
291 
