The Use of Instrumentation in Grammar Engineering 
Norbert BrSker 
Eschenweg 3, 69231 Rauenberg 
Abstract 
This paper explores the usefltllmss of a technique 
from software engineering, (:ode instrumentation, 
tbr the developlnent of large-scale natural  
grammars, hltbrlnation about the usage of graln- 
mar rules in test and corpus sentences is used to 
ilnprove grammar and testsuite, as well as adapting 
a grammar to a specific genre. Results show that less 
than half of a large-coverage grmnmar for German 
is actually tested by two large testsuites, and that 
10 30% of testing time is redundant. This method- 
ology applied can be seen as a re-use of grammar 
writing knowledge for testsuite compilation. The 
construction of genre-specific grmmnars results in 
peribrmance gains of a factor of four. 
1 Introduction 
The field of Computational Linguistics (CL) has 
1ooth moved towards applications and towards large 
data sets. These developments (:all for a rigorous 
methodology for creating so-called lingware: linguis- 
tic data such as lexica, grammars, tree-banks, as 
well as software processing it. Experience fl'om Soft;- 
ware Engineering has shown that the earlier deficien- 
cies are detected, the less costly their correction is. 
Rather than being a post-development effort, quality 
ewduation must be an integral part of development 
to make the construction of lingware more eifieient 
(e.g., cf. (EAGLES, 1996) fbr a general evahmtion 
fl'amework and (Ciravegna et al., 1998) for the ap- 
plication of a particular software design nlethodol- 
ogy to linguistic engineering). This paper presents 
the adaptation of a particular Software Engineering 
(SE) method, illstrmnentation, to Grmmnar Engi- 
neering (GE). Instrumentation allows to determine 
which test item exercises a certain piece of (software 
or grammar) code. 
The paper first describes the use of instrumenta- 
tion in SE, then discusses possible realizations in uni- 
fication grammars, and finally presents two classes 
of al)l)lications. 
2 Software Instrumentation 
Systematic software testing requires a match be- 
tween the test subject (module or comt)lete system) 
and a test suite (collection of test items, i.e., sam- 
ple input). This match is usually computed as the 
percentage of code items exercised by the test suite. 
Depending oll the definition of a code item, vari- 
ous measures are employed, tbr example (cf. (Itet- 
zel, 1988) and (EAGLES, 1996, Appendix B) ibr 
overviews): 
statement coverage percentage of single state- 
ments exercised 
branch coverage percentage of arcs exercised in 
control tlow graph; subsumes statement cover- 
age 
path coverage t)ercentage of 1)aths exercised from 
start to end in control flow graph; subsmues 
branch coverage; impractical due to large (often 
infinite) number of paths 
condition coverage percentage of (simple or ag- 
gregate) conditions evaluated to both true and 
false (on different test items) 
Testsuites are constructed to maximize the tar- 
geted measure. A test run yields information about 
the code items not exercised, allowing the improve- 
ment of the testsuite. 
The measures are autonmtically obtained by in- 
strumentation: The test subject is extended by code 
which records the code items exercised during pro- 
cessing. Afl;er l)rocessing the testsuite, the records 
are used to comlmte the lneasures. 
3 Grammar Instrumentation 
Measures from SE cannot silnl)ly be transferred to 
unification grmmnars, because the structure of (im- 
perative) programs is different fl:om (declarative) 
grmnmars. Nevertheless, the structure of a grmnmar 
(formalism) allows to define measures very similar to 
those employed in SE. 
constraint coverage is the quotient 
# constraints exercised 
Tco n --- # constraint in gralnlnar 
118 
where a (-onsi;raint; may \])e either a 1)hrasc- 
structure or an equational COllSl.l'ailll;; del)('al(l-. 
ing o11 the formalisln. 
disjunction coverage is the quotient 
# disjulml.ions covere(1 
5/'(lis = # disiun('t.ions in grammar 
where a disjunction is (:onsidel:ed (:over(~(l when 
all its all:ernative (li@mcl;s have been set)aral;ely 
exercised. It en(;omlmSSes (:onstraint coverage. 
Optional eonstituenl, s an(l equal;ions have to be 
treated as a disjuncgion of the consgrain(, and 
an empty constraini; (cf. Fig.2 for an examl)le ). 
interaction coverage is the quotient 
-j/: disjuncl; (-omt)inai;ions exercise(1 
)inl' = # legal di@m('l. (:oml)inati()ns 
where a (lisjunct Colnbinal;ion is a ('omI)h'l;e sel. 
of choi('es in l;he (lisjun(:tfions which yiehls a well- 
forlned grmmnal;i('at sl;ru(;l;ure. 
As with path coverage, the set of legal (li@mct 
Colnbination typically is intinite (hie to rc('ur- 
sion. A solution from SE is to restri(:t 1;111! use 
of recursive rules to a fixed llUllll)er ()f (;;ts(',s, for 
exaint)le not using t.hc rule at all, and using il; 
ollly OllCe. 
The goal of insl;rllnlenl;al;ion is 1;o el)lain inf()rma- 
li()li a})()lll, which test cases (~xer(:ise wlfi(:lt gl';'/llll11,:|l' 
(-onstraint.s. One way 1;o re(:or(1 lifts infornmlion is 
to exlend l,he parsing alg()rithm. Another way is 
~o use 1:he gralmnar formalisln il.qelf Io i(lc,,l.ify lhe 
disjun(:l;s, l)el)elMing on the (!xl)ressivits- ()f l;he f()r- 
realism used, th(; following 1)ossil)ilil;ies exisl:: 
atomic features Assmning a uni(lue mmfl)ering 
of (tisjuncts, an annotal;ion ()f ghe form 
DISJUNCT-nn = + can be used for marking. To 
delx;rmine whether a (-ertain disjun(:l; was use(t 
in consl~ru(;til~g a sohttion, one only nee(Is to 
check whether the associate(l feal;m'e occurs (at 
some level of embedding) in the solut.i(m. 
set-valued features If set-valued f(~al;ures are 
availal)le, one can use a sel;-valued fl~a- 
lure DISJUNCTS to co\]le(;i; ai.onli(" sym- 
1)ols tel)resenting one disjunct each: 
DISJUNCT-nn ~ DISJUNCTS, whi('h might, 
ease |;he collection of exereise(l (lisjuncl, s. 
multiset of symbols To recover the number of 
times a disjunct is used, one needs I;o leav(; the 
uniiication l)aradignl, l)ecause it is very difficult; 
1;o counl; wiLh unitical;iol~ grammars. \Ve have 
use(l a special feal;ure of our gramntar (levelol)- 
ment ei:vironment: Folh)wing the LI"G sl)iril; ()f 
(lifl'erent l)roje(:tions, it t)r()vi(les a t)r()je(:t.ioll for 
VP -~> 
VP ~- 
v +=1"; 
NP? $= (I" OBJ); 
PP* { ;=(t(mE); 
I $~ (T ADJUNCT); }. 
Figure 1: Sample Rule 
V 
{ e 
\]NI' 
(~ 
IP1)+ 
+=I"; 
\])ISJUNCT-001 C_ o.; 
+= (t oBJ) 
DISJUNCT-002 { o*; } 
\])ISJUNCT-003 E o*; 
{ $= (I" OBL) 
DIS,\]UNCT-004 C o,; 
I .l.e (t ADJUNCT) 
DIS,JUNCT-005 c o,;} }. 
Figure 2: Instl'mnenl.ed rule 
synfl)olic marl:s, whit:h is fornmlly equivalenl. 1o 
a nmll.isel, of s3"ml)ols associate(t wit.h the con> 
l)lel.e s()lut, ion (stru(:~,ural embedding I)\]ays no 
role; see (Franl¢ et al., \].998) f()r al)l)lical.ions). 
In I.his way, we can ('ollecl; fronl l.he reel; node of 
c'ach solution the set of all (tisjun(:ls exer(:ised, 
Ix)gel;her wil;h a usage eount.. 
Consider the LFG granunar rule in Fig.1. ~ Con- 
sl.raint (:()verage would require tesl; items such tlmt 
every (:alegory in t.he VP is exer(:ised; a sequence of 
V NP PP would sutli(:e for this measure, l)isjun(:lion 
('overage also requires 11) t.ake lh(! (unpty (lisjun('ls 
into a(:(tOUlll.: NP ;m(l PP are Ol)l.ional , s() i;hal, four 
ilems are neexh~(l 1.o achiewe full (lisjuncl.ion c()ver- 
age on 1.he phrase sl.ru(;lm'e imrl. of l he rule. 1)/1o \[.() 
l he (li@m(ti(m ill l.he PP retool alien, 1Ave more t.esl. 
items are requh'e(l (;o achieve full (lisjuncl;ion cover- 
age on Lhe (-Oml)h!t(; rule. Fig.2 shows lhe rule from 
Fig.1 with insta'ument, ation. 
4 Grmnmar and Testsuite 
Improvement 
l'laditionally, a tests,rite is used 1.o hill)rove (or 
mainl;ain) a gramnmr's quality (in terms of (:over- 
age an(l overgenerali(m). Using insi;rumenl.al;ion, 
one may exten(1 this usage l)y looldng for sources of 
1 Although the saml)h! rule m'e in the format of I,FG, noth- 
ing of the mc'th()d()logy relies (m the choice of linguistic (n" 
computal.ional 1)aradignL The notation: ?/*/+ represent ot)- 
tionality/iteration including/exchtding zero occurrences on 
categories, e rel)resents the eml)ty string. Annotations to 
a cat(.'gory sl)ecify equality (=) o1" ,'~(!t membershi t) (C) of fea- 
ture values, or non-existel,ce of i~aturcs (~); they are terlni- 
nat(!d l)y a s(,micolon (;). Disjunclions are given in I)ra(:(!s 
({-" "1-'' })' I" (4-) ""' ,,.,t,~x',,,'i,,bl°.~ ,','prc.~,.,ti,,g t.l,,, f,,~,- 
lure st.ruclur(! corresponding t.o the mother ((laugh/.er) of th.e 
rule. o, (for optimalil.y) represents the sent.ence's multi-set 
valued .%,mbolic projc'ction. Com,nents are enclosed in quo- 
lati(m marks ("... "). Cf. (K:tplan and Bresnan, 1982) for an 
intro(lucti()n to 1,1,*14 notation. 
119 
overgeneration (cf. Sec.4.3), and may also improve 
the quality of the testsuite, in terms of coverage (of. 
See.4.1) and economy (el. See.4.2). 
Complementing other work on testsuite construc- 
tion (cf. Sec.4.4), I will assume that a. grammar 
is already available, and that a testsuite has to be 
constructed or extended. While one may argue that 
grmnmar and testsuite should be developed ill paral- 
lel, such that the coding of a new gralmnar disjunct 
is accompanied by the addition of suitable test cases, 
and vice versa, this is seldom the case. Apart from 
the existence of grmnmars which lack a testsuite, 
there is the more principled obstacle of the evolu- 
tion of the grmnmar, leading to states where previ- 
ously necessary rules silently loose their useflflness, 
because their flmction is taken over by some other 
rules, structured differently. This is detectable by 
instrumentation, as discussed in See.4.1. 
On the other hand, once there is a testsuite, it has 
to be used economically, avoiding redundant tests. 
Sec.4.2 shows that there are different levels of re- 
dundancy in a testsuite, dependent on tile specific 
grammar used. Reduction of this redundancy can 
speed Ul) the test; activity, and give a clearer picture 
of the grammar's pertbrmance. 
4.1 Testsulte Completeness 
If the disjunction coverage of a testsuite is 1 for some 
grammar, the testsuite is complete w.r.t, this gram- 
mar. Such a testsuite can l'eliably be used to mon- 
itor changes in the gramlnar: Any reduction ill the 
grammar's coverage will show Ul) ill the failure of 
some test case (for negative test cases, cf. Sec.4.3). 
If the testsuite is not complete, instrumentation 
can i(lentii\[y disjuncts which are not exercised. These 
might be either (i) approl)riate, but tmtested, dis- 
juncts calling for the addition of a test case, or (it) ill- 
appropriate disjuncts, for which a grammatical test 
case exercising them cannot be constructed. 
Checking completeness of our local testsuite of 
1.787 items, we found that only 1456 out of 3730 
grammar disjuncts in our German grammar were 
tested, yielding T, tis = O.39 (the TSNLP testsuite 
containing 1093 items tests only 1081 disjuncts, 
yielding Tdi, = 0.28). 2 Fig.3 shows an example 
of a gap in our testsuite (there are no examples of 
circulnpositions), while Fig.4 shows an inal)l)roppri- 
ate disjunct thus discovered (the category ADVadj 
has been eliminated in the lexicon, but not in all 
rules). Another error class is illustrated by Fig.5, 
which shows a disiunct that can never be used due 
to an LFG coherence violation; tile grmnmar is ill- 
consistent here. a 
2There are, of course, m~parsed but grammatical test cases 
in both testsuites, which have not been taken into account 
in these figures. This exl)lains the dill'ere,lee to the overall 
number of 1582 items in the German TSNLP testsuite. 
3Tcst cases using a free dative pronoun may be in the test- 
PPstd =~ P1)rae $=$; 
NPstd $= (I" OBJ); 
{ e DISJUNCT-011 C o.; 
I Pcireum $=?; 
DISJUNCT-012 C o* 
"mmsed disjulmt"; } 
Figure 3: Appropriate untested disjunct 
ADVP { { e DISJUNCT-021 C o*; 
I ADVadj $=$ 
DISJUNCT-022 E o* 
"unused disjunct"; } 
ADVstd $=1" 
DISJUNCT-023 C o* 
"unused dis iunct"; } 
I .,. ). 
Figure 4: hmppropriate disjunct 
4.2 Testsuite Economy 
Besides being coml)lete , a testsuite must be econom- 
ical, i.e., contain as few items as 1)ossible. Instru- 
nmntation can identify redundant test cases, where 
re(lundaney can be defined in three ways: 
similarity There is a set of other test cases which 
jointly exercise all disjunct which the test case 
under consideration exercises. 
equivalence There is a single test case which ex- 
ercises exactly the same combination(s) of dis- 
juncts. 
strict equivalence There is a single test case 
which is equivalent to and, additionally, exer- 
cises the disjunets exactly as oft(m as, the test 
case under consideration. 
Fig.6 shows equivalent test cases found in our 
testsuite: Example 1 illustrates the distinction be- 
tween equivalence and strict, equivalence; the test 
cases contain different numbers of attributive adjec- 
tives. Example 2 shows that our grammar does not 
make any distinction between adverbial usage and 
secondary (subject or object) predication. 
The reduction we achieved in size and processing 
time is shown in Table 1, which contains measure- 
lnents for a test run containing only tile 1)arseable 
test cases, one without equivalent test cases (for ev- 
ery set of equivalent test cases, one was arbitrar- 
ily selected), and one without similar test cases. 
The last was constructed using a siml)le heuristic: 
Starting with the sentence exercising the most dis- 
juncts, working towards sentences relying on fewer 
disjuncts, a sentence was selected only if it exercised 
a disjunct wtfich no previously selected sentence ex- 
ercised. Assulning that a disjnnct working correctly 
suite, but receive no analysis since the grmmnatical fimction 
FREEDAT is not defined as such in the configuration section. 
120 
VPargs =5 ,., 
I PRONstd 
I ... }. 
$= (T FI(EEDAT) 
($ CASE) = dat 
($ PI{ON-TYPE) = 1)ers 
~(q- ()>.J2) 
DISJUNCT-041 C- o* 
"unused disjun(;t" ; 
Figure 5: hlconsist,ent, disjunct 
1 tin guter alter Wein 
tin guter all;er trockener \Vein 
'a good old (dry) wine.' 
2 Er ifit das Schnitzel roh. 
Er iBt das Schnitzel na(;kt, 
Er it.It das Schnitzel s(:hnell. 
'lie cats the ,~chnitzel nakcd/ra.lv/q.uictcIy.' 
Figure 6: Sets of equivaleul; t,est cases 
once will work corre('tly more than ()11(;(~: we di(1 nol; 
(:onsider st.rict, equivalence. 
We envisage the following use of this redundancy 
detec|:ion: '1'here ch~al'ly ar(; linguist;i(: r(~asolls l;o dis- 
t.inguish all 1;est cases ill (~xaml)le 2, s() l;hcy (:almol 
simply be delel;cd from the t(~'st;suit, e. Ilath(.'r, t.heir 
equivalence indicates that. the grammar is not 3,eli 
perfect. (or never will be, it' it, remains l/urely syn- 
tactic). Such equivalences couhl be int,erl)reted as 
a r(mfinder which linguistic distinclions need to lie 
incorl)orated into the grammar. Thus, 1;his lev(q ()f 
r('(hm(lancy may drive your grammar d(w(~h)l)ment 
agenda. The h~vcI of c(tuivalellc(~ C;/ll l)e t;/k(}li its 
a limited int(wacl;ion lesl: '\]'h('s(' lesl: ('as('s rcl)r(> 
scnl; one~ (-()ml)h'.lx~' s(~lecl;ion of grammar disiml(:l,s , 
and (given l,hc grammar) lhere is nolhing we can 
gain 1)y checking a test case if an equivalenl; one was 
tested. Thus, this level of redundan('y may 1)e used 
for ensuring the quality of gramlnar changes prior 
t,o their incorporation into the t)roducl~ion version of 
t,he grammar. The level of similarit.y (:onl;ains much 
le.~ss l,est cases, and does not t,esl, any (systenml, ic) 
intera(:tion 1)et.ween disjuncts. Thus, it may 1)1; use(1 
during (levch/1)ment. as a quick ru\]e-ofthunll) 1)rote - 
dure detecting serious errors only. 
test relative runtime relative 
eases size (set) nmtime 
TSNLP testsuite 
l)arseable 1093 100% 1537 100% 
no equivalents 783 71% 665.3 43% 
no similar cases 214 19% 128.5 8% 
local testsnite 
\])arseable 1787 100% 1213 100% 
no equivalents 1600 89% 899.5 74% 
no similar cases 331 18% 175.0 ld% 
~1~d)le 1: Reduction of ~l~stsuites 
l)er Test t'~illC leicht. 
\])ie schhdL, n. 
Man s(:htat~n. 
Dieser schlafen. 
Ich schhffen. 
\])er schlafen. 
\]crier schlafen. 
l)erienige schhffen. 
.lener schlafen. 
Keiner schlafen. 
\])erselbe schlaflm. 
Er schlafen. 
Irgendjemand schlafen. 
Dieselbe schlat~n. 
Das schlafen. 
Eines schlafen. 
Jede schlafen. 
Dieses schlaf'en. 
Eine schlafen. 
Meins schlafen. 
Dasjenige s(;hlafen. 
Jedes schlaflm. 
Die.ienige schlat'en. 
Jenes schlafen. 
Keines schlafen. 
Dasselbe schlafen. 
Figure 7: S(mi;ence~s relying on SUSl)icious di@mct 
4.3 Sources of Overgeneration 
To cont,rol overgenel'a.tion, al)l)ropriately marked un- 
grammati('al sentences are iml)(n'tant in every test- 
suite, lnsl;rulnentation as 1)rol)osed here only looks 
at successful parses, but. can sgi\]l l)e aI)l)lied in this 
C()lll;(?xt: If ~/11 llllgfalllIll~/l;ieal t.est. (;ase recuives all 
analysis, insl;rumeld;at, ion informs us a})ouI, t,he dis- 
julmtS used in the incorrect, analysis. One of these 
(lis.juncts must lie incorrect, or the sentence would 
not. have receiv(xt a solution. We exploit, this infor- 
mati(m by aecumulat.ioll across the entire l;est suite~ 
looking tot (lisjuncts t,hat al)t)ear in mmsually high 
1)report.ion in l)arseable mlgranmmtical test. cases. 
In t:his rammer, six grammar disjuncts are singled 
oul. \])y the l)arseal)h~ mlgramlnat,ical t.est cases in 
th(~ TSNLI ) t(,sIsuite. The rues1 l)rominen|; di@m(:t. 
al)l)ears in 26 senl(~n(;(> (list.e(t in Fig.7), of which 
lhe top left group is in(l(>d grmmnali('al and t h(~ 
rest fall int.() Fw(/ (:lasses: A partial V1 ) with object 
NP, inlert)reted as an imt/(n'at,iv(~ sentence (1)el;tom 
left), and a weird interaction with the tokenizcr in- 
correctly" handling cal)it.alization (right. groul)). 
15tr fl'om being conclusive, t,hc similarity of these 
s(nlt.ences derived from a suspicious grammar dis- 
junct, and the ('lear relation of the senten(-es to only 
tw(/exact.ly Sl)ceifial)le graminar errors make it 1)lau- 
sil)le that this approach is very i)rolnising ill detect- 
ing the sources of ovcrgener~tion. 
4.4 Other Al)l)roaches to Tcstsuite 
Construction 
The delicacy of testsuite construction is acknowl- 
edged in (EAGLES, 1996, I).37). Although t.here 
are a mnnber of eflbrts to construct reusable test- 
suites, none has to my knowledge exl)lored how ex- 
ist.ing grammars can l)e exl)loited. 
Starting wit.h (Flickinger el; al., 1987), |;(;si.suites 
have l)ecll (\[rawn 111) fix)In a linguistic viewpoint, in- 
for'm, cd by \[lhc I study of linguistie,s and \[reflecting\] 
the 9'ram'm, atical issues that linguists h, avc concerned 
them,selves with, (Flickinger et al., 1987, p.4). A1- 
121 
though the question is not explicitly addressed in 
(Balkan, 1994), all the testsuites reviewed there also 
seem to follow the same methodology. The TSNLP 
project (Lehmann and Oepen, 1996) and its succes- 
sor DiET (Netter et al., 1998), which built large nml- 
tilingual testsuites, likewise fall into this category. 
The use of corpora (with various levels of mmota- 
tion) has been studied, but the reconmmndations are 
that much manual work is required to turn cori)us 
examples into test cases (e.g., (Balkan and Fouvry, 
1.995)). The reason given is that corpus sentences 
neither contain linguistic 1)henomena in isolation, 
nor do they contain systematic variation. Corpora 
thus are used only as an inspiration. 
(Oepen and Flicldnger, 1998) stress the inter- 
dependence between application and testsuite, but 
don't comment on the relation between grammar 
and testsuite. 
5 Genre Adaptation 
A different al~t)lication of instrumentation is the tai- 
loring of a general grammar to specific genres. All- 
purpose grammars are 1)lagued by lexical and struc- 
tural aml)iguity that leads to overly long mmtimes. 
If this ambiguity could be limited, parsing efficiency 
would iml)rove. Instrunmnting a general grammar 
allows to automatically derive specialized subgrmn- 
mars based on sample corpora. This setup has sev- 
eral advantages: The larger the overlap between gel> 
res, the larger the portion of grammar development 
work that can be recycled. The all-lmrpose grammar 
is linguistically ltlore interesting, because it requires 
an integrated concept, as oI)posed to several sepa- 
rate genre-specific grammars. 
i will discuss two ways of improving the efficiency 
of parsing a sub, given an all-purpose uni- 
fication gramnmr. The first consists in deleting un- 
used disjuncts, while the second uses a staged pars- 
ing process. The experiments are only sketched, 
to indicate the apl)licability of the instrumentation 
technique, and not to directly compete with other 
proposals on grmnnmr specialization. For example, 
the work reported in (Rwner and Smnuelsson, 1994; 
Samuelsson, 1994) diifers from the one presented be- 
low ill several aspects: They induce a grammar from 
a treebank, while I propose to mmotate the gram- 
mar based on all solutions it produces. No criteria 
for tree decomposition and category specialization 
are needed here, and the standard parsing algorithm 
can be used. On the other hand, the efficiency gains 
are not as big as those reported by (Rayner and 
Salnuelsson, 1994). 
5.1 Restricting the Grammar 
Given a large sample of a genre, instrunmntation al- 
lows you to determine the likely constructions of that 
genre. Elinfinating unused disjuncts allows faster 
Descriptor Content Coverage 
HC-DE 
WHB 
NEWS 
NEWS-SC 
Copier/Printer User Manual 89% 
Car Maintenance Instructions 76% 
News (5-30 words per sentence) 42% 
Verb-tinal subclauses from News 75% 
Table 2: Corpora used for adaptation 
parsing due to a smaller gralmnar. An experiment 
was conducted with several corpora as detailed in 
Table 2. There was some eft'oft to cover the corpus 
HC-DE, but no grammar development based on the 
other corpora. The NEWS-SC corpus is part the 
corl)uS of verb-final sentences used by (Beil et al., 
1999). 
A training set of 1000 sentences froln each cor- 
pus was parsed with an instrumented base gram- 
mar. From the parsing results, the exercised gram- 
mar disjuncts were extracted and used to construct 
a corl)us-specific reduced grammar. The reduced 
grammars were then used to parse a test; set of an- 
other 1000 sentences Dora each corpus. TaMe 3 
shows the lmrt'ornmnce ilnprovement on the corpora: 
It gives the size of the grammars in terms of the 
number of rules (with regular expression right-hand 
sides and feature annotation), the number of arcs 
(corresponding to unary or binary rules with dis- 
junctive feature annotation), and the number of dis- 
juncls (unary or binary rules with tmique feature 
annotation). The number of mismatches counts the 
sentences for which the solution(s) obtained differed 
fl'om those obtained with the base gramnmr, while 
the number of additi(ms counts the selltellces which 
{lid not receiw; a 1)arse with the base grannnar due 
to resource limitations (runtinle or memory), but re- 
ceived one with the reduced granmmr. The other 
cohnnns give timings to l~rocess the total corlms, 
and the longest and average processing time per sen- 
ten(e; time is in seconds. The last cohmm gives the 
average nmnber of solutions per sentence. 
Due to the sampling of a genre, the grammars 
obtained can only be approximate. To deternfine 
the relation of the smnple size to the quality of the 
grmnmar obtained, the coverage of random fragment 
gram'mars was measured in the tbllowing way: Ran- 
domly select a nmnber of sentences fl'om the to- 
tal corpus, construct (in the same way as described 
aloove for the reduced grammar) a fragment grmn- 
mar, and deternfine its coverage on the test set fl'om 
the respective corpus. The graphs in Fig.8 show 
how the coverage and runtime relate to the number 
of sentences on which the fragnmnt granunars are 
based. The leftmost data point (x value 0) describes 
the performance of the reduced gramlnar on the 
training set, while the rightmost data point describes 
its perfbrmance on the test set. The data points in 
\])etween represent fl:agment grammars based on as 
122 
o~ ~S qk- ~ n 
g 
"Y~ ~,~ . 
ca.. bid o? 
m g 
~a 
bb 
Corl)us IIC-\])E 
1)ase grammar 185 3669 1.1564 n/~t l:/a \[ 7692.d >300 7.1 10.1 
rcducc'd graminar (938) 112 960 3739 0 1 J 2089.4 162.7 1.9 17.6 
195 3728 \] 1606 11/}t 1<128.!) >300.3 1.5 
53<1 3072 1 <144.2 11.3 0.4 
Corpus \VIIB 
1)ase grammar 
r(xlu(:ed glU/llllll}ll" (559) 
Tal)le 3: \]'erli)rman('e of r(.'(hleed grallllllars 
lOO \[ q 
00 
60 
° t~ 
-- ~ IIC\[~)E 
NEWS-SC - 
WIIB 
k A I 
200 300 400 500 000 700 
nLm/berolsanloncesusedforgfainmarconstruction 
3000 - 
2500 - 
2000 
1500 ~: 
1ooo 
501 " 
0 
HC.~E 
W)IB - 
NEWS~SC ' 
1 O0 200 300 400 500 600 700 800 
nut::be) of sentences used lot g\[arnmar construction 
Figure 8: l)erf()rnmnc(~ ()f fl';tgllient; graltllnars 
IlU/ll 3" ,%elll:ellces .:is giv(}n \])y (h(} x axis vahl(!. 
'.File result, s rel)orlx'(1 here r(~l)l'(LSOlll, (,he minimal 
l)el'l~)l:lllallc(; g;Iill duo Lo (;lit; t':t(;l; 1;tl;II, LII(, COllS(;l'llC- 
l;ion of reduced ~/11(l t'lU/~lll(}ll{, ~I'}tllIIlI}U'S life lI()J, 
based on (.he corre('l, solul;i(ms for the (,raining ,qell- 
l;ellce,q, })Ill; l';tgh(~d ' 011 all solulions l)rodu(:ed 1)y (;he 
base grammar. The (:OllSt, rucl;ion of a lart~e-s(:ale 
(;reel)ank with manually veriiie(l solutions is un(h!r 
way but has nol; )'el. 1)rogjresse(l far enough (;() serve 
as input for this ext)erimeld;. Even with this systenl- 
atic, but (:urable error, (;lie reduction reduces overall 
processing by a factor of four. The mmd)er of solu- 
tions is constant becaus(~ only unused disjuncts are 
eliminated; this will change if the treebank solutions 
are used (;o construct l;he redu(:od gl'~lllllllat'. 
5.2 Staged Parsing 
Even eliminat, ing only unlil:ely disjunets necessarily 
redllces L\]Io coverage of the gramnmr. A sequence of 
l)arsing stages allows one to profit front a small and 
fast; granmmr as well as from a large and slow one. 
S~age(t l)arsing applies difl'erent grammars one after 
the other to the inlmt, m:(;il one yields a solution, 
which terminates the l)rocess. In our case, a gram- 
mar of sl;age 'l~, q- 1 in(:ludes the grammar of stag0 t~, 
1)ttl; this nee(1 not be t:he case in gener;d. 
"1'() r(}(lu(x' the v;u'ial)iliLy for an (}Xl)(;rimenL: I as- 
,SlllIIO (;}ll.'ee s{;,:/.~(}s: Tit('. :\[irst, ill('hld('s frequcnl;ly used 
di,~jun('I;s, Idle s{)COll(i illfFt)qll(}llt di@m(:ts, alt(l l:h0 
thir(1 unu.~ed disjuncts. This ensur(?,~ (;he fllll (x)vt,r- 
age of the base grammar, \]ml; allows lo focus on fre- 
(lu(m(. con.sl:ru(q,ions in th(, first parsing stage. The 
t)rt)(:t;dure is similar as \])cfore: l"rom (.he solutiollS 
of a Lraining sol., ;t staged .qIYt?Itlltitl" iS construc.lx:d. 
()urrel~tly, exl)erimenI;s are l)erforlned (;o dei;ermine 
a llseflll detini(;ion of 'frequellt, ly used'. Indel)endent 
from the ac(,ual performance gains finally obtained, 
the apl)lication of instrulnentation allows a system- 
atie exploration of the possible configurations. 
5.3 Other approaches to grammar 
adal)tation 
(I{ayner and Samuelsson, 1994; Ilayner and Carter, 
1996; Sanmelsson, 1994) present a grammar Sl)eeial- 
izal;ion (,eclmique for unification gran:Inars. Fronl a 
tl'eebanl: of the sublanguagc, they induce a special- 
ized gramnlar using fewer 're, acre ~"ltlc,s" which col 
respond to the application of several original rules. 
They report an average speed-ul) of 55 for only the 
parsing phase (taking lexical lookup into accomlt, 
the sl)eed-up fael;or was only 6 10). I)ue to (:he 
(lerival.iOll ()f J;\]le ~rallllll~/r frOllt a corl)llS Sample, 
123 
they observed a decrease ill recall of 7.3% and an 
increase of precision of 1.6%. Tile differences to the 
approach described here are clear: Starting from the 
grammar, rather than from a treebank, we annotate 
tile rules, rather than inducing them from scratch. 
We do not need criteria for tree decomposition and 
category specialization, and we can use the standard 
parsing algorithm. On the other hand, the efficiency 
gains are not as big as those reported by (Rayner 
and Carter, 1996) (but note that we cannot measure 
ilarsing times alone, so we need to coral)are to their 
speed-up factor of 10). And we did not (yet) start 
from a treebank, but froln the raw set of solutions. 
6 Conclusion 
I have 1)resented the adaptation of code instrmnenta- 
tion to Grammar Engineering, discussing measures 
and iml)lementations, and sketching several applica- 
tions together with preliminary results. 
The main application is to iml)rove grammar and 
testsuite by exl)loring the relation between both of 
them. Viewed this way, testsuite writing can ben- 
efit from grammar developnlent because both de- 
scribe the syntactic constructions of a natural lan- 
guage. Testsuites systematically list; these construc- 
tions, while grammars give generative procedures 
to construct them. Since there are currently many 
more grammars than testsuites, we may re-use the 
work that has gone into the grmnmars for the im- 
provement of testsuites. 
Other al)l)lications of instrumentation are possi- 
1)le; genre adal)tation was discussed in some depth. 
On a more general level, one may ask whether other 
methods fl'om SE may fruitflflly al)ply to GE as well, 
1)ossibly in modified form. For example, the static 
analysis of programs, e.g., detection of unreachable 
code, could also be applied for grammar develop- 
ment to detect unusable rules. 

References 

L. Balkan and F. Fouvry. 1995. Corpus-based test 
suitc generation. TSNLP-WP 2.2, University of 
Essex. 

L. et al. Balkan. 1994. Test Suite Dcsign Annotation 
Scheme. TSNLP-WP2.2, University of Essex. 

F. Beil, G. Carroll, D. Prescher, S. Riezler, and 
M. Rooth. 1999. Inside-outside estimation of a 
lexicalized PCFG for german. In Proc. 37th Annual Mecting of the ACL. Maryland. 

F. Ciravegna, A. Lavelli, D. Petrelli, and F. Pianesi. 
1998. Developing  reesources and al)pli- 
cations with GEPPETTO. In Prec. 1st Int'l Conf. 
on Language Resources and Evaluation, pages 
619 625. Granada/Spain, 28-30 May 1998. 

EAGLES. 1996. Evaluation of Natural Language 
Processing Systems. Final Report EAG-EWG- 
PR.2. 

D. Flickinger, J. Nerbonne, I. Sag, and T. Wa- 
sow. 1987. Toward Evaluation of NLP Systems. 
Hewlett-Packard La.boratories, Pale Alto/CA. 

A. h'ank, T.H. King, J. Kuhn, and J. Maxwell. 
1998. Ol)timality theory style constraint rank- 
ing ill large-scale LFG grammar. In Prec. of 
the LFG98 Confcrcnce. Brisbane/AUS, Aug 1998, 
CSLI Online Publications. 

W.C. Hetzel. 1988. The complete 9uide to s@ware 
testing. QED Infommtion Sciences, Inc. Welles- 
ley/MA 02181. 

R.M. Kaplan and J. Bresnan. 1982. Lexical- 
functional grmmnar: A formal system for gram- 
matical rel/resentation. In J. Bresnan and R.M. 
Kaplan, editors, The Mcntal Representation of 
Grammatical Rclations, pages 173-281. Cam- 
bridge, MA: MIT l?ress. 

S. Lehmann and S. Oepen. 1996. Tsnlp - test suites 
for natural  processing. In Proc. 16th, 
International Conference on Computational Linguistics, pages 
711-716. Copenhagcn/DK. 

K. Netter, S. Armstrong, T. Kiss, J. Klein, and 
S. Lehman. 1998. Diet; - diagnostic and eval- 
uation tools for :all) applications. In Prec. 1st 
Int'l Conf. on Language I~esov, rces and Eval'ua- 
tion, pages 573 579. Granada/Spain, 28-30 May 
1998. 

S. Oct)ca and D.P. Flickinger. 1998. Towards sys- 
tematic grmnmar profiling:test suite techn. 10 
years afro. Journal of Computer Speech and Lan- 
guage , 12:411 435. 

M. Rayner and D. Carter. 1996. Fast parsing using pruning and grammar specialization. In Proc. 34th Annual Meeting of thc ACL. Santa Cruz, 
USA. 

M. IXayner and C. Sanmelsson. 1994. Corpus-based 
grmmnar specia.lization for fast analysis. In M.- 
S. AgnSs, H. Alshawi, I. Btrean, D. Carter, and 
K. Ceder, editors, Spohen Langv, age Translator: 
First- Year Report, pages 41-54. Report CRC-043, 
Cmnbridge/UK: SllI International. 

C. Samuelsson. 1994. Grammar specialization 
through entropy thresholds. In Proc. 32nd Annual 
Meeting of the ACL. 
