Improving Testsuites via Instrumentation 
Norbert BrSker 
Eschenweg 3 
69231 Rauenberg 
Germany 
norbert, broeker@sap, com 
Abstract 
This paper explores the usefulness of a technique 
from software engineering, namely code instrumen- 
tation, for the development of large-scale natural 
 grammars. Information about the usage 
of grammar rules in test sentences is used to detect 
untested rules, redundant test sentences, and likely 
causes of overgeneration. Results show that less 
than half of a large-coverage grammar for German is 
actually tested by two large testsuites, and that i0- 
30% of testing time is redundant. The methodology 
applied can be seen as a re-use of grammar writing 
knowledge for testsuite compilation. 
1 Introduction 
Computational Linguistics (CL) has moved towards 
the marketplace: One finds programs employing CL- 
techniques in every software shop: Speech Recogni- 
tion, Grammar and Style Checking, and even Ma- 
chine Translation are available as products. While 
this demonstrates the applicability of the research 
done, it also calls for a rigorous development 
methodology of such CL application products. 
In this paper,lI describe the adaptation of a tech- 
nique from Software Engineering, namely code in- 
strumentation, to grammar development. Instru- 
mentation is based on the simple idea of marking 
any piece of code used in processing, and evaluating 
this usage information afterwards. The application 
I present here is the evaluation and improvement of 
grammar and testsuites; other applications are pos- 
sible. 
1.1 Software Engineering vs. Grammar 
Engineering 
Both software and grammar development are simi- 
lar processes: They result in a system transforming 
some input into some output, based on a functional 
specification (e.g., cf. (Ciravegna et al., 1998) for the 
application of a particular software design method- 
ology to linguistic engineering). Although Grammar 
1The work reported here was conducted during my time 
at the Institut fiir Maschinelle Sprachverarbeitung (IMS), 
Stuttgart University, Germany. 
Engineering usually is not based on concrete specifi- 
cations, research from linguistics provides an infor- 
mal specification. 
Software Engineering developed many methods to 
assess the quality of a program, ranging from static 
analysis of the program code to dynamic testing of 
the program's behavior. Here, we adapt dynamic 
testing, which means running the implemented pro- 
gram against a set of test cases. The test cases 
are designed to maximize the probability of detect- 
ing errors in the program, i.e., incorrect conditions, 
incompatible assumptions on subsequent branches, 
etc. (for overviews, cf. (Hetzel, 1988; Liggesmeyer, 
1990)). 
1.2 Instrumentation in Grammar 
Engineering 
How can we fruitfully apply the idea of measuring 
the coverage of a set of test cases to grammar devel- 
opment? I argue that by exploring the relation be- 
tween grammar and testsuite, one can improve both 
of them. Even the traditional usage of testsuites to 
indicate grammar gaps or overgeneration can profit 
from a precise indication of the grammar rules used 
to parse the sentences (cf. Sec.4). Conversely, one 
may use the grammar to improve the testsuite, both 
in terms of its coverage (cf. Sec.3.1) and its economy 
(cf. Sec.3.2). 
Viewed this way, testsuite writing can benefit from 
grammar development because both describe the 
syntactic constructions of a natural . Test- 
suites systematically list these constructions, while 
grammars give generative procedures to construct 
them. Since there are currently many more gram- 
mars than testsuites, we may re-use the work that 
has gone into the grammars for the improvement of 
testsuites. 
The work reported here is situated in a large coop- 
erative project aiming at the development of large- 
coverage grammars for three s. The gram- 
mars have been developed over years by different 
people, which makes the existence of tools for navi- 
gation, testing, and documentation mandatory. Al- 
though the sample rules given below are in the for- 
mat of LFG, nothing of the methodology relies on 
325 
VP~V $=T; 
NP?$= (I" OBJ); 
PP* {$= (T OBL); 
156 ($ ADJUNCT);}. 
Figure 1: Sample Rule 
the choice of linguistic or computational paradigm. 
2 Grammar Instrumentation 
Measures from Software Engineering cannot be sim- 
ply transferred to Grammar Engineering, because 
the structure of programs is different from that of 
unification grammars. Nevertheless, the structure 
of a grammar allows the derivation of suitable mea- 
sures, similar to the structure of programs; this is 
discussed in Sec.2.1. The actual instrumentation of 
the grammar depends on the formalism used, and is 
discussed in Sec.2.2. 
2.1 Coverage Criteria 
Consider the LFG grammar rule in Fig. 1. 2 On first 
view, one could require of a testsuite that each such 
rule is exercised at least once. ~rther thought will 
indicate that there are hidden alternatives, namely 
the optionality of the NP and the PP. The rule can 
only be said to be thoroughly tested if test cases exist 
which test both presence and absence of optional 
constituents (requiring 4 test cases for this rule). 
In addition to context-free rules, unification gram- 
mars contain equations of various sorts, as illus- 
trated in Fig.1. Since these annotations may also 
contain disjunctions, a testsuite with complete rule 
coverage is not guaranteed to exercise all equation 
alternatives. The phrase-structure-based criterion 
defined above must be refined to cover all equation 
alternatives in the rule (requiring two test cases for 
the PP annotation). Even if we assume that (as, 
e.g., in LFG) there is at least one equation associ- 
ated with each constituent, equation coverage does 
not subsume rule coverage: Optional constituents 
introduce a rule disjunct (without the constituent) 
that is not characterizable by an equation. A mea- 
sure might thus be defined as follows: 
disjunct coverage The disjunct coverage of a test- 
suite is the quotient 
number of disjuncts tested 
Tdis = number of disjuncts in grammar 
2Notation: ?/*/+ represent optionality/iteration includ- 
ing/excluding zero occurrences on categories. Annotations 
to a category specify equality (=) or set membership (6) of 
feature values, or non-existence of features (-1); they are ter- 
minated by a semicolon ( ; ). Disjunctions are given in braces 
({... I-.. }). $ ($) are metavariables representing the feature 
structure corresponding to the mother (daughter) of the rule. 
Comments are enclosed in quotation marks ("... "). Cf. (Ka- 
plan and Bresnan, 1982) for an introduction to LFG notation. 
where a disjunct is either a phrase-structure al- 
ternative, or an annotation alternative. Op- 
tional constituents (and equations, if the for- 
malism allows them) have to be treated as a 
disjunction of the constituent and an empty cat- 
egory (cf. the instrumented rule in Fig.2 for an 
example). 
Instead of considering disjuncts in isolation, one 
might take their interaction into account. The most 
complete test criterion, doing this to the fullest ex- 
tent possible, can be defined as follows: 
interaction coverage The interaction coverage of 
a testsuite is the quotient 
number of disjunct combinations tested 
Tinter = number of legal disjunct combinations 
There are methodological problems in this cri- 
terion, however. First, the set of legal com- 
binations may not be easily definable, due to 
far-reaching dependencies between disjuncts in 
different rules, and second, recursion leads to 
infinitely many legal disjunct combinations as 
soon as we take the number of usages of a dis- 
junct into account. Requiring complete inter- 
action coverage is infeasible in practice, similar 
to the path coverage criterion in Software En- 
gineering. 
We will say that an analysis (and the sentence 
receiving this analysis) relies on a grammar disjunct 
if this disjunct was used in constructing the analysis. 
2.2 Instrumentation 
Basically, grammar instrumentation is identical to 
program instrumentation: For each disjunct in a 
given source grammar, we add grammar code that 
will identify this disjunct in the solution produced, 
iff that disjunct has been used in constructing the 
solution. 
Assuming a unique numbering of disjuncts, an an- 
notation of the form DISJUNCT-nn = + can be used 
for marking. To determine whether a certain dis- 
junct was used in constructing a solution, one only 
needs to check whether the associated feature oc- 
curs (at some level of embedding) in the solution. 
Alternatively, if set-valued features are available, 
one can use a set-valued feature DISJUNCTS to col- 
lect atomic symbols representing one disjunct each: 
DISJUNCT-nn 6 DISJUNCTS. 
One restriction is imposed by using the unification 
formalism, though: One occurrence of the mark can- 
not be distinguished from two occurrences, since the 
second application of the equation introduces no new 
information. The markers merely unify, and there is 
no way of counting. 
326 
VP=~V S-----t; 
{ e DISJUNCT-001 E o*; 
I NP $= ($ OBJ) 
DISJUNCT-002 E o*; } 
{ e DISJUNCT-003 E o*; 
I PP+{$= ($ OBL) 
DISJUNCT-004 E o.; 
I SE (J" ADJUNCT) 
DISJUNCT-005 E o*;}. } 
Figure 2: Instrumented rule 
Therefore, we have used a special feature of our 
grammar development environment: Following the 
LFG spirit of different representation levels associ- 
ated with each solution (so-called projections), it 
provides for a multiset of symbols associated with 
the complete solution, where structural embedding 
plays no role (so-called optimality projection; see 
(Frank et al., 1998)). In this way, from the root 
node of each solution the set of all disjuncts used 
can be collected, together with a usage count. 
Fig. 2 shows the rule from Fig.1 with such an in- 
strumentation; equations of the form DISJUNCT-nnE 
o* express membership of the disjunct-specific atom 
DISJUNCT-nn in the sentence's multiset of disjunct 
markers. 
2.3 Processing Tools 
Tool support is mandatory for a scenario such as 
instrumentation: Nobody will manually add equa- 
tions such as those in Fig. 2 to several hundred rules. 
Based on the format of the grammar rules, an algo- 
rithm instrumenting a grammar can be written down 
easily. 
Given a grammar and a testsuite or corpus to com- 
pare, first an instrumented grammar must be con- 
structed using such an algorithm. This instrumented 
grammar is then used to parse the testsuite, yielding 
a set of solutions associated with information about 
usage of grammar disjuncts. Up to this point, the 
process is completely automatic. The following two 
sections discuss two possibilities to evaluate this in- 
formation. 
3 Quality of Testsuites 
This section addresses the aspects of completeness 
("does the testsuite exercise all disjuncts in the 
grammar?") and economy of a testsuite ("is it min- 
imal?"). 
Complementing other work on testsuite construc- 
tion (cf. Sec.5), we will assume that a grammar 
is already available, and that a testsuite has to be 
constructed or extended. While one may argue that 
grammar and testsuite should be developed in paral- 
lel, such that the coding of a new grammar disjunct 
is accompanied by the addition of suitable test cases, 
and vice versa, this is seldom the case. Apart from 
the existence of grammars which lack a testsuite, 
and to which this procedure could be usefully ap- 
plied, there is the more principled obstacle of the 
evolution of the grammar, leading to states where 
previously necessary rules silently loose their use- 
fulness, because their function is taken over by some 
other rules, structured differently. This is detectable 
by instrumentation, as discussed in Sec.3.1. 
On the other hand, once there is a testsuite, you 
want to use it in the most economic way, avoiding 
redundant tests. Sec.3.2 shows that there are dif- 
ferent levels of redundancy in a testsuite, dependent 
on the specific grammar used. Reduction of this re- 
dundancy can speed up the test activity, and give a 
clearer picture of the grammar's performance. 
3.1 Testsuite Completeness 
If the disjunct coverage of a testsuite is 1 for some 
grammar, the testsuite is complete w.r.t, this gram- 
mar. Such a testsuite can reliably be used to mon- 
itor changes in the grammar: Any reduction in the 
grammar's coverage will show up in the failure of 
some test case (for negative test cases, cf. Sec.4). 
If there is no complete testsuite, one can - via in- 
strumentation - identify disjuncts in the grammar 
for which no test case exists. There might be either 
(i) appropriate, but untested, disjuncts calling for 
the addition of a test case, or (ii) inappropriate dis- 
juncts, for which one cannot construct a grammat- 
ical test case relying on them (e.g., left-overs from 
rearranging the grammar). Grammar instrumenta- 
tion singles out all untested disjuncts automatically, 
but cases (i) and (ii) have to be distinguished man- 
ually. 
Checking completeness of our local testsuite of 
1787 items, we found that only 1456 out of 3730 
grammar disjuncts ir~ our German grammar were 
tested, yielding Tdis = 0.39 (the TSNLP testsuite 
containing 1093 items tests only 1081 disjuncts, 
yielding Tdis = 0.28). 3 Fig.3 shows an example 
of a gap in our testsuite (there are no examples of 
circumpositions), while Fig.4 shows an inapproppri- 
ate disjunct thus discovered (the category ADVadj 
has been eliminated in the lexicon, but not in all 
rules). Another error class is illustrated by Fig.5, 
which shows a rule that can never be used due to an 
LFG coherence violation; the grammar is inconsis- 
tent here. 4 
3There are, of course, unparsed but grammatical test cases 
in both testsuites, which have not been taken into account 
in these figures. This explains the difference to the overall 
number of 1582 items in the German TSNLP testsuite. 4Test cases using 
a free dative pronoun may be in the test- 
suite, but receive no analysis since the grammatical function 
FREEDAT is not defined as such in the configuration section. 
327 
PPstd=> Pprae 4=$; 
NPstd 4 = (1" OBJ); 
{ e DISJUNCT-011 E o*; 
I Pcircum4=1`; 
DISJUNCT-012 E o. 
"unused disjunct" ; } 
Figure 3: Appropriate untested disjunct 
ADVP=~ { { e DISJUNCT-021 E o*; 
I ADVadj 4=1` 
DISJUNCT-022 E o* 
"unused disjunct" ; } 
ADVstd 4=1" 
DISJUNCT-023 E o, 
"unused disjunct" ; } 
I .,. } 
Figure 4: Inappropriate disjunct 
3.2 Testsuite Economy 
Besides being complete, a testsuite must be econom- 
ical, i.e., contain as few items as possible without 
sacrificing its diagnostic capabilities. Instrumenta- 
tion can identify redundant test cases. Three criteria 
can be applied in determining whether a test case is 
redundant: 
similarity There is a set of other test cases which 
jointly rely on all disjunct on which the test case 
under consideration relies. 
equivalence There is a single test case which relies 
on exactly the same combination(s) of disjuncts. 
strict equivalence There is a single test case 
which is equivalent to and, additionally, relies 
on the disjuncts exactly as often as, the test 
case under consideration. 
For all criteria, lexical and structural ambiguities 
must be taken into account. Fig.6 shows some equiv- 
alent test cases derived from our testsuite: Exam- 
ple 1 illustrates the distinction between equivalence 
and strict equivalence; the test cases contain differ- 
ent numbers of attributive adjectives, but are nev- 
ertheless considered equivalent. Example 2 shows 
that our grammar does not make any distinction be- 
tween adverbial usage and secondary (subject or ob- 
ject) predication. Example 3 shows test cases which 
should not be considered equivalent, and is discussed 
below. 
The reduction we achieved in size and processing 
time is shown in Table 1, which contains measure- 
ments for a test run containing only the parseable 
test cases, one without equivalent test cases (for ev- 
ery set of equivalent test cases, one was arbitrar- 
VPargs ~ { ... 
I PRONstd4= (1" FREEDAT) 
(4 CASE) -- dat 
(4 PRON-TYPE) = pers 
-~(t OBJ2) 
DISJUNCT-041 E o* 
"unused disjunct" ; 
I ... }. 
Figure 5: Inconsistent disjunct 
1 ein guter alter Wein 
ein guter alter trockener Wein 
'a good old (dry) wine' 
2 Er ifit das Schnitzel roh. 
Er iBt das Schnitzel nackt. 
Er ii3t das Schnitzel schnell. 
'He eats the schnitzel naked/raw/quickly.' 
3 Otto versucht oft zu lachen. 
Otto versucht zu lachen. 
'Otto (often) tries to laugh.' 
Figure 6: Sets of equivalent test cases 
ily selected), and one without similar test cases. 
The last was constructed using a simple heuristic: 
Starting with the sentence relying on the most dis- 
juncts, working towards sentences relying on fewer 
disjuncts, a sentence was selected only if it relied on 
a disjunct on which no previously selected sentence 
relied. Assuming that a disjunct working correctly 
once will work correctly more than once, we did not 
consider strict equivalence. 
We envisage the following use of this redundancy 
detection: There clearly are linguistic reasons to dis- 
tinguish all test cases in example 2, so they cannot 
simply be deleted from the testsuite. Rather, their 
equivalence indicates that the grammar is not yet 
perfect (or never will be, if it remains purely syn- 
tactic). Such equivalences could be interpreted as 
%- 
TSNLP testsuite 
parseable 1093 100% 1537 100% 
no equivalents 783 71% 665.3 43% 
no similar cases 214 19% 128.5 8% 
3561 
local testsuite 
parseable 1787 100% 1213 100% 
no equivalents 1600 89% 899.5 74% 
no similar cases 331 18% 175.0 14% 
5480 
Table 1: Reduction of Testsuites 
328 
a reminder which linguistic distinctions need to be 
incorporated into the grammar. Thus, this level of 
redundancy may drive your grammar development 
agenda. The level of equivalence can be taken as 
a limited interaction test: These test cases repre- 
sent one complete selection of grammar disjuncts, 
and (given the grammar) there is nothing we can 
gain by checking a test case if an equivalent one was 
tested. Thus, this level of redundancy may be used 
for ensuring the quality of grammar changes prior 
to their incorporation into the production version of 
the grammar. The level of similarity contains much 
less test cases, and does not test any (systematic) 
interaction between disjuncts. Thus, it may be used 
during development as a quick rule-of-thumb proce- 
dure detecting serious errors only. 
Coming back to example 3 in Fig.6, building 
equivalence classes also helps in detecting grammar 
errors: If, according to the grammar, two cases are 
equivalent which actually aren't, the grammar is in- 
correct. Example 3 shows two test cases which are 
syntactically different in that the first contains the 
adverbial oft, while the other doesn't. The reason 
why they are equivalent is an incorrect rule that as- 
signs an incorrect reading to the second test case, 
where the infinitival particle "zu" functions as an 
adverbial. 
4 Negative Test Cases 
To control overgeneration, appropriately marked un- 
grammatical sentences are important in every test- 
suite. Instrumentation as proposed here only looks 
at successful parses, but can still be applied in this 
context: If an ungrammatical test case receives an 
analysis, instrumentation informs us about the dis- 
juncts used in the incorrect analysis. One (or more) 
of these disjuncts must be incorrect, or the sentence 
would not have received a solution. We exploit this 
information by accumulation across the entire test 
suite, looking for disjuncts that appear in unusu- 
ally high proportion in parseable ungrammatical test 
cases. 
In this manner, six grammar disjuncts are singled 
out by the parseable ungrammatical test cases in the 
TSNLP testsuite. The most prominent disjunct ap- 
pears in 26 sentences (listed in Fig.7), of which group 
1 is really grammatical and the rest fall into two 
groups: A partial VP with object NP, interpreted 
as an imperative sentence (group 2), and a weird 
interaction with the tokenizer incorrectly handling 
capitalization (group 3). 
Far from being conclusive, the similarity of these 
sentences derived from a suspicious grammar dis- 
junct, and the clear relation of the sentences to only 
two exactly specifiable grammar errors make it plau- 
sible that this approach is very promising in reducing 
overgeneration. 
1 Der Test fg.llt leicht. 
Die schlafen. 
3 Man schlafen . 
Dieser schlafen . 
Ich schlafen . 
Der schlafen. 
Jeder schlafen. 
Derjenige schlafen . 
Jener schlafen . 
Keiner schlafen . 
Derselbe schlafen. 
Er schlafen. 
Irgendjemand schlafen . 
Dieselbe schlafen . 
Das schlafen . 
Eines schlafen. 
Jede schlafen . 
Dieses schlafen. 
Eine schlafen . 
Meins schlafen. 
Dasjenige schlafen. 
Jedes schlafen . 
Diejenige schlafen. 
Jenes schlafen. 
Keines schlafen . 
Dasselbe schlafen. 
Figure 7: Sentences relying on a suspicious disjunct 
5 Other Approaches to Testsuite 
Construction 
Although there are a number of efforts to construct 
reusable large-coverage testsuites, none has to my 
knowledge explored how existing grammars could be 
used for this purpose. 
Starting with (Flickinger et al., 1987), testsuites 
have been drawn up from a linguistic viewpoint, "in- 
\]ormed by \[the\] study of linguistics and \[reflecting\] 
the grammatical issues that linguists have concerned 
themselves with" (Flickinger et al., 1987, , p.4). Al- 
though the question is not explicitly addressed in 
(Balkan et al., 1994), all the testsuites reviewed there 
also seem to follow the same methodology. The 
TSNLP project (Lehmann and Oepen, 1996) and 
its successor DiET (Netter et al., 1998), which built 
large multilingual testsuites, likewise fall into this 
category. 
The use of corpora (with various levels of annota- 
tion) has been studied, but even here the recommen- 
dations are that much manual work is required to 
turn corpus examples into test cases (e.g., (Balkan 
and Fouvry, 1995)). The reason given is that cor- 
pus sentences neither contain linguistic phenomena 
in isolation, nor do they contain systematic varia- 
tion. Corpora thus are used only as an inspiration. 
(Oepen and Flickinger, 1998) stress the inter- 
dependence between application and testsuite, but 
don't comment on the relation between grammar 
and testsuite. 
6 Conclusion 
The approach presented tries to make available the 
linguistic knowledge that went into the grammar for 
development of testsuites. Grammar development 
and testsuite compilation are seen as complemen- 
tary and interacting processes, not as isolated mod- 
ules. We have seen that even large testsuites cover 
only a fraction of existing large-coverage grammars, 
329 
and presented evidence that there is a considerable 
amount of redundancy within existing testsuites. 
To empirically validate that the procedures out- 
lined above improve grammar and testsuite, careful 
grammar development is required. Based on the in- 
formation derived from parsing with instrumented 
grammars, the changes and their effects need to be 
evaluated. In addition to this empirical work, instru- 
mentation can be applied to other areas in Gram- 
mar Engineering, e.g., to detect sources of spurious 
ambiguities, to select sample sentences relying on a 
disjunct for documentation, or to assist in the con- 
struction of additional test cases. Methodological 
work is also required for the definition of a practical 
and intuitive criterion to measure limited interaction 
coverage. 
Each existing grammar development environment 
undoubtely offers at least some basic tools for com- 
paring the grammar's coverage with a testsuite. Re- 
grettably, these tools are seldomly presented pub- 
licly (which accounts for the short list of such refer- 
ences). It is my belief that the thorough discussion 
of such infrastructure items (tools and methods) is 
of more immediate importance to the quality of the 
lingware than the discussion of open linguistic prob- 
lems. 

References 

L. Balkan and F. Fouvry. 1995. Corpus-based test suite generation. TSNLP-WP 2.2, University of Essex. 

L. Balkan, S. Meijer, D. Arnold, D. Estival, and K. Falkedal. 1994. Test Suite Design Annotation Scheme. TSNLP-WP2.2, University of Essex. 

F. Ciravegna, A. Lavelli, D. Petrelli, and F. Pianesi. 1998. Developing  reesources and applications with geppetto. In Proc. 1st International Conference on Language Resources and Evaluation, pages 619-625. Granada/Spain, 28-30 May 1998. 

D. Flickinger, J. Nerbonne, I. Sag, and T. Wasow. 1987. Toward Evaluation of NLP Systems. Hewlett-Packard Laboratories, Palo Alto/CA. 

A. Frank, T.H. King, J. Kuhn, and J. Maxwell. 1998. Optimality theory style constraint ranking in large-scale lfg gramma. In Proc. of the LFG98 Conference. Brisbane/AUS, Aug 1998, CSLI On-line Publications. 

W.C. Hetzel. 1988. The complete guide to software testing. QED Information Sciences, Inc. Wellesley/MA 02181. 

R.M. Kaplan and J. Bresnan. 1982. Lexical-functional grammar: A formal system for grammatical representation. In J. Bresnan and R.M. Kaplan, editors, The Mental Representation of Grammatical Relations, pages 173-281. Cambridge, MA: MIT Press. 

S. Lehmann and S. Oepen. 1996. TSNLP - Test Suites for Natural Language Processing. In Proc. 16th International Conference on Computational Linguistics, pages 711-716. Copenhagen/DK. 

P. Liggesmeyer. 1990. Modultest und Modulverij~kation. Angewandte Informatik 4. Mannheim: BI Wissenschaftsverlag. 

K. Netter, S. Armstrong, T. Kiss, J. Klein, and S. Lehman. 1998. Diet - diagnostic and evaluation tools for nlp applications. In Proc. 1st Int'l Con/. on Language Resources and Evaluation, pages 573-579. Granada/Spain, 28-30 May 1998. 

S. Oepen and D.P. Flickinger. 1998. Towards systematic grammar profiling:test suite techn. 10 years after. Journal of Computer Speech and Language, 12:411-435.
