TSNLP --- Test Suites for Natural Language Processing 
Sabine Lehmann ~, Stephan Oepen ~ 
Sylvi(; l{egnier-Prosi; "!', Klau,~ Net;i;(',r c~, Vcronika l.ux a', Ju(lith Klein ~, 
Kirsteu Falkexlat '~, Fr(;(torik Fouw'y +, 1)ominique, E,qtival '~, Eva \])auI)hin*, 
\[Iervd C,(:)nll)agnion il, ,hl(lil;h Baur <), \].orna lbdkan <>, Doug Arnold <) 
~IISSCO e¢ I)I"KI Chnl)\[\] 
Universitd tie (~lCll(?VO (Jl, \[)eparl;nlonl; 
\[)4, rOllt,e d(!S Acacias ~l;llhls~t|;zelih~ulswt!g ~1 
(ill 1227 (~(!ll(~V(! 1) 6(i123 S~tarl)r{icktm 
-i 41 - 22-705 7{) 33 +49- 681 - 302 52 82 
Abstract 
The growing la.ngua<gc l.echnology indusl;ry n(!eds 
nl(!nslireili{~iil; tools 1;o allow rf!se;-i, rch(ws, (qigi- 
I\[(~(WS 1 III~LII?I,~2~(~FS 1 a,nd CllSI;OIII(!I'S i;o l,l'~mk ,l,,v<~l- 
Opl/ICllI;, (wa, llla, tea, nd ;-tSSllre (llla,ii|;y, ;'~ll(I ;is;s(~ss 
suit;al)ilil;y for <t variety of ~tl)l)licalions. 
The 'I'SNIA' ('\['(!s|. Suil;(!s for N;d;ural l,~uiguag{: 
\])l'O(:(~S<";illg~) 1)ro.j(~cL i has iilv<!stigated Vitl'i()ll,'-; as- 
1)cc|,s o\[' i,h(~ (:onsl;rll(;|;ion, Ili;-tillLelia,ll(;c aJl(\[ ~tl)- 
l)li(:~tt,ion oJ' s,yshml~tl;ic l,esi, suites ~ts diagnostic 
mid (wahlntion l,ools for NI,I ~ ~tpplic~tl,ions. The 
palmr suniinarizes the motivation a.ml main rc- 
sli\]l;s of TSNI,P: besides tl~e solid nici, hodologicnl 
fouud~d,ion of l,\[le l)roje(;l;, TSN\[A > ha. e, produc(~d 
SlLt)sl;~tntiztl (i.e. lm'ger tho,ll &lly cxist,ing gOllcr- 
al Lest suites) n~u|l,i-purl~osc mid iliu\]i;i-user LesL 
suites fin' three I~hlropea.n lm~guages l;ogel,\]ler wil, h 
a set of Sl)(;cialized tools that t'acilil,;~te th(! con- 
S|;l'llC|;iolt~ (~XI,CIISiOII~ lll~-till|;(HlallCC~ r(!I.ri(!val~ ?tll(| 
clts|;oniizal;iotl o\[' l;he l;(~sl, d~ti;a. 
The pul)li<:ly avMIMfle resull.s of TSNLI' l'(!l)re- 
seiii; a wduM)lc linguisLic rosourc(~ l;hai; has l;\[i('~ 
pot;enl, iM of l)roviding ~t widc-sl)r(utd I)re-sl,an(lard 
diagnost, i(: a.nd (w;thla.LiOll i;ool fl)r bol;h (l(woh)p(!rs 
and users (ll' NI,P al)l)licai;ions. 
\] Backgro,md and Motivation 
Ewduation of NI,P ;q)l)lications plays mi im:reas- 
ingly iml)ortanL role in both (;he a(:adtnni(: mtd in- 
(lusl;ria.1 NI, (:onmumiti(~s. Two tools t;raditiona.1- 
ly used for (~va.hial;iug an(l l:(~sl:ing N\],\] ) syst(!ms 
.%1"C \[,CS\[, S'II, iI, C,'~ ;I,11(l \[,cat (:o'I7)o7YI,. ~\['11o I;wo (:;VII 
1)(; seen as serving (:Olnt)hmw~nLary 1)url)oses (see 
l),mphiil el; al. (1995a)): in (:onl,ras~ 1;() tex(, ('or- 
1)()ra, whose nmiu a.(lva.ntage is l;ha% they r(;lle,(:t 
natm'nlly ()(:curring (lal,;~, l;h0 key 1)rop(~rti(~s of |;(!,ql; 
suil;cs a,re (i) syst, em(d/i(:ity, (ii) co'~d,'rol ov(:'r d(tl, a, 
(iii) i'r~,clv, siou, of ne:l(tl, ivc (l(da, ~tn(l (iv) c.:dt,(t'.,stiv- 
ity. 
I'l'tm la'oj~cl was ~tm'l,(~d in I)(~(:(md)er 1993 and 
COml)leted in M;u(:h 1!)9(;, Most of t,h(~ I)roj(~ct results 
(do(:um(mts, bibli()gral)hy , tcsL data, ;rod sofl;wa,'( 0 as 
well as on-lium ~tcccss to (;he l,('sl, suil,(~ (laJ,a, ba>;(? ~ul(l 
clnai\] ~uhh'ess(!s of l,h(! proj(~(;l; nmntlmrs c;-tll I)(! o\])- 
rained through I,h<~ world.wide web fronl the TSN1A' 
holIl(~ 1)~1,1!~'.! 'ht i;p ://tsnip. dfki. mtJ -sb. de/tsnlp/'. 
Tim 'FSNIA' \[)l'Oj(~(;I; WaS ftllld(~(I wil,hin l, hc I,inguisl, ic 
I{csna,rch I'h@tw(wiiq,2 (IAIE) \[)FOt.,I'?iAIIIII(! (11" 111o \[~;lll'O - 
p(~Ul (',Omlnission (i)<; X I\[I) und(!r i'es¢,arc}l gra, ilL I,HE-. 
(12-089 and by l, lw Swiss l:(~(l(~rM (',OV(!,'lllilCili. 
<> ( I I~IMT ( fro u p a, A eros p ntial(, lq'am:c 
Uuiversity of Essex (':OllllllOll l{.esearch Conter 
Wivmdm(! Ibu'k 12, rm~ Pasl;eur I~P 76 
17K (?oh:hcster (X)4 3SQ 1<' !)2152 Suresnes (Ic¢lex 
-I-44- 1206-872086 q 33- 1-46973061 
Among the ma.ii~ mol;iw~.tions for 1,he TSNIA' 
proj(~(:t wore the lack of gone, ra\] guidelines for the 
t(;sl; suite construction, of adeqmvte a,nd compre- 
h(;nsiv(~ test mnterial, and of al)prol)rial;(~ tools. 
The resulting duplication of effort among test 
.~uito d(welopers obviously le~tds t,o a waste of 
time and resour(:os. In additi(m, one of the main 
conclusions of a, study of existing t;esl;s suil;es 
COlMUCtCd during the first: sLage of the project 
(Esl:iva.l el; al. (\]994)) was that l;he. r(msability of 
exisi;ing test suiLos is severely hmnl)ered l)y the, it 
l;mk of structure a.ud a, nn()ta.t;ions. Iudeed, despit(~ 
th(; pioiw(',ring ('fl'orl;s of Flickinger et al. (1987) 
,~/,lld Nc, rbonn(! (;1; al. (1993), most of the, exist- 
ing test suites were writtren for some specific, sys- 
te, m or simply (municrat, c a nmnl)er of int(we, st;- 
ing examples and, thus, do not, niee, t th(! demarlM 
for large, sysLelna, tic, wt~ll-doclnnelll;t?(l, highly- 
sl;rut:l;ure<| mid mmotated collections of linguistic 
matexia.1, which is now required by a. growing num- 
1)er of NIA ) apt)lications. The 'FSNIP Ix;st suite 
addresses these denmnds ml(l provides I)owerflfl 
l;ools for l;hc consl;rll(:l;ion ;tnd m:mipulation of l;}m 
l;(~sl; (|aJ;a,. 
On the on0 ha,ml, sinco (;very NLP sysLtun 
(wheLlmr conmmrciM or und('x devolol)nwnt ) ('x- 
hil)its specific fea.l;m'es which lnak(; it unique, and 
every user (or dcvelol)er) of mt NLP sysl;(~m has 
sI)(~citi(: ne(;ds and i'equirt;nmnl.s, the TSNI,I' ~tl)-- 
l)r()ach is l)a,sed on tlm a~Smnl)tion l,h;d;, in ()r(l(w 
to yield informa.l;ive and int(wl)retal:)h~ results, a,ny 
1;eSl; suil;e used \['or ml actual {;(;sl; or evahiation lmlSl 
Ioe sp('.ci,,/i(: (~d, loa.st 1;() some (h',gr(w~) to i;hc sysLem 
and the user. ()n the other }mrml, sin(:o testing or 
ewdua.ting N\[,P systems is 1)crfornmd ti)r a. variety 
of \]mrl)oscs , t,h(', TSNI,I > a.l)l)roach is also gui<l(;d l)y 
I;h(', n(;ed to l)rovide test mat(!rial which is easily 
.l'('/tts(l, Dl(',. rib achiove th(;se two goMs of Sl)(~cili(:i- 
ty and reusalfility, the tra.ditional notion of a. l,est 
suite as a monoliflfi(: set: of test it;olns has l)(~(!n 
M)andoned in fnvour of the notion of a (tal;al)as(~ in 
which test iLelliS ;tl'(~ sl;ored l:ogether with a, rich in- 
venl;ory of asso(:iated liuguisti(: mM n(m-linguist;i(: 
~lllllO{,&i;iOllS. 
Thus, I;h(; l;(:sl, Sllig(~ da.l.;I.l)~tsc sorves a.s ;t virtunl 
(or met;;@ tx;st; suil;e thai; t)rovitlt~s the metals i;() ex- 
1;1"~-1,(% (;11(} l'(',l(;va, ltt sill)sial; Of th0 tcst; (/;1,(;}~. suital)le 
for s()nw Sl)(~(:iti(: tn.sk. Using tim e, xl)liciL s{;171tl:.. 
7 ll 
ture of the data and the TSNLP annotations, the 
database engine allows searching and retrieving 
data from the virtual test suite, thereby creating a 
concrete test suite instance according to arbitrary 
linguistic and extra-linguistic constraints. Since, 
additionally, there are tools provided for the main- 
tenance and extension of the test suite database, 
the TSNLP virtual test suite approach is an essen- 
tim innovation leading the way to a new generation 
of highly-structured reusable test suites. 
2 Test Suite Design and Methodology 
Based on a survey of existing test suites and an 
analysis of the diagnostic and evaluation require- 
ments of both NL technology developers and users, 
TSNLP has developed the methodology for the con- 
struction of core test data, that is, test items re- 
flecting central language phenomena and that are 
applicable to a wide range of applications, includ- 
ing parsers, grammar checkers, and controlled lan- 
guage checkers (Balkan et al. (1996)). 
The TSNLP methodology is designed to optimize 
(i) control over' test data, (ii) progressivity, and 
(iii) systematicity. These are necessary qualities 
for an adequate, reusable test suite, which are dif- 
ficult to find in test corpora. The methodology 
also addresses the specific goals of TSNLP to pro- 
duce multi-purpose, multi-user, and multilingual 
test suites. 
Control over test data What makes test 
suites valuable in comparison to corpora is that 
they can focus on specific linguistic phenomena 
and that each phenomenon can be presented both 
in isolation and controlled combinations in which 
as many linguistic parameters as possible are be- 
ing kept under control. This is particularly the 
case when a phenomenon is illustrated by system- 
atic variation over the parameters used to describe 
this phenomenon, while all other parts of the test 
items remain constant. 
Vocabulary is an aspect of the test data that 
needs to be controlled. TSNI,P achieves this by re- 
stricting the vocabulary in size as well as in do- 
main. Categorially and semantically ambiguous 
words are avoided where possible and only includ- 
ed when ambiguity is explicitly tested for. 
Additionally, TSNLP attempts to control the in- 
teraction of phenomena by keeping the test items 
as small as possible. Therefore, a number of guide- 
lines for this purpose (such as use declarative sen- 
tences and avoid modifiers and adjuncts) is pro- 
vided. 
Progressivity Progressivity is the principle 
of starting h'om simple test items and increasing 
their complexity. In TSNLP, this aspect is ad- 
dressed by requiring that each test item focuses 
only on a single t)henomenon (or rather subphe- 
nomenon or even feature) which distinguishes it 
from all other test items. This principle not on- 
ly ensures systernaticity during the test data (:on- 
struction but also allows test data users to apply 
the test data in a progressive order obtained from 
the special attribute presupposition in the phe- 
nomena classification. Thus, the precise identifi- 
cation of the coverage of a system and of its deft- 
eieneies is rendered easier. 
Systematicity Systematicity refers to the 
depth of coverage of a test suite, with respect to 
both well-formed and ill-formed items. System- 
aticity in TSNLP is achieved for well-formed items 
by the explicit classification of test items accord- 
ing to phenomena and sub-phenomena. Negative 
test data permits testing for overgeneration as well 
as for coverage, ill-formed items are derived from 
well-formed ones by systematic variation of the pa- 
rameters through the application of one (or more) 
of four operations, namely: 
• REPLACEMENT (e.g. change of person) 
(l)h'ench) L' ingdnieur vient. 
(lq'ench) *L' ingdnieur viens. 
• am)rrION (e.g. of an object NP) 
(German) Dcr Managcr arbeitet. 
(German) *Dcr Manager arbeitet den Vortrag. 
• DELETION (e.g. of an obligatory complement) 
(German) Der Mana.qer hiilt den Vortrag. 
(German) *Der Manager hiilt. 
• PERMUTATION (e.g. inverting word order ) 
(English) He saw the boy. 
(English) *He the boy saw. 
In general, tile systematicity of test data was 
greatly enhanced through the use of special- 
purpose tools in the data construction and vali- 
dation process (see section 5 below). 
Multillnguallty Multilinguality is achieved 
in the TSNLP test suites by covering the same 
range of phenomena in English, French and Ger- 
man, and adopting the same classification for these 
phenomena in the three languages. Furthermore, 
the choice of related terminology for the categolial 
and structural description contributes to I;he com- 
parability and consistency of the test items (see 
section 4 for details). 
Documentation To enhance the usability 
and extensibility of TSNI,P results, a three-vohnne 
user guide is under preparation providing clear in- 
structions for the assessment of the methodology, 
test data, and tools developed. 
3 TSNI,P Annotation Schema 
A detailed annotation schema was designed tbr the 
test data which does not, presuppose a specific lin- 
guistic theory, a particular evaluation situation or 
application type. 
Test data and am~otations in TSNI,P test suites 
are organized at four distinct representational lev- 
els: 
712 
• (?ore Data The (:()re of the test data c<>nsists of 
the individual test items together with all ge, n- 
eral, categorial and structural inforlnation that 
is indepen{lent of a token phenomenon or appli- 
cation. Besi<les the actual input string, annota: 
tions at this level include (i) bookl¢eeI>ing and 
documentation inR)rmation (sill;her, date, id 
numl>er), (it) the item format, its length, catego- 
ry and well--formedness eo<le, (iii) the (morpho- 
)syntactic categories and string l)ositions of the 
lexi<'al and phrasal elements ('onstil;nting the 
test il;em, and (iv) ~'tIl (mMersl,eeilie(l) represen- 
tation of its flmctional stru(:tm:e, gn<:oding a 
dependency or funetor-argument graph rather 
dmn a t)hrase st;ructure tree allows generaliza- 
tions over pt>tentially <:ontroversial t>hrase struc- 
tttre eonfigul'ations ~ilcl, thus, avoids imposing 
a specifi<: <:onstituent stru<:ture lint still ean be 
mapi>ed onto one. 
• Phenomenon-Related Data Based on a hi= 
erarchical classification of linguistic (an<l extra= 
linguistic) phenome+,a (e.g. verb wdency as a 
subtype of general complementation), each phe- 
noiuenon ix identitied by a phenomenon id and 
by its supertype,(s). \]interaction with other phe- 
nomena as well as the l)henom<ma which must 
be presuplmse<l are also given, in addition, the 
(syntactic) parameters which are relevant for the 
phenomenon (e.g. the munber an<l tyt)e of con> 
plements in the case of verb valency) are de- 
scribe<t. Individual test items can be assigned 
to one or several phenoluena and annotated ge- 
<:ording to the eorresl)ondii~g parameters. 
• Test Sets 'lPest items emt optkmally be groul>ed 
into test sets. A tesl, s01, is a group (>f test 
items containing typically one I)ositive examl)le 
&lid one or nlore negative examples. The re- 
lation t)etween positive an<l negative Ix;st it;eros 
has l)een one (>f the most <:hallengiug <luestions 
in designing test data and, as has l)een men: 
tioned, is based on the systematic variation of 
phenomenon=specific paraineters. 
• User and Application Parameters Infornm: 
lion that typically correlates with the use of a 
I;est suite for difl'erent types of ewtluation and for 
different apl)li<:ations (e.g. ratings of fl'e(luency 
or relewm<'e \['or a particular <\[onlailt) i8 factore<\[ 
fl'om the remainder of the data into 'user \[:4 ap- 
plication profile.,< As part of the <:ustomization 
t)ro<:ess users of the TSNI,P \[;est suil;es are eli- 
<:ouraged to extend this part; of the test suite 
database and a<ld whatever (formal or infor- 
mal) information is necessary for Ch<',ir Sl>eeific 
requirements. 
In ad<lition to l;he parts of the annotation 
s(-henta that follow a formal speeifi<;ation, there is 
room for textual conmmnts at the wn'ious levels to 
accommodate informatioi~ that (:annot or need not 
be forlnalized. 
\[ Test Item -- \] 
item id: 2/~0~20101 author: issco date: jan-95 \] 
register: formal format: n.onc' origin: inve~,tcd\[ 
difficulty: l wellformedness: / category: 5' \[ 
input: L ' i'n,g&~,icur vie'at . length: 3 I 
comment: / ~ 
ion instance category fimction domain 
2-- I,"in~h~,ie.ur ~<~ :-.~,nb:j 2.'3- 
:,'_ yic',t\[ ..... V:.:'~-.;,j _ J~,~,! _ __0..:¢ 
\[i-_ Phelaomelton 
phenomenon id: +2/t02 author: issco elate: jan-.95 
name: (/__Co'n~phmw,'nt~*tion_s'ubj(Nl')_ V I 
st, pertypes: (/_(;o'mplc.'mcv.tal, ion J 
presupposition: (7:Agreement, NILAqrcemc'n,t j 
restrictions: 7~e'ulral interaction: none purpose: to.st\[ 
comment: l',.transitive "oc.d~ (va.le'nc~j." l) I 
Figure \[: Sample inst;ance of the TSNI,P mmotation 
schema \[br one test item: the ;umotations are giwm 
in I;abular form for the test itc'm, analysis, and phe- 
71,omeno'n levels. 
4 Test I)ata Construction 
l;<)llowing the TSNI,P test suite guidelines 
(Estiwfl et al. (1994)) and using the annotation 
schema sketched above, the eonstru(:tion of test 
data was based on a classitication of the (synl;ac= 
tic) phenomena to bc <:overed. \[,i'om judgements 
on the linguistic relevance and frequency for dm 
individual languages, the following list; <)f (:ore pheo 
,n, omcna for T,qNIA' was compiled: 
• coinl)lententation; 
• agreelllenl;; 
• modification; 
• <liathesis; 
• modality, teltse, and asl)ect; 
• Selltence and clause tyt)es; 
• word order; 
® coordination; 
• negation; and 
• extragrammatical (e.g. parenthetieals and tern- 
poral expressions). 
A fin'ther sul)-elassifieation of phenomena is 
made according to the relevanl ~ynl, actie domains 
in which a I>henonmnon occurs (e.g. sentences (S), 
clauses (C), n<mn 1)hrases (NP) et al.). Fignre 2 
giw;s an overview of the test material awfilable. 
For ea<:h of the three languages some 5000 l,esl; 
items are l)rovided. Theret.'ore, TSNI,I' has already 
achieved a substantially broader and deeper <:over- 
age than previous general-purpose test suites (the 
still very popular Hewlett-Paekard tes~ suite, for 
instance, has a (;overage of 3000 test items for En- 
glish only). 
In order to enforce consistency of annotations 
across the three languages, canonical lists of the 
categories and fimctions used in the <leserit)tion of 
categorial and de4>endency structure were estal> 
lished (see Ix'\]mlann et al. (1996)). The <timen- 
sions <:hosen in the classification atl;eml)t to avoid 
71.3 
Phenomenon English French I G~ 
C_Comptementation 1481863 1881567 2181246 
C_Agreement 
C_Modification 
N P_Complementation 
N P_Agreement 
NP_Modification 
Diathesis 
Tense Aspect Modalit 
Sentence Types 
Coordination 
Negation 
Word Order 
Extragrammatical 
Total 
68155 1041183 2241175 
329163 
10127 12128 i 
2011998 27211082 29911732 
3011484 53160 
1571124 1761119 14r1148 
157139177127'5 1861134 801100 3891387 105\[14 
lm1106 3791319 1o51429 
289\[129 681100 821210 
7 7 601160 2~_ 2531o 
1158213o36Loo11313o11732133o8.1 
Figure 2: Status of the TSNLP data (l)ecember 
1995): relevance and breadth of individual phenon> 
ena present language-specific variation (the Immbers 
given are for grammatical vs. mlgramnmtical items). 
Individual phenmnenn are often further sub-classified 
according to phenomenon-internM dimensions. 
the presupposition of very si)ecific assumi)tions of 
a particular theory of grammar (or of a language), 
and rather try to capture those distinctions that 
seem to be relevant; across the set of TSNI,t' core 
phenomena. 
5 Test Suite Technology 
Because {;he test data construction proper as well 
as the custornization and application of a general- 
purpose test suite to a specific NLP system or do- 
main are laborious, cost-intensive and error-prone 
tasks, TSNLP put strong emphasis on supplying 
suitable special-purpose tools to fitcilitate both the 
development as well as usage of the TSNIA' test da- 
ta (Oepen et al. (1996a) give an overview). 
5.1 Test Data Construction 
To ease the tilne-consuming test data construc- 
tion and to reduce erratic variations in filling in the 
TSNI,P annotation schema, a graphical test suite 
construction tool (tsct) was implemented. The 
tool instant, iates the annotation schema (see sec- 
tion 3) as a feral-based input mask and provides 
for (limited) consistency checking of the field val- 
ues. Additionally, tsct allows reusing previously 
constructed and annotated data, as quite often 
when constructing a series of test; items it can be 
easier to duplicate and adapt a sintilar item rather 
than t)roduce annotations froul s(:ratch. For sorer; 
of the I;est data a DCG--lmsed test suite genera- 
ti(m tool (Arnold et al. (1994)) was det)loyed to 
automatically produce systematically wu'ied (i.e. 
both grmnmatical and ungrammatical) test items 
togeth0r with some part, of the ~mnotations. 
5.2 Test Data Maintenance and Retrieval 
To implement the TSNI,P virtual test suite ai)- 
preach (see section 1), the test data is mounted (m 
a relational datal)ase to satisfy the, folh)wing key 
desiderata: 
CI,IENT & 
APPIACATION 
PROGRAMS 
VGral}hical "~ I I 
• " " I Browser E 
L joe± j 
DATABASE \[ Library of Interface Functions i 
KFI~NEL f (SEaVEI~) 
I Database Inference Engine \[ 
Figure 3: Sketch of' tile modular tsdbl design: tin; 
database kernel is separated from client programs 
through a layer of interface flmctions. 
• usability: to facilitate the application of the 
methodology, technology, and test; data devel- 
oped in TSNLP to a wide variety of diagnosis and 
evaluation purposes for ditferent applications by 
developers or users with varied backgrounds; 
• suitability: to meet the specific necessities of 
storing and maintaining natural language t;est 
data (e.g. in string 1)recessing) and to provide 
maximally flexible interfaces; 
• adaptability and extensibility: to enable 
and encourage users of the, database to add test 
data and annotations according to their needs 
without changes to the underlying data model; 
and 
• portability and simplicity: to make the re- 
SUIts of TSNI,P available on several different 
hard- and software plat;forms and easy to use. 
To a.ccount for the 1)otentially different require- 
meats of NLP developers a.nd users and ill order 
to provide suitable interfaces to hmnan test suite 
users as well as to external applicatioi~ programs, a 
dual database inq)lementation was carried out: (i) 
while a proprietary implementation (called tsdb 1) 
allowed the fine-tuning of both the query \]anguage 
and interfaces, (it) a second version (tsdb2) builds 
on a commercial database product and, thus, is 
coml)liant to commol~ industry standards allowing 
(industrial) users of the TSNLP test; suite to acquire 
on-site technical SUl)l)ort where necessary. : 
The tsdb 1 inll)leanelfl;ation is a small and etli- 
cient relational database engine in ANSI C. 11; was 
designed with an open and dot:unrented interface 
layer (see figure 3) that enalfles test suite users to 
1)idirectiona.lly link an al)l)lication being tested to 
t;he database and run automated retrieve, 1)recess, 
and comi)arc, cycles. Diagnostic results obtained 
can be stored in the databnsc, as part of the %set" 
94 application prwJile for use in contitnlolts progress 
ewduation (section 6 gives mt exainple). 
An ASCii-based comnm.nd shell interprets a 
simplitied SQL-stylc query language and provides 
editing, completion, and command and query re- 
sult history. A network database server gives re- 
mote (though read-only) access to the test data. 
For the alternative intt)lententation tsdb 2 the 
COml)etitively priced dat, a,l)asc l)a.ckage Microsoft. 
714 
~ File \[dit Database Becord Program Itun Ulilldou~ Bl'O~se 
l,'igurc 4: Screen dural) o\[" the tsdb 1 test item win- 
dow; the underlying relational d++tabase allows parallel 
browsing and editing of multil)le r(qai,iot,s. 
libxPro was deployed bex:aus(; it in awdlM)le for 
both Apple Macintosh and personal COmlml;ers 
running MS Windows 2 and has a very wi(t(; distri- 
but;ion. Tit(; database provides (;oitv(+Jliotl|; graph- 
ical browsit|g and editing of tit(; data (using lmll~ 
down menus fbr tinit;e (hmtain fields; s(+,o \[igure 4) 
as well as standard import and export fa.cilities to 
exchange data with external applications. 
,5.a Query and Retrieval: An Example 
'15 ilh|strat;c the capacity and flcxil)ility of i;hc 
TSNLP annotation schema in ctntjunction with a 
relational database retri(wal (;ngin(:, a query exam- 
plc in the' simplified SQL-likc query language inter- 
preted by tsdb I together wit;h mt informal English 
paral)hrasc, in giwm: a 
• lind all grammatical test items that are associat-- 
ed wil:h Lh(; l)henonmnon of (:lat|sal (i.e. sul)jcct 
verb) ag;roo, lllt+Jl|; and have l)ronominal st|l~jecgs: 
se/,ect i-id i-input 
where i-wf :-- I g 
p-name -- "C AgreemenC" 
a-lunch±on -: "subj " & 
a-category ,-o "^PRON" 
6 Custonfization and T~-;sLing 
To validaLe t, hc 'I'SNIA' &llllOl;a\[;i()ll mol;hodology, 
t;esl; data, and tools, the lnOj(~t'.t rt;sttlt;s have b(;cn 
test(~(l against t+hr(~(~ (lifter(mr: al)plicati(m l;yl)o:q , 
viz. ;t commercial granltnar clmcker t})r French, 
a (:ontrolh',d 1;ml';uag(; (:he(:k(~r (%1,;(2(~) for l!;nglish 
and a pars(~r (<:he PAt II,; sys|;(~itl (hw(~l()l>('xl at I)I,'KI) 
_ + 
P'lhlildhi<c{ on the pOl)ular d~tM)asc llaclcag(! M~q Ac- 
i:(!ss, ~tiiothcr iliill\](mil~tli,aJ, ion ()\[' l.hc I,('M, suii,c dal,all~tsc 
i,<; curronl, ly llciiig devehtped. This vcisioli wili provide 
:-t siniila3" funcl,ion~tlil;y \[,(i tsdb 2 iuid b<' ;ivail;dlh~ \[or 1,1i(~ 
MS Windows world. 
:~Addil;ionM sa.inplc qu(!ri<~s a, nd lll()l(! d<~t, ails Oll 
i, hc tl~d,,Mlasc s,;:\]i(~lii;-i. (inc\]udilip> r<~ia.l,ion ~t.iit\[ a,(,- 
l,ril)ui;,~ i,un{~s) ctut be tbuud in ()t~l>(,t,. ,~i, M. (i99(;1>) 
:.m,:\] on t,lw 'I'SNI,I' WorM-Wide \Vel~ hotnc ll~t~,;c 
http ://t:snlp. d \['k:i., unJ -,~b. de/L :;it\] p/. 
for German. As in this setup the evaluation situ- 
at;ions ranged froilt user-level black box ewdua, tion 
of a (:ommercial prodttct to glass box diagnosis of 
a research 1)rol;otylm tamer develol)ment (the I)I,'- 
KI sysLcm), a tilllltber of interc+st, ing resull;s were 
ol)tained on both t, hc adequacy of tim TSNI+I' slY. 
proach as well as tim quality of the sys|;cms being 
l;est(;d. 
Iq'ench Grammar Checker %'ho real life 
c, wduation scenario (ix,. tim diagnosl;ic cvahmtion 
of a conint(~rcia\] NLP product) enal>led Acroslla- 
tiale to give a precise accolllt|; of t,h(; t, yl)(', of infor- 
mal;ion ol)tainable from th(', its(+' of TSNLP. 
Tit('+ folh)wing major 1)(~rforlna.lt('c, chara(:teristi('.s 
were revealed: 
• TSN1,P ill-formed test il;oms are fl'equontly not 
d0l;ect;(;d as sut:h. 
• The system perfbrms well on (both well-formed 
and ill-formed) l;cst it, cans illustrating the phc- 
lIOllt(~lt()ll of agi'00,1\[ICiil;, ill claHsCS as well a,~ in 
liOlllt l)hrast;s. 
• q'h(; systolii (h)0s n()l; lltasger l;h(~ l)hcIl()ilt(~ll()li 
of ('Oml)lt;mentation , eSl)(',cially iiol; ill a(ljt',(:tival 
phrases. 
• Sezd;(ml;ia\] b;sl; i(;olnS l)ro(htcc lint;get l(!sull;s l;han 
sill ),q(?lll;OltLial OliOS. 
• '\['lit; analysis capabilil,{es of the sysl;c3n at0 \]inl- 
it(~d (19% of the TSNI,P \[;os|; items were not flflly 
analysed). 
The itd:crl)rotatiou of the results lnoduc(~d t)y 
l;hc system and l;h(~ comparison wil;h l, hc ling|tis-- 
l:ic information \]n'ovidod ill the TSNI3' amlotati(ms 
led to mi id(',ntifi(:ation ot:' tim major .qho|'tt:o|n- 
ings of tho syst:oin in terms of systemati(:il;y, lex 
ical and morl)ho-syntacl;ic deliciencic, s, and intcr- 
f(~t'en(:(; wil,h oth(;r system coiltl)OllOllgS. 
English Cont;rolled Language Checker 
l",ssex tt~.%cxl l:hc (:oi|grolh;d la.nguagc (:hcc, ker %F,(X', 
(Adriacns (1994)). f,ike A(,r(,Sl,;~tiale , \]~ssox was 
mosl;ly in a black box sil:ttatiol~ with reSlmCt 1,o t,\]l(! 
SySlX~IlI~ CKCt~pl; l;hat, tItcy \]l~t(\] a(:ces.~ 1;o |;\]1(~ cott+- 
I;rolled grmnmar langttag(~ (h'~st:ril)l;ious (})ttl: uoi: 1,<) 
Ill(! sysl;C}lii rllles). ~l'\]t(} t;(}sl;ing involvod the writ- 
ilI~-{ t)f a 1;/,t'~r(? tlllllllIt}l' OI'( IIS ,()ltilS( (l test. items, du(' 
to tim fat;I; l;hal; lnany C\], ,uh~,~ are h;xically b~tsc(\[, 
whcl(~l,s Lilt: (:oro \[,(~sI; sllil;(' (:OltC(~ltl;ral;c,s ()It syll\[;~l.(: 
tit: l)henomen;~. 'Phe l;e~Mng lnOVed very wdua.I)h~ 
in highlii~hl;ing deti(:i(~ncies in l;ho sys\[;(~III \[)(wfoF 
iltallCC, ;IS well aS iu die rldo dest:rit)tions mtd gave 
l)oinl;t;rs t;o Lhe l)osMlde SOllr(:(~ o\]' IIIos(~ olr()r;~, 
(~(}l'lIlall \[1+a\['s(~,l" 111 C<)IlII(~CI;ill~ t;}l(! (.l(Wiltail 
'I'SNIA' l;oSl, suJl.e tO t;h(~ I)FKI I'AtlE \])at's,,}l "4 I)oLII 
'I'\['}IC I)FKI I'AGE (\]'lal, form for A dwm(:ud ( h'a.lnmar 
l';nujtmcrittg) syslcln is a, s;l,+tt;c-.t~l'-l, lw att, NI, cot. ~tl 
t~itl<+ }l+lld ~/l'&llttll+tt" (qi,l+illC(!ritlg pl+d, fot'm; it, is iu +u:tiv(+ 
tl~+C ,:l j; S(!V('~l'~l+l int(n'iw+l;ionaJ r(~scar(:h ittsi;itul, iott;q \].i. 
marily Ior Itt',q(l-~';Lylc ~rlfD, Iltlllit.F +lrwelopm~mt for (a,- 
m+m, 14ngliMt, .I;+l)+Utcs % aml ltMian. 
7 15 
tile test data as well as tile TSNLP technology were 
validated. Building on the C version of the TSNLP 
database (tsdbl), a bidirectional interface to the 
application was established allowing the instanti- 
ation of a DFKI user & application profile for tile 
storage of application-specific data (including per- 
formance measures and a semantic specification of 
the expected output). 
The seamless coupling between the test suite 
and the NL system allows running flflly automated 
retrieve, process, and compare cycles in the con- 
tinuous progress evaluation of the grammar and 
software such that after making changes to the 
system the irnpact on coverage and performance 
can be determined in an overnight batch ,job. The 
TSNLP test data and database technology proved to 
be a highly adequate tool for glass-box diagnostic 
evaluation; besides, the testing experience provid- 
ed valuable feedback for both the test suite and 
the application tested (Dauphin et al. (1995b)). 
7 Conclusion and Future Work 
The TSNLP project has laid tile tbundations for 
buihting large scale reference data for diagnostic 
and evaluation imrposes. The project has pro- 
duced a substantial set of test items for three dif- 
ferent languages, which are based on a system- 
atic and controlled methodology, comprehensive- 
ly almotated, and embedded in an enviromnent; 
allowing for easy access and maintenance of the 
data. The approach has been successfully tested 
against commercial and research NLP applications 
and components. 
However, while this work can be seen as an im- 
portant step in the right direction, we are very 
well aware of fllture developments which will be es- 
sential for a widespread acceptance of the system 
in a broad user coinmunity. These developments 
comprise amongst others further extensions of tile 
test data (possibly taking into account aspects of 
morphology and discourse), customization tools, 
which support the adaptation of the test data to 
specific domains and applications, as well as tools 
and methods which relate the isolated test items to 
corpora in order to determine their frequency and 
relevance. While the members of the project will 
continue this work, outside developers and users of 
NLP applications are invited to contribute to these 
resources which can become a reference standard 
only if they are truly public domain. 
Acknowledgements 
in its initial specification and in the early phase of 
the project, TSNLP 'was greatly inspired by the con- 
ceptional and administrative contributions of Siety 
Meijer of University of Essex. Additionally, sub- 
stantial parts of the implementation work at DFKI 
and the University of Essex have been carried out 
by Tom Fettig, Fred Oberhauser, and Martin Ron- 
dell. We especially want to thank Roger Havenith, 
the TSNLP project otficer at 1)G XIII, for his help 
throughout the project and tile two external re- 
viewers, Dan Flickinger and John Nerbonne, for 
their constructive comments and suggestions. 
References 
Adriaens, (leer& 1994. SECC: Simplified English 
Checker and Style Correction in an MT Framework. 
In Proceedings of the Language Engineering Conven- 
tion. Paris. 
Arnold, Doug, Marl;in Rondell, and Frederik Fouvry. 
\[994. Design and hnl)lementation of Test Suite 
Tools. Report to LIfE 62-089 I)-WP5.1. Univer- 
sity of Essex, UK. 
l{alkan, l,orna, Frederik Fouvry, and Sylvie Regnier- 
Prost (editors). 1996. TSNI,P User Mamml. Volume 
1: Background, Methodology, Customization, and 
Testing. TechnicM report. University of Essex, UK. 
Dauphin, Eva, Veronika Lux, Sylvie Regnier- 
Prost (principal authors), Doug Arnold, borna 
Balkan, Frederik Fouvry, Judith Klein, Klaus Net- 
ter, Stephan Oepen, Dominique EstivM, Kirsten 
FMkedal, and Sabine Lehmann. 1995a. Checking 
Coverage against Corpora. Report to LI{E 62-089 
I)-WP3.2. University of Essex, UK. 
Dauphin, Eva, Veronika I,ux, Sylvie. Regnier-Prost, 
Lorna Ball{an, Frederik Fonvry, Kirsten Falkedal, 
Stephan Oepen (principal anl, hors), l)oug Arnold, 
Judith Klein, Klaus Netter, Dominique Esl;ival, 
Kirsten FMkedal, and Sabine Lehmann. 19951). Test- 
ing and Customisat, ion of 'D!st Items. Report to IA{.E 
62-089 I)-WP4. Uniw'.rsity of Essex, UK. 
Estival, l)ominique, Kirsten Falkedal, Lorna 13alkan, 
Siety Meijer, Sylvie Regnier-Prost, Klaus Netter, 
and Stephan Oepen. 1994. Survey of Existing Test 
Suites. l{eport to LI{.E 62-089 D-WP1. University 
of Essex, UK. 
Flickinger, Daniel, John Nerbonne, Ivan A. Sag, and 
Thomas Wassow. 1987. Toward Evaluation of NI,I ) 
Systems. Technical report. Ilewlett-Packard l,ab- 
oratories. I)istrilmted at; the 24 th Annnal Meet- 
ing of |;he Association for (~oml)utational Linguistics (acI,). 
Lehmann, Sabine, I)onfinique Fs- 
tival, Kirsten Falkedal, Hervd Compagnion, I,orna 
Balkan, li'rederik Fouvry, Judith Baur, and Judith 
Klein. 19!)6. q'SNI,P User Manual. Volume :3: Test I)a- 
ta Docmnentation. 'l>chnical report, lstituto Dalle 
Molle per gli Studii Semantici e Cognitivi (ISSCO) 
Geneva, Switzerland. 
Nerbonne, John, Klaus Netter, I(ader i)iagne, Ludwig 
Dickmann, and Judith Klein. 1993. A Diagnostic 
~1¥~ol for German Syntax. Machine Translation 8:85 
107. 
Oepen, Stephan, Frederik Fouvry, Klans Netter, Tom 
Fettig, and Fred Oberhmlser. 1996a. TSNLP \[Js- 
er Manual. Volume 2: Core Test Suite Technolo- 
gy. ~lk~chnical report. Deutsches Forschugnszentrum 
fib Kiinstliche Intelligenz (I)FKI) Saarbriieken, Ger- 
l\[lany. 
Oepen, Stephan, Klaus Netter, and Judith Klein. 
19961). 'rSNI,P Test Suites for Natural Lat{guage 
Processing. In Linguistic Databases, ed. John Ner- 
bonne. CSLI l,eclmre Notes. Center for the Study of 
Language and Information. forthcoming. 
716 
