Beyond Skeleton Parsing: Producing a Comprehensive 
Large-Scale General-English Treebank With Full 
Grammatical Analysis 
Ezra Black, Stephen Eubank 
Hideki Kashioka, David Magerman* 
AT\]R, Interpreting Telecommunications 
Laboratories 
2-2 Hikaridai, Seika-cho 
Soraku-gun, Kyoto, Japan 619-02 
Roger Garside, Geoffrey Leech 
I)epts of Computing 
and l,inguistics 
University of Lancaster, 
Bailrigg, Lancaster LA1 4YT, UK 
rgg©comp, lanes, ac. uk 
{black,eubank,kashioka}@atr.itl.co.jp G.LeechOcentl.lancs.ac.uk 
1 Introduction 
A treebank is a body of natural language text 
which has been grammatically annotated by hand, 
in terms of some previously-established scheme of 
grammatical analysis. Treebanks have been used 
within the field of natural language processing as 
a source of training data for statistical part og 
speech taggers (Black et al., 1992; Brill, 1994; 
Merialdo, 1994; Weischedel et al., 1993) and for 
statistical parsers (Black et al., 1993; Brill, 1993; 
aelinek et al., 1994; Magerman, 1995; Magerman 
and Marcus, 1991). 
In this article, we present the AT'R/Lancaster 
7'reebauk of American English, a new resource 
tbr natural-language-, processing research, which 
has been prepared by Lancaster University (UK)'s 
Unit for Computer Research on the English Lan- 
guage, according to specifications provided by 
ATR (Japan)'s Statistical Parsing Group. First 
we provide a "static" description, with (a) a dis- 
cussion of the mode of selection and initial pro- 
cessing of text for inclusion in the treebank, and 
(b) an explanation of the scheme of grammatical 
annotation we then apply to the text. Sec.ond, we 
supply a "process" description of the treebank, in 
which we detail the physical and computational 
mechanisms by which we have created it. Finally, 
we lay out plans for the further development of 
this new treebank. 
All of the features of the ATR/Lancaster Tree- 
bank that are described below represent a radi- 
cal departure from extant large-scale (Eyes and 
Leech, 1993; Garside and McEnery, 1993; Marcus 
et al., 1993) treebanks. We have chosen in this ar- 
ticle to present our treebank in some detail, rather 
than to compare and contrast it with other tree- 
banks. But the major differences between this and 
earlier treebanks can easily be grasped via a corn- 
*Current alfiliation: Renaissance Technologies 
Corp., 25 East Loop Road, Suite 211, Stony 
Brook, NY 11776 USA; Consultant, ATR Interpret- 
ing Telecommunications Laboratories, 3-12/94 
parison of the descriptions below with those of the 
sources just, cited. 
2 General Description of the 
Treebank 
2.1 Document Selection and 
Preprocessing 
The ATR/Lancaster Treebank consists of approx- 
imately 730,000 words of grammatically-analyzed 
text divided into roughly 950 documents ranging 
in length ffmn about 30 to about 3600 words. 
The idea informing the selection of documents 
for inclusion in this new treebank was to pack 
into it the maximum degree of document variation 
along many different scales---document length, 
subject area, style, point of view, etc. -but with- 
out establishing a single, predetermined classifica- 
tion of the included documentsJ Differing pur- 
poses for which the treebank might be utilized 
may favor differing groupings or classifications of 
its component documents. Overall., the rationale 
for seeking to take as broad as possible a sample of 
current standard American English, is to support 
the parsing and tagging of unconstrained Amer- 
ican English text by providing a training corpus 
which includes documents fairly similar to almost 
any input which might arise. 
Documents were obtained from three sources: 
the Internet; optically-scanned hardcopy "occa- 
sional" documents (restaurant take out menus; 
flmdraising letters; utility bills); and purchase 
from commercial or academic vendors. To illus- 
trate the diverse nature of the documents included 
in this treebank, we list, in Table 1, titles of nine 
typical documents. 
In general, and as one might expect, the doc- 
uments we have used were written in the early 
to mid 1990s, in the United States, in "Standard" 
American English. However, there are fairly many 
1as was done, by contrast, in the Brown Corpus 
(Kucer~t and Francis, 1967). 
107 
Empire Szechuan Flier (Chinese take out food) 
Catalog of Guitar Dealer 
UN Chart, er: Chapters 1 5 
Airplane Exit-Row Seating: Passenger Information Sheet 
Bicycles: How 31'<) Trackstand 
Gow~'rnment: US Goals at G7 
Shoe Store Sale Flier 
Hair Loss t{,ernedy Brochure 
Cancer: Ewing's Sarcoma Patient Infbrmation 
'Fable 1: Nine Typical Docnments From A'ft{/I,an<:aster Treebank 
exceptions: documents written by Captain John 
Smith of Plymouth Plantation (1600s), by Ben- 
jamin Franklin (1700s), by Americans writing in 
periods throughout the 1800s and 1900s; docu- 
ments written in Australian, British, Canadian, 
and Indian English; and docnments featuring a. 
range of dialects and regional wtrieties of cur= 
rent American English. A smattering of such 
documents is included because within standard 
English, these linguistic varieties are sometimes 
quoted or otherwise utilized, and so they should 
be represented. 
As noted abow=', each document within the trek- 
bank is classified along many different axes, in or- 
der to support a large variety of different task 
specific groupings of the documents. Each docu- 
ment is classifed according to tone, style, linguistic 
level, point of view, physical description of doc- 
ument, geographical background of author, etc. 
Sample values for these attributes are: "friendly", 
"dense", "literary", %echnical", "how-to guide", 
and "American South", respectively. To convey 
domain information, one or more Dewey Decimal 
System three digit classifiers are associated with 
each document. For instance, for the cv o\[' a f>hys - 
iologist, Dewey 612 and 616 (Medical Sciences: 
\]lumen Physiology; Diseases) were chosen. On 
a more mundane, "bookkeeping" level, values for 
text title, author, publication date, text source, 
etc. are recorded as well. 
An SGML like markup language is used to cap- 
lure a variety of organizational level facts about 
each document, such as LIST structure; TITLEs 
and CAPTIONs; and even more recondite events 
such as POEM and IMAGE. HIGltLl(?,II'\]'ing of 
words and phrases is recorded, along with the w~- 
riety of highlighting: italics, boldface, large font, 
e~c. Spelling errors and, where essential, other ty- 
pographical lapses, are scrupulously recorded and 
then corrected. 
Tokenization (i.e. word splitting: Edward's 
--+ Edward's) and sentence spli~ting (e.g. tie 
said, "Hi there. Long time no see." ~ (Sen- 
tence.l:) Be said, (Sentence.2:) "Hi there. (Sen- 
tence.3:) Long time no see.") are performed by 
hand according to predetermined policies. Hence 
the treebank provides the resource of multifarious 
correct instances of word and sentence sI>litting. 
2.2 Scheme of Grammatical Annotation 
tlcretofore, all existing large =scale treebanks have 
employed the gra.nmnatical analysis technique of 
skeleton parsin(\] (Eyes and Leech, 1993; Garside 
and McEnery, 1993; Marcus et el., 1993), 2 in 
which only a partial, relatively sket<'hy, grammat- 
ical analysis of each sentence in the treebank is 
provided, a In contrast, the AT\[g/Lancaster Tree- 
bank assigns to each of its sentences a full and 
(:omplete grammatical analysis with respect to a 
very detailed, very comprehensive broad coverage 
grammar of English. Moreover, a very large, 
highly del;ailed part of speech tagset is used to 
label each word of each sentence with its syntac- 
tic a~,d semantic categories. The result is an ex- 
tremely specific and informative syntactic and se- 
mantic diagram of every sentence in the treebank. 
This shift fi'om skeleton parsing based tree- 
banks to a treebank providing flfll, detailed gram- 
matical analysis resolves a set of problems, de- 
tailed in (Black, 1994), involved in using skeleton 
parsing based treebanks as a means of initializ- 
ing training statistics for probabilistic grammars 
(Black et el., 1993). Briefly, the tirst of these prob- 
lems, which applies even where the grammar be- 
ing trained has been induced from the training 
treebank (Sherman el; al., 1990), is thai; the syn- 
tactic sketchiness of a skeleton ~ parsed treebank 
leads a statistical training algorithm to overcount, 
in some circumstances, and in other cases to un- 
~The 1995 release Penn Treebank adds flmctionM 
intormation to some nonterminals (Marcus et al., 1994), 
but with its rudimentary (roughly 45 tag) 
tagset, its non detailed internal analysis of noun con> 
pounds and NPs more generally, its lack of seman- 
tic categorization of words and phrases, etc., it ar- 
guably remains a skeleton parsed treebank, albeit an 
enriched one. 
aA ditfercnt kind of partial parse- crucially, 
one generated automatically and not by hand- 
characterizes the "treebank" produced by processing 
the 200 million word Birmingham \[?'niversity (UK) 
Bank of-English text corpus with the dependency 
grammar-based ENGCG lfelsinki Parser (Karlsson et 
el., 11995). 
108 
dercotlnl, instances of rule firings it+ trainil,g data 
(treel)a, nk) pars(s, and thus 1,o incorrecl,ly esti- 
niatc rtth> probal)ilitics. The second I>rol>leut is 
that where the gramniar being l, raino(l is more 
detailed syntactically (,hail I,\[le sl,:ehq.Oli parsing 
based trainiilg I, reelm.nk, the training corptts radi- 
(:ally tltl(1Cq'l)el'\['orlllS ill il,s crucial job of speci\['yiilg 
correct parses For training i>url)osrs (l+lack+ 1!)gd). 
It) addition to resolvhig gramtna,r t, rahling 
prol)len:ls, our Trocl)atik l-ir<-ivides a tneatis o\[' 
training non grmmnar based parsi.,g t.Iroc(>dures 
(Brill, 1993; Jelinek ctal., t994; Mag;erlnalt, 1995) 
at, new, higher l('v<'ls of gI'~Lttltll;/t,i('al detail and 
(~Oll l i) l+C h('qD-iiv(~l icss. 
'l'r('eIH/,llk S(?lltCll('(~s ;a.r(~ \[-iarso(I in {CI'IIIS O\[' i,\]1(" 
/1 7'/~ I¢?Lqlish (Trammar, whose charn('lerist,ics w<' 
will bri<qly d(>scribe. 
'Fhe (;Nmmlar's t)a.rt of SF,<>(;c\]l <.aSset r('s(qJ> 
I)les llie 17!) Lag (',laws I, agsot (l<w<.l(.Il)(~<l I>y 
lY(:I{I,;1, (Eyes and I,cc<'\[i, 19!);l): bul with tiullpr- 
'.)us maj<)r a.,l<l nlinor di\[f'(!r<mc<!s. ()no ntajor dif 
t'er<~nc(~, for inst,a.c<:, is <,hal, I,he ATII. lags('< (:al>- 
lures the (lifft~r<',c(' t>etwc'(m e.g. <<wall <:ov<wi,f~ 
wh('r(> +'('ov(~ril~+g" is a l(~xica\]izc<t ,,(>ul/ cry<lifts in 
-inS, and %1,<" cov('ring o\[' all l-i('l.s", wh<'r<' +'('ovcr- 
i.g" is a verbal llOtlll. In (:laws pl+;ICl, iCe~ I>ot.h arc 
NNI (sillgular conmlon n,:)u.). The A'I'I{ i.gsel 
innovatx~s the lag 1.y\[>c NV\'(; for verl)a\] nouns. 
Anoth<:r n:tjor difl'<'x<'tw<> is t.h<: (ling\[(,.l) us(. <)\[ 
"sul)('atcgorizati(>n', (".g. VI)t+~I,()B,I for (Io,ibl<. 
ol)jecl, v(>rt)s (t<mch Bill l,atin, el(+), 
Each v<+rl>, tl(-illt\], a(lj(~ctiv<, a.<l adverb it, 
the A'I'R tagsct includes a se.ianlic I:~b<'\], 
('hos('.. from 42 n<)uli/a<lj<:ctiv<ja<lv<.'l) <';at.<> 
gorics an<l 2+) verl)/vert)al (t;l+l.(+gjI'(-il'i<>S, S<>tll(' 
ov(+,r\]ap ('xist, ing b(~lwe('n t, hes(' cat<'gory se/,s. 
'|'hese s<>tu.~l.nl.i(: (:al;(>go|+i('s are in\[.(~.(i('({ fbr (*,'+9 
"%'la~+dard American lCngli.sD" Icml, i. a~\] do- 
m, ai~. Sani\])l(~ <;al,(~gories in(hid(': +'phys 
ical.a.l, tt:ib,m? + (t,o,,t,s/adjectiv<~s/adv(~rbs), "ai= 
ter" (verbs/vcrl~als), and "intcrp(+rsonal.;,ct." 
( ,,out,s / a<lj<+cl:iv<~s / adv<,rl>s / v(+rhs /v<..I;als ) . 'lh<>y 
wcrc d<weh-ip<'d by the A'I'R graulmaria, a,,d the. 
prov<'.n and r(~li.(.(I via <lay in <lay ottl lagging 
for six mouths at ATI{, by two huu,ul "+lrcc 
bankers", l\[len \['or \['our ttlonl, hs at, I,nticast,er I)y 
five l,r(>el)auk<~rs, with daily int,eract,ions ;llllOlig 
I,r<~el)aukcrs, and I-iei:~w~en i h<" l.r('el>ank(q's and i;h(, 
ATR g~ra li iIitariali. 
\[1' we ignore 1,he seltiaall.ic F, orlion o\[" A'I'I~ lags, 
llie t/-tgsel, cont.ains 165 (tifl'(q'ent l.ags. In('lu(I 
big the S('liia.iil, ic cai,egories iii the tag.',, i.\[lere are 
rougllly 2700 i,ags. As is l, li<> ('as(~ in I\]le (ll;tws 
t.at~sct. , so (:ailed "ditl.<)Lags" (:;Ill I)C ('r(>nl.<'(l I)ase(l 
Oil a\[ni<)sl. ;iliy l,;I..~ of t.hc lagset, 17)r t h<' l)urll(-is~' 
o1' lal>ellhig tiiul\[.iwor<l (,Xl)l'(>ssioils. \["or il/Sl;illCO, 
"will o' the wisp" is lal>clle<i ;is a ,4 word si\]lgtllar 
COililtiOil liC-illD. 'l'liis process <:;tl\] ad<l <'olisidcral>lc 
IiIlll-il-iers 0\[' Lags (,o I,}l(! ahoy(" tot, als. 
,<"Jelil,ell(;c~s ill I, tie Tre<>lm.nk ar(~ l>ars(xl wil;li 
resl-iecl, t,o the A7'I7 lgnglish (Tran+mar. The 
(',rammar, a feature based context fl:cc phrase 
st.ructtJre gratmnar, is related to the IBM t:,nglish 
(h'ammar as l>ul)lished in (Blac.k ctal., 1993), but 
differs tuorc: l'rolN, the IBM (h'atimlar than our 
l, agset does t'roln the (',laws tagsel,. For instance, 
the notion of "numntol\]iC" has no al>plicatioi~ to 
the ATI{ (~rallllllar; l, ho ATR Gratl\]l\[lal" has 67 
features and I\]O0 rules, whereas the IBM (~ram+ 
mar had 40 \[>al, tn'es aud '750 rules, c:t+c. 
'l'he i>reciscly corrcc.t parse (;~s pre st+ecificd by 
a human "trecbanker") figures among the parses 
I-iroduced for any given sentence by tile A'I'R 
(~ralnlnar, roughly !)0% of the time, \['or l;ext, o1' 
the unconsl,rained, wide open SOl'l, that, tim Tree- 
DaHk ix <'onil>OSCd of. The job of the treebattkers is 
l.o local.<' this exact; l>arse, \['or each s(mt.<m(:e, and 
add it to tlp 'l'recl>ar~k. 
\["i~tlre l shows |,wo Salll\[:.\[(+ parsc({ SOlil,ellC+c+s 
/'rolll l.he A'll't Treebank (aud originally l'rom a 
(thitwse I,ake oul, fOOd \[li(>r). l~ecatlse il, is inf'or- 
nial,ive t,o know which of the 1 lO0 rtlles is used nt 
a givou lr(,o no(h'+ and sitice the particular "tlon- 
lernlinal <'at, egory" associated wilh any .lode of 
llw l, ree is alwa.ys rccoveralfle, 4 nodos are labelled 
with ATIL (~ratnnlar rule nantes ratl>t + t.ha,, as is 
lll()l+e tlSllal+ with llOltl.(~rlttillal iHUllCS. 
3 Producing tile Treebank 
Ill ibis I>art of t.\]le article, we Liirll t'rolll "what," t,o 
"l,)w', a,d discuss the nlcchaiiistns by which the 
A' 1' I{ / I,ail('asl er ' Fr<'etm.ul~ was produced+ 
3.1 The Software Ilackbone GWBTooI: 
A Trecbanker's Workstation 
(a\¥B'l'ool is n Mol,if based X-Windows appli(:a- 
ti(>,t which allows the treel>anker to int.era{:t with 
the A'I'I{ l%glish Gra.mmar in order to produce 
\[,he lltOsl accllrate t,reel:.alik ill {,he shorl.csL amotuH, 
o\[' t.illl<'. 
The i,reebauking process begins ill the Treel>ank 
I:,dh,or sol'ten oF the treel>anker's worksta, tion with 
a list o\[ + scnl,enccs lagged wii,h pat'l, ol'-st>eech cal.e 
gorics. 'l'he 1,1'~>N)a.nkc'r sch;cl,s a SCtll;eltcc. \['rOtll the 
list, for proccsshig. Next, with tit<+ cli<'k o1" a bttl:. 
lot<, t ho Tl'eNmnk l'Rlitor graphically displays tit<> 
l>arse \[bt'<>st+ \[br 1.he s<:lll,ellC(' ill ~1 IlIC,/ISO-SCI\]SiI, iVC 
I'arse Tr<+r wi,dow (Figure 2). Each node dis 
i)\]ay<'(l t'el+r(+s(>nts a const, ituel+lt, in l,h<+ parse \['()rest. 
A shaded cons<it<tent no<h: itidicates that there ar<: 
all ernatiw+ alialyses of' thai, constii,uent, only <)it<! 
ol + which is displayed, lay clicking t, lic right+ niottsc 
l)ul3on Oll ;I shaded node, t.hc: l.r<~cbanl,:cr can dis- 
play a pOl)Ul> nicnu listhlg tho all,el'nat\]w+ analy- 
s('s, atly of which <:atl b(> disl>laycd l>y s<tecl,hlg t, he 
al)l>ropriat,e ill(Hill il.etU. (',lickhig i, he h>f't ntouse 
I)tll,l,Oll (.Ill ~-t COIISl,iI, IICII{, l\]OdC pOpS tip a Willdow 
lisl,htg tho fea,t, llt'e wthles for l, ha, l, const, it+llCnL. 
411, is contained in the rule t/;i,tne it, stir. 
109 
<S id="39" count=8> 
<HIGH r endit ion="it alic"> 
\[start \[quo (_( \[sprpd23 \[sprime2 \[ibbar2 Jr2 Please_RRCONCESSIVE r2\] ibbar2\] 
\[sc3 \[v4 Mention_VVIVERBAL-ACT \[nbar4 \[dl this_DDl dl\] 
\[nla coupon_NNIDOCUMENT nla\] nbar4\] \[fal when_CSWHEN 
\[vl ordering_VVGINTER-ACT vl\] fall v4\] sc3\] sprime2\] sprpd23\] )_) quo\] start\] 
</HIGH> 
</S> 
<S id="48" count=S> 
<HIGH rendition ="large"> 
\[start \[sprpd22 \[coord3 \[cc3 \[ccl OI~_CCOR ccl\] cc3\] 
\[nbarl3 \[d3 ONE_MCIWORD d3\] \[jl FREE_JJSTATUS j l\] \[n4 \[nla FANTAIL_NNIANIMAL nla\] 
\[nla SHRIMPS_NNIFOOD nla\] n4\] nbarl3\] coord3\] sprpd22\] start\] 
</HIGH> 
</S> 
Figure 1: Two ATR/Lancaster Treebank Sentences (8 words, italicized; 5 words, large font) from Chinese 
Take-Out Food Flier 
V~: i vbar2 ~prl.olJ .I 
A 
A 
,,lsuBsT,, EI 
A 
pos = v 
barnum = two 
n_sem = substance 
number = V5 
tense_aspect = pres 
v sem = send 
v type = main_verb 
vp_type = aux 
Figure 2: The GWBTool Treebanker's Workstation Parse Window display, showing the parse forest for 
an example sentence. On the far right, the feature values of the VBAR2 constituent, indicating that 
the constituent is an auxiliary verb phrase (bar level 2) containing a present-tense verb phrase with 
noun semantics SUBSTANCE and verb semantics SEND. The fact that the number feature is variable 
(NUMBER=V5) indicates that the number of the verb phrase is not specified by the sentence. The 
shaded nodes indicate where there are alternative parses. 
ii0 
The Treebank Editor also displays the number 
of parses in the parse forest. If the parse forest 
is unmanageably large, the treebanker can par- 
tially bracket the sentence and, again with the 
click of a button, see the parse forest containing 
only those parses which are consistent with the 
partial bracketing (i.e. which do not have any 
constituents which violate the constituent bound- 
aries in the partial bracketing). Note that the 
treebanker need not specify any labels in the par- 
tial bracketing, only constituent boundaries. The 
process described above is repeated until the tree- 
banker can narrow the parse forest down to a sin- 
gle correct parse. Crucially, for experienced Lan- 
caster treebankers, the number of such iterations 
is, by now, normally none or one. 
3.2 Two-Stage Part-Of-Speech Tagging 
Part-of-speech tags are assigned in a two-stage 
process: (a) one or more potential tags are as- 
signed automatically using the Claws HMM tag- 
ger (?); (b) the tags are corrected by a treebanker 
using a special-purpose X-windows-based editor, 
Xanthippe. This displays a text segment and, tbr 
each word contained therein, a ranked list of sug- 
gested tags. The analyst can choose among these 
tags or, by clicking on a panel of all possible tags, 
insert a tag not in the ranked list. 
The automatic tagger inserts only the syntactic 
part of the tag. To insert the semantic part of the 
tag, Xanthippe presents a panel representing all 
possible semantic continuations of the syntactic 
part of the tag selected. 
'lbkenization, sentence-splitting, and spelt 
checking are carried out according to rule by the 
treebankers themselves (see 2.1 above). However, 
the Claws tagger performs basic and preliminary 
tokenization and sentence-splitting, for optional 
correction using the Xanthippe editor. Xanthippe 
retains control at all times during the tag correc- 
tion process, for instance allowing the insertion 
only of tags valid according to the ATR. Gram- 
mar. 
3.3 The Annotation Process 
Initially a file consists of a header detailing the 
file name, text title, author, etc., and the text 
itself, which may be in a variety of formats; it; may 
('ontain HTML mark-up, and files vary in the way 
in which, for example, emphasis is represented. 
The first stage of processing is a scan of the text to 
establish its format and, for large files, to delimit 
a sample to be annotated. 
The second stage is the insertion of SGML 
like mark-up. As with the tagging I)rocess, this 
is done by an automatic procedure with manual 
correction, using microemacs with a special set of 
nlacros. 
Third, the tagging process described in section 
3.2 is carried out. The tagged text is then ex- 
tracted into a file for parsing via GWBTool (See 
3.1.1). 
The final stage is merging the parsed and tagged 
text with all the annotation (SGML-like mark- 
up, header information) for return to ATR. 
3.4 Staff Training; Output Accuracy 
Even though all Treebank parses are guaranteed 
to be acceptable to the ATR Grammar, insuring 
consistency and accuracy of output has required 
considerable planning and effort. Of all the parses 
output for a sentence being treebanked, only a 
small subset are appropriate choices, given the 
sentence's meaning in the document in which it 
occurs. The five Lancaster treebankers had to un- 
dergo extensive training over a long period, to un- 
derstand the manifold devices of the ATR Gram- 
mar expertly enough to make the requisite choices. 
This training was affected in three ways: a week 
of classroom training was followed by four months 
of daily email interaction between the treebankers 
and the creator of tile ATR Grammar; and once 
this training period ended, daily Lancaster/ATR 
email interaction continued, as well as constant 
consultation among the treebankers themselves. 
A body of documentation and lore was developed 
and frequently referred to, concerning how all se- 
mantic and certain syntactic aspects of the tagset, 
as well as various grammar rules, are to be applied 
and interpreted. (This material is organized via 
a menu system, and updated at least weekly.) A 
searchable version of files annotated to date, and 
a list of past tagging decisions, ordered by word 
and by tag, are at the treebankers' disposal. 
In addition to tile constant dialogue between 
the treebankers and the ATR grammarian, Lan- 
caster output was sampled periodically at ATR, 
hand-corrected, and sent back to the treebankers. 
In this way, quality control, determination of out- 
put accuracy, and consistency control were han- 
dled conjointly via the twin methods of sample 
correction and constant treebanker/grammarian 
dialogue. 
With regard both to accuracy and consistency 
of output analyses, individual treebanker abilities 
clustered in a fortunate manner. Scoring of thou- 
sands of words of sample data over time revealed 
that three of the five treebankers had parsing error 
rates (percentage of sentences parsed incorrectly) 
of 7%, 10%, and 14% respectively, while the other 
two treebankers' error rates were 30% and 36% 
respectively. Tagging error rates (percentage of 
all tags that were incorrect), similarly, were 2.3%, 
1.7%, 4.0%, 7.3% and 3.6%. Expected parsing er- 
ror rate worked out to 11.9% for the first three, 
but 32.0% for the other two treebankers; while 
expected tagging error rates were 2.9% and 6.1% 
respectively. 5 
5Almost all t~gging errors were semantic. 
iii 
What is fortnnate about this clustering of abil- 
it, ies is that the less able treebankers were also 
much less prolific than the others, producing only 
25% of the total treel)ank. Therefore, we are 
provisionally excluding this 25% of the treebank 
(about 180,000 words) fi'om use fbr parser train- 
ing, though we are experimeating with the use of 
the entire treebank (expected tagging error rate: 
3.9%) for tagger training. Finally, parsing and 
tagging consistency among the first, three tree- 
bankers appears high. 
4 Conclusion 
Within the next two years, we intend to produce 
Version 2 o\[' our Treebank, in which the 25% of 
the treebank that is currently suitable for t, rainiug 
taggers but not parsers, is fully corrected/~ 
Over the next several years, the A'\['R/Lancaster 
Treebank of American English will form the ba- 
sis for the research of A'l'l{'s Statistical Parsing 
Group in statistical parsing, part of speech tag- 
ging, and related fields. 
References 
E. Black, F. Jelinek, J. Lai\['e, rty, 1{,. Mercer, S. 
l{oukos. 1992. l)ecision tree models applied to 
the labelling of text with parts of speech. In 
Proceedings, DARPA Speech a~d Natural Lan- 
.quagc Workshop, Arden llouse, Morgan Kauf- 
man Publishers. 
E. Black, IL Ga.rside, and G. l,eech, Editors. 
1993. Statistically Dr'iron Computer Gram- 
mars Of l';nglish: The IBM/Lanca,sler Ap- 
proach, l{.odopi Editions. Amsterdam. 
E. I~lack. 199d. An experiment in customizing the 
Lancaster Treebank. In Oostdijk and de Ilaan, 
1 !)9/1, pages 159-168. 
E. Brill. 1993. A utomatic grammar induclion and 
parsing fl'ee text: A fransf'ormation I)ased ap- 
proach. In Proceedi'ngs, 31sl Annual Mceli'og oJ 
lk<- Associalio~ for (/o'mpulalional Linguislics, 
Association for Computational Linguistics. 
E. Brill. 1994. Some Advances in 
TransR)rmation Based Part of Speech Tagging. 
In Proceedings of lkc Twelfll~ National ConJ'cr- 
encc o~ Arlificial h~telligencc, pages 722-727, 
Seattle, Washington. American Association for 
Artificial Intelligence. 
E. Eyes and G. Leech. 1993. Syntactic Annol.a- 
lion: I,inguistic Aspects of (h:ammatical Tag- 
ging and Skeleton I)arsing. ('hapter 3 of Itlack 
et. al. 1993. 
c'Scv(m tenths of ~his 25% is already correct, so 
that lhe task involved is re l)axsing 30% of' 25% (- 
7.5%) of the trceba.nk. 
IL Garside and A. McEnery. 1993. Treebank- 
ing: The Compilation of'a Corpus of Skeleton 
Parsed Sentences. C'hapter 2 of Black el;. al. 
1993. 
F. ,lelinek, J. l,afferty, 1). Magennan, R. Met 
cer, A. Ratnaparkhi, S. R.oukos. 1994. Deci- 
sion Tree Parsing using a Hidden Derivation 
Model. In Proceedings, ARPA Workshop on 
Human Language Technology, pages 260-2(55, 
Plainsboro, New Jersey, AI{PA. 
F. Karlsson, A. Voutilainen, J. Heikkila, aml 
A. Anttila. 1995. Constraint Grammar: A 
Language Independent System for Parsing Un- 
restricted Text. Mouton de Gruyter: Berlin and 
New York. 
11. Kucera and W. N. Francis. 1967. Compltla- 
lional Analysis of Presenl- Day American l~%- 
qlish. 13rown University Press. Providence, RI. 
I). M. Magerman and M. P. Marcus. 1991. Pearl: 
A Probabilistic Chart Parser. In Proceedinqs, 
l'Juropean A CL Conference, March 1991, Berlin, 
Germany. 
l). M. Magerman. 1995. Statistical Decision Tree 
Models for Parsing. In Proceedings, 33rd A~- 
nual Mecling of lhe Association for' Compula- 
tional Linguistics, pages 276 283, (,%mhridge, 
Massachusetts, Association for (k)mputational 
Linguistics. 
M. 1 ). Marcus, B. Santorini, and M. A. 
Marcinkiewicz. 1993. Building a Large Anno- 
tated (:orpus of English: The Pe\]m Treebank. 
Computalional Ling,6slics, 19.2:313-330. 
M. Mar(ms, G. Kim, M. A. Marcinkiewicz, R. 
Maclntyre, A. Bies, M. Ferguson, K. Katz, and 
B. Schasberger. 1994. The Penn Treel)ank: An- 
notating Predicate Argument Structure. Pro- 
ceedinqs, ARPA lluman Language "l>chnology 
Workshop, Morgan Kaufinann Publishers Inc., 
San Francisco. 
1~. Merialdo. 1!)94. Tagging English Text with 
a Probabilistic Model. Compulalional Li'ngu~s- 
ties, 20.2:155- 171. 
N. ()ostdi.jl~ and P. de lla.an, l'3ditors. 199d. 
Cor7ms Ilased l~cscarch Into Language: h~ kou- 
our@ Yau Aarls. Rodol)i l~;ditions. Amster 
dal//. 
I{. A. Sharmau, F. ,Icliuek, and 1{. Mercer. 1990. 
Generating a (~rammar for Statistical Train- 
ing. In Proceedin/is, DA RPA £'peeck aud /v%t- 
ural Lan.quagc Workshop, Hidden Valley, Penn- 
sylvania. 
l{. Weischedel, M. Meteer, 1{. Schwartz, I,. 
l{amshaw, and J. I)almuc('i. 1993. Coping 
with Ambiguity and iJnknown Words through 
Probabilistic Models. (7ompalational Liuguis- 
lies, 19.2:359-382. 
112 
