File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1020_metho.xml

Size: 20,574 bytes

Last Modified: 2025-10-06 14:14:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1020">
  <Title>Beyond Skeleton Parsing: Producing a Comprehensive Large-Scale General-English Treebank With Full Grammatical Analysis</Title>
  <Section position="3" start_page="0" end_page="108" type="metho">
    <SectionTitle>
2 General Description of the
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="107" type="sub_section">
      <SectionTitle>
Treebank
2.1 Document Selection and
Preprocessing
</SectionTitle>
      <Paragraph position="0"> The ATR/Lancaster Treebank consists of approximately 730,000 words of grammatically-analyzed text divided into roughly 950 documents ranging in length ffmn about 30 to about 3600 words.</Paragraph>
      <Paragraph position="1"> The idea informing the selection of documents for inclusion in this new treebank was to pack into it the maximum degree of document variation along many different scales---document length, subject area, style, point of view, etc. -but without establishing a single, predetermined classification of the included documentsJ Differing purposes for which the treebank might be utilized may favor differing groupings or classifications of its component documents. Overall., the rationale for seeking to take as broad as possible a sample of current standard American English, is to support the parsing and tagging of unconstrained American English text by providing a training corpus which includes documents fairly similar to almost any input which might arise.</Paragraph>
      <Paragraph position="2"> Documents were obtained from three sources: the Internet; optically-scanned hardcopy &amp;quot;occasional&amp;quot; documents (restaurant take out menus; flmdraising letters; utility bills); and purchase from commercial or academic vendors. To illustrate the diverse nature of the documents included in this treebank, we list, in Table 1, titles of nine typical documents.</Paragraph>
      <Paragraph position="3"> In general, and as one might expect, the documents we have used were written in the early to mid 1990s, in the United States, in &amp;quot;Standard&amp;quot; American English. However, there are fairly many  exceptions: documents written by Captain John Smith of Plymouth Plantation (1600s), by Benjamin Franklin (1700s), by Americans writing in periods throughout the 1800s and 1900s; documents written in Australian, British, Canadian, and Indian English; and docnments featuring a.</Paragraph>
      <Paragraph position="4"> range of dialects and regional wtrieties of cur= rent American English. A smattering of such documents is included because within standard English, these linguistic varieties are sometimes quoted or otherwise utilized, and so they should be represented.</Paragraph>
      <Paragraph position="5"> As noted abow=', each document within the trekbank is classified along many different axes, in order to support a large variety of different task specific groupings of the documents. Each document is classifed according to tone, style, linguistic level, point of view, physical description of document, geographical background of author, etc.</Paragraph>
      <Paragraph position="6"> Sample values for these attributes are: &amp;quot;friendly&amp;quot;, &amp;quot;dense&amp;quot;, &amp;quot;literary&amp;quot;, %echnical&amp;quot;, &amp;quot;how-to guide&amp;quot;, and &amp;quot;American South&amp;quot;, respectively. To convey domain information, one or more Dewey Decimal System three digit classifiers are associated with each document. For instance, for the cv o\[' a f&gt;hys iologist, Dewey 612 and 616 (Medical Sciences: \]lumen Physiology; Diseases) were chosen. On a more mundane, &amp;quot;bookkeeping&amp;quot; level, values for text title, author, publication date, text source, etc. are recorded as well.</Paragraph>
      <Paragraph position="7"> An SGML like markup language is used to caplure a variety of organizational level facts about each document, such as LIST structure; TITLEs and CAPTIONs; and even more recondite events such as POEM and IMAGE. HIGltLl(?,II'\]'ing of words and phrases is recorded, along with the w~riety of highlighting: italics, boldface, large font, e~c. Spelling errors and, where essential, other typographical lapses, are scrupulously recorded and then corrected.</Paragraph>
      <Paragraph position="8"> Tokenization (i.e. word splitting: Edward's --+ Edward's) and sentence spli~ting (e.g. tie said, &amp;quot;Hi there. Long time no see.&amp;quot; ~ (Sentence.l:) Be said, (Sentence.2:) &amp;quot;Hi there. (Sentence.3:) Long time no see.&amp;quot;) are performed by hand according to predetermined policies. Hence the treebank provides the resource of multifarious correct instances of word and sentence sI&gt;litting.</Paragraph>
    </Section>
    <Section position="2" start_page="107" end_page="108" type="sub_section">
      <SectionTitle>
2.2 Scheme of Grammatical Annotation
</SectionTitle>
      <Paragraph position="0"> tlcretofore, all existing large =scale treebanks have employed the gra.nmnatical analysis technique of skeleton parsin(\] (Eyes and Leech, 1993; Garside and McEnery, 1993; Marcus et el., 1993), 2 in which only a partial, relatively sket&lt;'hy, grammatical analysis of each sentence in the treebank is provided, a In contrast, the AT\[g/Lancaster Tree-bank assigns to each of its sentences a full and (:omplete grammatical analysis with respect to a very detailed, very comprehensive broad coverage grammar of English. Moreover, a very large, highly del;ailed part of speech tagset is used to label each word of each sentence with its syntactic a~,d semantic categories. The result is an extremely specific and informative syntactic and semantic diagram of every sentence in the treebank.</Paragraph>
      <Paragraph position="1"> This shift fi'om skeleton parsing based tree-banks to a treebank providing flfll, detailed grammatical analysis resolves a set of problems, detailed in (Black, 1994), involved in using skeleton parsing based treebanks as a means of initializing training statistics for probabilistic grammars (Black et el., 1993). Briefly, the tirst of these problems, which applies even where the grammar being trained has been induced from the training treebank (Sherman el; al., 1990), is thai; the syntactic sketchiness of a skeleton ~ parsed treebank leads a statistical training algorithm to overcount, in some circumstances, and in other cases to un~The 1995 release Penn Treebank adds flmctionM intormation to some nonterminals (Marcus et al., 1994), but with its rudimentary (roughly 45 tag) tagset, its non detailed internal analysis of noun con&gt; pounds and NPs more generally, its lack of semantic categorization of words and phrases, etc., it arguably remains a skeleton parsed treebank, albeit an enriched one.</Paragraph>
      <Paragraph position="2"> aA ditfercnt kind of partial parse- crucially, one generated automatically and not by handcharacterizes the &amp;quot;treebank&amp;quot; produced by processing the 200 million word Birmingham \[?'niversity (UK) Bank of-English text corpus with the dependency grammar-based ENGCG lfelsinki Parser (Karlsson et el., 11995).</Paragraph>
      <Paragraph position="3">  dercotlnl, instances of rule firings it+ trainil,g data (treel)a, nk) pars(s, and thus 1,o incorrecl,ly estiniatc rtth&gt; probal)ilitics. The second I&gt;rol&gt;leut is that where the gramniar being l, raino(l is more detailed syntactically (,hail I,\[le sl,:ehq.Oli parsing based trainiilg I, reelm.nk, the training corptts radi(:ally tltl(1Cq'l)el'\['orlllS ill il,s crucial job of speci\['yiilg correct parses For training i&gt;url)osrs (l+lack+ 1!)gd). It) addition to resolvhig gramtna,r t, rahling prol)len:ls, our Trocl)atik l-ir&lt;-ivides a tneatis o\[' training non grmmnar based parsi.,g t.Iroc(&gt;dures (Brill, 1993; Jelinek ctal., t994; Mag;erlnalt, 1995) at, new, higher l('v&lt;'ls of gI'~Lttltll;/t,i('al detail and  lY(:I{I,;1, (Eyes and I,cc&lt;'\[i, 19!);l): bul with tiullpr'.)us maj&lt;)r a.,l&lt;l nlinor di\[f'(!r&lt;mc&lt;!s. ()no ntajor dif t'er&lt;~nc(~, for inst,a.c&lt;:, is &lt;,hal, I,he ATII. lags('&lt; (:al&gt;lures the (lifft~r&lt;',c(' t&gt;etwc'(m e.g. &lt;&lt;wall &lt;:ov&lt;wi,f~ wh('r(&gt; +'('ov(~ril~+g&amp;quot; is a l(~xica\]izc&lt;t ,,(&gt;ul/ cry&lt;lifts in -inS, and %1,&lt;&amp;quot; cov('ring o\[' all l-i('l.s&amp;quot;, wh&lt;'r&lt;' +'('ovcri.g&amp;quot; is a verbal llOtlll. In (:laws pl+;ICl, iCe~ I&gt;ot.h arc  wcrc d&lt;weh-ip&lt;'d by the A'I'R graulmaria, a,,d the.</Paragraph>
      <Paragraph position="4"> prov&lt;'.n and r(~li.(.(I via &lt;lay in &lt;lay ottl lagging for six mouths at ATI{, by two huu,ul &amp;quot;+lrcc bankers&amp;quot;, l\[len \['or \['our ttlonl, hs at, I,nticast,er I)y five l,r(&gt;el)auk&lt;~rs, with daily int,eract,ions ;llllOlig I,r&lt;~el)aukcrs, and I-iei:~w~en i h&lt;&amp;quot; l.r('el&gt;ank(q's and i;h(, ATR g~ra li iIitariali.</Paragraph>
      <Paragraph position="5"> \[1' we ignore 1,he seltiaall.ic F, orlion o\[&amp;quot; A'I'I~ lags, llie t/-tgsel, cont.ains 165 (tifl'(q'ent l.ags. In('lu(I big the S('liia.iil, ic cai,egories iii the tag.',, i.\[lere are rougllly 2700 i,ags. As is l, li&lt;&gt; ('as(~ in I\]le (ll;tws t.at~sct. , so (:ailed &amp;quot;ditl.&lt;)Lags&amp;quot; (:;Ill I)C ('r(&gt;nl.&lt;'(l I)ase(l Oil a\[ni&lt;)sl. ;iliy l,;I..~ of t.hc lagset, 17)r t h&lt;' l)urll(-is~' o1' lal&gt;ellhig tiiul\[.iwor&lt;l (,Xl)l'(&gt;ssioils. \[&amp;quot;or il/Sl;illCO, &amp;quot;will o' the wisp&amp;quot; is lal&gt;clle&lt;i ;is a ,4 word si\]lgtllar COililtiOil liC-illD. 'l'liis process &lt;:;tl\] ad&lt;l &lt;'olisidcral&gt;lc IiIlll-il-iers 0\[' Lags (,o I,}l(! ahoy(&amp;quot; tot, als. ,&lt;&amp;quot;Jelil,ell(;c~s ill I, tie Tre&lt;&gt;lm.nk ar(~ l&gt;ars(xl wil;li resl-iecl, t,o the A7'I7 lgnglish (Tran+mar. The (',rammar, a feature based context fl:cc phrase st.ructtJre gratmnar, is related to the IBM t:,nglish (h'ammar as l&gt;ul)lished in (Blac.k ctal., 1993), but differs tuorc: l'rolN, the IBM (h'atimlar than our l, agset does t'roln the (',laws tagsel,. For instance, the notion of &amp;quot;numntol\]iC&amp;quot; has no al&gt;plicatioi~ to the ATI{ (~rallllllar; l, ho ATR Gratl\]l\[lal&amp;quot; has 67 features and I\]O0 rules, whereas the IBM (~ram+ mar had 40 \[&gt;al, tn'es aud '750 rules, c:t+c.</Paragraph>
      <Paragraph position="6"> 'l'he i&gt;reciscly corrcc.t parse (;~s pre st+ecificd by a human &amp;quot;trecbanker&amp;quot;) figures among the parses I-iroduced for any given sentence by tile A'I'R (~ralnlnar, roughly !)0% of the time, \['or l;ext, o1' the unconsl,rained, wide open SOl'l, that, tim Tree-DaHk ix &lt;'onil&gt;OSCd of. The job of the treebattkers is l.o local.&lt;' this exact; l&gt;arse, \['or each s(mt.&lt;m(:e, and add it to tlp 'l'recl&gt;ar~k.</Paragraph>
      <Paragraph position="7"> \[&amp;quot;i~tlre l shows |,wo Salll\[:.\[(+ parsc({ SOlil,ellC+c+s /'rolll l.he A'll't Treebank (aud originally l'rom a (thitwse I,ake oul, fOOd \[li(&gt;r). l~ecatlse il, is inf'ornial,ive t,o know which of the 1 lO0 rtlles is used nt a givou lr(,o no(h'+ and sitice the particular &amp;quot;tlonlernlinal &lt;'at, egory&amp;quot; associated wilh any .lode of llw l, ree is alwa.ys rccoveralfle, 4 nodos are labelled with ATIL (~ratnnlar rule nantes ratl&gt;t + t.ha,, as is lll()l+e tlSllal+ with llOltl.(~rlttillal iHUllCS.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="108" end_page="109" type="metho">
    <SectionTitle>
3 Producing tile Treebank
</SectionTitle>
    <Paragraph position="0"> Ill ibis I&gt;art of t.\]le article, we Liirll t'rolll &amp;quot;what,&amp;quot; t,o &amp;quot;l,)w', a,d discuss the nlcchaiiistns by which the A' 1' I{ / I,ail('asl er ' Fr&lt;'etm.ul~ was produced+</Paragraph>
    <Section position="1" start_page="108" end_page="109" type="sub_section">
      <SectionTitle>
3.1 The Software Ilackbone GWBTooI:
</SectionTitle>
      <Paragraph position="0"> A Trecbanker's Workstation (a\Y=B'l'ool is n Mol,if based X-Windows appli(:ati(&gt;,t which allows the treel&gt;anker to int.era{:t with the A'I'I{ l%glish Gra.mmar in order to produce \[,he lltOsl accllrate t,reel:.alik ill {,he shorl.csL amotuH, o\[' t.illl&lt;'.</Paragraph>
      <Paragraph position="1"> The i,reebauking process begins ill the Treel&gt;ank I:,dh,or sol'ten oF the treel&gt;anker's worksta, tion with a list o\[ + scnl,enccs lagged wii,h pat'l, ol'-st&gt;eech cal.e gorics. 'l'he 1,1'~&gt;N)a.nkc'r sch;cl,s a SCtll;eltcc. \['rOtll the list, for proccsshig. Next, with tit&lt;+ cli&lt;'k o1&amp;quot; a bttl:. lot&lt;, t ho Tl'eNmnk l'Rlitor graphically displays tit&lt;&gt; l&gt;arse \[bt'&lt;&gt;st+ \[br 1.he s&lt;:lll,ellC(' ill ~1 IlIC,/ISO-SCI\]SiI, iVC I'arse Tr&lt;+r wi,dow (Figure 2). Each node dis i)\]ay&lt;'(l t'el+r(+s(&gt;nts a const, ituel+lt, in l,h&lt;+ parse \['()rest. A shaded cons&lt;it&lt;tent no&lt;h: itidicates that there ar&lt;: all ernatiw+ alialyses of' thai, constii,uent, only &lt;)it&lt;! ol + which is displayed, lay clicking t, lic right+ niottsc l)ul3on Oll ;I shaded node, t.hc: l.r&lt;~cbanl,:cr can display a pOl)Ul&gt; nicnu listhlg tho all,el'nat\]w+ analys('s, atly of which &lt;:atl b(&gt; disl&gt;laycd l&gt;y s&lt;tecl,hlg t, he al)l&gt;ropriat,e ill(Hill il.etU. (',lickhig i, he h&gt;f't ntouse I)tll,l,Oll (.Ill ~-t COIISl,iI, IICII{, l\]OdC pOpS tip a Willdow lisl,htg tho fea,t, llt'e wthles for l, ha, l, const, it+llCnL. 411, is contained in the rule t/;i,tne it, stir.</Paragraph>
      <Paragraph position="2">  an example sentence. On the far right, the feature values of the VBAR2 constituent, indicating that the constituent is an auxiliary verb phrase (bar level 2) containing a present-tense verb phrase with noun semantics SUBSTANCE and verb semantics SEND. The fact that the number feature is variable (NUMBER=V5) indicates that the number of the verb phrase is not specified by the sentence. The shaded nodes indicate where there are alternative parses.</Paragraph>
      <Paragraph position="3"> ii0 The Treebank Editor also displays the number of parses in the parse forest. If the parse forest is unmanageably large, the treebanker can partially bracket the sentence and, again with the click of a button, see the parse forest containing only those parses which are consistent with the partial bracketing (i.e. which do not have any constituents which violate the constituent boundaries in the partial bracketing). Note that the treebanker need not specify any labels in the partial bracketing, only constituent boundaries. The process described above is repeated until the treebanker can narrow the parse forest down to a single correct parse. Crucially, for experienced Lancaster treebankers, the number of such iterations is, by now, normally none or one.</Paragraph>
    </Section>
    <Section position="2" start_page="109" end_page="109" type="sub_section">
      <SectionTitle>
3.2 Two-Stage Part-Of-Speech Tagging
</SectionTitle>
      <Paragraph position="0"> Part-of-speech tags are assigned in a two-stage process: (a) one or more potential tags are assigned automatically using the Claws HMM tagger (?); (b) the tags are corrected by a treebanker using a special-purpose X-windows-based editor, Xanthippe. This displays a text segment and, tbr each word contained therein, a ranked list of suggested tags. The analyst can choose among these tags or, by clicking on a panel of all possible tags, insert a tag not in the ranked list.</Paragraph>
      <Paragraph position="1"> The automatic tagger inserts only the syntactic part of the tag. To insert the semantic part of the tag, Xanthippe presents a panel representing all possible semantic continuations of the syntactic part of the tag selected.</Paragraph>
      <Paragraph position="2"> 'lbkenization, sentence-splitting, and spelt checking are carried out according to rule by the treebankers themselves (see 2.1 above). However, the Claws tagger performs basic and preliminary tokenization and sentence-splitting, for optional correction using the Xanthippe editor. Xanthippe retains control at all times during the tag correction process, for instance allowing the insertion only of tags valid according to the ATR. Grammar. null</Paragraph>
    </Section>
    <Section position="3" start_page="109" end_page="109" type="sub_section">
      <SectionTitle>
3.3 The Annotation Process
</SectionTitle>
      <Paragraph position="0"> Initially a file consists of a header detailing the file name, text title, author, etc., and the text itself, which may be in a variety of formats; it; may ('ontain HTML mark-up, and files vary in the way in which, for example, emphasis is represented.</Paragraph>
      <Paragraph position="1"> The first stage of processing is a scan of the text to establish its format and, for large files, to delimit a sample to be annotated.</Paragraph>
      <Paragraph position="2"> The second stage is the insertion of SGML like mark-up. As with the tagging I)rocess, this is done by an automatic procedure with manual correction, using microemacs with a special set of nlacros.</Paragraph>
      <Paragraph position="3"> Third, the tagging process described in section 3.2 is carried out. The tagged text is then extracted into a file for parsing via GWBTool (See 3.1.1).</Paragraph>
      <Paragraph position="4"> The final stage is merging the parsed and tagged text with all the annotation (SGML-like markup, header information) for return to ATR.</Paragraph>
    </Section>
    <Section position="4" start_page="109" end_page="109" type="sub_section">
      <SectionTitle>
3.4 Staff Training; Output Accuracy
</SectionTitle>
      <Paragraph position="0"> Even though all Treebank parses are guaranteed to be acceptable to the ATR Grammar, insuring consistency and accuracy of output has required considerable planning and effort. Of all the parses output for a sentence being treebanked, only a small subset are appropriate choices, given the sentence's meaning in the document in which it occurs. The five Lancaster treebankers had to undergo extensive training over a long period, to understand the manifold devices of the ATR Grammar expertly enough to make the requisite choices. This training was affected in three ways: a week of classroom training was followed by four months of daily email interaction between the treebankers and the creator of tile ATR Grammar; and once this training period ended, daily Lancaster/ATR email interaction continued, as well as constant consultation among the treebankers themselves.</Paragraph>
      <Paragraph position="1"> A body of documentation and lore was developed and frequently referred to, concerning how all semantic and certain syntactic aspects of the tagset, as well as various grammar rules, are to be applied and interpreted. (This material is organized via a menu system, and updated at least weekly.) A searchable version of files annotated to date, and a list of past tagging decisions, ordered by word and by tag, are at the treebankers' disposal.</Paragraph>
      <Paragraph position="2"> In addition to tile constant dialogue between the treebankers and the ATR grammarian, Lancaster output was sampled periodically at ATR, hand-corrected, and sent back to the treebankers.</Paragraph>
      <Paragraph position="3"> In this way, quality control, determination of output accuracy, and consistency control were handled conjointly via the twin methods of sample correction and constant treebanker/grammarian dialogue.</Paragraph>
      <Paragraph position="4"> With regard both to accuracy and consistency of output analyses, individual treebanker abilities clustered in a fortunate manner. Scoring of thousands of words of sample data over time revealed that three of the five treebankers had parsing error rates (percentage of sentences parsed incorrectly) of 7%, 10%, and 14% respectively, while the other two treebankers' error rates were 30% and 36% respectively. Tagging error rates (percentage of all tags that were incorrect), similarly, were 2.3%, 1.7%, 4.0%, 7.3% and 3.6%. Expected parsing error rate worked out to 11.9% for the first three, but 32.0% for the other two treebankers; while expected tagging error rates were 2.9% and 6.1% respectively. 5  it, ies is that the less able treebankers were also much less prolific than the others, producing only 25% of the total treel)ank. Therefore, we are provisionally excluding this 25% of the treebank (about 180,000 words) fi'om use fbr parser training, though we are experimeating with the use of the entire treebank (expected tagging error rate: 3.9%) for tagger training. Finally, parsing and tagging consistency among the first, three treebankers appears high.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML