File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1092_metho.xml
Size: 12,306 bytes
Last Modified: 2025-10-06 14:13:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1092"> <Title>ANNOTATING 200 MILLION WORDS: THE BANK OF ENGLISH PROJECT</Title> <Section position="4" start_page="565" end_page="565" type="metho"> <SectionTitle> 3 THE LEXICON </SectionTitle> <Paragraph position="0"> Filtering produces a list of all tokenised word-forms in the input text which are not included in the current ENGTWOL lexicon. The most eon-nnon types are taken nnder closer scrutiny. It has to be decided whetlier these are genuine word forms or non-words (e.g. misspellings).</Paragraph> <Paragraph position="1"> At the begimfing, I used several (I;tys to update the lexical module for a new batch of text but experience and increased coverage of the lexicon haw~ diminished the time needed tbr this task considerably.</Paragraph> <Paragraph position="2"> \[ have added words above a certain frequency routinely to the ENG'I'WOL lexicon. The fi'equency is no~ fixed but de~ermined by practical considerations. For instance, when the data contain a great, deal of duplication (as in l, he BBC material owing to the repetitive nature of daily broadcasting), simple token fl.equency is a poor indicator of what is a suigable item to add to the lexicon. IIowever, sarnpiing methods have not heen developed t.o optimise the size of the lexicon, beea/ise it is not crucial for the present purpose.</Paragraph> <Paragraph position="3"> My lexieal practices differ sornewhat fi.om the updating procedure doeurnented in \[V'outilain,'n, 1994\]. If our aim is to SUpl)ly every word in running text; with all prol)er rnorphological and syntactic readings, we c,'Lnllot deprive frequent IlOllstandard words (e.g. htrn, veggie, wanna) of their obvious morphological readings because this might cause the whole sentence to be misanalysed. Siuce prescriptive considerations were not taken into aeeotmt in the design of ENGTWOI,, many el~t.rles marked as informal' or lang' in conventional dictionaries were added to the lexicon. I have also included highly domain-specific entries into the lexicon if they were frequent enough in certain types of data, especially when heuristics might produce erroneous or incomplete analyses for the word in question (e.g. species of fish which have the saule form in singular and plural: brill, chub, C/laTfish) t . ()he advantage of iucluding all frequetfl, gralfldeal words tO the lexicon is that EN(TFWOL filterlnl2; of incornitlg texts produces output which can he more reliably dealt with by autoulatic nleaus. \Vh,m :111 frequent nonstandard and even foreign words are listed in the lexicon, the otttput can be used in a straightforward way for generating new entries.</Paragraph> <Paragraph position="4"> The procedure of adding new entries to the lexicon goes its follows: first, all words are classified aecC/~rding to the part-of-speech they belong to. Second, new entries in the ENGTWOL format are generat.ed automatically from these word-lists using ready-made tools presented in \[Voul.ihfiuen, 1994\]. Lists or new entries are carefully checked up, and additiolml feat.ures (such as transiLivity and complemenl;ation IThe default category of morphological heuristics is :t singular Iioun. ht the case of a potential plural form (sending), au underspecified tag S(I/PL is given.</Paragraph> <Paragraph position="5"> Datures for verbs) a.re suPl)tied rnannally. In describing the items, 1 h;we relied mainly oil Collins COIIUII, D \[)ietionary (i.C/)87) and Collins English Dictionary (19.C/11) which have been avaihible for us ill electronic form. Ilut when the usage a.nd distribution seems to be /lllelea.r, \[ have generated an on-line concordance directly from the corpus..qlnee I have dealt with words which have a frequency of, say, at, least 10 tokens in the corpus, tJfis method seems to be quite reliable.</Paragraph> <Paragraph position="6"> We cannot detect errors ill the lexicon during the initial liltering phase. ()nee a certain string has had one or more entries in the lexicon, it is not present in the output of the filtering, and other potential uses might not. be added to the lexicon ~. And fi'eqllent errors telld to get., corrected since all incorrect analyses detected during the manual iuspee(.ion a.re corrected directly it, the lexicon.</Paragraph> <Paragraph position="7"> The I'\]NCI'I'W()I, lexicon which is used in the Bank analyses contains al\]proxlnmLely 75,0{}{} entries. Morphological analysis caters for all inflected forlns el' the lexical items. 'Fhe coverage of I.he lexicon bel'C/n'e updating is between 97% - 98% of all wo\['d-fclrtll toketls ill l'llllllitlg text. Al~pemlix A presents the nurnber of additional lexical eutries generated from each bateh of daLa. The cumulalave treml shows that a w~ry small nurnber c+f new entries is needed when analysing the hll.ter half of the corpus.</Paragraph> <Paragraph position="8"> Morphc, logical heuristics is applied afl.er I'3NG-TWOI, analysis as a separate module (by Voul.ilaiuen, Tapanainen). It assigns reliable analyses to words which were not: iueluded iu I,he lexicon. This also coutril)utes to the fact; that lexicon updating will be a minor task ill the future.</Paragraph> </Section> <Section position="5" start_page="565" end_page="566" type="metho"> <SectionTitle> 4 ENC.,CC, D\]SAMBICIUAT1ON AND SYNTAX </SectionTitle> <Paragraph position="0"> English Constraiut (h'armuar is a rule-based roof phologieal and depeudency-~)riented surface syntaetic ;ulalysor id&quot; runnill,p; English text.</Paragraph> <Paragraph position="1"> Marl)hologieal dlsamhlguat.lon of rruiltilfle p:lrt-o\['speech mid other inlh~cLi~m:d t.ags is carried out. he\[',re syllt;lC/lrlC almlysls. M(wphc)logieat dls:lluhiguat.ion reached a mature level well hef, re the beginning of t.his project. (see evaluation ill \[\Zoutihfiuen, 19.9~\]).</Paragraph> <Paragraph position="2"> '\]'he morphological disand)iguatiou rules (same 1100 ill the present grauunar) were writteu hy Atr.</Paragraph> <Paragraph position="3"> Voutilainen. The I~auk data is analysed using both gralnumr-hased' mid heurisl.ic' disambiguation rules. This leaves less morlqlological aulhiguity (below :1%), although the errm' rate is st~ill extremely low (belmv 0PS%).</Paragraph> <Paragraph position="4"> 2Although missing entries are possihle to tirol hulirectly, e.g. Ju9 and -cd forms ill t.lle filtering output imlicages that the base form is not. described in t.he lexicm~ as a verb d.;\[ Curl:eiiL state of ENGCC., syntax The first, version of I~N(I~C(', syntax was written hy Arto Anttila \[Anttila, 1994\]. At the beginning of tile Bank project, new Constrahit {h'amnlar Parser irnlflementations for syntactic rnaplfing and disaml)iguation were written by Pasi Tapanainen. These have been tested during the first, nlonths of this project. ,qome adjustineut to the syntax was needed to cater for new speeitications, e.g. in rule applica o ti<m order.</Paragraph> <Paragraph position="5"> \[ have tested all constraints extensively with differ-. ent types of text rroin the Bank. \[ have revised al-rnost ;ill syntactic rules and written new ones. The current EN(IIC(I parser uses 282 syntactic niapl'lillg rules, 492 syntactic eonstraillts and 204 heuristic syrii, a.el, ic eonstra.h/ts. The lnapl~hlg rules should be l;he lrlOSl; relial)le, since they attach all possil)le syntactic alternatives to the inorpllologieally disaulbiguated olil,pllt. ~ylitaetie i'llleS I)rillle coiitextilally hiappropriate synt;ael.ie tags, or accept, ,iilSl; OllC conl;extuaily ;ipproprlal;e tlt\[r~, ~ynl.aetic and heuristic rule eanip(melits are \[}ll'liiiiily shnilar trill, thC/~y dilrer hi reliMfility. 1t. is Imssil)le ilot to use heurlstie rules lit all if olie ahliS ill, lnaxhiially error-ri'ee out\]nit, hut the. eos~ is aii hlerease hi arul)iguity.</Paragraph> <Paragraph position="6"> I)urhig the project, the quallt.y of synDax ha.'; hnproved considerably. '\]'\]ie itnrrei/t error i'ate~ ',vlieli pai'shlg~ new unresl.ricl,ed rliunhig text~ is appi'c~xi-Irl3tely 7(~ll i.e., 7 words (lilt or JO0 t,~r:t I.\]le Wl'Oll~ syntael.ic code. Ihit I,\]ie auihiy;uit.y rate is still fah'ly high, 16.d% in ~ 0.Sili word Samllh.~, which tile;illS thai, lfi w(n'ds olig o\[&quot; i00 sl.iII hgve liiOl'l; l.htiil (Jilt? niorilhological or syntael, ic ;.llt0riiativo. Much o\[&quot; the i'eiu,:lhlili<C~ arili)it,;uity is or the prepositional attaelilriei\]t type, This pa.rl, icLll\[u&quot; tylle of ainl)ig;uil,y aeeoiilll;s ror .:lpproxhriately Q()% el ~ all renlahliilg; alrilligliil.y. More heuristic rules ar0 needed for prulihig the I'Olrlliiliill~_ ~, ambiguities. Of eoilrse> lllaily o\[&quot; I.he rerrlahiilig alul)lguities (espechllly i'P al.tae\]ililel/I,) are \[r>enuh\]e and should I)e rel,ahied.</Paragraph> <Paragraph position="7"> The spee,.I ij\[' l.he whoh! sysgelll used hi Ill~.)l'l/h(li(it,~ie;.ll and syntactic aniiotation is about 400 woi'ds per seeolid O11 iI, ~UN ,'ql)lil'e,~l,atioll |0/:l~(),</Paragraph> <Section position="1" start_page="566" end_page="566" type="sub_section"> <SectionTitle> 4.2 DeveloI)hlg the syntax </SectionTitle> <Paragraph position="0"> I,'aeilities J'l)r the fast eOml)ihli.ion of a parser with a iiow rule file and the spe(~d of the ana\]ysls iiiak(~s a very ~ood elivh'OilliieliL for the \]ingulst 1.o test new constl'ahit.s.</Paragraph> <Paragraph position="1"> A special debllgghig verslon of the parser can he used \['or testhig ptirpos(~s. The delJlig~giilg; W~l'SiOli (;akes tully disauibignated EN(\]~C(II texts as iullul.. Ideally, (':v0ry rill(: is tested against ~t represelltative SalYil)le frolri a, corpus. 'Fhis would sel; the requireiFleilt that the test eorp/ls should be lll~Lde of large randolri sauiples. Ilowever, it is l, hne-eolisunihi<t,~ 1.o prep.:tre li'la.i\]lially large ailiOllilts o1&quot; corrected alid (lisainl.li<guatod da.La, 0veil frolll I&quot;N(IC(~ ould)ut+ 'l'\]iere\['ore, it very large test tin'pus i;~ I)eyond tile scope of this t)roject.</Paragraph> <Paragraph position="2"> The c.urrl~lit syntaeLic test, COl'pUs contains approxiniately :10,000 words. It is large ellOligh r(n. tesl:hip: reliable syntactic i'il\]eS, but if we Walls to rate the aeceptahility or heurisl;ic syntael, ie rules> \[i larger Sylltaetic Cfll'jlilS wouhl lie lleeessal'y. The tesl, eorl)tls efnlsist.s or 16 individual I,cxL sarriples \['roiri the flank o\[' \]'\]nglish da.ta. The texts have I)ee:n chosen so that; they l, ake text t, ype vari:.ltiou hire ;tccrltliit. All salullles but one are eontillu{)us, unedil.ed glll)lml'ts oJ&quot; the eorpllS.</Paragraph> <Paragraph position="3"> It seems WOl'thwhile to eoilthllle l\]reparhi<~ ;t disanibiguat,ed eorl)us rrolrl seleei,ed piee,,s or I;exl,. ()llee tww data is reccfived, it. is (!xpedielil, to ;i(l(i a represent.;d.ive Salliiile I'rOlil it to the test COl'tillS. A Inauualiy (\[isarlil\]igjlatc(l Lest e(u'pus e(msl;itutes a very sl;raiyflitrorward dclcuirlental.ion (;)\[' 1.he apr, lied parsiug sclmnm (as described in \[,~anG~s,,u, 1987\]).</Paragraph> </Section> </Section> <Section position="6" start_page="566" end_page="566" type="metho"> <SectionTitle> 5 C()NCJL(IS\]ON </SectionTitle> <Paragraph position="0"> The analysin.c sy.~teni h:ls reached ~ irial.lire sl.age, WJlere ill\[ I.echnica\] pr(li)ielliS seOIll to lie s(llved. \'V,~ have develol>ed ilt(!l, ii(Itl~ deal\[ili-~ with I.lle dal.;i, wil.h a COllsideraiile de/.,jrce ()\[' ;lllttililatisa{.i(ili. I'\]Nt(\]~C(7~ it;IS pl'<wed Ixl Im a. fast ;lil(i ;icelll'\[li;e rtlJe-i)a:~ed .qystelu f~u' anaiysliig~ illlrestricted I.exl;.</Paragraph> <Paragraph position="1"> \,Vrithig and dOCuluenthig Ii',N(~CC syntax will be the lliaili e(}licerll durhlg I.he \['clll()wiu<0; lil<}llttl.<J. ()llr i',art, or the i,roj(,ci, will lie O<>ll'll)\]el.ed hy March> 1995.</Paragraph> <Paragraph position="2"> It is p(~.~sihle i.\]mi, the whMe 200--ulillion corpns will be aualyse(I al'resh uear I.he elm ,r tlw prctie(.t. 'l'\]lis wt)uld pill. I,o II.~ all I.he \[llllH'(ivelrlellts IlHl(h~ dlll'iM?~ tlw l,w()-year i}erio(\] and w~mh\[ Q;uar:ull;('e a umxilrml (Icgree or uuil'ormity aud tile ()verall aeelu'acy ~1' I,he allllOl.aJ ed (:or pits.</Paragraph> </Section> class="xml-element"></Paper>