CLAWS4: THE TAGGING OF THE BRITISH NATIONAL CORPUS 
Geoffrey Leech, Roger Garside and Michael Bryant 
UCREL, Lancaster University, UK 
1 INTRODUCTION 
The main purpose of this paper is to describe 
the CLAWS4 general-purpose grammatical tagger, 
used for the tagging of the 100-million-word British 
National Corpus, of which c.70 million words have 
been tagged at the time of writing (April 1994)) 
We will emphasise the goals of (a) gener~d-purpose 
adaptability, (b) incorporation of linguistic knowl- 
edge to improve quality ,and consistency, and (c) 
accuracy, measured consistently and in a linguisti- 
cally informed way. 
The British National Corpus (BNC) consists of 
c.100 million words of English written texts and 
spoken transcriptions, sampled from a comprehen- 
sive range of text types. The BNC includes 10 
million words of spoken h'mguage, c.45% of which 
is impromptu conversation (see Crowdy, forthcom- 
ing). It also includes ,an immense variety of written 
texts, including unpublished materials. The gr,'un- 
matical tagging of the corpus has therefore required 
the 'super-robustness' of a tagger which can adapt 
well to virtually all kinds of text. The tagger also has 
had to be versatile in dealing with different tagsets 
(sets of grammatical category labels-- see 3 below) 
and accepting text in varied input formats. For the 
purposes of the BNC, l, he tagger has been requircd 
both to accept and to output text in a corpus-oriented 
TEl-confonnant mark-up definition known as CDIF 
(Corpus Document Interchange Format), but within 
this format many variant fornaats (affecting, for 
example, segmentation into words and sentences) 
can be readily accepted. In addition, CLAWS al- 
XThe BNC is the result of a collaboration, supported 
by the Science mid Engineering Research Council (SERC 
Grant No. GR/F99847) ,and the UK Dep,'u'tment of Trade 
and Industry, between Oxford University Press (lead p~u't- 
ner), Longman Group Ltd., ChambersHarrap, Oxford 
University Computer Services, the British Library and 
Lancaster University. We thank Elizabeth Eyes, Nick 
Smith, mid Andrew Wilson for their v,'duable help in the 
preparation of this paper. 
lows variable output formats: for the current tag- 
ger, these include (a) a vertically-presented format 
suitable for manual editing, and (b) a more com- 
pact horizontally-presented format often more suit- 
able for end-users. Alternative output formats are 
also glowed with (c) so-called 'portmanteau tags', 
i.e. combinations of two alternative tags, where the 
tagger calculates there is insufficient evidence for 
safe dis,'unbiguation, and (d) with simplified 'plain 
text' malk-up for the human reader. (See Tables I 
and 2 for examples of output formats.) 
CLAWS4, the BNC tagger, 2 incorporates many 
features of adaptability such as the above. It "also 
incorporates many refinements of linguistic analysis 
which have built up over 14 years: particularly in 
the construction and content of the idiom-tagging 
component (see 2 below). At the same time, there 
are still many improvements to be made: the claim 
that 'you can put together a tagger from scratch in 
a couple of months' (recently heard at a research 
conference) is, in our view, absurdly optimistic. 
2 THE DESIGN OF THE GRAM- 
MATICAL TAGGER (CLAWS4) 
The CLAWS4 tagger is a successor of the CLAWS 1 
tagger described in outline in Marshall (1983), and 
more fully in Garside et al (1987), and has the same 
basic architecture. The system (if we include input 
,and output procedures) has five major sections: 
(a) segnnentation of text into word and sentence 
units 
(b) initial (non-contextual) part-of-speech assign- 
ment \[using a lexicon, word-ending list, and 
various sets of rules for tagging unknown 
items\] 
2CLAWS4 has been written by Roger Garside, with 
CLAWS adjunct software written by Michael Bryant. 
622 
(c) rule-driven contextual part-of-speech assign- 
ment 
(d) probabilistie tag disambiguation \[Markov pro- 
cess\] 
\[(c') second pass of (c)\] 
(e) output in intermediate form. 
The intermediate form of text output is the form 
suitable for post-editing (see 1 above; also Table 1), 
which can then be converted into other formats ac- 
cording to particular output needs, as already noted. 
The pre-processing section (a) is not trivial, since, 
in any large and varied corpus, there is a need 
to handle unusual text structures (such as those 
of many popular and technical magazines), less 
usual graphic features (e.g. non-roman alphabetic 
characters, mathematical symbols), and features of 
conversation transcriptions: e.g. false starts, incom- 
plete words and utterances, unusual expletives, un- 
planned repetitions, and (sometimes multiple) over- 
lapping speech. 
Sections (b) and (d) apply essentially a Hidden 
Markov Model (HMM) to the assignmeut and dis- 
ambiguation of tags. But file intervening section (c) 
has become increasingly important as CLAWS4 tuLs 
developed the need for versatility across a range of 
text types. This task of rule-driven contextualpart- 
of-speech assignment began in 1981 as an 'idiom- 
tagging' program for dealing, in the main, with 
parts of speech extending over more than one or- 
thographic word (e.g. complex prepositions such as 
according to and complex conjunctions such as so 
that). In the more fully developed form it now has, 
this section utilises several different idiom lexicons 
dealing, for example, with (i) general idioms such 
as as much as (which is on one analysis a single 
coordinator, and on another analysis, a word se- 
quence), (it) complex munes such as Dodge City 
and Mrs Charlotte Green (where the capital letter 
alone would not be enough to show that Dodge mad 
Green are proper nouns), (iii) foreign expressions 
such as annus horribilis. 
These idiom lexicons (with over 3000 entries in 
all) can match on both tags and word-tokens, em- 
ploying a regular expression formalisnl at the level 
both of the individual item and of the sequence 
of items. Recognition of unspecified words with 
initial capitals is also incorporated. Conceptually, 
each entry has two parts: (a) a regular-expression- 
based 'template' specifying a set of conditions on 
sequences of word-tag pairs, and (b) a set of tag as- 
sigmnents or substitutions to be performed on any 
sequence matching the set of conditions in (a). Ex- 
amples of entries from each of tile above kinds of 
idiom lexicon entry are: 
(i) did/do\]does, (\[XXO/AVO/ORD\])2, \[VVB\] VVI 
(it) Monte/Mount/Mt NP0, (\[WIC\])2 NP0, \[WIC\] 
NP0 
(iii) ad AV021 AJ021, hoe AV022 AJ022 
Exphmatory note: 
(a) Symbolic formalism: 
Let "IT be any tag, and let ww be any word. 
Let n,m be arbitrary integers. Then: 
ww TT represents a word and its associated tag 
, separates a word from its predecessor 
\[TT\] represents :m already assigned tag 
\[WICI represents an unspecified word with a Word Ini- 
tial Capilal 
"I"I'/'I'T means 'either '71" or TT'; ww/ww means 'either 
WW or WW' 
ww'13'TT represenls an unresolved ambiguity be- 
tween "lq' and TI" 
TT* represents a tag wilh * marking the location of 
unspecified ch:u'acters 
(\[TTI)n represents the number of words (up to n) which 
may optionally intervene at a giveq point in the 
template 
TTnm represents the 'ditto tag' attached to ~m ortho- 
graphic word to indicate it is part of a complex 
sequence (e.g.so that is tagged so CJS21 , that 
CJS22). The variable n indicates the number of or- 
thographic words in the sequence, and m indicates 
that the current word is in tile ttzth position in that 
sequence. 
(b) Ex~unples of word lags (in the C5 'basic' tagset): 
A J0 adjective (tmm:uked) 
AV0 adverb (unmarked) 
CJS subordinating conjunction 
NP0 proper noun 
ORD ordinal number 
VVB finite base form of lexical verb 
VVI infinitive of lexical verb 
XX0 negative mmker: not or n't 
623 
(c) Explanation of the three rules above: 
Rule (i) ensures that following a finite form of do and 
(optionally) up to two adverbs, negators or ordinals, 
a base form of the verb is tagged as an infinitive. 
Rule (ii) ensures that in complex names such as Monte 
Alegre, Mount Pleasant, Mount Palomar Obser- 
vatory, Mt Rushmore National Memorial, all the 
words with word-initi,'d caps are tagged as proper 
nouns. 
Rule (iil) ensures that the Latin expression ad hoc is 
tagged as a single word, either an adjective or an 
adverb. 
We have also now moved to a more complex, 
two-pass application of these idiomlist entries. It is 
possible, onthe first pass, to specify ambiguous out- 
put of an idiom assignment (as is necessary, e.g., for 
as much as, mentioned earlier), so that this can then 
be input to the probabilistic disambiguation process 
(d). On the second pass, however, after probabilis- 
tic disambiguation, the idiom entry is deterministic 
in both its input and output conditions, replacing 
one or more tags by others. In effect, this last kind 
of idiom application is used to correct a tagging er- 
ror arising from earlier procedures. For exanlple, a 
not uncommon result from Sections (a)-(d) is that 
the base form of the verb (e.g. carry) is wrongly 
tagged as a finite present tense form, rather than 
an infinitive. This can be retrospectively corrected 
by replacing VVB (= finite base form) by VVI (= 
infinitive) in appropriate circumstances. 
While the HMM-type process employed in Sec- 
tions (b) and (d) affirms our faith in probabilistic 
methods, the growing importance of the contextual 
part-of-speech assigxwaent in (c) and (c') demon- 
strates the extent to which it is important to tran- 
scend the limitations of the orthographic word, as 
the basic unit of grammatical tagging, and also to 
selectively adopt non-probabilistic solutions. The 
term 'idiom-tagging' originally used was never par- 
ticularly appropriate for these sections, which now 
handle more generally the interdependence between 
grammatical and lexical processing which NLP sys- 
tems ultimately have to cope with, and are also able 
to incorporate parsing information beyond the range 
of the one-step Markov process (based on tag bi- 
gram frequences) employed in (d). 3 Perhaps the 
term 'phraseological component' would be more 
appropriate here. The need to combine probabilistic 
3We have experimented with a two-step Markov pro- 
cess model (using tag trigrams), and found little benefit 
over the one-step model (using tag bigrams). 
and non-probabilistic methods in tagging has been 
widely noted (see, e.g., Voutilainen et al. 1992:14). 
3 EXTENDING ADAPTABILITY: 
SPOKEN DATA AND TAGSETS 
The tagging of 10 million words of spoken data 
(including c.4.6 million words of conversation) 
presents particular challenges to the versatility of 
the system: renderings of spoken pronunciations 
such ,as 'avin' (for having) cause difficulties, as do 
unplanned repetitions such as I et, mean, I mean, 
1 mean to go. Our solution to the latter pn~blem 
has been to recognize such repetitions by a special 
procedure, and to disregard, in most cases, the re- 
peated occurrences of tile same word or phrase for 
the purposes of tagging. It has become clear that 
the CLAWS4 resources (lexicon, idiomllsts, and tag 
transition matrix), developed for written English, 
need to be adapted if certain frequent and rather 
consistent errors in the tagging of spoken data are 
to be avoided (words such as I, well, and right are 
often wrongly tagged, because their distribution in 
conversation differs mazkedly from that in written 
texts). We have moved in this direction by allowing 
CLAWS4 to 'slot in' different resources according 
to the text type being processed, by e.g. providing 
a separate supplementary lexicon and idlomlist for 
the spoken material. Eventually, probabilistic anal- 
ysis of the tagged BNC will provide the necessary 
information for adapting datastructures at run time 
to the special demands of particular types of data, 
but there is much work to be done before this po- 
tential benefit of having tagged a large corpus is 
realised. 
The BNC tagging takes place within the context 
of a larger project, in which a major task (undertaken 
by OUCS at Oxford) is to encode the texts in a TEI- 
conformant mark-up (CDIF). Two tagsets have been 
employed: one, more detailed than the other, is 
used for tagging a 2-million-word Core Corpus (an 
epitome of the whole BNC), which is being post- 
edited for maximum accuracy. Thus tagsets, like 
text formats and resources, are among the features 
which are task-definable in CLAWS4. In general, 
the system has been revised to allow many adaptive 
decisions to be made at run time, and to render it 
suitable for non-specialist researchers to use. 
624 
4 ERROR RATES AND WHAT 
THEY MEAN 
Cun~ntly, judged in terms of major categories, 4 the 
system has an error-rate of approximately 1.5%, and 
leaves c.3.3% ,'unbiguitics unresolved (as portman- 
teau tags) in the output. However, it is all too easy 
to quote error rates, without giving enough infor- 
mation to enable them to be properly assessed. ~ We 
believe that any evaluation of the accuracy of auto- 
matte grammatical tagging should take account of 
a number of factors, some of which arc extremely 
difficult to measure: 
4.1 Consistency 
It is necessary to measure tagging practice against 
some standard of what is an appropriate tag for a 
given word in a given context. For example, is hor- 
rifying in a horrifying adventure, or washing in a 
washing machine an adjective, a norm, or a verb 
participle? Only if this is specified independently, 
by an annotation scheme, c:m we feel confident in 
judging where the tagger is 'correct' or 'incorrect'. 
For the tagging of the LOB Corpus by the earli- 
est version of CLAWS, the :umotation scheme was 
published in some detail (Johzmssou et al 1986). We 
are working on a similar annotation scheme docu- 
ment (at present a growing in-house document) for 
the tagging of the BNC. 
4"lqqe error rate and ambiguity rate are less favourable 
if we take account of errors and ambiguities which occtu" 
within major categories. E.g. the porlmanteau tag NP0- 
NN1 records contidently that a word is a noun, but not 
whether it is a proper or common noun. If such cases are 
added to the count, then the estimated error rate rises lo 
1.78%, and the estimated ambiguity tale to 4.60%. 
5One reasonable attempt to evaluate competing ac- 
curacy of different taggcrs is that in Voutilaincn ct al 
(1992:11-13), where it is argued, on the basis of tlte 
tagging of sample written texls, that the performance of 
the llclsinki constraint grammar p:u'ser ENGCG is supe- 
rior to that of CLAWS 1 (the e:u-liest version of CLAWS, 
completed in 1983), which is in turn is somewhat supe- 
rior to Church's Parts tagger. While recognizing that the 
accuracy of the tlelsinki system is impressive, we note 
also that the method of evaluation (in terms of 'precision' 
and 'recall') employed by Voutilainen ct al in not easy 
to comp~u'e with the method employed here. Further, 
a strict attempt at measuring compmability would have 
to take fuller account of the 'consistency' and 'qu'dily' 
criteria we mention, and of the need it) compare across 
a broader rtmge of texts, spoke~ and written. This issue 
c,-umot be taken further in this paper, but will hopefully 
be tile bztsis of future research. 
4.2 Size of Tagset 
It might be supposed that tagging with a finer- 
grained tagset which contains more tags is more 
likely to produce error than tagging with a smaller 
and cruder tagset. Ill tile BNC project, we have used 
a tagset of 58 tags (the C5 tagsct) for the whole cor~ 
pus, ,'rod in addition we have used a larger tagset 
of 138 tags (the C6 tagset) 6 for the Core Corpus of 
2 million words. The evidence so far is that this 
makes little difference to the error rate. llut size 
of tagset must, in the absence of more conclusive 
evidence, remain a factor to be considered. 
4.3 Discriminative Value of Tags 
Tile difficulty of grammatical tagging is directly re- 
lated to the nuinber of words for which a given tag 
distinction is made. This measure may be called 
'discriminative value'. For example, in the C5 
tagset, one tag (VDI) is used for the inlinitive of 
just one verb -- to do -- where:ts ~mother tag (VVI) 
is used for the infinitive of all lexical verbs. On lhe 
other hand, VDB is used for linite base forms of 
to do (including tile present tense, imperative, and 
subjunctive), whereas VVB is used of finite t)ase 
forms of all lexic:d verbs. It is clc,'u" the tags VDI 
and VDB have a low discriminative value, whereas 
VVI and VVB have a high one -- since there are 
thousands of lexieal verbs in Englisb. It is ~dso 
clear that a tagset of the lowest possible discrimi- 
native value -- one which assigned a single tag to 
each word and a single word to each tag -- would 
bc utterly valueless. 
4.4 Linguislic Qualily 
This is a very elusive, but crucial concept, tIow 
far are tile tags in a particular tagset valuable, by 
criteria either of linguistic thet~ry/description, or 
of usefulness in NLP? For example, tile tag VDI, 
mentioned ill C. above, appears trivial, but it can be 
argued that this is nevertheless a useful category 
tbr English grammar, where the verb do (unlike its 
equivalent in most other Europe,'m languages) has a 
vet-y special ftmction, e.g. in forming questions and 
negatives. On the other band, if we had decided to 
assign a special tag to the ved~ become, this would 
have been more questionable. Linguistic quality is, 
6The tagset figures exclude punctuation tags and port- 
manteau Rigs. 
62.5 
on the face of it, determined only in a judgemental 
manner. Arguably, in the long term, it can be deter- 
mined only by the contribution a particular tag dis- 
tinction makes to Success in particular applications, 
such as speech recognition or machine-aided trans- 
lation. At present, this issue of linguistic quality is 
the Achilles' heel of grammatical tagging evalua- 
tion, and we must note that without judgement on 
linguistic quality, evaluation in terms of b. and c. is 
insecurely anchored, 
It seems reasonable, therefore, to lump criteria 
b.-d. together as 'quality criteria', and to say that 
evaluation of tagging accuracy must be undertaken 
in conjunction with (i) consistency \[How far has the 
armotation scheme been consistently applied?\], and 
(ii) quality of tagging \[How good is the annotation 
scheme?\]. 7 Error rates are useful interim indica- 
tions of success, but they have to be corroborated 
by checking, if only impressionistically, in terms 
of qualitative criteria. Our work, since 1980, has 
been based on the assumption that qualitative crite- 
ria count, and that it is worth building 'consensual' 
linguistic knowledge into the datastructures used by 
the tagger, to make sure that the tagger's decisions 
are fully informed by qualitative considerations. 
REFERENCES 
Crowdy, S. (forthcoming). The BNC Spoken Cor- 
pus. In G. Leech, G. Myers and J. Thomas (Ed.). 
Spoken English on Computer. London: Long- 
mall. 
Garside, R., G. Leech ,and G. Szunpson (Eds). 
(1987). The ComputationalAnalysis of English: 
A Corpus-basedApproach. London: Longman. 
Johansson, S., E. Atwell, R. Garside ~md G. Leech. 
(1986). The Tagged LOB Corpus: User's Man- 
ual. Bergen: Norwegian Computing Centre for 
the Humanities. 
Marshall, I. (1983). Choice of grammatical word- 
class without global syntactic analysis: tagging 
words in the LOB Corpus. Computers and the 
Humanities, 17, 139-50. 
VAn example of a consistency issue is: How consis- 
tently is Time \[the name of a mag~ine\] tagged in the 
corpus? Is it tagged always NP0 (,as a proper noun), or 
always NN1 (as a common noun), or sometimes NP0 and 
sometimes NNI? An example of a quality issue is: Is it 
worth distinguishing between proper norms and common 
nouns, anyway? 
Voutilainen, A., J. Heikkil~i and A. Anttila. 
(1992). Constraint Grammar of English: A 
P etf ormance-O riented lntroduction. University 
of Helsinki: Department of General Linguistics. 
626 
Table 1: Taggcd Spoken D~mt fiom the BNC- Verticttl Format 
0000203 070 I 03 PNP 
0000203 080 can 03 \[VM0/100\] NNI%/0 
0000203 090 just 03 \[AV0/100\] AJ0%/0 
0000203 i00 take 98 WI 
0000204 010 note 03 \[NNI/99\] VVB/I 
0000204 020 of 03 PRF 
0000204 030 any 03 \[DT0/100\] AV0%/0 
0000204 040 other 03 \[AJ0/99\] NNI@/I 
0000204 050 er 03 UNC 
0000204 060 persona\]. 03 AJ0 
0000204 070 pension 03 \[NNI/100\] VVB@/0 
0000204 071 , 03 , 
0000204 080 not 03 XX0 
0000204 090 personal 03 AJ0 
0000204 i00 pension 03 \[NNI/100\] WB@/0 
0000204 i01 , 03 , 
0000204 ii0 any 03 \[DT0/97\] AV0%/3 
0000204 120 erm 03 UNC 
0000205 010 other 03 \[AJ0/98\] NNI@/2 
0000205 020 insurance 03 NNI 
0000205 030 you > 03 PNP 
0000205 031 're < 03 VIIB 
0000205 040 got 98 VVN 
0000205 041 , 03 , 
0000205 050 just 03 \[AV0/100\] AJ0%/0 
0000205 060 put 03 \[WB/66\] VVD@/22 VVN@/13 
0000205 070 it 03 PNP 
0000205 080 on 03 \[AVP@/62\] PRP/38 
0000205 090 there 03 \[AV0/100\] EX0/0 
0000205 i00 and 97 CJC 
0000205 i01 , 96 , 
0000205 ii0 and 96 CJC 
0000205 120 that > 97 DT0 
0000205 121 's < 97 \[VBZ/100\] VilZ@/0 
0000206 001 <unclear> 01 NULL 
0000206 002 </u> 01 NULL 
0000207 001 *'22;2679;u 01 NULL 
0000207 002 ......................................................... 
627 
Table 2: Tagged Spoken Data fi'om tile BNC -- Horizontal Format 
<s c:"0000201 002" n:00061> 
<ptr t:Pl3> That&DT0;'s&VBZ; what&DTQ; <ptr t=Pl4> I&PNP; was&VBD; 
told&VVN; to&T00; bring&VVI; and&CJC; that&DT0;'s&VBZ; what&DTQ; 
I&PNP; have&VHB; brought&VVN;.&PUN; 
</U> 
<U id=D0027 who=W0000> 
<S C='0000203 002" n=00062> 
Yeah&ITJ;,&PUN; I&PNP;'m&VBB;,&PUN; I&PNP;'ve&VHB; got&VVN; 
another&DT0; form&NNI;,&PUN; I&PNP; can&VM0; just&AV0; take&WI; 
note&NNl; of&PRF; any&DT0; other&AJ0; er&UNC; personal&AJ0; 
pension&NNl;,&PUN; not&XX0; personal&AJ0; pension&NNl;,&PUN; any&DT0; 
e~m&UNC; other&AJ0; insurance&NNl; you&PNP;'ve&VHB; got&VVN;,&PUN; 
just&AV0; put&VVB; it&PNP; on&AVP-PRP; there&AV0; and&CJC;,&PUN; 
and&CJC; that&DT0;'s&VBZ; <unclear> 
</u> 
</u> 
628 
