GRAMMATICAL ANALYSIS BY COMPUT~ OF THE LANCASTER-OSLO/BERGEN 
(LOB) CORPUS OF BRITISH ~NGLISH TEXTS. 
Andrew David Beale 
Unit for Computer Research on the English Language 
Bowland College, University of Lancaster 
Bailrigg, Lancaster, England LA1 aYT. 
ABSTRACT 
Research has been under way at the 
Unit for Computer Research on the ~hglish 
Language at the University of Lancaster, 
England, to develop a suite of computer 
programs which provide a detailed 
grammatical analysis of the LOB corpus, 
a collection of about 1 million words of 
British English texts available in 
machine readable form. 
The first phrase of the pruject, 
completed in September 1983, produced a 
grammatically annotated version of the 
corpus giving a tag showing the word 
class of each word token. Over 93 per 
cent of the word tags were correctly 
selected by using a matrix of tag pair 
probabilities and this figure was upgraded 
by a further 3 per cent by retagging 
problematic strings of words prior to 
disambiguation and by altering the 
probability weightings for sequences of 
three tags. The remaining 3 to ~ per 
cent were corrected by a human post-editor. 
The system was originally designed to 
run in batch mode over the corpus but we 
have recently modified procedures to run 
interactively for sample sentences typed 
in by a user at a terminal. We are 
currently extending the word tag set and 
improving the word tagging procedures to 
further reduce manual intervention. A 
similar probabilistic system is being 
developed for phrase and clause tagging. 
~qE STI~JCTURE A~D PURPOSE 
OF THE LOB CORPUS. 
The LOB Corpus (Johansson, Leech and 
Goodluck, 1978), like its American 
~/gl~sh counterpart, the Brown Corpus 
LKucera and Francis, 196a; Hauge and 
;Iofland, 1978), is a collection of 500 
samples of British ~hglish texts, each 
containing about 2,000 word tokens. The 
samples are representations of 15 
different ~ext categories: A. Press 
(Reportage); B. Press (Editorial); 
C. Press (Reviews); D. Religion; E. 
~ills and Hobbies; F. Popular Lore; 
G. Belles Lettres, Biography, r'\[emoirs, 
293 
etc. ; H. Miscellaneous ; J. 
Learned and Scientific; K. General 
Fiction; L. Mystery and Detective 
Fiction; M. Science Fiction; N. 
Adventure and Western Fiction, Romance 
and Love Story; R. Humour. There are 
two main sections, informative prose and 
imaginative prose, and all the texts 
contained in the corpus weee printed in 
a single year (1961). 
The structure of the LOB corpus was 
designed to resemble that of the Brown 
corpus as closely as possible so that 
a systematic comparison of British and 
American written English could be made. 
Both corpora contain samples of texts 
published in the same year (1961) so 
that comparisons are not distorted by 
diachronic factors. 
The LOB corpus is used as a database 
for linguistic research and language 
description. Historically, different 
\]inguists have been concerned to a 
greater or lesser extent with the use of 
corpus citations, to some degree, at 
least, because of differences in the 
perceived view of the descriptive 
requirements of grammar. Jespersen 
(1909-A9), Kruisinga and Erades (1911) 
gave frequent examples of citations from 
assembled corpora of written texts to 
illustrate grammatical rules. Work on 
text corpora is, of course, very much 
alive toda~v. Storage, retrieval and 
processing of natural language text is a 
more efficient and less laborious task 
with modern computer hardware than it 
was with hand-written card files but 
data capture is still a significant 
problem (Francis, 1980). The forthcoming 
work, A Comprehensive Grammar of the 
~Elish Lan~la~e (Quirk, Greenbaum, 
leech, and ~arr.vik, 1985) contains many 
citations from both LOB and Brown 
Corpora. 
A GRAF~ATICALLY ANNOTA~ VERSION 
OF ~E CORPUS 
Since 1981, research has been directed 
towards writing programs to grammatically 
annotate the LOB cor~is. From 1981-83, 
the research effort produced a version of 
the corpus with every word token labelled 
by a grammatical tag showing the word 
class of each word form. Subsequent 
research has attempted to build on the 
techni~les used for automatic word 
tagging by using the output from the word 
tagging programs as input to phrase and 
clause tagging and by using probabilistic 
methods to provide a constituent analysis 
of the LOB corpus. 
~e programs and data files used for 
word tagging were developed from work done 
at Brown University (Greene and BAbin, 
1971). Staff and research associates at 
Lancaster undertook the programming in 
PASCAL while colleagues in Oslo revised 
and extended the lists used by Greene and 
R~bin (op.cit.) for word tag assignment. 
Half of the corpus was post-edited at 
Lancaster and the other half at the 
Norwegian Computing Centre for the 
Humanities. 
How word tagging works. 
~he major difficulties to be 
encountered with word tagging of written 
English are the lack of distinctive 
inflectional or derivational endings and 
the large proportion of word forms that 
belong to more than one word class. 
~hdings such as -able, -ly and -ness are 
graphic realizations"---of morphologlc'-~l 
units indicating word class, but they 
occur infrequently for the purposes of 
automatic word tag assignment; the 
reader will be able to establish 
exceptions to rules assigning word classes 
to words with these suffixes, because the 
characters do not invariably represent 
the same morphemes. 
The solution we have adopted is to use 
a look up procedure to assign one or more 
potential ~ags to each input word. ~e 
appropriate word tag is then selected for 
words with more than one potential tag 
by ca\]culatLug the probability of the 
tag's occurrence ~iven neighbouring 
potential tags. 
~otential word tag assignment. 
In cases where more than one potential 
tag is assigned to the inpu~ word, the 
tags represent word classes of the word 
without taking the syntactic environmeat 
into account. A list of one to five word 
flnal characters, known as the 
's~ffixlist', is used for assignment of 
appropriate word class tags to as many 
word types as possible. A list of full 
word forms, known as the 'wordlist', i& 
used for exceptions to the suffixlist, 
and, in addition, word forms that occur 
more than 50 times in the corpus are 
included in the wordlist, for speed of 
processing. The term 'suffixlist' is 
used as a convenient name, and the reader 
is warned that the list does not 
necessarily contain word final morphs; 
strings of between one and five word 
final characters are included if their 
occurrence as a gagged form in the Brown 
corpus merits it. 
~e 'suffixlist' used by Greene and 
Rubin (op.cit.) was substantially revised 
and extended by Johansson and Jahr (1982) 
using reverse alphabetical lists of 
approximately 50,000 word types of the 
Brown Corpus and 75,000 word types of 
both Brown and LOB corpora. Frequency 
lists specifying the fre~uehcy of tags 
for word endings consistlng of 1 to 5 
characters were used to establish the 
efficiency of each rule. Johansson and 
J~r were guided by the Longman 
Dictionary of Contemporary ~hglish (1978) 
and other dictionaries and grammars 
including ~/irk, Greenbaum, Leech and 
~art-vik (1972) in identifying tags for 
each item in the wordlist. For the 
version used for Lancaster-Oslo/BerEen 
word tagging (1985), the suffixlist was 
expanded to about 7~90 strings of word 
final characters, the wordlist consisted 
of about 7,000 entries and a total of 
135 word tag types were used. 
Potential ~ag disambiguation. 
~%e problem of resolving lexical 
ambiguity for the large proportion of 
English words that occur in more than one 
word class, (BLOW, CONTACT, HIT, LEFT, 
RA2~, RUN, REFUSE, RDSE, 'dALE, WATCH ...), 
is solved, whenever possible by examining 
the local context. '~rd tag selection 
for homographs in Greene a~d Rubin (op. 
cir.) was attempted by using 'context 
frame rules', an ordered list of 5,300 
rules designed to take into account the 
tags assigned to up to two words 
preceding or following the ambiguous 
homograph. ~3~e program was 77 per cent 
successful but several errors were due to 
appropriate rules being blocked when 
adjacent ambi~lities were encountered 
(Marshall, 1983: 140). Moreover, about 
80 per cent of rule application took 
just one immediately neighbouring tag 
into account, even though only a quarter 
of the context frame rules specified 
only one immediately neighbouring tag. 
To overcome these difficulties, 
research associates at Lancaster have 
devised a transition probability matrix 
of tag pairs to compute the most probable 
294 
tag for an ambiguous form given the 
immediately preceding and following tags. 
~his method of calculating one-step 
transition probabilities is suitable for 
disambiguating strings of ambiguously 
tagged words because the most likely path 
through a string of ambiguously tagged 
words can be calculated. 
The likelihood of a tag being selected 
in context is also influenced by likeli- 
hood markers which are assigned to 
entries with more than one tag in the 
lists. Only two markers, '@' and '%', 
are used, '@' notionally Ludicat~ng 
that the tag is correct for the 
associated form less than 1 in lO 
occasions, '%' notionally indicating that 
the tag occurs less than 1 in lOO 
occasions. The word tag disambiguation 
program uses these markers to reduce the 
probability of the less likely tags 
occurring Lu context; '@' results in the 
probability being halved, '%' results in 
the probability being divided by eight. 
Hence tags marked with '@' or '%' are 
only selected if the context indicates 
that the tag is very likely. 
Error analysis. 
At several stages during design and 
implementation of the tagging software, 
error analysis was used to improve various 
aspects of the word tagging system. 
Error statistics were used to amend the 
lists, the transition matrix entries and 
even the formula used for calculating 
transition probabilities (originally this 
was the frequency of potential tag A 
followed by potential tag B divided by 
the frequency of A. Subsequently, it was 
changed to the frequency of A followed by 
B divided by the product of the frequency 
of A and the frequency of B (Marshall, 1983: l~w~ff)). 
Error analysis indicated that the one- 
step transition method for word tag 
disambiguation was very successful, but 
it was evident that further gains could be 
made by including a separate list of a 
small set of sequences of words such as 
accordin~ to, as well as, and so as to 
which were retagged prior to word tag 
disambigu.~ t ior~. Another modification 
was to include an algorithm for altering 
the values of sequences of three tags, 
such as constructions with an intervening 
adverb or simple co-ordinated 
constructions such that the two words on 
either side of a co-ordinating conjunction 
contained the same tag where a choice was 
available. 
No value in the matrix was allowed to 
be as little as zero, by providing a 
minimum positive value for even extremely 
unlikely tag co-occurrences; this allowed 
at least some kind of analysis for unusual 
or eccentric syntax and prevented the 
system from grinding to a halt when 
confronted with a construction that it 
did not recognize. 
Once these refinements to the suite of 
word tagging programs were made, the 
corpus was word-tagged. It was estimmted 
that the number of manual post-editing 
interventions had been reduced from about 
230,000 required for word tagging of the 
Brown corpus to about 35,000 required 
for the IDB corpus (Leech, Garside and 
Atwell, 1983: 36). The method achieves 
far greater consistency than could be 
attained by a human, were such a person 
able to labour through the task of 
attributing a tag to every word token in 
the corpus. 
A record of decisions made at the post- 
editing stage was kept for the purpose of 
recording the criteria for judging 
whether tags were considered to be correct 
or not (Atwell, 1982b). 
Improving word tagging. 
Work currently being undertaken at 
Lancaster includes revising and extending 
the word tag set and improving the suite 
of programs and data files required to 
carry out automatic word tagging. 
Revision of the word tag set. 
The word tag set is being revised so 
that, whenever possible, tags are 
mnemonic such that the characters chosen 
for a tag are abbreviations of the 
grammatical categories they represent. 
This criterion for word tag improvement 
is solely for the benefit of human 
intelligibility and in some cases, 
because of conflicting criteria of 
distinctiveness and brevity, it is not 
always possible to devise clearly 
mnemonic tags. For instance, nouns and 
verbs can be unequivocally tagged by the 
first letter abbreviations 'N' and 'V', 
but the same cannot be said for articles, 
adverbs and adjectives. These categories 
are represented by the tags 'AT', 'RR', 
and 'JJ'. 
It was decided, on the grounds of 
improving mnemonicity, to change 
representation of the category of number 
in the tag set. In the old tag set, 
singular forms of articles, determiners, 
pronouns and nouns were unmarked, and 
plural forms had the same tags as the 
singular forms but with 'S' as the end 
character denoting plural. As far as 
mnemonicity is concerned, this is 
confusing, especially to someone 
uninitiated in the refinements of LOB 
tagging. In the new tag set, number is 
295 
now marked by having 'I' for singular 
forms, 'P' for plural forms and no number 
character for nouns, articles and 
determiners which exhibit no singular or 
plural morpLolo~ical distJnctJveaess (COD, 
A~ is d~siralC,_e, both for the purposes 
of human intelligibility and for 
mechanical processing, to make the tagged 
system as hierarchized as possible. In 
the old tag set m,xial verbs, and forms of 
the verbs BE, DO and HAVE were tagged as 
'r~,'', 'B", 'D", and 'H" (where ''' 
~epresents any of the characters used for 
these tags denoting sub~lasses of each 
tag class). In the new word tag set, 
these have been recoded 'V~,~'', 'VB'', 
'VD'', 'V~", to show that ~hey are, ilt 
fact, verbs, and to Cacilitate verb 
couni.inE in a f~equency ~nalysis of the 
t_agged corpus; "4"I'' is I:he new tag for" 
\] exical verbs. 
It has been taken as a design principle 
of the new tag set that, wherever possible, 
subc_~.teEories and supercat~gories should 
be retrieved by referrin E to the 
zhara<-ter position in \[:,he string of 
characters ::taking up a tag, major word 
class Codin~ beir~ denoted by the initial 
character(s) nf the tag and subsequent 
charactel.s denoting morpho-syntactic 
subcateEor~ ~s. 
Kierarchization of the new tee set is 
best e×e~:'pIi fied by prcnnuns. 'P'' is a 
pronoun, .~s distinct from other ta~ 
initial characters, s~,~h as "~:'' for 
noun, 'V'' fo\]' verb a/~d so on. 'PP'' 
~s a personal pronoun, ~s distinct from 
'~:'' ~n indefinite pronoun; '~?I'' 
is a first persnn personal pronoun: ~, 
we, us, as distinct fr'om 'Plm/. °' , 
I{ ~'v ~.n--d" ';PX" which a~'e second, 
third person and r~flex~ve l~ronouI~s; 
'~'~'IS" is a fib-st pezso:t s:~b~ect 
p~rsonal prortourl: I and we, 8s distinct 
from fi~'s ~ person o~-ject l~r.~ons\] pronouns, 
:~e, af~ ,:~s,_Ts denote~i by ';PIO" ' ; finally "r!~pISl : the first person si~l\] ar 
subject personal pronoun, _I (~he colon 
is used tc show that the form mus~ have 
an .-:xtitial capital letter). 
~e thir, l cril:erion for revising and 
enlarging the word tag set is to improve 
~nd extend the linguistic cateEorisation. 
For. instance, a tag for the category of 
predi~:ative addectJve, 'JA', has been 
introduced fo1" ad~e~-tives like ablaze, 
adrift and afloat, in addition Uo the 
~y ex-:~dist~ction between 
attributive and ordinaz~ adjectives, 
marked 'JB' as distinct from 'JJ'. 
There is a~ essential distributional 
restriction on subclasses of adjectives 
occurring only attributively or 
predicatively, and it was considered 
appropriate t~notate this in the tag set 
in a consistent manner. The attributive 
category has been introduced for 
comparative adjectives, 'JBR', (bq=PER, 
~;T~ ...) and superlative adjectives, 
'JBT', (U~OST, UTTEI~OST ... ). 
As a further example of improving the 
linguistic categorization without 
affecting the proportion of correctly 
tagged word forms, consider the word ONE. 
In the old tagging system, this word 
was always assigned the tag 'CDI'. 
This is unsatisfactory, even though ~TE 
is always assigned the tag it is supposed 
to receive, because O~FE is not simply 
a singular cardinal number. It can be a 
sin~llar impersonal pronoun, One is often 
s~rised by the reaction of ~ ~ s~, 
or a sinEul-ar" ~mm-~ ~, We ~ts--~S 
contrasting, for instance, w-'~h-'~ 
al form He wants those ones. It is 
~herefore approprl'~e f-To'~ ~,~C~,~o be 
assigned 5 potential tags, 'CDI', '~TI', 
and '~TNI', one of which is to be selected 
by the transition probability procedure. 
Revision of the programs and data files. 
Revision of the word tag set has 
necessitated extensive revision of the 
word- and suffixlists. The transition 
matrix will be adapted so that the 
corpus can be retagged with tags from 
the new word tag set. In addition, 
programs are being revised to reduce the 
need for special pre-editing and input 
format requirements. In this way, it will 
be possible for th~ system to tag 
~glJsh tex~s or:her than the LOB corpus 
without pre-edJ ring. 
Reducing Pre-editing. 
For the 1983 version of the ta~ged 
corpus, a pre-editin E stage was carried 
out partly by computer and partly by a 
h,~man pre-editor (Atwell, 1982a). As part 
of this stage, the computer automatically 
reduced all sentence-initial capital 
letters and the hum~ pre-editor recapit- 
alizsd those sentence initial characters 
that began proper nouns. We are now 
endeavourin E to cut out this phase so that 
the automatic tagg~n E suite can process 
inp, xt text in its normal orthographic 
form as mixed case characters. 
Eentence boundaries were explicitly 
• ~arked, an part of thp input ~eq~:irements 
::o the tag~.in~ procedures, and since 
the word class of a word with an initial 
capital letter is significantly affected 
by whether it occurs at the beginning 
of a sentence, it was considered 
appropriate to make both sentence 
boundary recognition and word class 
assignment of words with a word init.ial 
capital automatic. All entries in the 
296 
word list now appear entirely in lower 
case and words which occur with different 
tags according to initial letter status 
(board, march, may, white ...) are 
assigned tags accordzng t~---"o a field 
selection procedure: the appropriate tags 
are given in two fields, one for the 
initial upper case form (when not acting 
as the standard beginning-of-sentence 
marker) and the other for the initial 
lower case form. The probability of tags 
being selected from the alternative lists 
is weighted according to whether the form 
occurs at the beginning of the sentence 
or elsewhere. 
Knut Hofland estimated a success rate 
of about 9a.3 per cent without pre-editing 
(Leech, Garside and Atwell, 1983: 36). 
Hence, the success rate only drops by 
about 2 per cent without pre-editing. 
Nevertheless, the problems raised by words 
with tags varying according to initial 
capital letter status need to be solved 
if the system is to become completely 
automatic and capable of correct tagging 
of standard text. 
Constituent ;alalysis. 
The high success rate of word tag 
selection achieved by the one-step 
probability disambiguation procedure 
prompted us to attempt a similar method 
for the more complex tasks of phrase and 
clause tagging. The paper by Garside and 
Leech in this volume deals more fully with 
this aspect of the work. 
Rules and symbols for providing 
a constituent analysis of each o£ the 
sentences in the corpus are set ~t in a 
Case-law Manual (Sampson, 198~) and a 
series of associated documents give the 
reasoning for the choice of rules and 
symbols (Sampson, 1983 - ). Extensive 
tree drawing was ,mdertaken while the 
Case-Law ~anual was beinz written, partly 
to establish whether high-level tags and 
rules for hig~h-level tag assignment 
needed to be modified in the light of the 
enormous variety and complexity of 
ordinary sentences in the corpus, and 
partly to create a databank of manually 
parsed samples of the LOB corpus, for the 
purposes of providing a first- 
approximation of the statistical data 
required to disambiguate alternative 
parses. 
To date, about 35,O00 words (I,500 
sentences) have been manually parsed and 
keyed into an ICL ~/E 2900 machine. W~ 
are presently aimin~ for a tree bank of 
about 50,0OO words of evenly distributed 
samples taken from different corpus 
categories r,presenting a cross-section 
of about 5 per cent of the word tagged 
c or!m~ s. 
The future. 
It should be made clear to the reader 
that several aspects of the research 
are cumulative. For instance, the 
statistics derived from the tagged Brown 
corpus were used to devise the one-step 
probability program for word tag 
disambiguation. Similarly, the word 
tagged LOB corpus is taken as the input 
to automatic parsing. 
At present, we are attempting to : 
provide constituent structures for the 
LOB corpus. Many of these constructions 
are long and complex; it is notoriously 
difficult to summarise the rich variety 
of written ~hg!ish, as it actually occurs 
in newspapers and books, by using a 
limited set of rewrite rules. Initially, 
we are attempting to parse the LOB 
corpus using the statistics provided by 
the tree bank and subsequently, after 
error analysis and post-editing, 
statistics of the parsed corpus can be 
used for further research. 
ACKNOWI/~GI~E~TS 
The work described by the author of 
this paper is currently supported by 
Science and ~h~ine~r~ug Research Council 
Grant GRICI~7700. 
~CES 
Abbreviation : 
ICAME _- International Computer Archive 
of Modern ~hglish. 
Atwell, E.S. (1982a). LOB Corpus Ta~in~ 
Project: Manual Pr~'/%-dit Handbook. 
Unpub lishe~--~ent : Unit for 
Computer Research on the ~hglish 
Language, University of lancaster. 
(1982b). LOB ~rpus Taggin~ Project: 
Manual Po s--~- e~-f-~andb oo k. m~- 
grammar of LOB Corpus English, 
examining the types of error commonly 
made during automatic (computational) 
analysis of ordinary written English). 
Unpublished document : Unit for 
Computer Research on the ~hglish 
language, University of lancaster. 
Francis, W.N. (1980). 'A tagged corpus - 
problems and prospects', in Studies 
in ~hglish lin~listics for Randolph 
~1980) edited by S-~-'Greenbaum, 
G.N~ech and J. S~arrvik, 192-209. 
London : Longman. 
Greene, B.B. and Rubin, G.M. (1971). 
'Automatic Grammatical Tagging of 
English', Providence, R.I. : 
Department of Linguistics, Brown 
University. 
297 
Hauge, J. and Hofland, K. (1978). 
~ticrofiche version of the Brown 
UniversityCo rpus oi'~Pr--~ent--~y 
American ~n~-l-~. \]~rgen:'-e~"~'4~s EDB- 
Senter for Humanistisk Forskning. 
Jespersen, O. (1909-A9). A Modern ~hElish 
Grammar on Historical ~r~c~es, 
F~un_ks g a ar~. 
Johansson, S. (1982) (editor). Computer 
Corpora in ~hElish language research. 
Bergen: -~orwegian Computing Centre 
for the Humanities. 
Johansson, S. and Jahr, M-C. (1982). 
'Grammatical Tagging of the LOB Corpus: 
Predicting Word Class from Word 
~hdings', in S. Johansson (1982), ll8- 
Johansson, S., Leech, G. and Goodluck, H. 
(1978). Manual of information to 
ac c omp any-'-~-~c as ter-Os lo/Be'-r~en ~ 
o£ r~tish Eaglish, for use with 
i computers. Unpublish-~ d-~u~ent : 
Department of English, University of 
Oslo. 
Kruisinga, E. and Erades, P.A. (1911). 
An ~hElish Grammar. Nordhoof. 
Kuc'~a, H. and Francis, W.N. (196A, 
revised 1971 and 1979). Manual of 
Information to accompany A~a-'rd 
of Pro-sent-Day Rii~ed American 
or use witR 
Comouters.---~r~-'~de-~, R~ode Island: 
Brown University Press. 
Leech, G.N., Garside, R., and Atwell, E. 
(1983). 'Recent Developments in the 
us~ of Computer Corpora in English 
Language Research', Transactions of the 
Philological Society, 23-aO. 
~s DictionaIT/ of Cmntemporary ~h~lish ). London'S- Longman. Marshall, I. (1983). 'Choice of 
Grammatical Word-Class without Global 
~/ntactic Analysis: Tagging Words in 
the LOB Corpus', Computers and the 
Humanities, Vol. 17, No. 3, 139-150. 
Quirk, R., Greenbatu~, S., Leech., G.N. 
and S~arrvik, J. (1972). A Grammar of 
Con~emporar~ ~hslish. LondOn: Longing. 
(1985). A Comprehensive Grammar of the 
~h~lish rangua~e. London : Longman. 
Sampson, G.R. (198A). UCR~, Symbols and 
l~les for Manual Tree--~aw~n~. 
~-~l~-~e~'-~en~: Unit ~or Computer 
Research on the English Language, 
University of Lancaster. 
(1983 -). Tree Notes I - XIV. 
Unpublished documents: Unit for 
Computer Research on the Hhglish 
Languace, University of Lancaster. 
298 
