A PROBABILISTIC APPROACH TO GRAMMATICAL ANALYSIS 
OF WRITT!N ENGLISH BY COMPUTER. 
Andrew David Beale, 
Unit for Computer Research on the ~hglish I,~_~Zt.zage, 
University of Lancaster, Bowland College, 
Bailrigg, Lancaster, England LA1 AYT. 
ABSTRACT 
Work at the Unit for Computer Research 
on the Eaglish Language at the 
University of Lancaster has been directed 
towards producing a grammatically 
s nnotated version of the Lancaster-Oslo/ 
Bergen (LOB) Corpus of written British 
English texts as the prel~minary stage in 
developing computer programs and data 
files for providing a grammatical 
analysis of -n~estricted English text. 
From 1981-83, a suite of PASCAL 
programs was devised to automatically 
produce a single level of grammatical 
description with one word tag representing 
the word class or part of speech of each 
word token in the corpus. Error analysis 
and subsequent modification to the system 
resulted in over 96 per cent of word 
tags being correctly assigned 
automatically. The remaining 3 to ~ per 
cent were corrected by human post-editors. 
~brk is now in progress to devise a 
suite of programs to provide a 
constituent analysis of the sentences in 
the corpus. So far, sample sentences 
have been automatically assigned phrase 
and clause tags using a probabilistic 
system similar to word tagging. It is 
hoped that the entire corpus will 
eventually be parsed. 
THE LOB CORPUS 
The LOB Corpus (Johansson, Leech and 
Goodluck, 1978) is a collection of 500 
text samples, each containing about 
2,000 word tokens of written British 
~hglish published in a single year (1961). 
The 500 text samples fall into 15 
different text categories representing 
a variety of styles such as press 
reporting, science fiction, scholarly and 
scientific writing, romantic fiction and 
religious writing. There are two main 
sections: informative prose and imaginative 
prose. The corpus contains just over 1 
million word tokens in all. 
Preparatica of the LOB corpus in 
machine readable form began at the 
Department of Linguistics and Modern 
English Language at the University of 
Lancaster in the early 1970s under the 
direction of G.N. Leech. Work was 
transferred, in 1977, to the Department 
of English at the University of Oslo, 
Norway and the Norwegian Computing Centre 
for the Humanities at Bergen. Assembly 
of the corpus was completed in 1978. 
~ne LOB Corpus was designed to be a 
British ~hglish equivalent of the 
Standard Corpus of Present-Day Edited 
American mnglish, for use with Digital 
Computers, otherwise known as the Brown 
Corpus (Ku~era and Francis, 196~; Hauge 
and Hofl-n~, 1978). The year of 
publication of all text samples (1961) 
and the division into 15 text categories 
is the same for bo~h corpora for the 
purposes of a systematic comparison of 
British and American natural language and 
for collaboration between researchers 
at the various universities. 
~brd Tagging o~ the LOB Corpus. 
~3~e initial method devised for 
automatic word tagging of the LOB corpus 
can be represented by the following 
simplified schematic diagram: 
WORD F0~S -, ~OTENTIAL WORD TAG 
ASSIGNMENT (for each word in isolation) 
--> TAG SELECTION (of words in context) 
--> TAGGED WORD FORMS 
Sample texts from the corpus are 
input to the tagging system which then 
performs essentially two main tasks: 
firstly, one or more potential tags and, 
where appropriate, probability markers, 
are assigned to each input word by a 
look up procedure that matches the input 
form against a list of full word forms, 
or, by default, against a list of one to 
five word final characters, known as the 
'suffixlist' ; subsequently, in cases 
where more than one potential tag has 
been assigned, the most probable tag is 
selected by using a matrix of Qne-step 
transition probabilities giving the 
likelihood of one word tag following 
another (Marshall, 1983: 1Alff). 
159 
The tag selection procedure 
disambiguates the word class membership 
of many common English words (such as 
CONTACT, SHOW, TALK, T~2~EPHONE, WATC~ and 
~ISPER). Moreover, the method is 
suitable for disambiguating strings of 
adjacent ambiguities by calculating the 
most likely path through a sequence of 
alternative one-step transition 
probabilities. 
Error analysis of the method (Marshall, 
op. cir.: 1A3) showed that the system was 
over 93 per cent successful in assigning 
and selecting the appropriate tag in 
tests on the ~mning text of the LOB 
corpus. But it became clear that this 
figure could be improved by retagging 
problematic sequences of words prior to 
word tag disambiguation and, in addition, 
by altering the probability weightings of 
a small set of sequences of three tags, 
known as 'tag triples' (Marshall, op. 
cir.: 1~7). In this way, the system 
makes use of a few heuristic procedures 
in addition to the one-step probability 
method to automatically ~nnotate the input 
text. 
We have recently devised an interactive 
version of the word tagging system so that 
users may type in test sentences at a 
terminal to obtain tagged sentences in 
response. Additionally, we are 
substantially extending and modifying the 
word tag set. The programs and data files 
used for automatic word tagging are being 
modified to reduce manual intervention 
and to provide more detailed subcategor- 
izations. 
Phrase and Clause Tagging. 
The success of the probabilistic model 
for word tagging prompted us to devise 
a similar system for providing a 
constituent analysis. Input to the 
constituent analysis module of the system 
is at present taken to be LOB text with 
post-edited word tags, the output from 
the word tagging system. We envisage 
an interactive system for the future. 
A separate set of phrase and clause 
tags, known as the hypertag set, has been 
devised for this purpose. A hypertag 
consists of a single capital letter 
indicating a general phrase or clause 
category, such as 'N' for noun phrase or 
'F' for finite verb clause. This 
initial capital letter may be followed 
by one or more lower-case letters 
representing subcategories within the 
general hypertag class. For instance, 
'Na' is a noun phrase with a subject 
pronoun head, 'Vzb' is a verb phrase with 
the first word in the phrase inflected 
as a third person singular form and the 
last word being a form of the verb BE. 
Strict rules on the permissible: 
combinations of subca~egory symbols have 
been formulated in a Case Law Manual 
(Sampson, 198~) which provides the rules 
and symbols for checking the output of 
the automatic constituent analysis. The 
detailed distinctions made by the 
subcategory symbols are devised with the 
aim of providing helpful information for 
automatic constituent analysis and, for 
the time being, many subcategory symbols 
are not included in the output of the 
present system. (For the current set of 
hypertags and subcategory symbols, see 
Appendix A). 
The procedures for parsing the corpus 
maybe represented in the following 
simplified schematic diagram: 
WORD TAGGED CORPUS -~ T-TAG A~IGNFLENT 
(PARTIAL PARSE) -~ BRACKET CLOSING AND 
T-TAG SELECTION -~ CONSTITUENT ANALYSIS 
Phrasal ,nd clausal categories and 
boundaries are assigned on the basis of 
the likelihood of word tag pairs opening, 
closing or continuing phrasal and clausal 
constituencies. This first part of the 
parsing procedure is known as T-tag 
assignment. A table of word tag pairs 
(with, in some cases, default values) is 
used to assign a string of symbols, known 
as a T-tag, representing parts of the 
constituent structure of each sentence. 
The word tag pair input stage of parsing 
resembles the word- or suffixlist look up 
stage in the word tagglnE system. 
Subsequently, the most likely string of 
T-tags, representing the most probable 
parse, is selected by using statistical 
data giving the likelihood of the 
immediate dominance relations of 
constituents. Other procedures, which I 
will deal with later, are incorporated 
into the system, but, in very broad 
outline, the automatic constituent 
analysis system resembles word tagging 
in that potential categories (and 
boundaries) are first assigned and later 
disambiguated by calculating the most 
likely path through the alternative 
choices. 
In the case of word tagging, the word 
tagged Brown corpus enabled us to derive 
word tag adjacency statistics for 
potential word tag disambiguation. But 
no parsed corpus exists yet for the 
purposes of derivln~ statistics for 
disambiguating parsing information. 
A sample databank of constituent 
structures has therefore been manually 
compiled for initial trials of T-tag 
assignment and disambiguation. 
160 
The Tree Bank 
~hen the original set of hypertags and 
rules was devised, G.R. Sampson began the 
task of drawing tree diagrams of the 
constituent analysis of sample sentences 
ca computer print-outs of the word tagged 
version of the corpus. As tree drawing 
proceeded, amendments and extensions to 
the rules for tree drawing and the 
inventory of hypertags were proposed, on 
the basis of problems encountered by the 
linguist in providing a satisfactory 
grammatical analysis of the constructions 
in the corpus. The rationale for the 
original set of rules and symbols, and 
of subsequent modifications, is documented 
in a set of Tree Notes (Sampson, 1983 - ). 
So far, about 1,500 complete sentences 
have been manually parsed according to the 
rules described in the Case Law Manual 
and these structu~res have been keyed into 
an ICL VHE 2900 machine which represents 
them in bracketed notation as four fields 
of data on each record of a serial file• 
The fields or col, lmns of data are:- (i) 
a reference number, (2) a word token of 
sample text, (3) the word tag for the 
word and (~) a field of hypertags and 
brackets showing the constituency-level 
status of each word token. 
Any amendments to the rules and symbols 
for hypertagging necessitate corresponding 
amendments to the tree structures in the 
tree databank. 
The Case Law Manual. 
The Case Law Manual (Sampson, 198~) is 
a document that s,,mmarizes the rules and 
symbols for tree drawing as they were 
originally decided and subsequently 
modified after problems enccuntered by the 
linguist in working through samples of 
the word tagged corpus. I will only give 
a brief sketch of the principles contained 
in the Case Law Manual in this paper• 
Any sequence in the word tagged corpus 
marked as a sentence is given a root 
hypertag, 'S'. Between 'S' and the word 
tag level of analysis, all constituents 
perceived by the linguist to be 
consisting of more than one word and, in 
some cases, single word constituents, 
are labelled with the appropriate 
hypertag. Any clause or sentence tag 
must dominate at least one phrase tag 
but otherwise unary branching is generally 
avoided. 
Form takes precedence over function 
so that, for instance, in fact is 
labelled as a prepositio'~aT-~rase rather 
than as an adverbial phrase. No attempt 
is made to show any paraphrase 
relationships. Putative deleted or 
transposed elements are, in general, not 
referred to in the Case Law Manual, the 
exceptions to this general principle 
being in the treatment of some co- 
ordinated constructions and in the 
analysis of constructions involving what 
transformational grammarians call 
unbounded movement rules (Sampson, 198~: 2). 
The sentences in the LOB corpus present 
the linguist with the enormously rich 
variety of English syntactic constructions 
that occurs in newspapers, books and 
journals; and they also force issues - 
such as how to incorporate punctuation 
into the parsing scheme, how to deal with 
numbered lists and dates in brackets - 
issues which, although present and 
familiar in ordinary written language, 
are not generally, if at all, accounted 
for in current formalized grammars. 
T-TAG ASSIGNMENT 
A T-tag is part of the constituent 
structure immediately dominating a 
word tag pair, together with any 
closures of constituents that have been 
opened, and left unclosed, by previous 
word tag pairs. Originally, it was 
decided to start the parsing process by 
using a table of all the possible 
combinations of word tag pairs, each with 
its own T-tag output. Rules of this 
sort may be exemplified as follows:- 
cs - = 
(N+I) YBN- JJ = J\]N : T~UJ : ¥\]\[N 
(N+2) - RB = T J : Y\]\[R 
(N+3) VBG - RP = Y N : Y\]ER 
A word tag pair, to the left of the 
equals sign, is accepted as 5he input 
to the rule which, by look-up, assigns 
a T-tag or string of T-tag options 
(separated by colons) as alternative 
possible analyses for the input tag pair. 
In example (N), a subordinating 
conjunction followed by a preposition 
indicates that a prepositional phrase 
is to be opened as daughter of the 
previous constituent (denoted by the 
'wild card' hypertag ' Y' ) ; in example 
(N+l), a past participle form of a verb 
followed by an adjective indicates 
three options : 
a. either close a previously opened 
adjective phrase and continue an 
already opened noun phrase or 
161 
b. close a previously opened verb 
phrase and open an adjective 
phrase or 
c. close a previously opened verb 
phrase and open a noun phrase 
constituent. 
In this way, the constituent analysis 
begins by an examination of the 
~mmediately local context and a 
considerable proportion of information 
about correct parsing structure is 
obtained by considering the sequence of 
adjacent word tag pairs in the input 
string. In some cases, surplus inform- 
ation is supplied about hypertag choices 
which later has to be discarded by T-tag 
selection; in other cases, word tag 
pairs do not provide sufficient clues for 
appropriate constituent boundary 
assi~ment. Word tag pair input should 
therefore be thought of as producing an 
incomplete tree structure with surplus 
alternative paths, the remaining task 
being to complete the parse by filling in 
the gaps and selecting the appropriate 
path where more than one has been 
assigned. 
Cover S~mbols. 
For the purposes of T-tag look up, 
word tag categories have been conflated 
where it is considered ~mnecessary to 
match the input against distinct word 
tags; often, the initial part of a 
T-tag closes the previous constituent, 
whatever the identity of the constituent 
is, and specification of rules for every 
distinct pair of word tags is redundant. 
This prevents T-tag assignment requiring 
an unwieldy 133 * 133 matrix. 
The more general word tag categories 
are known as cover symbols. These 
usually contain part of a word tag 
string of characters with an asterisk 
replacing symbols denoting the redundant 
subclassifications. (See Appendix B for 
a list of cover symbols.) 
Three stages of T-tag assignment. 
T-tag assignment is now divided into 
three look-up procedures: (I) pairs of 
word tags (2) pairs of cover symbols 
(3) single word tags or cover symbols, 
preceded or followed by an unspecified 
tag. Each procedure operates in an 
order designed to deal with exceptional 
cases first and most general cases last. 
For instance, if no rules in (1) and (2) 
are invoked by an input pair of tags, 
where the second input tag denotes some 
form of verb, then the default rule - 
VB = Y\]\[V is invoked such that any tag 
followed by any form of verb closes 
the constituent left ope n by a previous 
T-tag look-up rule (where 'Y' is a symbol 
denoting any hypertag). Subsequently, 
a vet0 phrase is opened. 
If the first tag of the input pair 
denotes a form of the verb BE, then the 
rule BE- VB = Y ¥ in procedure (2) is 
invoked. Finally, if the first tag of 
the input pair is 'JJR', denoting a 
comparative adjective, and the second 
tag is 'VBN', denoting the past 
participle form of a verb, then the rule 
JJR- VBN = Y J in (1) is invoked. 
The T-tag table was initially 
constructed by linguistic intuition and 
subsequently keyed into the ICL VNE 2900 
machine. Comparison of results with 
sections of samples from the tree bank 
enables a more empirical validation of 
the entries by checking the output of the 
T-tag look up procedure against samples 
of the corpus that have been manually 
parsed accordiug to the rules contained 
in the Case Law Manual. 
~here alternative T-tags are assigned 
for any word or cover tag pair, the 
options are entered in order of 
probability and unlikely options are 
marked with the token '@'. This 
information can be used for adjusting 
probability weightings downwards in 
comparison of alternative paths through 
potential parse trees. 
Reducing T-tag options. 
Some procedures are incorporated into 
T-tag assignment which serve to reduce 
the explosive combinatorial possibilities 
of a long partial parse with several 
T-tag options. Sometimes, T-tag options 
can be discarded 4mmediately after T-tag 
assignment because adjacent T-tag 
information is incompatible; a T-tag 
that closes a constituency level that 
has not previously been opened is not a 
viable alternative. In cases where 
adjacent T-tags are compatible, the 
assignment program collapses common 
elements at either end of the options 
andthe optional elements are enclosed 
within curly brackets, separated by 
one or more colons. Here is the 
representation in cover symbols and 
alternative constituent structures of the 
sentence, "~eir offering last night 
differed little from their earlier act 
on this show a week or so ago. " (LOB 
reference: C0~ 80 001 - 81 081). Cover 
symbols and word tags appear in angle 
brackets : 
\[ S \[N<DT*~N<N *>~3: ~ N<AP*> NCN*2\]\[ ¥<VB *>Z R~R*~ 
{ J :} P<IN>KN<DT*>N<J*>N<N*>~ : \]\])~<IN> -_ 
N<DT'~N<N*>~ \] ~: \] 3 IF: JR)ENd'< DT*>N<N*> IN 
+<CC>N~P*>U\]~ER<R*> : \[J<R*> :R<R*>~\]S~. * >~ 
162 
Gaps in the analysis. 
Since the T-tag selection phase of the 
system does not insert constituents, it 
follows that any gaps in the analysis 
produced by T-tag look up must be filled 
before the T-tag selection stage. By 
intuition or by checking the output of 
T-tag assiEnment against the same samples 
contained in the tree bank, rules have 
been incorporated into T-tag assignment 
to insert additional T-tag data after 
look up but before probability analysis. 
~hen T-tag look up produces EPCN3 
(open prepositional phrase, open and close 
noun phrase), a further rule is 
incorporated that closes the prepositional 
phrase immediately after the noun phrase. 
Similarly, a preposition tag followed by 
a wh-determiner ~e.g. with whom, to which, 
by whatever, etc) indicates that a finite 
~ause should be opened between the 
previous two word tags (whatever precedes 
the preposition and the preposition 
itself). 
Rules of this sort, which we call 
"heuristic rules", could be dealt with by 
including extra entries in the T-tag 
look up table, but since the constituency 
status is more clearly indicated by 
sequences of more than two tags, it is 
considered appropriate, at this stage, to 
include a few rules to overwrite the 
output from T-tag look up, in the same way 
that heuristics such as 'tag triples' 
and a procedure for adjustiug probability 
weightings were included in the word 
tagging system, prior to word tag 
selection, to deal with awkward cases 
there. 
Long distance dependencies. 
Genitive phrases and co-ordinated 
constructions are particularly problematic. 
For instance, in The Queen of Ea~land's 
Palace, T-tag loo~--~p is no'V, at present, 
a-~o establish that a potential 
genitive phrase has been encountered 
until the apostrophe is reached. We 
know that a genitive constituent might be 
closed according to whether the potential 
genitival constituent contains more than 
one word. Consequently a procedure must 
be built in to establish where the genitive 
constituent should be opened, if at all. 
Co-ordinated constructions present similar 
prob lens. 
T-TAG SELECTION AND BRACKET CLOSING 
It is the task of the final phase of 
the parser to fill in any remaining 
closing brackets in the appropriate places 
and calculate the most probable tree 
structure given the various T-tag options. 
The bracket closing procedure works 
backwards through the T-tag string, 
selecting unclosed constituents, 
constructing possible subtrees and 
assigning each a probability, using 
immediate dominance probability 
statistics. Each of the possible closing 
structures is incorporated into the 
calculation for the next unclosed 
constituent; the bracket closing procedure 
works its way up and down constituency 
levels until the root node, 'S', has 
been reached and the most probable 
analysis calculated. 
T-tag options are treated in a similar 
manner to bracket closing; probabilities 
are calculated for the alternative 
structures and the most likely one is 
selected. 
Tmmediate dominance probabilities. 
A program has been devised to record 
the distinct immediate dominance 
relationships in the tree bank for each 
hypertag; the number of permissible 
sequences of hypertags or word tags that 
amy hypertag can dominate is stored in a 
statistics file. At initial trials, 
this was the databank used for selecting 
the most likely parse, but because the 
tree bank was not sufficiently large 
enough to provide the appropriate analysis 
for structures that, by chance, were not 
yet included in the tree bank, other 
methods for calculating probabilities were 
tried ont. 
At present, daughter sequences are 
split into consecutive pairs and the 
probability of a particular option is 
calculated by multiplying probabilities 
of pairs of daughter constituents for 
each subtree. This method prevents 
sequences not accounted for in the tree 
bank from being rejected. Sample 
sentences have been successfully parsed 
using this method, but we acknowledge that 
further work is required. One problem 
created by the method is that, because 
probabilities are multiplied, there is a 
bias against long strings. It is 
envisaged that normalization factors, 
which would take account of the depth of 
the tree, would counterbalance the 
distortion created by multiplication of 
probabilities. 
CONCLUSION 
We have found that the success rate 
for gr~mmatically annotating the LOB 
corpus using probabilistic techniques 
for lexical disambiguation is surprisingly 
high and we have consequently endeavoured 
to apply similar techniques to provide a 
constituent analysis. 
163 
Corpus data provides us with the rich 
variety of extant Eaglish constructions 
that are the real test of the grammarian's 
and the computer programmer's skill in 
devising an automatic parsing system. 
The present method provides an analysis, 
albeit a fallible one, for any input 
sentence and therefore the success rate of 
the tagging scheme can be assessed and 
where appropriate, improved. 
ACKNOWLEDG~M ~N TS 
The author of this paper is one member 
of a team of staff and research 
associates working at the Unit for 
Computer Research on the Eaglish Language 
at the University of Lancaster. The 
reader should not assume that I have 
contributed any more than a small part of 
the total work described in the paper. 
Other members of the team are R. Garside, 
G. Sampson, G. Leech (joint directors); 
F.A. Leech, B. Booth, S. Blackwell. 
The work described in this paper is 
currently supported by Science and 
Engineering Research Council Grant 
GR/C/47700. 
P~RENCES 
Hauge, J. and Holland, K. (1978). Micro- 
fiche version of the Brown Univers~ 
Corpus o£ PTesent-Da~American Emglish. 
Bergen: NAVF's EDB-~enter for 
Humanistisk Forskning. 
Johansson, S., Leech, G. and Goodluck, H. 
(1978). Manual of information to 
accompany th, e Lancaster-Oslo/Ber~en 
cor~us of British En~lishl for use with 
dlgltal computers. Unpubllshed 
document: Department of English, 
University of Oslo. 
Ku~era, H. and Francis, W.N. (196~, revised 
1971 and 1979). Manual of Information 
to accompany A Standard Corpus of 
Present-Day Edited American EaRlish, 
for use with Digital Computers. 
Providence, Rode Island: Brown 
University Press. 
r~arshall, I. (1983). 'Choice of Grammatical 
Word-Class without Global Syntactic 
Analysis: Tagging Words in the LOB 
Corpus', Computers and the Humanities, 
Vol 17, No. 3, 139-150. 
Sampson, G.R. (198@). UCREL Symbols and 
~les for Manual Tree-Drawing. 
Unpublished document: Unit for Computer 
Research on the English Language, ~ 
iversity of Lancaster. 983). T~ee Notes I-XIV. Unpublished 
documents: Unit for Computer Research 
on the Eaglish Language, University of 
Lancaster. 
APPENDIX A 
Hypertags and Subscripts. 
~he initial capital letter of each 
hypertag represents a general constituent 
class and subsequent lower case letters 
represent subcategories of the 
constituent class. The reader is warned 
that, in some cases, one lower case 
letter occurring after a capital letter 
has a different meaning to the same 
letter occurring after a different capital 
letter. 
A As-clause 
D Determiner phrase 
Dq beginning with a wh-word 
Dqv beginning with wh-ever word 
E Existential TH2RE 
F 
Fa 
Fc 
Ff 
Fn 
Fr 
Fs 
Finite-verb clause 
Adverbial clause 
Comparative clause 
Antecedentless relative clause 
Nominal clause 
Relative clause 
Semi-co-ordinating clause 
G Germanic genitive phrase 
J Adjective phrase 
Jq beginning with a wh-word 
Jqv beginning with a wh-ever word 
Jr Comparative adjective phrase 
Jx with a measured gradable 
L Verbless clause 
M 
Nf 
Ni 
Number phrase 
Fractional number phrase 
with ONE as head 
N Noun phrase 
Na with subject pronoun head 
Nc with count noun head 
Ne Emphatic reflexive pronoun 
Nf Foreign expression or formnla 
Ni IT occurring with extraposition 
Nj with adjective head 
Nm with mass noun head 
Nn with proper name head 
No with object pronoun head 
Np Plural noun phrase 
Nq beginning with a wh-word 
Nqv beginning with a wh-ever word 
Ns Singular noun phrase 
Nt Tinle 
Nu with abbreviated unit noun head 
Nx premodified by a measure 
expression 
P Prepositional phrase 
Po beginning with OF 
Pq with wh-word nominal 
Pqv with wh-ever word nominal 
Ps Stranded preposition 
164 
R 
l~v 
Rr 
Hx 
S 
S£ sq 
T 
Tb 
Tf ~g 
Ti 
Tn Tq 
U 
V 
Vb 
Ve 
Vg 
Vi 
Vm 
Vn 
Vo Vp 
Vr 
Vz 
W 
X 
Y 
,= 
Adverbial phrase 
beginning with a wh-word 
beginning with a wh-ever word 
Comparative adverb phrase 
with a measured gradable 
Sentence 
Interpolation 
Direct quotation 
Non-finite-verb clause 
Bare non-finite-verb clause 
FOR-TO clause 
with -ingparticiple as head 
with ~infinitive head 
with past participle head 
Infinitival indirect question 
Exclamation or Grammatical 
Isolate 
Verb phrase 
ending with a form of the verb 
BE 
containing NOT 
beginning with an-in~ 
participle 
with infinitive head 
beginning with AM 
beglnning with a past participle 
Separate verb operator 
Passive verb phrase 
Separate verb remainder 
with distinctive 3rd person 
tense 
WITH clause 
NOT separate from the verb 
'Wild card' 
TAG_SUFFIXES for co-ordinated 
constructions and 'idiom 
phrases ' 
APPENDIX B 
Cover Symbols 
AB ° Pre-qualifier or pre-quantifier 
( ui~, rather, such , all, half, both .... ) 
AP* Post-determiner (on~, other, little, 
much, few, several, many, next, IW~T ...-U. 
BE* Grammatical forms of the verb BE 
(be, were, was, being, am, been, 
are, ~ 
CD* Cardinal (one, two, 3, 195~- 60). 
DO* Grammatical forms of the verb DO 
(do, did, does). 
DT" Determiner or Article (this, the, 
any, these, either, neit-~, a, n__~o; 
including pre-nominal possessive 
pronouns, her, your, my, our ...). 
HV" Grammatical forms of the verb HAVE, 
(have, had (past tense), have, 
ha-~-Vpas-~participle ), has .--~ 
J" Adjective (including attributive, 
comparative and superlative 
adjectives : enormous, tantamount, 
worse, briEhtest ... ). 
N" Noun (including formulae, foreign 
words, singular common nouns, with 
or without word initial capitals, 
abbreviated units of measurement, 
singular proper nouns, singular 
locative nouns with word initial 
capitals, singular titular nouns 
with word initial capitals, 
singular adverbial nouns and 
letters of the alphabet). 
P" Pronoun (none, anyone, everything, 
anybody, me, us, you: it, him, her, 
them, hers, yours, mlne, our___.~s, 
m-~If ,--~ems e--~s .... ) 
P*A Subject Pronoun (I, we, he, she, 
they). 
R" Adverb (including comparative,. 
superlative and nominal adverbs : 
~a' delicately, better, least, 
irs, indoors, now~ then, 
to-ds~, here ...). 
RI" Adverb which can also be a particle or a preposition (above, 
between, near, across, on, abou_.~t, 
back, out ...). 
VB" Verb form (base form, past tense, 
present participle, past 
participle, 3rd person singular 
forms ). 
WD" ~h-determlner (whichl" what, 
whichever ... ). 
WP" Wh-pronoun (who, whoever, whosoever, 
whom, whomever, whomsoever ... ). 
*S Plural form (of common nouns, 
abbreviated units of measurement, 
locative nouns, titular nouns, 
adverbial nouns, post determiners 
and cardinal numbers). 
*$ Genitive form (of singulmr and 
plural common nouns, locative 
nouns with word initial capitals, 
titular nouns with word initial 
capitals, adverbial nouns, ordinals, 
adverbs, abbreviated units of 
measurement, nominal pronouns, 
post-determiners, cardinal numbers, 
determiners and wh-pronouns). 
165 
