Towards a cross-linguistic tagset 
Drs. Jan Cloeren 
TOSCA Research Group for Corpus Linguistics 
Dept. of Language ~ Speech, University of Nijmegen 
NL-6525 HT Nijmegen, The Netherlands 
e-malh cloeren@lett.kun.nl 
April 29, 1993 
Abstract 
With the spread of large quantities of corpus 
data, the need has arisen to develop some stan- 
dard not only for the format of'interchange of 
text (an issue which has already been taken 
up by the Text Encoding Inititiave), but also 
for any information added in some subsequent 
stage of (linguistic) enrichment. The research 
community has much to gain by such stan- 
dardization since it will enable researchers to 
e~ectively access and therefore make optimal 
use of the results of previous work on a corpus. 
This paper provides some direction of thought 
as to the development of a standardized tagset. 
We focus on a minimal tagset, i.e. a tagset con- 
raining information about wordclasses. We in- 
vestigate what criteria should be met by such 
a tagset. On the basis of an investigation and 
comparison of ten different tagse~s that have 
been used over the years for the (wordclass) 
tagging of corpora, we arrive at a proposal for 
a cross-linguistic minima\] tagset for Germanic 
languages I . 
Part I 
Introduction 
The last few years there has been an increasing 
interest in the use of corpus data, especially by 
those working in the field of natural language 
processing (NLP). This development can in 
part be ascribed to the fact that the scale on 
which such data are becoming available has 
increased: as publishing houses, industry, etc. 
switched to an electronic format for the in- 
1. In this paper the focus is on British English. Amer- 
lean English, Dutc21 and German. 
terchange of texts this meant a dramatic in- 
crease in the amount of text that was readily 
available to anyone interested, while develop- 
ments in hardware and in software have made 
it possible to manipulate large quantities of 
data (more) effectively. 
Earlier corpus-based approaches proved to 
be laborious undertakings. Before the 'real' 
work could begin one would have to go through 
the painstaking process of designing a cor- 
pus, gaining permission from publishers to use 
(part(s) of) texts and somehow making the 
texts computer-readable. The size of the cor- 
pora was very much determined by such fac- 
tors as cost (in terms of time and money to 
be invested) and availability of data. Corpora 
generally were compiled for particular research 
purposes, such as the investigation of a partic- 
ular variety of the language. Typical examples 
here are the Brown Corpus and the Lancaster- 
Oslo/Bergen (LOB) Corpus which were both 
compiled with the intention of representing a 
cross-section of the language (American En- 
glish and British English respectively). Com- 
pilers of corpora would each adopt his/her 
own conventions for representing the textual 
data. In a similar fashion the further process- 
ing of the data, the enrichment of (basically) 
raw text with some kind of linguistic informa- 
tion (wordclass information, syntactic, seman- 
tic and/or even pragmatic information) would 
largely depend on the time and money avail- 
able, and also on the particular interests of the 
researchers involved. 
Recently, with the increase in the amount 
of data that are becoming available, the at- 
tention of corpus compilers has been drawn to 
the need for a common interchange format for 
texts in order to make these data more readily 
30 
accessible for third parties. The Text Encoding 
Inititiative (TEI) has undertaken to develop a 
standard for the marking up of texts that is 
based on the Standard Generalized Markup 
Language (SGML). So far we have not yet 
reached the point where any serious attempts 
are being made to standardize the linguistic 
information that is being added in successive 
stages of linguistic enrichment. Instead, re- 
searchers from different backgrounds and with 
different beliefs are working their own turf ex- 
ploring a variety of methods for the enrich- 
ment of corpora, ranging from purely stochas- 
tic to strictly rule-based approaches, which 
seem to be in competition with each other. In 
view of the current state of the art in corpus 
analysis, the amount of work that has already 
been done in the area of tagging corpora for 
wordclass information and the experience that 
has been gained in the process, it would appear 
that by now the time has come to start think- 
ing about developing some sort of standard for 
the encoding of wordclass information. 
In the remainder of this paper we focus 
on the design of a minimal tagset, i.e. a 
tagset containing information about word- 
classes, that will provide a common basis for 
the wordclass tagging of texts written in Ger- 
manic languages. 
We first compare a number of tagsets that 
have been or are being used for the tagging of 
wordclass information in prominent corpora. 
On the basis of this comparison we arrive at a 
number of criteria that a standardised (mini- 
real) tagset should meet. Finally put forward a 
proposal for a basic tagset that may be applied 
cross-linguistically. 
Part II 
Tagsets: a comparison 
In order to give the reader some idea of 
the kind of (wordclass) information covered 
in various tagsets, a comparison is made of 
the tagsets employed in ten corpora 2 the 
Brown Corpus (Ku~era and Francis, 1967), the 
Lancaster-Oslo/Bergen Corpus (3ohansson et 
al., 1978), the SUSANNE Corpus (Sampson, 
2. Information regarding a detailed characterization 
of the corpora including size, design, research context 
and method of processing can be found in the corre- 
sponding literature 
forthcoming), the Penn Treebank (Santorini, 
1991), the IBM-Lancaster Corpus (Black et 
al., 1993), the British National Corpus (Leech, 
1993), the ICE Corpus (Greenbaum, 1991), 
the Tosca Corpus (Oostdijk, 1991), the Dutch 
Eindhoven Corpus (uit den Boogaart, 1975), 
and the German IDS-Mannheim Corpus (Neu- 
mann, 1987). 
As a first step in making a comparison of 
the ten tagsets employed in the wordclsss tag- 
ging of the above corpora we start by distin- 
guishing the wordclass categories that these 
tagsets have in common. There are nine cate- 
gories that are (in one fashion or another) in- 
cluded in each of the tagsets: noun, pronoun, 
article/determiner, adjective, adverb, preposi- 
tion, conjunction, verbal form, and interjec- 
tion. Apart from these nine categories we dis- 
tinguish one other category, which by defini- 
tion is open-ended since it is intended to cover 
all tags that do not fit into any of the other 
categories. We shall refer to this category as 
Vopen category". Next, we make an inventory 
of the different tags employed by each of the 
tagsets. As a result we obtain lists 3 of tags 
for each of the wordclass categories we distin- 
guished. For example, Table 1 lists the various 
tags as they occur in the ten tagsets under 
consideration for the tagging of conjunctions 
(incl. connectives). 
There appears to be a great deal of overlap: 
while actual tags may differ (cf. CO, CI, 70 for 
conjunctions in general, in the ICE, Brown and 
Eindhoven tagsets respectively), there seems 
to be some consensus as to the kind of infor- 
marion one wants to encode or tag. Before go- 
ing into this, however, we take a closer look at 
both the format and the nature of the tags. 
Starting with the oldest tagset included in 
our comparison, the Brown tagset, we see that 
this is in a sense a very 'fiat' coding scheme. It 
was developed with the intention of encoding 
genera\] wordclass information. The tags con- 
sist of character sequences which encode the 
wordclass category of a word and occasionally 
extended information such as form, aspect or 
case. Thus for example we find JJ for adjective 
and CC for coordinating conjunction, but also 
VB for base form of verb, VBD for past tense 
of verb, VBG for present participle, etc. 
The LOB tagset, but also the tagsets 
3. These lists are rather sizeable and have therefore 
not been included. They are available, however, from 
the author (via e-mail). 
31 
Conjunctions 
Code Interpretation Corpus 
CO 
CI 
70 
CON3 
COAP 
COCO 
C3C 
CC 
CCnn 
CCB 
CF 
ABX 
DTX 
LE 
LEnn 
CON 
CON 
72 
COSU 
C3S 
CS 
CSnn 
BCS 
71 
73 
74 
C3T 
CST 
CSA 
CSN 
CSW 
conjunction 
conjunction 
conjunction 
conjunct 
appositive conjunction 
coordinative conjunction 
coordinating conjunction 
coordinating conjunction 
part of CC 
coordinating conjunction "but" 
semi-coordinating conjunction "yet" 
double conjunction (both ...and) 
double conjunction (either ... or) 
leading co-ordinator( both, both_and) 
part of LE 
connective 
connective 
comparative connectives 
subordinating conjunction 
subordinating conjunction 
subordinating conjunction 
part of CS 
before subord.conjuction(even(if)) 
subordinating conjunction 
subord, conj. matrix sent. wordorder 
introductory part of conjunctive groups 
conjunction nthat" 
conjunction "that" 
conjunction "as" 
conjunction "than" 
conjunction "whether" 
ICE 
Brown 
Eindhoven 
Tosca 
TOSC& 
TOSCa 
BNC 
Brown, LOB, PENN,SUS 
IBML 
SUS 
IBML 
IBML 
Brown, LOB 
Brown, LOB 
SUS, IBML 
SUS 
Tosca 
ICE 
Eindhoven 
Tosca 
BNC 
Brown, LOB, SUS,IBML 
SUS 
IBML 
Eindhoven 
Eindhoven 
Eindhoven 
BNC 
IBML 
IBML 
IBML 
IBML 
Table 1: conjunctions 
of Penn Treebank, IBM-Lancaster and the 
British National Corpus very much follow 
the Brown tagset. Of these tagsets, the LOB 
tagset is of course closest to the Brown tagset, 
since the corpus was compiled with the inten- 
tion of comparing American and British En- 
glish from the same year (1961). 
The Penn Treebank tagset appears to be 
a reduced version of the Brown/LOB tagset. 
The reduction becomes manifest in that some 
tags are less detailed. For example, where 
Brown/LOB distinguish between possessive 
pronoun, persona\] pronoun, and reflexive pro- 
noun and with each of these has a further 
subclassification (e.g. PPIA for personal pro- 
noun, first person singular nominative; PP1AS 
for persona\] pronoun, first person plural nom- 
inative; PPIO for persona\] pronoun, first per- 
son singular accusative; etc.), Penn Treebank 
only has one tag, PP, to cover all persona\], 
possessive, and reflexive pronouns. All in all 
the derivation of the Penn Treebank from the 
Brown/LOB tagset is fairly straightforward. A 
minor deviation occurs in the tagging of gen- 
itive markers in the case of nouns. Here ac- 
cording to the Penn Treebank tagging scheme 
genitive markers are assigned their own tag, 
which is not the case with the Brown/LOB 
tagging scheme. 
The various tagging schemes that have 
emerged from the cooperation between IBM 
and the Lancaster-based Unit for Computer 
32 
Research on the English Language 4 all can 
be placed within what might be referred to 
as the Brown/LOB tradition in tagging. For 
our discussion we single out one particular 
tagging scheme, called CLAWS2a. This tag- 
ging scheme was developed to encode the in- 
formation needed at wordclass level in order to 
be able to parse computer manuals. Although 
this tagging scheme clearly shows some re- 
semblance to the Brown/LOB tagging scheme, 
it stands out in that with certain categories 
(such as verb and noun) there is a lot of extra 
detail that we do not find elsewhere. For ex- 
ample, tags such as noun o\[sty\]e and noun o£ 
organization are unique for this tagset. When 
we look at the format of the tags we find that 
there may be up to five characters per tag, 
while the leftmost characters relate the more 
general wordclass information and the charac- 
ters that occur towards the right specify fur- 
ther detail. Compound lexical items can be en- 
coded by means of tags that extend over more 
than one word. For example, while II would 
be the tag for simple preposition, the complex 
preposition in spite of is tagged I131 I132 I133. 
The tagset that has been developed within 
the framework of the British National Corpus 
(BNC) contains only some 60 different tags s 
The reason for this must be sought in the 
fact that the BNC intends to incorporate a 
very large amount of text (approx. 100 mil- 
lion words). The grammatical (i.e. wordclass) 
tagging of the corpus is to be carried out au- 
tomaticaily. For reasons of efflclency and also 
to increase the success rate of the tagging it 
was decided to have a rather small tagset. A 
comparison of the BNC tagging scheme to the 
one used in the IBM-Lancaster Corpus shows 
that the two tagging schemes are closely re- 
late& Both can be characterized as 'fiat' tag- 
ging schemes. The major difference between 
the two, apart from their sizes, appears to be 
that the BNC uses more mnemonic abbrevia- 
tions for otherwise similar tags 6 ; for example 
general adjectives get the label AJO instead of 
JJ. 
4. Black et al. (1993). 
~. The British National Corpus is a joint undertaking 
by Oxford University Press, Longman, and V~'.& R. 
Chambers, the universities of Lancaster and Oxford, 
and the British Library. 
6. This was probably done in the hope to improve the 
readlbilhy of the various tags. For example, the tag JJ 
for adjective is replaced by the tag AJO for general 
~ljective. 
So far we have been looking at the tagging 
schemes employed by five English language 
corpora. Turning away from those for a mo- 
ment and shifting our attention to the German 
and the Dutch corpora, we see that the tagging 
schemes employed in these corpora do not dif- 
fer all that much from what we have already 
seen with the English corpora. Again 'fiat' tag- 
ging schemes are found, while the wordclass 
categories that are distinguished largely coin- 
cide with the English wordclasses. As with the 
tagging schemes for the English language cor- 
pora, the tagging schemes for the German and 
Dutch corpora each have their own degree of 
detail. 
The 12-million-word Mannheim Corpus was 
tagged automatically for both wordclass infor- 
mation and syntactic information by means of 
the SATAN parser ~. Neither level of analysis 
includes much detail. The wordclass tagging 
is rather rudimentary and includes only the 
most basic wordclass information needed for 
the syntactic analysis. Only occasionally in- 
formation is added about prepositions that oc- 
cur as collocates of other words and about the 
case that prepositions require for their com- 
plements. 
The tagging of the relatively small Eind- 
hoven Corpus is rather detailed. The tags con- 
sist of three digit codes, where the first digit in- 
dicates the wordclass, the second digit supplies 
information on the subclass, and the third 
digit carries additional information of various 
kinds, such as verb aspect, person, number, 
etc. 
Returning now to the tagging schemes era- 
ployed for the tagging of English language cor- 
pora, we find that in the case of the three 
remaining tagging schemes under considera- 
tion - the SUSANNE tagging scheme, the 
ICE tagset, and the Tosca tagset - they have 
opted for more hierarchically structured tag- 
ging schemes. Each of these tagging schemes 
encodes highly detailed wordclass information 
in a systematic fashion by introducing some 
sort of 'additional feature(s)' slots in their 
tags. The SUSANNE tagging scheme distin- 
guishes as many as 352 distinct tags. Closer 
examination of these tags however learns us 
7. The Mannheim Corpus has been compiled by the 
Irmtitut f/h" Deutsche Sprache (ID5), the German na- 
tional institute for research on the German language. 
The SATAN parser was developed at the University of 
5aarbr,~cken for the purpose of machine translation. 
33 
that at the basis of these 352 items there are 
some 70 major wordclasses, while the addition 
of more featurelike information such as num- 
ber, person, case, etc. leads to a total of 352 
tags. Something similar appears to be the case 
when we consider the Tosca tagset and its less 
detailed derivative, the ICE tagset. Again tags 
for major wordclass categories have been ex- 
tended so as to include additional feature in= 
formation. 
In summary, having compared the different 
tagging schemes that have been or are being 
employed in the tagging of wordclass informa- 
tion in ten different corpora, we find that 
• there appears to be some consensus as 
to what to tag as far as wordclass infor- 
mation is concerned; 
• even the format of the tags in different 
tagging schemes does not differ all that 
much; 
• the kind of information does not vary too 
much from one language to another, in 
other words it appears feasible to have 
a tagset that could be applied cross- 
linguistically. 
As to the last point, we must observe that the 
idea of having a tagset that can be applied 
cross-linguistically is not at all new. For exam- 
ple, the tagset employed in the multi-lingua/ 
ESPRIT-860 project \[Kesselheim, 1986\] was 
intended to comprise all the tags that would 
be required for the tagging of wordclass infor- 
marion in each of the EC languages. Since this 
meant that not only Germanic languages were 
included (Dutch, English, German), but also 
Romance languages (French, Italian, Spanish) 
and Greek, the tagset could be expected to be 
truly cross-linguistic. Unfortunately, however, 
examination of the tagset shows that it is more 
of a collection of various tags that any of the 
languages might at some stage require, rather 
than a thoroughly designed minimal interlin- 
gual (or cross-linguistic) tagset. It is therefore 
all the more surprising to find that certain 
wordclasses in German, Dutch and English are 
not accounted for s. 
8. For example, there appears to be no wordclnss tag 
for interjections for German and Dutch, nor is there a 
tag for particl~ in English or in Dutch. 
Part III 
Criteria for a minimal 
tagset 
Having compared the tagsets of ten prominent 
corpora we now arrive at a point where we 
should reflect a little on the criteria that will 
have to apply to a tagset to make it cross- 
linguistic and generally applicable. If we take 
a closer look at the results of our study de- 
scribed above we can conclude that there are 
mainly three types of criteria in relation to the 
development of a tagset; criteria of a linguis- 
tic nature, criteria for the data format of the 
labels and terminological criteria for the label 
names. Therefore the following points deserve 
special attention: 
• coverage of major word classes 
• addition of relevant feature information 
• format of the tags 
. representation of the labels 
Let us begin with the linguistic criteria. The 
major word classes should minimally be cov- 
ered by the tagset. During our projection of 
the different tagsets we came to the conclu- 
sion that there are 12 main word classes 9 that 
can be distinguished. As we said earlier, it be- 
came obvious that across the different tagsets 
variation as far as the main classes are con- 
cerned was not very large. The same was true 
even for the two non-English tagsets. 
The specification of feature information, how- 
ever, turned out to be more problematic than 
the determination of word classes, because the 
tagsets differed strongly in their degree of de- 
tail and/or their research context. From a lin- 
guistic point of view, there is a great variety 
of feature information. First of all, there are 
subclassiflcations of the major word classes, 
such as the distinction between common and 
proper nouns or the different types of pro- 
noun. Furthermore, in relation to verbs, we 
can think of additional feature information 
about the degree of subcategorisation (tran- 
sitivity). Although this appeared not to be a 
very common feature in the tagsets examined, 
it can be of great importance to grammar- 
based syntactic corpus analysis 1°. Another tel- 
evant linguistic criterion is to facilitate the in- 
clusion of morphological information, at least 
9. see following section 
10. see Oostdljk (1991) 
34 
for verbs, nouns and adjectives. Features con- 
taining this information are for example: num- 
ber (singular, plural), person (first, second, 
third), gender (masculine, feminine, neuter), 
degree of comparison (positive, comparative, 
superlative) and, especially for the analysis of 
German texts, case information (nominative, 
genitive, dative, accustive). From the tagsets 
under survey it appeared that morphological 
information is desirable, but there seems to 
be no agreement on what sort of information 
should be added. This leads us to a following 
criterion that is important for the development 
of our tagset: the format of the tags. 
From our study it has become clear that 
there is some degree of agreement on major 
word class information and to some extent also 
on their subclassification. However, with re- 
spect to additional feature information we can 
see a sort of growing disagreement as the de- 
tail of specification increases. This has to do 
with the various purposes people may have in 
relation to the enrichment of corpora. On the 
other hand language-specific items play a role 
in this context too, for example prepositions 
in postposition in Dutch n. So the format of 
the tags has to account for two major aspects: 
• hierarchical structuring of information 
• flexibility in relation to the encoding of 
special features and/or language specific 
items 
Hierarchical structuring implies the ordering 
of information within a specific label. From left 
to right the degree of detail wiLl increase, so 
that at the beginning we will find the major 
word class label, followed by a subclassiflca- 
tion, and then several additional features. 
With ttexibi\]it¥ we mean that the label format 
should be open so that no researcher is lim- 
ited by our tagset in adding special features 
that, according to his opinion and/or research 
aims, are useful or even essential. On the other 
hand researchers who want to make use of ba- 
sic word classes only, need only concern them- 
selves with parts of the tags and can ignore 
the other parts. 
In our attempt to develop a cross-linguistic 
tagset we make use of a hierarchical data-field 
oriented coding scheme. The hierarchy in the 
labels is represented as follows: the level of de- 
tail increases from left to right and the dif- 
ferent entries are separated by one or more 
11. Hij loopt her boa in. (engl.: He runs into the forest.) 
unique dehmiters. This way of coding also 
enables researchers to convert the format of 
the labels in a relatively easy way for their 
individual needs. For example in the ICE.- 
project, where the tags form the input of a 
two-level grammar, labels with the format de- 
scribed above can be automatically converted 
into two-level tags. So this way of coding seems 
to be attractive linguisticalJy (levels of descrip- 
tion) as well as formally (interchangeability). 
Finally we have to determine how the la- 
bels should be represented. Generally we can 
distinguish between two ways of coding - ei- 
ther a completely numeric label or a mnemonic 
letter-digit sequence. Although for reasons of 
readability mnemonic labels are preferable, 
numeric labels can be used as well, since it is 
not too difficult to transform one weLl-defined 
form into another. The advantage of numeric 
labels is that they are relatively compact and 
therefore can be stored more efficiently. 
Focussing on the mnemonic way of coding, 
the question is: what terminology can best be 
used for the labels. From the tagsets exam- 
ined we can conclude that there is a commonly 
accepted linguistic terminology with respect 
to word-class information, also from a cross- 
linguistic point of view. In order to provide 
codes as mnemonic as possible we are of the 
opinion that this terminology should also be 
included in our tagset. 
In the following section we present a first 
step towards a cross-linguistic tagset as a re- 
sult of our comparative study and illustrate 
the different major word classes, their possi- 
ble subclassification, additional feature infor- 
mation as well as the way in which this infor- 
mation can be coded. 
Part IV 
A basic tagset for 
Germanic languages 
In this section we present a basic cross- 
linguistic tagset for Germanic languages. 
Those familiar with the coding scheme 
adopted in the ESPRIT-S60 project will find a 
number of similarities with this scheme. Thus, 
as we mentioned above, we have adopted the 
idea of a hierarchical data-field oriented coding 
scheme. Moreover the scheme allows for over- 
35 
Word Class Categories 
noun 
pronoun 
article 
adjective 
numeral 
verb 
adverb 
preposition 
conjunction 
particle 
interjection 
formulate expression 
Table 2: word classes 
as well as underspecification 1~. Unlike the ES- 
PRIT tagging scheme, however, it is strictly 
datafield oriented in order to allow automatic 
format and addition of extra feature informa- 
tion. The different datafields are always sepa- 
rated by a # symbol, which functions as a de- 
limiter. The hierarchical structure of the tags 
becomes clear when one looks at the examples 
we give in order to illustrate the tagging of the 
different wordclasses. In addition to what has 
been described above, the tagset also includes 
so-caUed ditto-tags which make it possible to 
account for lexical items that consist of more 
than one word, such as compound nouns or 
complex prepositions. Ditto-tags take the form 
of two numbers separated by a slash, where 
the first number indicates the current part of 
a compound item, while the second number in- 
dicates the total number of words that make 
up the compound. For example, the three word 
complex preposition in spite ofis tagged as fol- 
lows: 
in PRP#...#I/3, spite PRP#...#2/3, of 
PRP#...:~3/3. 
Our basic tagset for the encoding of word class 
information in corpora comprises twelve ma- 
jor word class categories and an additional 
category which is intended to accommodate 
language-specific items. The twelve major cat- 
egories that are distinguished are listed in Ta- 
ble 2. 
The additional category we shall refer to as 
open. Each of the word class categories is dis- 
cussed below. 
Perhaps the categories particle, interjection 
and £ormulaic expression are not so familiar to 
some readers. For this reason we give a more 
detailed description when presenting them. 
12. see Kesselheim (1986) 
I. 1~oun 
Wordclass N 
Subclass corn (common) 
prop (proper) 
Additional info: 
number sg (singular) 
plu (plural) 
gender masc (masculine) 
fem (feminine) 
neut (neuter) 
case nora (nominative) 
gen (genitive) 
dat (dative) 
ace (accusative) 
compound 1/2, 2/2, etc. 
Two examples: 
N#com#plu# 
refers to a single plural common noun, 
for example house. 
N#com~plu~acc~fem 
refers to the same word class as above 
but contains extra case (accusative) and 
gender (feminine) information, for ex- 
ample German Hguser x3 
2. Pronoun 
Wordclass PN 
Subclass per (personal) 
pos (possessive) 
ref (reflexive) 
dem (demonstrative) 
int (interrogative) 
tel (relative) 
ind (indefinite) 
rec (reciprocal) 
Additional info: 
number sg (singular) 
plu (plural) 
person 1 (first) 
2 (second) 
3 (third) 
gender masc (masculine) 
fern (feminine) 
neut (neuter) 
case nora (nominative) 
gen (genitive) 
dat (dative) 
acc (accusative) 
13. engl. houses 
36 
compound i/2, 2/2, etc. 
For example: 
PN#per#sg#1#nom 
refers to a first person singular nomina- 
tive personal pronoun (/). 
PN#rec:#l/2# 
refers to the first part of a reciprocal pro- 
noun, for example each other. 
3. Article 
Wordclass ART 
Subclass def (definite) 
ind (indefinite) 
In addition, features such as number, 
case and gender can be added. 
For example: 
ART#def#sg#acc#masc# 
refers to a definite singular accusative 
masculine article, e.g. german den. 
4. Adjective 
Wordclass ADJ 
Degree pos (positive) 
corn (comparative) 
sup (superlative) 
Subclassifieation of adjectives seems to 
be a complicated matter since many 
of the subdivisions we found are deter- 
mined by the research context. For ex- 
ample, we found attributive adjectives 
(main, chief), nominal adjectives and 
even semantically superlative adjectives. 
So at this stage we are not able to pro- 
vide some consistent subclassification. 
Again, additional features such as num- 
ber, case, gender, ditto (as described 
above) and form (for example English: 
-ed, -ing) can be added, for example: 
ADJ #gen#pos#acc#masc# 
refers to a definite genera/ positive ac- 
cusative masculine adjective, for exam- 
ple German ~ntelligenten 
5. Numeral 
Wordclass NUM 
Subclass ord (ordinal) 
crd (cardinal) 
frac (fractional) 
6. 
7. 
8. 
Features like number and case can be 
added. 
For example: 
NUM#ord# 
refers to an ordinal numeral, for example 
Verb 
Wordclass V 
Subclass 1 (lexical) 
a (auxiliary) 
Form inf (infinitive) 
part (participle 
Tense pres (present) 
past (past) 
Mood ind (indicative) 
subj (subjunctive) 
Subcat different forms of transitivity 
such as intransitive, copular, 
transitive, etc. 
Again features as number, person and 
compounding can be added. 
For example: 
V:#l#3#past:~ 
refers to a lexical verb (third person, 
past tense), for example went. 
Adverb 
Wordclass ADV 
Degree pos (positive) 
corn (comparative) 
sup (superlative) 
Features for compounding can be added. 
A general subdassification of adverbs is 
very hard to establish because of the se- 
mantle nature of such subdivisions. 
For example: 
ADV#pos# 
refers to a positive adverb, for example 
earJy 
Preposition 
Worddass PRP 
Subclass conj (conjunctive) 
adv (adverbial) 
post (postposition, for Dutch) 
phra (phrasal) 
gen (general) 
Features indicating the case a preposi- 
tion requires for its complement, as well 
37 
as ones for compounding can be added. 
For example: 
PRP#phra#dat# 
refers to a phrasal preposition (required 
by a prepositional verb) that combines 
with a complement with dative case. 
9. Conjunction (incl. connectives) 
Wordclass CON 
Subclass sub (subordinating) 
cot (coordinating) 
con (connective) 
Also, ditto-tags can be added. 
For example: 
CON#sub# 
refers to a subordinating conjunction, 
for example because 
10. Particle, Interjection, Formula 
Wordclass PRT (particle) 
INT (interjection) 
FOR (formulaic expression) 
.Interjections are normally referred to as 
words that do not enter into syntactic 
relations and that do not have a clear 
morphological structure. Very often they 
are of a onomatopoeic nature. Examples 
of interjections are: aha, .hm, wow, past, 
oops. 
Formu/a/c Expressions are fixed expres- 
sions used as formulalc reactions in a cer- 
taln dialogue contexts. Examples are: a\]/ 
the best; excuse me; dank u weI; Danke, 
gut. 
Particles are morpholog- 
ically fixed words that do not belong to 
any of the word classes described above 
and that can function in many ways in 
a sentence, for example as introducing 
element of the subject of an infinitival 
clause (for example: I am waiting for the 
meeting to begin), or they function as 
fixed answers to questions (for example: 
yes, no, ja 14 
Ditto-tags can be applied to the d.ifl~er- 
ent elements of the tagged item. 
For example: Good FOR#1/2# Morn- 
ins FOR#2/2# 
14. for more detailed information •bout particles, see 
Engel (1988) 
11. Open 
Wordclass : O (open) 
The subclasses to be distinguished 
within this wordclass category may vary, 
depending on the specific language the 
tagset is used for. For English the geni- 
tive marker belongs in this category; the 
same goes for the German verb particle. 
For example: 
O#GM# 
refers to a genitive marker. 
Part V 
Conclusion 
In this paper we have sketched the way in 
which linguistic enrichment of corpora could 
be standardized. We have reported on our ef- 
forts to standardize the word class tags. In ad- 
dition, we compared the tag sets of a number 
of prominent corpora. The differences between 
these sets encouraged us to proceed towards a 
standardized cross-linguistic tagset. This set 
could contribute to improved access and ex- 
change of analyzed corpora. In addition to a 
standardized tagset it might be interesting to 
determine if and how a standard annotation 
of linguistic information on higher levels of de- 
scription (syntax, semantics, pragmatics) can 
be established. 
We are working on these issues and we hope 
to encourage other (academic and industrial) 
researchers in the field of corpus linguistics to 
participate in the discussion about common 
guidelines for the linguistic annotation of cor- 
pora in the future. 
Part VI 

References 
• Aarts, J. and Th. van den Heuvel (1985): 
Computational tools for the syntactic 
analysis of corpora, in: Linguistics 23: 
303-35. 
• Black, E., 1t. Car- 
side and G. Leech (1987): Statistically- 
Driven Computer Grammars o£ English: 
The IBM-Lancaater Approach. Internal 
report. 
38 
• Boogaart, P.C. uit den (1975): Woord- 
Requenties. Utrecht: Oosthoek, Schel- 
terns & Holkema. 
• Engel, U. (1988): Deutsche Grammatik. 
Heidelberg: Julius Groos Verlag. 
• Greenbaum, S. (1991): The development 
of the International Corpus of English. 
in: A~jmer~ K. and B. Altenberg (1991): 
English Corpus Linguistics: 83-91. Lon- 
don: Longman. 
• Greenbaum, S. (1992): The ICE Tagset 
Manual. London: University College 
London. 
• 3ohansson, S. et al. (1978): Manual olin- 
formation to Accompany the Lancaster- 
Oslo/Bergen Corpus o\[ British English , 
t'or Use with digital Computers. Oslo: 
Dept. of English, University of Oslo. 
• Kesselheim (1986): Coding for word- 
classes, in: ESPB.IT-860 Final report 
\[BU-WKL-0576\]. 
• Ku~era, H. and W.N. Francis (1987): 
Computational Analysis of Present-Day 
American English. Providence, Rhode 
Island: Brown University Press. 
• Leech, G. (1993): 100 million words of 
English, in: English Today 33: 9-15. 
• Marcus, M.P. and B. Santorini (1991): 
Building Very Large Natural Language 
Corpora: The Penn Trecbank. CIS- 
report: University of Pennsylvania. 
• Neumann, R. (1987): Die grammatis- 
che Erschliessung der Mannheimer Kor- 
pora. Mannheim: Institut fuer Deutsche 
Sprache. Internal report. 
• Oostdijk, N. (1991): Corpus Linguistics 
and the Automatic Analysis o/' English. 
Amsterdam: B.odopi. 
• Sampson, G. (1992): The Susanne Cor- 
pus. E-mail report. 
• Sampson, G. (forthcoming): English t'or 
the Computer. Oxford: Oxford Univer- 
sity Press. 
• Santorini, B. (1991): Part-of-Speed 
Tagging Guidelines for the Penn Tree- 
bank Project. Internal report. 
•TEI (1991): List of Comon Morphologi- 
cal Features. TEI document AIIW2. 
