Proceedings of EACL '99 
Exploring the Use of Linguistic Features in Domain and Genre 
Classification 
Maria Wolters' and Mathias Kirsten 2 
1Inst. f. Kommunikationsforschung u. Phonetik, Bonn; wolters@ikp.uni-bonn.de 
2German Natl. Res. Center for IT-AiS.KD-, St. Augustin; mathias.kirsten~gmd.de 
Abstract 
The central questions are: How useful 
is information about part-of-speech fre- 
quency for text categorisation? Is it fea- 
sible to limit word features to content 
words for text classifications? This is 
examined for 5 domain and 4 genre clas- 
sification tasks using LIMAS, the Ger- 
man equivalent of the Brown corpus. Be- 
cause LIMAS is too heterogeneous, nei- 
ther question can be answered reliably 
for any of the tasks. However, the re- 
sults suggest that both questions have 
to be examined separately for each task 
at hand, because in some cases, the ad- 
ditional information can indeed improve 
performance. 
1 Introduction 
The greater the amounts of text people can ac- 
cess and have to process, the more important effi- 
cient methods for text categorisation become. So 
far, most research has concentrated on content- 
based categories. But determining the genre of 
a text can also be very important, for example 
when having to distinguish an EU press release 
on the introduction of the euro from a newspaper 
commentary on the same topic. 
The results of e.g. (Lewis, 1992; Yang and Ped- 
ersen, 1997) indicate that for good content clas- 
sification, we basically need a vector which con- 
tains the most relevant words of the text. Using 
n-grams hardly yields significant improvements, 
because the dimension of the document represen- 
tation space increases exponentially. But do word- 
based vectors also work well for genre detection? 
Or do we need additional linguistically motivated 
features to capture the different styles of writing 
associated with different genres? 
In this paper, we present a pilot study based 
on a set of easily computable linguistic features, 
namely the frequency of part-of-speech (POS) 
tags, and a corpus of German, LIMAS (Glas, 
1975), which contains a wide range of different 
genres. LIMAS is described briefly in Sac. 3, while 
sections 2 and 4 motivate the choice of features. 
The text categorisation experiments are described 
in Sec. 5. 
2 Linguistic Cues to Genre 
2.1 What is genre? 
The term "genre" is more frequent in philology 
and media studies than in mainstream linguistics 
(Swales, 1990, p.38). When it is not used synony- 
mously with the terms "register" or "style", genre 
is defined on the basis of non-linguistic criteria. 
For example, (Biber, 1988) characterises genres in 
terms of author/speaker purpose, while text types 
classify texts on the basis of text-internal criteria. 
Swales phrases this more precisely: Genres are 
collections of communicative events with shared 
communicative purposes which can vary in their 
prototypicality. These communicative purposes 
are determined by the discourse community which 
produces and reads texts belonging to a genre. 
But how can we extract its communicative pur- 
pose from a given text? First of all, we need to 
define the genres we want to detect. The defi- 
nitions which were used in this study are sum- 
marised in section 3.1. If we assume that the 
culture-specific conventions which form the ba- 
sis for assigning a given text to a certain genre 
are reflected in the style of the text, and if that 
style can be characterised quantitatively as a ten- 
dency to favour certain linguistic options over oth- 
ers (Herdan, 1960), we can then proceed to search 
for linguistic features which both discriminate well 
between our genres and can also be computed reli- 
ably from unannotated text. Potential sources for 
such options are comparative genre studies (Biber, 
1988), authorship attribution research (Holmes, 
1998; Forsyth and Holmes, 1996), content analy- 
142 
Proceedings of EACL '99 
sis (Martindale and MacKenzie, 1995), and quan- 
titative stylistics (Pieper, 1979). For the last step, 
classification, we need a robust statistical method 
which should preferably work well on sparse and 
noisy data. This aspect will be discussed in more 
detail in section 5. 
In their paper on genre categorization, (Kessler 
et al., 1997) take a somewhat different approach. 
They classify texts according to generic facets. 
Those facets express distinctions that "answer to 
certain practical interests" (p. 33). The "brow" 
facet roughly corresponds to register, and the 
"narrative" facet is taken from text type theory, 
while the "genre" facet most closely correspond to 
our usage of the term. 
2.2 Choice of features 
There are two basic types of features: ratios and 
frequencies. Typical ratios are the type/token ra- 
tio, sentence length (in words per sentence), or 
word length (in characters per words). More elab- 
orate ratios which have been found to be useful in 
quantitative stylistics (Ross and Hunter, 1994) are 
e.g. the ratio of determiners to nouns or that of 
auxiliaries to VP heads. 
The most common features to be counted are 
words, or, more precisely, word stems. While 
most text categorisation research focusses on con- 
tent words, function words have proved valuable 
in authorship attribution. The rationale behind 
this is that authors monitor their use of the most 
frequent words less carefully than that of other 
words. But this is not the reason why function 
words might prove to be useful in genre analy- 
sis. Rather, they indicate dimensions such as per- 
sonal involvement (heavy use of first and second 
person pronouns), or argumentativity (high fre- 
quency of specific conjunctions). Content anal- 
ysis counts the frequency of words which belong 
to certain diagnostic classes, such as for exam- 
ple aggressivity markers. The frequency of other 
linguistic features such as part-0f-speech (POS), 
noun phrases, or infinitive clauses, has been ex- 
amined selectively in quantitative stylistics. In his 
comparative analysis of written and spoken genres 
in English, Biber (Biber, 1988) lists an impressive 
array of 67 linguistically motivated features which 
can be extracted reliably from text. However, he 
sometimes relies heavily on the fixed word order of 
English for their computation, which makes them 
difficult to transfer to a language with a more flex- 
ible word order, such as German. (Karlgren and 
Cutting, 1994) reports good results in a genre clas- 
sification task based on a subset of these features, 
while (Kessler et al., 1997) show that a prudent 
selection of cues based on words, characters, and 
ratios can perform at least equally well. 
In our paper, we explore a hybrid approach. 
Starting from the classical information retrieval 
representation of texts as vectors of word frequen- 
cies (Salton and McGill, 1983), we explore how 
performance is affected if we include 
function word frequencies. For example, texts 
which aim at generalisable statements may 
contain more indefinite articles and pronouns 
and less definite articles. 
POS frequencies. (This essentially condenses 
information implicitly available in the word 
vector.) For example, nominal style should 
lead to a higher frequency of nouns, whereas 
descriptive texts may show more adjectives 
and adverbials than others. 
Note that we do not experiment with sophisti- 
cated feature selection strategies, which might be 
worthwhile for the POS information (cf. Sec. 4). 
POS frequency information is the only higher- 
level linguistic information which is encoded ex- 
plicitly. Most current POS-taggers are reliable 
enough (at least for English) for their output to 
be used as the basis for a classification, whereas 
robust, reliable parsers are hard to find. Another 
source of information would have been the posi- 
tion of a word in a sentence, but incorporating 
this would have lead to substantially larger feature 
spaces and will be left to future work. Seman- 
tic classes were not examined, because defining, 
building, fine-tuning, and maintaining such word 
lists can be an arduous task (cf. e.g. (Klavans and 
Kan, 1998)), which should therefore only be un- 
dertaken for corpora with both well-defined and 
well-represented genres, where inherently fuzzy 
class boundaries are less likely to counteract the 
effect of careful feature selection. 
3 The LIMAS corpus of German 
Since our focus is on genre detection, we decided 
not to use common benchmark collections such 
as Reuters 1 and OHSUMED 2 because they are 
rather homogenous with respect to genre. 
LIMAS is a comprehensive corpus of contem- 
porary written German, modelled on the Brown 
corpus (Ku~era and Francis, 1967) and collected 
in the early 1970s. It consists of 500 sources with 
around 2000 words each. It has been completely 
tagged with POS tags using the MALAGA sys- 
tem (Beutel, 1998). MALAGA is based on the 
1 http://www.research.att.com/lewis/reuters21578.html 
2 ftp://medir.ohsu.edu/pub/ohsumed 
143 
Proceedings of EACL '99 
STTS tagset for German which consists of 54 cat- 
egories (Schiller et al., 1995). The corpus has at- 
ready been used for text classification by (vonder 
Grfin, 1999). 
Since the corpus is rather heterogeneous, we de- 
fined two sets of tasks, one based on the full cor- 
pus (CL), the other based on all texts from the 
categories law, politics, and economy (LPE) (104 
sources in all). In the LPE experiments, empha- 
sis was on searching for good parameters for the 
various learning algorithms as well as on the con- 
tribution of POS and punctuation information to 
classification accuracy. The experiments on the 
complete corpus, on the other hand, focus more 
on composition of the feature vectors. 
3.1 Genre Classes 
LIMAS is based on the 33 main categories of 
the Deutsche Bibliographie (German bibliogra- 
phy). Each of the bibliography's categories is rep- 
resented according to its frequency in the texts 
published in 1970/1971, so that the corpus can be 
considered representative of the written German 
of that time (Bergenholtz and Mugdan, 1989). 
Furthermore, the corpus designers took care to 
cover a wide range of genres within each subcat- 
egory. As a result, groups of more than 10 doc- 
uments taken from LIMAS will be rather hetero- 
geneous. For example, press reports can be taken 
from broadsheets or tabloids, they can be com- 
mentaries, news reports, or reviews of cultural 
events. 
Many of the main categories correspond to 
domains such as "mathematics" or "history". 
Although not evident from the category label, 
genre distinctions can also be quite important 
for domain classification, because some domains 
have developed specific genres for communication 
within the associated community. There are three 
such domain categories in our experiments, poli- 
tics (P), law (L), and economy (E). Two further 
categories are academic texts from the humani- 
ties (H) and from the field of science and technol- 
ogy (S). In the LPE corpus, this distinction is col- 
lapsed into "academic" (A), the set of all scholarly 
texts in the corpus. Four categories are based on 
genre only. On one hand, we have press texts (N), 
and more specifically NH, press texts from high 
quality broadsheets and magazines, on the other 
hand, fiction (F) and FL, a low-quality subset of 
F. For LPE, we defined a category D consisting 
of articles from quality broadsheets. Table 1 gives 
an overview of the categories and the number of 
documents in each category for each corpus. In 
all subsequent experiments, we assume as base- 
line the classification accuracy which we get when 
L P E H S 
CL n 20 44 40 109 72 
CL acc. 96 91,2 92 78 85,6 
F FL N NH 
CL n 60 26 53 30 
CL acc. 88 94,8 89,4 94 
L P E A D 
LPE n 20 43 40 45 26 
LPE acc. 80 58,7 61,5 56,7 75 
Table 1: Number of documents n in each category 
and classification accuracy acc. if each document 
is judged not to belong to that category. 
all documents are assigned to the majority class. 
The baselines are specified in Tab. I. 
4 Validating the Features 
If the frequency of POS features does not vary 
significantly between categories, adding such in- 
formation increases both random variation in the 
data as well as its dimensionality. To check for 
this, we conducted a series of non-parametric tests 
on CL for each POS tag. 
In addition, binary classification trees were 
grown on the complete set of documents for each 
category, and the structure of the tree was subse- 
quently examined. Classification trees basically 
represent an ordered series of tests. Each tree 
node corresponds to one test, and the order of 
the tests is specified by the tree's branches. All 
tests are binary. The outcome of a test higher up 
in the tree determines which test to perform next. 
A data item which reaches a leaf is assigned the 
class of the majority of the items which reached 
it during training. The trees were grown using 
recursive partitioning; the splitting criterion was 
reduction in deviance. Using the Gini index led 
to larger trees and higher misclassification rates. 
Since the primary purpose of the trees was not 
prediction of unseen, but analysis of seen data, 
they were not pruned. There were no separate 
test sets. 
We tested for 12 categories and all STTS POS 
tags if the distribution of a tag significantly differs 
between documents in a given category and docu- 
ments not in that category. These categories con- 
sist of the nine defined in Sec. 3 plus the content- 
based domains (Hi) and religion (R), and texts 
from tabloids and similar publications (PL). 
Choice of Feature Values: The value of a fea- 
ture is its relative frequency in a given text. The 
frequencies were standardised using z-scores, so 
that the resulting random variables have a mean of 
0 and a variance of 1. The z-scores were rounded 
144 
Proceedings of EACL '99 
down to the next integer, so that all features 
whose frequency does not deviate greatly from the 
mean have a value of 0. Z-scores were computed 
on the basis of all documents to be compared. 
This makes sense if we view style as deviation from 
a default, and such defaults should be computed 
relative to the complete corpus of documents used, 
not relative to specific classification tasks. 
Results: In general, only 7 of all 54 tags show 
significant differences in distribution for more 
than half of the categories, and the actual differ- 
ences are far smaller than a standard deviation. 
However, for most tasks, there are at least 15 POS 
tags with characteristic distributions, so that in- 
cluding POS frequency information might well be 
beneficial. 
The four most important content word classes 
are VVFIN (finite forms of full verbs), NN 
(nouns), ADJD (adverbial adjectives), and ADJA 
(attributive adjectives). Importance is measured 
by the number of significant differences in dis- 
tribution. A higher incidence of VVFIN char- 
acterises F, FL, and NL, whereas texts from 
academia or about politics and law show signif- 
icantly less VVFIN. The difference between the 
means is around 0.2 for F and FL, and below 0.1 
for the rest. (Numbers relate to the z-scores). 
Note that we cannot claim that more VVFIN 
means less nouns (NN): scholarly texts both show 
less VVFIN and less NN than the rest of the cor- 
pus. For adjectives, we find that academic texts 
are significantly richer in ADJA (differences be- 
tween 0.02-0.04), while FL contains more adver- 
bial adjectives (difference 0.04). 
But function words can be equally important in- 
dicators, especially personal pronouns, which are 
usually part of the stop word list. They are sig- 
nificantly less frequent in academic texts and cat- 
egories E, L, NH, and P, and more frequent in 
fiction, NL, and R. Again, all differences are at or 
below 0.1. A lower frequency of personal pronouns 
can indicate both less interpersonal involvement 
and shorter reference chains. 
Other valuable categories are, for example, 
pronominal adverbs (PAV) and infinitives of auxil- 
iary verbs (VAINF), where the difference between 
the means usually lies between 0.2 and 0.4 for sig- 
nificant differences. (We restrict ourselves to dis- 
cussing these in more detail for reasons of space.) 
Pronominal adverbs such as "deswegen" (because 
of this) are especially frequent in texts from law 
and science, both of which tend to contain texts 
of argumentative types. The frequency of infini- 
tives of auxiliaries reflects both the use of passive 
voice, which is formed with the auxiliary "war- 
den" in German, and the use of present perfect or 
pluperfect tense (auxiliary "haben'). In this cor- 
pus, texts from the domains of law and economy 
contain more VAINF than others. 
The potential meaning of common punctuation 
marks is quite clear: the longer the sentences an 
author constructs, the fewer full stops and the 
more commata and subordinating conj unctions we 
find. However, the frequency of full stops is dis- 
tinctive only for four categories: L, E, and H have 
significantly fewer full stops, NL has significantly 
more. We also find significantly more commata 
in fiction than in non-fiction, Possible sources for 
this are infinitive clauses and lists of adjectives. 
With regard to the trees, we examined only 
those splits that actually discriminate well be- 
tween positive and negative examples with less 
than 40% false positives or negatives. We will 
not present our analyses in detail, but illus- 
trate the type of information provided by such 
trees with the category F. For this category, 
PPER, KOMMA, PTKZU ("to" before infinitive), 
PTKNEG (negation particle), an~t PWS (substi- 
tuting interrogative pronoun) discriminate well in 
the tree. In the case of PTKZU and PTKNEG, 
this difference in distribution is conditional, it was 
not observed in the significance tests and surfaced 
only through the tree experiments. 
5 Text Categorisation Experiments 
For our categorisation experiments, we chose a 
relational k-nearest-neighbour (k-NN) classifier, 
RIBL (Emde and Wettschereek, 1996; Bohnebeck 
et al., 1998), and two feature-based k-NN algo- 
rithms, learning vector quantisation (LVQ, (Ko- 
honen et al., 1996)), and IBLI(-IG) (Daelemans 
et al., 1997; Aha et al., 1991). The reason for 
choosing k-NN-based approaches is that this al- 
gorithm has been very successful in text categori- 
sation (Yang, 1997). 
We first ran the experiments on the LPE- 
corpus, which had mainly exploratory character, 
then on the complete corpus. 
In the LPE-experiments, we distinguished six 
feature sets: CW, CWPOS, CWPP, WS, WS- 
POS, and WSPP, where CW stands for content 
word lemmata, WS for all lemmata, POS for POS 
information, and PP for POS and punctuation in- 
formation. 
In the CL-experiments, we did not control for 
the potential contribution of punctuation features 
to the results, but on the type of lemma from 
which the features were derived. We again ex- 
plored 6 feature sets, CW, CWPOS, WS, WSPOS, 
FW, and FWPOS, where FW stands for function 
145 
Proceedings of EACL '99 
word lemmata. Punctuation was included in con- 
ditions WS, WSPOS, FW, and FWPOS, but not 
in CW and CWPOS. In addition to feature type, 
we also varied the length of the feature vectors. 
In the following subsections, we outline our gen- 
eral method for feature selection and evaluation 
and give a brief description of the algorithms used. 
We then report on the results of the two suites of 
experiments. 
5.1 Feature Selection 
The set of all potential features is large - there are 
more than 29000 lemmata in the LPE corpus, and 
more than 80000 in the full corpus. 
In a first step we excluded for the LPE corpus, 
all lemmata occuring less than 5 times in the texts, 
and for the CL corpus, all lemmata occurring in 
less than 10 sources, which left us with 4857 lem- 
mata for LPE and 5440 lemmata and punctuation 
marks for CL. We then determined the relevance 
of each of these lemmata for a given classifica- 
tion task by their gain ratio (Yang and Pedersen, 
1997). From this ranked list of lemmata, we con- 
structed the final feature sets. 
5.2 The Algorithms 
RIBL: RIBL is a k-NN classification algo- 
rithm where each object is represented as a set 
of ground facts, which makes encoding highly 
structured data easier. The underlying first- 
order logic distance measure is described in 
(Emde and Wettschereck, 1996; Bohnebeck et 
al., 1998). Features were not weighted be- 
cause using Kononenko's Relief feature weight- 
ing (Kononenko, 1994) did not significantly af- 
fect performance in preliminary experiments. 
The input for RIBL consists of three relations 
lemma(di,lemma,v), pos(di,POS-Tag,v), and doc- 
ument(all), with di the document index and v the 
standardised frequency, rounded to the next inte- 
ger value. In the CL experiments, the lemma tag 
covers both real lemmata and punctuation marks, 
in LPE, punctuation marks had a separate pre- 
cidate. Relations with a feature value of 0 are 
omitted, reducing the size of the input consider- 
ably. For these features, a true relational repre- 
sentation is not necessary, but that might change 
for more complex features such as syntactic rela- 
tions. 
IBL: IBL stores all training set vectors in an 
instance base. New feature vectors are assigned 
the class of the most similar instancc. We use the 
Fuclidean distance metric for determining nearest 
ncighbours. All experiments were run with (IBL- 
IG) or without (IBL) weighting the contribution 
of each feature with its gain ratio. 
LVQ: LVQ also classifies incoming data based 
on prototype vectors. However, the prototypes 
are not selected, but interpolated from the training 
data so as to maximise the accuracy of a nearest- 
neighbour classifier based on these vectors. Dur- 
ing learning, the prototypes are shifted gradually 
towards members of the class they represent and 
away from members of different classes. There 
are three main variants of the algorihm, two of 
which only modify codebook vectors at the deci- 
sion boundary between classes. 
5.3 LPE-Experiments 
5.3.1 Procedure 
From the complete set of documents, we con- 
structed three pairs of training and test sets for 
training the feature classifiers. The test sets are 
mutually disjunct; each of them contains 5 posi- 
tive and 5 negative examples. The corresponding 
training sets contain the remaining 95 documents. 
For RIBL, test set performance is determined us- 
ing leave-one-out cross validation. Feature vectors 
contained either 100,500, or 1000 lemma features. 
On the basis of test set performance, we deter- 
mined precision, recall, and accuracy. Instead of 
determining recall/precision breakeven point as in 
(Joachims, I998) or average precision over differ- 
ent recall values as in (Yang, 1997), we provide 
both values to determine which type of error an 
algorithm is more susceptible to. Tab. 2 summa- 
rizes the results. 
5.3.2 Algorithm-speclfic results 
Condition IBL-IG resulted in significantly 
higher precision (+0.5%) than IBL, but lower re- 
call and accuracy (difference not significant). The 
number of neighbouring vectors was also varied 
(k = 1,3, 5, 7). For precision, recall, and accuracy, 
best results were achieved with k = 3. A pure 
nearest-neighbour approach led to classifying all 
examples as negative. The number of neighbours 
k was also varied for RIBL. Contrary to 1BL, it 
performs best for k = 1. 
For the LVQ runs, we used the variant OLVQI. 
In this algorithm, one codebook vector is adapted 
at a time; the rate of codebook vector adaptation 
is optimised for fast convergence. The resulting 
codebook was not tuned afterwards to avoid over- 
fitting. We varied both the number of codebook 
vectors (10,20,50,90) and the initialisation proce- 
dure: during one set of runs, each class receives 
the same number of vectors, during the other, 
the number of codebook vectors is proportional to 
class size. Performance increases if codebook w~.c- 
146 
Proceedings of EACL '99 
Task Alg. 
A RIBL 
IBL 
LVQ 
E FtlBL 
IBL 
LVQ 
L I:tIBL 
IBL LVQ 
N RIBL 
IBL 
LVQ 
P I:tIBL 
IBL 
LVQ 
Prec. RRecall FN FS 
92,9 94,05 I00 wspos 
75 75 I000 ws* 
99,67 I00 500 cwpos 
97,59 77,18 500 ws 
75 75 10O0 all 
100 100 1000 all 
95,45 I00 I00 wspos 
75 75 I00/I000 all 
I00 I00 I00 ws* 
I00 I00 100 wspos 
75 75 I00 all 
100 I00 I00 all 
96,93 89,09 500 ws 
75 75 100/1000 all 
I00 I00 I00 ws = 
Table 2: Test set performance averaged over all 
runs for each task and for the best combination of 
feature set and number of features, precision and 
recall having equal weight. 
Key: all: ws/wspos/wspp/cw/cwpos/cwpp, cw*: 
cw/cwpos/cwpp, ws*: ws/wspos/wspp 
tors are assigned proportionally to each class and 
deteriorates with the number of codebook vectors, 
a clear sign of overfitting. 
LVQ achieves a performance ceiling of 100% 
precision and recall on nearly all tasks except for 
genre task A. The low average performance of IBL 
is due to bad results for k = 1; for higher k, IBL 
performs as well as LVQ. Overall, performance de- 
creases with increasing number of features. IBL is 
rather robust regarding the choice of feature set. 
LVQ tends to perform better on data sets derived 
from both content and function words, with the 
exception of task A. Because of the ceiling effect, 
it almost never matters if the additional linguistic 
features are included or not. Recall is significantly 
better than precision for most tasks. 
RIBL shows the greatest variation in perfor- 
mance. Although it performs fairly well, Tab. 2 
shows differences of up to -5% on precision and 
-23% on recall. Overall, ws-based feature sets 
outperform cw-based ones. Performance declines 
sharply with the number of features. POS fea- 
tures almost always have a clear positive effect on 
recall (on average +28%, cw* and +16%, ws*), 
but an even larger negative effect on precision (- 
38%, cw* and -39%,ws*), which only shows for 500 
and 1000 lemma features. Lemma and POS fre- 
quency information apparently conflict, with POS 
frequency leading to overgeneralization. Maybe 
semantic features describe the class boundaries 
more adequately. They may be covered implic- 
itly in large vectors containing lemmata from that 
class. For 100 lemmafeatures, where the represen- 
tation is extremely sparse, we find that including 
POS information does indeed boost performance, 
especially for the two genre tasks, as we would 
have predicted. 
5.4 CL Experiments 
5.4.1 Procedure 
In this set of experiments, RIBL and IBL were 
both evaluated using leave-one-out cross valida- 
tion. The performance of LVQ is reported on 
the basis of ten-fold cross validation for reasons 
of computing time. Training and test sets were 
also constructed somewhat differently. The test 
set contained the same proportion of positive ex- 
amples as the training set. If we had balanced 
the test set as above, this would have resulted in 
4 pairs of sets instead of 10, and much smaller 
test sets, because some classes, such as L, are 
very small. This problem was not so grave for the 
LPE experiments because of the ceiling effect and 
the small size of the complete data set, therefore, 
we did not rerun the corresponding experiments. 
Furthermore, the number of codebook vectors for 
LVQ was now varied between 10, 50, 100, and 200 
in order to take into account the increased train- 
ing set sizes. 
5.4.2 Results 
The results on the larger corpus differ substan- 
tially from that on the smaller corpus. It is far 
easier to determine if a text belongs to one of the 
three major domains covered in a corpus than to 
assign a text to a minor domain which covers only 
4% of the complete corpus. If the class itself is not 
considerably more homogeneous (with respect to 
the classifier used) than the rest of the corpus, 
this will be a difficult task indeed. Our results sug- 
gest that the classes were indeed not homogeneous 
enough to ensure reliable classification. The rea- 
son for this is that LIMAS was designed to be as 
representative as possible, and consequently to be 
as heterogeneous as possible. This explains why 
we never achieved 100% precision and recall on 
any data set again. In fact, results became much 
worse, and varied a tot depending mainly on the 
type of classifier and the task. Again, if classes are 
very inhomogeneous, any change in the way sim- 
ilarity between data items is computed can have 
strong effects on the composition of the neighbour- 
hood, and the erratic behaviour observed here is a 
vivid testimony of this. We therefore chose not to 
present general summaries, but to document some 
typical patterns of variation. 
Parameter settings: LVQ gives best results in 
terms of both precision and recall for even initial- 
isation of codebook vectors, which makes sense 
because the number of positive examples has now 
become rather small in comparison to the rest of 
the corpus. A good codebook size appears to be 
50 vectors. 
147 
Proceedings of EACL '99 
CW 
CWPOS 
FW 
FWPOS 
WSPOS 
WS 
H S 
50 200 50 200 
65.2 33.6 42.24 47.15 
65.2 29.5 42.24 47.15 
19.6 54 59.79 17.3 
19.6 54 74.4 17.3 
88.3 100 62.45 45.9 
56.6 68 62.45 45.9 
Table 3: Average LVQ results (precision) for cate- 
gories H and S, 50 codebook vectors, even initial- 
ization. 
For RIBL, restricting the size of the relevant 
neighbourhood to 1 or 2 gives by far the best re- 
sults in terms of both precision and recall, but not 
in terms of accuracy - the negative effect of false 
positives is too strong. 
IBL is also sensitive to the size of the neigh- 
bourhood; again, precision and recall are highest 
for k--1. For this size, incorporating information 
gain into the distance measure leads to a clear de- 
crease in performance. 
Overall performance: Unsurprisingly, perfor- 
mance in terms of precision and recall is rather 
poor. Average LVQ performance under the best 
parameter settings in terms of precision and re- 
call only improves on the baseline for two genres: 
H (baseline 78%, accuracy for feature set WSPOS 
88%) and FL (feature sets CONT and CONTPOS, 
baseline 94%, accuracy 95%). Under matched 
conditions (same genre, same feature set, same 
number of features, optimal settings), IBL and 
RIBL both perform significantly worse than LVQ, 
which can interpolate between data points and so 
smooth out at least some of the noise. For exam- 
ple, IBL accuracy on task H is 69,1% for both WS 
and WSPOS, while accuracy on FL never much 
exceeds 92% and thus remains just below baseline. 
RIBL performs best on FL for condition CWPOS, 
but even then accuracy is only 90%. 
Size of Feature Vector: The number of fea- 
tures used did not significantly affect the perfor- 
mance of IBL. For LVQ, both precision and re- 
call decrease sharply as the number of features 
increases (average precision for 50 lemma features 
29.5%, for 200 24.8%; average recall for 50 9.1%, 
for 200 7.1%). But this was not the case for all 
genres, as Tab. 3 shows. The categories H and 
S are chosen for comparison because they are the 
largest. For H, the precision under conditions CW 
and CWPOS decreases, all others increase; for S, 
it is exactly the other way around. 
Composition of feature vectors: Another 
lesson of Tab. 3 is that the effect of the com- 
position of the feature vectors can vary depend- 
ing both on the task and on the size of the fea- 
ture vector. The dramatic fall in precision for 
condition FWPOS, category S, shows that very 
clearly. Here, additional function word informa- 
tion has blurred the class boundaries, whereas for 
H, it has sharpened them considerably. Because of 
the large amount of noise in the results, we would 
be very hesitant to identify any condition as op- 
timal or indeed claim that our hypotheses about 
the role of POS information or content vs. func- 
tion words could be verified. However, what these 
results do confirm is that sometimes, comparing 
different representations might well pay off, as we 
have seen in the case of task H, where WSPOS 
indeed emerges as optimal feature set choice. 
6 Conclusion 
In this paper, we examined different linguistically 
motivated inputs for training text classification al- 
gorithms, focussing on domain- and genre-based 
tasks. 
The most clear-cut result is the influence of the 
training corpus on classifier performance. If we 
want general-purpose classifiers for large genres or 
collections of genres, "small" representative cor- 
pora such as LIMAS will in the end provide too 
little training material, because the emphasis is 
on capturing the extent of potential variation in 
a language, and less on providing sufficient num- 
bers of prototypical instances for text categorisa- 
tion algorithms. In addition, genre boundaries are 
notoriously fuzzy, and if this inherent variability 
is compounded by sparse data, we indeed have 
a problem, as Sec. 5.4 showed. Therefore, fur- 
ther work into genre classification should focus on 
well-defined genres and corpora large enough to 
contain a sufficient number of prototypical docu- 
ments. In our opinion, further investigations into 
the utility of linguistic features for textcategoriza- 
tion tasks should best be conducted on such cor- 
pora. 
Our results neither support nor refute the hy- 
potheses advanced in Sec. 2. However, note that 
in some cases, the additional non-content word 
information did indeed improve performance (cf. 
Tab. 3), so that such representations should at 
least be experimented with before settling on con- 
tent words. 
Acknowledgements 
We would like to thank Stefan Wrobel, Thomas 
Portele, and two anonymous reviewers for their 
148 
Proceedings of EACL '99 
comments. All statistical analyses were con- 
ducted with R (http://www.ci.tuwien.ac.at/R). 
Oliver Lorenz added the POS tags to LIMAS. 
References 
D. Aha, D. Kibler, and M. Albert. 1991. 
Instance-based learning algorithms. Machine 
Learning, 6:37-66. 
H. Bergenholtz and J. Mugdan. 1989. Zur Kor- 
pusproblematik in der Computerlinguistik. In 
I. B£tori, W. Lenders, and W. Putschke, edi- 
tors, Handbuch Computerlinguistik. deGruyter, 
Berlin/New York. 
B. Beutel. 1998. Malaga User 
Manual. http://www.linguistik.uni- 
erlangen.de/Malaga.de.html. 
D. Biber. 1988. Variation across Speech and 
Writing. Cambridge University Press, Cam- 
bridge. 
U. Bohnebeck, T. Horvath, and S. Wrobel. 1998. 
Term comparisons in first-order similarity mea- 
sures. In Proc. 8th Intl. Conf. Ind. Logic Progr., 
pages 65-79. 
W. Daelemans, A. van den Bosch, and T. Weijters. 
1997. IGTtree: Using trees for compression and 
classification in lazy learning algorithms. AI 
Review, 11:407-423. 
W. Emde and D. Wettschereck. 1996. Relational 
instance based learning. In Proc. 13th Intl. 
Conf. Machine Learning, pages 122-130. 
R.I. Forsyth and D. Holmes. 1996. Feature~ 
finding for text classification. Literary and Lin- 
guistic Computing, 11:163-174. 
R. Glas. 1975. Das LIMAS-Korpus, ein Textkor- 
pus f/it die deutsche Gegenwartssprache. Lin- 
gustische Berichte, 40:63-66. 
G. Herdan. 1960. Type-token mathematics: a 
textbook of mathematical linguistics. Mouton, 
The Hague. 
D. Holmes. 1998. The evolution of stylometry in 
humanities scholarschip. Literary and Linguis- 
tic Computing, 13:111-117. 
T. Joachims. 1998. Text categorization with Sup- 
port Vector Machines: Learning with many rel- 
evant features. Technical Report LS-8 23, Dept. 
of Computer Science, Dortmund University. 
,I. Karlgren and D. Cutting. 1994. Recognizing 
text genres with simple metrics using discrimi- 
nant analysis. In Proc. COLING Kyoto. 
B. Kessler, G. Nunberg, and H. Schiitze. 1997. 
Automatic classification of text genre. In Proc. 
35th A CL/Sth EACL Madrid, pages 32-38. 
J. Klavans and Min-Yen Kan. 1998. Role of verbs 
in document analysis. In Proc. COLING/ACL 
Montrdal. 
T. Kohonen, J. Kangas, J. Laaksonen, and 
K. Torkkola. 1996. LVQ-PAK - the learning 
vector quantization package v. 3.0. Technical 
Report A30, Helsinki University of Technology. 
I. Kononenko. 1994. Estimating attributes: Anal- 
ysis and extensions of relief. In Proc. 7th Europ. 
Conf. Machine Learning, pages 171 - 182. 
H. Ku~era and W Francis. 1967. Frequency anal- 
ysis of English usage: lexicon and grammar. 
Houghton Mifflin, Boston. 
D. Lewis. 1992. Feature selection and feature ex- 
traction for text categorization. In Proc. Speech 
and Natural Language Workshop, pages 212- 
217. Morgan Kaufman. 
C. Martindale and D. MacKenzie. 1995. On the 
utility of content analysis in author attribution: 
The Federalist. Computers and the Humanities, 
29:259-270. 
U. Pieper. 1979. Uber die Aussagekraft statistis- 
chef Methoden fi~r die linguistische Stilanalyse. 
Narr, Tfibingen. 
D. Ross and D. Hunter. 1994. p-EYEBALL: 
An interactive system for producing stylistic de- 
scriptions and comparisons. Computers and the 
Humanities, 28:1-11. 
G. Salton and M.J. McGill. 1983. Introduction 
to Modern Information Retrieval. McGrawHill, 
New York. 
A. Schiller, S. Teufel, and C. Thielen. 1995. 
Guidelines ftir das Tagging deutscher Textcor- 
pora mit STTS. Technical report, IMS 
Stuttgart/Seminar f. Sprachwiss. Ttibingen. 
J. Swales. 1990. Genre Analysis. Cambridge Uni- 
versity Press, Cambridge. 
A. yon der Gr/in. 1999. Wort-, Morphem- und Al- 
lomorphhgufigkeit in dom~nenspezifischen Kor- 
pora des Deutschen. Master's thesis, Insti- 
tute of Computational Linguistics, University 
of Erlangen-Ntirnberg. 
Y. Yang and J. Pedersen. 1997. A comparative 
study on feature selection in text categorization. 
In Proc. 14th ICML. 
Y. Yang. 1997. An evaluation of statistical ap- 
proaches to text categorization. Technical Re- 
port CMU-CS-97-127, Dept. of Computer Sci- 
ence, Carnegie Mellon University. 
149 
