\]~{ECOGNI:ZING '\]:'F:XT GENII.ES Wl'rll S:lb,/l:l'l,I,; ~/~I,;'I'I/I(~S USING DISCII.IMINANT ANAI,YSIS 
J USSI \[(AItI,C, ILEN 
jussi~sics.se 
Swedish Insl;il,ute of (2)ml),tter Science 
Box 1263, S 164 28 K\[S'rA, Stockholm, Sw('.den 
I)OUOI,ASS (\]UT'I'IN(-} 
cutting@apple, com 
hl)l)le Compel, or 
Cupe.rl.lno, CA 95014, USA 
Abstract 
A siml)le method for (:~d, egorizing texts into pre-deturmincd 
text gem:e c;ttcgorics using tit(: st;tti.',t, icM sl.+utd+u'd tcch 
nique of discriminatH, amdysis is demonstrated wil.h appli- 
cation to the Brown(:orpus. I)is(:rimina.ut analysis makes it 
possibh~ tl,qC it, la,rge lllllllber of l).~Xl'a-Ill(:l,(:rs Lh;tL llHl,y 1)(! SI)(1 
cific for a. certain corpus or inlormation stream, and combine 
I.henl into ~t small tmmber ol + functions, wiLh t.he pa.ram(:i(:rs 
weighted oil basis of how usehd they ;u:e for discritniml.t 
ing text genres. An a.ppli(:~tl.ion to inforuta.tiott retrieval is 
discussed. 
Text Types 
Thor(; are. different types of l;exL '\['exl.s "al)oui," l.he 
sa.me thing m~ty be in differing geurcs, of difl'(~rem. I.y I)eS, 
;rod of v;trying quality. Texts vary along st'.ver;d param. 
el.ers, a.ll relc'wull, for l,he gcuera.l inlortlu~tiol~ rel, ri(wal 
problem of real.thing rea(lcr needs m.I texts. (liven this 
variation, in a text retrieval eonl.ext, the l)rol)lems arc 
(i) i(Mttifying ;cures, and (ii) choosing criteria t,o ch,s-- 
ter texts of the smnc gem:e, wit, h l)redictal>le l>recision 
aml rcca.ll. This should uot he eonfused with t, he issue 
of idenl.ifying topics, m,d choosiug criW+ria that. diserinl- 
inatc on(: topic from auother. All.hough u(>t orthogonal 
to gem'(', del)endent; wu+iat, ion, the wu'iat, ioll i, hat, rela, l,es 
dirc(-t.ly to (:onW.uI; and topic is Moug or, her (litu<'.usions. 
Na.l,ura.lly, there is (;o-va.riancc.. 'I'exl.s al)oul. (:(+rl.aitl 
topics ula,y only occur iu (:(;rt;ailt g(!tll'(!s, alt(\] {.exl.s ill 
eertaiu ge.nres may only t.rea.t c(q'l.ain topics; mosl. l.ol)- 
ics do, however, occur iu several ;cures, which is what 
inl;erests us here. 
Douglas I~il)et: has sl, udied l;exl, variat.ion along scv 
eral l)aranmtcrs, and found that t,cxt.s can I)(,, cousidcrcd 
to wvry along live ditnensious. In his st, udy, he clush'.rs 
\[~ai.ures according t.o eowu'iauce, t.o find tmderlyiug di 
mensions (198!)). We wish to liud a method for idenl.ifv- 
in; easily eomput.al)h; I)\[tl:al,|et.cH's t.hat ra.l>idly classify 
previously IlllS(?(~ll texts in gell(':r~ql classes and along a 
small set smalh~r 1,tmn I~,il>er's \[ivl'. of dimm,siot,s, 
s,,ch that l.hcy can bc cxplai,,(~d in i,,t,tit.iwdy siml)le 
terms to l.hc ,,so," of a.n informal.ion rel.riewd ~Hq)liea-- 
tion. ()m: aim is 1,o t;~ke ~ set of texts that. has b(:ei, 
select, ed I)y sotne sort of crude semantic analysis such as 
is typica.lly performexl I>y an iufornmtion rel, ri(!vM sys- 
l, em and I)art.il.ion il, flu'lher I)y genre or (.cxl. t;yl)e , aud 
Experiment I I",xperiment 2 l'~xperiment 3 
............. ( l\]~'°w~Lc at e g' °zies) .. 
\[. Informa.tive 1. I)ress A. Press: report;tge 
B. Press: editoriaJ 
(L Press: reviews 
4. Mis(: - l)~ Ileligion- " 
I,',. Skills and lIohhies 
1.'. I)olml~u: Lore 
C. Belles \],cttr(s, cl.c. 
21 Non-.tiction --llT(h)v. doc. "(~ m:lsc. 
,I. \[,estr n('.d 
II. \[magin;ttivu 3. Fiction K. (',eneral l"ietion 
I,. Mystery 
- - 
N. Adv. ~ Wes{el'll 
P. tloma.nce 
}i: ii i,h.;i. 
Table 1: (',al,egories iu the I\]rowu (;orpus 
t.o display this wu'iat.iou as siluply as possible in oue or 
l.wo dilu(msions. 
Method 
'vVe st,art by using \['catm'es similar go those firsl, hlw!s 
(.igat(d by \]~iber, but wc eonc('.ul,rate on (;hose t, hat; arc 
easy 1.o comput<~ assuming we have a parl, of speech tag 
get ((hll, l.ing e/ /*l, 1992; (/hureh, 1988), such ;Is ,queh 
as i, Jlh'd l)(:l'SOll l)FoIIOllIl oeeul'l+Ci,C() l;atc ;18 o\])l)obed 
1.o 'geucral hedges' (l~iher, 1989). More mid more of 
I/ihcr's |'egtlail'eS will be awfilahle with tim advent of 
more prolieieut aua.lysis programs, for iusl,a.nce if eom- 
plel.e surface syntaet.ic l>a.rsing were performed hefore 
catl!gorizat.iotl (Voul;ilaiueu ,~ Talmnailu'u, 1993). 
W(~ then use (liscriuduant analysis, a. technique from 
descriptive .~tatist.ics. I)iscrimimull. atmlysis tak,'s a set 
of l)rCcat.egorized imlividuals and (I;~ta ou t,hcir vm.m 
liOl, Oil iI lltllIlb(21" o1' plLr~lliiCl.el'S~ lLlld WOl'ks olll. a s(!t 
discriminant J'uuctions which dist;ingnishes hetw(.etl t.he 
groups. These l'uuetious can l.llen l)e used I.o predicl, the 
ca+l.egory mlmd)ershil)s of new iudividuals based on tJmir 
)ara.met(!r scores (Tal.sluoka, 1971 ; M ustouen, 1965). 
Evaluation 
"or data. we used the Browu corpus of English text sn,i, 
)h's of uuifolnt length, ca.l,cgorized ht se\,cral cal.cgorh~s 
I07/ 
Variable Range 
Adverb count 19 - 157 
Character count 7601 12143 
\],ong word count (> 6 chars) 168 - 838 
Preposition count 151 433 
Seeond person pronoun count 0 - 89 
"Therefore" count 0- 11 
Words per sentence average 8.2 - 5a.2 
Chars / sentence average 34.6 266.3 
First person pronoun count 0 - 156 
"Me" count 0 3(1 
Present participle count 6 - 1(11 
Sentence count 40 236 
Type / token ratio 14.3 - 53.0 
"I" count 0 120 
Character per word average 3.8 - 5.8 
"It" count 1 - 53 
Noun count 243 -- 75:l 
Present verb count 0 - 79 
"That" count :1 72 
"Which" count 0 -- 40 
'Fable 2: Parameters for l)iscriminant Analysis 
Category Items Errors 
\[. lnformatiw'. 374 16 (4 %) 
II. Imaginative 126 6 (5 %) 
qbtM 500 22 (4 %) 
Table 3: Categorization in Two Categories 
as seen in table 1. We ran discriminant analysis on 
the texts in the corl)us using seve.ral different features 
as seen in table 2. We used the SPSS system for sta- 
tistical data analysis, which has as one of its fcatm.es 
a complete discriminant analysis (SPSS, 1990). The 
diseriminant flmction extracted t?om the data by the 
analysis is a linear combination of tlle parameters. To 
categorize a set into N categories N - 1 functions need 
to be determined, llowever, if we are content with being 
able to plot all categories on a two-dimensional plane, 
whidl probably is what we want to do, for ease of ex- 
position, we only use the two first and most significant 
functions. 
2 categories 
In the ease of two categories, only one function is nec- 
essary foe' determining the category of an itenl. The 
flmction classified 478 cases correctly and miselassilled 
22, out of the 500 cases, as shown in table 3 and figure I. 
4 categories 
Using the three functions extracted, 366 cases were cor- 
rectly classified, and 134 eases were misclassified, out of 
tile 500 cases, as can be seen in table 4 and figure 2. 
"Miscellaneous", the most problematic category, is a 
loose grouping of different informative texts. The single 
most problematic subsubset of texts is a subset of eigh 
teen non-fiction texts labeled "learned/humalfities". 
Sixteen of them were eniselassitied, thirteen as "mis- 
eell&eleotls". 
40 + 
I 
I 
I 
20 + 
I 
I 
I 
X ....... + ......... + ......... + ....... X 
I 
11 I 
111 + 
1111 I 
Iiiii I 
11111111 I 
211111111 + 
11111111111 2 \] 
11111111111112212 2 2 22 \[ 
11111111111111111112222222222221 
x ....... + ......... + ......... + ....... x 
-2.0 0.0 2.0 
Centroids : * * 
Figure 1: Distribution, 2 Categories 
Category J~ Errors 
2. Non-Iiction 28 (25 %) 
3. 1.';ctl .... \] 12(; I ~ ('~ %) 
4. Misc. / 176 I 68 (47 %) 
'focal L%~ ° l 134 (27 °/~T 
Table 4: Categorization in Four Categories 
+ ........ + ......... 4 ........... + ......... + 
I 223 \[ 
I 23 I 
\[ 233 I 
+ + 22433 + + 
\[ 244433 \[ 
I 224 44333 I 
1 244 44433 * \] 
I * 224 44333 I 
I 244 44433 I 
+ 0.0 + 224 + 4433+ + 
\] 2244 * 44333 \[ 
\[ 2444 44433 \[ 
I 22211444444444444444433 \] 
I 221111111111111111111443333 I 
I 2211 * 111111333\[ 
+ 22211 + + 1113+ 
I 22111 iiI 
I 2211 I 
I 22211 \] 
+ ........ + ......... + ......... + ........ + 
-2.0 0.0 2.0 
Figure 2: l)istribution, 4 Categories 
1072 
15 (or 10) cat(;gorh.~s 
Using th0 \['Oill:l;eell funetions extracted, 258 cases w(we 
correctly classified and 242 cases inischlssilied out of 
the 500 cases, as shown in table 5. Trying to distin. 
guish I)eLween the di\[ferenL types of fiction is exl)en- 
sive. hi tornis of errors. \[\[' the tiction subcategories 
were collapsed there only wouht be ten categories, and 
the error rate R)r the c.atogorizal,ion would iniprove as 
showil ill th0 "revis0d totM" record of the tal)le. The 
"learned~humanities" nubcal;egory is, as I)erore, prol)-- 
lematic: only two of the. eighteen itomn were correctly 
classified. The. others were irlost often misclassilied as 
"l/,cligion" or "Belles l.ettre.s". 
Validation of the Technique 
It is iinl)ortant to note that this exl)erinlent does not 
claim to show how geHrc, s ill fact ditfer. What we show 
is tha.t this sort of teellnique can. bc used t.o determine 
which l)aramcters to line, given ~ set of them. We did 
not use a test set disjoint from I, he training set, and 
we do not claiul I;hat the functions we had the method 
extract fi:onl the data are useful iu theulselves. We dis- 
cuss how well this meJ, hod categorizes a set texl, given 
a set of categories, alld given a net of paralllCl.ers. 
The error rates clinlt) steelfly with the iiunlher of 
categories tested Ibr in the (:()rims we used. This ,m,y 
have to do with how the categories are chosen aud de- 
fined. For iustance, distinguishing between dill(rein. 
types of liction by fornlal or stylistic criteria of this 
kind may just he sonicthing we shouht not a.tteml)t: 
the fictiou types are naturally delined ill ternln o1 their 
content, a.fter all. 
'Fhc statistical tcchni(luc of factor anM:qsi,~ can be 
used to discover categories, like l~iher has done. The 
prol/lenl with using automatically (lerived categories is 
that even if they are iu a sense tea.l, lneaniug that they 
are SUl)l)orted by data, i.hey may t)e dillicult to C×l)lain 
for l he uuenthusiastic lltylliall if l.he ahii is to tlS(! the 
techlii(lUe in retrieval tooln. 
Other criteria that shouhl be studied are second 
alld higher order statistics on the rospeoLivc l)aranle 
ters. (-Jorl, ain l)aranieterst)robal)ly varG lnor~ ill certahl 
text types than other% aild they may have a s\[~'c?lJcd 
dislribulion as well. This is iiot dillicull, to deterliiine, 
although l.h(! standard methods do llOt nupl)orl, illltO 
lnatic detcrinination of staudard devial,iou or skl:wness 
as discrinlination criteria. 'lT)gethcr with iJle hwesti-. 
gation of sew;ra\] hil, herto Ultl.ried l)aranlcters, this is a 
11(7.'(( step. 
Readability Indexing 
Not unrel~Lted to the study of genre is the study of 
rcadabilily which aims to categorize texts aecoMing to 
their suital)ility for assumed sets of assumed readers. 
There ix a weall, h of formula: to couqmte readahilil.y. 
Most commonly l,hey combine easily computed text 
measures> typically average or Saml)led averag,: s<n 
t(;ncc leugth couibiucd with siulihMy couqluled woM 
length, or in(ides((, of words not on a sl/ecified "easy 
word lint" (( ',hall, 1948; K late, 1963). hi spite of C, hall'n 
warnings al)out inj,.ticious application to writing tasks, 
readal)ility measurement has naively come to be used 
as a l)l:escriptive metric of good writiug as a tool for 
writers, ~md has thus COllie into some disrepute, among 
text researchers: Our small study conlirms the I)asie 
findings of the early readal)ility studies: the most im 
i)ortant fa.cl.ors of tim ones we tested are. word length, 
sentence length, and different derivatives of these two 
parameters. As long as readM)ility indexing nchemes 
are used iT, descriptive at)l)lications they work well to 
discrinlilml;e between text types. 
Application 
The technique shows practical promise. The territo- 
rial nial)s showu in ligm'es 1, 2, and 3 are intuitively 
une\['ul tools for (lisplayiug what type a particular text 
is, compared with other existing texts. The technique 
denionstrated above has au obvious application in in- 
formatiol~ retrieval, for l)ieking out interesting texts, if 
(cutest based methods select a too large set for easy 
nlanipulation and browning (Cutting c/ al, 1992). 
In any specific application area it will be unlikely 
t, hat the text datM)ase to be accessed will be completely 
free form. The texts uuder consideration will probably 
he speciiic in some way. C, enc'ral text tyl)eS may be 
useful, but quite l)rohably there will be a domain or 
liehl-sl)ecilic text typology. In till envisioned apl)lica~ 
tics, a user will employ a cascade of filters starting with 
filtering by topic, and continuing with filters by genre 
or text, l.yl)e, aim ending by filters for text quality, or 
other t(mtal,iv(; liner-grained quMilieal,ionn. 
The IntFilter Project 
The \[Ntl,'ilter F'roject at the departments of Computer 
aml Systems Sciences, C, omputational \[,inguistics, ~md 
Psychology at Stockhohn University is at present stiMy.. 
ing texts on the USli'.NIi;T News cont'ercncing system, 
The project at present studies texts which appear on 
several different types of USF.Nt';T News coll\['erences, 
a, ml investigates how well the classilieation criteria and 
categories tllat exl)erienced USENI,71' News users report 
using (lutl"ilter, 1993) can be used by a newsreader 
systeni. To do this the l)roject apl)lics the method 
described here. The project uses categories such as 
"ltuery" ~ lCCOIIlll|ellt)l 1 llkLllIl()|l|lC(~lllelltll 1 "FAQ", all(l so 
\['orth, categorizing theui I,sing paranieters such ;is dif- 
ti~rent types of length tneanurcs, form word content, 
quote level, \]lereentage quoted text and other USEN I';T 
News Sl)ecific parameters. 
Acknowledgements 
Thanl,:s to Hans Karlgrcu, Gumml K,~iJlgren, (_h~c, ff Nun- 
berg, Jau l>ederscn, and the (',<>ling re.ferees, who all 
have colH:ril>uted with suggestions and method()logical 
discussious. 
70,7.7 
Category 
A. Press: reportage 
B. Press: editorial 
C. Press: reviews 
l). Religion 
E. Skills and Hobbies 
I". Popular Lore 
G. Belles Lettres, 13iogral>hies ere. 
II. G'overnment documents & misc. 
d. Learned 
K. General Fiction 
L. Mystery 
M. Science Fiction 
N. Adventure and Western 
P. Romance 
R. Ilumor 
Total 
l!'ietion (From previous table) 
Revised total 
Items 
44 
27 
17 
17 
36 
48 
75 
3O 
80 
29 
24 
6 
29 
29 
9 
500 
126 
500 
Errors Miss 
~1 (25 %) l,' 
8 (ao %) A 
4 (~4 %) I~ 
8 (47 %) G 
17 (47 %) ,I 
32 (67 %) ¢,1~: 
49 (65 %) I),B,A 
9 (3o %) J 
32 (40 %) II,I),G,F 
16 (55 %) fiction 
12 (50 '%) -"- 
l (17 %) -"- 
18 (62 %) -"- 
22 (76 %) -"- a (aa %) -"- 
242 (4s %) 
(i (,5 %) 
178 (35 %) 
( Table 5: Categorization in 15 ,at(go les 
+ ......... + ......... + ......... + ......... + ......... + ......... + 
\[ -4 -2 LJJ 0 2 JHH \[ 
\[ LLJ JJH \[ 
\[ LLPJJ JH \[ 
+ + + LLLPKFJJ+ + +JHH + 
\[ LLLPKKKFFJJJ JJn \[ 
\] LLLPKKKKFFFFFJJJ * * JHH \[ 
\[ * LLLPKKK KF FFFJJ JJH \[ 
\[ L**LNPRK KF FFJJJ JH \[ 
\[ LLLLNNNKKK* KKF *FFJJJ JIItt I 
+ LLLLNNNNKKK* + KFF + *FFFJJ+ + JJH + 
\[ LLLLNNNNNRKK KF * FFFJJJJ JHH \[ 
{LLLLNNNNNNNKK KKF *FFFGGGGGJJJ JJtl \[ 
\]LNNNN NNNKKK KK*RFFFFFFFFFFGGGG GGGJJJJ JH \] 
INN NNKKK KKRRRBBBBBBB*BBBBBGGGGGGGGGJJJJ JHH \[ 
\[ NNNKK KKKRR RB * BBBBBBGGGGGGGJJJJ JJll I 
+ NNNKKK + KKKRRR RRB + + BBAAAAAAAAAJJJJHH + 
\]NNKKK KKKRRR RBB * BBA AAAAJJHH\] 
{NKK KKRRR RB BBAA AAAAHI 
\[KK KKKRR RRB BAA AA{ 
\[ KKKRRR RBB BBA \[ 
{ KKKRRR RRB BBHBBBAAAAAA \[ 
+ KKRRR +RBBBBBBBBBBHBBBCCCCCCCCCCAAAAAAAAAAAAAA+ 
{ KKKRR RRBBBCCCCCCCCCCCC CCCCCCCCCCCCCCC\[ 
\[ KKKRRR RRCCCCC \[ 
IKKRRR RRCC I 
+ ......... + ......... + .......... + ......... + .......... { .......... + 
Figure 3: l)istribution, 15 C.ategories * Indicates a group eentroid. 
7074 
l~eferences 
Douglas Bitmr 1989. "A typology of English texts", I, iu- 
guistics, 27:3--43. 
Jeanne S. Chall 1948. I~cadnbility, Ohio Stal, c Univ. 
Kelnmth Clmrch 1988. "A Stochastic Parts of 5;fmc<:h 
aJtd Noun Pitrasa Pa+rser for Unrestricted Text", 
lb'ocs. 2rid ANLP, Austbt. 
Douglas,'+ Cutthtg, Julian Kupi(w., Jan l)('.(hn's(m, 
an(1 l.N'.n(~h)l)e Sibun 1992. "A Ih'act.ical 
lbn:t-of--Stmech '13.gger", lb'ocs. 2rd A NLP, Trcnto. 
Douglass Cu¢,t.lng, D. Karger, Jan Pedersml, m~d 
John Tuk(,.y 1992. ~Scatl.e,'/(~ather: A (Jh,sl.cr-lmst~d 
Al)l)roa(:h to Browsing \[,arge \])ocument (2olhx:l.ions" 
I'rocs. ,5'1G IR '92. 
IntFilter 199:1. 
Working Papers of the lnll"illcr Project, available I,y 
gopher from dsv.su.se:/pub/IntFi\] ter. 
George R. Klare 1963. ThcMcasurcmcntoft~adabilitg, 
\[owa. Univ press. 
W. N. ~5'ancis "rod F. Kui:era 1982. l"rcq++cm:g An,/!/sis 
of /'J*tglish Usage , \[loughton MilllilL. 
Sept>o Musl,onmi 1965. "M ultiple t)iscriminsu+l Analy 
sis in Linguistic Problems", ,5't:ttislical Methods i~t /,in- 
:lui.slics, /t:37-:1,t. 
M. M. Tatsuoka 7197l. Multivariate Analgsis, New 
York:.lohn Wiley & Sons. 
Atro Voutilainen and pasl 5\[½1I)all~-thlelt I993. "Ambi- 
guity resoh,l.ion in a, reduct.ionistic parser", Procs. 6'lh 
\]'~uropcan A CL, t ltrcchl.. 
SPSS 1990. The ,5'/',5'.b' Ib:\[ercncc (;~+id+:, (.qdca+go: ,qP,q5 
I IIC+ 
7075 
