A SYSTEM \]\[7OR CREATING AND 
MANIPULATING GENERALIZEI) WORDCLASS 
TRANSITION MATRICES FROM LARGE 
LABELLEt) TEX'I'--CORPORA 
Wilfried Blocmberg 
Institute of Phonetics 
University of Nijmegen 
P.O. Box 9103 
6500 HD Nijmegen 
The Netherlands 
Michael Kcsselheim 
h~stitut fib ° Allgemeine Elektrotechnik und Akustik 
Ruhr-Universit~it Boclmm 
Universitfitsstrasse 150 
D-4630 Bochmn 
West-Germany 
ABSTRACT 
This paper deals with the training phase of a Markov-type 
linguistic model that is based on transition probabilities 
between pvirs and triplets of syntactic categories. To deter- 
mine the o?timal level of detail for a set of syntactic classes 
we developed a systetn that uses a set-theoretical formalism 
to defiue such sets mid has some measm~s to comp~uce and 
c,ptimize them fildividually. 
In section two we describe the optimizafiou problem (hi 
terms of piediction, infoimation and economy requilements) 
and our approach to its solution. Section three introduces the 
system dlat will assist a lhlguist in h,'mdling the prediction 
and economy criteria and in the last section we plesent some 
slunple lemtlts that can be achieved with it. 
I. IN'fRODUCrlON 
The context in which we strutted devclopping the system 
described ia this paper is the I~NPRIT project #860, 'I.,inguis- 
tic Analysis of the European I.,anguages', which deals with 
seven European languages. 
The rnah~ objective of the project is to provide a language 
independe~t softw,'we enviromnent for dealing with the lin- 
guistic phase of a number of applications in the re'din of 
office a/ito:mation such as high quality, natural soundhlg text- 
to-speech ~:onversion for unlimited vocabularies, automatic 
speech recognition for large vocabularies, and omni-font 
optical character reading includhlg automatic reading of 
handwriting. 
The decision on what type of linguistic model to be used 
ill the project was made at an early stage. It was decided to 
aim at a probabilistic positional gramnrar (a Mmkov-type 
grammar) based on transition probabilities of pairs and tri- 
plets of syntactic categories. Tile use of Matkov-type models 
immediately incurs the necessity of defilting training texts. 
We started out with trainhlg corpora of approximately 
100,000 words of official EEC publications, that were avail- 
able hi all languages of the community. The training consists 
of buildhlg a number of data structures. 'File first is a lexicon 
of ,'111 words that occur in the text, with their attendmlt prob- 
ability of occurl~uce and all possible wordclasses. The sec- 
ond structme is formed by two and three dimensional matri- 
ces describing the transition probabilities between pairs or 
triplets, respectively, of wordclasses. Clearly, the probabili- 
lies specified depend on the choice of syntactic categores 
along the dimensions. One of the major problems with a Mal- 
koviml approach is to determine the optimal level of detail of 
the wordclasses for each dimension. In tiffs paper we will 
describe a softwale systetn that helps linguists ha carrying out 
experitnents aimed at finding an 'optitnal' system of word- 
classes. 
2. MARKOW ANALYSIS OF LARGE CORPORA 
AND WORDCLASS SYSTEMS 
The prOblem of finding a suitable wordclass set for statistical 
disambiguation of syntactic labelling may be fommlated 
more precisely and fomlally as follows: 
Find a set of wordelass labels (with gross wordclass and com- 
plex information) that can label each word of a language and 
1. is minimal in the number of labels (economy require- 
ment) 
2. provides high predictive power for adjacent word- 
classes in a chain. A formal way to do this is by mini- 
mizing tile average entropy of N-dimensional transition 
probabilities for subsequent labels in sentences, e.g. 
reduced to the two-dimensional case, to minimize: 
E = - P(tjlt,)to (P(blZO)l,  
j i 
with: 
S 
n 
ij 
P(alb) 
summation symbol 
number of labels in the system 
indices running from 1 to n 
conditional probability of 'a' given 'b' 
(prediction requirement) 
3. is maximal in the amount of infomaation about each 
labelled word, e.g. for syntactic analysis or disambigu- 
ation of alternative graphemic hypotheses. (informa- 
tion requirement) 
To find an exact solution to this problem is difficult - if 
not impossible, because of 
the dimensionality of the optinfization problem (given the 
large number of wordclasses needed to obtain useful 
parsing results) 
- the difficulty to define a unique starting set of word- 
classes for an optimization 
the dependence of a possible finite solution on the anal- 
ysed corpus 
Our approach to this problem is to start from a very 
detailed hierarchical wordclass system including complex 
information. Tile degree of detail can be reduced by means of 
the notion of "cover symbols" that form partifioltings of the 
original system. Cover symbols and w0rdclasses not 
accounted for by cover symbols are called 'labels'. Initially, 
cover symbols will be created by combining wordclass 
symbols for related classes - e.g. the classes "verb, 1. person 
singtdar indicative present active" and "verb, 1. per:;on singu- 
lar conjunctive present active" giving a cover symbol "verb, 
1. person singular present active". At a later stage other 
cover symbols can be created by combining and excluding 
wordclass symbols and already existing cover symbols. \]\[~a 
the optimization process different sets of." labels are created 
subsequently mad compared by measmes ~elated to either of 
the criteria mentioned. 
A user working in the optimization process ~eeds meas~ 
ures to compare the significance of individual labels within a 
given set and to estimate the usefulness of joining labels i~,~to 
new, more comprehensive cover symbols'. Az one measur~ 
for criterium two we use the entropy directly in a global ~nd 
diagnostic way. Additionally a number of measures have 
been defined that are related to entropy and give more spe- 
cific information on the performance of individual labels. 
Given a text in which to each word a label has beetg 
assigned that is: 
1. the basic wordclass, if this has not been defined as 
belonging to a cover symbol 
2. file applicable covet" symbol otherwise 
and given a 2D-matrix that contains relative frequencies 
of transitions from any label (wordclass or cover symbol) to 
any other label in the text, then some useful rueastn'es are 
the branclfing factor for a given label, that tells how many 
different labels actually followed/preceeded it in an anal- 
ysed text. 
file variance of the transition probabilities in a row/cob 
umn of the matrix, that indicates how much the strength 
of connections from the label to sttrrotmding labels varies 
as ,analysed fi~om a text. 
tile correlation between different rows/columns of the 
matrix, that gives information about how similarly the 
labels behave in a general right/left context, i.e. how 
much itffomtation will be lost by combining two labels 
into a new cover symbol. 
file relative frequency of a given label, that indicates tile 
relative labelling relevance wiflfin a given system. 
The measures defined here for a 2D-matrix, can be 
applied to a 3D-matrix in a similar way, e.g. the colxelafion 
between two labels in the same matrix dimellsion then means 
cox~relating the numbers of two planes. 
50 
3~ .~(\]\]~\[},,:i?:,: .i',\]/! ;i'~?,4'.1',:}i'~ ~.;OJt !,.Di¢~'IU.CI;,5; ILttOM 
i'~,i~k'>il.,{;~.'t)'V A i'q<t_,f ,'i(~;)Z~:; 
%1 ~rder to a:~si~;~ ~hGuists h~ thch' ta.'& of dc.<~ig~& G -'~x opli-- 
ma~ se~ oJ:" ',,:,o.~'delasst,~:*: ~<:,,t; desig:~;;:d ~ too/ ca!icd g-",l',/tivig.: 
F, dJtor ~7o_r Jv(a~G.c~s :i)~'o~:a \]~L'~d~:~>v \['.!Y, ltys\[s, "..lie ~.,'.(~,~;t Jwzpor- 
tant des_~.g~ <:oi~si¢:lf::~;tt~o~s re; ik~llJll;iDeiililli' j ~\]le Sy:,;t'L~xIi a,'e: 
(~EVc{O\[) cOV~;" t~yiu?'.bo\] :~::TG a~Taly~'c:: )0~_,tat:i~ees a~td !r~!~x'; 
a,.ILp~i~ d 7~,!.,';o ~,d.~a~ded itclp ~.c a'¢aJtable at ev<.',.~y poJat 
3, \[ ~'.L,o J~ L;;.; ~ .(\].,.s~ tooi ~ox' c)q>c_,J,.a,c,~d ~,.~:;;:i'm 't<lit;y cau 
C2C,~!'~ ~q}~\]i: C0t\]I~JA~lid i\]il, S ~)y tlle\]lisi:ivt;s Of I/so tile: 
J~Ggi~G .('acflity~ 
;';td~MA is ~@it .h'~tc, two _~ogical pa~ls, though they ace 
ck~scly rob'rEdo h~ ttw. fi~'gt pa~.l a user ~al~ c.r,<:at,:~ a set of 
cove.,.' sy.a~hoJ:~, /~. s;~.4~x~r~tie~d i'onnalism has beta, defined 
~)~x ,-;pt;',:i~'yi~g c:ovcr symbols iu a hJeraccificaJ way: rc.cm' 
sive\]y -:;;;ts :d i~.b~;h.; ~my be put imo lists, th,:at sw.:h li~:ts t;e 
e::ch~dex! from oih~r lists k~ ,<:p,;:eKy the fm~{ set of word- 
c|as~es co~/tai,.a;d \]~_ a ee~tail~ Cover sy~fl~ol, (sc;e al;pelldix for 
~totatiorO 
}h:i.el~ ,,;3 rebels can be defined for ¢:ach dimetlsiou. (called 
"scope?') of a erm~sitlo.u matrix stsparately, i.e. one dan defiiie 
a specific cov~r symb¢fi or~iy :2~x c.g, ~f~e first position h~ a 
transitioa t:~d~' or triple..,¢~.licr .o s~.~t of cover symbols has 
bee~ defi~icd v. con,<;iste~ey ~:h~;ck is mad% to ellslll=e- that tm 
wordeqaas <,;~/l~l'~)ol be\]\[o~ll~s tD zalol'e thaii olle (:over syl~lI}ol. 
A <':el o\[' cover symbol d,~fh~itions ix cal!~:d a "mapping". 
.A. mapping has to b,': co~s/stei~t but no~ ~ec(:ssmily eomplete~ 
Lo. rmt ovecy woidcla.sg my.st belong to ;ome dover symbol. 
Dift'e.rettt st. is of mappings crux be m~aged together as long as 
fl~ey stay eca~sistemo 
~n lhe ~:eeol~d pa~ of tl~c system a m,;cr can create and 
marfipulak~ ~nmsMo~ probabfliiy mat.Goes with the help of a 
map.ph~g. Mais:h:Es <:m~ b.:~ cr~afed i_'xom !shelled iext: in tiff'<: 
case the sy:',~cm win ,~mbsm~e wordeJm;se~ i-~ tlieir respective 
(:ov~.;r syl~l~2o\[s a~ld wo~.dcJassi:s llOt behmgirig to a*~y covey 
,~p/nfi.x~\] w.~. e.~i,.~.,d '/!.w, ~.m.~tri~., i,~ ih'is we/ti:e :a:,dy.a;d text 
is ~,~o! res~'rb:tcd, vdih x, >;F':.ei tutho ~l//lil~i;i~';~ ' ().\[ wordclasses. A 
seccmd way t~, egcag~3 ~iiatrbscs ia Jmm calc:tdaliol~ ¢m oilier 
,?~a&h;es. ~ 5..wet sym'~;~h-: e~.:u, b; ~, de.fined ~t~teracti'vely, and tlie 
r~vv mah~;~ i~,~hmging to She new mappi~ G cars de compmed. 
"!'o ha~tdie th~;s~ mat:dca~e,~ ~>'., data_ sl~ett~lc has been desJ.gaed, 
'~)as~xi (m ff.~:~ ~por:~a~.ess ~.{' the: ai~atxices, .~t futfils two rcquire- 
me:~ts: it i~ ;;uf~ic\[el~fly fas~ f~r ~=~:kticval of data in a~ imcntc.- 
tiw.: e~;v;re_umel:t and it eel n~arfipulatc e:x.b;em,...{y ia,'yie 
mahices (largest so far 750 z 750 z 750), 
doric ~ c:ow.:., sy~,.~bols and vaatricEs i1~ additio~ to U~: eom-. 
, ~t~ti~m of ',.'tiE me-.'tsuv::s ~elated to elll;Jcop},+ '~,3~" :¢l:tc?~ {m~.'- 
i)os~s rite sy,<;Icm i~c|mles a powerful luEchatfisu-~ ~o vx:c,~s 
matrices ,:rod ~.vlated mappings for an~dysis ~llld edifi~g. ()~. 
may take a ,mnibcr of labels from a dhne~sio~ of a ~r~ai:~i~c 
~,gg!c:e the:~t ;t ,~;et wi.fh a ,,ew merle mid defhlo a e;ubmatfi> ¢ by 
:.:!Jecifyi;ag arch ..~;ts i. the di~Ibrt:lit di~r~ensio~.<,~ '~'i~i, s~fi~ma.. 
!_,i~ ~,my d~e~ b~ ~, ~mcessmd selectively by tl~:s.pJ~;_y, stad.siic: h 
ch~m<~e :-:~:d qm~!~fizat{o~ pmrJcdlx!gs. 
i,Z t~;,:~ StatiStiCS pat!: JlsfOSnlaiiolI o.~1 si)arscne~:: ~wl ¢.1::: 
t@;b.e..;i, iaM lowc:st transltion probabilities in ma_t~im::.: o~ ::i!b- 
mat~_iec.<; may i~e gathered. Cogrclatio~s of trm~sifio~i i-r(:ql~c~ 
~.:;(;s b~;:c,'<:cn hd~ch; may bc~ cahi;,a\[ated fl-u' a (;.~2aU,~ iak.~!<~w),'2~,l 
raag~ of ~;meome only, f.ist, chauge and qlla.dzai%, com.. 
mal~ds may be specified foc a maaedcai rauge ¢,,f J;r;::qc,:~Me.<; 
in tile Sllblllatt\[K. This e~st!res that olle liiay >,{:(:~:exs; c~.:tht{at 
"ft~.rluE~cy layers" it~ the me&d?~, which is au c~:scaii~} op<;ra. 
ffot~ ior viewing very large matrices wi.lh only ,~ iTew ~.:~'xc:'._u~ 
of tlie erttfies now-zero. 
tf a user awetmlally finds dial the labels it~ aw, e dim~?:~. 
sion of a sift)matrix, could be inchlded idle a ~evi cower s3,,.x~ • 
boi, he/she may spceLfy this directly ~md the: ov;_:,aii left, ix 
together with its mapping wili be tnmsformed iuio ~, m:,v 
~;maller one. Different mairJeos may be ~llel'\[~ed KN iOt;~ iitJ \[iic; 
misted ~Iiapi)illgS arc eoi~lpatit)lc ia a!l ailal)/iie x~:m;e: : ,m*c~. 
symbols in ode m~,{~ph~g must bc eith,:r di@mci from th,: 
orles hi the offer mapphlg or itt md>s~:t rolatiom 
4, SOME EXAMPLE RESU1 ,TS 
The" paJ.iner:.; witllii~ lhe consortimn have .im~t ~:tx,icd ~h,' 
development of the optima\[ wordelass syslems. 'Dlcrcfor<:, Ju 
ihis paper we will resirict ourselves to the prc~;c~Uatiol.~ of a 
small number of ex~unples that should convey the {iavotw of 
rite kind of information that cml be derived with file system. 
The data h~ the cx~unples ace derived from a~ ~.q'\['h::e text 
in Gemaan (g0,O00 words) and the same tcxl h~ Dutch 
(100,000 words) Isbelted with the ESPlOY-+wordctas:-; system 
(cm 250 wordelasscs for Gem-~an aml 104 Jor )?t~tci~ were 
actually itsed). '\]'he symbols nsed h~ th~,~ examph,x ca~ l,:- 
intcq~reted as: 
'P': prepgsitiol*, 'D': d,:temenc:r~ 
'N': ~om~, 'A': adj~:&<~c;~ 
'C': eonjtmclioJ~, 'B': att~fi L 
'M02': date 
57 
,#,: 
i% ,. 
the subclass cannot be specified for the wordclass 
in question 
the subclass is specifiable, but has uot been speci- 
fied 
Example 1: 
If a user works on a 3D-matrix with the matl/x editor aid 
considers inclusion of all conjunctions into one cover symbol 
in the first scope, but wants to leave the most frequent labels 
out, he/she will look e.g. at a part of the matrix by a com- 
m,'u~d 
DISPLAY C ......... ;; 
which will give a display of only those parts of the matlix 
where a conjtmction stands in the first position of the Markov 
chain. 
Let us assume that the ,nest frequent labels ,-u'e 
C(K)#######, C02..##### and 'all labels C01 but without 
C01..#####, the,l he/she could define the cover symbol 
'ZCON' for scope I in the following way: 
ZCON = 
_ZCEX 
C ......... ! _ZCEX; 
( COO#######, C02#######, 
C01 ....... ! C01#####); 
with: '0' the list operator 
'!' the exception operator 
'_ZCEX' a local nanre 
With the help of tiffs new cover symbol we cru~ transform 
the matrix accordiugly. 
Exanrple 2: 
Listing of two most frequent wordclass 
triples within German corpus 
...................................... 
D00##N.F## A00 ..... ## N00,.S,F## 660 
F00####### D00##N.F## N00..S.F## 1310 
This is the well-known detemalner-adjecfive-noun phrase 
and the preposition-determiner-noun phrase. The tmmbers 
indicate the frequency with which the triples occur in the 
training text. 
Exanrple3:Statisfics 
Some symbols in first position of a chain 
......................................... 
symbol scope relfreq branching stddev 
factor 
AI7 ..... ## 1 0.00006 0/i 0.030612 
B09####### 1 0.00399 0/28 0.238650 
COO####### 1 0.02771 0/105 1.298851 
D01##S.M## 1 0.00260 0/17 0°34880"7 
The very low standard deviation of the label A17.....## casts 
considerable doubt upou its significance; it will probably be 
included into a cover symbol. The label COO#######, on the 
other hand, will probably deserve to be given a class of its 
own. 
Exanple 4: 
Correlations between symbols in scope 1 
....................................... 
V0001T..## V0043T..## 0.000 
V00.0...## V29.0...## 0.838 
M02####### B02####### 0.908 
The labels M02####### and B02####### have a high 
correlation and are therefore candidates to be put into the 
same cover symbol. But before doing this one has to deter- 
mine the significance of such an operation by checking the 
standard deviation, branchhlg factor and the relative freu 
quency. Also the third criterium as defined in section two has 
to be taken into account. 
Example 5: 
Entropy of symbols in scope 1 derived 
from the Dutch corpus 
..................................... 
ZVERB 2,675 
ZNOUN 2.371 
ZADJEC i. 830 
ZADVER 2. 609 
ZPRONO i. 799 
ZPREP 1. 870 
ZCONJ 2. 481 
ZMISCE 2.564 
Tltis table has been derived from the Dutch corpus after 
definition of cover symbols for the main word classes. '171e 
entropies of these cover symbols are low compared to the 
maximum we encountered. Certainly tltis set of cover sym-. 
bols is too small to fulfill the information requirenrent for e.g. 
52 
disambiguation of alternative gl,'aphemic forms, definitions ate not allowed to be directLy or indirectly 
recursive. 
APPENDIX\[: SYNTAX OF COVER SYMBOL 
DEFINITIONS 
The grammar is in BN-fonn, where: 
'1' mevas optionality, 
'1' alternative, 
'<' and '>' nontemainal, 
informal desclhptions are between double quotes. 
SET 
cover symbols used ill the map can only be excluded 
from other cover symbols (not included, otherwise the 
mapping would be inconsistent). This gives the con- 
sttaint use of cover symbol notations within a cover sym- 
bol definition, E.g. in an expression Z1 = 
<expl>!(<exp2>!<exp3>), the cover symbol set becomes 
inconsiste.t, if another cover symbol Z2 occurs included 
in <expl> or <exp3>, 
cover symbols occuning on the right side of a definition 
must be defined in the same file. 
<Defi.ition> 
<CS> 
<Symbol list> 
<primtist> 
<Prim> 
<CSA-notation> = 
<CS-notation> = 
<WCl.-notation> = 
<CS-constraint> = 
= <CS-notation> '=' < CS > ';' I 
<CSA-notation>'=' < CS > ';' 
= <Symbollist> {'!' <Symbol list>} 
= <Prim> I'(' <Pfimlist>')' 
= <Prim> I <Pfimlist> ',' <P,Lmlist> 
= <CS> I<WCL-notation> I 
<CSA-notation> I 
<CS-constraint> 
'_'<CS_notation> 
"valid cover symbol notation" 
"valid wordclass symbol notation" 
"constraint use of CS-notation" 
la} order to support order in the cover symbol definitio.s 
cover symbols that ate to be included into other cover sym- 
bols (i.e. they have only attxifiaty function, but will not occur 
ha a map) are notated differently from cover symbols, that 
will occur hi a map: Auxili,'u'ies lmve a name preceeded by a 
Additional notations are used in a textual definition to 
specify the scope for subsequently defined cover symbols, 
Cover symbol definitio, fries may include other cove,' 
symbol definition fries by a C-like "#include" command. 
with the fl~llowing constraints: 
INFORMATION FLOW IN THE EMMA MARKOW ANALYSIS SYSTEM 
.................... I I .... > / matrix file / ...... 
/ vertica\].ized / ->I ANALYSE I ............... 
/ & labeled text / i TEXT I 
.................... I I 
J I ............... 
...................... >I i .... > / mapping file /-- 
........................ I 
v 
........................................................ < .... 
I l 
I ............................ <- 
v v 
..................................... I 
/ 2nd matrix /--->l .... > / improved /-- 
............. I / matrix file / 
I ................. 
.............. I EDIT ................. 
/ 2r, d map /--->\[ MATRIX .... > / improved / ...... 
............. \[ / mapping file / \[ 
................. 
i ................. 
\[ ..... > / derived cover / .......... 
......... / symbol file / 
.......... I 
I I .................. l 
............ > I TEXT r .... > / cover symbol /--- 
l EDITOR I / definition file / 
I I ................. 
I 
GENERATE I ................... 
INITIAL I .... > / initial / .... 
MAPPING I / mapping file / 
FILE I ................. 
............................................................................................ < ............................ 
55 
