Unsupervised Learning of a Rule-based Spanish Part of 
Speech Tagger 
Chinatsu Aone a, nd Kevin Hausman 
Sysl, clns l{.cs(;eu'<:t~ al~(l Al)l)li<:{d;ions (~orpora.1;ioll (SIIA) 
4300 Vair l,akcs (\]ottt% 
l<'a, it'fax, VA 22033 
}Lo I1 cc((i\]Sl'{i,. (:o i i i ~ h ~1,1 is i / 1311 ((1~ s l-iL co l i l 
Abstract 
This I)at)cr d(~scrihcs a, SI);ulish l'arl.-ol: 
Speech (I'()S) l~a.gger which al,i)lies aad 
extends Ih'ill's a, lguril;hn~ for 'u,.super- 
vised learuing of ruh',- based l, aggcrs (llrill, 
1995). First, we discuss our general ;ttt- 
proach including extensions we rim(It 1;~) 
the algorithm in order t,o hamllc ml 
kllOWl~ w(H'(IH a,l~d pa.ra, ll~(d,crize \[ea.l'llillg; 
and ta.gging ot)l;ions. Ncx L we rCl)orl, 
and amdyzc our eXl)cri~mm,al rcsull;s us- 
iug dill'crenl, l)ara~ilel:crs. Thcll, wc (h~ 
s('ril)e our "hyhri(l" a.t)l)roach which was 
ll(~C{~Hs;-tl'y ill Ol'(l(w 1,o oV(~l'(:(Hl~l(~ ;1 \['un(la 
mcni, al li~lit;al, ion in Ih'ill's origiual al- 
goril, hnL Finally, wc cOral)at(> our tag- 
ger wil;h Ilid(Icn Mark()v Model (IlMM)- 
based (,aggers. 
:l Introduction 
We have develol)ed a. Spanish l'arb-(,l:-HI)('(~ch 
(I)()S) 'l'a.g:g(~r which al)l)lies and extends Ih'ill's 
alg(,rilJ~u for unSUl)crvised l('a.rniug (llrill, l.q!),5) 
to cr(~a.l;e a. set of rules (;hal, r(~(luce the aml)iguil.y 
of I'()S tags on words. We have ch()scu an un- 
supervised Ica','ning algori/,hn~ l)(~ca.u,s(~ il, does not 
require a. larg;(' I)()S-l;agged t,raining (-orl)us. Since 
there was n() I)()S-t.agged Spanish c(,rt)us avail- 
abh' 1;() us and since creating a large hand-l,;tgp;(xl 
corltus is both cosl, ly aud I)r()ne l,o inconsislamcy, 
Gc decision was also a l)ra, ci, ical one. Wc have de- 
cided 1;o develop a rule-based I\[,;xggcr l}(!causc such 
a. tagger lea.rus a sel, of declarative rules m~d also 
because we wautt'd 1,o c(tml)are it, with Ili(Id(:n 
M arkov M odd (I 1M M)-/)ascd 1;aggers. 
We extcude<l Ih'ill's algol'itlnu in scwwal ways. 
l"irsl,, we cxtcnd(;d it, (,o Imn(Ih~ unknowu words in 
the training and test texl,s. Scc(m(l, w(., i)aram(~ 
terized learniug and t;ag,ghw; ol)tions. Finally, wc 
cxp(;rinmnl, ed wil;h a "hyhrid" solul, io,, where we 
tls(~d a. v(;ry sinful\[ Iltlllll)cr o\[' hm~(I-(lis:nnhigual,(~d 
texi;s during training to overcom(~ a tiu~(lan~(ml, a.l 
limitation in tit(: learning algorithm. 
2 Conipollenl;s 
()ur Spa,ish l'()N l,agg(w consisLs of l, hre(~ (:()lHl)(~ 
nenl;s: the lnil, ial St, al,e Anuol, a l.or, the l,carncr, 
;Ul(l l, he l{.uh" 'l';tggcr, c;tch o\[' which is d(~scrih~'(l 
helow. 
2.1 Initial State Annotator 
This conq)onent is used t()assign all I>ossil)le I>()S 
l.ag~ 1.o a. given Spanish word. It consisl;s of 
h~xicon h>()kul) , nl(>rphol()gical analysis, a.nd ult. 
k.(.wn word ha, ndlin~,,. 'l'he Spanish I'()S l,ag s.~l, 
us(~d in 1,his w()rk c(msisl,s ()f l,hc l'olh,wi.v; l;~gs: 
AI)J. AI)V, BI+, (form of scr ()r ~s/ar), (II,()(~1( 
'I'IME, (;()I,()N, (;()MMA, (;ON.l, I)ATE, I)1,'/1', 
HAVE (form uf ha&', 9 , IIYI'IIEN, LET'I'I,~I{,, 
I,I'AREN, M()I)EL, MUI,TII'IAEH,, N, NUMBI{,, 
1 ), I)EI{,I()I), I)I{,EI~'JX, PI{,(), J)I{,()PN, QIJEH-. 
MAll, I(, QIJ()TE, IlL)MAN, I{.I'AI{EN, SEMI- 
(;()l,()N, SI,ASll, SIJIRX)N.I, S//FVIX, TIIEILE 
(hater used iu "l,h(~re" collsl, ructi()ns), WII\[)ET (,:,.*l/il,,',,), WIJN J' 
(q,t,r), Wt\[l'l' ( da,~d~O, ;.,d V 
(S(> 'l'ahl(' ;/). 
2.\].1 L(;xic(m Lookup and Morl)h()logi(-al 
Analysis 
Iinlil{e Ib'ill's English l;aggcr cxl)(wint(mt; (le 
scribed in (l+rill, 1!)!),~), ,~o large I)()H-1;ag;gc(I 
Spanish c(,rpus was a.vailable Lo us Dotal which ;t 
htrge h~xicon cau be (\]eriv(~(l. As a resull,, we (le- 
ci(h~d 1;o pars(" l,h('. (m-line ( kdlins Span ish--I!',uglish 
l)icl,ionary t, ~tlt(\[ d(',riv(',d a, large h',xicon from it; 
(al)()ul, 45,000 cnl;rics). W(' used only the ()pen 
class entries from {;Iris l<~xi('on, and then aug- 
m(~nl,c(I il, with irr(~gular verb forms and ;t nullll)cr 
of closcd-(-iass words. ()ur nlorl)hological analyzer 
uses a sel; of rewrite rules to sl, ril> off all(l/or mo(l 
if','/ word endings I;o lind root forms uf words. 
2.1.2 (Jnknown Word Ilan(|ling 
Hittce l, hc lexicon and JUUrl)hological almlysis 
will not cov('r cv('ry single wor(l I, ha/, cmi apl)Car 
itl a, 1,exl,, ;-mal, t,::nlpl, is iwulc ;/l, l, his ,'-;l, ag:c I,{i 
classify ItllkltowII WOl'dS. Any word which did 
noL gel, ;tssigned one or more p;trl, s-o\[Lslme.ch in 
l W(: h;~v(', obta,in(:d ~ license to the, dictionary. 
53 
the lookup/morphology phase is examined for cer- 
tain traits often indicative of particular parts-of- 
speech. This task is similar to what was done 
by the guessers for the HMM-based French and 
German taggers (Chanod and Tapanainen, 1995; 
Feldweg, 1995). 
For example, words ending in the letters 
"mente" are assigned the tag of ADV (adverb). 
Those words ending in "ando" or "endo" are as- 
signed the tag V-CONTINUOUS-NIL (continuous 
form of the verb). Table 1 shows a list of unknown 
word handling rules. 
Table 1: Unknown Word Handling Rules 
Iteuristics POS tag 
num > 1600 & < 2100 
roman numeral 1-9 
-and,-endo 
-ido,-ado,-ida,-ada 
-er,-ir,-ar 
-erse,-irse,-arse 
-cidn,-idad,-izaje 
-mente 
-able 
capitalized 
DATE 
ROMAN 
V-CONTINUOUS-NIL 
V-PERFECT-NIL 
V-NIL-NIL 
V-NIL-NIL-CLITIC ~ 
N 
ADV 
ADJ 
PROPN 
Performing these simple checks reduces the 
number of unknowns in our test set of 17,639 
words from 737 (4.2%) to la8 (0.9%). The re- 
maining unknowns are assigned a set of ambigu- 
ous open-class tags of N, V, ADJ, and ADV so 
that they can be disambiguated by the Learner. 
2.2 Learner 
The Learner takes as input ambiguously tagged 
texts produced by the Initial State Annotator, and 
tries to learn a set of rules that will reduce the 
ambiguity of the tags. Output is a file of rules in 
the following form: 
context = C: P~\] ... \]1~ I ...I P,~ --+ t}, where 
context is one of PREVWORI), 
NEXTWORD, PREVTA(~ or NEXTTAG, 2 
Cis a word or tag, 
P1,..., t~,..., Pn are the ambiguous 
parts-of-speech to be reduced, 
Pi is the part-of-speech that replaces 
P1,..., Pi,..., Pn. 
llere are some examples taken from the actual 
learned rules: 
* NEXTWORD = DE : PIN --~ N 
. PREVWOI{D = EN : DETIADV --+ DET 
* PREVTAG = DET : VIN ---+ N 
* NEXTTAG = SUBCONJ : BEIV --+ V 
2PREVWORD = previous word, PREVTAG = 
previous tag. 
The Learner applies Brill's algorithm for unsu- 
pervised learning to try to reduce the ambiguity of 
the tags in the input corpus. The following steps 
are taken: 
1. The Learner examines each ambiguously 
tagged word and creates a set of contexts for 
the word. Two of these contexts will be PRE- 
VWORD and NEXTWORD. The remainder 
consist of PREVTAG and NEXTTAG con- 
texts as required by the tag(s) on the preced- 
ing and following words. For example, if the 
word preceding the ambiguously tagged word 
is ambiguously tagged with two tags, then the 
Learner must generate two PREVrI'AG con- 
texts. 
2. An attempt is made to tind unambiguously 
tagged words in the corpus that are tagged 
with one and only one of the tags on the am- 
biguously tagged word. For example, if the 
word in question has both N and V tags, then 
the Learner would search for words with only 
an N tag or only a V tag. 
3. If such a word is found, the contexts of that 
word are examined to determine if there is an 
overlap between them and the contexts gen- 
erated for the ambiguously tagged word. One 
issue for this determination is how nmch am- 
biguity should be tolerated in the context of 
the unambiguously tagged word. For exam- 
ple, if one of the possible contexts is PRE- 
VTA(I=N and the word preceding the un 
ambiguously tagged word has both N and V 
tags, should the context apply? To permit 
various approaches to be tried, we extended 
the Learner to accept a parameter (i.e., free- 
dora) that determines how nmch ambiguity 
will be accepted on the context words for the 
context to match. 
4. If a context matches for this unambignously 
tagged word, the count of unambignously 
tagged words with the particular part of 
speech occurring in that context is incre- 
mented. 
5. After the entire corpus is examined, each of 
these possible reduction rules (of the form 
"Change the tag of a word from X to Y in 
the context C where Y C X") is ranked ac- 
cording to the following. First, for each tag 
Z C ;g, Z ¢ Y, the Learner computes: 
• incontcxt(Z, C), where 
freq(Y)= number of occurrences of words 
unambiguously tagged with Y, 
freq(Z)= number of occurrences of words 
unambiguously tagged with Z, 
inconlext(Z, C)= nmnber of times a word 
unambiguously tagged with Z 
occurs in context (I. 
54 
The tag Z that gives the highest score from 
this formula is saved as R. Then the score for 
a particular transforrnatiotr is 
(:) - * te, (:) 
6. If the highest-ranked transfornlation is trot 
positive, the Learner is done. ()therwise, the 
highest-ranked transformation is appended 
to the list of transformations learned. The 
Learner then searches this list for the trans- 
forlnation that will result in the most reduc- 
tion of ambiguity (whi,;h will always l)e the 
latest rule h:arned) and applies it;. This pro- 
tess continues until no further reduction of 
ambiguity is possil)h;, lh;re, we also extended 
tire l,earner to +~ccept a different parameter 
(i.e., l-ta.qJ'reedom) that deterntines how tHlt(:h 
ambiguity will I)e accepted on a word that is 
used for ('onte+xt during ambiguity rcducliou, 
that is, when the l+earn(>r has tbund a ruh~ and 
is apl)lying it to the trMning text. Note that 
sl)ecifying too small a value for this t)arame- 
ter can cause the \].,e&r'lr(-:,i' to go irrto ,:tIl etrd- 
less loop, as restricting the valid r:ollt(:xts Itray 
have the effect of nullifyiug the just-learned 
rule. 
7. The Learner thee returns to step 1 to begin 
the process again. 
2.3 R,ule Tagg('+:t' 
This c+otnl)otmnt reads tagged texts l)ro(h,:ed by 
the lnitiM State Atmotator and rules produced by 
the Learner and applies the learned rules to tim 
ta,gged texts to reduce the aml)iguity of the tags. 
We extended the+ l{,ule Tagger to haw~ two i)os- 
sible modes of operation (i.e., best-rule-first and 
learued-sequcuce mo(les controlled by t, hc scq pa- 
rameter) for using the, learned rules to reduce am- 
biguity: 
I. The Rule Tagger can use an algorithm similar 
to that used in step 7 of the l,earner. Each 
possible reduction rule is examined against 
the text to determine whid~ ruh', results in 
the greatest reduction of aml)iguity. 
2. The R, ule Tagger can use a sequeutial appli- 
cation of the learned rules in the order tha.t 
the rules were learned. After each rule has 
been applied in sequence, all of the rules pre- 
ceding it are re-applied to take adwml.age of 
ambiguity reductions made by the latest rule 
apl)lied. 
The R, ule Tagger allows one to specify, as in the 
/,earner, how much ambiguity will be, tolerated 
for a context to match. For example, one can 
be very restrictive and require that a tag context 
(e.g., PREVTAG=N) thatch only an unambigu- 
ously tagged word (in this ease, a word with only 
an N tag). This parameter (i.e,., r-lagJi'ccdom) 
Sl)eeifies the maximunl ambiguity Mlowed on a 
context word for a (:ontext tag to llrateh: I re- 
quires that the context word be unatnbiguously 
l,agg('.(l, 2 requires that tlmrc be no more than two 
tags on the word, and so on. 
3 Experiments and Results 
I"or training and testing of the tagger, we have 
randomly l)icked articles from a large (274MB) 
"H Norte" Mexican newspaper corl)uS, and sel)- 
arated tlwm into the training and test s(+ts. The 
test set; (17,639 words) was t, ngged matmally for 
comparison agaittst the system-tagged texts. For 
training, wc partitioned the de, velopment set into 
sev(:ral dilt'erent-sized sets in order to st(: the el- 
feels of training corpus sizes. The 1)reakdown can 
I)e Found in Table 2. 
'l'al)h~ 2: Ambiguously tagged Training sets 
S0t; Words 
Tiny 1322 words 
Small 3066 words 
M(~dittm 5591 words 
Full 12795 words 
If one randomly picks one of the possible tags 
(+t+ each word in the test set, the accuracy is 78+0% 
(78.0% with the simple verh tag set). The awwage 
I'()S amhiguity per word is 1.52 (1.49) including 
t)unctuation tags arr(I 1.58 (1.56) excluding l)Unc- 
tuat, ion tags. For co,nparison, the accuracy of 
lh'ill's unsupervised English tagger was 95.1% us- 
ing 120,000-word Penn Treel)ank texts. Ills initial 
state tagging accuracy was 90,7%, whictl is con- 
siderably higher than our Sl)a, ish case (78.6%:). 
3.1 Eth;ct of Tag Set 
Our tirst set of experiments tests the etDct of the 
I'()S tag eomt)lexity. We used both the Siml)le 
verl) tag set (5 tags) and the c, otnplex verb tag set 
(42 tags), which is shown in "l'~l)le 3, where * can 
be either IS(l, 2S(~, 3S(;, IPL, 2PI+, or 3PI+. In 
tim case of siml)le verb tag set, tense, person and 
numl)er information is discarded, leaving only a 
"V" tag and the lower four tags in the table. 
The scores witlr the siml)le verb tag set fur dif- 
ferent sizes of training sets are found in Tabh~ 4, 
and those with the complex verb tag set in 'l'a 
ble 5. For these two experiments, (,he Learner was 
set to have a tight restriction on using context 
for learning (i.c, the freedom parameter was set to 
1) and a loose restriction on context tbr applying 
the learned rules (i.e., l-lagfrecdom 10). q'he l{,ule 
Tagger was given a moderately-tight restrictiotb on 
using context for reduction rule application (i.e., 
r-lagJ'rccdom 2). 
In goner'M, the scores are slightly higher using 
the siml)le verb t~g set over the complex verb 
55 
Table 3: (Jomplex Verb Tag Set 
V-(~()N I)ITIONA L-* 
V-FUTUI{E-* 
V-IM PERFECT-* 
V-IM P EI~F ECT-S U 13.1 U NCTIV E- f{A-* 
V-IM PER, FECT-SUI\]J UNCTIVE-S E-* 
V-PRESENT-* 
V-P RES ENT-S U BJ UN( JTIV E-* 
V- P R ET E RI T-* 
V-NIL-NIl, 
V-C()NTINU()US-NII~ 
V-I'EI{,FECT-NIL 
V-NIL-N1L-(;LITIC 
Table d: Ambiguously tagged texts, Sirnple Verbs 
Set # of rules learned 
Tiny 
Small 
Medium 
Full 
(none) 
~core 
131 82.5% 
211 91.5% 
287 91.8% 
434 83.0% 
0 78.6% 
This rule was learned h~te in tile learning process 
when most I'/SU1KJONJ pairs had already been 
reduced, llowever, as olle Call see frolll t\]le COil- 
text of the rule, it will apply in a large number 
of eases in a text. The Rule Tagger notes this 
and applies the rule early, thus incorrectly chang- 
ing many P/SUI~C()NJ pairs to SUBC()NJ and 
reducing the accuracy of t, he tagging. Since this 
phenomenon never occurred in any of the other 
learning rims, one can see that the learning pro 
eess can be heavily influenced by the choice of it, 
put texts. 
3.2 Effeet of Rule Application 
Parameters 
The next tests performed involved using rules gen- 
erated above and changing 1)arameters to the Rule 
'l'agger to see how the scores wouhl be influenced. 
In the following test, we used tile simi)le verb tag 
set rules but varied the r-tagfrccdom parameter 
and the scq parameter. The results can be found 
in Table 6. 
tag set (91.8% vs. 90.3% for the "Medimn" cor- 
pus). This behavior is most likely due to the 
fact that, some verb tense/person/number combi- 
nations e~mnot easily be distinguished from con- 
text, so the Learner was unable to find a rtfle that 
would disambiguate them. 
As can be seen from the tables, performance 
increased as the size of the learning set incre, ased 
up to the "Medium" set, where the score levelled 
otf. With very small learning sets, the system was 
unable to tlnd sulticient examples of phenomena 
to produce reduction rules with good coverage. 
One surprising data t)oint in the simple 
verb tag set experiments was the "Full" score, 
which dropped Mmost 9% fi'om the "Medium" 
score. After analyzing the results more closely, 
it was found that the l,earner had learned 
a very spec, i\[ie rule regarding tile reduction 
of prel)osition/subordinate~-conjunction eombina- 
tions late in the learning process. The learned 
rule was: 
I'I{I'3V'I'A(~ = N : I'\]SUBCONJ -~ S\[IB(ff)NJ 
'Fable 5: 
Verbs 
Ambiguously tagged texts, (~omplex 
Set # of rules learned Score 
Tiny 125 81.7% 
Small 212 89.6% 
Medium 323 90.3% 
Full 564 90.2% 
(none) 0 78.0% 
Table 6: Ambiguously tagged texts, Simple Verbs 
Set R,-Tag- 
freedom 
Tiny 1 
2 
3 
4 
Small 1 
2 
3 
4 
Medium 1 
2 
3 
4 
Full 1 
2 
3 
4 
Score 
(best-rule- 
first) 
82.7% 
82.5% 
82.1% 
81.9% 
90. l% 
91.5% 
91.5% 
91.5% 
90.5% 
91.8% 
91.8% 
91.8% 
82.4% 
83.0% 
81.7% 
81.5% 
~eore 
(learned- 
seqtlellce) 
80.2% 
80.6% 
80.5% 
80.5% 
89.8% 
89.9% 
89.9% 
89.9% 
90.6% 
90.5% 
90.5% 
90.5% 
79.8% 
80.0% 
80.0% 
8O.0% 
Although the wu'iations are slight, the best 
value for the r'-lagfl'c, edom l)arameter seems to be 
at an ambiguity level of 2. It seellls that the strat- 
egy of reducing the ambiguity as quickly as pos- 
sible (best-rule-first) is better than following the 
ordering of the rules by the l,earner. This \[nay 
well be due to the fact that the ordering of the 
rules as produced by the Learner is dependent on 
the training texts. Since the test set was a differ- 
eat set of texts, the ordering of the rules was not 
as applicable to them as to the training texts, and 
so the tagging performance suffered. 
56 
3.3 Etfe, rt (ff Hand+tagged Tex(;s 
Afl±er ex+ttnining l±h(; result;s fi'om l±he aJ)(~v(~ expcr 
imcnl±s, wc rea,lized l±hal, sonm of (,he (:h~scd-cl;uss 
words in Spanish ;~re a, ltnosl, always amhiguous 
(e.g., preposil±ions are usually ~unl~ig;uous bel,we(m 
1 )I{EP a, nd S U B(:()NJ, a, nd del±errnine, rs bel;we('n 
I) 1'3'1' a, ud 1' R()). This m('aus (;hal, l;h(~ l,ea, rncr will 
?~,ever \[ea, rn a rule I±o dismul)igu;tl,e l, hcs(~ clos(:d- 
class (:~+ses I)e(:+mse l, here will r;u'ely he ulmml)i,gtt- 
otis C()ll(;c:xl±s ill I, he l±raining I,ex(,s (,agge(\[ hy 1;11(' 
ini(;iaJ Si;al,e A lttlO(,&(;or. 'l'ha(, is, un\]i\],m ()\[)('II-(:\[&SS 
w(,'ds, wc will no(, littd .cw ltJta,ntl)iguotts ch,s('d 
class words i, l±exl±s prccis(~ly I)(:(:;mse there is oil\[y 
a closed set; of t;hcm+ 'l'hus, wc decided I±o illl, ro- 
(bite a, st~la\[\] tltll\]t\])(:r of' \]la, lt(\[-(,~Lgg(x\] l;exl;s illl,o (,lie 
1;l:a, ining sel; given (;o the l,earner. Since t, he }l;m(\[ 
t~tgg(;(\[ 1;exI;s \[l&Ve ~¢corI'(~C(;" (~X&III\[)I0S ()\[+ V,~LI'IOIlS 
l)h(:notn(',ua,, I, he l.eartmr s\]toul(l \])e a\])\[e (,¢) lin(I 
good exa, nq)les in t;h(,~+ I±(, learn l'ro~+. 
For our t,esl±s, wc (h~litmd four set,s o\[" hat.l- 
t, agged texts t,h;U, wc a, dded t+()the "Sttmll" (306(~ 
wor(\[s) set, o\[' at~tbigu()us\[y l,aggcd l,exl,s. The 
1)reakdown is in Table T. 
Ta, hh' 7: Ila, tM-l±a+gg(+d 'l'raini,g s(:l~s 
Set, Words 
Snta, I1 218 wor(Is 
Me(liutn 588 wor(ls 
l.a, rgc !)()(i words 
Full 1791 words 
Again, (,he l,e~rner w~m ,'set l,o have a. l±i,ghl, rc- 
sl±rici;ion on using cont, exl± for h+arn\]ttg (,fr'ccdom l) 
;m(I a loose restric(,ion on col|l±ex(, \['or ;t.ppJyill,g l±hc 
\],:,a, rn(;(\[ rul(;s (la.qJ}'ecdom, 10). 'l'h(> I{,ulc Tagger 
wa, s giw~n a itlodera,l,ely-t, ighl± rest, riot,ion on using 
(:OIl\[,(;Xl± \['or rt':(lll('l,iotl rule a,i)l+lit:al, i()t~ (J:r'rcdom 2). 
The bcst-rul(',-Jir,sl mode of I, he l{mle Tagg(:r was 
Ilscd, 
The resull,~, ~s shown iu Table 8, a+rc sligi~l, ly 
belA,('r l, han wh(;n using only ;m~l)igttously Lagged 
l±eXl,S, It is inl;eresl, it~g I±o note tl\];d, l,\]m higher 
~-tc(:tu'a, cy w:-ts achieved wit, h fewer ruh'.s. Itl fact, 
;d\[ expe, rimcnl,s resull;ed iu \[ea, rnhtg a lil,l±h' (~ver 
200 tithes. 
'I'M)h; 8: Atnl)iguous/l\]t~;tnd,igu.us 'l'cxts, Sit~tple 
Verbs 
Set, # of rul(':s h'artmd S(:(,'c 
10 
Medium 211 92.1% 
I,argc 205 I i% 
l"ull 202 L~.q2. 1% 
(.o,,( 9 _ :2+t I 
lu +ul(lil±i()t~ t,o Lira (+×l)('rhu(mts ;Ll)ov(>, wc 
wa, l|l±(xI l,o knuw i\[" (,he itfl;r()(lucl, ion o\[' ha, rid- 
ULgg;cd texts into tim "Full" aud~iguo,.tsly 1,a~g,.'d 
set would improw~ il,s r,M;h.er low score (or. 'l'a- 
hie 4). Wc performed an experilJtcnl, using sitnplc 
w~rb tags, the "l,'ull" ambiguously tagged text;s, 
~md the "Full" ha, nd-t;agged l±exts. Tim resu\[l±s 
were d22 rules learned with :-~ score of 92.1%, which 
tied with (,he "Sm'MI" ambiguously l,agged set, for 
achieving l,he highest, .,tccura.('y of all o\[" the lem',- 
i,g/ta,ggine; runs, a~ full 13.5% higher than using 
,o I,.~;-u'nittg. 
4 Problems and Possible 
hnprovements 
All;hough our Sl);mish P()S l;aggcr l)er\['orn,ed rca 
son;dfly well, ~whieving ~Llt it,q)rovcment of 13.,5% 
ill ;tc(:tlra, cy over r&llC\[()ttlly picking tags, l±hcre wcro 
sewwaJ lwol)lcms t, lt~ui; ln'evenl, e,,l the sysl±cln I'ronll 
re;whiug an cwm higher score. 
4.11. Learning Proi~h',ln 
As discussed iu Sccl±ion 3.3, ~u,,l)iguous closed 
class words (e.g., prep()sil, ions, det,crminers, etc.) 
ca,nnol; bc reduced when l, here a, re no unaatll~igu 
ous exa.nlples o\[' l±heull in l, he l,i'n, iui|lg; t,exl, s. This 
is prev;flent, in Slmnish, where most I)reposil±ions 
C~ll| &\]SO ~)0 :-;tl\]){H'dill;t\[,o (;oII j/LIICI, ioIIS~ dc(,orlllillOl'S 
c;u| be prol|otlllS, (':l;c. A \['ew h~tll(I (,;Lgged l,exl±s 
;~rc required l,o Ic,,rn goud rules for reducing l,he 
;uld~iguil±y un I hesc words. It ix l>ossihle, t,~w- 
(wcr, t, hM such l±exts c.%tl I)e dis:-md)igutd, ed only 
for t, heir :-~lways ambiguous ch)scd-ck-tss words bul± 
llol, tlllaJlli)igtlOtlS clos('(l-cla,'-.;s words or o\])0,11-class 
words. Such an cxperim(ml; similar 1,o seleclivo 
,samplin.q (.\[isctlsscd in l)agan and lengels(m (1)a- 
g;ul ~-md l",l+gelscm, 1+)/)5) wo.hl I',e useful in the 
\['ul, tll'(: \[)c:c;~llse, il' it, is t, ruc, it; will reduce t, hc cost; 
or tll:-I,tlll;-t\] l,~+gging (-onsidcr+flfly. 
4.2 Lexicon Prol)le, nJ. 
I)rohlems l;ha,t; \]m('anlc a, ppar(ml± .a,s we ra, lt lll()r(: 
t,csl±s were (,he incotnl)l(~l±en(~ss ~tl~(I t\[tisl, a, kes in 
t, he lexic(m. While I,h(+ lexicotl, derived l'r()lll t h(~ 
(',ollins Spa,nish-lgnglish dict, iot\]&ry, w~s quit,(' rich 
in w(~r(ls, il±s l;:-tg set, ,.lid uol± a, lwa+ys tmd, ch l;he t, ag 
dcliuit,ions we ('ml)loyc(t. l,'or (~xampl(~, our l,ag sol, 
(\]isl, inguislms pr(>l)(:r n()uus (I)I{.()I>N) &lid It()/lllS 
(N), whereas the (~ollins di(:l±ionvxy t~ark(+d h(>l;h 
as nout+s (N). We have a(lded ottr existing 1)t't)l)tw 
ha.rim lisl±s 1,o t, he lexicon t,o t>+u'l;bdly solve /,his 
i)rol+lem, Iml, the list, s +u'e currenl;ly limil,cd I,o h> 
(:~l,l±i(lli II;_I, llI(~S (tll(i lmol)hCs lh'sl; n,~tln(>s. 
We a, lso l~'Otlltd s(w(q'0,1 mistakes ht late (\]olli,s 
dclinil,i(,ns (e.g., severed adverbs ending " ltt(~llt,( "?, 
wcr(" classified a(Ij(~cl;ives). All, lt()ug\]~ we fixed 
l±hese mist, akcs as we t~ol, iccd (hcltl, it, ix diflicult, 
1,O kllOW h(lw nl,~uly sucll (Wl'ors sl,ill remain in tim 
\[(~xic:()n. 
It l±urtt(xt out; I;hal, the h+complcle, nc,s.s o1' the lex- 
ic.n was auol, h(w funda+m(ml,a,l l)rol)h~::~ I±o I~rill's 
unsul>erviscd h'~-u'ning algoril:Inm ThaJ, is, when 
57 
the lexicon does not list all the possible tags for a 
word, the tagger is very likely to make a mistake. 
This is because the learner is trained to reduce 
the ambiguity of possible tags of a word (say N, 
V, ADJ tags), but if the lexicon lists only a sub- 
set of the possible tags (say N and V tags), the 
system will never learn to assign an ADJ tag even 
when the word is used as an adjective. 
This type of problem was observed frequently 
when words are ambiguous between proper nouns 
and some other parts-of-speech such as "Flo- 
,'es (ADJ/PROPN)," "Lozano (ADJ/PROPN)," 
"van (V/PP~OPN) ''a, "Serra (V/l'i{,OPN)," etc. 
because not all the proper nouns are in the lexi- 
COIl. 
The problems described above did not occur in 
Brill's experiments because he derived the lexicon 
fi'om a POS-tagged corpus and used the untagged 
version of the same corpus for training and test- 
ing. Thus, he used an "optimal" lexicon which 
contains all the words with only parts-of-speech 
which appeared in the corpus. In addition, in 
such a corpus, rarely used POS tags of a word 
are less likely to occur, and words are less likely 
to be ambiguous. Thus, in a sense, his "unsuper- 
vised learning" experiments did take advantage of 
a large POS-tagged corpns. 
5 Related Works 
It is very ditIicult to compare performances be- 
tween taggers when accuracy depends on quality 
of corpora and lexicons, and maybe on character- 
is,its of languages. But in this section, we corn- 
pare our tagger with Hidden Markov Model-based 
taggers. 
A more widely used algorithnl for unsuper- 
vised learning of a POS tagger is Hidden Markov 
Model (I1MM). Cutting el al. ((hitting et al., 
1992) and Melialdo (Merialdo, 1994) used IIMM 
to learn English POS taggers while Chanod and 
'I'apanainen (Chanod and Tapanainen, 1995), 
Feldweg (Feldweg, 1995), and Ledn and Ser- 
rano (l,e6n and Serrano, 1995) ported tile Xerox 
tagger (Cutting et al., 1992) to French, German, 
and Spanish respectively. One of tile drawbacks of 
an tlMM-based approach is that laborious man- 
ual tuning of symbol and transition biases is nec: 
essary to achieve high accuracy. Without tuned 
biases, the C, erman Xerox tagger achieved 85.89% 
while the French Xerox tagger achieved 87% accu- 
racy. After one man-month of tuning biases, the 
accuracy of the French tagger increased to 96.8%. 
One could derive such biases fronl a corpus, as dis- 
cussed in (Merialdo, 199d), but it unfortunately 
requires a tagged cort/us. 
'Fhe best accuracy of the Spanish Xerox tag: 
ger was 91.51% for the reduced tag set (174 tags) 
lit can be a part of a last name as it, "van Mahler", 
but also is an inflected form of "it". 
with the hase accuracy (i.e. no training) of 88.98% 
while the best accuracy of our tagger is currently 
92.1% for the simple tag set (39 tags) with the 
base accuracy of 78.6%. The lower base accuracy 
in our exl>eriment is probably due to the large 
number of entries in the Collins dictionary. 
6 Summary 
Our Spanish Part of Speech Tagger is a successful 
implementation and extension of Brill's unsuper: 
vised learning algorithm that reduces the ambi- 
guity of part-of-speech tags on words in Spanish 
texts. 
The system requires few, if any, hand-tagged 
texts to bootstrap itself. Rather, it merely re- 
quires a Spanish lexicon and morphological an- 
alyzer that can tag words with all their possi- 
ble parts-of-speech. (liven that the system per- 
forms at approximately 92% accuracy even with 
the aforementioned problems and with the inch> 
sion of unknown words, we would expect that this 
system could achieve better results, approaching 
those of similar English-language POS taggers, 
when these problems are rectitied. 
References 
Eric Brill. 1995. /Jnupervised learning of disam- 
biguation rules for part of speech tagging, hi 
Proceedings of the 3rd Workshop on Very Large 
Corpora. 
Jean-l)ierre Chanod and Pasi Tal)anainen. 1995. 
Tagging French - (~omparing statistical and a 
constraint-based method. In Proceedings of the 
I,/A CL - 95. 
D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun. 
1992. A Practical Part-of-Speech Tagger. hi 
Proceedings of the 7'hird Conference ou Applied 
Natural Language Processing. 
Ido Dagan and Scan I ). Engelson. 1995. Selective 
Sampling in Natural I,anguage Learning. In 
Proceedings of the IJCAI Workshop on Nero Ap- 
proach to Lcarning for" Natural Language Pro- 
cessing. 
llelnlut Feldweg. 1995. Implementation and Eval- 
uation of a. German ll M M for POS Disambigua- 
lion. In Proceedings of lhe Is"ACL ,91C1)A7' 
Workshop. 
Fernando S{mchez I,edn and Amalio F. Nieto Ser- 
ran(). 1995. l)evelot>ment of a spanish version 
of the xerox tagger. Ill l'roceedings of the XI 
Congrcso de la ,5'ocicdad I,,'spar~ola para el Proce= 
samiento dcl Lenguaje. Nalural (,gEl'LN '95). 
Bernard Merialdo. 1994. Tagging English Text 
with a Probabilistic Model. Compnialional Lin- 
guislics, 20(2). 
53 
