Unsupervised Learning of Word-Category Guessing Rules 
Andrei Mikheev 
HCRC Language Technology Group 
University of Edinburgh 
2 Buccleuch Place 
Edinburgh EH8 9LW, Scotland, UK 
: Andrei.Mikheev~ed.ac.uk 
Abstract 
Words unknown to the lexicon present a 
substantial problem to part-of-speech tag- 
ging. In this paper we present a technique 
for fully unsupervised statistical acquisi- 
tion of rules which guess possible parts- 
of-speech for unknown words. Three com- 
plementary sets of word-guessing rules are 
induced from the lexicon and a raw cor- 
pus: prefix morphological rules, suffix mor- 
phological rules and ending-guessing rules. 
The learning was performed on the Brown 
Corpus data and rule-sets, with a highly 
competitive performance, were produced 
and compared with the state-of-the-art. 
1 Introduction 
Words unknown to the lexicon present a substan- 
tial problem to part-of-speech (POS) tagging of real- 
world texts. Taggers assign a single POS-tag to a 
word-token, provided that it is known what parts- 
of-speech this word can take on in principle. So, first 
words are looked up in the lexicon. However, 3 to 
5% of word tokens are usually missing in the lex- 
icon when tagging real-world texts. This is where 
word-Pos guessers take their place -- they employ 
the analysis of word features, e.g. word leading and 
trailing characters, to figure out its possible POS cat- 
egories. A set of rules which on the basis of ending 
characters of unknown words, assign them with sets 
of possible POS-tags is supplied with the Xerox tag- 
ger (Kupiec, 1992). A similar approach was taken 
in (Weischedel et al., 1993) where an unknown word 
was guessed given the probabilities for an unknown 
word to be of a particular POS, its capitalisation fea- 
ture and its ending. In (Brill, 1995) a system of rules 
which uses both ending-guessing and more morpho- 
logically motivated rules is described. The best of 
these methods are reported to achieve 82-85% of 
tagging accuracy on unknown words, e.g. (Brill, 
1995; Weischedel et al., 1993). 
The major topic in the development of word-Pos 
guessers is the strategy which is to be used for the 
acquisition of the guessing rules. A rule-based tag- 
ger described in (Voutilainen, 1995) is equipped with 
a set of guessing rules which has been hand-crafted 
using knowledge of English morphology and intu- 
ition. A more appealing approach is an empiri- 
cal automatic acquisition of such rules using avail- 
able lexical resources. In (Zhang&Kim, 1990) a 
system for the automated learning of morphologi- 
cal word-formation rules is described. This system 
divides a string into three regions and from train- 
ing examples infers their correspondence to under- 
lying morphological features. Brill (Brill, 1995) out- 
lines a transformation-based learner which learns 
guessing rules from a pre-tagged training corpus. 
A statistical-based suffix learner is presented in 
(Schmid, 1994). From a pre-tagged training cor- 
pus it constructs the suffix tree where every suf- 
fix is associated with its information measure. Al- 
though the learning process in these and some other 
systems is fully unsupervised and the accuracy of 
obtained rules reaches current state-of-the-art, they 
require specially prepared training data -- a pre- 
tagged training corpus, training examples, etc. 
In this paper we describe a new fully automatic 
technique for learning part-of-speech guessing rules. 
This technique does not require specially prepared 
training data and employs fully unsupervised statis- 
tical learning using the lexicon supplied with the tag- 
ger and word-frequencies obtained from a raw cor- 
pus. The learning is implemented as a two-staged 
process with feedback. First, setting certain param- 
eters a set of guessing rules is acquired, then it is 
evaluated and the results of evaluation are used for 
re-acquisition of a better tuned rule-set. 
2 Guessing Rules Acquisition 
As was pointed out above, one of the requirements in 
many techniques for automatic learning of part-of- 
speech guessing rules is specially prepared training 
data -- a pre-tagged training corpus, training ex- 
amples, etc. In our approach we decided to reuse 
the data which come naturally with a tagger, viz. 
the lexicon. Another source of information which 
is used and which is not prepared specially for the 
task is a text corpus. Unlike other approaches we 
don't require the corpus to be pre-annotated but 
use it in its raw form. In our experiments we used 
the lexicon and word-frequencies derived from the 
327 
Brown Corpus (Francis&Kucera, 1982). There are 
a number of reasons for choosing the Brown Cor- 
pus data for training. The most important ones are 
that the Brown Corpus provides a model of general 
multi-domain language use, so general language reg- 
ularities can be induced from it, and second, many 
taggers come with data trained on the Brown Cor- 
pus which is useful for comparison and evaluation. 
This, however, by no means restricts the described 
technique to that or any other tag-set, lexicon or 
corpus. Moreover, despite the fact that the train- 
ing is performed on a particular lexicon and a par- 
ticular corpus, the obtained guessing rules suppose 
to be domain and corpus independent and the only 
training-dependent feature is the tag-set in use. 
The acquisition of word-Pos guessing rules is a 
three-step procedure which includes the rule extrac- 
tion, rule scoring and rule merging phases. At the 
rule extraction phase, three sets of word-guessing 
rules (morphological prefix guessing rules, morpho- 
logical suffix guessing rules and ending-guessing 
rules) are extracted from the lexicon and cleaned 
from coincidental cases. At the scoring phase, each 
rule is scored in accordance with its accuracy of 
guessing and the best scored rules are included into 
the final rule-sets. At the merging phase, rules which 
have not scored high enough to be included into the 
final rule-sets are merged into more general rules, 
then re-scored and depending on their score added 
to the final rule-sets. 
2.1 Rule Extraction Phase 
2.1.1 Extraction of Morphological Rules. 
Morphological word-guessing rules describe how 
one word can be guessed given that another word is 
known. For example, the rule: \[un (VBD VBN) (JJ)\] 
says that prefixing the string "un" to a word, which 
can act as past form of verb (VBD) and participle 
(VBN), produces an adjective (J J). For instance, by 
applying this rule to the word "undeveloped", we 
first segment the prefix "un" and if the remaining 
part "developed" is found in the lexicon as (VBD 
VBN), we conclude that the word "undeveloped" is 
an adjective (JJ). The first POS-set in a guessing 
rule is called the initial class (/-class) and the POS- 
set of the guessed word is called the resulting class 
(R-class). In the example above (VBD VBN) is the 
/-class of the rule and (J~) is the R-class. 
In English, as in many other languages, morpho- 
logical word formation is realised by affixation: pre- 
fixation and suffixation. Although sometimes the af- 
fixation is not just a straightforward concatenation 
of the affix with the stem 1, the majority of cases 
clearly obey simple concatenative regularities. So, 
we decided first to concentrate only on simple con- 
catenative cases. There are two kinds of morpho- 
logical rules to be learned: suffix rules (A') -- rules 
which are applied to the tail of a word, and prefix 
rules (A p) -- rules which are applied to the begin- 
ning of a word. For example: 
1consider an example: try - tried. 
A s : \[ed (NN VB) (JJ VBD VBN)\] 
says that if by stripping the suffix "ed" from an 
unknown word we produce a word with the POS-class 
(NN VB), the unknown word is of the class - (JJ 
VBD VBN). This rule works, for instance, for \[book 
--*booked\], \[water---*watered\], etc. To extract such rules 
a special operator V is applied to every pair of words 
from the lexicon. It tries to segment an affix by left- 
most string subtraction for suffixes and rightmost 
string subtraction for prefixes. If the subtraction 
results in an non-empty string it creates a morpho- 
logical rule by storing the POS-class of the shorter 
word as the /-class and the POs-class of the longer 
word as the R-class. For example: 
\[booked (JJ VBD VBN)I ~7 \[book (NN VB)\] --~ 
A' : \[ed (NN VB) (:IJ VBD VBN)\] 
\[undeveloped (JJ)l ~7 \[developed (VBD VBN)\] --+ 
A p : \[un (VBD VBN) (JJ)l 
The ~7 operator is applied to all possible lexicon- 
entry pairs and if a rule produced by such an applica- 
tion has already been extracted from another pair, 
its frequency count (f) is incremented. Thus two 
different sets of guessing rules -- prefix and suffix 
morphological rules together with their frequencies 
-- are produced. Next, from these sets of guess- 
ing rules we need to cut out infrequent rules which 
might bias the further learning process. To do that 
we eliminate all the rules with the frequency f less 
than a certain threshold 82. Such filtering reduces 
the rule-sets more than tenfold and does not leave 
clearly coincidental cases among the rules. 
2.1.2 Extraction of Ending Guessing Rules. 
Unlike morphological guessing rules, ending- 
guessing rules do not require the main form of an 
unknown word to be listed in the lexicon. These 
rules guess a POs-class for a word just on the ba- 
sis of its ending characters and without looking up 
its stem in the lexicon. Such rules are able to cover 
more unknown words than morphological guessing 
rules but their accuracy will not be as high. For 
example, an ending-guessing rule 
A~: \[in s - (JJ NN VBG)\] 
says that if a word ends with "ing" it can be an 
adjective, a noun or a gerund. Unlike a morphologi- 
cal rule, this rule does not ask to check whether the 
substring preceeding the "ing"-ending is a word with 
a particular POS-tag. Thus an ending-guessing rule 
looks exactly like a morphological rule apart from 
the/-class which is always void. 
To collect such rules we set the upper limit on 
the ending length equal to five characters and thus 
collect from the lexicon all possible word-endings of 
length 1, 2, 3, 4 and 5, together with the POs-classes 
of the words where these endings were detected to 
appear. This is done by the operator /X. For ex- 
ample, from the word \[different (JJ)\] the/% operator 
will produce five ending-guessing rules: It - (ss)\]~ \[nt 
" (J:\])l; \[ent- (Ji\])\]; \[rent- (J:I)\]; \[erent- (:l J)\]. The A 
operator is applied to each entry in the lexicon in the 
2usua/ly we set this threshold quite low: 2-4. 
328 
way described for the ~7 operator of the morpholog- 
ical rules and then infrequent rules with f < 0 are 
filtered out. 
2.2 Rule Scoring Phase 
Of course, not all acquired rules are equally good 
as plausible guesses about word-classes: some rules 
are more accurate in their guessings and some rules 
are more frequent in their application. So, for every 
acquired rule we need to estimate whether it is an 
effective rule which is worth retaining in the final 
rule-set. For such estimation we perform a statistical 
experiment as follows: for every rule we calculate 
the number of times this rule was applied to a word 
token from a raw corpus and the number of times it 
gave the right answer. Note that the task of the rule 
is not to disambiguate a word's POS but to provide 
all and only possible POSs it can take on. If the rule 
is correct in the majority of times it was applied it 
is obviously a good rule. If the rule is wrong most 
of the times it is a bad rule which should not be 
included into the final rule-set. 
To perform this experiment we take one-by-one 
each rule from the rule-sets produced at the rule ex- 
traction phase, take each word token from the cor- 
pus and guess its POS-set using the rule if the rule 
is applicable to the word. For example, if a guess- 
ing rule strips a particular suffix and a current word 
from the corpus does not have this suffix, we classify 
these word and rule as incompatible and the rule as 
not applicable to that word. If the rule is applicable 
to the word we perform look-up in the lexicon for 
this word and then compare the result of the guess 
with the information listed in the lexicon. If the 
guessed POS-set is the same as the POS-set stated in 
the lexicon, we count it as success, otherwise it is 
failure. The value of a guessing rule, thus, closely 
correlates with its estimated proportion of success 
/p~l which is the proportion of all positive outcomes 
of the rule application to the total number of the 
trials (n), which are, in fact, attempts to apply the 
rule to all the compatible words in the corpus. We 
also smooth/3 so as not to have zeros in.positive or 
negative outcome probabilities: 15 = ~.~i~ 
/3 estimate is a good indicator of rule accuracy. 
However, it frequently suffers from large estimation 
error due to insufficient training data. For example, 
ifa rule was detected to work just twice and the total 
number of observations was also two, its estimate/3 is 
very high (1, or 0.83 for the smoothed version) but 
clearly this is not a very reliable estimate because 
of the tiny size of the sample. Several smoothing 
methods have been proposed to reduce the estima- 
tion error. For different reasons all these smoothing 
methods are not very suitable in our case. In our 
approach we tackle this problem by calculating the 
lower confidence limit 7r L for the rule estimate. This 
can be seen as the minimal expected value of/3 for 
the rule if we were to draw a large number of sam- 
ples. Thus with certain confidence ~ we can assume 
that if we used more training data, the rule estimate 
/3 would be no worse than the ~L limit. The lower 
confidence limit 7r L is calculated as: 
~rL =/3 -- Z(l-~)/2 * sp =/3 -- z(t-~)/2 * 
This function favours the rules with higher esti- 
mates obtained over larger samples. Even if one 
rule has a high estimate but that estimate was ob- 
tained over a small sample, another rule with a lower 
estimate but over a large sample might be valued 
higher. Note also that since/3 itself is smoothed we 
will not have zeros in positive (/3) or negative (1 -/3) 
outcome probabilities. This estimation of the rule 
value in fact resembles that used by (Tzoukermann 
et al., 1995) for scoring pos-disambiguation rules for 
the French tagger. The main difference between the 
two functions is that there the z value was implic- 
itly assumed to be 1 which corresponds to the con- 
fidence of 68%. A more standard approach is to 
adopt a rather high confidence value in the range 
of 90-95%. We adopted 90% confidence for which 
z(1-0.90)/2 = z0.05 = 1.65. Thus we can calculate 
the score for the ith rule as: /3i - 1.65 * ~/P~(QP') 
Another important consideration for scoring a 
word-guessing rule is that the longer the affix or end- 
ing of the rule the more confident we are that it is 
not a coincidental one, even on small samples. For 
example, if the estimate for the word-ending "o" was 
obtained over a sample of 5 words and the estimate 
for the word-ending "fulness" was also obtained over 
a sample of 5 words, the later case is more represen- 
tative even though the sample size is the same. Thus 
we need to adjust the estimation error in accordance 
with the length of the affix or ending. A good way 
to do that is to divide it by a value which increases 
along with the increase of the length. After several 
experiments we obtained: 
scorei =/3i - 1.65 * ~/(1 + log(IS, I)) 
When the length of the affix or ending is 1 the 
estimation error is not changed since log(l) is 0. For 
the rules with the affix or ending length of 2 the es- 
timation error is reduced by 1 + log(2) = 1.3, for 
the length 3 this will be 1 + log(3) -- 1.48, etc. The 
longer the length the smaller the sample which wilt 
be considered representative enough for a confident 
rule estimation. Setting the threshold 0~ at a cer- 
tain level lets only the rules whose score is higher 
than the threshold to be included into the final rule- 
sets. The method for setting up this threshold is 
based on empirical evaluations of the rule-sets and 
is described in Section 3. 
2.3 Rule Merging Phase 
Rules which have scored lower than the threshold 
0s can be merged into more general rules which if 
scored above the threshold are also included into the 
final rule-sets. We can merge two rules which have 
scored below the threshold and have the same affix 
329 
(or ending) and the initial class (/)3. The score of 
the resulting rule will be higher than the scores of 
the merged rules since the number of positive ob- 
servations increases and the number of the trials re- 
mains the same. After a successful application of 
the merging, the resulting rule substitutes the two 
merged ones. To perform such rule-merging over 
a rule-set, first, the rules which have not been in- 
cluded into the final set are sorted by their score 
and best-scored rules are merged first. This is done 
recursively until the score of the resulting rule does 
not exceed the threshold in which case it is added 
to the final rule-set. This process is applied until no 
merges can be done to the rules which have scored 
below the threshold. 
3 Direct Evaluation Stage 
There are two important questions which arise at 
the rule acquisition stage - how to choose the scoring 
threshold 0, and what is the performance of the rule- 
sets produced with different thresholds. The task of 
assigning a set of Pos-tags to a word is actually quite 
similar to the task of document categorisation where 
a document should be assigned with a set of descrip- 
tors which represent its contents. The performance 
of such assignment can be measured in: 
recall- the percentage of BOSs which were assigned 
correctly by the guesser to a word; 
precision- the percentage of BOSs the guesser as- 
signed correctly over the total number of BOSs it 
assigned to the word; 
coverage - the proportion of words which the 
guesser was able to classify, but not necessarily cor- 
rectly; 
In our experiments we measured word precision 
and word recall (micro-average). There were two 
types of data in use at this stage. First, we eval- 
uated the guessing rules against the actual lexicon: 
every word from the lexicon, except for closed-class 
words and words shorter than five characters 4, was 
guessed by the different guessing strategies and the 
results were compared with the information the word 
had in the lexicon. In the other evaluation experi- 
ment we measured the performance of the guessing 
rules against the training corpus. For every word we 
computed its metrics exactly as in the previous ex- 
periment. Then we multiplied these results by the 
corpus frequency of this particular word and aver- 
aged them. Thus the most frequent words had the 
greatest influence on the aggreagte measures. 
First, we concentrated on finding the best thresh- 
olds 08 for the rule-sets. To do that for each rule-set 
produced using different thresholds we recorded the 
three metrics and chose the set with the best aggre- 
gate. In Table 1 some results of that experiment 
are shown. The best thresholds were detected: for 
ending rules - 75 points, for suffix rules - 60, and for 
3For ending-guessing rules this is always true, so only 
the ending itself counts. 
4the actual size of the filtered lexicon was 47,659 en- 
tries out of 53,015 entries of the original lexicon. 
prefix rules - 80. One can notice a slight difference 
in the results obtained over the lexicon and the cor- 
pus. The corpus results are better because the train- 
ing technique explicitly targeted the rule-sets to the 
most frequent cases of the corpus rather than the 
lexicon. In average ending-guessing rules were de- 
tected to cover over 96% of the unknown words. The 
precision of 74% roughly can be interpreted as that 
for words which take on three different BOSs in their 
BOs-class, the ending-guessing rules will assign four, 
but in 95% of the times (recall) the three required 
BOSs will be among the four assigned by the guess. 
In comparison with the Xerox word-ending guesser 
taken as the base-line model we detect a substantial 
increase in the precision by about 22% and a cheerful 
increase in coverage by about 6%. This means that 
the Xerox guesser creates more ambiguity for the 
disambiguator, assigning five instead of three BOSs 
in the example above. It can also handle 6% less 
unknown words which, in fact, might decrease its 
performance even lower. In comparison with the 
ending-guessing rules, the morphological rules have 
much better precision and hence better accuracy of 
guessing. Virtually almost every word which can be 
guessed by the morphological rules is guessed ex- 
actly correct (97% recall and 97% precision). Not 
surprisingly, the coverage of morphological rules is 
much lower than that of the ending-guessing ones - 
for the suffix rules it is less than 40% and for the 
prefix rules about 5-6%. 
After obtaining the optimal rule-sets we per- 
formed the same experiment on a word-sample which 
was not included into the training lexicon and cor- 
pus. We gathered about three thousand words from 
the lexicon developed for the Wall Street Journal 
corpus 5 and collected frequencies of these words in 
this corpus. At this experiment we obtained simi- 
lar metrics apart from the coverage which dropped 
about 0.5% for Ending 75 and Xerox rule-sets and 
7% for the Suffix 60 rule-set. This, actually, did not 
come as a surprise, since many main forms required 
by the suffix rules were missing in the lexicon. 
In the next experiment we evaluated whether the 
morphological rules add any improvement if they are 
used in conjunction with the ending-guessing rules. 
We also evaluated in detail whether a conjunctive 
application with the Xerox guesser would boost the 
performance. As in the previous experiment we mea- 
sured the precision, recall and coverage both on the 
lexicon and on the corpus. Table 2 demonstrates 
some results of this experiment. The first part of 
the table shows that when the Xerox guesser is ap- 
plied before the E75 guesser we measure a drop in 
the performance. When the Xerox guesser is applied 
after the E75 guesser no sufficient changes to the per- 
formance are noticed. This actually proves that the 
E75 rule-set fully supercedes the Xerox rule-set. The 
second part of the table shows that the cascading 
application of the morphological rule-sets together 
with the ending-guessing rules increases the over- 
5these words were not listed in the training lexicon 
330 
all precision of the guessing by a further 5%. This 
makes the improvements against the base-line Xerox 
guesser 28% in precision and 7% in coverage. 
4 Tagging Unknown Words 
The direct evaluation of the rule-sets gave us the 
grounds for the comparison and selection of the best 
performing guessing rule-sets. The task of unknown 
word guessing is, however, a subtask of the overall 
part-of-speech tagging process. Thus we are mostly 
interested in how the advantage of one rule-set over 
another will affect the tagging performance. So, we 
performed an independent evaluation of the impact 
of the word guessers on tagging accuracy. In this 
evaluation we tried two different taggers. First, we 
used a tagger which was a c++ re-implementation 
of the LISP implemented HMM Xerox tagger de- 
scribed in (Kupiec, 1992). The other tagger was the 
rule-based tagger of Brill (Brill, 1995). Both of the 
taggers come with data and word-guessing compo- 
nents pre-trained on the Brown Corpus 6. This, ac- 
tually gave us the search-space of four combinations: 
the Xerox tagger equipped with the original Xe- 
rox guesser, Brill's tagger with its original guesser, 
the Xerox tagger with our cascading Ps0+S60+E75 
guesser and Brill's tagger with the cascading guesser. 
For words which failed to be guessed by the guess- 
ing rules we applied the standard method of classi- 
fying them as common nouns (NN) if they are not 
capitalised inside a sentence and proper nouns (NP) 
otherwise. As the base-line result we measured the 
performance of the taggers with all known words on 
the same word sample. 
In the evaluation of tagging accuracy on unknown 
words we pay attention to two metrics. First we 
measure the accuracy of tagging solely on unknown 
words: 
UnkownScore = Correctl~Ta$\[ledUnkownWords 
TotalUnknown W ords This metric gives us the exact measure of how the 
tagger has done on unknown words. In this case, 
however, we do not account for the known words 
which were mis-tagged because of the guessers. To 
put a perspective on that aspect we measure the 
overall tagging performance: 
TotalScore = CorrectlyTagsedWords TotaIWords 
Since the Brown Corpus model is a general lan- 
guage model, it, in principle, does not put restric- 
tions on the type of text it can be used for, although 
its performance might be slightly lower than that of 
a model specialised for this particular sublanguage. 
Here we want to stress that our primary task was not 
to evaluate the taggers themselves but rather their 
performance with the word-guessing modules. So we 
did not worry too much about tuning the taggers for 
the texts and used the Brown Corpus model instead. 
We tagged several texts of different origins, except 
from the Brown Corpus. These texts were not seen 
at the training phase which means that neither the 
6Since Brill's tagger was trained on the Penn tag-set 
(Marcus et al., 1993) we provided an additional mapping. 
331 
taggers nor the guessers had been trained on these 
texts and they naturally had words unknown to the 
lexicon. For each text we performed two tagging 
experiments. In the first experiment we tagged the 
text with the Brown Corpus lexicon supplied with 
the taggers and hence had only those unknown words 
which naturally occur in this text. In the second ex- 
periment we tagged the same text with the lexicon 
which contained only closed-class 7 and short 8 words. 
This small lexicon contained only 5,456 entries out 
of 53,015 entries of the original Brown Corpus lex- 
icon. All other words were considered as unknown 
and had to be guessed by the guessers. 
We obtained quite stable results in these experi- 
ments. Here is a typical example of tagging a text of 
5970 words. This text was detected to have 347 un- 
known words. First, we tagged the text by the four 
different combinations of the taggers with the word- 
guessers using the full-fledged lexicon. The results 
of this tagging are summarised in Table 3. When us- 
ing the Xerox tagger with its original guesser, 63 un- 
known words were incorrectly tagged and the accu- 
racy on the unknown words was measured at 81.8%. 
When the Xerox tagger was equipped with our cas- 
cading guesser its accuracy on unknown words in- 
creased by almost 9% upto 90.5%. The same situa- 
tion was detected with Brill's tagger which in general 
was slightly more accurate than the Xerox one 9. The 
cascading guesser performed better than Brill's orig- 
inal guesser by about 8% boosting the performance 
on the unknown words from 84.5% 1° to 92.2%. The 
accuracy of the taggers on the set of 347 unknown 
words when they were made known to the lexicon 
was detected at 98.5% for both taggers. 
In the second experiment we tagged the same text 
in the same way but with the small lexicon. Out of 
5,970 words of the text, 2,215 were unknown to the 
small lexicon. The results of this tagging are sum- 
marised in Table 4. The accuracy of the taggers 
on the 2,215 unknown words when they were made 
known to the lexicon was much lower than in the 
previous experiment -- 90.3% for the Xerox tagger 
and 91.5% for Brill's tagger. Naturally, the perfor- 
mance of the guessers was also lower than in the 
previous experiment plus the fact that many "semi- 
closed" class adverbs like "however", "instead", etc., 
were missing in the small lexicon. The accuracy of 
the tagging on unknown words dropped by about 
5% in general. The best results on unknown words 
were again obtained on the cascading guesser (86%- 
87.45%) and Brill's tagger again did better then the 
Xerox one by 1.5%. 
Two types of mis-taggings caused by the guessers 
7articles, prepositions, conjunctions, etc. 
Sshorter than 5 characters 
9This, however, was not an entirely fair comparison 
because of the differences in the tag-sets in use by the 
taggers. The Xerox tagger was trained on the original 
Brown Corpus tag-set which makes more distinctions be- 
tween categories than the Penn Brown Corpus tag-set. 
1°This figure agrees with the 85% quoted by Brill 
(Brin, 1994). 
occured. The first type is when guessers provided 
broader POS-classes for unknown words and the tag- 
ger had difficulties with the disambiguation of such 
broader classes. This is especially the case with the 
"ing" words which, in general, can act as nouns, ad- 
jectives and gerunds and only direct lexicalization 
can restrict the search space, as in the case with 
the word "going" which cannot be an adjective but 
only a noun and a gerund. The second type of mis- 
tagging was caused by wrong assignments of BOSS by 
the guesser. Usually this is the case with irregular 
words like, for example, "cattle" which was wrongly 
guessed as a singular noun (NN) but in fact is a 
plural noun (NNS). 
5 Discussion and Conclusion 
We presented a technique for fully unsupervised 
statistical acquisition of rules which guess possible 
parts-of-speech for words unknown to the lexicon. 
This technique does not require specially prepared 
training data and uses for training the lexicon and 
word frequencies collected from a raw corpus. Us- 
ing these training data three types of guessing rules 
are learned: prefix morphological rules, suffix mor- 
phological rules and ending-guessing rules. To select 
best performing guessing rule-sets we suggested an 
evaluation methodology, which is solely dedicated to 
the performance of part-of-speech guessers. 
Evaluation of tagging accuracy on unknown words 
using texts unseen by the guessers and the taggers 
at the training phase showed that tagging with the 
automatically induced cascading guesser was consis- 
tently more accurate than previously quoted results 
known to the author (85%). The cascading guesser 
outperformed the guesser supplied with the Xerox 
tagger by about 8-9% and the guesser supplied with 
Brill's tagger by about 6-7%. Tagging accuracy on 
unknown words using the cascading guesser was de- 
tected at 90-92% when tagging with the full-fledged 
lexicon and 86-88% when tagging with the closed- 
class and short word lexicon. When the unknown 
words were made known to the lexicon the accu- 
racy of tagging was detected at 96-98% and 90-92% 
respectively. This makes the accuracy drop caused 
by the cascading guesser to be less than 6% in gen- 
eral. Another important conclusion from the evalua- 
tion experiments is that the morphological guessing 
rules do improve the guessing performance. Since 
they are more accurate than ending-guessing rules 
they are applied before ending-guessing rules and 
improve the precision of the guessings by about 5%. 
This, actually, results in about 2% higher accuracy 
of tagging on unknown words. 
The acquired guessing rules employed in our cas- 
cading guesser are, in fact, of a standard nature 
and in that form or another are used in other POS- 
guessers. There are, however, a few points which 
make the rule-sets acquired by the presented here 
technique more accurate: 
• the learning of such rules is done from the lex- 
icon rather than tagged corpus, because the 
guesser's task is akin to the lexicon lookup; 
• there is a well-tuned statistical scoring proce- 
dure which accounts for rule features and fre- 
quency distribution; 
• there is an empirical way to determine an opti- 
mum collection of rules, since acquired rules are 
subject to rigorous direct evaluation in terms of 
precision, recall and coverage; 
• rules are applied cascadingly using the most ac- 
curate rules first. 
One of the most important issues in the induction 
of guessing rule-sets is the choice right data for train- 
ing. In our approach, guessing rules are extracted 
from the lexicon and the actual corpus frequencies 
of word-usage then allow for discrimination between 
rules which are no longer productive (but have left 
their imprint on the basic lexicon) and rules that are 
productive in real-life texts. Thus the major factor 
in the learning process is the lexicon. Since guessing 
rules are meant to capture generM language regular- 
ities the lexicon should be as general as possible (list 
all possible POSs for a word) and as large as possi- 
ble. The corresponding corpus should include most 
of the words from the lexicon and be large enough 
to obtain reliable estimates of word-frequency distri- 
bution. Our experiments with the lexicon and word 
frequencies derived from the Brown Corpus, which 
can be considered as a generM model of English, re- 
sulted in guessing rule-sets which proved to be do- 
main and corpus independent 11, producing similar 
results on test texts of different origin. 
Although in general the performance of the cas- 
cading guesser is only 6% worse than the lookup of a 
general language lexicon there is room for improve- 
ment. First, in the extraction of the morphological 
rules we did not attempt to model non-concatenative 
cases. In English, however, since most of letter mu- 
tations occur in the last letter of the main word it is 
possible to account for it. So our next goal is to ex- 
tract morphological rules with one letter mutations 
at the end. This would account for cases like "try - 
tries", "reduce - reducing", "advise - advisable". We 
expect it to increase the coverage of thesuffix mor- 
phological rules and hence contribute to the overall 
guessing accuracy. Another avenue for improvement 
is to provide the guessing rules with the probabilities 
of emission of POSs from their resulting POS-classes. 
This information can be compiled automatically and 
also might improve the accuracy of tagging unknown 
words. 
The described rule acquisition and evaluation 
methods are implemented as a modular set of c++ 
and AWK tools, and the guesser is easily extendable 
to sub-language specific regularities and retrainable 
to new tag-sets and other languages, provided that 
these languages have affixational morphology. Both 
the software and the produced guessing rule-sets are 
available by contacting the author. 
11but tag-set dependent 
332 
6 Acknowledgements 
Some of the research reported here was funded 
as part of EPSRC project IED4/1/5808 "Integrated 
Language Database". I would also like to thank 
Chris Brew for helpful discussions on the issues re- 
lated to this paper. " 

References 
E. Brill 1994. Some Advances in Transformation- 
Based Part of Speech Tagging. In Proceedings of 
the Twelfth National Conference on Arlificial In- 
telligence (AAAAL94), Seattle, WA. 
E. Brill 1995. Transformation-based error-driven 
learning and Natural Language processing: a case 
study in part-of-speech tagging. In Computational 
Linguistics 21(4) pp. 543-565. 
W. Francis and H. Kucera 1982. Frequency Analysis 
of English Usage. Houghton Mifflin, Boston 1982. 
J. Kupiec 1992. Robust Part-of-Speech Tagging Us- 
ing a Hidden Markov Model. In Computer Speech 
and Language 
M. Marcus, M.A. Marcinkiewicz, and B. Santorini 
1993. Building a Large Annotated Corpus of En- 
glish: The Penn Treebank. In Computational Lin- 
guistics, vol 19/2 pp.313-329 
H. Schmid 1994. Part of Speech Tagging with Neu- 
ral Networks. In Proceedings of the International 
Conference on Computational Linguistics, pp.172- 
176, Kyoto, Japan. 
E. Tzoukermann, D.R. Radev, and W.A. Gale 1995. 
Combining Linguistic Knowledge and Statistical 
Learning in French Part of Speech Tagging. In 
EACL SIGDAT Workshop, pp.51-59, Dublin, Ire- 
land 
A. Voutilainen 1995. A Syntax-Based Part-of- 
Speech Analyser In Proceedings of the Sev- 
enth Conference of European Chapter of the As- 
sociation for Computational Linguistics (EACL) 
pp.157-164, Dublin, Ireland 
R. Weischedel, M. Meteer, R. Schwartz, L. Ramshaw 
and J. Palmucci 1993. Coping with ambiguity and 
unknown words through probabilistic models. In 
Computational Linguistics, vol 19/2 pp.359-382 
Byoung-Tak Zhang and Yung-Taek Kim 1990. Mor- 
phological Analysis and Synthesis by Automated 
Discovery and Acquisition of Linguistic Rules. 
In Proceedings of the 13th International Confer- 
ence on Computational Linguistics, pp.431-435, 
Helsinki, Finland. 
