!1 
II 
II 
II 
II 
II 
II 
I! 
II 
II 
II 
II 
II 
Induction of a Stem Lexicon for Two-level Morphological Analysis 
Erika F. de Lima 
Institute for Natural Language Processing 
Stuttgart University 
Azenbergstr. 12 
70174 Stuttgart, Germany 
delima~ims, uni-stuttgart, de 
Abstract 
A method is described to automatically ac- 
quire from text corpora a Portuguese stem lex- 
icon for two-level morphological analysis. It 
makes use of a lexical transducer to generate 
all possible stems for a given unknown inflected 
word form, and the EM algorithm to rank al- 
ternative stems. 
1 Motivation 
Morphological analysis is the basis for most natural 
language processing tasks. Hand-coded lists used 
in morphological processing are expensive to create 
and maintain. A procedure to automatically induce 
a stem lexicon from text corpora would enable the 
creation, verification and update of broad-coverage 
lexica which reflect evolving usage and are less sub- 
ject to lexical gaps. Such a procedure would also 
be applicable to the acquisition of domain-specific 
vocabularies, given appropriate corpora. 
In the following, a method is described to au- 
tomatically generate a stem lexicon for two-level 
morphological analysis (Koskenniemi, 1983). The 
method, which was implemented and tested on a 
newspaper corpus of Brazilian Portuguese, is appli- 
cable to other languages as well. 
2 Method 
The learning algorithm consists of a procedure which 
attempts to determine the stem and part of speech 
for each (unknown) inflected form in its input. For 
instance, given the inflected form recristalizaf~es 
('recrystallizations'), the procedure induces that 
cristal ('crystal') is a noun, and adds it to the set 
of learned stems. 
The system makes use of a two-level processor- 
PC-KIMMO (Antworth, 1990)-to generate a set of 
putative stems for each inflected form in its input. 
(For a detailed account of the PC-KIMMO two-level 
framework, see (Antworth, 1990).) In order to mor- 
phologically analyze its input, the processor makes 
use of a set of two-level rules, a lexicon contain- 
ing inflectional as well as derivational affixes, and 
a unification-based word grammar. No stem lexi- 
con is provided to the system. In the word grammar 
and lexical transducer, a stem is defined to be a non- 
empty arbitrary sequence of characters. 
The current system contains 102 two-level rules, 
accounting for plural formation, e.g., cristal ('crys- 
tal') - cristais ('crystals'), diminutive and augmenta- 
tive formation, e.g., casa ('house') - casinha ('house- 
DIM'), feminine formation, e.g., alemao ('German- 
MASC') - alema ('German-FEM'), superlative for- 
mation pag~o ('pagan') - pananissimo ('pagan- 
SUP'), verbal stem alternation, e.g., dormir ('to 
sleep') - durmo ('sleep-IP-SG-PRES'), and deriva- 
tional forms, e.g., forum ('forum') - \]orense ('foren- 
sic'). The a~xes lexicon consists of 511 entries, of 
which 236 are inflectional and 275 derivational. The 
unification-based word grammar consists of 14 rules 
to account for prefixation, suffixation, and inflection. 
Each word parse tree produced for an inflected 
form yields a putative stem and its part of speech 
through the constraints provided by the grammar 
and affix lexicon. For instance, given the unknown 
inflected form cristalizar ('crystallize'), and the con- 
stralnt that the suffix izar ('ize') may only be applied 
to nouns or adjectives to form a verb, the system in- 
duces that the string cristal ('crystal') is possibly a 
nominal or adjectival stem. 
Since a stem is defined to be an arbitrary non- 
empty string, a parse forest is usually produced 
for each inflected form, yielding a set of putative 
stems, each corresponding to one parse tree. In or- 
der to establish the correct stem for an inflected 
form, the learning procedure attempts to combine 
the accumulated evidence provided by related word 
forms, i.e., word forms sharing a common stem. For 
instance, the word recristalizaf5es ('recrystalliza- 
de Lima 267 Induction of a Stem Lexicon 
Erika F. de Lima (1998) Induction of a Stem Lexicon for Two-level Morphological Analysis. In D.M.W. Powers (ed.) 
NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 267-268. 
tions') shares a common stem with related words 
such as cristal ('crystal'), cristalino ('crystalline- 
MASC'), cristaliza ('crystallize-3P-SG-PRES'), etc. 
The EM algorithm, used to assign a probability to 
each stem in a set, makes use of this fact to deter- 
mine the most probable stem. 
3 EM Algorithm 
The system uses the expectation maximization (EM) 
algorithm (Dempster, Laird, and Rubin, 1977) to as- 
sign probabilities to each stem in a set, given all sets 
obtained for a given corpus of inflected word forms. 
In the current setting, the algorithm is defined as 
follows. 
Algorithm. Let S be a set of sterns. Further, let S 
be a finite set of nonempty subsets of p(S), and let 
So = Uxes X. For each stem x in So: 
Initialization: 
co(x) = Ex~s(I(z, x). gc(x)) 
Step k + 1: 
ck+~(z) = ck(=) + Ex~s(Pk(=, X). gc(X)) 
Where Pc is a function from S to the natural num- 
bers mapping a set X to the number of times it was 
produced for a given corpus C of inflected forms, and 
I, Pk, and Pk are functions defined as follows: 
I: Sx~(S) -,\[0,1\] 
{ I-~1 ifxEX 
(x, X) ~ 0 else 
Pk : s x p(s)-~ \[o, 1\] 
( p~_,.~4r.L_ if zex and IXl>l 
(=,x) l "(') el,e 
Pk : S --+\[0,1\] X ~ ~(_..~.~_L_ 
~ESo 
A stem x is considered to be best in the set X at 
the iteration k if x E X and p~ (x) is an absolute 
maximum in U~x Pk(~). 
In the experiment described in the next section, 
a set of stems was considered disambiguated if it 
contained a best set at the final iteration; the final 
number of iterations was set empirically. 
4 Results 
The method described in the previous sections was 
applied to a newspaper corpus of Brazilian Por- 
tuguese containing 50,099 inflected word types. The 
system produced a total of 2,333,969 analysis (puta- 
tive stems) for these words. Of the 50,099 stem sets, 
33,683 contained a best stem. 
In order to measure the recall rate of the learn- 
ing algorithm, a random set of 1,000 inflected word 
types used as input to the system was obtained, and 
their stems manually computed. The recall rate is 
given by the number of stems learned by the system, 
divided by the total number of stems, or 42,3%. The 
low recall rate is due partially to the fact that not 
all sets produced by the system contained a best 
stem. The system produced partial disambiguation 
for 15,814 of the original 50,099 sets, e.g., after the 
final iteration, there was a proper subset of stems 
with maximal probability, but no absolute maxi- 
mum. A large number of partial disambiguations 
involved sets containing a stem considered to be 
both an adjective and a noun, e.g., {AJ stem, N 
stem}. This reflects the fact that very often Por- 
tuguese words are noun-adjective homographs, and 
assignment to one category cannot be made based 
on the morphological evidence alone. If the system 
were to consider partial disambiguation as well, the 
recall rate could be significantly improved. 
In order to evaluate the precision of the learning 
algorithm, a random set of 1,000 stems produced by 
the system was compared to the judgements of a sin- 
gle judge. The precision of the system is given by the 
number of correct learned stems divided by the total 
number of learned stems, or 70.4%. A small percent- 
age of errors was due to the fact that closed-class 
words were assigned open-class word categories. A 
closed-class word lexicon would eliminate these er- 
rors. Spelling errors are another source of errors. 
Taking frequency of occurrence into account would 
alleviate this problem. By far the largest percentage 
of errors was due to the fact that the system was 
not able to correctly segment stems, mostly due tO 
incorrect prefixation. In order to improve precision, 
the system should make use of not only of the stem 
provided by each parse tree, but take the structure 
itself into account in order to correctly determine 
the stem boundaries. 

References 
Antworth, Evan L. 1990. PC-KIMMO: a two-level 
processor for morphological analysis. Summer In- 
stitute of Linguistics, Dallas. 
Dempster, A.P., N.M. Laird, and D.B. Rubin. 1977. 
Maximum likelihood from indomplete data via 
the EM algorithm. J.R.Statis. Soc. B, 39:1-38. 
Koskenniemi, Kimmo. 1983. Two-level morphol- 
ogy: a general computational model \]or word-\]or'm 
recognition and production. University of Helsinki 
Department of General Linguistics, Helsinki. 
