Practical Bootstrapping of Morphological Analyzers 
Kemal Oflazer 1,~ 
1Department of Computer Engineering 
Bilkent University 
Bilkent, Ankara, 06533,Turkey 
ko@crl, nmsu. edu 
Sergei Nirenburg ~ 
2Computing Research'Laboratory 
New Mexico State University 
Las Cruces, NM, 88003 
sergei@crl, nmsu. edu 
Abstract 
This paper presents a semi-automatic technique for 
developing broad-coverage finite-state morphological 
analyzers for any language. It consists of three 
components-elicitation of linguistic information from 
humans, a machine learning bootstrapping scheme and 
a testing environment. The three components are ap- 
plied iteratively until a threshold of output quality is 
attained. The initial application of this technique is 
for morphology of low-density languages in the context 
of the Expedition project at NMSU CRL. This elicit- 
build-test technique compiles lexical and inflectional 
information elicited from a human into a finite state 
transducer lexicon and combines this with a sequence 
of morphographemic rewrite rules that is induced us- 
ing transformation-based learning from the elicited ex- 
amples. The resulting morphological analyzer is then 
tested against a test suite, and any corrections are fed 
back into the learning procedure that builds an im- 
proved analyzer. 
Introduction 
The Expedition project is devoted to fast "ramp-up" 
of machine translation systems from less studied, 
so-called "low-density" languages into English. One 
of the components that must be acquired and built 
during this process is a morphological analyzer for 
the source low-density language. Since we expect 
that the source language informant will not be 
well-versed in computational linguistics in general 
or in recent approaches to building morphological 
analyzers (e.g., \[Koskenniemi, 1983\], \[Antworth. 1990\], 
\[Karttunen et al., 1992\], \[Karttunen, 1994\]) and the 
operation of state-of-the-art finite state tools (e.g., 
\[Karttunen. 1993\], \[Karttunen and Beesley, 1992\], 
\[Karttunen et al., 1996\]) in particular, the generation 
of the morphological analyzer component has to be 
accomplished almost semi-automatically. The user 
must be guided through a knowledge elicitation proce- 
dure for the knowledge required for the morphological 
analyzer. This is accomplished using the elicitation 
component of Expedition, the Boas system. As this 
task is not easy, we expect that the development of 
the morphological analyzer will be an iterative process, 
whereby the human informant will revise and/or 
refine the information previously elicited based on the 
feedback from a test runs of the nascent analyzer. 
The work reported in this paper describes the use 
of machine learning in the process of building and re- 
fining morphological analyzers. The main use of ma- 
chine learning in our current approach is in the au- 
tomatic learning of formal rewrite or replace rules for 
morphographemic changes from the examples, provided 
by the informant. This subtask of accounting for such 
phenomena is perhaps one of the more complicated as- 
pects of building an analyzer and by automating it we 
expect to gain a certain improvement in productivity. 
There have been a number of studies on induc- 
ing morphographemic rules from a list of inflected 
words and a root word list. Johnson \[1984\] presents 
a scheme for inducing phonological rules from surface 
data, mainly in the context of studying certain aspects 
of language acquisition. The premise is that languages 
have a finite number of alternations to be handled by 
morphographemic rules and a fixed number of contexts 
in which they appear; so if there is enough data, phono- 
logical rewrite rules can be generated to account for 
the data. Rules are ordered by some notion of "'surfaci- 
ness", and at each stage the nmst surfacy rule -- the rule 
with the most transparent context is selected. Golding 
and Thompson\[1985\] describe an approach for inducing 
rules of English word formation from a given corpus 
of root forms and the corresponding inflected forms. 
The procedure described there generates a sequence of 
transformation rules, l each specifying how to perform 
a particular inflection. 
More recently, Theron and Cloete \[1997\] have pre- 
1Not in the sense it is used in transformation-based learn- 
ing \[Brill, 1995\]. 
14 
sented a scheme for obtaining two-level morphology 
rules from a set of aligned segmented and surface pairs. 
They use the notion of string edit sequences assum- 
ing that only insertions and deletions are applied to a 
root form to get the inflected form. They determine 
the root form associated with an inflected form (and 
consequently the suffixes and prefixes) by exhaustively 
matching against all root words. The motivation is that 
"real" suffixes and prefixes will appear often enough in 
the corpus of inflected forms, so that, once frequently 
occurring suffixes and prefixes are identified, one can 
then determine the segmentation for a given inflected 
word by choosing the segmentation with the most fre- 
quently occurring affix segments and considering the 
remainder to be the root. While this procedure seems 
to be reasonable for a small root word list, the potential 
for "noisy" or incorrect alignments is quite high when 
the corpus of inflected forms is large and the proce- 
dure is not given any prior knowledge of possible seg- 
mentations. As a result, selecting the "correct" seg- 
mentation automatically becomes quite nontrivial. An 
additional complication is that allomorphs show up as 
distinct affixes and their counts in segmentations are 
not accumulated, which might lead to actual segmen- 
tations being missed due to fragmentation. The rule 
induction is not via a learning scheme: aligned pairs 
are compressed into a special data structure and traver- 
sals over this data structure generate morphographemic 
rules. Theron and Cloete have experimented with plu- 
ralization in Afrikaans, and the resulting system has 
shown about 94% accuracy on unseen words. 
Goldsmith \[1998\] has used an unsupervised learning 
method based on the minimum description length prin- 
ciple to learn the "morphology" of a number of lan- 
guages. What is learned is a set of "root" words and 
affixes, and common inflectional pattern classes. The 
system requires just a corpus of words in a language. In 
the absence of any root word list to use as a scaffolding, 
the shortest forms that appear frequently are assumed 
to be roots, and observed surface forms are then either 
generated by concatenative affixation of suffixes or by 
rewrite rules. 2 Since the system has no notion of what 
the roots and their part of speech values really are, and 
what morphological information is encoded by the af- 
fixes, these need to be retrofitted manually by a human 
(if one is building a morphological analyzer) who would 
have to weed through a large number of noisy rules. We 
feel that this approach, while quite novel, can be used 
to build real-world morphological analyzers only after 
substantial modifications are made. 
ZSome of which may" not make sense, but are necessary- 
to account for the data: for instance a rule like insert a word 
final y after the root "eas". is used to generate easy. 
15 
This paper is organized as follows: The next section 
very briefly describes the Boas project of which this 
work is a part. The subsequent sections describe the 
details of the approach, the morphological analyzer ar- 
chitecture, and the induction of morphographemic rules 
along with explanatory examples. Finally, we provide 
some conclusions and ideas for future work. 
The BOAS Project 
Boas \[Nirenburg, 1998, Nirenburg and Raskin, 1998\] is 
a semi-automatic knowledge elicitation system that 
guides a team of two people through tile process of de~ 
veloping the static knowledge sources for a moderate- 
quality, broad-coverage MT system from any "low- 
density" language into English. Boas contains knowl- 
edge about human language and means of realization of 
its phenomena in a number of specific languages and is, 
thus, a kind of a "linguist in the box" that helps non- 
professional acquirers with the task. In the spirit of tile 
goal-driven, "demand-side" approach to computational 
applications of language processing \[Nirenburg, 1996\], 
the process of acquiring this knowledge has been split 
into two steps: (i) acquiring the descriptive, declarative 
knowledge about a language and, (ii) deriving opera- 
tional knowledge (content for the processing engines) 
from this descriptive knowledge. A typical elicitation 
interaction screen of Boas is shown in Figure 1. 
An important aspect that we strive to achieve regard- 
ing these descriptive and operational pieces of informa- 
tion, be it elicited from human informants or acquired 
via machine learning is that they should be transpar- 
ent and human readable, and where necessary human 
maintainable and extendable, contrary to opaque and 
uninterpretable representations acquired by various sta- 
tistical learning paradigms. 
Before proceeding any further we would also like to 
state the aims and limitations of our approach. Our 
main goal is to significantly expedite the deveIopment 
of a morphological analyzer. It is clear that for inflec- 
tional languages where each root word can be associated 
with a finite number of word forms, one can, with a lot 
of work, generate a list of word forms with associated 
morphological features encoded, and use this as a look- 
up table to analyze word forms in input texts. This is, 
however, something we would like to avoid, as it is time 
consuming, expensive and error-prone. We would prefer 
attempting to capture general morphophonological and 
morphographemic phenomena, and lexicon abstractions 
(say as inflectional paradigms) using an example driven 
technique, and essentially reduce the acquisition pro- 
cess to one of just assigning root or citation forms to 
one of these lexicon abstractions, with the automatic 
generation process to be described, doing the rest of 
,. ~ I~ 
; \[\] lllml r ~" ii 
' 
• " I I II I I'1 II I" I, I ~ 
• ~'i-:" ~:I~T~': ,~-~,,~.. ~ ,,,~1~.. ~ .,.~, ."-n i r~ "~ 
, ~ .-.- ~ .- :"~ 4-- \],~i! i~i! ~ ~ ,.*~,- ,...--i~,"~ i~-": .~;.~ - ' ":'~ ,.,,..,. .... "" ' ."~ : . " ~ ,,,•. ' " .' !i ~ = ~ '~ ~ :~ *, *"" " 
• • I .t.",'l" "'" " - ~.'" "-: .... !..L~ ~ .! .... ~ .',.~4._~..: ~:-,'.'~- I 
i i . "." -.....'.. ~-..~.,'~r. "....,:_ .~'.~.~W**.'~.';~, '.d.\[..: ! "~'7.. "'.* ,." . . . • 
• i ~.~i; ~~ ~¢4~I • ~-..........~ .:,. .... ,.~. :.--. . . ! .~ 
~ • .I, -. ,!..~:.-:~," s.-~. 
I'- -..I: .:~.~ ,-.:-,~.>-.~.'. ,';,~, 
,,-,,,F,.~. ';*':3 :" " '" -i ,;'; ..!.lj-.j.~'~.~...,:" 
• - i -'~, s i,,.~ .,.i'i :,,~::,:..,'~' .~ :"-~ . ,~ . 
", ".. "," .......... "" .~.,r, • .'~ ~,--I.'~.~,, ,." " .' :" ~ "" ~~i- i "-',-",--, *~-~..- x ,,~..~I ~-.~ 
• , I :lj .~...! r-, d,~.'L~,, .~|~ ," 
• ° .... . I~,.I~I~!-,'..~,...~I," :~ ,~.~J ,.~ 
, ~, ~...... :..:.. !:.,. 
~ .... ." j. .. ::~.d,, .. .... ,,. 
, ("" i "",,i ;~B','.; ;,'," 
• i." ;: .>~ ",'~ ":"::: ".7 
'I """!"" "~:~:":"" /~\]~, - r "~ i "~ : ' 
~.i~ ~ _ ~ .I ° . 
IIIII 
Figure 1: .~ sample Boas elicitation screen 
16 
the work. This process will still be imperfect, as we ex- 
pect human informants to err in making their paradigm 
abstractions, and overlook details or exceptions. So, the 
whole process will be an iterative one, with convergence 
to a wide-coverage analyzer coming slowly at the be- 
ginning (where morphological phenomena and lexicon 
abstractions are being defined and tested), but signifi- 
cantly speeding up once wholesale root form acquisition 
starts. Since the generation of the operation content 
(data files to be used by the morphological analyzer en- 
gine) from the elicited descriptions, is expected to take 
a few minutes, feedback on operational performance can 
be provided very fast. There are also ways to utilize a 
partially acquired morphological analyzer to aid in the 
acquisition of open class root or citation forms. 
Human languages have many diverse morphological 
phenomena and it is not our intent at this point to have 
a universal architecture that can accommodate any and 
all phenomena. Rather, we propose a modular and ex- 
tensible architecture that can accommodate additional 
functionality in future incarnations of Boas. We also 
intend to limit the morphological processing to process- 
ing single tokens and deal with multi-token phenomena 
such as partial or full word reduplications with addi- 
tional machinery that we do not discuss here. 
The Elicit-Build-Test Paradigm 
In this paper we concentrate on operational content in 
the context of building a morphological analyzer. To 
determine this content, we integrate the information 
provided by the informant with automatically derived 
information. The whole process is an iterative one as il- 
lustrated in Figure 2, whereby the information elicited 
is transformed into operational data required by the 
generic morphological analyzer engine s and the result- 
ing analyzer is tested on a test corpus. 4 Any discrep- 
ancies between the output of the analyzer and the test 
corpus are then analyzed and potential sources of er- 
rors are given as feedback to the elicitation process. 
Currently, this feedback is limited to morphographemic 
processes. 
The box in Figure 2 labeled Morphological Ana- 
lyzer Generation is the main component which takes 
in the information elicited and generates a series 
of regular expressions for describing the morpholog- 
ical lexicon and morphographemic rules. The mor- 
phographemic rules describing changes in spelling as a 
result of affixation operations, are induced from the ex- 
3We currently use XRCE finite state tools as our target 
environment \[Karttunen et al., 1996\]. 
4Also independently elicited from either the human in- 
formant or compiled from any on-line resources for the lan- 
guage in question. 
amples provided, by using transformation-based learn- 
ing \[Brill, 1995, Satta and Henderson, 1997\]. The re- 
sult is an ordered set of contextual replace oz" rewrite 
rules, much like those used in phonology. We then use 
error-tolerant finite state recognition \[Oflazer, 1996\] to 
perform "reverse spelling correction" for identifying the 
erroneous words the finite state analyzer accepts that 
are (very) close to the correct words in the test corpus 
that it rejects. The resulting pairs are then aligned, and 
the resulting mismatches are identified and logged for 
feedback purposes. 
Morphological Analyzer Architecture 
We adopt the general approach advocated by Kart- 
tunen \[1994\] and build the morphological analyzer as 
the combination of several finite state transducers some 
of which are constructed directly from the elicited in- 
formation while others are constructed from the output 
of the machine learning stage. Since the combination of 
the transducers is computed at compile time, there are 
no run time overheads. The basic architecture of the 
morphological analyzer is depicted in Figure 3. The 
components of this generic architecture are as follows: 
The analyzer consists of the union of transducers each 
of which implements the morphological ealalysis process 
for one paradigm. Each such transducer is the compo- 
sition of a number of components. These components 
are (from bottom to top) described below: 
1. The bottom component is an ordered sequence 
of morphographemic rules that are learned via 
transformation-based learning from the examples for 
the inflectional paradigm provided by the human in- 
formant. The rules are then composed into one finite 
state transducer \[Kaplan and Kay, 1994\]. 
2. The root and morpheme lexicon contains the root 
words and the affixes. We currently assume that 
all affixation is concatenative and that the lexi- 
con is described by a regular expression of the sort 
\[ Affixes \]* \[ Roots \] \[ Suffixes \]*.5 
3. The morpheme to surfacy \]eature mapping essentially 
maps morphemes to feature names but retains some 
encoding of the surface morpheme. Thus, allomorphs 
that encode the same feature would be mapped to 
different "surfacy" features. 
4. The lexical and surfacy constraints specify any con- 
ditions to constrain the possibly overgenerating mor- 
photactics of the root and morpheme lexicon. These 
5%Ve currently assume that we have at most one prefix 
and at most one suffix, but this is not a fundamental limita- 
tion. On the other hand, elicitation of complex morphotac- 
tics for an agglutinative language like Turkish or Finnish, 
requires a more sophisticated elicitation machinery. 
17 
! 
. 
. 
I Corpus 
CompilationJ 
Test Corpus 
Start 
1 
Human Elicitation Process /. ./ 
Description of Morphology 
(paradigms, examples, exceptions, etc.) 
1 
I Morphological Analyzer Generation 
1 
I Content for Morphological Analyzer Engine 
(lexicons, morphographemic rules) 
1 
Lrco_o.w. c.    Erroo J " l (MA Engine, TestEngine) Omissions 
Figure 2: The Elicit-Build-Test Paradigm for Bootstrapping a Morphological Analyzer 
constraints can be encoded using the root morphemes 
and the surfacy features generated by the previous 
mapping. The use of surfacy features enables refer- 
ence to zero morphemes which otherwise could not 
have been used. For instance, if in some paradigm a 
certain prefix does not co-occur with a certain suffix, 
or always occurs with some other suffix, or if a certain 
root/lemma of that paradigm has exceptional behav- 
ior with respect to one or more of the affixes, or if the 
allomorph that goes with a certain root depends on 
the properties of the root, these are encoded at this 
level as a finite state constraint. 
The surfacy feature to feature mapping module maps 
the surfacy representation of the affixes to symbolic 
feature names; as a result, no surface information 
remains except for the lemma or the root word. Thus, 
for instance, allomorphs that encode the same feature 
and map to different surfacy features, now map to the 
same feature symbol. 
The feature constraints specify ant' constraints 
among the symbolic features. This is an alternative 
functionality to that provided by lexical and surfacy 
constraints to constrain morphotactics, but at this 
level one refers to and constrains features as opposed 
to surfacy features. This may provide a more natu- 
ral or convenient abstraction, especially for languages 
with long distance morphotactic constraints. 
These six finite state transducers are composed to yield 
the transducer for the paradigm, and the union of 
the resulting transducers produces one (possibly large) 
transducer for morphological analysis where surface 
strings applied at the lower side produce all possible 
analyses at the upper side. 
Information Elicited from Human 
Informants 
The Boas environment elicits morphological informa- 
tion by asking the informant a series of questions about 
the paradigms of inflection. A paradigm abstracts to- 
gether lemmas (or root words) that essentially behave 
the same with respect to inflection, and captures infor- 
mation about the morphological features encoded and 
forms realizing these features, from which additional in- 
formation can be extracted. It is assumed that all lem- 
mas that belong to the same paradigm take the same 
set of inflectional affixes. It is expected that the roots 
and/or the affixes may undergo systematic or idiosyn- 
cratic morphographemic changes. It is also assumed 
that certain lemmas in a given paradigm mat" behave 
in some exceptional way (for instance, contrary to all 
other lemmas, a given lemma may not have one of the 
inflected forms\]) A paradigm description also provides 
the full inflectional patterns for one characteristic or 
distinguished lemma belonging to the paradigm, and 
additional examples for any other lemmas whose inflec- 
tional forms undergo nonstandard morphographemic 
18 
Lemma+.Morpholo~al Featmes (e.g,, hapl~+Adj+Super ) 
..-~ ......................... .. 
i\[ Ftssm~C~s ii 
o t 
.o ........................ ~..-.. 
I 
S~ric?-ti-FeitureMIplPinl \]i 
o 
o 
IMl~lMmme- ,o-SIId..l~ F ..... li U ''* U : o t 
, ------,-l) )\[ 
; -; , o ,( 
i . .... ,)J li i ........................ • .. .......................... -" ., 
o ! 
T 
Surface Form (e.g. bappiest) 
Figure 3: General Architecture of the Morphological Analyzer 
changes. If necessary, any lexical and feature con- 
straints can be encoded. Currently the provisions we 
have for such constraints are limited to writing regular 
expressions (albeit at a much higher level), but captur- 
ing such constraints using a more natural language (e.g., 
\[Ranta, 1998\]) can be stipulated for future versions. 
Preprocessing ~' 
The information elicited from the human informant is 
captured as a text file. The root word and the in- 
flection examples for the distinguished lemma are pro- 
cessed with an alignment algorithm to determine how 
the given root word aligns with each inflected form so 
that the edit distance is minimum. Once such align- 
ments are performed, the segments in the inflected form 
that are before and after the root alignment points 
are considered to be the prefixes and suffixes of the 
paradigm. These are then associated with the features 
given with the inflected form. 
Let us provide a simple example from a Russian verb 
inflection paradigm. The following information about 
the distinguished lemma in the paradigm is provided: 6 
ROOT rez Verb LEMMA rezat' 
FORM rezat' Inf FORM reZ' Impsg 
FORM. reZ'te Imppl FORM reZu Preslsg 
FORM feZeS Pres2sg FORM reZet Pres3sg 
FORM reZem Preslpl FORM reZete Pres2p1 
FORM reZut Pres3pl FORM. rezali PastPl 
FORM rezalo PastNsg FORM rezala PastFsg 
FORM rezal PastMsg 
The alignment produces the following suffix feature 
6Upper case characters and the single quote symbol en- 
code specific Russian characters. The transliteration is not 
conventional. 
pairs for the suffix lexicon and morpheme to feature 
mapping transduction: 
+at'-> +Inf 
+u -> +Preslsg 
+em -> +Preslpl 
+all -> +PastPl 
+al -> +PastMs E 
+' -> +Impsg 
+eS -> +Pres2sg 
+ete -> +Pres2pl 
+alo -> +PastNsg 
+'te -> +Imppl 
+'et -> +Pres3sg 
+ut -> +Pres3pl 
+ala -> +PastFsg 
We then produce the following segmentations to be 
used by the learning stage discussed in the next section. 
It should be noted we (can) use the lemma form as the 
morphological stem, so that the analysis we generate 
will have the lemma. Thus, some of the rules learned 
later will need to deal with this. 
(rezat'+at', rezat') 
(rezat'+'te, reZ'te) 
(rezat'+et, reZet) 
(rezat'+ete, reZete) 
(rezat'+ali, rezali) 
(rezat'+ala, rezala) 
(rezat '÷t, reZ') 
(rezat'+eS, reZeS) 
(rezat'+em, reZem) 
(rezat'+ut, reZut) 
(rezat'+alo, rezalo) 
(rezat'+al, rezal) 
Learning Segmentation and 
Morphographemic Rules 
The lemma and suffix information elicited and ex- 
tracted as summarized above are used to c~mstruct 
regular expressions for the lexicon component of each 
paradigm. 7 The example segmentations like those 
above are fed into the learning module to induce mor- 
phographemic rules. 
~The result of this process is a script for the XRCE finite 
state tool xfst. Large scale lexicons can be more efficiently 
compiled ~, the XRCE tool lexc. We currently do not gen- 
erate lerc scripts, but it is trivial to do so. 
19 
Fiwms 1 
":o-:" I I "" ! 
~Transr.r .,.d,m ~ 
Figure 4: Transformation-based learning of mor- 
phographemic rules 
Generating Candidate Rules from Examples 
The preprocessing stage yields a list of pairs of seg- 
mented lexical forms, and surface \]orms. The seg- 
mented forms have the roots/lemmas and affixes, and 
the affix boundaries are marked by the + symbol. This 
list is then processed by a transformation-based learn- 
ing paradigm\[Brill, 1995, Satta and Henderson, 1997\] 
as illustrated in Figure 4. The basic idea is that we con- 
sider the list of segmented words as our input and find 
transformation rules (expressed as contextual rewrite 
rules) to incrementally transform it into the list of sur- 
face forms. The transformation we choose at every iter- 
ation is the one that makes the list of segmented forms 
closest to the list of surface forms. 
The first step in the learning process is an initial 
alignment of pairs using a standard dynamic program- 
ming scheme. The only constraints in the alignment are 
that a + in the segmented lexical form is always aligned 
with an empty string on the surface side (notated by a 
0), and that a consonant (vowel) on one side is aligned 
with a consonant (vowel) or 0 on the other side. The 
alignment is also constrained by the fact that it should 
correspond to the minimum edit distance between the 
original lexical and surface forms, s ~,From this point on, 
we will use a simple example from English to clarify our 
points. 
We assume that we have the pairs (un+happy+est, 
unhappiest) and (shop+ed, shopped) in our example 
base. We align these and determine the total number of 
"errors" in the segmented forms that we have to fix to 
make all match the corresponding surface forms. The 
initial alignment produces the aligned pairs: 
un+happy+es*c shop0+ed 
un0happi0est shopp0ed 
with a total of 5 errors. From each segmented pair we 
generate rewrite rules of the sort 9 
SWe choose one if there are multiple legitimate align- 
ments. 
9V~re use the XRCE Finite State Tools regular expression 
syntax \[Karttunen et al., 1996\]. For the sake of readability. 
we will ignore the escape symbol (Z) that should precede 
any special characters (e.g., +) used in these rules. 
20 
u -> 1 \[\] LeftContext _ RightContext ; 
where u(pper) is a symbol in the segmented form, 
l(ower) is a symbol in the surface form. Rules are 
generated only from those aligned symbol pairs which 
are different. LeftContext and RightContext are sim- 
ple regular expressions describing contexts in the seg- 
mented side (up to some small length) taking into ac- 
count also the word boundaries. For instance, from the 
first aligned-pair example, this procedure would gener- 
ate rules such as (depending on the amount of left and 
right context allowed) 
y -> i \]1 p _ y -> i II p _ ÷ e 
y->ill p_+es y->ill p_+est 
y->ill p_+est# y->ill p p_+e 
• " i +->01 #un _ +->011 #un _ hap 
+->011 _ est • , 
+ -> 0 li _ e s t # . . . 
+-> 0 II ppy _ e s t # 
The # symbol denotes a word boundary, to capture 
any word initial and final phenomena. The segmenta- 
tion rules (+ -> 0) require at least some minimal left 
or right context (usually longer than the minimal con- 
text for other rules for more accurate segmentation de- 
cisions). We also disallow contexts that consist only 
of a morpheme boundary, as such contexts are usu- 
ally not informative. It should also be noted that these 
are rules that transform a segmented form into a sur- 
face form (contrary to what may be expected for anal- 
ysis.) This lets us capture situations where multiple 
segmented forms may map to the same surface form, 
which would be the case when the language has mor- 
phological ambiguity. Thus, in a reverse look-up a given 
surface form may be interpreted in multiple wa~'s if ap- 
plicable.10 
Since we have many examples of aligned pairs, it is 
likely that a given rule will be generated from many 
pairs. For instance, if the pairs (stop+ed, stopped) 
and (trip+ed, tripped) were also in the list. the gem- 
ination rule0 -> p \]l p _ + e d, (along with certain 
others) will also be generated from these examples. We 
count how many times a rule is generated and associate 
this number with the rule as its promzse, meaning that 
it promises to fix this many "errors" if it is selected to 
apply to the current list of segmented forms. 
Generalizing Rules If information regarding 
phoneme/grapheme classes in addition to consonant 
and vowel classes, such as SIBILANTS = {s,x.z}, LABIAL 
= {b,m, ...} HIfiHWOVELS = { u, i ...). etc., it is 
l°However, the learning procedure may fail to fix all er- 
rors, if among the examples there are cases where the same 
segmented form maps to two different surface forms (gener- 
ation ambiguity). 
possible to generate more general rules. Such rules can 
cover more cases and the number of rules induced will 
typically be smaller and cover more unseen cases. For 
instance, in addition to arule like 0 -> p II p _ + 
e, the rules 
o -> p II 
0 -> p II 
0 -> p II 
0 -> p II 
CONSONANTS _ e 
p _ VOWELS 
LABIALS _ e 
CONSONANTS _ VOWELS 
can be generated where symbols such as CONSONANTS 
or LABIALS stand for regular expressions denoting the 
union of relevant symbols in the alphabet. The promise 
scores of the generalized rules are found by adding the 
promise scores of the original rules generating them. It 
should also be noted that generalization will increase 
substantially the number of candidate rules to be con- 
sidered during each iteration, but this is hardly a serious 
issue, as the number of examples one would have per 
paradigm would be quite small. The rules learned in 
the process would be the most general set of rules that 
do not conflict with the evidence in the examples. 
Selecting Rules At each iteration all the rules along 
with their promise scores are generated from the cur- 
rent state of the example pairs. The rules generated 
are then ranked based on their promise scores with the 
top rule having the highest promise. Among rules with 
the same promise score, we rank more general rules 
higher with generality being based on context subsump- 
tion. However, all the segmentation rules go to the 
bottom of the list, though within this group rules are 
still ranked based on decreasing promise and context 
generality. The reasoning for treating the segmenta- 
tion rules separately and later in the process, is that 
affixation boundaries constitute contexts for any mor- 
phographemic changes and they should not be elimi- 
nated if there are any (more) morphographemic phe- 
nomena to process. 
Starting with the top ranked rule we test each rule on 
the segmented component of the pairs using the finite 
state engine, to see how much the segmented forms are 
• 'fixed". The first rule that fixes as many "errors" as it 
promises to fix, gets selected and is added to the list of 
rules generated, in order. H 
The complete procedure for rule learning can now be 
given as follows: 
- Align surface and segmented forms; 
- Compute total Error; 
- uhile(Error > O) { 
-Generate all possible revrite rules 
i l Note that a rule may actually clobber other places, since 
context checking is done only on the segmented form side 
and what it delivers ma.v be different than what it promises. 
as promise scores are also dependent on the surface side. 
21 
(subject to context size limits); 
-Rank Rules; 
-while (there are more rules and 
a rule has not yet been selected) { 
- Select the next rule; 
- Tentatively apply rule to 
all the segmented forms; 
Re-align the resulting segmented 
forms with the corresponding 
surface forms to see 
how many ''errors'' have 
been f~xed; 
- If the number fixed is equal to 
what the rules promised to fix 
select this rule; ) 
-Commit the changes with the changes 
performed by the rule and 
save alignments; 
-Reduce Error by the promise 
score of the selected rule; ) 
This procedure eventually generates all ordered se- 
quence of two groups of rewrite rules. The first group of 
rules are for any morphographemic phenomena in the 
given set of examples, and the second group of rules 
handle segmentation. All these rules are composed in 
the order generated to construct the Morphographemic 
Rules transducer at the bottom of each paradigm (see 
Figure 3). 
Identifying Errors and Providing Feedback 
Once the MoTThographemic Rules transducers are com- 
piled and composed with the lexicon transducer that is 
generated automatically fl'om the elicited information, 
we obtain the analyzer as the union of the individual 
transducers for each paradigm. It is now possible to 
test this transducer against a test corpus and to see if 
there are any surface forms in the test corpus that are 
not recognized by the generated analyzer. Our inten- 
tion is to identify and provide feedback about any minor 
problems that are due to a lack of examples that cover 
certain morphographemic phenomena, or to an error in 
associating a given lemma with a paradigm. 
Our approach here is as follows: we use the result- 
ing morphological analyzer with an error-tolerant finite 
state recognizer engine \[Oflazer. 1996\]. For any (cor- 
rect) word in the test corpus that is not recognized 
we try to find words recognized by the analyzer that 
are (very) close to the rejected word. by error-tolerant 
recognition, performing essentially a reverse spelling 
correction. If the rejection is due a snmll number (1 
or 2) of errors, the erroneous words recognized by the 
recognizer are aligned with the corresponding correct 
words from the test corpus. These aligned pairs can 
then be analyzed to see what the problems may be. 
An Example 
The examples generated from the above Russian 
paradigm will induce the following rules coded using 
XRCE notation and composed with . o. operator. (\[..\] 
indicates empty string.): 12 
\[t -> \[..1 II _ ' + \] .o. 
Ca-> C..\] II _ " + \] .o. 
\[z-> \[..\] II _ ' + \] .o. 
\[' -> z II _+a\] .o. 
\[' -> Z II _ + e \] .o. 
\[' -> Z II + u \] .o. 
\[' -> \[..\] ~1 _ + ' \] .o. 
\[..\] -> Z 11 + ' \] .o. 
\[' -> \[..\] II 7- + _ e \] .o. 
\[+-> \[..\] II _ ' # \] .o. 
\[+ -> \[..\] II _ u # \] .o. 
\[+ -> \[..\] II _ e S # \] .o. 
\[+ -> \[..\] II _ a 1 # \] .o. 
\[+-> \[..\] II _ era# \] .o. 
\[+ -> \[..\] II _ e t # \] .o. 
\[+ -> \[..3 II _ u t # \] .o. 
C+ -> \[..\] II _ a t ' # \] .o. 
\[+ -> \[..\] II _ ' t e # \] .o. 
\[+ -> \[..\] II _ a 1 a # \] .o. 
\[+ -> \[..\] II _ a 1 i # \] .o. 
\[+ -> \[..\] II _ a 1 o # \] .o. 
\[+-> \[. • \] II _ e t e # \] 
Note that since we require that the analyses contain 
the verbal lemmas, a number of rules deal with the 
lemma marker +at'. These rules when composed with 
tile lexicon, will. for example, output 
rezat'+Verb Par2 +Impsg 
in response to input reZu. Now, pisat' is a verb 
that was included in this paradigm, and running the 
corpus containing inflected forms of pisat' through 
the error-tolerant analyzer and subsequent alignment 
would raise the following flags (among others): 
Morp.-> pisZut pisZete pisZte piszali piszalo 
File -> piSOut piSOete piSOte pisOali pisOalo 
which indicate a consistent problem due either to a 
wrong paradigm selection for this verb or the lack of 
examples that would describe the s --~ S alternation. 
Since only examples from one verb were given, some of 
the rules were specialized to fixing the phenomena in 
those examples, which explains the spurious z/Z in the 
inflected forms of pisat'. Adding such examples for 
the verb to the example base or defining a new paradigm 
for this other verb in the next round solves these prob- 
lems. 
t~This example does not involve rule generalization. 
22 
Performance Issues The process of generating a 
morphological analyzer once the descriptive data is 
given, is very fast. Each paradigm can be processed 
within seconds on a fast workstation, including the few 
dozens of iterations of rule learning from the examples. 
A new version of the analyzer ca,, be generated within 
minutes and tested very rapidly on any test data. Thus, 
none of the processes described in this paper constitutes 
a bottleneck in the elicitation process. 
Summary and Conclusions 
We have presented the highlights of our approach for 
automatically generating finite state morphological an- 
alyzers from information elicited from human infor- 
mants. Our approach uses transformation-based learn- 
ing to induce morphographemic rules from examples 
and combines these rules with the lexicon information 
elicited to compile the morphological analyzer. There 
are other opportunities for using machine learning in 
this process. For instance, one of the important issues 
in wholesale acquisition of open class items is that of de- 
termining which paradigm a given lemma or root word 
belongs to. From the examples given during the acqui- 
sition phase it is possible to induce a classifier that can 
perform this selection to aid the informant. 
We believe that this approach to machine learning of 
a natural language processor that involves a 1/uman in- 
formant in an elicit-generate-test loop and uses scaffold- 
ing provided by the human informant in machine learn- 
ing, is a very viable approach that avoids the noise and 
opaqueness of other induction schemes. Our current 
work involves using similar principles to induce (light) 
syntactic parsers in the Boas framework. 
Acknowledgements 
This research was supported ill part by Contract 
MDA904-97-C-3976 from the US Department of De- 
fense. We also thank XRCE for providing the finite 
state tools. 

References 

\[Antworth, 1990\] Evan L. Antworth. PC-KIMMO: A 
two-level processor for Morphological Analysis. Sum- 
mer Institute of Linguistics, Dallas, Texas, 1990. 

\[Brill, 1995\] Eric Brill. Transformation-based error- 
driven learning and natural language processing: A 
case study in part-of-speech tagging. Computational 
Linguistics, 21(4):543-566, December 1995. 

\[Golding and Thompson, 1985\] Andrew Golding and 
Henry S. Thompson. A morphology component for 
language programs. Linguistics. 23. 1985. 

\[Goldsmith. 1998\] John Goldsnfith. Unsupervised 
learning of the morphology of a natural lan- 
guage. Unpublished Manuscript, 1998. 

\[Johnson, 1984\] Mark Johnson. A discovery proce- 
dure for certain phonological rules. In Proceedings 
o\[ lOth International Conference on Computational 
Linguistics-COLING'84, 1984. 

\[Kaplan and Kay, 1994\] Ronald M. Kaplan and Martin 
Kay. Regular models of phonological rule systems. 
Computational Linguistics, 20(3):331-378, Septem- 
ber 1994. 

\[Karttunen and Beesley, 1992\] Lauri Karttunen and 
Kenneth. R. Beesley. Two-level rule compiler. Tech- 
nical Report, XEROX Palo Alto Research Center, 
1992. 

\[Karttunen et al., 1992\] Lauri Karttunen, Ronald M. 
Kaptan, and Annie Zaenen. Two-level morphology 
with composition. In Proceedings of the 15 th Interna- 
tional ConJerence on Computational Linguistics, vol- 
ume 1, pages 141-148, Nantes, France, 1992. Inter- 
national Committee on Computational Linguistics. 

\[Karttunen et a/., 1996\] Lauri Karttunen, Jean-Pierre 
Chanod, Gregory Grefenstette, and Anne Schiller. 
Regular expressions for language engineering. Nat- 
ural Language Engineering, 2(4):305-328, 1996. 

\[Karttunen, 1993\] Lauri Karttunen. Finite-state lexi- 
con compiler. XEROX, Palo Alto Research Center- 
Technical Report, April 1993. 

\[Karttunen, 1994\] Lauri Karttunen. Constructing lex- 
ical transducers. In Proceedings of the 16 th Inter- 
national Conference on Computational Linguistics, 
volume 1, pages 406-411, Kyoto, Japan, 1994. Inter- 
national Committee on Computational Linguistics. 

\[Koskenniemi, 1983\] Kimmo Koskenniemi. Two-level 
morphology: A general computational model for 
word form recognition and production. Publication 
No: 11. Department of General Linguistics, Univer- 
sity of Helsinki, 1983. 

\[Nirenburg and Raskin, 1998\] Sergei Nirenburg and 
Victor Raskin. Universal grammar and lexis for quick 
ramp-up of MT systems. In Proceedings of First In- 
ternational Con\[erence on Language Resources and 
Evaluation, 1998. 

\[Nirenburg, 1996\] Sergei Nirenburg. Supply-side and 
demand-side lexical semantics. In Proceedings of the 
Workshop on Breadth and Depth of Semantic Lexi- 
cons at the 34th Annual Meeting of the Association 
for Computational Linguistics, 1996. 

\[Nirenburg, 1998\] Sergei Nirenburg. Project Boas: "A 
Linguist in a Box" as a multi-purpose language re- 
source. In Proceedings of COLING'98, 1998. 

\[Oflazer, 1996\] Kemal Oflazer. Error-tolerant finite- 
state recognition with applications to morphological 
analysis and spelling correction. Computational Lin- 
guistics, 22(1):73-90, March 1996. 

\[Ranta, 1998\] Aarne Ranta. A multilingual natural lan- 
guage interface to regular expressions. In Lauri Kart- 
tunen and Kemal Oflazer, editors, Proceedings of 
International Workshop on Finite State Methods in 
Natural Language Processing, FSMNLP'98, 1998. 

\[Satta and Henderson, 1997\] Giorgio Satta and 
Jolm C. Henderson. String transformation learning. 
In Proceedings of ACL/EACL 'gz 1997. 

\[Theron and Cloete, 1997\] Pieter Theron and Ian 
Cloete. Automatic acquisition of two-level morpho- 
logical rules. In Proceedings of 5th Conference on 
Applied Natural Language Processing, 1997. 
