Robust, Applied Morphological Generation .... ......... ..... . _ 
Guido Minnen John Carroll Darren Pearce 
Cognitive and Computing Sciences 
University of Sussex 
Brighton BN1 9QH, UK 
{fir stname, lastname } @cogs. susx. ac. uk 
Abstract 
In practical natural  generation sys- 
tems it is often advantageous to have a separate 
component that deals purely with morphologi- 
cal processing. We present such a component: a 
fast and robust morphological generator for En- 
glish based on finite-state techniques that gen- 
erates a word form given a specification of the 
lemma, part-of-speech, and the type of inflec- 
tion required. We describe how this morpholog- 
ical generator is used in a prototype system for 
automatic simplification of English newspaper 
text, and discuss practical morphological and 
orthographic issues we have encountered in gen- 
eration of unrestricted text within this applica- 
tion. 
1 Introduction 
Most approaches to natural  generation 
(NLG) ignore morphological variation during 
word choice, postponing the computation of the 
actual word forms to be output to a final stage, 
sometimes termed 'linearisation'. The advan- 
tage of this setup is that the syntactic/lexical 
realisation component does not have to consider 
all possible word forms corresponding to each 
lemma (Shieber et al., 1990). In practice, it is 
advantageous to have morphological generation 
as a postprocessing component that is separate 
from the rest of the NLG system. A benefit 
is that since there are no competing claims on 
the representation framework from other types 
of linguistic and non-linguistic knowledge, the 
developer of the morphological generator is fl'ee 
to express morphological information in a per- 
spicuous and elegant manner. A further bene- 
fit is that localising morphological knowledge in 
a single component facilitates more systematic 
and reliable updating. From a software engi- 
neering perspective, modularisntion is likely to 
reduce system development costs and increase 
system reliability. As an individual module, 
the morphological generator will be more easily 
shareable between several different NLG appli- 
cations, and integrated into new ones. Finally, 
such a generator can be used on its own in other 
types of applications that do not contain a stan- 
dard NLG syntactic/lexical realisation compo- 
nent, such as text simplification (see Section 3). 
In this paper we describe a fast and robust 
generator for the inflectional morphology of En- 
glish that generates a word form given a speci- 
fication of a lemma, a part-of-speech (PoS) la- 
bel, and an inflectional type. The morphologi- 
cal generator was built using data from several 
large corpora and machine readable dictionar- 
ies. It does not contain an explicit lexicon or 
word-list, but instead comprises a set of mor- 
phological generalisations together with a list of 
exceptions for specific (irregular) word forms. 
This organisation into generalisations and ex- 
ceptions can save time and effort in system de- 
velopment since the addition of new vocabu- 
lary that has regular morphology does not re- 
quire any changes to the generator. In addition, 
the generalisation-exception architecture can be 
used to specify--and also override--preferences 
in cases where a lemma has more than one pos- 
sible surface word form given a particular inflec- 
tional type and PoS label. 
The generator is packaged up as a Unix 'fil- 
ter', making it easy to integrate into applica- 
tions. It is based on efficient finite-state tech- 
niques, and is implemented using the widely 
available Unix Flex utility (a reimplementation 
of the AT&T Unix Lex tool) (Levine et al., 
1992). The generator is freely available to the 
NLG research comnmnity (see Section 5 below). 
The paper is structured ms follows. Section 2 
describes the morphological generator and eval- 
201 
uates its accuracy. Section 3 outlines how the (1) {h}+"s+s_.N" 
generator is put ..to use in.a prototy.p~.system for.:.: ........... :...: ~-.:=..{a=e..tnxnfnp_~ord_:form (1, !~es"-) ).; } 
automatic simplification of text, and discusses 
a number of practical morphological and ortho- 
graphic issues that we have encountered. Sec- 
tion 4 relates our work to that of others, and we 
conclude (Section 5) with directions for future 
work. 
2 Morphological Generation 
2.1 The Generator 
The morphological generator covers the produc- 
tive English affixes s for the plural form of nouns 
and the third person singular present tense of 
verbs, and ed for the past tense, en for the 
past participle, and ing for the present partici- 
ple forms of verbs. 1 The generator is imple- 
mented in Flex. 
The standard use of Flex is to construct 
'scanners', programs that recognise lexical pat- 
terns in text (Levine et al., 1992). A Flex 
description--the high-level description of a 
scanner that Flex takes as input--consists of a 
set of 'rules': pairs of regular expression pat- 
terns (which Flex compiles into deterministic 
finite-state automata (Aho et al., 1986)), and 
actions consisting of arbitrary C code. Flex cre- 
ates as output a C program which at run-time 
scans a text looking for occurrences of the reg- 
ular expressions. Whenever it finds one, it exe- 
cutes the corresponding C code. Flex is part of 
the Berkeley Unix distribution and as a result 
Flex programs are very portable. The standard 
version of Flex works with any ISO-8559 char- 
acter set; Unicode support is also available. 
The morphological generator expects to re- 
ceive as input a sequence of tokens of the form 
lemma+inflection_label, where lemma specifies 
tim lemma of the word form to be generated, 
inflection specifies the type of inflection (i.e. s, 
ed~ cn or ing), and label specifies the PoS of the 
word form. The PoS labels follow the same pat- 
tern as in the Lancaster CLAWS tag sets (Gar- 
side et al., 1987; Burnard, 1995)~ with noun tags 
starting with N, etc. The symbols + and _ are 
delimiters. 
An example of a morphological generator rule 
is given in (1). 
~\Ve do not currently cover comparative and superla- 
tive forms of adjectives or adverbs since their productiv- 
ity is much less predictable. 
The left-hand side of the rule is a regular expres- 
sion. The braces signify exactly one occurrence 
of an element of the character set abbreviated 
by the symbol h; we assume here that h abbre- 
viates the upper and lower case letters of the al- 
phabet. The next symbol + specifies that there 
..... must. be a..sequence of one or.=more characters, 
each belonging to the character set abbreviated 
by h. Double quotes indicate literal character 
symbols. The right-hand side of the rule gives 
the C code to be executed when an input string 
matches the regular expression. When the Flex 
rule matches the input address+s_N, for exam- 
ple, the C function np_word_.form (defined else- 
where in the generator) is called to determine 
the word form corresponding to the input: the 
function deletes the inflection type and PoS la- 
bel specifications and the delimiters, removes 
the last character of the lemma, and finally at- 
taches the characters es; the word form gener- 
ated is thus addresses. 
Of course not all plural noun inflections are 
correctly generated by the rule in (1) since 
there are many irregularities and subregular- 
ities. These are dealt with using additional, 
more specific, rules. The order in which these 
rules are applied to the input follows exactly 
the order in which the rules appear in the Flex 
description. This makes for a very simple and 
perspicuous way to express generalizations and 
exceptions. For instance, the rule in (2) gener- 
ates the plural form of many English nouns that 
originate from Latin, such as stimulus. 
(2) 
{return(np_word_form(2, "i") ) ; } 
With the input stimulus+s_N, the output is 
stimuli rather than the incorrect *stimuluses 
that would follow from the application of the 
more general rule in (1). By ensuring that this 
rule precedes the rule in (1) in the description, 
nouns such as stimulus get the correct plural 
form inflection. Some other words in this class, 
though, do not have the Latinate plural form 
(e.g. *boni as a plural form of bonus); in these 
cases the generator contains rules specifying the 
correct forms as exceptions. 
202 
2.2 Inflectional Preferences 
The rules constitutingthe geiaerator do not nec- 
essarily have to be mutually exclusive, so they 
can be used to capture the inflectional morphol- 
ogy of lemmata that have more than one pos- 
sible inflected form given a specific PoS label 
and inflectional type. An example of this is the 
multiple inflections of the noun cactus, which 
has not only the Latinate plural form cacti but 
also the English~ptura4.form.cactuses: , In addi- 
tion, inflections of some words differ according 
to dialect. For example, the past participle form 
of the verb to bear is borne in British English, 
whereas in American English the preferred word 
form is born. 
In cases where there is more than one possi- 
ble inflection for a particular input lemma, the 
order of the rules in the Flex description de- 
termines the inflectional preference. For exam- 
ple, with the noun cactus, the fact that the rule 
in (2) precedes the one in (1) causes the gener- 
ator to output the word form cacti rather than 
cactuses even though both rules are applicable. 2 
It is important to note, though, that the gen- 
erator will always choose between multiple in- 
flections: there is no way for it to output all 
possible word forms for a particular input. 3 
2.3 Consonant Doubling 
An important issue concerning morphological 
generation that is closely related to that of 
inflectional preference is consonant doubling. 
This phenomenon, occurring mainly in British 
English, involves the doubling of a consonant 
at the end of a lemma when the lemma is in- 
flected. For example, the past tense/participle 
inflection of the verb to travel is travelled in 
British English, where the final consonant of the 
lemma is doubled before the suffix is attached. 
In American English the past tense/participle 
inflection of the verb to travel is usually spelt 
traveled. Consonant doubling is triggered on 
the basis of both orthographic and phonologi- 
cal information: when a word ends in one vowel 
-"Rule choice based on ordering in the description can 
in fact be overridden by arranging for the second or sub- 
sequent match to cover a larger part of the input so that 
the longest match heuristic applies (Levine et al., 1992). 
But note that the rules in (t) and (2) will always match 
the same input span. 
3Flex does not allow the use of rules that have iden- 
tical left-hand side regular expressions. 
followed by one consonant and the last part of 
..-: =the, word is stressedyin-general:.the ~eonsona, nt 
is doubled (Procter, 1995). However there are 
exceptions to this, and in any case the input to 
the morphological generator does not contain 
information about stress. 
Consider the Flex rule in (3), where the sym- 
bols C and V abbreviate the character sets con- 
sisting of (upper and lower case) consonants and 
.vowels,. respectively. 
(3) {A}*{C}{V}"t+ed_V °' 
{return(cb_wordf orm(O, '%", "ed" ) ) ; } 
Given the input submit+ed_ V this rule correctly 
generates submitted. However, the verb to ex- 
hibit does not undergo consonant doubling so 
this rule will generate, incorrectly, the word 
form exhibitted. 
In order to ensure that the correct inflection 
of a verb is generated, the morphological gener- 
ator uses a list of (around 1,100) lemmata that 
allow consonant doubling, extracted automati- 
cally from the British National Corpus (BNC; 
Burnard, 1995). The list is checked before in- 
flecting verbs. Given the fact that there are 
many more verbs that do not allow consonant 
doubling, listing the verbs that do is the most 
economic solution. An added benefit is that if a 
lemma does allow consonant doubling but is not 
included in the list then the word form gener- 
ated will still be correct with respect to Ameri- 
can English. 
2.4 Deriving the Generator 
The morphological generator comprises a set of 
of approximately 1,650 rules expressing mor- 
phological regularities, subregularities, and ex- 
ceptions for specific words; also around 350 lines 
of C/Flex code for program initialisation and 
defining the functions called by the rule actions. 
The rule set is in fact obtained by automati- 
cally reversing a morphological analyser. This 
is a much enhanced version of the analyser orig- 
inally developed for tile GATE system (Cun- 
ningham et al., 1996). Minnen and Carroll (Un- 
der review) describe in detail how the reversal is 
performed. The generator executable occupies 
around 700Kb on disc. 
The analyser--and therefore the generator-- 
includes exception lists derived from WordNet 
(version 1.5: Miller et al., 1993). In addi- 
tion. we have incorporated data acquired semi- 
203 
automatically from the following corpora and 
machine readable,dictionaries: the..LOB. cor- 
pus (Garside et al., 1987), the Penn Tree- 
bank (Marcus et al., 1993), the SUSANNE cor- 
pus (Sampson, 1995), the Spoken English Cor- 
pus (Taylor and Knowles, 1988), the Oxford 
Psycholinguistic Database (Quinlan, 1992), and 
the "Computer-Usable" version of the Oxford 
Advanced Learner's Dictionary of Current En- 
glish (OALDCE; Mitton, 1.9.92). 
2.5 Evaluation 
Minnen and Carroll (Under review) report an 
evaluation of the accuracy of the morphologi- 
cal generator with respect to the CELEX lexi- 
cal database (version 2.5; Baayen et al., 1993). 
This threw up a small number of errors which 
we have now fixed. We have rerun the CELEX- 
based evaluation: against the past tense, past 
and present participle, and third person singu- 
lar present tense inflections of verbs, and all plu- 
ral nouns. After excluding multi-word entries 
(phrasal verbs, etc.) we were left with 38,882 
out of the original 160,595 word forms. For each 
of these word forms we fed the corresponding 
input (derived automatically from the lemma- 
tisation and inflection specification provided by 
CELEX) to the generator. 
We compared the generator output with the 
original CELEX word forms, producing a list 
of mistakes apparently made by the generator, 
which we then checked by hand. In a number 
of cases either the CELEX lemmatisation was 
wrong in that it disagreed with the relevant en- 
try in the Cambridge International Dictionary 
of English (Procter, 1995), or the output of the 
generator was correct even though it was not 
identical to the word form given in CELEX. 
We did not count these cases as mistakes. We 
also found that CELEX is inconsistent with re- 
spect to consonant doubling. For example, it 
includes the word form pettifogged, 4 whereas 
it omits many consonant doubled words that 
are much more common (according to counts 
from the BNC). For example, the BNC con- 
tains around 850 occurrences of the word form 
programming tagged as a verb, but this form 
is not present in CELEX. The form programing 
does occur in CELEX, but does not in the BNC. 
4A rare word, meaning to be overly concerned with 
small, unimportant details. 
We did not count these cases as mistakes either. 
:Of~he :r~m~i.ning: 359'.mist~kes(:346:~c0neern6d 
word forms that do not occur at all in the 100M 
words of the BNC. We categorised these as irrel- 
evant for practical applications and so discarded 
them. Thus the type accuracy of the morpho- 
logical analyser with respect to the CELEX lex- 
ical database is 99.97%. The token accuracy is 
99.98% with respect to the 14,825,661 relevant 
.tokens .i.mthe BNC .(i.e. ,at.rate ,of two errors per 
ten thousand words). 
We tested the processing speed of the gener- 
ator on a Sun Ultra 10 workstation. In order 
to discount program startup times (which are 
anyway only of the order of 0.05 seconds) we 
used input files of 400K and 800K tokens and 
recorded the difference in timings; we took the 
averages of 10 runs. Despite its wide coverage 
the morphological generator is very fast: it gen- 
erates at a rate of more than 80,000 words per 
second. 5 
3 The Generator in an Applied 
System 
3.1 Text Simplification 
The morphological generator forms part of a 
prototype system for automatic simplification 
of English newspaper text (Carroll et al., 1999). 
The goal is to help people with aphasia (a lan- 
guage impairment typically occurring as a re- 
sult of a stroke or head injury) to better un- 
derstand English newspaper text. The system 
comprises two main components: an analysis 
module which downloads the source newspaper 
texts from the web and computes syntactic anal- 
yses for the sentences in them, and a simpli- 
fication module which operates on the output 
of the analyser to improve the comprehensit)il- 
ity of the text. Syntactic simplification (Can- 
ning and Tait, 1999) operates on the syntax 
trees produced in the analysis phase, for exam- 
ple converting sentences in the passive vdice to 
active, and splitting long sentences at appropri- 
ate points. A subsequent lexical simplification 
stage (Devlin and Tait, 1998) replaces difficult 
or rare content words with simpler synonyms. 
The analysis component contains a morpho- 
logical analyser, and it is the base forms of 
sit is likely that a modest increase in speed could be 
obtained by specifying optimisation levels in Flex and 
gcc that are higher than the defaults. 
204 
words that are passed through the system; this with a list of exceptions (e.g. heir, unanimous) 
• eases the task of.the texic~l.simplification toodo ,: =,eollecCed:using.the:pronunciation information in 
ule. The final processing stage in the system 
is therefore morphological generation, using the 
generator described in the previous section. 
3.2 Applied Morphological Generation 
We are currently testing the components of the 
simplification system on a corpus of 1000 news 
the OALDCE, supplemented by-further cases 
(e.g. unidimensional) found in the BNC. In the 
case of abbreviations or acronyms (recognised 
by the occurrence of non-word-initial capital let- 
ters and trailing full-stops) we key off the pro- 
nunciation of the first letter considered in isola- 
tion. 
stories downloaded from .the :Sunde!T!and Echo ....... Simi!arlyi .the orthography .of .the .genit.ive 
(a local newspaper in North-East England). In marker cannot be determined without taking 
our testing we have found that newly encoun- 
tered vocabulary only rarely necessitates any 
modification to the generator (or rather the 
analyser) source; if the word has regular mor- 
phology then it is handled by the rules express- 
ing generalisations. Also, a side-effect of the fact 
that the generator is derived from the analyser 
is that the two modules have exactly the same 
coverage and are guaranteed to stay in step with 
each other. This is important in the context of 
an applied system. The accuracy of the gener- 
ator is quite sufficient for this application; our 
experience is that typographical mistakes in the 
original newspaper text are much more common 
than errors in morphological processing. 
3.3 Orthographic Postprocessing 
Some orthographic phenomena span more than 
one word. These cannot be dealt with in mor- 
phological generation since this works strictly a 
word at a time. We have therefore implemented 
a final orthographic postpmcessing stage. Con- 
sider the sentence: 6 
(4) *Brian Cookman is the attraction at 
the King's Arms on Saturday night 
and he will be back on Sunday night 
for a acoustic jam session. 
This is incorrect orthographically because the 
determiner in the final noun phrase should be 
an, as in an acoustic jam session. In fact an 
nmst be used if the following word starts with 
a vowel sound, and a otherwise. We achieve 
this, again using a filter implemented in Flex, 
with a set of general rules keying off the next 
word's first letter (having skipped any inter- 
vening sentence-internal punctuation), together 
6This sentence is taken from the story "The demise 
of Sunderland's Vaux Breweries is giving local musicians 
a case of the blues" published in the Sunderland Ech, o 
on 26 August 1999. 
context into account, since it depends on the 
identity of the last letter of the preceding word. 
In the sentence in (4) we need only eliminate 
the space before the genitive marking, obtain- 
ing King's Arms. But, following the newspaper 
style guide, if the preceding word ends in s or z 
we have to 'reduce' the marker as in, for exam- 
ple, Stacey Edwards' skilful fingers. 
The generation of contractions presents more 
of a problem. For example, changing he will 
to he'll would make (4) more idiomatic. But 
there are cases where this type of contraction is 
not permissible. Since these cases seem to be 
dependent on syntactic context (see Section 4 
below), and we have syntactic structure from 
the analysis phase, we are in a good position 
to make the correct choice. However, we have 
not yet tackled this issue and currently take the 
conservative approach of not contracting in any 
circumstances. 
4 Related Work 
We are following a well-established line of re- 
search into the use of finite-state techniques for 
lexical and shallow syntactic NLP tasks (e.g. 
Karttunen et al. (1996)). Lexical transduc- 
ers have been used extensively for morphological 
analysis, and in theory a finite-state transducer 
implementing an analyser can be reversed to 
produce a generator. However, we are not aware 
of published research on finite-state morpho- 
logical generators (1) establishing whether in 
practice they perform with similar efficiency to 
morphological analysers, (2) quantifying their 
type/token accuracy with respect to an inde- 
pendent, extensive 'gold standard', and (3) in- 
dicating how easily they can be integrated 
into larger systems. Furthermore, although a 
number of finite-state compilation toolkits (e.g. 
t(arttunen (1994)) are publicly available or can 
205 
be licensed for research use, associated large- length trailing strings and concatenating suf- 
.scale l.inguis tic .,descriptions=-~ar,,,exa,mple=~n,-:.:~...~.. fixes ........ All ~mo~phologicaUy,..subreguta,r-~ :forms. 
glish morphological lexicons--are usually com- 
mercial products and are therefore not freely 
available to the NLG research community. 
The work reported here is-also related to 
work on lexicon representation and morpho- 
logical processing using the DATR representa- 
tion  (Cahill, 1993; Evans and Gazdar, 
must be entered explicitly in the lexicon, as well 
as irregular ones. The situation is similar in 
FUF/SURGE, morphological generation in the 
SURGE grammar (Elhadad and Robin, 1996) 
being performed by procedures which inspect 
lemma endings, strip off trailing strings when 
appropriate, and concatenate suffixes. 
.1996). 
cal and more of an engineering perspective, fo- 
cusing on morphological generation in the con- 
text of wide-coverage practical NLG applica- 
tions. There are also parallels to research in 
the two-level morphology framework (Kosken- 
niemi, 1983), although in contrast to our ap- 
proach this framework has required exhaustive 
lexica and hand-crafted morphological (unifi- 
cation) grammars in addition to orthographic 
descriptions (van Noord, 1991; Ritchie et al., 
1992). The SRI Core Language Engine (A1- 
shawi, 1992) uses a set of declarative segmen- 
tation rules which are similar in content to our 
rules and are used in reverse to generate word 
forms. The system, however, is not freely avail- 
able, again requires an exhaustive stem lexicon, 
and the rules are not compiled into an efficiently 
executable finite-state machine but are only in- 
terpreted. 
The work that is perhaps the most similar 
in spirit to ours is that of the LADL group, in 
their compilation of large lexicons of inflected 
word forms into finite-state transducers (Mohri, 
1996). The resulting analysers run at a com- 
parable speed to our generator and the (com- 
pacted) executables are of similar size. How- 
ever, a full form lexicon is unwieldy and incon- 
venient to update: and a system derived from it 
cannot cope gracefully with unknown words be- 
cause it does not contain generalisations about 
regular or subregular morphological behaviour. 
The morphological components of current 
widely-used NLG systems tend to consist of 
hard-wired procedural code that is tightly 
bound to the workings of the rest of the system. 
For instance, the Nigel grammar (Matthiessen, 
1984) contains Lisp code that classifies verb, 
noun and adjective endings, and these classes 
are picked up by further code inside the t<PML 
system (Bateman, 2000) itself which performs 
inflectional generation by stripping off variable 
However,.. we,.~adopt .less ..of .a~.theoreti~ .... -..,.,.Jn~ eLtr~ent~.,NI,G~-.systerns,~or.#hographic 4nfor- 
mation is distributed throughout the lexicon 
and is applied via the grammar or by hard-wired 
code. This makes orthographic processing dif- 
ficult to decouple from the rest of the system, 
compromising maintainability and ease of reuse. 
For example, in SURGE, markers for alan us- 
age can be added to lexical entries for nouns to 
indicate that their initial sound is consonant- 
or vowel-like, and is contrary to what their or- 
thography would suggest. (This is only a partial 
solution since adjectives, adverbs--and more 
rarely other parts of speech--can follow the in- 
definite article and thus need the same treat- 
ment). The appropriate indefinite article is in- 
serted by procedures associated with the gram- 
mar. In DRAFTER-2 (Power et al., 1998), an 
alan feature can be associated with any lex- 
ical entry, and its value is propagated up to 
the NP level through leftmost rule daughters in 
the grammar (Power, personal communication). 
Both of these systems interleave orthographic 
processing with other processes in realisation. 
In addition, neither has a mechanism for stat- 
ing exceptions for whole subclasses of words, for 
example those starting us followed by a vowel-- 
such as use and usual--which must be preceded 
by a. KPML appears not to perform this type 
of processing at all. 
We are not aware of any literature describing 
(practical) NLG systems that generate contrac- 
tions. However, interesting linguistic research in 
this direction is reported by Pullmn and Zwicky 
(In preparation),. This work investigates tile un- 
derlying syntactic structure of sentences that 
block auxiliary reductions, for example those 
with VP ellipsis as in (5). 
(5) *She's usually home wh, en he's. 
206 
5 Conclusions provided to us by the University of Sheffield 
We have described a generatorf0r English in:: ' G.A~E ~projoet-,...:(3hris -Brew,,.Dale- Gerdem:an.n~.. 
flectional morphology. The main features of the Adam Kilgarriff and Ehud Reiter have sug- 
generator are: 
wide coverage and high accuracy It in- 
corporates data from several large corpora 
and machine readable dictionaries. An 
evaluation has shown the error rate to be 
very low. 
robustness The generator does not contain 
an explicit lexicon or word-list, but instead 
comprises a set of morphological generali- 
sations together with a list of exceptions for 
specific (irregular) words. Unknown words 
are very often handled correctly by the gen- 
eralisations. 
maintainability and ease of use The or- 
ganisation into generalisations and excep- 
tions can save development time since ad- 
dition of new vocabulary that has regular 
morphology does not require any changes 
to be made. The generator is packaged up 
as a Unix filter, making it easy to integrate 
into applications. 
speed and portability The generator is 
based on efficient finite-state techniques, 
and implemented using the widely available 
Unix Flex utility. 
freely available The morphological gener- 
ator and the orthographic postproces- 
sor are fi'eely available to the NLG re- 
search community. See <http://www.cogs. 
susx.ac.uk/lab/nlp/carroll/morph.html>. 
In future work we intend to investigate the 
use of phonological information in machine 
readable dictionaries for a more principled so- 
lution to the consonant doubling problem. We 
also plan to further increase the flexibility of 
the generator by including an option that al- 
lows the user to choose whether it has a prei~r- 
ence for generating British or American English 
spelling. 
Acknowledgements 
This work was fimded by UK EPSRC project 
GR/L53175 'PSET: Pra(:tical Simplification of 
English Text', and by all EPSRC Advanced Fel- 
lowship to tim second author. The original ver- 
sion of t.lw morl)hol~gi('al analyscr was kindly 
gested improvements to the analyser/generator. 
Thanks also to the anonymous reviewers for in: 
sightful comments. 

References 
Alfred Aho, Ravi Sethi, and Jeffrey Ullman. 
..... ...~ 1986..=-£:ompilers,: ~Principles,~Techniques and 
Tools. Addison-Wesley. 
Hiyan Alshawi, editor. 1992. The Core Lan- 
guage Engine. MIT Press, Cambridge, MA. 
Harald Baayen, Richard Piepenbrock, and Hed- 
derik van Rijn. 1993. The CELEX Lexi- 
cal Database (CD-ROM). Linguistic Data 
Consortium, University of Pennsylvania, 
Philadelphia, PA, USA. 
John Bateman. 2000. KPML (Version 3.1) 
March 2000. University of Bremen, Germany, 
< http://www.fbl0.uni-bremen.de/anglistik/ 
langpro/kpml/README.html>. 
Lou Burnard. 1995. Users reference guide for 
the British National Corpus. Technical re- 
port, Oxford University Computing Services. 
Lynne Cahill. 1993. Morphonology in the lex- 
icon. In Proceedings of the 6th Conference 
of the European Chapter of the Association 
for Computational Linguistics, pages 87-96, 
Utrecht, The Netherlands. 
Yvonne Canning and John Tait. 1999. Syntac- 
tic simplification of newspaper text for apha- 
sic readers. In Proceedings of the ACM SIGIR 
Workshop on Customised Information Deliv- 
ery, Berkeley, CA, USA. 
John Carroll, Guido Minnen, Darren Pearce, 
Yvonne Canning, Siobhan Devlin, and John 
Tait. 1999. Simplifying English text for lan- 
guage impaired readers. In Pwceedings of the 
9th Conference of th, e European Chapter of 
the Association for Computational Ling.uis- 
tics (EACL), Bergen, Norway. 
Hamish Cunningham, Yorick Wilks, and Robert 
Gaizauskas. 1996. GATE--a GenerM Archi- 
tecture for Text Engineering. In Proceed- 
ings of the 16th Conference on Computational 
Linguistics, Copenhagen, Denmark. 
Siobhan Devlin and John Tait. 1998. The use 
of a psychotinguistic database in the simpli- 
fication of text for aphasic readers. In (Ner- 
bonne. 1998). 
Michael Elhadad and Jacques Robin. 1996. An 
overview of SU-KGE:..A ~eusable~,.eomprehen- 
sive syntactic realization component. Tech- 
nical Report 96-03, Dept of Mathematics and 
Computer Science, Ben Gurion University, Is- 
rael. 
Roger Evans and Gerald Gazdar. 1996. DATR: 
a  for lexical knowledge representa- 
tion. Computational Linguistics, 22. 
Roger Garside, Ge.off;ey.. _Leech, and...Geoffrey 
Sampson. 1987. The computational analysis 
of English: a corpus-based approach. Long- 
man, London. 
Lauri Karttunen, Jean-Pierre Chanod, Gregory 
Grefenstette, and Anne Schiller. 1996. Regu- 
lar expressions for  engineering. Nat- 
ural Language Engineering, 2(4):305-329. 
Lauri Karttunen. 1994. Constructing lexical 
transducers. In Proceedings of the 14th Con- 
ference on Computational Linguistics, pages 
406-411, Kyoto, Japan. 
Kimmo Koskenniemi. 1983. Two-level model 
for morphological analysis. In 8th Interna- 
tional Joint Conference on Artificial Intelli- 
gence, pages 683-685, Karlsruhe, Germany. 
John Levine, Tony Mason, and Doug Brown. 
1992. Lex ~4 Yacc. O'Reilly and Associates, 
second edition. 
Mitch Marcus, Beatrice Santorini, and 
Mary Ann Marcinkiewicz. 1993. Build- 
ing a large annotated corpus of English: the 
Penn Treebank. Computational Linguistics, 
19(2):313-330. 
Christian Matthiessen. 1984. Systemic Gram- 
mar in computation: The Nigel case. In Pro- 
ceedings of the 1st Conference of the European 
Chapter of the Association for Computational 
Linguistics, pages 155-164, Pisa, Italy. 
George Miller, Richard Beckwith, Christiane 
Fellbaum, Derek Gross, Katherine Miller, and 
Randee Tengi. 1993. Five Papers on Word- 
Net. Princeton University, Princeton, N.J. 
Guido Minnen and John Carroll. Under review. 
Past and robust morphological processing tools 
for practical NLP applications. 
Roger Mitton. 1992. A description of a 
computer-usable dictionary file based on 
the Oxford Advanced Learner's Dictio- 
nary of Current English. Availat)le at 
< ftp: / / ota.ox.ac.uk / pub / ota/ pub lic / d icts / 710 / 
text710.doe: >. 
Mehryar Mohri. 1996. On some applications of 
....... ~ :fmittee-:sta, t e -automata,.-.t heeory.,.~tox.:natu.ea, l~lam.- 
guage processing. Natural Language Engi- 
neering, 2(1):61-80. 
John Nerbonne, editor. 1998. Linguistic 
Databases. Lecture Notes. CSLI Publica- 
tions, Stanford, USA. 
Richard Power, Donia Scott, and Roger Evans. 
1998. What You See Is What You Meant: di- 
rect knowledge editing~with natu~aL 
feedback. In Proceedings of the 13th Bien- 
nial European Conference on Artificial Intel- 
ligence (ECAI 98), Brighton, UK. 
Paul Procter. 1995. Cambridge International 
Dictionary of English. Cambridge University 
Press. 
Geoffrey Pullum and Arnold Zwicky. In prepa- 
ration. Licensing of prosodic features by 
syntactic rules: the key to auxiliary reduc- 
tion. First version presented to the Annual 
Meeting of the Linguistic Society of America, 
Chicago, Illinois, January 1997. Available at 
< http://www.lsadc.org/web2/99modabform.ht 
Philip Quinlan. 1992. The Oxford Psycholin- 
guistic Database. Oxford University Press. 
Graeme Ritchie, Graham Russell, Alan Black, 
and Stephen Pulman. 1992. Computational 
morphology: practical mechanisms for the 
English lexicon. MIT Press. 
Geoffrey Sampson. 1995. English for the com- 
puter. Oxford University Press. 
Stuart Shieber, Gertjan van Noord, Robert 
Moore, and Fernando Pereira. 1990. Seman- 
tic head-driven generation. Computational 
Linguistics, 16(1):7-17. 
Lita Taylor and Gerry Knowles. 1988. Man- 
ual of information to accompany the SEC 
Corpus: the machine-readable corpus of spo- 
ken English. Manuscript, Urfiversity of Lan- 
caster. UK. 
Gertjan van Noord. 1991. Morphology in 
MiMo2. Manuscript, University of Utrecht, 
The Netherlands. 
