A Dictionary and Morphological Analyser for English 
G.J. Russell 
S.G. Puhnan 
Computer Laboratory, 
University of Cambridge 
1. Introduction and Overview 
This paper describes the current state of a three-year 
project aimed at the development of software for use in 
handling large quantities of dictionary information within 
natural language processing systems. 1 The project was 
accepted for funding by SERC/Alvey commencing ill 
June 1984, and is being carried out by Graeme Ritchie 
and Alan Black at the University of Edinburgh and 
Steve Puhnan and Graham Russell at the University of 
Cambridge. It is one of three closely related projects 
funded under the Alvey IKBS Programme (Natural 
Language Tlleme); a parser is under development at 
Edinburgh by Henry Thompson and John Phillips, and a 
sentence grammar is being devised by Ted Briscoe and 
Clare Grover at Lancaster and Bran Boguraev and John 
Carroll at Cambridge. It is intended that the software 
and rules produced by all three projects will be directly 
compatible and capable of functioning in an integrated 
system. 
Realistic and useful natural language processing sys- 
tems such as database front-ends require large numbers 
of words, together with associated syntactic and semantic 
Information, to be efficiently stored in machine-readable 
form. Our system is Intended to provide the necessary 
facilities, being designed to store a large number (at least 
10,000) of words and to perform morphological analysis 
on them, covering both Inflectional and derlvatlonal mor- 
phology. In pursuit of these objectives, the dictionary 
associates with each word information concerning its 
morphosyntactlc properties. Users are free to modify the 
system In a number of ways; they may add to the lexi- 
cal entries Lisp functions that perform semantic manipu- 
latlons, and tailor the dictionary to the particular subject 
matter they are interested in (different databases, for 
example). It Is also hoped that the system is general 
enough to be of use to linguists wishing to Investigate 
the morphology of English and other languages. Con- 
tents of the basle data files may be altered or replaced: 
1. A 'Word Grammar' file contains rules assigning inter- 
nal structure to complex words, 
2. A 'Lexicon' file holds the morpheme entries which 
include syntactic and other Information associated 
with stems and affixes. 
3. A 'Spelling Rules' file contains rules governing permis- 
sible correspondences between the form of morphemes 
listed in the lextcon and complex words consisting of 
sequences of these morphemes. 
Once these data flies have been prepared, they are com- 
piled using a number of pre-processtng functions that 
operate to produce a set of output files. These 
constitute a fully expanded and cross-Indexed dictionary 
which can then be accessed from within LISP. 
The process of morphological analysis consists of pars- 
lng a sequence of Input morphemes with respect to the 
word grammar, It Is Implemented as an active chart 
parser (Thompson & Rltchle (1984)), and builds a struc- 
ture in the form of a tree in which each node has two 
1 This work is supported by SERC/AIvey grant number 
GR/C/79114. 
G.D. Rttchle 
A.W. Black 
Department of 
Artificial Intelligence, 
University of Edinburgh 
associated values, a morphosyntactlc category, and a rule 
Identifier. 
The system is written in FRANZ LISP (opus 42.15) 
running under Berkeley 4.2 Unix. Future developments 
will concentrate on improving its efficiency, in particular 
by restructuring the code. We also hope to produce an 
implementation in C, which should offer a faster 
response time. 
2. Linguistic Assumptions 
The grammatical framework underlying the linguistic 
aspects of the system is that of Generalized Phrase 
Structure Grammar, as set out in Gazdar et al. (1985). 
Morphological categories employed here correspond to the 
syntactic categories in that work, and the type of syn- 
tactic information present in dictionary entries is 
intended to facilitate the use of the system as part of a 
more general GPSG-based program. In developing our 
prototype, we have adopted many of the proposals made 
in that work. To that extent, certain assumptions about 
a correct analysis of English sentence syntax are built in 
to the lexlcal entries, but this should not preclude adap- 
tation by users to suit different analyses. 
Following what has become a general assumption in 
syntactic theory, we take the major lexlcal categories to 
be partitioned into four classes by the two binary-valued 
features \[+ N\] and \[:k V\]. The major lexlcat categories 
have phrasal projections; these are distinguished from 
their lexlcal counterparts by their value for the feature 
BAR. Lexlcal categories have the value 0, and phrasal 
categories (including sentences) have the value 1 or 2. 
Thus, a Noun Phrase is of the category: 
((V -) (N +) (BAR 2)) 
In our analysis, 'bound morphemes', that is to say 
prefxes and suffixes, are distinguished from others by 
their BAR specification; tile suffix ing is the sole member 
of the category: 
((V 4-) (N -) (VFORM ING) (BAR -1)) 
As in other GPSG-based work, our analysis encodes the 
subcategorlzational prbpertles of lexlcal Items in the value 
of a feature SUBCAT. Transitive verbs such as devour 
are specified as (SUBCAT NP), and Intransitives such as 
elapse as (SUBCAT NULL). 
As an example from the current analysis of how the 
system can operate to produce well-formed words, con- 
sider the familiar fact of English morphology that no 
word may contain more than one imqection. The word 
grammar must permit both walked and walking, but not 
walkinged. This is achiev~xi by restricting the distribu- 
tion of inflectional suffixes so that they attach to non- 
Inflected stems only. A general statement of this type 
of restriction is made in terms of a feature INFL: stems 
specified as (INFL +) may take an lnflecUonal sulfix, 
while those specified as (INFL ~) may not. The STEM 
feature described in section 4 provides one means of 
enforcing correct stem-affix combinations; if the suffixes 
ed and ing are specified with (STEM ((INFL +))), they 
277 
will attach only to categories which Include the 
specification (INFL +). Walk, as a regular verb, is so 
specified; wallced and waltcing are therefore accepted. Ed, 
ing, other tnfectlonal suffixes, and irregular (i.e. 
unlnflectable) words, however, are specified as (INFL -). 
Our grammar assigns a binary structure to the words in 
question. In order for this method to prevent e.g. walk- 
inged, the stem walking must also bear the (INFL -) 
specification. This it does, since we regard sutfixes as 
being the head of a word, and as contributing to the 
categorial content of the word as a whole. If the INFL 
specification of the suf~x is copied into the mother 
category, the STEM specification of a further suffix will 
not be satisfied. See section 4 for more discussion of 
these matters. 
3. The Lexicon 
The lexicon itself consists of a sequence of entries, each 
in the form of a Lisp s-expression. An entry has five 
elements: (1) and (ii) the head word, in its written form 
and in a phonological transcription, (ill) a 'syntactic 
field', (iv) a 'semantic field', and (v) a 'user field'. The 
semantic field has been provided as a facility for users, 
and any Lisp s-expression can be inserted here. No 
significant semantic information is present in our entries, 
beyond the fact that e.g. better and best are related in 
meaning to good. 
Similarly, the user feld Is unexploited, being occupied 
in all cases by the atom 'nil'. It serves primarily as a 
place-holder, in that, while it is desirable to maintain 
the possibility for users to include in an entry whatever 
additional information they desire, the form which that 
Information might take in practice is clearly not predict- 
able. 
The syntax field consists of a syntactic category, as 
defined by Gazdar et al. (1985), i.e. a set of feature- 
value pairs. Some of these are relevant only to the 
workings of the word grammar, and may thus be 
Ignored by other components In an integrated natural 
language processing system. Their purpose is to control 
the distribution of morphemes in complex words, as 
described in the following section. 
The content of a syntax field is often at least par- 
tlally predictable. This fact allows us to employ as an 
aid to users wishing to write their own dictionary rules 
which add information to the lexicon during the compi- 
lation process. Recall that, in our analysis of English, 
the lnflectablllty of a word is governed by the value in 
that word's category for INFL. Completion Rules (CRs) 
can be written that will add the specification (INFL-) 
to any entry already Including (PLU +) (for e.g. men), 
(AFORM ER) (for e.g. worse), (VFORM ING), etc,, thus 
removing the need to state Individually that a given 
word cannot be inflected. 
A second means of reducing the amount of prepara- 
tory work is provided in the form of Multiplication 
Rules (MRs). Whereas CRs add further specifications to 
a single entry, MRs have the effect of Increasing the 
number of entries In some principled way. One applica- 
tion of MRs Is to express the fact that nouns and adjec- 
tlves do not subcategorize for obligatory complements. 
A MR can be written which, for each entry containing 
the specification (N +) and some non-NULL value for 
SUBCAT, produces a copy of that entry where the SUB- 
CAT specification is replaced by (SUBCAT NULL). 
The lexicon complies Into two files, one holding mor- 
phemes stored in a tree-shaped structure (cf. Thorne et 
278 
al. (1968)), and the other holding the expanded entries 
relating to them. The comptlatlon of a lexicon can take 
a considerable amount of time; our prototype incorporates 
a lexicon with approximately 3500 entries, which com- 
plies In approximately ninety minutes. 
4. The Word Grammar 
The internal structure of words is handled by a 
unification feature grammar with rules of the form: 
mother -~ daughter 1 daughter 2 ... 
where 'mother', 'daughtcrl', etc. are categories. A rule 
which adds the plural morpheme to a noun might be 
given as shown below: 
((BAR 0) (V -) (N +) (PLU +) (INFL -)) => 
((BAR 0) (V -) (N +) (INFL +)) 
((BAR -1) (V -) (N 4-) (PLU 4-) (INFL -)) 
The system provides two methods of writing rules in a 
more general form; variables and feature-passing conven- 
tions. 
In our grammar, the category and inflectabllity of a 
suffixed word are determined by the category and 
lnflectablllty of the suffix; in the rule below, ALPHA, 
BETA, and GAMMA are variables ranging over the set 
of values {+, -}: 
((V ALPHA)(N BETA)(INFL GAMMA)(BAR 0)) => 
((BAR 0)) 
((V ALPHA)(N BETA)(INFL GAMMA)(BAR -1)) 
Since variables are interpreted consistently throughout a 
rule, the mother category and suffix will be identical In 
their specifications for N, V and INFL. 
As an alternative to variables, feature passing conven- 
tions are also available. These relate categories in what 
Gazdar et al. (1.985) term 'local trees', i.e. sections of 
morphological structure consisting of a mother category 
and all of Its immediate daughters. The conventions 
refer to 'pre-lnstantlatlon' features; these are features 
present in the categories mentioned In the relevant rule. 
'Extension' and 'unification' are meant In the sense of 
Gazdar et al. (1985), q.v. 
The Word-Head Convention: 
After lnstantlatlon, the set of WHead features in the 
mother is the unification of the pre-lnstantlatlon 
WHead features of the Mother with the pre- 
lnstantlatlon WHead features of the Rlghtdaughter. 
This convention is analogous to the simplest case of the 
Head Feature Convention in Gazdar et at. (1985). 
Although there is no formal notion of 'head' in the sys- 
tem, this convention embodies the Implicit claim that the 
head in a local tree is always the right daughter. If the 
daughters are a prefix and a stem (as in e.g. re-apply), 
the WHead features of the stem are passed up to the 
mother. Features encoding morphosyntactic category can 
be declared as members of the WHead set, and re-apply 
is then of the same category as, and shares various 
sentence-level syntactic properties with, apply. If the 
daughters are a stem and a suffix, the category of the 
mother Is determined not by the stem, but rather by the 
suffix. For example, possible and ity may be combined to 
form possibility, whose 'nountness' is due to the category 
of the suffix. 
The Word-Daughter Convention: 
(a) If any WDaughter features exist on the Right- 
daughter then the WDaughter features on the 
Mother are the unification of the pre-lnstantlaUon 
WDaughter features on the Mother with the pre- 
lnstantlatlon WDaughter featm-es on the Right-. 
daughter. 
(b) If no WDaughter features exist on the Right- 
daughter then the WDaughter features on the 
Mother are the unification of the pre-lnstantiatlon 
WDaughter features on the Mother with the pre- 
lnstantlation WDaughter features on the Left- 
daughter. 
The subcategorlzation class of a word remains constant 
under Inflection, but is likely to be changed by the 
attachment of a derlvatlonal suffix. Moreover, the sub- 
categorization of a prefixed word is the same as that of 
its stem. The WDaughter convention is designed to 
reflect these facts by enforcing a feature correspondence 
between one of the daughters and the mother. When 
the feature set WDaughter is defined as including the 
subcategorlzation feature SUBCAT, the convention results 
in configuratkms such as: 
((SUBCAT NP)) ((SUBCAT NP)) 
((V +)(N +\]) ((SUBCAT NP)) 
((SUBCAT NP)) ((VFORM ING)) 
which show the relevant feature specifications in local 
trees arising from suffixatton of an adjective with +ize to 
produce a transitive verb and suffixatlon of a transitive 
verb with +ing to produce a present participle. 
The Word-Sister Convention: 
When one daughter is specified for STEM, the 
category of the other daughter must be an extension 
of the value of STEM. 
The purpose of this third convention is to allow the 
subcategorization of affixes with respect to the type of 
stem they may attach to. The behavlour of affixes that 
attach to more than one category can be handled natur- 
ally by giving them a suitable specification for STEM. 
If it is desired to have anti- attached to both nouns and 
adjectives, for example, the specification (STEM ((N +))) 
will have that effect, since both adjectives and nouns are 
extensions of the category ((N +)1. 
The user can define the sets WHead and WDaughter 
as he wishes, or, by leaving them undefined, avoid their 
effects altogether. The feature STEM is built in, and 
need not be defined. The effects of the Word-Sister 
Convention can be modified by changing the STEM 
specifications ill the lexlcal entries, and avoided by 
omitting them. 
5. The Spelling Rules 
The rules are based on the work of Koskennlemt (1983a, 
1983b, Karttunen 1983), though their application here is 
solely to the question of 'morphographemlcs'; the more 
general morphological effects of Koskenniemi's rules are 
produced dlffenmtly. The current version of the system 
contains a compiler allowing the rules to be written in a 
high level notation based on KoskennIemi (1985). Any 
number of spelling rules can be employed, though our 
system has fifleen. They are compiled during the gen- 
eral dictionary pre-processlng stage into deterministic 
finite state transducers, of which one tape represents the 
lexlcal form and the other the surface form. 
The following rule describes the process by which an 
additional e is Inserted when some nouns are suffixed 
with the plural morpheme +s: 
Epenthesls 
+:e <=~> { < s:s h:h > s:s x:x z:z } --- s:s 
or < c:c h:h2> .... s:s 
The epenthests rule states that e must be inserted at a 
morpheme boundary if an(:\[ only if the boundary has to 
its left sh, s, x, z or eh and to Its right s. The 
Interpretation of the rule Is simple; the character pair 
('lexical character:surface character') to the left of the 
arrow specifies the change that takes place between the 
contexts (again stated in character pairs) given to the 
right of the arrow. Braces ('{','}') Indicate disjunction 
and angled brackets Indicate a sequence, Alternative 
contexts may be specified using the word 'or'. IJexlcal 
and surface strings of unequal length can be matched by 
using the null character '0', and special characters may 
be defined and used in rules, for example to cover the 
set of alphabetic characters representing vowels. 
The spelling rules are able to match any pair of char- 
acter strings. It would for example be possible to 
analyse the suppletlve went as a surface form 
corresponding to the lexlcal form go+ed. In this case, 
four rules would be needed to effect the change, and a 
better solution is to list went separately In the lexicon. 
in practice, the choice between treating this type of 
alternation dynamically, with morphological and spelling 
rules, and statically, by exploiting the lexicon directly, 
depends on the user's Idea of which is the more elegant 
solution. While elegance may be in the eye of the 
beholder, computational efficiency is mffortunately not. 
I\[ will generally be more efficient to list a word In the 
lexicon titan to add spelling or morphological rules 
specific to small number of cases. 
References

Gazdar, G., E. Klein, G.K. Pullmn, and I.A. Sag (1985) 
Generalized Phrase Structure Grammar. Oxford: 
Blackwells. 

Karttunen, L. (1983) "KIM:MO - A General Morphologi- 
cal Processor", in Texas Linguistic Forum 22, 165 - 
186. Department of Linguistics, University of Texas, 
Austin, Texas. 

Koskennieml, K. (1983a) "Two-level model for morpho- 
logical analysis", in Proceedings of the Eighth Interna- 
tiona2 Joint Conference on AzTificial Intelligence, 
Karlsruhe, 683 - 685. 

Koskennleml, K. (1983b) Two-level Morphology: a general 
computational model for word-form recognition and pro- 
duction, Publication No. 11, University of tIelslnkl, 
Finland. 

Koskennteml, K. (1985) "Compilation of Automata from 
Two-level Rules", talk given at the Workshop on 
Finite-State Morphoiogy, CSLI, Stanford, July, 1985. 

Thompson, IL and G.D. Rltchte (19841 "Implementing 
Natural Language Parsers", in T. O'Shea and M. Elsen+ 
stadt (eds.) Az~tificial Intelligence: Tools, Techniques 
and Applications. New York: Harper and Row. 

Thorne, J.P., P. Bratley, and, It. Dewar (1968) "The syn- 
tactic analysis of English by machine", in D. Mlchie 
(ed.) Machine Intelligence 3. Edinburgh: Edinburgh 
University Press. 
