The Derivation of a GrammaticaUy Indexed Lexicon 
from the Longman Dictionary of Contemporary English 
Bran Boguraev t, Ted Briscoe§, John Carroll t, David Carter t and Claire Grover§ 
t Computer Laboratory, Universityof Cambridge 
Corn Exchange Street, Cambridge CB2 3QG, England 
§ Department of Linguistics, University of Lancaster 
Bailrigg, Lancaster LA1 4YT, England 
Abstract 
We describe a methodology and associated software 
system for the construction of a large lexicon from 
an existing machine-readable (published) dictionary. 
The lexicon serves as a component of an English mor- 
phological and syntactic analyesr and contains entries 
with grammatical definitions compatible with the word 
and sentence grammar employed by the analyser. We 
describe a software system with two integrated com- 
ponents. One of these is capable of extracting syn- 
tactically rich, theory-neutral lexical templates from 
a suitable machine-readabh source. The second sup- 
ports interactive and semi-automatic generation and 
testing of target lexical entries in order to derive a size- 
able, accurate and consistent lexicon from the source 
dictionary which contains partial (and occasionally in- 
accurate) information. Finally, we evaluate the utility 
of the Longman Dictionary of Contemporary EnglgsA as 
a suitable source dictionary for the target lexicon. 
1 Introduction 
Within the larger framework of the Alvey Programme 
of advanced information technology -- a research and 
development initiative set up in the UK to promote 
collaborative research projects ~imed at several en- 
abling key technologies -- a coordinated e~ort to build 
a natural language toolkit for the use by the wider aca- 
demic and industrial community is being carried out 
jointly by groups at the Universities of Cambridge, 
Lancaster and Edinburgh. 
The goal of these three closely related projects is to 
produce directly compatible rule systems and associ- 
ated software, capable of functioning together as an in- 
tegrated system for morphological and syntactic pars- 
ing of texts. The projects aim to deliver, respectively, 
a 8entente grammar of English together with a toord list 
indexed to the grammar, a combined inflectional and 
derlvational morphological ana/y~er and dictionary 8~s- 
tent, and a parser for the grammatical formalism used. 
The work is being carried out within the theoretical 
framework of Generalized Phrase Structure Grammar 
(Gszdar et ai., 1985), but many of the mechanisms 
would be usable without a theoretical commitment to 
GPSG. It is envisaged that the complete integrated 
toolkit will be used by a number of research and de- 
velopment groups, as a base component for a range of 
applications. The potential requirements of a diverse 
user community motivate, in particular, the need for 
a morphological and syntactic anaiyser with wide cov- 
erage of English grammar and vocabulary. Briscoe et 
al. (1987) describes the sentence grammar formalism 
and current coverage of the English grammar in detail. 
Russell et al. (1986) describes the morphological anal- 
yser and dictionary system. Further relevant details 
of both projects are provided in section 2. 
As part of the grammar project, in tandem with 
the development of the grammar proper, work is un- 
derway to develop a sizeable word list which will be in- 
tegrated with an existing lexicon of about 4000 words, 
hand crafted by the morphology project. The cover- 
age of this word list and its compatibility with the 
sentence grammar, word grammar and existing lexi- 
con k critical for the complete analysis system. The 
word list need only contain base and irregular entries, 
as productive inflectional and derivational variants are 
analysed at run-time on the basis of the word gram- 
mar. Therefore, when the word list is integrated with 
the existing lexicon and dictionary system it will form 
a dynamic system for word analysis, and not just a 
repository of word forms used for simple lookup. 
An additional constraint on the content of the tar- 
get word Ust comes from the fact that even though 
there k no provision for the analysis system to handle 
semantics, there is still the need to provide a minimal, 
theoretically neutral extension to the grammar rules 
and lexical entry format to allow subsequent integra- 
tion of a semantic component: thus information con- 
ceruing eg. the predicate-argument structure of verbs 
and their logical types must be made available in the 
lexical entries. 
The question tl~en arises of how to develop such a 
detailed and substantial word list. Our approach has 
been to make use of the machine-readable source of a 
published dictionary, namely the Longr~sn Dictionary 
of Contsraporarll Engtish (henceforth LDOCE) (Proc- 
ter, 1978). Apart from the obvious motivation of at- 
tempting to derive a large list of words from a comput- 
erised source, LDOCE is particularly relevant to this 
project since it o~ers, among other things, through a 
highly elaborate and semi-formal system of 9ram~zar 
codes, detailed information about the grammatical be- 
haviour of individual words. We have mounted the 
dictionary on-line and, following its conversion into a 
flexible lexical knowledge base (as described in Bogu- 
raev et M., 1987), a range of experiments have since 
been carried out with the aim of establishing LDOCE's 
appropriateness to the task of deriving a word list 
with associated grammatical definitions indexed to the 
analyser grammar. Section 3 below describes the syn- 
tactic level information available in, and extractable 
from, LDOCE and summarises the description of an 
operational program used to derive such information. 
The attempt to use semi-form~Aised, and occasion- 
ally inaccurate, information for constructing a large 
computerised lexicon raises a number of practical prob- 
lems. In order to make maximal use of the rich syn- 
193 
tactic data in the source machine-readable dictionary 
(MR/)), we have designed a lexicon development sys- 
tem which embodies a methodology for a semi-automa- 
tic interactive cycle of lexical entry generation and 
testing. This is described in section 4. 
2 The target lexicon 
Given the goal of the toolkit projects to provide a led- 
con capable of supporting morphological and syntactic 
analysis of English, there is a precise definition of the 
information required in lexical entries. Both the gram- 
mar and morphology projects have adopted a feature 
system based largely on that described in Gasdar et al. 
(1985). A lexical entry will contain features relevant 
either to the word grammar or sentence grammar, or 
both, represented as a list of feature name / feature 
value pairs. In Figure I we show a fragment from the 
hand crafted lexicon developed as part of the morphol- 
ogy project (Russell et al., 1986). Here we concentrate 
on the feature-value sets carrying the syntactic infor- 
mation; the complete entries have also semantic and 
user fields, which are of no relevance to this paper. 
believe 
(V *. \]1 -, BAIL O, AG~ \[BAi 2, V -, If *. 
~'01H NOel. PlO -. ~ -. V01O +, AUI -, 
ISFL +, FI\] -. VFORN BSE, TAT -. SUBCAT OK\] 
\[V ÷, ~ -, BaIL O. AGIt \[BAR 2. V -. lJ *, 
l~'01Lq !\[01)4\], PID -, gF.A -, VOBD % AUX -, 
ISFL % FI! -, VFORM BSZ, IAT -, SUBCAT I'10NP\] 
IV*. I -. B~ O. A~I. \[BkR 2. V -. I ÷. 
NFO~ NoEq\]. PRD-, ~-, woRD % AUX-, 
I\]nq, % FIN -. VFOEq BSE. IAT -. SUBCAT IP..AP\] 
\[V ÷, N -, BA.i O. AGR \[BE~. 2, V -. N ÷. 
NFOR/4 N0a.q\]. FBD -. ~ -. V0RD +. AUX -. 
I~FL 4. FI~ -. YFOIH BSE. rat -. SO'CAT SFI\]J\] 
Figure 1: Sample lexical entries 
An almost complete list of the feature names and 
potential values which may occur as part of the lex- 
ical entry for a given morpheme is given in Figure 
2 overleaf. Grover et al. (1987) contains a complete 
description of the features used in the sentence grzm- 
mar; P,.itchie et ~l. (1987) offers an equally complete 
description of the morphological and syntactic features 
relevant to the operations of the word grammar. For 
the purposes of this paper, we present a brief overview 
of the sentence grammar feature system. 
With exception of the features N, V and BAR, 
used to define the major categories of the grammar, 
most features can be classlfied in terms of the cate- 
gories they apply to. For each major category type 
there is a set of head features which must appear on 
all instances of that category type, regardless of their 
BAR feature value. Further features must (or may) 
be associated only with some instances of a category 
type, depending on the value of their BAR feature 
(or, on occasions, some other feature). The sets of 
head features for the four major categories axe: 
VERBALHEAD {PRD FIN AUX VFORAI PAST AGR} 
NOMINALHEAD {PLU POSS CASE PN COUNT 
PRD PRO PART NFORM PER} 
PREPHEAD {PFORM LOC PRD} 
ADJHEAD {AFORM PRD QUA ADV NUM NEG 
PAI~I ~ AGR DEF}. 
The features appearing on certain categories in ad- 
dition to the sets defined above are COMP, IN'V, NEG 
and SUBCAT which are relevant to verbal categories; 
SPEC, DEF and SUBCAT, applicable to nominal cat- 
egories; GERUND, POSS and SUBCAT for preposi- 
tional categories; and SUBCAT alone for adjectival 
categories. With exception of SUBCAT, which must 
be specified for all lexical entries, and the respective 
head features sets, the only other features required 
by the lexical nodes in the grammar are NEG, 
and DEF. Features like SLASH, WH, UB and EVER, 
which are required by the grammar to implement the 
GPSG treatment of certain linguistic phenomena, are 
of no relevance to this paper. 
The feature set in Figure 2 overleaf defines the in- 
formation about lexical items which will be required 
to construct a lexicon compatible both in form and 
content with the rest of the analysis system. Some of 
these features, (such as FIX) are specific to bound 
morphemes(these include, for example, entries for 
uztive", ~ng ~ or "nessJ). Other features (for instance 
WH, REFL) are specific to closed class vocabulary 
items, such as interrogative, relative and reflexive pro- 
nouns. Bound morphemes and closed class vocabulary 
are exhaustively defined in the hand crafted lexicon. 
However, this lexicon inevitably only contains a few 
examples of the much larger open class vocabulary° In 
order for the word and sentence grammars to func- 
tion correctly, open class vocabulary must be defined 
In terms of the feature set illustrated overleaf (Figure 2a). 
The features relevant to the open class vocabulary 
can be divided into those which are predictable on 
the basis of the part of speech of the item involved, 
those which follow from the inflectional or deriv~tional 
morphological rules incorporated into the system, ~nd 
those which rely on more specific information than 
part of speech, but nevertheless must be specified for 
each individual entry. For example the values for the 
features N, V and BAR in the sample entries above 
follow from the part of speech of ~oelieve = . The values 
of PLU and PER are predictable on the basis of the 
word grammar rules and need not be independently 
specified for each entry. On the other hand, the values 
of SUBCAT and LAT are not predictable from either 
part of speech or general morphological information. 
We concentrate on this last class of features which 
must be specified on an entry-by-entry basis in any 
lexicon which is going to be adequate for supporting 
the analysis system. Within this class of features some 
(eg. LAT, AT or BARE..ADJ) are only relevant to the 
word grammar. It is clear that those features that are 
derivable from the part of speech information are re- 
coverable from virtually any/vfl~). However, most (if 
not all) of the features in the third class above are not 
recoverable from the majority of ~\[\]~.Ds. As indicated 
above, LDOCE appears to be an exception to this 
generai\]sation, because it employs a system of gram- 
matical tagging of major syntactic classes, offering de- 
tailed information about subcategorisation, morpho- 
logical irregularity and broad syntactico-semantic in- 
formation. 
194 
BAR {-10 12} 
V {-+} 
N {-+} 
PRD {- .4-} 
qUA {- +} 
ADV {- ÷} 
FXN {- +} 
PAST {- +} 
PLU {- +} 
a. open class vocabulary 
AT {-+} 
LAT {- ÷} 
AGR a catesory 
STEM a category 
SUBCAT { ........ PRED INF NP AP NOPASS 
SFIN VPINF SINF OR IT_SUBJ 
PPFROM PPTO TWONP FOR.S 
LOC S-SUBJ NP..NP NP_AP 
OE SR1 DETH AND ........ } 
INFL {- .4-} 
COUNT {- ÷} 
PN {- +} 
PER {1 '~ S} 
CASE {HeM ACC} 
BAR,Z._ADJ {- +} 
AFOR/%4 {ER EST NONE} 
NFOIqU~4 {IT THBR~- NORM} 
VFORIN/ {BSE EN ING TO} 
FIX {PRE SUF} 
INV {- ÷} 
AUX {- +) 
NEC {- +} 
DEF (- "4"} 
SLASH a category 
b. closed cIMs vocabulary and aH~es 
COMPOUND {NOUN VERB ADJ NOT} 
TITLE {- +} 
pOSE {- +} 
PFO~ {WITH OF FROM AT ABOUT 
TO ON IN FOR AGAINST BY} 
REFL a category 
WH {- +} 
uB {Q R) 
EvER {- +} 
PRO {- +} 
PRT {AS IN OFF ON UP) 
Figure 2: Features and feature values 
3 The source data 
It turns out that even though the grammar coding 
system of LDOCE is not GPSG specific, it encodes 
much of the information which GPSG requires relat- 
ing to the subcategorisation classes in the lexicon. The 
Longman lexicographers have developed a representa- 
tional system which is capable of describing compactly 
a variety of data relevant to the task of building a lex- 
icon with grammatical definitions; in particular, they 
are capable of denoting distinctions between count and 
ma~ nouns ('do~ vs. Sdesire'), predicative, postpos- 
itive and attributive adjectives ('asleep" vs. "elect" 
vs. "jocular~), noun and adject|ve complementation 
(~ondness', Tact') and, most importantly, verb com- 
plementation and valency. 
8.1 The Longman grammar coding system 
Grammar codes typically contain a capital letter, fol- 
lowed by a number and, occasionally, a small letter, 
for example \[TSa\] or \[V3\]. The capital letters encode 
information "about the way a word works in a sen- 
tence or about the position it can fill" (Procter, 1978: 
xxviii); the numbers "give information about the way 
the rest of a phrase or clause is made up in relation to 
the word described" (ibid.). For example, "T" denotes 
a transitive verb with one object, while "5" specifies 
that what follows the verb must be a that clause. (The 
small letters, eg. "a" in the case above, provide infor- 
mation related to the status of various complementis- 
era, adverbs and prepositions in compound verb con- 
structions: here it indicates that the complementiser 
is optional.) As another example, '~r3" introduces a 
verb followed by one object and a verb form (V) which 
must be an infinitive with to (3). 
In addition, codes can be qualified with words or 
phrases which provide further information concerning 
the linguistic context in which the described item is 
likely, and able, to occur; for example \[Dl(to)\] or \[L(to 
be)l\]. Sets of codes, separated by semicolons, are as- 
sociated with individual word senses in the lex/cal en- 
try for a particular item, as the entry for ~feel", with 
extracts from its printed form shown in Figure 3, il- 
lnstrates. These sets are el/ded and abbreviated in 
the code field associated with the word sense to save 
space in the dictionary. Partial codes sharing an ini- 
tial letter can be separated by commas, for example 
\[Tl,Sa\]. Word qualifiers relating to a complete se- 
quence of codes can occur at the end of a code field, 
delimited by a colon, for example \[TI;I0: (DOWN)\]. 
faol I • 1 \[T1,6\] to get the knowledge of by 
touching with the fingers: ... 2 \[Wv6;Tl\] to 
experience (the touch or movement of some- 
thing): ... S \[L7\] to experience (a condition 
of the mind or body); be consciously." ... 4 
\[LI\] to seem to oneself to be: ... 5 \[TI,5;V3 
to believe, esp. for the moment 6 L7\] to 
give (a sensation): ... 7 \[Wv6;10\] to (be able 
to) experience sensations: ... 8 \[Wv6;T1\] to 
suffer because of (a state or event): ... 9 {L9 
(~ter,/ov)\] to search with the fingers rather 
than with the eyes: ... 
Figure 3: Fragment of an LDOCE entry 
This apparent formal syntax for describing gram- 
matical information in a compact form occasionally 
breaks down: different classes of error occur in the 
tagging of word senses. These include, for example, 
misplaced commas or colon del/miters and occasional 
migration of other lex/cal information (e.g. usage la- 
bels) into the grammar code fields. 
This type of error and inconsistency arises because 
grammar codes are constructed by hand and no au- 
tomatic checking procedure is attempted (l~fichiels, 
1982). They provide much of the motivation for our in- 
teractive approach to lexicon development, since any 
attempt at batch processing without extensive user 
intervention would inevitably result in an incomulete 
and inaccurate lexicon. 
195 
$.2 Making use of the gr-mmar codes 
The program which transforms the LDOCE grammar 
codes into lexical entries utilisable by the analyser first 
produces a relatively theory-neutral representation of 
the lexical entry for a particular word. As an illnstm- 
tion of the process of transforming a dictionary entry 
into a lexical template we show below the mapping 
of the third verb sense of %elieve" below into a lex- 
ical entry incorporating information about the gram- 
matical category, syntactic subcategorisstion frames 
and semantic type of verb -- for example a label like 
(Type 20Ralsing) indicates that under the given 
sense the verb is a two-place predicate and that if it 
occurs with a syntactic direct object, this will function 
as the logical subject of the predicate complement. 
be-lievo ... v 1 \[I0J to have a firm religious 
faith 2 iT1\] to consider to be true or hon- 
est: to be|ices someoaelto helices someoae's 
reports 8 \[TSa,b;VS;X (to be) I, (to be} 7\] 
to hold ss an opinion; suppose: I helices he 
ha* come. \[ He haJ come, I helices. \[ "Ham 
he comer m "I be|ices so.* I I helices ~m to 
hams ~oae it. I I belleee h~m (to be) hovtest 
(believe verb (Sense 3) 
((Takes NP SBsr) (Type 2)) 
((Takes NP NP Inf) (Type 20P~ising)) 
((or ((Takes NP NP NP) (Type 20Raisin~)) ((Takes NP NP Auxlnf) (Type 20l~sisins:)) 
((or ((Takes NP NP AP) (Type 20Rnisins)) ((Takes NP NP Auxlnf) (Type20Raisin~)) 
Figure 4: A lexical template derived from LDOCE 
This resulting structure is a lexical template, de- 
signed as a formal representation for the kind of syntac- 
rico-semantic information which can be extracted from 
the dictionary and which is relevant to a system for 
automatic morphological and syntactic analysis of En- 
glish texts. 
The overall transformation strategy employed by 
our system attempts to derive both subcategorisation 
frames relevant to a particular word sense and infor- 
mation about the semantic nature (i.e. the predicate- 
argument structure and the logical type) of, especially, 
verbs. In the main, the code numbers determine a 
unique subcategorisation. However, such semantic in- 
formation is not explicitly encoded in the LDOCE 
grammar codes, so we have adopted an approach at- 
tempting to deduce a semantic classification of the 
particular sense of the verb under consideration on 
the basis of the complete set of codes assigned to that 
sense. In any subcategorisatlon frame which involves a 
predicate complement there will be a non-transparent 
relationship between the superficial syntactic form and 
the underlying logical relations in the sentence. In 
these situations the parser can use the semantic type 
of the verb to compute this relationship. Expanding 
on a suggestion of Nfichieis (1982), we classify verbs 
as subject equi (SEqui), object equi (OEqul), sub- 
ject raising (SRalsing) or object raising (ORulsing) 
for each sense which has a predicate complement code 
associated with it. These terms, which derive from 
Transformational Grammar, are used as convenient 
labels for what we regard as a semantic distinction. 
The five rules which are applied to the grammar 
codes associated with a verb sense are ordered in a way 
which reflects the filtering of the verb sense through 
a series of syntactic tests. Verb senses with an lit+IS\] 
code are classified as SRaising. Next, verb senses 
which contain a \[V\] or IX\] code and one of \[D5\], \[DSa\], 
\[De\] or \[D6a\] codes are classified as OEqui. Then, 
verb senses which contain a IV\] or \[X l code and a ITS\] 
or \[TSa\] code in the associated grammar code field, 
(but none of the D codes mentioned above), are clas- 
sified as ORalstng. Verb senses with a \[VJ or \[X(to 
be)\] code, (but no \[T5\] or \[TSa\] codes), are classified. 
as OEquL Finally, verb senses containing a \[T2\], \[T3\] 
or iT4\] code, or an \[I2\], \[13\] or \[I4\] code are classified 
as SEquL Below we give examples of each type; for a 
detailed description see Boguraev and Briscoe (1987). 
happen(S) \[WvS;/Zd-IS\] 
(Type I SRaising) 
warn(1) \[Wv4;I0;TI:( o~ aca/m~),Sa;D 5a;V3\] 
(Type 3 o~ui) 
usume(1) \[Wv4;Tl,Sa,b',X(to be)l,7\] 
(Type 20Raising) 
decline(S) \[TI,S;10\] 
(Type 2 SZqul) 
Figure 5: The four semantic types of verb 
A generic lexical template of the form illustrated in 
Figure 4 can clearly be directly mapped into a feature 
duster within the features and feature set declarations 
used by the dictionary and grammar projects. A coln- 
parison of the existing entries for ~oelieve ~ in the hand 
crafted lexicon (Figure 1) and the third word sense for 
~believe m extracted from LDOCE demonstrates that 
much of the information available from LDOCE is of 
direct utility -- for example the SUBCAT values can 
be derived by an analysis of the Takes values and 
the ORaieing logical type specification above. In- 
deed, we have demonstrated the feasibility (Alshawi 
et al., 1985) of driving a parsing system directly from 
the information av~lable in LDOCE by constructing 
dictionary entries for the PATR-H system (Shieber, 1984). 
It is also clear, however, that it is unrealistic to 
expect that on the basis of only the information avail- 
able in the machine-readable source we will be able 
to derive a fully fleshed out lexical entry, capable of 
fulfilling all the run-time requirements of the analy- 
sis system that the lexicon under construction here is 
intended for. 
3.3 Utility of LDOCE 
for automatic lexicon generation 
Firstly, the information recoverable from LDOCE which 
is of direct utility is not totally reliable. Errors of 
omission and assignment occur in the dictionary 
for example, the entry for aconsider" (Figure B) lacks 
a code allowing it to function in frames with sentential 
complement (eg. I consider that it is a great honour to 
be here). The entry for %xpect", on the other hand, 
spuriously separates two very similar word senses (1 
and 5), assigning them different grammar codes. 
196 
¢onslde, ... 2 \[WvS, X (to be) 1,7; V3 l 
to regard as; think of in a stated way: 
I conelder pol •/oo~ (= I regard you 
a fool). I I consider it ~ great hoaonr to 
be ~ ~th yon to~v. I ae o~d he con- 
old, red me (to be) too lazy to be • ~ood 
worker. I The Shetl~r~d lolandt ~r~ eta- 
~ll~ eontldered ~ pa~rt o~ Scotl~ad ......... 
expect ... 1 \[T3,Sa,b\] to think (that 
something will happen): I ezpect (tho~) 
he'll p~s the ¢z~mination. \] He expects 
tO/~l the ez~mlaa~ioa. J "Will the come 
.ooa~" "I ezpect so." ........ S \[V3\] to 
believe, hope and think (that someone 
will do something): The officer egpected 
/t~e inca tO do their daty is the ¢O~1~ /mtt/e ....... 
acknowledge ... I \[TI,4,S (to) to agree to the truth of; recogniee the fact or ex- 
istence (of): I ¢~knowledge the trash o~ 
~,oar esteemed. J .They o~knowledoed (to 
,,e) th~ they were deleted I ~Y ~" 
knowle~ed ~ei~7 been d~eJe~ed 2 \[T1 
(a~); X (to be) 1,7\] to recognise, accept, 
or admit (as): ~re warn ~knowJedoed to 
be t~e beet j~aper, t T~l~y ~knowledoed 
t/l~moe/gee (to be) deJewted ........ 
Figure 8: Errors of omission sad assignment in LDOCE 
Errors like these nitimately cause the transforma- 
tion program to fail in the mapping of grammar codes 
to feature clusters. We have limited our use of LDOCE 
to verb entries because these appear to be coded most 
carefully. However, the techniques outlined here axe 
equally applicable to other open class items. 
Furthermore, since some of the information re- 
qured is only recoverable on the basis of a comparison 
of codes within a word sense specified in the source 
dictionary, additional errors can be introduced. For 
example, we assign ORatslng to verbs which con- 
taln subcategorlsatlon frzmes for sentential comple- 
ment, a noun phrase object and an infinitive comple- 
ment within the same sense. However, thls rule breaks 
down in the case of an entry such as %cknowledge ", 
where the two codes corresponding to different subcat- 
egorisation frames are split between two (spuriously 
separated) word senses (Figure 6), and consequently 
incorrectly assigns OEqui to this verb. The rule con- 
sequently breaks down and aconsider~ is incorrectly 
assigned the logical type of an Equi verb. 
We have tested the classification of verbs into se- 
mantic types using a verb list of 139 pre-classified 
items available in various published sources (eg. Stock- 
well etal., 1973). The overall error rate in the pro- 
cess of grammar code analysis and transformation was 
14~; however, the rules discussed above classify verbs 
into SRalsing, SEqui and OEqul very successfully. 
The main source of error comes from the mieclasslfi- 
cation of ORaising into OEqut verbs. This was con- 
firmed by another test, involving applying the rules for 
determining the semantic types of verbs over the 7,965 
verb entries in LDOCE. The resulting lists, assign- 
ing the 719 verb senses which have the potential for 
predicate complementation into appropriate seman- 
tic classes, confirm that errors in our procedure are 
mostly localised to the (mls)application of the ORals- 
lng rule. Arguably, these errors ~o derive mostly 
from errors in the dictionary, rather than a defect of 
the rule; see Boguraev and Briscoe (1987) for further 
discussion. 
Secondly, the analysis system requires information 
which is simply not encoded in the LDOCE entries; 
for example, the morphological features AT, LAT and 
BARE_ADJ are not there. This type of feature is crit- 
ical to the analysis of derivxtional variants, and such 
information is necessary for the correct application of 
the word grammar. Otherwise many morphologically 
productive, but nonexistant, lexical forms will be de- 
fined and be potentially analysable by the lexicon sys- 
tem. Therefore, lexical templates are not converted 
directly to target lexical entries, but form the input to 
second phase in which errors and inadequacies in the 
source ~ are corrected. 
4 A. methodology and a system 
for lexicon development 
In order to provide for fast, simple, but accurate devel- 
opment of a lexicon for the analysis system we have im- 
plemented a software environment which is integrated 
with the transformation program described above and 
which ofers an integrated morphological generation 
package and editing facilities for the semi-antomatic 
production of the target lexicon. The system is de- 
signed on the a~umption that no machine-readable 
dictionary can provide a complete, consistent, and to- 
tally accurate source of lexical information. Therefore, 
rather than batch process the MRD source, the lexicon 
development software is based around the concept of 
semi-automatic and rapid construction of entries, in- 
volving the continuous intervention of the end user, 
typically a linguist / lexicographer. 
In the course of an interactive cycle of develop- 
ment, a number of entries are hypothesised and auto- 
matically generated from x single base form. The fam- 
ily of related surface forms is output by the morpholog- 
ical gensr~tor, which employs the same word grammar 
used for inflectional and derivxtlonal morphology by 
the analysis system and creates new entries by a~iding 
a/fixes to the base form in legitimate ways. The gen- 
eration and refinement of new entries is based on re- 
peated application of the morphological generator to 
suitable base forms, followed by user intervention in- 
volving either rejecting, or minimally editing, the sur- 
face forms proposed by the system. Below we sketch 
a typical pattern of use. 
If the user asks the system to create an entry for 
'rbelieve', the transformation program described in 
section 3.2 (see Figure 4) will create an entry which 
contains all the syntactic information specified in Fig- 
ure 1. In addition, many surface forms with associated 
grammatical definitions will be generated automati- 
cally: 
cobclievc overbclieve 8ubbelieve believed 
disbelieve postbclieve unbelieyc bolieveo 
interbelievo prebelieve underbelieve believer 
misbelteve rebeltevo believable beltewlng 
outbeliove s~4believe believal believes 
Figure 7: Derivational variants of %elieve" 
The system generates these forms from the base 
entry in batches and displays the results in syntactic 
frames associated with subcategorisatlon possibilities. 
Thees frames, which are used to tap the user's gram° 
maticality judgements, are as semantically 'bleached' 
197 
as possible, so that they will be as compatible as poe- 
sible with the semantic restrictions that verbs place 
on their arguments. Each possible SUBCAT feature 
value in the grammar is associated with such frames, 
for example: 
SFIN: 
0a: 
0E: 
77~r- ... ~ t~ momma~ ~ some~'.g 
7he~ C ... ~ t~r~ to be • vm~-~ 
7"a~ C ... ~ t~ ~ ~ so,net~ 
• 27seg C ... "-I ~to be ~pm~gem 
Figure 8: Syntactic subcategorisation frames 
Internally, frames are more complex than illus- 
trated above. Surface phrasal forms with marked slots 
in them are associated with more detailed feature spec- 
ifications of lexical categories which are compatible 
with the fully \]nstantiated lexical items allowed by the 
grammar to fill the slots. Such detailed frame speci- 
fications are automatically generated on the basis of 
syntactic analysis of sentences made up from the frame 
phrase skeleton with valid lexical items substituted for 
the blank slot filler. Figure 9 below shows a fragment 
of the system's inventory of frames. 
7"r~r" ... -1 t~t ~omm~ ;- ao,net~'.g. 
\[! -, V ÷, BAR 0, aGK IN ÷, V -, B~ 2, NFOB~4 
NORM, PER 3, PLU ÷, COUNT ÷, CASE NOM\], 
SUBC~? b'FIS\] 
7'I~C ... "1 ,m'nm.,e to be somet,~/ng. 
\[~ -, V +, BAI O, aGlt \[N ÷, V -, BAg 2, NFOa.q 
liOB/4, PEB. 3, PLU +, COUNT ÷, CASE NOX\], 
S~CA! 0El 
\[N -, V +, B~. 0, IGR \[~ +, V -, BAR 2, gFORM 
~OB/4, PER 3, PLU +, COUNT +, CASE ~OX\], 
SUBCl? ORI 
\[N -, V'÷, BAR O, IG~, \[~ *, V -, BAIL 2, gFOB/4 
~OB/4, PER 3, PLU +, COUNT +, CaSE NOX\], 
suBcI? SE2\] 
~r-.,. "7 fAen. to be ~ p~o~em, 
IN -. V *, BJa. O. IGl \[!\[ *. V -. BaR 2. NFO~ 
NORM, PEX 3, PLU *, COUNT *, CaSE NOHI, 
su~c~! u,:\] 
\[N -, V +. BA.~ O. iGR \[N *, V -, B~ 2, ~FOB.q 
NORH, PER 3, PLU +, COU~T +, CISE ~OX\], 
SU~CA? OR\] 
* 77~ C ... ~ t.~.ze to be ~ pzo~enL 
\[~ -, V *, BAR O, .tGB. \[N ÷, V -, B~q. 2, IqFOR/4 
NORM. PER 3, PLU ÷, COUNT ÷, CASE NOI4\], 
SU~CAT 0El 
Figure 9: Complete syntactic frames 
The system ensures that slots in syntactic frames 
are filled by surface forms which have the syntactic 
features the sentence grammar requires. Displaying 
such instantiated frames provides a double check both 
on the outright correctness of the surface form and on 
the correctness of a surface form paired with a partic- 
ular definition. For example, the user can reject They 
oeerbelieee that 8orneone is something completely, but 
The v be\[ievem that someone is something is indicative of 
an incorrect definition, rather than surface form. Syn- 
tactic frames encoding other 'transformational' possi- 
bilitlse are often associated with particular SUBCAT 
values since these provide the user with more helpful 
data to accept or reject a particular assignment. Thus 
for example selecting between Raising and OEqui 
verbs is made easier if the frames for \[SUBCAT OR.\] 
are instantiated simultaneously: 
7~ ~ so, z~o,w to be ,o,,a~,~ / 
per~ eomeo,~ to be eo,ne~n¢ 
77a~ ~ 0ave to be ~ Vm~,~ / 
7hey per~/e t~ere to be ~ pro~n 
Figure 10: SUBCAT value selection 
The user has two broad options: to reject a set of 
frames and associated surface form outright or to edit 
either the surface form or definition associated with 
a set of frames. Exercising the first option causes all 
instances of the surface form and associated syntactic 
frames to be removed from the screen and from fur- 
ther consideration by the user. However, this action 
has no effect on the eventual output of the system, 
so these morphologically productive but non-existent 
forms and definitions will still be implicit in the lex- 
icon and morphology component of the English anal- 
yser. It is assumed that this overgeneration is harm- 
less though, because such forms will not occur in ac- 
tual input. 
Editing a surface form or associated definition re- 
suite in a new (non-productive) entry which will form 
part of the system's output to be included as an in- 
dependent irregular entry in the target lexicon. If the 
user edits a surface form, the edited version is substi- 
tuted in all the relevant syntactic frames. Provided 
the user is satisfied with the modified frames, a new 
entry is created with the new surface form, treated as 
an indivisible morpheme, and paired with the existing 
definition. Similarly, if the user edits a definition as- 
sociated with a set of syntactic frames, a new set of 
frames will be constructed and if he or she is happy 
with these, a new entry will be created with existing 
surface form and modified definition. (The English 
analyeer can be run in a mode where non-productive 
separate entries are 'preferred' to productive ones.) 
The user can modify both the surface form and 
the associated definition during one interaction with a 
particular potential entry; for example, the definition 
for ~believal m contains both an incorrect surface form 
and definition for a nominal form of the base form 
~oeUeve =. After the associated syntactic frames are 
displayed to the user, instead of rejecting the entire 
entry at this point, he or she can modify the surface 
form to create a new entry for ~oellef" -- a process 
which results in the revised syntactic frames: 
T~ ~ev~ 
be~evd eo.~o~ to be ao.~'.g 
Figure I1: Frame-based refinement of %elief" 
198 
The user now has three options; rejecting the third 
syntactic frame, or alternatively deleting the associ- 
ated sub-entry with a \[SUBCAT OR\] feature defini- 
tion, followed by confirmation will result in the con- 
struction of a new entry for the lexicon. The third 
option, should the user decide that nominal forms 
never take OR complements, is to edit the morpho- 
logical rules themselves. This option is more radical 
and would presumably only be exercised when the user 
was certain about the linguistic data. 
The system described so far allows the semi-auto- 
matic, computer-aided production of base entries and 
irregular, non-productive derived entries on the ba. 
sis of selection and editing of candidate surface forms 
and definitions thrown up by the derivationai generA~ 
tor. However, this approach is only as good as the 
initial base entry constructed from LDOCE. If the 
base entry is inadequate, the predictions produced by 
the generator are likely to be inadequate too. This 
will result in too much editing for the system to be 
much help in the rapid production of a sizeable lexi- 
con. Fortunately, the system of syntactic frames and 
editing facilities outlined above can also be used to re- 
fine base entries and make up for inadequacies in the 
LDOCE grammar code system (from the perspective 
of the target grammar). For example, LDOCE en- 
codes trAusitivity adequately but does not represent 
systematically whether a particular transitive has a 
passive form. In the target grammar, there are two 
SUBCAT values NP and NOPASS which distinguish 
these types of verb. Therefore, all verbs with a tran- 
sitive LDOCE code are inserted into the two sets of 
syntactic frames shown below. When these frames axe 
iustantiated with particular verbs rejection of one or 
other is enough to refine the LDOCE code to the ap- 
propriate SUBCAT value. For example, the instanti- 
ated frames for "cost n are: 
liP: 
IOP~: 
Thelt C ... -l that 
Theme ~e C ... "7 ~t&~n 
Tho,s are co~ by them 
TA~ r" ... "~ t&U 
* Tha,s ~re C ... 3 b~ them 
, Thou ~e co*t bY them 
Figure 12: The SUBCAT / NOPASS distinction 
The fact that "cost" does not fit into the NP paw 
sive (second) frame, behaving in a way compatible 
with the NOPASS predictions, means it acquires a 
NOPASS SUBCAT value. Since these frames will be 
displayed first and the operation changes the base en- 
try, subsequent forms and definitions generated by the 
system will be based on the new edited base entry. 
This example, also highlights one of the inher- 
ent problems in our approach to lexicon development. 
Syntactic frames are used in preference to direct pe- 
rusal of definitions in terms of feature lists to speed up 
lexicon development by tapping the user's grAmmati- 
cality judgements directly and to reduce the amount 
of editing and keyboard input. They also provide the 
user with a degree of insulation from the technical 
details of the morphological and syntactic formalism. 
However, semantically 'bleached' frames can lea~l to 
confusion when they interact with word sense ambi- 
guity. For example, aweigh ~ has two senses one of 
which allows passive and one of which does not (com- 
pare The baby toaa toeighed by the doctor with * Ten 
pound6 tuaa t#eighed by the baby). 
Unfortunately, the syntactic frames given for NP / 
NOPASS axe not 'bleached' enough because they tend 
to select the sense of "weigh ~ which does Mlow passive. 
The example raises wider issues about the integration 
of some treatment of word meaning with the produc- 
tion of such a lexicon. These issues go beyond this 
paper, but the problem illustrated demonstrates that 
the type of techniques we have described are heuris- 
tic aids rather than failsafe procedures for the rapid 
construction of a sizeable and accurate lexicon from s 
machine-readable dictionary of variable accuracy and 
consistency. 
5 Conclusion 
Practical natural language applications require vocab- 
ularies substantially larger than those typically devel- 
oped for theoretical or demonstration purposes and 
hand crating these is often not feasible, and certainly 
never desirable. The ev-Muation of the LDOCE gram- 
mar coding system suggests that it is sufficiently de- 
tailed  nd accurate (for verbs) to make the on-llne pro- 
duction of the syntactic component of lexical entries 
both viable and labour saving. However, the less than 
100% accuracy of the code assignments in the source 
dictionary suggests that a system using the machine- 
readable version for lexicon development must embody 
a methodology allowing rapid, interactive and semi- 
automatic generation and testing of lexicM entries on 
a large scale. 
We have outlined a lexicon development environ- 
ment, which embodies a practical approach to using 
an existing MRD for the construction of a substantial 
computerised lexicon. The system splits the deriva~ 
tion of target lexical entries into two phases; an au- 
tomatic transformation of the source data into a for- 
mMised lexical template containing as much relevant 
information as can be derived (directly or indirectly), 
followed by semi-automatic correction and refinement 
of this template into a set of base and irregular target 
entries. 
6 Acknowledgements 
This work was supported by research grants (Num- 
bers GR/D/4217.7 and GR/D/05554) from the LrK 
Science and Engineering Research Council under the 
Alvey ProgrAmme. We are grateful to the Longman 
Group Limited for kindly allowing us access to the 
typesetting tape of the Longman Dictionary of Con- 
temporary English for research purposes. 
7 References 
Alshawi, Hiyan; Boguraev, Bran and Brlscoe, Ted 
(1985) 'Towards a dictionary support environment 
for a real-time parsing system', Proceeding8 of the 
~nd Buropean Conference of the Asseciaitlon /or Corn- 
putational Linguistics, Geneva, Switzerland, pp. 171- 
178 
199 
Bogursev, Bran; Carter, David and Briscoe, Ted (1987) 
A m~iti-purpoee inter~ace to an on-llne dictionary, 
Third Conference of the European Chapter of the 
Association for Computational Linguistics, Copen- 
hagen, Denmark Boguraev, Bran and Briscoe, Ted (1987) Large lexi- 
cons for natural language processing -- exploring 
the grammar coding system of LDOCE, Computa- 
tional Linguistics, vol.13 
Briscoe, Ted; Grover, Claire; Boguraev, Bran and Car- 
roll, John (1987) A formalism and en~ronmerd for 
Me development of a large grammar of English, Tenth 
International Conference on Artificial Intelligence, 
Milan, Italy 
G~dsr, Gerald; Klein, Ewan; Pullum, Geoffrey K.; 
and Sag, Ivan A. (1985) Gener~ized phr~e Rruc- 
furs grammar, Oxford: Blackwell and Cambridge: 
Harvard University Press 
Grover, Chire; Briscoe, Ted; Carroll, John and Bogu- 
rasv, Bran (1987, forthcoming) The Alvev natural 
language toola pro~eet grammar -- a large compu- 
tationa~ grammar of Engliah, Lanc~ter Papers in Linguistics, Department of Linguistics, University 
of Lancaster l~vfichieis, A.rchibal (1982) Ezploiting a large dictionarv 
da~abaae, Ph.D. Thesis, Unlversit~ de Liege, Bel- zium 
Procter, Paul (1978) Longman ~ctionary of cordempo- 
vary Engliah, Lonfs~man Group Limited, Harlow and 
London, England 
l~tchie, Gr~eme; Pulman, Stephen; Black, Alan and 
l:tuuel\], Graham (1987) A computational frame- 
work for lexlcal description, Comp~ionai Linguia- 
tics, vol.13 
Russell, Graham; Pulman, Steve; R~tchie, Graeme; 
and Black, Alan (1986) 'A dlctionaa~/and morpho- 
logical analyser for english', Procsedinga of the llth 
International Congreu on Computationag Linguis- 
tiea, Bonn, Germany, pp. 277-279 Shieber, Stuart (1984) 'The design of a computer lan- 
guage for linguistic information', Proceedings of the 
IO~h International Congreaa on Computationa~ Lin- gu~tica, 
Stanford, California, pp. 362-366 
Stockwell, Robert; Schschter, Paul and P~-tee, Bar- 
bsra (1973) The major 8zmtaetic ~tructure8 of En- glish, Holt, Rinehart and Winston, New York, NY 
200 
