The ACQUILEX LKB: representation issues in semi-automatic 
acquisition of large lexicons 
Ann Copestake 
University of Cambridge Computer Laboratory 
New Museums Site, Pembroke Street, Cambridge, CB2 3QG, UK 
Ann.Copestake@cl.cam.ac.uk 
Abstract 
We describe the lexical knowledge base sys- 
tem (LKB) which has been designed and im- 
plemented as part of the ACQUILEX project 1 
to allow the representation of multilinguM syn- 
tactic and semantic information extracted from 
machine readable dictionaries (MRDs), in such 
a way that it is usable by natural language 
processing (NLP) systems. The LKB's lex- 
ical representation language (LRL) augments 
typed graph-based unification with default in- 
heritance, formalised in terms of default unifi- 
cation of feature structures. We evaluate how 
well the LRL meets the practical requirements 
arising from the semi-automatic construction of 
a large scale, multilingual lexicon. The system 
as described is fully implemented and is being 
used to represent substantial amounts of infor- 
mation automatically extracted from MRDs. 
1 Introduction 
The ACQUILEX LKB is designed to support repre- 
sentation of multilingual lexical information extracted 
from machine readable dictionaries (MRDs) in such a 
way that it can be utilised by NLP systems. In con- 
trast to lexical database systems (LDBs) or thesaurus- 
like representations (e.g. Alshawi et al., 1989; Calzolari, 
1988) which represent extracted data in such a way as 
to support browsing and querying, our goal is to build 
a knowledge base which can be used as a highly struc- 
tured reusable lexicon, albeit one much richer in lexi- 
cal semantic information than those commonly used in 
NLP. Thus, although we are using information which 
has been derived from MRDs (possibly after consider- 
able processing involving some human intervention), our 
aim is not to represent the dictionary entries themselves. 
Our methodology is to store the dictionary entries and 
raw extracted data in our LDB (Carroll, 1990) and to 
use this information to build LKB entries which could 
be directly utilised by an NLP system. Briscoe (1991) 
discusses the LDB/LKB distinction in more detail and 
describes the ACQUILEX project as a whole. 
1'The Acquisition of lexical knowledge for Natural Lan- 
guage Processing systems' (Esprit BRA-3030) 
Practical NLP systems need large lexicons. Even in 
cases such as database front ends, where the domain of 
the application is highly restricted, a practical natural 
language interface must be able to cope with an exten- 
sive vocabulary, in order to respond helpfully to a user 
who lacks domain knowledge, for example. For applica- 
tions such as text-to-speech synthesis, interfaces to large- 
scale knowledge based systems, summarising and so on, 
large lexicons are clearly needed; for machine transla- 
tion the requirement is for a large scale, multilingua\] 
lexical resource. Acquisition of such information is a 
serious bottleneck in building NLP systems, and MRD 
sources currently seem the most promising source for 
semi-automatically acquiring the syntactic and semantic 
information needed. 
Previous work on extracting and representing syntac- 
tic information includes the work done on the Alvey 
Tools lexicon project (Carroll and Grover 1989) in which 
a large scale lexicon was produced semi-automatically 
from LDOCE (Longman Dictionary of Contemporary 
English, Procter, 1978) using a feature and unification 
based representation. There has been considerable dis- 
cussion and some implementation of LKBs for the repre- 
sentation of semantic information extracted from MRD~, 
(e.g. Boguraev and Levin, 1990; Wilks et ai, 1989). How- 
ever the knowledge representation languages assumed 
are rarely described formally; typically a semantic net- 
work or a frame representation has been suggested, but 
the interpretation and functionality of the links has been 
left vague. Several networks based on taxonomies have 
been built, and these are useful for tasks such as sense- 
disambiguation, but are not directly utilisable as NL1 c 
lexicons. For a reusable lexicon, a declarative, formallb 
specified, representation language is essential. 
In the ACQUILEX project we are concerned with the 
extraction and representation of both syntactic and lexi- 
cal semantic information. A common representation lan- 
guage is needed, to allow the interaction of lexical se- 
mantic and syntactic properties to be described. Ther~ 
is currently a considerable amount of work in lexical se- 
mantics where unification based formalisms are used tc 
represent this interaction (e.g. Briscoe et al.'s (1990' 
account of logical metonymy (Pustejovsky, 1989, 1991) 
Sanfilippo's (1990) representation of thematic and aspec. 
tual information). However we also wish to structure th~ 
lexicon, in order to link lexical entries. This is essential 
  
 88 
since we are ultimately considering lexicons with maybe 
100,000 entries for each language. Although the aim of 
the ACQUILEX project is to determine the feasibility of 
using MRD sources, rather than attempting to build a 
lexicon of such size, we nevertheless need an LKB which 
can cope with tens of thousands of entries. 
There are currently several approaches to develop- 
ing representation languages which allow the lexicon 
to be structured, in particular by inheritance. These 
include object-oriented approaches (Daelemans, 1990), 
and DATR (Evans and Gazdar, 1990). We chose to 
use a graph unification based representation language 
for the LKB, because this offered the flexibility to rep- 
resent both syntactic and semantic information in a way 
which could be easily integrated with much current work 
on unification grammar, parsing and generation. In con- 
trast to DATR for example, the LKB's representation 
language (LRL) is not specific to lexical representation. 
This made it much easier to incorporate a parser in the 
LKB (for testing lexical entries) and to experiment with 
notions such as lexical rules and interlingual links be- 
tween lexical entries. Although this means that the LRL 
is in a sense too general for its main application, the typ- 
ing system provides a way of constraining the represen- 
tations, and the implementation can then be made more 
efficient by taking advantage of such constraints. 
Our typed feature structure mechanism is based on 
Carpenter's work on the HPSG formalism (Carpenter 
1990) although there are some significant differences. 
We augment the formalism with the more flexible psort 
inheritance mechanism, which allows for default inheri- 
tance. Much of the motivation behind this comes from 
consideration of the sense-disambiguated taxonomies 
semi-automatically derived from MRDs, which we are 
using to structure the LKB (see Copestake 1990). The 
notion of types, and features appropriate for a given 
type, gives some of the properties of frame representa- 
tion languages, and allows us to provide a well-defined, 
declarative representation, which integrates relatively 
straightforwardly with much current work on natural 
language processing and lexical semantics. 
Thus the operations that the LKB supports are (de- 
fault) inheritance, (default) unification, and lexical rule 
application. It does not support any more general forms 
of inference and is thus designed specifically to support 
processes which concern lexical rather than general rea- 
soning. In the rest of this paper we first informally intro- 
duce the way in which lexical entries are represented in 
the LKB. We then describe the LRL, and discuss how the 
design of the default inheritance system was influenced 
by the application. (A fuller and more formal account 
of the LRL appears in papers in Briscoe et al., forth- 
coming.) We conclude with an overview of the actual 
implementation and a discussion of the utility of typed 
feature structures and the psort mechanism in practise. 
2 Lexical entries in the LKB 
Consider Figure 1, which is a screen dump of the 
LKB system showing part of a file containing a semi- 
automatically generated lexical entry for the Dutch noun 
kippevlees (chicken meat) (top right of figure), the fea- 
ture structure (FS) representation of that description 
(bottom right) and the fully typed feature structure 
into which it is expanded in the LKB (left of figure). 
See Vossen (1991) for the details of the generation of 
this entry. Features are shown uppercased, types are 
in lowercase bold, reentrancy is indicated by numbers 
in angle brackets. The identifier for the lexical entry, 
kippevlees_V_O_l, indicates that it corresponds to the 
sense kippevlees 1 in the Van Dale dictionary. The un- 
expanded lexical entry is relatively compact, but a large 
amount of information is inherited via the type and psort 
systems. The expanded lexical entry is not shown com- 
pletely; the entry's syntactic type is noun-cat, and the 
box round this indicates that its internal structure is 
not displayed. The same applies to the sense-id infor- 
mation (which enables the corresponding LDB entry to 
be accessed) and the argument structure. Figure 2 (left) 
shows the type lex-uncount-noun, which determines 
the basic skeleton of the entry. Feature structures (called 
constraints) are associated with types and inherited by 
all FSs of a particular type. Thus the form of this lexi- 
cal entry is due to the constraint on lex-uncount-noun 
shown in the figure. 
Default inheritance from the lexical semantic struc- 
ture for the lexical entry for vlees_V_0_l augments the 
type information for the entry for kippevlees. We encode 
a relatively rich lexical semantic structure for nouns (re- 
ferred to as the 'relativised qualia structure', RQS) based 
on the notion of qualia structure, described by Puste- 
jovsky (1989, 1991). Noun lexical entries are parsed to 
yield a genus term, vlees in this case, and differentia. 
The genus term is normally interpreted in LKB terms 
as specifying the lexical entry from which information is 
inherited by default; as explained in Section 4 this also 
partially defines the lexical semantic type (RQS type), 
which is c_nat_subst in this example (for comestible, 
natural, substance). A fragment of the RQs type hierar- 
chy is also shown in Figure 2 2. The differentia can be 
partially interpreted relative to the RQS type; in this ex- 
ample < rqs : origin > = "kip" is an indication that 
kippevlees comes from kip; eventually this will allow their 
lexical entries to be linked automatically by the appro- 
priate lexical rule (Copestake and Briscoe, 1991; Copes- 
take et al., 1992). The feature ORIGIN is introduced at 
type natural (the FS definition of natural is shown in 
Figure 2, top right, before expansion by inheritance from 
nomrqs). Since natural is a parent of c_nat_subst, 
ORIGIN is an appropriate feature for c_nat._subst. 
The feature TELIC is used to provide a slot for the se- 
mantics of the verb sense which is associated with the 
purpose of an entity (eating in this case). The way in 
which such a representation may be used in the treat- 
ment of logical metonymy was described in Briscoe et 
al (1990). Other features (such as PHYSICAL-STATE) are 
used to encode information which is useful for applica- 
tions such as sense-disambiguation. This attempt to rep- 
resent detailed lexical semantic information illustrates a 
general principle of the ACQUILEX project; such lexical 
2Unlike Carpenter(1990) we adopt a notation with the 
most general type at the top of any diagram, because this 
seems more natural to the main users of the system. 
  
 89 
II File Edit Find Windows Tools Preferences Ldb Lkb 
kippevlees - ex landed 
',_ip.p~,~ee~ van(l-food.lex 
lex-uncount-noun k ippev lees U-O_1 
ORTH:kippevlees < sense-ld : diotlonaPu > = "URHD" 
CAT:~--~ < sense-id : Idb-entrtj-no > = "16810605" 
SEM'. pnm2f-lfemula-entity-eqillJ < sense- i d : homonym-no > - "0" 
SENSE-ID:lsense-id j < sense-id : sense-no > - "!" 
RQS:\[c_n~t._subst < r'qs : or-igln > = ("kip") 
ORIGIN~REA: st~ing < I ex-uncount-noun rqs > 
\]I_=LIC: \[strict-trams--.;em < ULEES_U_D_I < lex-noun-s ign r'qs >. 
IND: <0> = e~e 
PRED: ~t41 
ARG I : \[verb-fennels 
IND: ,(0> 
PRED: eat_l_O_l 
ARG1 : <0>\] 
ARG2: \[bim~/-ifermull 
IND: <0> 
PRED: stud 
~RQ 1 : p-4qlt~ennul~ I 
ARG2: p-pltt-I'onnul:lQJ 
PHVS ICAL: true 
OBJECT-INDEX: <1 > = obi 
ANIMACV: f,,Ise 
PHVS ICAL..STATE: solid_i 
@UAL: phys.-quai I 
FORM: \[physfonn 
VOLUME: seeJmr 
WEIGHT: seal~ 
SHA PE: nor,-individulted\] 
CONSTITUENCY: ~:enstituency 1 
ORIGIN: kip\]\] 
kippevlees - definition 
'OR'~: IdllqPe~4ees 
SENBE-ID: \[top 
LANGUAGE: 4vtah 
FB-ID:kippevlees v 0 | 
D ICTIONARY: ylmd 
LDB-ENll:IY-NO: 16610606 
HOkC>NYM-HO: 0 
SEHSEJ~O: I\] 
RQS: \[e subst 
ORIGIN: kip\]\] 
< lex-unoouet-~ou, rqs > < YLI=I~8 
Expanded psort 
Figure 1: A lexical entry 
entries are usable by a wide range of NLP systems be- 
cause they are relatively rich and detailed; applications 
which do not make use of detailed lexicM semantic infor- 
mation can simply discard the information. Clearly the 
converse is not true, and a more impoverished represen- 
tation would be less generally useful. We thus aim for 
representations which are as rich as possible in informa- 
tion which we can extract automatically, and represent 
formally, but which are also well motivated linguistically 
and/or useful for practical NLP applications. This also 
applies to our use of thematic roles in the semantics; see 
the examples of LKB entries for verbs given in Sanfilippo 
and Poznanski (1992, this volume). 
3 The type system 
In the definition of a type hierarchy we follow Carpen- 
ter(1990) very closely. The type hierarchy defines a par- 
tial order (notated E_, "is more specific than") on the 
types and specifies which types are consistent. Only 
FSs with mutually consistent types can be unified -- 
two types which are unordered in the hierarchy are as- 
sumed to be inconsistent unless the user explicitly spec- 
ifies a common subtype. Every consistent set of types 
S C_ TYPE must have a unique greatest lower bound or 
meet (notation \[7S). 3 This condition allows FSs to be 
typed deterministically -- if two FSs of types a and b 
are unified the type of the result will be a \[7 b, which 
must be unique if it exists. If a ~ b does not exist unifi- 
cation fails. In the fragment of a type hierarchy shown in 
Figure 2 c__natural and natural..substance are consis- 
tent; c_natural R naturaLsubstance --- c_nat_subst 
Because the type hierarchy is a partial order it has prop- 
erties of reflexivity, transitivity and anti-symmetry (frorr 
which it follows that the type hierarchy cannot contair 
cycles). 
We define a typed feature structure as a tuple F = 
(Q, q0,/f, 0), where the only difference from the untypec 
case is that every node of a typed FS has a type, 6(q) 
The type of a FS is the type of its initial node, O(qo) 
The definition of subsumption of typed FSs is very sire 
liar to that for untyped FSs, with the additional provis( 
that the ordering must be consistent with the ordering oI 
their types. We thus overload the symbol E ("is-more. 
3In order to check the type hierarchy for uniqueness o 
greatest lower bounds we carry out a p~irwise comparison el 
types with multiple parents to see if they have a unique low 
est greater bound. Since the number of types with multipb 
parents is typically much less than the totaJ number of types 
this is considerably more efficient than carrying out p~irwis, 
comparisons on all the types in the hierarchy. 
  
 90 
r 4 File Edit Find Windows Tools 
lex-uncount-noxn expanded 
~erents = lex.-noun-sigm 
lex-uncount-noun 
OR~: \[orth\] 
CAT: \[noun-o~ CAT-TVPE: n 
M-FEATS: \[nominsi-m--/ests 
REG-MOR PH: beele=lm 
AGR: \[nomined-eor 
PERS: person • 
HUM: number .% 
GENDER: gender\] 
NOMINAL-FORM: nominll-ferm 
CASE: oqere 
COUNT: hdse\]\] 
SEM: \[unery-formuli-entit y-erg ! 
IND: (0>- entity PRED: <! > = string 
I ARG1." <0>\] 
SENSE-ID: 
RGS: bomrqs\] 
Preferences Ldb Lkb 
natural - definition ~=I~I| 
! I/'l='entsmlturll = nomnls 
\[ ORIGIN: (string basic)\] 
Type hierarchy 
r~'IRQS 
HRTURRL 
_ , 
0 NRTURR  .VS,CRL 
C~J~ff~ C_DIRT..SI_m~_ T CR/~E HRTURRL_ I rRl~ 
I-M'l:lN RN I I'IRL 
Figure 2: Types 
specific-than", "is-subsumed-by") to express subsump- 
tion of FSs as well as the ordering on the type hierarchy. 
Thus if/'1 and F2 are FSs of types tl and t~ respectively, 
then F1 E F~ only if tl E t2. 
3.1 Constraints 
Our system differs somewhat from that described by 
Carpenter in that we adopt a different notion of well- 
formedness of typed feature structures. In our system 
every type must have exactly one associated FS which 
acts as a constraint on all FSs of that type; by subsuming 
all well-formed FSs of that type. The constraint also de- 
fines which features are appropriate for a particular type; 
a well-formed FS may only contain appropriate features. 
Constraints are inherited by all subtypes of a type, but 
a subtype may introduce new features (which will be in- 
herited as appropriate features by all its subtypes). A 
constraint on a type is a well-formed FS of that type; all 
constraints must therefore be mutually consistent. 
Features may only be introduced at one point in the 
type hierarchy (cf Carpenter's minimal introduction). 
Because of the condition that any consistent set of types 
must have a unique greatest lower bound, it is also the 
case that sets of features will become valid at unique 
greatest points in the type hierarchy. This allows under- 
typed feature structures to be introduced into the system 
by the user which are then given the most general possi- 
ble type. The importance of this form of type inference 
for our application is discussed in Section 5.2, below. 
Constraints are given by the function 
C: (TYPE, E:) --, .Z- 
where ~" is the set of FS. C(t) denotes the constraint FS 
associated with type t. We define the notion of appro- 
priate features as follows: 
Definition 1 /f C(t) = (Q, qo, 6, O) we define 
Appfeat(t) = reat(qo) where we define Feat(q) to be 
the set of features labelling transitions from the node q 
such that f e Feat(q) if 6(f, q) is defined. 
The conditions on the constraint function are as fol- 
lows: 
Monotonicity Given types tl and t2, if tl ~ t2 then 
C(tl) E_ C(t~) 
Type For a given type t, if C(t) is the FS (Q, q0, 6, 0) 
then O(qo) = t. 
  
 91 
Consistency of constraints For all q E Q, we have 
that F' = (Q',q,~,O) E C(O(q)) and Feat(q) = 
Appfeat(O(q)). 
We therefore disallow any occurrence of t in a sub- 
structure of C(t), thus if C(t) = (Q, q0, ~, 0) then for 
all q E Q, q ¢ q0 implies that 0(q) # t. Since we dis- 
allow cycles in FSs such a constraint could only be 
satisfied by an infinite FS, which is also disallowed. 
Maximal introduction of features For every fea- 
ture f E FEAT there is a unique type t = 
Maztype(f) such that f E Appfeat(t) and 
there is no type s such that t F- s and f E 
Appfeat(s). The maximal appropriate value of a 
feature Mazappval(f) is the type t such that if 
C(Maztype(f)) = (Q, qo, 6, 0> then t = 0(~(f, q0)) 
Definition 2 We say that a given FS F = (Q, qo, 6, 0> 
is a well-formed FS iff for all q E Q, we have that F' = 
(Q', q, 6, 0) E C(O(q)) and Feat(q) = Appfeat(O(q)). 
Carpenter separates the notion of typing and con- 
straints. This allows a more powerful constraint lan- 
guage, but complicates the system. Since the users of 
the LKB were initially not familiar with feature struc- 
ture representations it was important to keep the sys- 
tem as simple as possible, and in practice we have not 
yet found the additional power of Carpenter's constraint 
language necessary. 
Some relatively minor extensions to the formalism al- 
low the implementation of some cooccurrence restric- 
tions and the disjunction of atomic types. It is necessary 
to allow types with string values, representing orthogra- 
phy for example, to be introduced as needed rather than 
predefined; we therefore define an atomic type string 
which is allowed to have any string as a subtype without 
these being explicitly specified. All subtypes of string 
are taken to be disjoint. 
4 Default inheritance and taxonomies 
We extend the typed FS system with default inheritance. 
FSs may be specified as inheriting by default from one 
or more other (well-formed) FSs which we refer to in 
this context as psorts. Psorts may correspond to (parts 
of) lexical entries or be specially defined. Since psorts 
may themselves inherit information, default inheritance 
(notated by <, "inherits from") in effect operates over a 
hierarchy of psorts. We prohibit cycles in the inheritance 
ordering. Inheritance order must correspond to the type 
hierarchy order. 
Pl < P2 :=~ Typeof(pl) E Typeof(p2) 
where Pl and P2 are psorts 
The typing system thus restricts default inheritance es- 
sentially to the filling in of values for features which are 
defined by the type system. 
Default inheritance is implemented by a version of de- 
fault unification, for a detailed discussion of which see 
Carpenter (1991, forthcoming). In default unification, 
unlike ordinary unification, inconsistent information is 
ignored rather than causing failure; however the defini- 
tion is complicated by the need to consider the interac- 
tions between reentrant FSs. The way we deal with this 
is discussed in detail in Copestake(1991, forthcoming), 
but since the problematic cases seem to arise relatively 
rarely in our particular application, we will not discuss 
the full definition here. We use Iq< to signify default uni- 
fication, where A R< B means that A is the non-default 
and B the default FS. When no reentrancy interactions 
are involved the definition is: 
AM< B = AM \['3{¢ E q/ I A~¢ #±} 
where @ is the set of all component FSs of B. 
The ordering on the psort hierarchy gives us an or- 
dering on defaults. So for example, assume that the 
following is the lexical entry for BOOK_L_I_I: 
lex-noun-slgn 
"artifact_physlcal 
R S TELIC \[ verb-sere Q = = \[PRED = read_L_l_l\] 
PHYSICAL-STATE ---- solid_a 
The following path specifications make the lexical entries 
defined inherit from BooK_L_I_I: 
autobiography <rqs> < book_L_l_l <rqs> 
dictionary <rqs> < book_L_l_l <rqs> 
<rqs : relic : prod > 
= refer_to_L_O_2 
lexicon <rqs> < dictionary <rqs> 
AUTOBIOGRAPHY would thus have the same values 
BOoK_L_I_I for both telic and physical-state. DIC- 
TIONARY will inherit the value solid_a for the featur~ 
PHYSICAL-STATE but the value of TELIC overrides tha~ 
inherited from BOOK_L_I_ 1. LEXICON inherits its valu, 
for the telic role from DICTIONARY rather than fron 
BOOK_L_I_I: 
lex-noun-sign 
artifact _physical 
verl>-sem 
RQS = TELIC = \]PRED = refer_to_L_0_2 
L 
PHYSICAL-STATE = solid_a 
Multiple default inheritance is allowed but is restricte, 
to the case where the information from the parent psort 
does not conflict. This is enforced by unifying all (full~ 
expanded) immediate parent psorts before default un! 
fying the result with the daughter psort. The type re 
striction on default inheritance means that all the psort 
must have compatible types and the type of the daughte 
must be the meet of those types. We define inheritan¢ 
to operate top-down; that is a psort will be fully e~ 
panded with inherited information before it is used fc 
default inheritance. We adopted this approach as 
are primarily interested in default inheritance betwee 
fully formed lexical entries; since we disallow conflict 
arising from multiple inheritance, distinctions betwee 
top-down and bottom-up inheritance only arise with tl~ 
problematic cases of default unification alluded to abov, 
We also allow non-default inheritance from psorts, in 
plemented by ordinary unification. This is a relative\] 
  
 92 
recent addition to the LKB, prompted partly by issues in 
the representation of the multilingual translation links. 
It also seemed to be desirable in the representation of 
qualia structure, in order to allow the telic role of a noun 
to be specified directly in terms of a verb sense, without 
allowing other information in that lexical entry to con- 
flict. Thus the entry for dictionary above would actually 
specify: 
<rqs : relic > =-- refer_to_L 0_2 < sem > 
where == indicates non-default inheritance. 
Although introducing psorts as well as types may seem 
unnecessarily complex there seem to be compelling rea- 
sons for doing so for this application, where we wish 
to use taxonomic information extracted from MRDs to 
structure the lexicon. The type hierarchy is not a suit- 
able way for representing taxonomic inheritance for sev- 
eral reasons. Perhaps the most important is that taxo- 
nomically inherited information is defeasible, but typing 
and defaults are incompatible notions. Types are needed 
to enforce an organisation on the lexicon -- if this can 
be overridden it is useless. Furthermore the type system 
is taken to be complete, and various conditions are im- 
posed on it, such as the greatest lower bound condition, 
which ensure that deterministic classification is possible. 
Taxonomies extracted from dictionaries will not be com- 
plete in this sense, and will not meet these conditions. 
Intuitively we would expect to be able to classify lexical 
entries into categories such as human, artifact and so 
on, and to be able to state that all creatures are either 
humans or animals, since in effect this is how we are 
defining those types. But we would not expect to be 
able to use the finer-grained, automatically acquired in- 
formation in this way; we will never extract all possible 
categories of horse for example. 
In implementational terms, using the type hierar- 
chy to provide the fine-grain of inheritance possible 
with taxonomic information would be very difficult. 
A type scheme should be relatively static; any alter- 
ations may affect a large amount of data and check- 
ing that the scheme as a whole is still consistent is a 
non-trivial process. Because the inheritance hierarchies 
are derived from taxonomies and thus are derived semi- 
automatically from MRDs, they will contain errors and it 
is important that these can be corrected easily. In prac- 
tise, deciding whether to make use of the type mecha- 
nism or the psort mechanism has been relatively straight- 
forward. If we wish to use a feature which is particular 
to some group of lexical entries we have to introduce a 
type, otherwise, especially if the information might be 
defeasible, we use a psort. 
Several of the decisions involved in designing the de- 
fault inheritance system were thus influenced by the ap- 
plication. The condition that the default inheritance or- 
dering reflects the type ordering was partly motivated 
by the desire to be able to provide an rqs type for lexical 
entries on the basis of taxonomic data alone. However it 
also seems intuitively reasonable as a way of restricting 
default inheritance; without some such restriction it is 
difficult to make any substantive claims when default in- 
heritance is used to model some linguistic phenomenon. 
5 Using the LKB 
5.1 Interface and implementation 
The LKB as described here is fully implemented in Pro- 
cyon Common Lisp running on Apple Macintoshes. It 
is in use by all the academic groups involved in the AC- 
QUILEX project. In total there are currently about 20 
users on five sites in different countries. Interaction with 
the LKB is entirely menu-driven. Besides the obvious 
functions to load and view types, lexical entries, psorts, 
lexical rules and so on, there are various other facilities 
which are necessary for the application. A very simple 
(and inefficient) parser is included, to aid development 
of types and lexical entries. There are tools for support- 
ing multilingual linked lexicons, described in Copestake 
et al. (1992). The LKB is integrated with our LDB 
system so that information extracted from dictionary 
entries stored in the LDB can be used to build LKB 
lexicons. 
The type system which has been developed for use on 
the ACQUILEX project is fairly large (about 450 types 
and 80 features). Currently nearly 15,000 lexical entries 
containing syntactic and semantic information have been 
stored in the LKB. The bulk of these entries are currently 
made up of nouns for which the main semantic informa- 
tion is inherited down semi-automatically derived tax- 
onomies. Sanfilippo and Poznanski (1992) describe the 
semi-automatic derivation of entries for English psycho- 
logical predicates by augmenting LDOCE with thesaurus 
information derived from the Longman Lexicon. Work 
has begun on deriving multi-lingual linked lexicons. 
Given the complexity of the FSs for lexical entries, and 
the size of the lexicons to be supported by the LKB, it 
is clearly not possible to store lexicons in main memory. 
Lexical entries are thus stored on disk, to be expanded 
as required. Entries may be indexed by type of FS at 
the end of user-defined paths, and also by the psort(s) 
from which they are defined to inherit, although pro- 
ducing such indices for large lexicons is time consuming. 
Checking lexical entries (for well-formedness, default in- 
heritance conflicts and presence of cycles) can be carried 
out at the same time as indexing or acquisition. 
Efficiency gains arising directly from the use of types 
were not a major factor in our decision to use a typed sys- 
tem. Although parsing with typed FSs is more efficient 
than with untyped ones, since unification will fail when 
type conflict occurs, this is not particularly important in 
the LKB, since most unifications will be performed while 
expanding lexical entries, when the vast majority of uni- 
fications would be expected to succeed. Since there is 
some overhead in typing the FSs, the use of types proba- 
bly decreases efficiency slightly, although the unifications 
involved will be comparable to those needed if the same 
information were conveyed by templates. Since the LKB 
has to cope with large lexicons, with thousands of com- 
plex lexical entries, space efficiency rather than speed is 
the major consideration. The most important factor in 
space efficiency is the use of inheritance, both in the type 
system and the psort system, which allows unexpanded 
lexical entries to be very compact. 
  
 93 
5.2 Typing and automatic acquisition of large 
lexicons 
Our notion of typing of FSs can be regarded as a way 
of getting the functionality of templates in untyped FS 
formalisms, with the added advantages of type checking 
and type inference. As a method of lexical organisation, 
types have significant advantages over templates, espe- 
cially for a large scale collaborative project. Once an 
agreed type system is adopted, the compatability of the 
data collected by each site is guaranteed. There may of 
course be problems of differing interpretation of types 
and features, but this applies to any representation; to 
ameliorate them we associate short documentation in- 
formation with each type, accessible via the menu in- 
terface from any point where the type is displayed. In 
an untyped feature system, typographical errors and so 
on may go undetected, and debugging a large template 
system can be extremely difficult; a type system makes 
error detection much simpler. Since a given FS has a 
type permanently associated with it, is also much more 
obvious how information has come to be inherited than 
if templates are used. 
Essentially the same advantages of safety and clarity 
apply to strict typing of FSs as to strict typing in pro- 
gramming languages. Of course a reduction in flexibility 
of representation has to be accepted, once a particular 
type system is adopted. It is possible to achieve a very 
considerable degree of modularisation; we have found 
that we could develop the noun RQS type system almost 
completely independently of the verb type system, once a 
small number of common types were agreed on, and that 
name clashes were the only problem found when reinte- 
grating the two. After approximately eight months of 
use we are now on the third version of both the verb and 
the noun type systems; individual users have been ex- 
perimenting with various representations which are then 
integrated into the general system as appropriate. En- 
coding the agreed representation in terms of a type sys- 
tem, rather than by means of templates, makes global 
alterations relatively easy because of the localisation of 
the information (for example, since a feature can only be 
introduced at one point in the hierarchy, it is easy to find 
all types which will be affected by a change in feature 
name) and the error checking. It is important that re- 
processing of raw dictionary data is avoided when a type 
system is changed, particularly if user interaction is in- 
volved, but storing intermediate results in the LDB as 
a derived dictionary helps achieve this. Even within the 
project it has proved useful to have local type systems 
and lexicons, and to derive entries for these automati- 
cally from the general LKB. Currently this is achieved 
by ad-hoc methods; we intend to investigate the devel- 
opment of tools to make transfer of information easier 
and more declarative. 
Ageno et al. (1992) describe one way in which the type 
system can be integrated with tools for semi-automatic 
analysis of dictionary definitions. Types are correlated 
with the templates used in a robust pattern matching 
parser, and user interaction can be controlled by the type 
system. The user is only allowed to introduce informa- 
tion appropriate for a particular type, and a menu-based 
interface can both inform the user of the possible values 
and preclude errors. 
The utility of typing for error checking when repre- 
senting automatically acquired data can be seen in the 
following simple example. The machine readable version 
of LDOCE associates semantic codes with senses. Ex- 
amples of such codes are P for plant, H for human, M for 
male human, K for male human or animal, and so on. 
When automatically acquiring information about nouns 
from LDOCE, we specify a value for the feature SEX, 
where this is possible according to the semantic codes. 
Thus the automatically created lexical entry for bull 1 1 
contains the line: 
< rqs : sex > = male 
In the current type system the feature sex is introduced 
at type creature. A few LDOCE entries have incor- 
rect semantic codes; Irish stew for example has code 
K. Since Irish stew has rtQs type c_artifact, which is 
not consistent with creature, SEX was detected as an 
inappropriate feature. Attempts at expansion of the au- 
tomatically generated lexical entry caused an error mes- 
sage to be output, and the user had the opportunity to 
correct the mistake. If the LKB were not a typed system, 
errors such as this would not be detected automatically 
in this way. 
In contrast, automatic classification of lexical entries 
by type, according to feature information, can be used to 
force specification of appropriate information. A lexical 
entry which has not been located in a taxonomy will be 
given the most general possible type for its RQS. However 
ifa value for the feature sex has been specified this forces 
an rtQs type of creature. This would also force the value 
of ANIMATE to be true, for example. 
5.3 The psort inheritance mechanism. 
Manual association of information with psorts has 
proved to be a highly efficient method of acquiring in- 
formation, since many psorts have hundreds of daughtel 
entries. Creating 'artificial' psorts, which can be used 
where there is no simple lexicalisation of a concept, ia 
also a powerful technique. Disjunctions such as persor, 
or animal, for example, can be represented as the gener- 
alisation of the two psorts involved. This and other case., 
of more complex taxonomic inheritance are discussed b~ 
Vossen and Copestake (1991, forthcoming). 
We adopted the most conservative approach to multi. 
pie default inheritance (i.e. information inherited frorr 
multiple parents has to be consistent) because we kne~ 
we would have to cope with errors in extraction of in. 
formation from MRDs, and with the lexicographers 
original mistakes. We expected this to be overrestric. 
tive, but in fact our consistency condition seems to be 
met fairly naturally by the data. Taxonomies extractec 
from MRDs are in general tree-structured (once sense. 
disambiguation has been performed); there do not tenc 
to be many examples of genuine conjunction, for exam 
pie. Multiple inheritance is mainly needed for cross 
classification; artifacts for example may be defined prin 
cipally in terms of their form or in terms of their func 
tion, but here different sets of features are typically spec 
ified, so the information is consistent. Furthermore i 
  
 94 
frequently turns out to be difficult to identify a second 
psort parent from the dictionary definition differentia. 
However type inference resulting from feature instantia- 
tion may still force a type to be assigned which represents 
the cross-classification. 
Acknowledgements 
Several people contributed in various ways to the design, 
implementation and development of the LKB, especially 
Valeria de Paiva, Antonio Sanfilippo, Ted Briscoe, John 
Carroll, John Bowler and Horacio Rodriguez. We are 
very grateful to Bob Carpenter for his detailed comments 
on our use of types and default unification. We are grate- 
ful to the publishers Longman, Biblograf, Van Dale and 
Garzanti for allowing groups involved in ACQUILEX to 
use their dictionaries. 

References 

A. Ageno el al.. SEISD: An Environment for Extraction 
of Semantic Information from On-Line Dictionaries. In 
Proceedings of the 3rd Conference on Applied Natural 
Language Processing, Trento, Italy, 1992. 

H. Alshawi, B. Boguraev and D. Carter. Placing the dic- 
tionary on-line. In B. Boguraev and T. Briscoe (eds.), 
Computational lexicography for natural language process- 
ing, pages 41-63, Longman, London, 1989. 

B. Boguraev and B. Levin. Models for lexical knowledge 
bases. In Proceedings of the 6lh Annual Conference of 
the UW Center for the New OED, pages 65-78, Water- 
loo, 1990. 

T. Briscoe. Lexical Issues in Natural Language Process- 
ing. In E. Klein and F. Veltman (eds.), Natural Language 
and Speech, pages 39-68, Springer-Verlag, 1991. 

T. Briscoe, A. Copestake and B. Boguraev. Enjoy the 
paper: Lexical semantics via lexicology. In Proceedings 
of the 13th Coling, pages 42-47, Helsinki, 1990. 

T. Briscoe, A. Copestake and V. de Paiva (eds.). De- 
fault Inheritance in Unification based approaches to the 
Lexicon. Cambridge University Press, New York, forth- 
coming. 

N. Calzolari. The dictionary and the thesaurus can be 
combined. In M. W. Evens (ed.), Relational models of 
the lexicon, pages 75-96, Cambridge University Press, 
1988. 

B. Carpenter. Typed feature structures: Inheritance, 
(In)equality and Extensionality. In Proceedings of the 
First International Workshop on Inheritance in Natural 
Language Processing, pages 9-18, Tilburg, The Nether- 
lands, 1990. 

B. Carpenter. Skeptical and Credulous Default Unifica- 
tion with Applications to Templates and Inheritance. In 
T. Briscoe, A. Copestake and V. de Paiva (eds.), De- 
fault Inheritance in Unification based approaches to the 
Lexicon, CUP, New York, 1991, forthcoming. 

J. Carroll and C. Grover. The derivation of a large com- 
putational lexicon for English from LDOCE. In B. Bogu- 
racy and T. Briscoe (eds.), Computational lexicography 
for natural language processing, pages 117-134, Long- 
man, London, 1989. 

J. Carroll. Lexical Database System: User Manual. 
Esprit BRA-3030 ACQUILEX deliverable no. 2.3.3(c), 
April 1990. 

A. Copestake. An approach to building the hierarchical 
element of a lexical knowledge base from a machine read- 
able dictionary. In Proceedings of the First International 
Workshop on Inheritance in Natural Language Process- 
ing, pages 19-29, Tilburg, The Netherlands, 1990. 

A. Copestake. Default Unification in the LKB. In T. 
Briscoe, A. Copestake and V. de Paiva (eds.), Default 
Inheritance in Unification based approaches to the Lexi- 
con, CUP, New York, 1991, forthcoming. 

A. Copestake and T. Briscoe. Lexical Operations in 
a Unification Based Framework. In Proceedings of 
the ACL SIGLEX Workshop on Lexical Semantics and 
Knowledge Representation, pages 88-101, Berkeley, Cal- 
ifornia, 1991. 

A. Copestake, B. Jones, A. Sanfilippo, H. Rodriguez and 
P. Vossen. Multilingual lexical representation, ms Uni- 
versity of Cambridge, Computer Laboratory, 1992. 

W. Daelemans. Inheritance in Object-Oriented Natu- 
ral Language Processing. In Proceedings of the First 
International Workshop on Inheritance in Natural Lan- 
guage Processing, pages 30-39, Tilburg, The Nether- 
lands, 1990. 

R. Evans and G. Gazdar (editors). The DATR papers. 
Cognitive Science Research Paper CSRP 139, School of 
Cognitive and Computing Sciences, University of Sussex, 
1990. 

P. Procter (editor). Longman Dictionary of Contempo- 
rary English. Longman, England, 1978. 

J. Pustejovsky. Current issues in computational lexical semantics. In Proceedings of the 4th European ACL, 
pages xvii-xxv, Manchester, 1989. 

J. Pustejovsky. The Generative Lexicon. Computational 
Linguistics, 17(4) 1991. 

A. Sanfilippo. Grammatical Relations, Thematic Roles 
and Verb Semantics. PhD thesis, Centre for Cognitive 
Science, University of Edinburgh, 1990. 

A. Sanfilippo and V. Poznanski. The Acquisition of Lex- 
ical Knowledge from Combined Machine-Readable Dic- 
tionary Sources. In Proceedings of the 3rd Conference 
on Applied Natural Language Processing, Trento, Italy, 
1992. 

P. Vossen. Converting Data from a Lexical Database to 
a Knowledge Base. ACQUILEX Working paper, No 27, 
1991. 

P. Vossen and A. Copestake. Untangling definition 
structure into knowledge representation. In T. Briscoe, 
A. Copestake and V. de Paiva (eds.), Default Inheritance 
in Unification based approaches to the Lexicon, CUP, 
New York, 1991, forthcoming. 

Y. Wilks, D. Fass, C-M. Guo, J. McDonald, T. Plate 
T and B. Slator. A tractable machine dictionary as a 
resource for computational semantics. In B. Boguraev 
and T. Briscoe (eds.), Computational lexicography for 
natural language processing, pages 193-231, Longman, 
London, 1989. 
