Word Knowledge Acquisition, Lexicon Construction and 
Dictionary Compilation 
Antonio Sanfilippo* 
SIIARP l,M)oratorics of l';urol)c l,td. 
Oxford Science Park, Oxford ()X4 4(\]A, UK 
ant onio~sharp, co. uk 
Abstract 
We describe an approach to semiautomatic lex- 
icon development from |nachine readal)le dictio- 
naries with specific reference to verbal diatlleses, 
envisaging ways in which tile results obtained can 
be used to guide word classification in the con- 
strnction of dictionary datal)ases. 
1 Introduction 
The acquisition and representation of lexical knowl- 
edge from machine-readable dictionaries and text cor- 
pora have increasingly become major concerns in Com- 
putational Lexicography/Lexicology. While this trend 
was essentially set by the need to mmximize cost- 
effectiveness in building large scale Lexical Knowledge 
Bases for NLP (LKBs), there is a clear sense in which 
the construction of such knowledge I)ascs also caters to 
the demand for better dictionaries. Currently available 
dictionaries and thesauri provide an undoubtedly rich 
source of lexical information, but often omit or neglect 
to make explicit salient syntactic and semantic proper- 
ties of word entries. For exa|nplc, it is well known that 
the same verb sense can appear in a wtriety of snl)cat- 
egorization frames which can be related to one at|other 
through valency alternations (diatheses). Some dictio- 
naries provide subcategorization information by means 
of grammar codes, as shown below for the "sail" sense. 
of the verb dock in LI)OCE -- Longman's Dictionary 
of Contemporary English (Procter, 1978). 
(1) a,,,:k "| , \[Tl;m: ("0\] .... 
The codes \[T1;10:(at)\] indicate that the vcrl) can bc 
either transitive or intransitive with the possible a(I- 
dition of all oblique colnpienlent introduced by l.he 
preposition at: 
(2) a. \[T1 (at)\]: Kim docked his ship (at Clasgow) 
b. \[IO (at)l: The ship docked (at Glasgow) 
Unfortunately, an indication of diatheses which relate 
the various occurrences of tt,e verb to one another is 
rarely provided. Consequently, if we were to use the 
grammar code information found in M)OCE to cre- 
ate verb entries in an I,I(B by automatic conversion 
we would construct four seemingly vnrelated entries 
for the verb dock (see §3). Inadequacies of this kind 
may be redressed through semiantomatie techniques 
*The researcl, relmrted in this paper was carried out 
within the ACQUILFX project. Iatn indebted to Ted 
Briscoe, Ann Col)estake and Pete Whitek)ck for helpful 
comments. 
wl|ich make it possil)le to suplfly infornmtion concern- 
ing amenability to diathesis alternations so ~tq to avoid 
expanding distinct entries for related uses of the same 
verb. This practice woldd allow us to develop an I,KB 
from dictionary databases which offers a more co|n- 
plate and linguistically relined repository of lexical in- 
formation l, hall the source databases. Such an \],Kll 
wouhl be used to generate lexical components for NI,P 
systems, and couhl also be integrated into a lexicogra- 
pher's workstation to guide word classification. 
2 The ACQUILEX Lexicon Development 
Environnmnt 
Our points of departure are tile tools for lexical acqui- 
sition and knowledge representation (lew~loped iL~ part 
of the ACQUII,I'3X project ('The Acquisition of Lcxieal 
Knowledge for NLP Systems'). 
The ACQUILI'~X l,exicon l)evelopment Environ- 
men| uses typed graph unilication with inheritance 
as its lexical representation htnguage (for details, see 
Copestake (1992), Sanfiliplm & l'oznafiski (1992), and 
pal)ers by Copestake, de Paiva and Sanfilippo in 
Briscoe el al. (1993)). It; allows the user to define 
an inheritance hierarchy of types with associated re- 
strictions expressed in terms of attril)ute-wdue \[)airs 
as shown in Fig 1, and to create lexicons where such 
types are used to create lexical templates which encode 
word-se,|se specific information ex{.racte.d from MRI)s 
st, ch as the one in Fig 2. (Bold lowerc~me is used for 
types, caps for attributes, and boxes enclosing types in- 
dicate total omission of attribute-vahm pairs, l)etails 
COIICel'llillg, |lie OllcOdillg o\[ vet'\]) Sylll, aX an(I seulantics 
can be found in Sanlilil)po (1993).) 
Feature Structure (I"S) descriptions of word senses 
such as tilt.' one in Fig 2 are created semiautomati- 
cally through a program which converts syntactie an(I 
T 
sign rule . 
lex-sig, .-. h!xllrld-l'uh.. ... 
verb-slgn • . . 
strict-intraua-Mgn . . . 
° verl,-Mgn \] 1- lexical-rule \] 
CAT = ~lex-etd\] 
S~:M = ~ L 'Npu'r= ~J J 
\[;'igure I: q'ype IIierarchy & Constraints (fragment). 
223 
• 8trict-intrans-nign Olq.Tn ~ ttwiln 
" RESULT~trlct'intralla'cat= ~ J 1 
CAT = \[ Ilp-sllgn \] 
ACTIVE = \[ SEM = \[\] 
str|ct-intrana-~am 
IND = 1I\] PRED = and 
/ \[ INnverb'f°rlnula= \[\] \] 
AROl = \[ PRED = \[~l~whnl~l_l 
SEM = tAROt = \[\] 
agt-formula IND = \[~proce~a 
ARG2 : \[\] PRED = llgt-cettt~-lnov-s||antt ARG1 = \[\] 
ARO2 e-anhnate 
Figure 2: LKB Entry for swim (simplified). 
semantic specifications encoded in MRDs into LKB 
types. For example, the choice of LKB types used 
in the characterization of the verb swim above was in- 
duced from the syntactic and semantic codes found 
in LDOCE and tim Longman Lexicon of Contempo- 
rary English (LLOCE, McArtllur 1980). In LI)OCE, 
the first sense of the verb swim is marked ,'us a 
strict intransitive verb (\[I0\]) whose subject is animate 
((box .... 0)); in LLOCE, the same verb sense is se- 
mantically classified as a movement verb with manner 
of motion specified (M19): 
(3) swim 1 (1) 
LDOCE 
LLOCE 
\[I0\] (box .... 0)... 
M19 - Particular ways of moving 
The MRD-to-LKB equivalences induced by the conver- 
sion algorithm are as shown in (4) where agt-eause- 
move-manner indicates that the subject participant 
relation implies self-induced movement with manner 
specified. 
(4) \[10\] --4" str|ct-|ntranu-slgn 
(box .... 0) ~ teaT: ACTIVE:SZM: ^aa2= 
~-an\[lllate\] 
M19 -* \[cxr:ACTWr.:SEU:Vm)= 
\[tgt-eHtlSe-lnOVe-l||H II II(~l" l 
\[I0\], M19 --~ tS~M:tun = l" ...... l 
3 Verbal Diatheses and Lexieal 
Acquisition 
In the example discussed above, MI~D-to-LKB conver- 
sion is relatively straightforward: a single LKB entry is 
created for swim since a single grammar code is found 
in the MRD sources used. Where a verb-sense entry 
gives more than one grammar code, however, the ques- 
tion arises whether or not each grammar code should 
be mapped into a distinct LKB entry. For example, 
the codes given in LDOCE for the verb dock (see (1)) 
could potentially be used to derive four LKB verb en- 
tries: 
(s) LKB TYPF, EXAMP LI,; 
a. striet-trans-sign Kim docked the boat 
b. obl-trans-slgn Kim docked the boat 
at Southampton 
c. striet-intrans-sign The boat docked 
d. obl-intrmxs-sign The boat docked at 
Southampton 
Notice, however, that in tllis case the creation of four 
distinct LKB entries is unnecessary insofar ,as the use of 
the verb exemplified in (5b) contains enongh informa- 
tion to derive the remaining uses of the verb through 
lexical rules which progressively reduce the verb's va- 
lency by dropping the subject and/or prepositional ar- 
gument(s). Such a step would be linguistically moti- 
vated in that it establishes a clear link between alter- 
native uses of the same verb sense. Moreover, compact 
representation of verb use extensions is desirable from 
an engineering perspective its it reduces the size of ttm 
lexicon, allowing verb use expansion to be delayed till 
parsing time. This practice can be made to facilitate 
the resolution of lexical ambiguity by enforcing selec- 
tive application of lexical rules (Copestake & Briscoe, 
1994). 
Compact representation of verb use extensions due 
to valency alternations requires that a note of all ap- 
plicable lexical rules be made in each kernel entry. In 
choosing ol)l-trans-slgn as the LKB type for dock, 
for example, specifications would be added saying that 
the verb is amenabh~" to the causative-inchoative al- 
ternation relating agentive and agentless uses ((5a,b) 
vs. (5c,d)), and the path alternation pertaining to 
the omission of the prepositional argument ((5a,c) vs. 
(5b,d)). In addition, the path alternation would ha~(e 
to be specified as to whether it preserves amenability 
to a telic interpretation (accomplishment or achieve- 
men|) of the event described by the verb or not. For 
example, tile omission of the goal argnment for a verb 
such as drive, push or carmj induces an atelic (process) 
interpretation as indicated by incompatibility with a 
terminative adverbial: 
(6) a. ,\]ohn drove his car to London in one hour 
b. John drove his ear (*in one hour) 
Within a (partial) deeomposltional approach to verb 
semantics (Tahny, 1985; Jackendolr, 1990; Sanfilippo, 
1993; Sanlilil)po el al., 1992)), this contrast can be 
explained with reference to the rneaniug component 
path. In (6a), the goal argument (1o London) fixes a 
final bound for the path along which the driving event 
takes place. Assuming that, the compositional meaning 
of tile sentence involves establishing a homomorphism 
between tile event described by the verb and the path 
along which such an event takes place (l)owty, 1991; 
Sanlilippo, 1991), it follows that with an unbounded 
path (e.g. (6b)) only a process interpretation is pos- 
sible, whereas with a bounded path (e.g. (6a)) a relic 
interpretation is more likely. P,y contr~t, the omission 
of the goal argument with verbs such ,an deliver, bring, 
dock and send does not inhibit amenability to a relic 
interpretation, e.g. 
(7) We can deliver the goods (to your door) in one 
hour 
274 
Our aim, then, wtLs to capture regularities across dis- 
tinct nses of the same verb sense by relating the sub- 
categorization frames relative to these uses via regular 
syntactic and semantic changes. 'lb iLssess the feasibil- 
ity of this approach, we attgmented the MtH)-to-LKB 
conversion code with facilities which make it possible 
to infer amenability to specific diathesis alternations 
from occurrence of multiple grammar codes and their 
~ssociated semantic codes in the MR, Ds. To improve on 
the informational content of LDOCE grammar codes, 
we used an intermediate dictionary semiantonmtically 
derived from LDOCE (LI)OCEAnter) where the sub- 
categorization information inferrablc from grammar 
codes and other orthographic conventions wi~s made 
more explicit (Boguraev & Briscoe, 1989; Carroll 
& Grovel 1989). Semantic inlbrmation about verb 
classes was obtained by mapping across LI)OCF, and 
II, LOCE so as to augment LI)OCE queries with the- 
saurus information, i.e. semantic codes (Santilippo & 
Poznafiski, 1992). 
Syntactic and semantic intbrmation relative to verb 
senses was extracted through special functio,~s which 
operate on pointers to dictionary entries. The ex- 
tracted info was used to generate FS representations 
of word senses. The collversion process was carried 
out in such a way that whenever multiple subcatego- 
rization frames were found in association with a verb 
sense, only those which could not be derived via diathe- 
sis alternation were expanded into LKII entries, l;br 
example, the LDOCEAnter entry for dock gives four 
subeategorization frames: 
(dock) 
(((Cat V) (Takes NP) (Type 1)) 
((Cat 7) (Takes NP PP) (Type 2) (PFtlRH at)) 
(((;at 7) (Takes NP NP) (Type 2 Transitive)) 
((Cat 7) (Takes NP NP PP) (Type 3) (PF\[\]RH at))) 
In this case, the four uses of the verb can all 
be derived from the last one through application of 
the causative-inchoative and bounded-path alterna- 
tions mentioned above; all that ,ceds doing is to mark 
what diatheses are possible in the LKB entry derived, 
e.g. 
o|)i-t ig,t ....... \] 
OIUI'II = do<zk 
= \[ TI(.ANS-ALT = catm-lnch \] L L 
OIJIrAI/I' ~ I)-IHLth J 
'l?he algorithm which guides this process checks 
whether information regardiug diathesis alternations 
can be inferred from dictionary entries iu the MRI) 
sources or must be manually supplied. In perform- 
lug this check, snbcategorization options relative to a 
given verb sense which can be inferred from a more 
informative subcategorization frame are ignored. This 
technique was successfidly employed in semiautomatic 
derivation of lexicons for 360 n~ow.'n~ent verbs yieldiug 
over 500 additional possible expansions by application 
of lexical rules. 
4 Verbal Diatheses and Knowledge 
Representation 
To encode amenability to verbal diathescs, the fea- 
ture D1ATHESES wiLs introduced ;Ls an extension of 
the morphological features ~Lssoeiated with verbs (see 
(8)). This feature takes as value the type altem,a- 
lions which is in turn (teiine(l ,'ks having a wtriety of 
specialized types according to which diathesis alterna- 
tions are admissible for each choice of verb type (e.g. 
intransitive, transit.ive, ditransitive), ~u shown in Fig- 
ure 3 (see next page). The following table provides 
diatheses refi~rred to in l"ig 3. 
I.~X AM 1) I,E 
l(im broke lhe glass vs. 
the glass broke 
1(ira scares Sally vs. 
Sally scares easily 
examples of the 
I) IATI1ESIS 
caus-inch 
middle 
indef-obj 
defobj 
reel I) 
Itass 
b-lmth 
U-lmth 
to/tbre 
John ale a sandwich vs. 
John ale 
John did nol notice the sign vs. 
John did nol notice 
Kim met Bill vs. 
Kim and Bill met 
Hill read the Guardian vs. 
The Guardian was read by Bill 
Kim relurned the book to Sue vs. 
Kim returned the book 
Kim came away vs. 
Kim came (particle alternation) 
Kim swam across lhc Fiver vs. 
Kim swam 
Kim walked away vs. 
Kim walked (particle alternation) 
John broughl a book to/for Sue vs. 
John brought Sue a book 
l)iathesis alternations are enforced by means of lexi- 
cal rules whM,, on par with all other information struc- 
tures in the LKP,, are hierarchically arranged, ~s shown 
in Fig 4 with reference t,o the bound and unl)ound 
path alternations for intransitive verbs. LexicM rules 
h!xleal-I'tde 
t~hl-ittt rann-.lt tl-l)ltt h-all I~-path-alt 
U_l)ath_ol)l_intrnns_al t b-l)ath-ol)l-|s~tratt a-all 
Figure 4: Lexical Rule llierarchy (fragment). 
enforcing diathesis alternations may involw~ a variety 
of syntactic, semantic and orthographic changes. For 
example, the u-path-ohl-intrans-alt rule shown in 
Fig 5 I)elow takes as input an I"S of type obl-intrans- 
sign which represents a verb describing a non-stative 
eventuality (dyn-eve) whose subject participant (with 
semantics KI) is implied as moving along a directed 
path (th-move-dir) the endpoint of which is speci- 
tled by the oblique a,'gument (pp-sign), e.g. the use 
of swim in Kim swam acTvss the river. The output is 
an FS representing a strict intransitive verb (striet- 
intrans-sign) which describes a process and whose 
subject participant is like that of the inlutt with the 
directed path speciIication removed (th-move instead 
of th-mow~-dir), e.g. swim in Kim swam). 
5 Using the I,KB to Guide Dictionary 
Compilation 
There are at least two ways in which an LKB such as 
the one developed in ACQUILEX offers the means to 
275 
\[alternations \] I)IVI'-AI,T = prt-or~obl-a|t \] 
PRT-ALT = prt-or-obl-alt l PRT-ALT = prt-or-obl=alt \[ I'IUI'-AIA' = prt=or-obl-alt \] OBb-ALT = prt-or-obl-alt L TItANS-ALT = trans-alt 
PRT-ALT = prt-or-obl-alt / \[ PRT-ALT = l)rt-or-ol)l-alt \] PItT-ALT = prt-or-obl-alt OBL.ALT = prt-or-obl-alt J l TItANS-AI,T = traits-air j L TItANS. ALT : trana-alt J 
L OBL-AIII" = l)rt-ttr-obl-alt 
dltrana-dlatl ..... \[t bl-diatl ........ ..... \] PPJt%ALT = prt-or-obl-alt | PRT-ALT = prt-or-obl-alt 
"PttANS-AL'r = trana-alt OBL-ALT = prt-or=obl-alt l TRANS-AIIr = italia-air 
L ODL-AIIY = prt-or-obl-alt L I)AT-MOVT = (|itt=nlovt 
trana-alt _E caua-incl h middle) htdef-obj~ def-obj, recip, pass 
prt-or-obl-alt ~ b-path) u-path 
dat-niovt ~ to) for 
Figure 3: Verbal Dhttheses Ilierarchy 
" u-pat h-obl-lnt rana-alt 
• mtrlet-|ntrans-slgn ORTtl = \[fflorth 
" atrlct-|ntrans-cat \] 
I~ESULT = 
CAT = CT \[ nl~-a|gll 
A IVE = \[SEM = ~ \] 
OUTPUT = st r|ct=intrans-sem 
IND = process 
FRED = and 
SEM = ARQ1 = \[\] r 
tll-formula \] ARG2 = \[\] \[ PItED = th-ntove 
L A~o~ = \[\] 
" ol)l-lnt rana-algn 
ORTII ~ \[\] 
obl-|ntrans-cat 
\[ strlct=|ntrana-cat \] 
CAT = | A IVE -- \[ Itl)'a|gll "l L OT ._ \[~EM=mJ 
= \[ pp-slgn 1 ACTIVE \[ SEM = \[\] \] 
INPUT = lntrana-obl-sem 
IND = dylt-ev0 PRI,?,D = aitd 
AltGl ~ \[~\]Iv~r|)=fornlu|a i 
SEM = 
1" binary-formula \[ PP~ED = and 
l \[ th-fi)rmula "l ARG2 ~ \[ ARGI = \[\] \] PILED = th-ltlOve-d|r J 
L Alia2 = l~lol, j 
L Ar~a~ = ~~q 
Figure 5: Tlle "unbounded p~tth" lexical rule for intransitives 
276 
CoRPus 
the ship docked safely 
NP ~ ADV 
LKll: 
1 \[ ,,-p.th-obl-~,~~'\] 
OII'rPUT = ~t-illgran~_~igl,J 1 
INPUT = ~ .j 
T 
CORPUS 
the strip (IOCKe(I t_t(, tal~Sgow 
Figure 6: TBA 
facilitate word classificatiol, in the compilation of new 
lexieal databases. 
First, the links between LKB types and dictionary 
entries established in the conversion stage can be used 
to run consistency cheeks on tile MRI) sources and 
to supply missing information or correct errors, This 
offers an efficient and cost-effective way of generating 
improved versions of the same (lietionary. 
Second, the types associated with specific word 
classes can he made to guide lexical aquisition from 
corpora when creating new dictionaries. It is now 
widely recognized that corpora are indispensable in 
the acquisition of lexicaI information relating to is- 
sues of usage such as the range and frequency of dif- 
ferent patterns of syntactic realization. 'FILe availabil- 
ity of software tools for partial analysis of texts {e.g. 
morphological and semantic tagging, phrasal parsing 
etc.) has increased significantly the utility of corpora 
in lexical acquisition by providing ways to structure 
the information contained in them (see Briscoe {199l) 
and references therein), l'h~rther advances yet can be 
made by using LKB types to chLssify words in text eof 
pora. Suplmse , for example, we lit,ked the input and 
ontl)ut of lexical rules to semantically tagged subcat- 
egorization frames extracted from bracketed corpora| 
(Poznafiski & Sautillppo, 1993). As indicated in Fig 6, 
this would allow us to assess which alternations might 
be of interest in establishing regular verb sense/usage 
shifts. Such an assessment wonhl provide an effective 
way to drive verb categorization from corpora in the 
domain of valency alternations. 
6 Final Remarks 
A key element in our approach to \]exica\[ acquisition 
and representation of verbal diathesis concerns tim use 
of semantics constraints in formulating MI{I) queries 
and characterizing FS descriptions. This practice en- 
sttres that the results achieved in this work for ,n(> 
tiou verbs can be suitably extended to other semantic 
verb classes. For example, the c\[tuss of verbs which un- 
dergo "extraposition" --e.g. That Kim left early both- 
ers Sue vs. It bothers 5'ue that Kim left early-- can 
be identified by using semantic constraints on MK1) 
queries which identify psychological verbs with stimu- 
Ills subject such as bother, please, etc. (Sanfilippo & 
Poznatiski, 1992). This approach provides an effective 
way of employing semiautomatic extraction of infor- 
mation from MIt,1)s for lexicon construction, and it 
facilitates word classification from text corpora when 
co,npiling new dictionary datab~es. 
References 
P, riseoe, T. (1991) I,exical Issues in Natural I,anguage 
Processing. In Klein, E. ,~ F. Veltman ((:(Is.). Nat- 
uml Language and 5'peecl h Springer-Verlag, 39-68. 
llriscoe, T., A. Copestake and V. de Paiva (1993) 
Defaull Inheritance within Unifiealion-Based Ap- 
p~vaches to the Lea:icon, CUP. 
P, oguraev, B. & T. Briseoe (1989) Utilising the 
I,I)OCE (Jrammar Codes. In Boguraev, 1L & 
Brlscoc, '\['. (eds.) Computational Leedcography for 
Natural Langvage Processing, Longman, London. 
Carroll, J. & C. (Irover (1989) The Derivation of 
a Large Computational l,exicon for English from 
LI)OCE. In lloguraw, B. (~ llriscoe, T. (eds.) Com- 
pvlalional Le:eieography for Natural Language Pro- 
cessing. Longman, I,ondon. 
Col)estake, A. (1992) The ACQUILEX LI(B: Repre- 
sentation Issues in Semi-Antomatic Acquisition or 
Large I,exlcons. In l'roceedings of the 3rd Confer- 
ence on Applied Natural Langvage Processing. 
Copestake, A. and T. Briscoe (1994) Semi-productive 
|)olysemy and Sense Extensions. Ms. Computer 
Laboratory University of Cambridge, Xerox Park 
(Menlo l)ark) and Rank Xerox Research l,aboratory 
(Qrenohle). 
l)owty, D. (1991) Thematic Protc~Roles aml Arga- 
mcnt Selection. Language (}7, pp. 547-619. 
Jackendoff, ll.. (1990) ,5'emanlic Strvctures. MIT 
Press, Cambridge, Mass. 
McArthur, T, (1981) Longman Lexicon of Contempo- 
rary g'nglish. Lougman, l,oudon. 
Poznafiskl &, Saufilipl)o (1993) l)eteeting I)ependen- 
eies between Semantic Verb Subclasses and Subc;~t- 
cgorization Frames in q'cxt Corpora. In IL Boguracv 
&. J. Pustcjovsky (eds) Acquisition of Lez'ical Knowl- 
edge from Text, Proceedings of a ,gIG I, EX workshop, 
ACL-93, Ohio. 
Procter, I ). (1978) Longman Dictionary of Contem- 
porary English. l,ongnmn, London. 
San/ilippo, A. (1991) Thcnmtie aud Aspectua\] Infor- 
nmtion in Verb Se,nanl.ics. Belgian Journal of Lin- 
ffuislics, 6. 
Sanfilippo, A. (1993) LKII Eneodil,g of Lexical 
Knowledge. In llriseoe, '1'., A. Cot)e.stake and V. 
de l'aiva (eds.). 
Santiliplm, A & V. Poznafiski (1992) The Acquisi- 
tion of Lexical Knowledge from Combined Machine- 
Readable l)ictionary Sources. In t'roceedings of the 
3rd Conference on Applied Natural Language lbv- 
cessing, 'lYento. 
Sanfilil)po, A., T. Briscoe, A. Copestake, M. Mar|l, 
M. Tauld and A. Alongc (1992) ~lYanslation I"quiv- 
alence and Lexiealization in the ACQUILI';X I,l(P,. 
Proceedings of TM1-92, Montreal, Canada. 
Tahny, 1,. (1985) Lexiealizatio,~ Patterns: Semantic 
Structure in Lexical Form. In Shopen, T. (cd) Lan- 
guage "\]~pology and Syntactic Description 3. (h'am- 
marital Categoriesv and the Lexicon, CUlL 
277 
