Comlex Syntax: Building a Computational Lexicon 
Ralph Grishm:m, Catherine Macleod, and Adam Mcyers 
Computer Science Department, New York University 
715 Broadw,~y, 7th Floor, New York, NY 10003, U.S.A. 
{grislnnan,macleod,me.yers } (@cs.nyu.e(ht 
Abstract 
We des((tile tile design of Comlex Syntax, a co,nputa- 
tional lexicon providing detailed syntactic iuformation 
ff)r approximately 38,000 English headwords. We con- 
sider the types of errors which arise in creating such 
a lexicon, and how such errors can be measured and 
controlled. 
1 Goal 
The goal of the (:omlex Syntax project is to create a 
moderately-broad-coverage lexicon recording the syn- 
tactic features of gnglist; words for purposes of cou> 
putational language analysis. This dictionary is be- 
ing developed at New York University and is to he 
distributed by the Linguistic Data Consortimn, to be 
freely usable for both research and commercial pur- 
poses by members of the Consortium. 
In order to ineet the needs of a wide range of an~> 
lyzers, we have inchlded a rich set of syntactic features 
and haw~ aimed to characterize these Datures in a rela- 
tively theory-neutral way. In l)articnlar, the feature set 
is more detailed than those of the major commercial 
dictionaries, such ;us the Oxford Adwmced Learner's 
Dictionary (OALI)) \[d\] and the Longnum Dictionary of 
Contemporary English (LDOCE) \[8\], which haw~ I)een 
widely used as a source o\[' lexical i,,for,,lal, ioil ill \];lll- 
guage analyzers. 1 In addil.ion, we have ahned to be 
irio,'e cOrrlpreheiisive ill capturhig featt, res (hi partic.u- 
\]ar, stibcategorization \['eatures) than co,iI,llercial dic 
tlonaries. 
2 Structure 
Tile word list was derived fi'on, the file prepared 
by Prof. Roger Mitten from the Oxford Adwn,ced 
Learner's Dictionary, and contains about 38,000 head 
forms, although some purely British terms have been 
omitted, loach entry is organized as a nested set of 
typed feature-vahle lists. We currently use a Lisp-like 
parenthesized list notation, although the lexicon couhl 
ITo facilii~ate the transition to COMLEX by currenl, users of 
these dictionaries, we have i)reparcd mappings froln COMI,EX 
classes to those of several other dictionaries. 
be readily mapped into other hwn,s, such as SC, MI,- 
marked text, if desired. 
SOllie sauil)le dicticll,ary entries are shown ilt Figure 
1. The first syml/ol gives the part of speech; a word 
with several parts of speech will have several dictionary 
entries, one for each part of speech. Each e,itry has all 
:orth foatilre, giving the base fO,'lfl of tile word, No,ins, 
verbs, and adjectiw~s with irregular Inorphology will 
liave featt,res for the irregular fo,.iris :plural, :past, :past- 
part, etc. Words which take con-,i)leirients will have 
a subcatego,'ization (:sube) \['eat,ire. For exaniple> the 
verb "ai)andon" eali occur with a IlOllri phrase followed 
by a prepositional phrase with tim preposition "to" 
(e.g., "1 abandoned hii,i to the linguists.") or with just 
a ,lOll,, phrase compleifient ("\[ aballdone(l the shill."). 
Other syntactic features are recorded under :features. 
For example, the noun "abandon" is marked as (count- 
able :pval ("wlth")), indicating that it must appear in 
the singular with a deter,niner unless it is preceded by 
the preposZion "with". 
2.1 Subcategorization 
We have paid p~u'ticular attention to providing 
detailed subcategorization information (information 
about complement structure), both for verbs and for 
tllose nouns and adjectives which do take cmnl)lements. 
In order to insure the COml)leteness of our codes, we 
studied the codiug e)ul)loyed by s(weral other u,ajor 
texicous, includh,g (,he Ih'andeis Verh Lexlcolt 2, the 
A(JQIJII,EX Prc, ject \[10\], the NYU Linguistic String 
l'roject \[9\], the OALI), and IA)OCI'\], a, nd, whenever 
feasiMe, haw~ sought to incorporate distinctions made 
in any of these all(tie,tortes. ()ur resulting feature sys- 
ten, includes 92 subcategorization features Ibr w~rbs, 14 
for adjectives, and 9 for llO,,ns. These features record 
dilforences in grammatical functional structure as well 
as constituent structure. In particular, tl,ey Calfl.ure 
four different types of control: subject control, object 
control, variable control, and arbitrary control. Fur- 
thermore, the notation allows us to indicate that 
verl) Irlay haw~ dill>rent control features for different 
comlflement structm'es~ or ewm for dilrerent preposi- 
tions within the complement. We record, for example, 
that "blame ... on" involves arbitrary control ("lie 
2 l)ewdoped by J. (ih'in;sha.w and I{..lackendoff. 
268 
(verb 
noun 
(prep 
(adverb 
(adjective 
(verb 
verb 
noun 
:orth "abandon" :subc ((np-pp :pval ("to"))(np))) 
:orth "abandon" :features ((countable :pval ("with")))) 
:orth "above" ) 
:orth "above") 
:orth "above" :features ((ainrn)(apreq))) 
:orth "abstain" :subc ((intrans) 
(pp :pva\[ ("from")) 
(p-ing-sc :pua\] ("fro,'l\]")))) 
:orth "accept" :subc ((np)(that-s)(np-as-np))) 
:orth "acceptance") 
Figure 1: .qai,lph~ (X)M I,I,',X .qyntax diction:u'y en~.ri(-s. 
IAarned the country's health i~roblems (.m eating tc, o 
much chocolate."), whereas "blanle for" involw,s ol)-. 
ject control ("lie blamed John for going too fast."). 
The names fl)r the ditferent complmnent types are 
b~sed on the conventions used ill the Ih-ancleis wwb 
lexicon, where each COml)Mneut is designated by tl,, 
names of its constituents, together with a few tags to 
indicate things such as control phenonleua. Earh corn 
plement type is formally defined by n fr;uue (see Fig-. 
ure 2). Tile frame includes the constituellt structure, 
:cs, tile grammatical structure, :gs, one cu, nlm'e :fea- 
tures, and one or more ex~unples, :ex. Tile constit.uent 
structure lists the constituents in sequence; the gram- 
marital structure indicates the functional role played 
by e,~ch c(mstituent. The elemenl.s of the constitueut 
structure are indexed, and these indices are referenced 
in the grammatical structure field (in up-.frames, I.he 
index "1" in the grammatical structures always refers 
to tile surface subject of tile verb). 
Three verb frames are shown ill Figure 2. The fh'st, 
s, is for flail sententiM complements with ;m optional 
"that" eo,nplementizer. Tim second aim third frames 
I)oth represent infinitiwd conq~lemel,ts, aim dill're' only 
in their filnctiona\[ structure. The to-ingsc frame iv f(~r 
subject-cm~trol verbs, verbs for which the surface 
subject is the flmctional subject of both the nlatrix 
;tad embedded chmses. The notation :subject 1 in the 
:cs tleld indicates that the surface subject is the sub- 
ject of tile enlbedded clause, while the :subject 1 ill the 
:gs Iield indicates that it is the subject of the matrix 
clause. The indication :features (:control subject) pro- 
vides this \[nforlnation redundantly; we include I)oth 
indications in case one is more collvelliellt for i);trticu - 
ltu" dictionary users. The to-ingrs fl'atne is for raising- 
to-subject verbs - - verbs for which the surface subject 
is tile functional subject only of the embedded c\];tuso. 
The functional subject position in the matrix clause is 
unlilled, as indicated by the notation :gs (:subject () 
:corap 2). 
3 Methods 
Our basic aplm)acll has been to create an initial lexicon 
lll&llIUtl\]y a, lld \[,h~ll \[,() list! ;t vtH'i~ty of resolll'ces) both 
commercial aml corpus-deriw'd, to reline rids lexicon. 
Alth-ugh methods haw~ been dew%ped .ww tile last 
few years for autovual,ically ideutifyi,g sore,: subcat- 
i,gorizati(~ll consl,r:tillts I, llrotlgh corpus ;tllulysis \[2,5\[, 
these methods are sl,ill lhuited iu the range cf disthlc 
l, ions they can identify and their Mfility to deal with 
\](~w-frequency words. (hmsequently \ve have chosen \[,o 
use manual entry for creaticm of our initial dictio,mry. 
The entry of lexical information is being performed 
by flmr gll;tdllllte liuguistics studcllts, rel'erled I.o as 
elves ("elf" = euterer ,,f lexical features). Tile elw:s are 
provided with a memMmsed interl'~ce c-ded in C-lu- 
mort 1,isp using the Garnet GI/I package, aim runuiug 
on Sun workst.atimls. Tiffs iuterfa.ce also p,'c.vides ac 
tess t,o a hu'ge text corpus; as ~ wcwd is being', eutered, 
instances .f t, he word e;m be viewed in one of tim win- 
dows. I:,lves rely on cited, ions from the corpus, dellni- 
ti¢ms and citations from any of several printed dictio 
naries and their own linguistic intuitions in assigninp; 
features I,o words. 
I)ictiouary entry began ill April 19!)3. Au initial 
dicti<mary contahlhut ewtries for all the u(u.us, verbs 
and adjvci,ives ill tile ()AI,I) was coluldetml iu M.y, 
1!)9'1.3 
We expect t. checl¢ tiffs dicti,mary ;tg;tillSt sevel'a{ 
SOIIrC(!S, VVe hltelld to C¢)lill)al'e the IilaAlll;t\] sllbcate 
gorizations for verbs aF.ainsl, I, hose in the ()A\[,I), and 
would be pleased to make COllI\])a, risous ;I.l.,;a.illst other 
broad-c~werage dictiouarios if those Cttll be m!tde avail-- 
able tbr this purpose. We also hltend to mMw COml)ar- 
is~ms against sewn'al corpus deriw~d lists: at the very 
least, with w!rb/l~reptMthm and w~rb/partMe pairs 
wit.h high mutual inf, rmation \[3\] mid, if possible, wil.h 
the results of recently-developed procedures for ex 
tractinF, subcai,egorlzal, iou tYames from corpor;t \[2,.ti\]. 
While tiffs corpus-derived information may not be de- 
tailed or accurate e|lough for fu~ly-autonl~tted lexicon 
3No fl!gtlllres ;ire being assigned to adwM~s in the initial 
|eXi(:OII 
269 
(vp-frame s 
(vp-frame to-ingsc 
(vp-frame to-inf-rs 
:cs ((s 2 :that-comp optional)) 
:gs (:subject 1 :comp 2) 
:ex "they thought (that) he was always late") 
:cs ((vp :2 :mood to-infinitive :subject 1)) 
:features (:control subject) 
:gs (:subject 1 :comp 2) 
:ex "1 wanted to come.") 
:cs ((vp 2 :mood to-infinitlve :subject 1)) 
:features (:raising subject) 
:gs (:subject () :comp 2) 
:ex "they seemed to wilt.") 
Figure 2: Sample O*OMI,I'\]X .Syntax subcategorization Dames. 
creation, it sliould be most wduable as a basis for com- 
parisons. 
4 Types and Sources of Error 
As \])art of the process of refining the dictionary and as- 
suring its quality, we have spent considerable resources 
on reviewing dictionary entries and on occasion have 
had sections coded by two or even four of the elves. 
This process has allowed us to make some analysis 
of the sources and types of error in the lexicon, and 
how these errors might be reduced. We. can divide the 
sources of error and inconsistency into four classes: 
1. errors of classification: where an instance of 
a word is improperly analyzed, and in particular 
where the words following a verb are not properly 
identified with regard to complement type. SI)e- 
eific types of problems include misclassifying ad- 
juncts as arguments (or vice versa) and identifying 
the wrong control features. Our primary defenses 
against such errors have been a steady refinement 
of tile feature deseril)tions in ollr nlanlla\] and rel';- 
ular grou I) review sessions with all the elves. Ill 
particular, we have developed detailed criteria for 
making adjunct/argument distinctiolis \[O\]. 
A 1)reliminary study, conducted on examples 
(drawn at random from a corpus not used for 
our concordance) of verbs beginning with "j", in- 
dicated that elves were consistent 93% to 94% 
of the time in labeling argument/adjunct distinc- 
tions following our criteria and, in these eases, 
rarely disagreed on the subcategorization. In more 
than half of the cases where there was disagree- 
then(, the elves separately flagged these as drill- 
cult, ambiguous, or figurative uses of the verbs 
(and therefore would probably not use them its 
the basis for assigning lexical features). The agree- 
ment rate for examl)les whicti were not flagged was 
96% to 98%. 
2. oniitted features: where an ell' omits a Dature 
because it is not suggested by an example in the 
concordance, a citation ill the dictionary, or the 
elf's introspection. In order to get an est.ilnate of 
the niag,itude of this problem we decided to es- 
tablish a measure of coverage or "recall" for the 
subcategorization Dal.ures assigned by our elves. 
"lb do this, we tagged the first 150 "j" verbs from 
a randomly selected corpus from a part of the 
San Diego Mercury which was not inchlded in 
our concordance and then compared the dictio- 
nary entries created by our lexicographers against 
the tagged eorptis. The restllts of this colnparison 
are sliown in Figure 3. 
~Phe "(~omplements only" is the percentage of in- 
stances in the corpus covered by the subcatego- 
rization tags assigned by the elves and does not 
include the identification of i~rly l)rel)ositions or 
adverbs. 'l'lie "(~oinl~lements only" would corre- 
spond rougllly to the type of inforinal, ion provided 
by OALI) and l,I.)()(\]Jl'\] 4. The "COlllpielnc:nl,s q- 
l>relmsitions/l)articles" colliirin inehides eli the 
fl,al.ures> tllal, is it, eonsidel'S the correct idenl,ill- 
cation of the conip\]einent l)\]ilS the sp,~cilie prepo- 
sil.ions aiid adverbs i't!(lllil'e(\] by eert~thi comple- 
illonl.s. Ttie two COlllliiliS of (igiii'es iUlder "Ci)ni- 
l)lenients-t-I>rel>ositions/l'ari.icles ', show tim re- 
suits with and without the enumeration of dh'oe- 
tional l)reposltlons. 
We have recently changed ollr approach to i, he 
classification of verbs (like "riin '>' "send >>, "jog '>, 
"wall:', "juml;') wliieh take a long list of direc- 
tional l)rel)ositlons, by l)roviding our entering pro- 
grain with a P-D/I{ option on the preposition llst. 
'l'his option will automatically assign a list of di- 
rectional prepositions to the verb and thus will 
saw.' tirne and eliminate errors of rriissing prepo- 
sitions. In some eases this apl)roaeli will provide 
4 I~I)OCI~ does provide some preposltloiis and particles. 
270 
elf # (~JOml)lenlenl.s ()lily Conq)lemeuts + I'repositions/lhtrtich~s 
without I)-I)IIL using ILl)IlL 
l 96% 89% 90% 
2 82% 63% 79% 
3 95% 83% 92% 
4 87% 69% 81% 
elf av 90% 76% 8,1% 
elf union 10(1% 93% 9,1% 
Figure 3: Numl)er of subcategorization features assigned 1.o "j" verbs by (lifferenl, elves. 
elf # Coml)len,mts only ( ~ mpl ~menls -F I'repositions/I)artMes 
tvithoul. I)-l)ll{. using ILl)IlL 
__m 1 + 2 100% 
1 + 3 97% 
1 + 4 96% 
2 + 3 99 % 
2 + 4 95% 
3 + 4 97% 
2-elf av I 
!)1% 
91% 
!) 1% 
89% 
7!)% 
85% 
9:1% 
92% 
91% 
9O% 
86% 
92% 
.97% ss'7,, 91% 
Figure 4: Numl)er of subcategorization features assigned t~) ''j'' glories by pairs of elw!s. 
a prel)osition list that is a little rich for a given 
verb I)ut we have decided to err on the side of a 
slight overgeneration rather thall risk missing ally 
prel)ositions which actually occur. As you can see, 
the removal of the ILl)IlLs from consideration im- 
proves the in(lividual elf scores. 
The elf union score is the union, of the lexical en- 
tries for all fcmr elves. These are certainly nuln- 
bets to be proud of, but realistically, having the 
verbs clone four sel)arate times is not I)ractical. 
llowew~r, in our original proposal we stated that 
because of the complexity of the verb entries we 
wouhl like to have them done twice. As can be 
seeil in l:igure 5, with two passes we su('ce,,d hi 
raising individual percentages in all cases. 
We would like to make clear that evell in tim 
two cases where our individuM lexicographers miss 
18% and 13% of the complements, there was only 
one instance in which this might have resulted in 
the inability to parse a sentence. This was a miss- 
ing intransitNe. Otherwise, the missed cOnll)le- 
rnents wouhl have been analyzed as adjuncts since 
they were a combination of prepositional phrases 
and adverbials with one case of a suhordinal.e ccm- 
j line(ion ~as". 
We endeavored to make a comparison with 
LDOCE on the measurement. This was a bit dif- 
ficult since LDOCE lacks some con,plements we 
have and combines others, not always consistently. 
For instance, our PP roughly corresponds to either 
1,9 (our Pl'/al)Vl') or l)rep/adv + T1 (e.g. "on" 
3. 
+ TI) (our I'I'/I'AI{;F-NP) but in some cases a 
I)relmsition is mentioned but the verb is classified 
as iIltr~tllSii, iVe. 'Hie stra, ightl(Fw;u'd colnparisor~ 
has I,I)O(~E (illdhlg 7;~(~t) of f,he Lagged COml)le- 
menl.s hut a softer measure eliminating comple- 
ments that I,I)OC, E seems to 1)e lacking (I'AILT- 
NP-I'P, ILPOSSIN(I, PP-Pl ~) aM Mlowing for a 
1)P coral)lenient for "joke", although it is no(. spec- 
itied, results ill a l'ml'celfl.ag(; of 79. 
We haw~ adOld.ed tw. lines of defense against the 
prohh!m of omitted features, l"irsg, critical en- 
tries (particularly high fre(luency wM)s) have been 
done independently by two or more elves. Second, 
we are dewq.pinp; a IIIO1'(~ balanced C(H'plIH for t\]lo 
dv,,s to ,,,~,ilt. i~)~c,.,,,, st,,di~,s (e4';., \[1\]) co,lh.,,, 
our {)\])serv;ttions I,h:d, I'(':d, ures SllCh as sub('atego- 
rizati()n patterlls luay di\[i'l!r sui)stantiatly betweell 
corpora. We began with a corpus f,'ol,~ a single 
newspaper (S:m .lose Mercury News), but lutve 
since added the Ih'own corpus, several literary 
works from the l,ibrary of America, scientific ab- 
stracts fl'oul the U.S. I)epartment of Energy, aml 
;ill additional newspal~er (the Wall ~treet Jnur- 
Iiztl). In ,~xl, ending the corpus, we h&ve limited 
mlrselves to texts which would lie readily awdlable 
I,o nlenlliers of the l,inguistic I)ata Consortium. 
exl:e.ss ft~atul'es: when ;m elf assigns a spurious 
feature throu.gh incorrect extrapolation or analo.gy 
from available examples or introspection. Because 
of our desire Io obtain relatively complete foatllre 
sets, even figr infrequent verl)s, we have pernfit- 
271 
ted elves to extrapolate from the citations fotmd. 
Such a process is bound to be less certain than 
the assignment of features from extant examples. 
Ilowever, this problem does not appear to be very 
severe. A review of tile "j" verb entries produced 
by all four elves indicates that the fraction of spu- 
rious entries ranges from 2% to 6%. 
d. fl~zzy features: feature assignment is defined in 
terms of the acceptability of words in particular 
syntactic frames. Acceptability, however, is often 
not absolute but a matter of degree. A verb may 
occur primarily with particular complements, but 
will be "acceptable" with others. 
This problem is eompmmded by words which take 
on particular features only in special contexts. 
Thus, we don't ordinarily think of "dead" as be- 
ing gradable (*"Fred is more dead than Mary."), 
but we do say "deader than a door nail". It is 
also compounded by our decision not to make 
sense distinctions initially. For examl)le, many 
words which are countable (require a determiner 
before the singular form) also have a generic sense 
in which the determiner is not required (*"Fred 
bought apple." but "Apple is a wonderflfl fla- 
vor."). For each such problematic feature we have 
prepared gnidelines for the elves, but these still 
require considerable discretion on their part. 
'Fhese problems have emphasized for ns tbe impof 
tanee of developing a tagged corpus in conjunction 
with the dictionary, so that frequency of occurrence 
of a feature (and frequency by text type) will be avail- 
able. We have done stone preliminary tagging in par- 
aim with the completion of our initial dictionary. We 
expect to start tagging in earnest in early summmer. 
Our plan is to begin by tagging verbs in the Brown 
corpus, in order to be able to correlate our tagging 
with the word sense tagging being done by tim Word- 
Net group on the same corpus \[7\]. We expect to tag 
at least 25 instances of each verb. If there are not 
enough occurrences in tim Brown Corlms , we will use 
examples from the same sources as our extended cor- 
pus (s~e above). 
5 Acknowledgements 
Design and preparation of COMLEX Syntax has been 
supported by tile Advanced Research Projects Agency 
through the Office of Naval Research under Awards 
No. MDA972-92-J-1016 and N00014-90-J-1851, and 
The 'IYustees of the University of Pennsylwmia. 
\[41 
\[5\] 
\[2\] Michael Brent. From grammar to lexicon: Unsu- 
pervised learning oflexical syntax. Computalional 
Linguisties, 19(2):243--262, 1993. 
\[3\] l)onald ltindle and Mats l{.ooth. Structural ambi- 
guity and lexieal relations. In Proceeding.s" of the 
29th Annual MeetiT~g of the Ass'n. flrr Computa- 
lional Ling.uislics, pages 229-25;6, Berkeley, CA, 
June 1991. 
A. S. Ilornby, edito,'. O, ford Advanced Learner's 
Diclionary of Current English. 1980. 
Christol)ller Manning. Automatie acquisition of a 
large subeategorization dictionary fi'om eo,'pora. 
In Proceedi'ngs of the ','\]lst Annual Meeting of the 
Assn. fl)r Compulalio'nal Linguistics, pages 22/5 
242, Columbus, OI1, June 1993. 
\[6\] Adanl Meyers, Catherine Macleod, and l(.alph 
Grisl,man. Standardization of tim complement- 
ad.iunct distinction. Proteus Project Memoran- 
dum 64, Computer Science l)epartment, New 
York University, 1994. 
\[7\] George Miller, Clm,dia Leaeoek, l{andee Tengl, 
and Ross Ihmker. A semantic concordance. In 
Proceedings of the Human Language Technology 
Workshop, pages 303-308, Princeton, NJ, March 
1993. Morgan l(aufmaml. 
\[8\] P. Proctor, editor. Longman Dictionary of Con- 
lemporary English. Long,nan, 1978. 
\[9\] Naomi Sager. Natural Language Information Pro- 
cessing. Addison-Wesley, I{.eading, MA, 1981. 
\[10\] Antonio Sanlilippo. LI(B encoding of lexi- 
eal knowledge. In q'. Briscoe, A. Copestake, 
and V. de I'avia, editors, l)cfaull Inheritance 
in Unification-Based Approaches to Ihe Le'xicom 
Cambridge University Press, 1!)92. 
References 
\[1\] Douglas Biber. Using reglster-diversified corpora 
for general langnage studies. Computational Lin- 
guistics, 19(2):219-242, 1993. 
272 
