DICTIONARIES, DICTIONARY GRAMMARS AND DICTIONARY ENTRY PARSING 
Mary S. Neff IBM T. J. Watson Research Center, P. O. Box 704, Yorktown Heights, New York 10598 
Branimir K. Boguraev IBM T. J. Watson Research Center, P. O. Box 704, Yorktown Heights, New York 10598; 
Computer Laboratory, University of Cambridge, New Museums Site, Cambridge CB2 3QG 
Computerist: ... But, great Scott, what about structure? You can't just bang that lot into a machine without structure. Half a gigabyte of sequential file ... 
Lexicographer: Oh, we know all about structure. Take this entry for example. You see here italics as the typical ambiguous structural element marker, being apparently used as an undefined 
phrase-entry lemrna, but in fact being the subordinate entry headword address preceding the small-cap cross-reference headword address which is nested within the gloss to a defined phrase 
entry, itself nested within a subordinate (bold lower-case letter) sense section in the second branch of a forked multiple part of speech main entry. Now that's typical of the kind of structural re- 
lationship that must be made crystal-clear in the eventual database. 
from "Taking the Words out of His Mouth" -- Edmund Weiner on computerising the Oxford English Dictionary 
(The Guardian, London, March, 1985) 
ABSTRACT 
We identify two complementary p.ro.cesses in. the 
conversion of machine-readable dmUonanes into 
lexical databases: recovery of the dictionary 
structure from the typographical markings which 
persist on the dictionary distribution tapes and 
embody the publishers' notational conventions; 
followed by making explicit all of the codified and 
ellided information packed into individual entries. 
We discuss notational conventions and tape for- 
mats, outline structural properties of dictionaries, 
observe a range of representational phenomena 
particularly relevant to dictionary parsing, and 
derive a set of minimal requirements for a dic- 
tionary grammar formalism. We present a gen- 
eral purpose dictionary entry parser which uses a 
formal notation designed to describe the structure 
of entries and performs a mapping from the flat 
character stream on the tape to a highly struc- 
tured and fully instantiated representation of the 
dictionary. We demonstrate the power of the 
formalism by drawing examples from a range of 
dictionary sources which have been processedand 
converted into lexical databases. 
I. INI"RODUCTION 
Machine-readable dictionaries (MRD's) axe typi, 
tally ayailable in the form of publishers 
typesetting tapes, and consequently are repres- 
ented by a fiat character stream where lexical data 
proper is heavily interspersed with special (con- 
trol) characters. These map to the font changes 
and other notational conventions used in the 
printed form of the dictionary and designed to 
pack, and present in a codified compact visual 
format, as much lexical data as possible. 
To make maximal use of MRD's, it is necessary 
to make their data, as well as structure, fully ex- 
~ licit, in a data base format that lends itself to exible querying. However, since none of the 
lexical data base (LDB) creation efforts to date 
fully addresses both of these issues, they fail to 
offer a general framework for processing the wide 
range of dictionary resources available in 
machine-readable form. As one extreme, the 
conversion of an MRD into an LDB may be 
carried out by a 'one-off" program -- such as, for 
example, used for the Longman Dictionary of Contemporary English 
(LDOCE) and described 
in Bogtbr_ aev and Briscoe, 1989. While the re- 
suiting LDB is quite explicit and complete with 
respect to the data in the source, all knowledge 
of the dictionary structure is embodied in the 
conversion program. On the other hand, more 
modular architectures consisting of a parser and 
a _grammar -- best exemplified by Kazman's 
(1986) analysis of the Oxford English Dictionary 
(OED) -- do not deliver the structurally rich and 
explicit LDB ideally required for easy and un- 
constrained access to the source data. 
The majority of computational lexicography 
projects, in fact, fall in the first of the categories 
above, in that they typically concentrate on the 
conversion of a single dictlonarv into an LDB: 
examples here include the work l~y e.g. Ahlswede et al., 
1986, on The Webster's Seventh New Collegiate Dictionary; 
Fox et a/., 1988, on The Collins English Dictionary; 
Calzolari and Picchi, 
1988, on H Nuovo Dizionario Italiano Garzanti; 
van der Steen, 1982, and Nakamura, 1988, on 
LDOCE. Even work based on multiple diction- 
aries (e.g. in bilingual context: see Calzolari and 
Picchi, 1986) appear to have used specialized 
programs for eac~ dictionary source. In addition, 
not an uncommon property of the LDB's cited 
above is their incompleteness with respect to the 
original source: there is a tendency_ to extract, in 
a pre-processing phase, only some fragments (e.g. 
91 
part of speech information or definition fields) 
while ignoring others (e.g. etymology, pronun- 
ciation or usage notes). 
We have built a Dictionary Entry Parser (DEP) 
together with grammars for several different dic- 
tionaries. Our goal has been to create a general 
mechanism for converting to a common LDB 
format a wide range of MRD's demonstrating a 
wide range of phenomena. In contrast to the 
OED project, where the data in the dictionary is 
only tagged to indicate its structural character- 
istics, we identify ,two processes which are crucial 
for the 'unfolding, or making explicit, the struc- 
ture of an MRD: identification of the structural 
markers, followed by their interpretation in con- 
text resulting in detailed parse trees for individual 
entries. Furthermore, unlike the tagging of the 
OED, carried out in several passes over the data 
and using different grammars (in order to cope 
with the highly complex, idiosyncratic and am- 
biguous nature of dictionary entries), we employ 
a parsing engine exploiting unification and back- 
tracking, and using a single grammar consisting 
of three different sets of rules. The advantages 
of handling the structural complexities of MRD 
sources and deriving corresponding LDB s in one 
operation become clear below. 
While DEP has been described in general terms 
before (Byrd et al., 1987; Neff eta/., 1988), this 
paper draws on our experience in parsing the 
Collins German-English / Collins English-German 
(CGE/CEG) and LDOCE dictionaries, which 
represent two very different types of machine- 
readable sources vis-~t-vis format of the 
typesetting tapes and notational conventions ex- 
ploited by the lexicographers. We examine more 
closely some of the phenomena encountered in 
these dictionaries, trace their implications for 
MRD-to-LDB parsing, show how they motivate 
the design of the DEP grammar formalism, and 
discuss treatment of typical entry configurations. 
2. STRUCTURAL PROPERTIES OF MRD'S 
The structure of dictionary entries is mostly im- 
plicit in the font codes and other special charac- 
ters controlling the layout of an entry on the 
printed page; furthermore, data is typically com- 
pacted to save space in print, and it is common 
for different fields within an entry to employ rad- 
ically different compaction schemes and 
abbreviatory devices. For example, the notation 
T5a, b,3 stands for the LDOCE grammar codes 
T5a;T5b;T3 (Boguraev and Briscoe, 1989, pres- 
ent a detailed description of the grammar coding 
system in this dictionary), and many adverbs are 
stored as run-ons of the adjectives, using the 
abbreviatory convention ~ly (the same conven- 
tion appliesto ce~a~o types of atfixation in gen- 
eral: er, less, hess, etc.). In CGE, German 
compounds with a common first element appear 
grouped together under it: 
Kinder-: .~.ehor m children's choir; --doe nt children's \[ village; 
-ehe f child marriage. I 
Dictionaries often factor out common substrings 
in data fields as in the following LDOCE and 
CEG entries: 
ia.cu.bLtor ... a machine for a keeping eggs warm until they HATCH b keeping alive babies that are too small 
to live and breathe in ordinary air 
Figure I. Def'mition-initial common fragment 
Bankrott m -(e)6, -e bankruptcy; (fig) breakdown, collapse; (moralisch) bankruptcy. ~ machen to 
become or go bankrupt; den - anmelden or ansagen or erld~ren to declare oneself bankrupt. 
Figure 2. Definition-final common fragment 
Furthermore, a variety of conventions exists for 
making text fragments perfo.,rm more than one 
function (the capitalization of' HATCH above, 
for instance, signals a close conceptual link with 
the word being defined). Data of this sort is not 
very useful to an LDB user without explicit ex- 
pansion and recovery of compacted headwords 
and fragments of entries. Parsing a dictionary to 
create an LDB that can be easily queried by a 
user or a program therefore implies not only tag- 
g~ag the data in the entry, but also recovering 
ellided information, both in form and content. 
There are two broad types of machine-readable 
source, each requiring a different strategy for re- 
covery of implicit structure and content of dic- 
tionary entries. On the one hand tapes may 
consist of a character stream with no explicit 
structure markings (as OED and the Collins bi- 
linguals exemplify); all of their structure is iml~li.ed 
in the font changes and the overall syntax ot the 
entry. On the other hand, sources may employ 
mixed r~presentation, incorporating both global 
record delhniters and local structure encoded in 
font change codes and/or special character se- 
quences (LDOCE and Webster s Seventh). 
Ideally, all MRD's should be mapped onto LDB 
structures of the same type, accessible with a sin- 
~le query language that preserves the user s intui- 
tion about tile structure of lexical data (Neff et 
a/., 1988; Tompa, 1986), Dictionary entries can 
be naturally represented as shallov~ hierarchies 
with a variable number of instances of certain 
items at each level, e.g. multiple homographs 
within an entry or multiple senses within a 
homograph. The usual inlieritance mechanisms 
associated with a hierarchical orgardsation of data 
not only ensure compactness of representation, 
but also fit lexical intuitions. The figures overleaf 
show sample entries from CGE ,and LDOCE and 
their LDBforms with explicitly unfolded struc- 
ture. 
Within the taxonomy of normal forms .(NF) de- 
freed by relational data base theo~, dictionary 
entries are 'unnormalized relations in which at- 
tributes can contain other relations, rather than 
simple scalar values; LDB's, therefore, cannot be 
correctly viewed as relational data bases (see Neff 
et al., 1988). Other kinds of hierarchically struc- 
tured data similarly fall outside of the relational 
92 
.'t~le \[...\] n (a) Titel m (also Sport); (of chapter) 
Uberschrift f; (Film) Untertitel m; (form of address) 
Am'ede f. what -- do yon give a bishop? wie redet or 
spricht man ¢inen Bischof an? (b) (Jur) (right) 
(Rechts)anspruch (to auf + acc), Titel (spec) m; 
(document) Eigentumsurkunde f. 
entry 
+-hc:l~: title 
t • -$upert'K~ 
... 
+-pos : n ~-slns 
• -seflsflclm: a 
+- tran ._qroup l 
+-tran 
I ÷~rd: Titel 
I +-gendmr: m 
I +Sin: also Sport 
I ÷ - t ran_g roup 
I :-~_rlote: of chapter 
I I 
•-word: (lberschrift 
I •-gender: f 
I +-tran_.group 
I +-domain: Film 
I ÷-trim 
I +-woPd: Untertitel 
I +-~r: m 
I ÷-tran~r~3up 
I +-usaglt_note: form of address 
I ÷-÷ran I 
+-'NON: Ant÷de I 
+-gender: f 
+-collocat 
÷-source: what -- ¢o you give a bishop? 
*-~rget 
÷-~ease: wie redet /or/ spricht 
man ÷inert Bischof an? 
÷-$11~1 
÷-$ensllum: b 
+-domain: Jur 
÷-÷r-an_group 
÷-usagl_noti: right 
t-train 
• -Nord: Rechtsanspruch 
÷'-Nord: Anspruch 
+-comlmmmt I 
•-~r4)co~p: to 
I +-~Poomp: auf + acc 
÷-gef~Br: m 
e-÷ran 
+-word: Titel 
+-style: spec 
÷-~ndlr: m 
÷-÷ran group 
÷-usage_note: document ÷-÷ran 
+-Nord: Eigentumsurkunde 
÷-gender: f 
Figure 3. LDB for a CEG entry 
NF mould; indeed recently there have been ef- 
forts to design a generalized data model which 
treats fiat relations, lists, and hierarchical struc- 
Ures uniformly (Dadam et al., 1986). Our LDB 
rmat and Lexical Query l_anguage (LQL) sup- 
port the hierarchical model for dictionary data; 
the output of the .parser, similar to the examples 
in Figure 3 and Figure 4, is compacted, encoded, and loaded into an LDB. 
nei.~,.ce/'nju:s~ns II 'nu:-: n I a person or an÷real that 
annoys or causes trouble, PEST: Don't make a 
nuisance of yourself." sit down and be quiet! 2 an action 
or state of affairs which causes trouble, offence, or 
unpleasantness: What a nuisance! I've forgotten my 
ticket 3 Commit no nuisance (as a notice in a public 
place) Do not use this place as a a lavatory b aTIP ~ 
entry 
• -I'wJb#: nuisance 
I 
+-SUlmPhom 
÷-print foist1: nui.sance 
I +-primaw 
I ÷-peon strir~j: "nju:sFns II "nu:- 
+-syncat: n 
I +-sensa_def 
+-sense_no: 1 
•-darn 
I •-implicit_xrf 
I I +-to: pest 
I ÷-def stril~: a person or animal that 
| annoys or causes trouble: 
I pest ÷-example 
÷-eX stril~: Don't make a nuisance of 
yourself: sit down an¢ 
be quiet/ 
•-sense_def 
• -slmse .no: 2 +.-defn 
I ÷-def_string: an action or state of affairs 
\[ which causes trouble, offence. 
I or unpleasantness 
+-example 
• -ex_strirlg: What a nuisancel 
i've forgotten my ticket 
+-sense_def 
÷-sense no: 3 ÷-de~ - 
÷-h¢~ j~rase: Commit no nuisance 
+-quail§let: as a notice in a public place 
+-sub defn 
I a 
I +-def_stril~: Do not use this place 
I as a lavatory ÷-~.~b_dlfn 
+-seq_no: b ÷--defn 
*-i.~li¢it_xrf 
I *-to: tip 
I ÷-h¢~ no: 4 
÷-dQf s\]ril~J~: Do not use this place 
as a tip 
Figure 4. LDB for an LDOCE entry 
3. DEP GRAMMAR FORMALISM 
The choice of the hierarchical model for the rep- 
resentation of the LDB entries (and thus the 
output of DEP) has consequences for the parsing 
mechanism. For us, parsing involves determining 
the structure of all the data, retrieving implicit 
information to make it explicit, reconstructing 
ellided information, and filling a (recursive) tem- 
plate, without any data loss. This contrasts with 
a strategy that fills slots in predefmed (and finite) 
sets of records for a relational system, often dis- 
carding information that does not fit. 
In order to meet these needs, the formalism for 
dictionary entry grammars must meet at least 
three criteria, in addition to being simply a nota- 
tional device capable of describing any particular 
93 
dictionary format. Below we outline the basic 
requirements for such a formalism. 
3.1 Effects of context 
The graham,_ .~ formalism should be capable of 
handling mildly context sensitive' input streams, 
as structurally identical items may have widely 
differing functions depending on both local and 
global contexts. For example, parts of speech, 
field labels, paraphrases of cultural items, and 
many other dictionary fragments all appear in the 
CEG in italics, but their context defines their 
identity and, consequently, their interpretation. 
Thus, in the example entry in Figure 3 above, m, (also Sport), (of chapter), and (spec) 
acquire 
the very different labels of pos, do, in, 
us=g=_not=, and sty1.=. In addition, to distin- 
t~ish between domain labels, style labels, dialect 
els, and usage notes, the rules must be able to 
test candidate elements against a closed set of 
items. Situations like this, involving subsidiary 
application of auxiliary procedures (e.g. string 
matching, or dictionary lookup required for an 
example below), require that the rules be allowed 
to selectively invoke external functions. 
The assignment of labels discussed above is based 
on what we will refer to in the rest of this paper asglobal 
context. In procedural terms, this is 
defined as the expectations of a particular gram- 
mar fragment, reflected in the names of the asso- 
dated rides, which will be activated on a given 
pare through the grammar. Global context is a 
dynamic notion, best thought of as a 'snapshot' 
of the state of the parser at any_ point of process- 
ing an entry. In contrast, local context is defined 
by finite-length patterns of input tokens, ,arid has 
the effect of Identifying typographic 'clues to the 
structure of an entry. Finally, immediate context 
reflects v.ery loc~ character patte12as which tend 
t 9 drive the initial segmentatmn ot the 'raw' tape 
character stream and its fragmentation into 
structure- and information-carrying tokens. 
These three notions underlie our approach to 
structural analysis of dictionaries andare funda- 
mental to the grammar formalism design. 
3.2 Structure manipulation 
The formalism should allow operations on the 
(partial) structures delivered during parsing, and 
not as.separate tree transtormations once proc- 
essing is complete. This is needed, for instance, 
in order to handle a variety of scoping phenom- 
ena (discussed in section 5 below), factor out 
items common to more than one fragment within 
the same entry, and duplicate (sub-)trees as com- 
plete LDB representatmns ~ being fleshed out. 
Consider the CEG entry for abutment": 
I abutment \[.,.\] n (Archit) Fltigel- or Wangenmauer f. I 
Here, as well as in "title" (Figure 3), a copy of 
the gender marker common to both translatmns 
needs to migrate back to the ftrst tram. In addi- 
tion, a copy of the common second compound 
element -mauer also needs to migrate (note that 
e _  : abutment I 
÷-superhom 
,I.-$ens 
÷- t Pan_group +-tran 
I +-iNord: F/Ogelmauer I *-~nd=r: f 
÷-tran 
+.-t,K)rd : Wangenmauer 
÷-gender: f 
identifying this needs a separate noun compound 
parser augmented with dictionary lookup). 
An example of structure duplication is illustrated 
by our treatment of (implicit) cross-references in 
LDOCE, where a link between two closely re- 
lated words is indicated by having one of {hem 
typeset in small capitals embedded in, a definition 
of the other (e.g. "PEST' and "TIP' in the deft- 
nitions of "nuisance" in Figure 4). The dual 
purpose such words serve requires them to appear 
on at least two different nodes in the final LDB 
structure: ¢~f_string and implicit_xrf. In or- 
der to perform the required transformations, the 
formalism must provide an explicit 
dle on partial structures, as they are being 
built by the parser, together with operations 
which can mariipulate them -- both in terms of 
structure decomposition and node migration. 
In general, the formalism must be able to deal 
witli discontinuous constituents, a problem not 
dissimilar to the problems of discontinuous con- 
stituents in natural language parsing; however in 
dictionaries like the ones we discuss the phe- 
nomena seem less regular (if discontinuous con- 
stituents can be regarded as regular at all). 
3.3 Graceful failure 
The nature of the information contained in dic- 
tionaxies is such that certain fields within entries 
do not use any conventions or formal systems to 
present their data. For instance, the "USAGE" 
notes in LDOCE can be arbitrarily complex and 
unstructured. . fragments, .c°mbining straaght text 
with a vanety of notattonal devices (e.g. font 
changes, item highlighting and notes segmenta- 
tion) in such a way that no principled structure 
may be imposed on them. Consider, for example, 
the annotation of "loan": 
loan 2 v ........ esp. AmE to give (someone) the use of, 
lend ........ USAGE It is perfectly good AmE to use 
loan in the meamng of lend: He loaned me ten dollars. 
The word is often used m BrE, esp. in the meaning 'to 
lend formally for a long period': He loaned h/s 
collection of pictures to the public GALLERY but many 
people do not like it to be used simply in the meaning 
of lend in BrE... 
Notwithstanding its complexity, we would still 
like to be able to process the complete entry, re- 
covering as much as we can from the regularly 
encoded information and only 'skipping' over its 
truly unparseable fragment(s). Consequently, the 
formalism and the underlying processing flame- 
94 
work should incorporate a suitable mechanism 
for explicitly handling such data, systematically 
occumng in dictionaries. 
The notion of .graceful failure is, in fact, best re- 
garded as 'seledive parsing'. Such a mechanism 
has the additional benefit of allowing the incre- 
mental development of dictionary grammars with 
(eventually) complete coverage, and arbit .r-~.ry 
depth of analysis, of the source data: a particular 
grammar might choose, for instance, to treat ev- 
erything but the headword, part of speech, and 
pronunciation as 'junk', and concentrate on 
elaborate parsing of the pron.u:n, ciation fields, 
while still being able to accept all input without 
having to assign any structure to most of it. 
4. OVERVIEW OF DEP 
DEP uses as input a collection of 'raw' 
typesetting images of entries from a dictionary 
0.e. a typesetting .tape. with begin-end' bounda- 
ries of entries explicitly marked) and, by consult- 
ing an externally supplied .gr-qmmar s.p~." c for 
that particular dictionary, produces explicit struc- 
tural representations for the individual entries, 
which are either displayed or loaded into an LDB. 
The system consists of a rule compiler, a parsing 
nDg~Be, a dictionary entry template generator, an loader, and various development facilities, 
all in a PROLOG shell. User-written PROLOG 
functions and primitives are easily added to the 
system. The fdrmalism and rule compiler use the 
Modular Logic Grammars of McCo/'d (1987) as 
a point of d~ure, but they have been sub- 
stantially modified and extended to reflect the re- 
quirements of parsing dictionary entries. 
The compiler accepts three different kinds of rules 
corresponding to the three phases of dictionary 
entry analysis: tokenization, retokenization, and 
proper. Below we present informally 
ghts of the grammar formalism. 
4.1 Tokenization 
Unlike in sentence parsing, where tokenization 
(or lexical analysis) is driven entirely by blanks 
and punctuation, the DEP grammar writer ex- 
plicitly defines token delimiters and token substi- 
tutions. Tokenixation rules specify a one-to-one 
mapping from a character substring to a rewrite 
token; the mapping is applied whenever the 
specified substring is encountered in the original 
typesetting tape character stream, and is only 
sensitive to immediate context. Delimiters are 
usually font change codes and other special char- 
acters or symbols; substitutions axe atoms (e.g. 
ital_correction, field_m) or structured terms 
be.g. fmtl italic l, ~! "1" I). Tokenization reaks the source character stream into a mixture 
of tokens and strings; the former embody the 
notational conventions employed by the printed 
dictionary, and are used by tlae parser to assign 
structure to an entry; the latter carry the textual 
(lexical) content of the dictionary. Some sample 
rules for the LDOCE machine-readable source, 
marking the beginning and end of font changes, 
or making explicit special print symbols, are 
shown below (to facilitate readability, (*AS) re- 
presents the hexadecimal symbol x'AS'). 
dolim( "(~i)", font( i~alic } ). dolia( "(UCA)", font( beginl samll_caps ) I ). 
dolim(II{~mS) ii f~r~t ( end( small_caps ) ) ). 
dolim!"(~)", ital correction). delill( "OqlO)", hyl~in_mark ). 
Immediate context, as well as local string rewrite, 
" can be specified by more elaborate tokenization 
rules, in which two additional arguments specify 
strings to be 'glued' to the strings on the left and 
right of the token delimiter, respectively. For 
CEG, for instance, we have 
dotiml". >u4<", f~t;~l;)>~).<°'). delim( ":>u~<", 
delim( ">uS<", font( roman ) ). 
Tokenization opeEates recursively on the string 
fragments formed by an active rule; thus, appli- 
catton of the first two rules above to the stnng 
,,mo~. :~a,: ~r~" results in the following token 
list: "xxx" . lad . fontlbold) , "y~¢". 
4.2 Retokenization 
Longer_-range (but still local) context sensitivity~ 
is irfiplemented via retokenization, the effect ot 
which is the 'normalization' of the token list. 
Retokenization rules conform to a general rewrite 
format -- a pattern on the left-hand side defines 
a context as a sequence of (explicit or variable 
place holder) tokens, in which the token list 
should be adlusted as indicated by the right-hand 
side -- and can be used to .perform a range of 
cleaning up tasks before parsing proper. 
Streamlining the token list. Tokens without in- 
formation- or structure-bearing content; such as 
associated with the codes for fialic correction or 
thin space, are removed: 
ital correction : ,Seg <:> ÷Seg. 
Superfluous font control characters can be simply 
deleted, when they follow or precede certain 
data-can'ying tokens which also incorporate 
typesetting information (such as a homogra.ph 
superscript symbol or a pronunciation marker 
indicating the be~finning of the scope of a pho- 
netic font): 
rk font! phonetic ) < • rk. supl N) < • R 
(Re)adjusting the token list. New tokens can be 
introduced in place of certain token sequences: 
bra : fonttitalic) <=> beginlrestric~ion). f~'tt(r~m~'t) : ket < • ~wl(r~stricti~'b). 
Reconstruction of string segments. Where the 
initial (blind) tokenization has produced spurious 
lragraentation, string sewnents can be suitably 
reconstructed. For instance, a hyphen-delimited 
sequence of syllables in place of the print form 
of a headword, created by tokeni~ation on 
~,-rg), can be 'glued' back as follows: 
*Syl_l : ~ mark : +$ 1 Z t strxngpTSyl 1 ) : $s~r~ngp( S¥1 2 ) 
<=> w~oin(Seg, S~1_1.' .... .$yl_2.n:l"I t~. 
This rule demonstrates a characteristic property. 
of the DEP formalism, discussed in more detail 
95 
later: arbitrary Prolog predicates can be invoked 
to e.g. constrain rule application or manipulate 
strings. Thus, the rule oialy applies to string to- 
kens surrounding a hyphen character; it manu- 
factures, by string concatenation, a new segment 
which replaces the triggering pattern. 
Further segmentation. Often strings need to be 
split, with new tokens inserted between the 
pseces, to correct infelicities in the tapes, or to 
insert markers between recognizably distinct con- 
tiguous segments that appear in the same font. 
The rule below implements the CGE/CEG con- 
vention that a swung dash is an implicit switch 
to bold if the current font is not bold already. 
fontIX} : $(-X=bold) : ¢E : tstringplE} 
tcm~=at( A,B,E ) tconcat (" ~',re,B}: <=> rant(X) : ÷A : font(bold} : +B. 
Dealing with irregular input. Rules that rear- 
range tokens are o~ten needed to correct errors in 
the tapes. In CEG/CGE, parentheses surround- 
ing italic items often appear (erroneously) in a 
roman font. A suite ofiaxles detaches the stray 
parentheses from the surrounding tokens, moves 
them around the font marker, and glues them to 
the item to which they belong. 
+E : $strir~piE) : t¢oncat(") "~E1,EI <=> t0 )n- : +El. /* detach */ 
font(F) : ")" < • ., ),o : : retoKen( font( F ) ). /* 
move */ +E : Sstrirtgl=iE) : ")" : toc~:at(E,")"~E1} 
<:> ÷El. /~ gluo */ 
eot~um invokes retokenization recursively on the 
sublist beginning with fontt e) and including all 
tokens to its right. In p "nneiple, the three rules 
can be subsumed by a single one; in practice, 
separate rules also 'catch' other types of errone- 
ous or nots), input. 
Although retokenization is conceptually a sepa- 
rate process, it is interleaved in practice with 
tokemzation, bringing imp .rovements in perform- 
ance. Upon completion, the tape stream corre- 
sponding, for instance, to the LDOCE entry 
non-trivial manipulation of (partial) trees, as im- 
plicit and/or ellided information packed in the 
bntries is being recovered and reor-gaxxized. Pars- 
ing is a top-down depth-first operation, and only 
the first successful parse is used. This strategy, 
augmented by a 'junk collection' mechanism 
(discussed below) to recover from parsing failures, 
turns out to be adequate for handling all of the 
phenomena encountered while assigning struc- 
tural descriptions to dictionary entries. 
Dictionary grammars follow the basic notational 
conventions of logic grammars; .however, we use 
additional operators tailored to the structure ma- 
nipulation requirements of dictionary parsing. In 
pLrticular, the right-hand side of grammar rules 
admits the use of-four different types ot operators, 
designed to deal with token list consumption, to- 
ken list manipulation, structure assignment, and 
(local) tree transformations. These operators 
suitably modify the expansions of grammar rules; 
ultimately, all rules are compiled into Prolog. 
Token consumption. Tokens axe removed from 
the token list by the + and - operators; + also as- 
signs them as terminal nodes under the head of 
the invoking rule. Typically, delimiters intro- 
duced by tokenization (and retokenization) are 
removed once they serve their primary function 
of identifying local context; string segments of the 
token list are assigned labels and migrate to ap- 
propriate places in the final structural represen- 
iation ot an entry. A simple rule for the part of 
speech fields in CEG (Figure 3) would be: 
los ::>-fzntl italic) = +Sag. 
A structured term stpos, "n".nil) is built as a 
result of the rule consuming, for instance, the to- 
ken "n", Rule names are associated with attri- butes in the LDB representation for a dictionary 
entry; structures built by rules are pairs of the 
form sire, Vii=l, where velt~ is a list of one 
or more elements (strings or further structures 
'returned' by reeunively invoked rules). 
au.tit.fi¢ ;¢¢'tistik, adj suffering from AUTISMI: I 
autistic chlld/behaviour -- ...ally adv \[Wa4\] I 
F<wtistic<F<>wO~O} titC*80}~icP<C: "fist 
Z kH<adj<S<OOOO<O<suf qer ing from{~CA)autis 
m¢~B){*SA) : £u~6}autistic childrm~behavi 
our(~) R<OZ<R<-nmlZy<R<><adv<N~< 
is converted into the following token list: 
maHtar fld ~ . p@ maHter . 
pro~_wmrker - ~sd_--rker 
do~ marker font.T~, inl mll caps ) }. 
~t ..1-1 . bagin~e~m) . 
"autistic" "au-tis-tic" 
"C : "tlstlk" -adp 0 
"0000" "suffering from" 
"a~ut i~a#' "amtisti¢ 
ahild/be~viour" "01" 
Token list manipulation. Adjustment of the to- 
ken list may be required in, for instance, simple 
cases of recovering ellided information or reor- 
dering tokens in the input stream. This is 
achieved by the tm and ir~x operators, which 
respectively insert single, or sequences of, tokens 
into the token list at the current position; and the 
++ operator, which inserts tokens (or arbitrary 
tree fragments) directly into the structure under 
construction. Assuming a global variable, .rod, bound to the headword of the current entry, and 
the ability to invoke a Prolog string concat- 
enation tunction trom within a rule (~a the * 
operator; see below), abbreviated morphological 
derivations stored as run-ons might be recovered ~l~ e ltlqc~r 
in~doriv | . "autisti(ally" by: ! doriv ) . fld_sep . "adv" 
fld_sep . "Ha4" . fld_sep . run_on =:>-rurmn mark : -fon~lbold} : -Sag : ..e~x~=~l,,-,,~ X, Seg) 
wi.I X. suffix) 4.3 Parsing t~,n~'l:te,m,:l, x, Oerivl 
++Ooriv. Parsing proper makes use of unification and 
backtrracking to handle identification of segments (i tin is separately defined to test for membership 
by context, and is heavily augmented with some of a closed class of suffixes.) 
96 
Structure assignment. The ++ operator can only 
assign arbitrary structures directly to the node in 
the tree which is currently under construction. A 
more general mechanism for retaining structures 
for future use is provided by allowing variables to 
be (optionally) associated with grammar rules: in 
this way the grammar writer can obtain an ex- 
plicit handle on tree fragments, in contrast to tlae 
default situation where each rule implicitly 
'returns' the structure it constructs to its caller. 
The following rule, for example, provides a skel- 
eton treatment to the situation exemplified in 
Figure 4, where a definition-initial substring is 
common to more than one sub-definition: 
dofs = • (Sag) : s 
stjxkafs(X) ==> subdof(X) : opt(subdofs(X)). subdof(X) ==>-font(bold) : 
sd letter : -fontl rol~n) : 
~ncatlX, Seg, DefStr~ng) : ins(DefString) : dof_strxng. 
S d:Fletter ==> *Sag ~veri~(Seg, "abe"). de _siring =:> +Sag ~ estringp(Seg). 
The defs rule removes the defmition-irtitial string 
segment and passes: it on to the repeatedly in- 
voked ~s. This manufactures the complete 
definition string by concatenating the common 
initial segment, available as an argument 
instantiated two levels higher, with the continua- 
tion string specific to any given sub-definition. 
Tree transformations. The ability to refer, by 
name, to fragments of the tree being constructed 
by an active grammar rule, allows arbitrary tree 
transformations using the complementary opera- 
tors -z. and +~.. They can only be applied to 
non-terminal grammar rules, and require the ex- 
plicit specification of a place-holder variable as a 
rule argument; this is bound to the structure 
constructed by the rule. The effect of these op- 
erators on the tree fragments constructed by the 
rules they modify is to prevent their incorporation 
into the local tree (in the case of -z), to explicitly 
splice it in (in the case of ÷z), or simply to capture 
it (z). The use of this mechanism in conjunction 
with the structure naming facility allows both 
permanent deletion of nodes, as well as their 
practically unconstrained migration between, and 
within, different levels of grammar (thus imple- 
menting node raising and reordering). It is also 
possible to write a rule which builds no structure 
(the utility of such rules, in particular for con- 
trolling token consumption and junk collection, 
is discussed in section 5). 
Node-raising is illustrated by the grammar frag- 
ment below, which might be used to deal with 
certain collocation phenomena. Sometimes dic- 
tionaries choose to explain a word in the course 
of defining .another related word by arbitrarily in- 
setting mm~-entnes in their defmitmns: 
lach.ry.mal 'l~kfimal adj \[Wa51 of or concerning tears 
of the organ (lach~mai gland/'_ ./) of the body that 
produces them 
The potentially complex structure associated with 
the embedded entry specification does not belong 
to the definition string, and should be factored 
out as a separate node moved to a higher level of 
the tree, or even used to create a new tree entirely. 
The rule for parsi.n.g the definition fields of an 
entry makes a provmon for embedded entries; the 
structure built as an ~ entry is bound to 
the str,ac argument in the aofn rule. The -z op- 
erator prevents the ~_entry node from 
being incorporated as a daughter to ae~n: how- 
ever, by finification, it beghas its ,mi',gr, ation 
'upwards' through the tree, till it is 'caught by the 
entry rule several levels ~gher and inserted (via 
• x) in its logically appropnate place. 
entry ::> head : ton : pos : code : defn( Em~fled ) : 
+Xembedded_entryl Embedded ). 
ckafn(StrIJc) ==>-Segl : Sstringp(Segl) : -Ze~=~KJded entry( Struc ) 
-Seg2 : $s~ringp( Seg2 ) $concat { Segl,S~2, 
De÷String ) : *+OefString. 
embedded_entry ==>-bra : ........ : -ket. 
Capturing generalizations / execution control. 
The expressive power of the system is further en- 
hanced by allowing optionality (via the opt oper- 
ator), alternations (I) and conditional constructs 
in the gra'--:nar rules; the latter are useful both for 
more co~:::,.,ct rule specification and to control 
backtracking while parsing. Rule application 
may be constrained by arbitrary tests (revoked, 
as Prolog predicates, via a t operator), and a 
string operator is available for sampling local 
context. The mechanism of escaping to Prolog, 
the motivation for which we discuss below, can 
also be invoked when arbitrary manipulation of 
lexical data -- ranging from e.g. simple string 
processing to complex morphological analysis -- 
Is required during parsing. 
Tree structures. Additional control over the 
shape of dictionary" entry trees is provided by 
having two types of non-terminal nodes: weak 
and strong ones. The difference is in the explicit 
presence or absence of nodes, corresponding to 
the rule names, in the final tree: a structure frag- 
ment manufactu~d by a weak non-terminal is 
effectively spliced into the higher level structure, 
without an intermediate level of naming. One 
common use of such a device is the 'flattening' 
of branching constructions, typically built by re- 
cursive rules: the declaration 
str~;,-,~_nonterminals ( clefs . subde¢ . nil 1. 
when applied to the sub-definitions fragment 
above, would lead to the creation of a group of 
sister ~f nodes, immediately dominated bv a 
aefs node. Another use of the distinction be- 
wcteen weak and strong non-terminals is the ef- ive mapping from typographically identical 
entry segments to appropriately named structure 
fragments, with global context driving the name 
assignment. Thus, assuming a weak label rule 
which captures the label string for further testing, 
analysis of the example labels discussed in 3.1 could be achieved as follows (also see Figure 3): 
97 
labellXI =:> -beginlrestriction} :.÷X : $strir~p(X\] : -endfresxrictionl. 
tr~n ==> opt I doamin I style I diaZ I usaga_note -) : word. 
~o~en ==> labeltX} i ,i,,X, ~_!ab). ==> label(X } Sisal X, lab\]. 
dial = • labellX} $isalX, dial-lab). usagenote ==> labellX). 
Such a mechanism captures g~aeralities in 
typograp~tc conventions employed across any 
given dictionary, and yet preserves the distinct, 
name spaces required for a meaningful unfolding 
of a dictionary entry structure. 
5. RANGE OF PHENOMENA TO HANDLE 
Below we describe some typical phenomena en- 
countered in the dictionaries we have parsed and 
discuss their treatment. 
5.1 Messy token lists: controlling token 
consumption 
The unsystematic encoding of font changes be- 
fore, as well as after, punctuation marks (com- 
mas, semicolons, parentheses) causes blind 
tokenization to remove punctuation marks from 
the data to which they are visually and concep- 
tually attached. As already discussed (see 4.2), 
most errors of this nature can be corrected by 
retokenization. Similarly, the confusing effects 
of another pervasive error, namely the occurrence 
of consecuti, e font changes, can be avoided by 
having a retokenization rule simply remove all 
but the last one. In general, context sensitivity is 
handled by (re)adjusting the token list; 
retokenization, however, is only sensitive to local 
context. Since global context cannot be deter- 
mined unequivob.ally till parsing, the grammar 
writer is given complete control over the con- 
sumption and addition of tokens as parsing pro- 
ceeds from left to right -- this allows for 
motivated recovery of ellisions, as well as dis- 
carding of tokens in local transformations. 
For instance, spurious occurrences of a font 
marker before a print symbol such as an opening 
parenthesis, which is not affected by a font dec- 
' laration, clearly cannot be removed by a 
retokenization rule 
font! roman\] : bra <=> bra. 
(The marker may be genuinely closing a font 
segment prior to a different entry fragment which 
commences with, e.g., a left parenthesis). Instead, 
a grammar rule anticipating a br~ token within its 
scope can readiust the token list using either of: 
... ==> ... : -fontlroman) : -bra : inslbr-a). ... ==> ... : -fantlromanl : stringlbra.*\]. 
(The $*ri-e operator tests for a token list with br~ as its first element.) 
5.2 The Peter-1 principle: scoping phenomena 
Consider the entry for "Bankrott" in Figure 2. 
Translations sharing the label (fig) ("breakdown, 
collapse ') are grOUl>ed together ~6ith commas and 
separated from other lists with semicolons. The 
restnctlon (context or label) precedes the llst and 
can be said to scope 'right' to the next semicolon. 
We place the righ-t-scoping labels or context un- 
der the (semicolon-delimited) t~,n_group as sister 
nodes to the multiple (comma-delimited) tr--~ 
nodes (see also the representation of "title" in 
Figure 3). Two principles ate at work here: 
meiintaining implicit e~dence of synonymy 
among terms in the target langtmge responds to 
the "do not discard anything" philosophy; placing 
common data items as high as possible in the tree 
(the 'Peter-minus-1 princaple') is in the spirit of 
Flickinger et al. (1985), and implements the 
notion of placing a t~al node at the hi~. est 
position hi tlae tree wlaere its value is valid in 
combination with the values at or below its sister 
nodes. The latter principle also motivates sets of 
rules like 
~rm~ ==> "'" pr~n ... : homograph .... ==> pratt 
used to account for entries in English where the 
pronunciation differs for different homographs. 
5.3 Tribal memory: rule variables 
Some compaction or notational conventions in 
dictionaries require a mechanism for a rule to re,- 
member (part of) its ancestry or know its sister s 
descendants. Consider the l~roblem of determin- 
ing the scope of gender or labels immediately 
following variants of the headword: 
Advolmturbfiro nt (Sw), Advokaturskanzlei f ( Aus) lawyer's offize. 
Tippfr~ein nt ( lnf), ~ppse f -, -n ( pej ) typist. 
Alchemic ( esp Aus) , Akhimief alchemy. 
The first two entries show forms differing, re- 
spectively, in dialect and gender, and register and 
gender. The third illustrates other combinations. 
The rule accounting for labels after a variant must 
know whether items of like type have already 
been found after the hcadword, since items before 
the variant belong to the headword, different 
items of identical type following both belong in.- 
dividuaUy, and all the rest are common to botla. 
This 'tribal' memory is implemented using rule 
variables: 
entry ::> ... ( I dial : $(N:dial)) I 
(N=f-,~dial}) : ... : opt(subhm~lN)| .... 
subhamdlN} ==> opt( $(N=nodial) : 
optldial) ) : .... 
In addition to enforcing rule constraints via 
unification, rule arguments also act as 'channels' 
for node raising and as a mcchanisrn for control- 
ling rule behaviour depending on invocation 
context. 
This latter need stems from a pervasive phenom- 
enon in dictionaries: the notational conventions 
for a logical unit within an entry persist across 
different contexts, and the sub-grammar for such 
a unit should be aware of the environment it is 
activated in. Implicit cross-references in LDOCE 
are consistently introduced by fontl stall csos \], 
independent of whether the runnin 8 text is a de- 
fmiuon (roman font), example (italic), or an era- 
98 
bedded phrase or idiom (bold); by enforcing the 
return to the font active before the invocation of 
iaq)iioit=xrf, we allow the analysis of cross- 
references to be shared: 
implicit xrft X) ==> -1Font( begin( stall cams ) ) 
- : ... :-¢ont(X).- 
df tx* ==> ... implicit xrflroaan) : .... ex-txt =ffi> implicit-xrf(italic) 
id_-_tx* ==> ... implioit-xvfl bold) ..... 
5.4 Unpacking, duplication and movement of 
structures: node migration 
The whole range of phenomena requiring explicit 
manipulation of entry fragment trees is handled 
by the mechanisms for node raising, reordering, 
and deletion. Our analysis of implicit cross- 
references in LDOCE factors them out as sepa- 
rate structural units participatingin the make-up 
of a word sense definition, as well as reconstructs 
a 'text image' of the definition text, with just the 
orthography of the cross-reference item 'spliced 
in' (see Figure 4). 
darn ==> .dof_segs.! O_String) . : ooT_szringCD_St r trig J. 
clef segslStr_l) = • def_nugget(Seg) ( d~f segslStr O) 
Str-O : "" )- 
tcon(~*( Seg,Str_O ,Str_l ). 
def_nugget(Ptr ) ==> 7.iatPlicit xr¢ (s( impliEit xrf, . 
s( to, Ptr.Ril ). Resx ) ). def_nuggot! Seg ) ==> -Seg : Sstringpt Seg ). 
def_strlngi Dof) ==> ÷+Oef. 
The rules build a definition string from any se- 
quence of substrings or lexical items used as 
cross-references: by invoking the appropriate 
de¢_nusmat rule, the simple segments are retained 
only for splicing the complete definition text; 
cross-reference pointers are extracted from the 
structural representation of an implicit eross- 
reference; and itmlicit._xef nodes are propagated 
up to a sister position to the dab_string. The 
string image is built incrementally (by string con- 
catenation, as the individual a-¢_nutmts are 
parsed); ultim, ately the ~¢_strir~ rule simply 
incorporates tt into the structure for ae~. De- 
claring darn, def string and implicit_xrf to be 
strong non-terminals ultimately results in a dean 
structure similar to the one illustrated in 
Figure 4. 
Copying and lateral migration of common gender 
labels in CEG translations, exemplified by title' 
(Figure 3) and "abutment" (section 3.2), makes a differ r- ent use of the ¢z operator. To capture the 
leftward scope of gender labels, in contrast to 
common (right-scoping) context labels, we create, 
for each noun translatton (tran), a gender node 
with an empty value. The comma-delimited *ran 
nodes are collected by a recursive weak non- terminal *fans rule. 
trams ==> tran(G) : opt( -ca : trans(G) ). tran(G) :=> ... word ... : 
opt( -Zoenektr! G ) ) : *7.gendor( G ). 
The (conditional) removal of gander" in the sec- 
ond rule followed by (obligatory) insertion of a 
~ne~r node captures the gender if present and 
'digs a hole' for it if absent. Unification on the 
last iteration of tear~ fills the holes. 
Noun compound fragments, as in "abutment" 
can be copied and migrated forward or backward 
using the same mechknism. Since we have not 
implemented the noun compound parsing mech- 
amsm required for identification of segments to 
be copied, we have temporized by naming the 
fragments needing partners alt_.=¢x or alt_sex. 
5.5 Conflated lexical entries: homograph 
unpacking 
We have implemented a mechanism to allow 
creation of additional entries out of a single one, 
for example from orthographic, dialect, or 
morphological variants of the original headword. 
Some CGE examples were given in sections 2 and 
5.3 above. To handle these, the rules build the 
second entry inside the main one and manufac- 
ture cross reference information for both main 
form and variant, in anticipation of the imple- 
mentation of a splitting mechanism. Examples 
of other types appear in both CGE and CEG: 
vampire \[...\] n (lit) Vampir, Blutsauger (old~ m; (fig) 
Vampir m. - hat Vampir, Blutsauger (old) m. 
wader \[...\] n (a) (Orn) Watvogel m. (b) ~s pl (boots) 
Watstiefel pl. 
house in cpd~ HaLts-; ~ arrest n Hausarrest m; ~ boat 
n Hausboot n~ - baund adj ans Haus gefesselt; .... 
house:. --hunt vi auf Haussuche sein; they have started 
--hunting sic haben angefangen, nach einem Haus zu 
suchen; -hunting n Haussuche n; .... 
The conventions for morphological vari,'ants, used 
heavily in e.g. LDOCE and Webster s Seventh, 
are different and would require a different mech- 
anism. We have not yet developed a generalized 
rule mechanism for ordering any kind of split; 
indeed we do not know if it ts possible, given the 
wide variation ~, seemingly aa hoc conventions 
for 'sneaking in logically separate entries into re- 
lated headword definitions: the case of "lachrymal 
gland" in 4.3 is iust one instance of this phe- 
nomena; below we list some more conceptually 
similar, but notationally different, examples, 
demonstrating the embedding of homographs in 
the variant, run-on, word-sense and example 
fields of LDOCE. 
daddy long.legs .da~i lot~jz also (/'m/) crane fly -- n 
... a type of flying insect with long legs 
ac.rLmo.ny ... n bitterness, as of manner or language 
-- -nious ~,kri'maunias/ adj: an acrimonious quarrel -- 
-niously adv 
crash I ... v ... 6 infml also gatecrash -- to join (a party) 
without having been invited ... 
folk et.y.mol.o.gy ,,..'--~ n the changing of straage or 
foreign words so that they become like quite common 
ones: some people say ~parrowgrass instead of 
ASPARAGUS: that ia an example of folk etymology 
99 
5.6 Notational promiscuity: selective 
tokenization 
Often distinctly different data items appear con- 
tiguous in the same font: the grammar codes of 
LDOCE (section 2) are just one example. Such 
run-together segments clearly need their own 
tokenization rules, which can only be applied 
when they are located during parsing. Thus, 
commas and parentheses take on special meaning 
in the string "X(to be)l,7", indicating, respec- 
tively, ellision of data and optionality of p~ase. 
This is a different interpretation from e.g. alter- 
nation (consider the meaning of "adj, noun")or 
the enclosing of italic labels m parentheses (Fig- 
ure 3). Submission of a string token to further 
tokemzation is best done by revoking a special 
purpose pattern matching module; thus we avoid 
global (and blind) tokenization on common (and 
ambiguous) characters such as punctuation 
marks. The functionality required for selective 
tokenization is provided'by a ~e primitive; 
below we demonstrate the construction of a list 
of sister synca* nodes from a segment like "n, 
v, adj", repetitively invoking oa)-~a) to break a 
string into two substrings separated by a comma: 
-Seg : $stri ( ) : syr~ats ==> $t~rse(Hd." ~n~.Re~s .nil, Se9) : 
ins1( Hd. Rest.nil ) : 
s t syncat • ,~a: : opttsyncats). == tin( Seg, portofspeec:h 1. 
5.7 Parsing failures: junk collection 
The systematic irregularity of dictionary data (see 
section 3.3) is only one problem when parsing 
dictionary entries. Parsing failures in general are 
common during .gr-,~maar development; more 
specifically, they tmght arise due to the format of 
an entry segment being beyond (easy) capturing 
within the grammar formalism, or requiring non- 
trivial external functionality (such as compound 
word parsing or noun/verb phrase analysis). 
Typically, external procedures o~. rate on a newly 
constructed string token which represents a 
'packed' unruly token list. AlternaUvely, if no 
format need be assigned to the input, the graxn. - 
mar should be able to 'skip over' the tokens m the list, 
collecting them under a 'junk' node. 
If data loss is not an issue for a specific applica- 
tion, there is no need even to collect tokens from 
irregular token lists; a simple rule to skip over 
USAGE fields might be wntten as 
usacje ==> -usage nmrk : use field. use field ==> -U ToKen : Snotiee~d ufield} : 
opt( use_f ield ). - 
(Rules like these, building no structure, are espe- 
cially convenient when extensive reorganizatmn 
of tile token list is required -- typically in cases 
of grammar-driven token reordering or token de- 
letion without token consumption.) 
In order to achieve skipping over unparseable in- 
put without data loss, we have implemented a 
ootleztive rule class. The structure built by such 
rules the (transitive) concatenation of all the 
character strings in daughter segments. Coping 
with gross irregularities is achieved by picking up 
any number of tokens and 'packing' them to- 
ther. This strategy is illustrated by a grammar 
phrases conjoined with italic 'or' in example 
sentences and/or their translations (see Figure 3). 
The italic conjunction is surrounded by slashes in 
the resulting collected string as an audit trail. The 
extra argument to e~n$ ehforces, following the 
strategy outlined in section 5.3, rule application 
only m the correct font context. 
stron~nonterminals (source . targ . hill. 
colle~ives !conj . nil ). 
source ==> ¢on~(bo\].d). 
r~ ==> (:~rl..11 rOlllilr~ J. - IX) ::> -TOrt~|X) +~ -fort~(i~l 1} : 
44'* /" 4,"Or" ~ ++"/ " 
-font I X ) +Seg. 
Finally, for the most complex cases of truly ir- 
regular input, a mechanism exists for constraining 
juiak collection to operate only as a last resort and 
only at the point at which parsing can go no fur- 
ther. 
5.8 Augmenting the power of the formalism: 
escape to Prolog 
Several of the mechanisms described above, such 
as contextual control of token consumption (sec- 
tion 5.1), explicit structure handling (5.4), or se- 
lective toke/fization (5.6), are implemented as 
• separate Prolo~z modules. Invoking such extemai 
functionality from the grammar rules allows the 
natural integration of the form- and content- 
recovery procedures into the top-down process 
of dictionary entry analysis. The utility of this 
device should be clear from the examples so far. 
Such escape to the underlying implementation 
language goes against the grain of recent devel- 
opments of declarative gran3m_ ar formalisms. (the 
procedural ramifications of, for instance, being 
able to call arbitrary LISP functions from the arcs 
of an ATN grammar have been discussed at 
length: see, for instance, the opening chapters in 
Whitelock et al., 1987). However, we feel justi- 
fied in augmenting, the ..... formalism in such a way, 
as we are dealing with input which Is different m 
nature from, and on occasions possibly more 
complex than, straight natural language. Unho- 
mogeneous mixtures of heavily formal notations 
and annotations in totally free format, inter- 
spersed with (occasionally incomplete) fragments 
of natural language phrases, can easily defeat any 
attempts at 'cleafi' parsing. Since the DEP sys- 
tem is designed to deal with an open-ended set 
of dictionaries, it must be able to corffront a sim- 
ilarly open-ended set of notational conventions 
and abbreviatory devices. Furthermore. dealing 
in full with some of these notations requires ac- 
cess to mechanisms and theories well beyond the 
power of any grammar formalism: consider, for 
stance, what is involved in analyzing pronun- 
ciation fields in a dictionary, where alternative 
pronunciation patterns are marked only for 
syllable(s) which differ from the primar3 ~ pronun- 
caation (as in arch.bish.op: /,a:tfbiDp II ,at-/); 
where the pronunciation string itself ts not 
marked for syllable structure; and where the as- 
signment of syllable boundaries is far from trivial 
(as in fas.cist: /'f=ej'a,st/)! 
100 
6. CURRENT STATUS 
The run-time environment of DEP includes gr .ammar debugging utilities, and a number of 
opttons. All facilities have been implemented, except where noted. We have very detailed 
grammars for CGE (parsing 98% of the entries), 
CEG (95%), and LDOCE (93%); less detailed grammars for Webster s Seventh (98%), and both 
laalves of the Collins French Dictionary (approxi- 
mately 90%). 
The Dictionary Entry Parser is an integra.1, part 
of a larger system designed to recover dictionary structure to an arbitrary depth of detail, convert 
the resulting trees into LDB records, and make 
the data av/tilable to end users via a flexible and 
powerful lexical query language (LQL). Indeed, 
we have built LDB's for all dictionaries we have 
parsed; further development of LQL and the ex- 
ploitation of the LDB's via query for a number 
of lexical studies are separate projects. 
Finally, we note that, in the light of recent efforts 
to develop an interchange standard for (English 
mono-lingual) dictionaries (Amsler and Tompa, 
1988), DEP acquires additional relevance, since 
it can be used, given a suitable annotation of the 
grammar rules for the machine-readable source, 
to transduce a typesetting tape into an inter- 
changeable dictionary source, available to a larger 
user commumty. 
ACILNOWLEDGEMENTS . 
We would like to thank Roy Byrd, Judith Klavans and Beth Levin for many discussions 
concerning the Dictionary Entry Parser system in general, and this paper in particular. Any re- 
maining errors are ours, and ours only. 
REFERENCES 
Ahlswede, T, M Evens, K Rossi and J Markowitz 
W1986) "Building a Lexical Database by Parsing ebster's Seventh New Collegiate Dictionary '~, 
Advances in Lexicology, Second Annual Confer- 
ence of the UW Centre for the New Oxford English Dictionary, 65- 78. 
Amsler, R and F Tompa (1988) "An 
SGML-Based Standard for English Monolingual 
Dictionaries", Information in Text, Fourth An- 
nual Conference of the L'W Centre for the New Oxford English Dictionary, 61- 79. 
Boguraev, B, and E Briscoe (Eds) (1989) Com- putational Lexicography for Natural Language 
Processing, Longman, Harlow. 
.~yrd, R, N Calzolari, M Chodorow, J Klavans, 
Neff and O Rizk (1987) "Tools and Methods for Computational Lexicology", Computational 
Linguistics, vol. 13(3 - 4), 219 - 240. 
Calzolari~ N and E Picchi (1986) "A Project for a Bilingual Lexical Database System", Advances 
in Lexicology, Second ~ual Conference of the L.'W Centre for the New Oxford English Dic- 
tionary, 79- 92. 
Calzolari, N and E Picchi (1988) "Acquisition of 
Semantic Information from an On-Line 
Dictionary.", Proceedings of the 12th Interna- tional Conference on Computational Linguistics, 
87- 92. 
Collins (1980) Collins German Dictionary: German- English, English- German, 
Collins 
Publishers, Glasgow. 
Gaxzanti (1984) II Nuovo Dizionario Italiano Garzanti, 
Garzanti, Milano. 
Longman (1978) Longman Dictionary of Con- temporary English, 
Longman Group, London. 
Dadam, P, K Kuespert, F Andersen, H Blanken, 
R Erbe, J Guenauer, V Lure, P Pistor and G 
Walsh (1986) "A DBMS Prototype to Support 
Extended NF2 Relations: An ~tegrated View on 
Flat Tables and Hierarchies, Proceedings of A CM SIGMOD'86: International Conference on 
Management of Data, 356- 367. 
Flickinger, D, C Pollard, T Wasow (1985) 
"Structure Sharing in Lexical Representation", Proceedings of the 23rd Annual Meeting of the 
Association for Computational Linguistics, 
262- 267. 
Fox, E, T Nutter, T Alhswede, M Evens and J 
Markowitz (1988) "Building a Large Thesaurus 
for Information Retrieval", Proceedings of the Second Conference on Applied Natural Language 
Processing, 101 - 108. 
Kazman, R (1986) "Structuring the Text of the Oxford Engl!s,h Dictionary through Finite State 
Transduction , University of Waterloo Technical 
Report No. TR - 86- 20. 
McCord, M (1987} "Natural Language Process- 
ing and Prolog", m A Walker, MMcCord, J 
Sowa and W Wilson (Eds) Knowledge Systems and ' Prolog, 
Addison-Wesley, Waltham, Massachusetts, 291 - 402. 
Nakamura, J and Makoto N (1988) "Extraction of Semantic Information from an Ordinary Eng- 
lish Dictionary and Its Evaluation", Proceedings of the 12th International Conference on Computa- 
tional Linguistics, 459 - 464. 
Neff, M, R Byrd and O Rizk (1988) "Creat~g 
and Querying Hierarchical Lexical Data Bases , Proceedings df the Second Conference on Applied 
Natural Language Processing, 84- 93. 
van der Steen, G J (1982) "A Treatment of Que- 
ries in Large Text Corpora", in S Johansson (Ed) Computer Corpora in English Language 
Research, Norwegian Computing Centre for the 
Humanities, Bergen, 49 - 63". 
Tompa, F (1986) "'Database Design for a Dic- tionary of the Future', 
University of Waterloo, unpublished. 
W7 (1967) Webster's Seventh New Collegiate Dictionary, C.&C. Merriam Company, 
Springfield, Massachussetts. 
Whitelock, P, M Wood, H Somers, R Johnson 
and P Bennett (Eds) (1987) Linguistic Theory and Computer Applications, Academic Press, New 
York. 
101 
