Towards Developing Reusable NLP Dictionaries 
Pim van der Eijk and Laura Bloksma and Mark van der Kraan 
Research lllstitutc for l,angt, age and Speech 
Foundation for I,anguage Technology 
State University of Utrecht 
The Netherlands 
vandereijk{~let.ruu.nl 
Abstract 
Development of reusahle dictionaries for NI,P 
applications requires a carefully designed lexi- 
cological framework, a lexical acquisition strat- 
egy, an integrated development toolbox, and 
facilities to generate dictionaries for client ap- 
plications. This paper presents results of tile 
LEXlC projecO, which was set up to prepare 
the development of large multilingual lexieal re- 
sources. 
Kt;ywords: lexicons, tools, large-scale re- 
sources, typed feature structures. 
1 Introduction 
1.1 Common Linguistic Resources 
A large amount of the investments in the development 
of any NLP application is spent on the construction of 
what one might call "large databases of lexieal and gram- 
matical resources". These resources could in principle he 
useful for many applications although they hardly ever 
are: due to the lack of agreement on the definition of ba- 
sic notions and of consensus on the analysis of linguistic 
phenomena they are often linked too closely to specific 
applications. Moreover, given the generally limited size 
and duration of NLP projects both quantity and quality 
of such project-specific databases are disappointing. 
In this paper we will discuss results from the LExlc 
project, a feasibility study preparing large-scale develop- 
1The Lexic project wan financed and supported by the 
three project partners: Philips Research, developing tile 
Rosetta machine translation system, the Foundation for Lan- 
guage Technology, participating in tile Eurotra project, and 
Van Dale, one of the lnaln dictionary publishers in the 
Netherlands, as well as by tile the European Commission, 
and the l)utch ministries of Education and l"coaomic Af- 
fairs. Details of the project are discussed in \[van tier Eijk ct 
al., 1991). 
The ~tuthors want to thattk Anne van Bolhuis, Joy lIcrk- 
lotz, Jeroen Fokker and Tim Dumas for contribution to the 
activities discussed in this paper. 
irleut of s. reusabh! lexical database, started hy a consor- 
tium ef industrial and university partners. The lexica\[ 
database is designed to consist of an integrated package 
of two monolingual dictionaries for I)utch and Spaaish 
and the bilingual dictionaries relating the~ languages. 
The consortium comprised a dictionary publisher as well 
as NLP application developers, giving it the unique op- 
portunity of confronting the large body of exl)erience, 
infrastructure and existing data of publishers with the 
requirements of a new class of profe~qional users. 
Another interesting aspect of the projcct was that 
it addressed the whole spectrum of issues in lexieal 
database development, from lexical acquisition to serv- 
ing heterogeneous client applications. In the current ab- 
sence of arty standard for tile (grammatical) content of 
the dictionary (e.g. standardized sets of grammaticM 
features) the reusability of a dictionary can only be eval- 
uated in terms of usability for some target applications. 
1.2 Structure of the paper 
Section 2 discusses the issue of acquisition of lexical data. 
Section 3 introduces the implementation formalism and 
tools. The lexicon architecture is discussed in section 4. 
Conversion of data to client applications of the database 
is discussed in section 5. 
2 Acquisition 
2.1 Strategies 
There are three potentially useful strategies to develop 
large lexical resources, which are not ill principle mutu- 
ally exclusive. 
MRDs The extraction of data from machine-readable 
dictionaries has received nmch attention ill the past 
decade. In our view tile usefulness of existing mate- 
rial for NLP application has been somewhat overesti- 
mated. Traditional dictionaries are oriented towards a 
market of hlll/lau constlu3ers~ who coustllt the dictio- 
nary for entirely different reasons than N LP applications. 
For instance, most of the information in NhP dictio. 
uaries is concerned with the grammatical description of 
Ac:rf!s DI! COLING-92, NAh'H~S, 23-28 AOt'rr 1992 5 3 l'l~oc, o1: COLING-92, NANTI!S, AU(;. 23-28, 1992 
words, which in many dictionaries is only rudimentarily 
available ~. 
Furthermore, given that humans can use their intel- 
ligence and knowledge of the language(s), much infor- 
mation is only present in unformalized definitions and 
examples. As discussed in e.g. \[MeNaught, 1988\], it is 
often feasible to extract (relatively) formalized informa- 
tion, but the cost-effectiveness of autmnatic extraction 
of information from less formalized data is highly ques- 
tionable. 
From this discussion it follows that MRDs atone can- 
not be the source for NLP dictionaries. In section 2.2 we 
will discuss in more detail the evaluation of the potential 
sources of data for our specific purposes. 
Corpora Automatic extraction of lexical features by 
applying various pattern recognition techniques to large 
bodies of text has received some attention recently (cf. 
e.g. \[Zernik and Jaeobs, 1990\]). tlowever, the infor- 
mation needed for our applications cannot be extracted 
from corpora yet, although important improvements can 
bc expected in the following years. 
Lexicography Given the present inadequacy of 
MRDs and corpus-related tools, manual labour is indis- 
pensable for lexicon development. The tools described 
in section 3 have been developed as a 'workbench' to 
support these lexieographical activities. We will show 
that this tool allows for easy integration of information 
extracted from MRDs with lexicographic editing. 
2.2 Sources 
Evaluation Measure It is difficult to assess the 
"reusability" of existing data without an evaluation mea~ 
snre, i.e. without knowing .for what purpose the data 
shonld be usable. This is especially difficult in the case 
of grammatical features. We developed a lexicon frag- 
ment (implemented as TFS type hierarchy, cf. section 
3) defining the classification scheme for the monolingual 
dictionaries. This fragment is inspired by IfPSG and 
GB, and incorporates many of the (innovative) distinc- 
tions developed by ttm client applications Eun.OTItA and 
ROSETTA. It is, however, much more lezicalistthan these 
systems. 
Eventually, all lexical entries in the two languages 
should be described using this scheme, so that they can 
be readily converted to client applications. The data 
that can be extracted from a potential source has been 
interpreted with respect to this classification scheme to 
assess the amount of information contained in it. 
Data Analysis The machine-readable sources we con- 
sidcred are the existing Van Dale Dutch monolingual 
and bilingual Dutch-Spanish machine-readable dictio- 
nary and the CELEX lexical database. From our eval- 
uation it followed that existing MRDs for Dutch (as for 
almost all other languages) contain only a small part of 
the information needed by NLP applications. 
~Well-structured dictionaries like \[Longman, 1987\] are an 
important exception to this, cf. \[Boguraev and Briscoe, 
t989\], 
Fortunately, the CELEX lexical database has enriched 
a selection of 30000 entries of the "Van Dale Dictionary 
of Contemporary Dutch" with grammatical information, 
taking into account the requirements of a number of 
(prototype) NLP applications under development in the 
Netherlands. A large amount of information needed for 
our target applications can be converted automatically 
from this database. The entries, stored in a relational 
database, can be imported into the Dutch lexicon using 
the TFS constraint solver similarly to the conversion to 
client applications (see section 5). The C r.gx dictionary 
has historic links to tile Van Dale dictionaries (especially 
with respect to reading distinction), which greatly sim- 
plifies integration of these sources. 
With respect to translation information we found that 
the "raw" translational data could be extracted easily 
from the Vail Dale bilingual dictionaries. The original 
Vail Dale concept is especially interesting for multilin- 
gum applications, as the Dutch part is the same (at least 
in principle) in all bilingual dictionaries with Dutch as 
source language (cf. \[van Sterkenburg el al., 1982\]). 
Extraction of information about phrasal translation, 
such as the choice of the support verb of a noun in the 
target language, is unfortunately hidden in unrestricted 
text (example sentences etc.), from which it is difficult 
to extract. Phrasal information also snffers greatly from 
incompleteness. 
3 The TFS Formalism 
Before discussing the proposed lexicon architecture we 
will introduce the computational framework in which it 
has been formalized and ilnplemented, the formalism of 
typed feature structures. 
Currently the family of unification-based formafis:rLq is 
an emerging standard as the implementation formalism 
of natural language processing systems. A variant called 
typed feature structures, discussed a.o. in \[Carpenter, 
1990\], \[Emele and Zajac, 1990\] and \[Zajac, 1990\], ha.s 
been adopted in a number of European lexicon projects, 
including ACQUILEX, Euito'raA 7 and MULTILEX. In 
the course of our project, a TFS database, user interface 
and a constraint solver have been implemented. 
TFS is an excellent formalism for computational lex- 
icons, as it enables a definition of types, or classes, of 
linguistic objects, arranged in a multiple inheritance hi- 
erarchy, where types are associated with an appropriate- 
hess specification defining their features and the types of 
those features and with (possibly disjunctive and com- 
plex) constraints. The object-oriented character of the 
system allows for minimization of redundancy, whereas 
the type system maximizes integrity of data. 
Three TFS-based tools have been developed: 
• a tool for interactive definition ~, entry and modifi- 
cation of data (cf. section 3.1). 
* a TFS database which can be accessed from the user 
interface and the constraint solver. 
3The TFS-editor can bc used to interactively define a type 
hierarchy, as such a hierarchy can be viewed itself a.u a typed 
feature structure, ef. \[Fnkker, 1992\]. 
Acra~s DE COLING-92, N^N'rI!S, 23-28 Ao'\]r 1992 5 4 Pgoc. OF COLING-92. NAN'fES, AUC;. 23-28, 1992 
• a TFS-compiler for data manipulation, e.g. selec- 
tions and conversion. 
The TFS-compiler is similar to the systems described 
by \[Carpenter, 1990\], \[Emele and Zajac, 1990\], and 
\[I,'ranz, 1990\], and like these it constitutes a general- 
purpose constraint-based formalism which can be used 
for a wide variety of tasks, including parsing, transla- 
tion and generation. Our prototype is implemented on 
top of Sicstus Prolog, and is used primarily for selection 
and conversion of data. It offers a number of tracing 
and debugging facilities to assist in the design of type- 
hierarchies and during query-evaluation. 
These three tools can import and export data in a 
special-pnrpo~ text format, whictl is useful for inter- 
change and further processing. The acquisition tools for 
the Van Dale dictionaries and Celex can also generate 
their output in this format. 
3.1 User Interface 
The hierarchical definition of the grammatical types in 
TFS corresponds closely to a "decision tree" which the 
lexicographer traverses while editing a lemma. A graph~ 
teal user interface has been developed by the computer 
science department of the State University of Utrecht 
(\[Fokker, 1992\]) which allows the user to narrow down 
the main type of the lenrma (s)he is editing to a specific 
subtype and to subsequently edit the associated feature 
structure. For example, a lemma is refined li'om ENTRY 
to VERB to DATIVE_VERB, then constraints for this type 
are retrieved and the features and their substructures 
can be edited recursively. 
Of course, only appropriate features are presented and 
can be edited, e.g. it is impossible to edit a feature arg3 
of an intransitive verb. While editing tile value of a few- 
ture the editor creates a subwindow already positioned 
at the minimal type of this feature. E.g. while editing 
a verb, the feature semantics will already be positioned 
at the type EVENT, as this is the minimal type of this 
feature for verbs. 
The editor includes a useful help facility which can be 
viewed as an on-line instruction manual: a hell) function 
exists for each choice point which describes a number of 
criteria and examples to help making the decision. 
It will now be clear how lexicographic work using the 
decision tree model relates to importation of lcxical data 
from existing sources, such ms MRDs. These can he con- 
verted to partially edited lexical entries, so that the lex- 
icographer doesn't have to start at the 'root' level (e.g. 
the choice point El/TRY in tile example), but at an inter- 
mediate level (e.g. VERB). Further choices lea(\[ to more 
refined descriptions of the word. Like all errors, errors iu 
the source dictionary can be corrected by moving back 
to a higher-level choice point in the hierarct~y. 
Completed entries, and also arbitrary substructures, 
can be named and stored iu a database for future use as 
shared (sub)structures in other entries. Useful applica- 
tions of this cross-reference mecbanism are iu morphol- 
ogy and for the implementation of synonymy (see 4.2). 
Compounds can be assigned a feature tree with features 
left_daughter and right_daughter, whose values are point- 
ers in the database to their constitnting parts. 
Tile editor has been implemented in C using tile Mi- 
crosoft Windows 3.0 graphical interface. Tile progranr is 
designed to he e~mily portable, e.g. to X windows. The 
underlying database can be shared via a LAN. As the 
other tools, the database allows for import and export 
of feature structures in tile interchange format. 
The editor is designed specitically for the TFS for- 
raalism. However it can tie used for any specific type 
hierarci~y, as tile definition of the type hierarchy is simply 
defined in a separate text file which is read by the pro- 
gram during start np. IIence, it is potentially interestillg 
for tile devch)pment of many other (NLP) dictionaries. 
An interesting elaboration of the editor would he to 
add extra functionality for the lexicographer besides 
editing attd viewing feature structures, such &~ facilities 
to consult wtrious on-line dictionaries or text corpora. 
4 Dictionary organization 
llaving introduced the computational framework wc will 
proceed with tile diseussion of the organization of the 
dictionary 4. The emph~asis has been on two types of 
modularity: 
I. Modularity of dictionaries and thesaurus. 
The general approach is to define clearly a mun- 
her of ahstractiou levels (cf. section 4,1) in order 
to achieve ccLsy conncctability of the monolingua\[ 
dictionaries via bilingual dictionaries. By geueraliz~ 
lug bilingual translation to bilingual synonymy (or 
equivalence, cf. section 4.2) wc can even separate se~ 
mantic descriptions ("concepts") from the elements 
in which they arc realized in languages. Wc will 
show how such concel)tual dictionaries can bc gen- 
erated from bilingual (fictionarics (4.3). 
2. Modularity of grammatical description (cf. section 
~). 
With respect to the linguistic content of tile mono 
lingual dicLiouaries (i.e. the grammatical descrip- 
tion) we will diseuss the use of typed feature struc- 
ture constraints expressing relations bctwcen gram° 
matical descriptions in various linguistic theories. 
This allows fi)r a very llexihle relation between var 
ions grammatical descriptions. 
4.1 Tim m(~nollngual dictionary 
Word forms in a language, ~Ls found in text corpora, 
arc associated with canonical forms according to \[exi- 
cological conventions. In particular contexts they are 
associated with c×act\]y one of a tixed finite number of 
designations ~. In \[Zgusta, 1971\], two other "compo 
neuts" of meaning are distinguished besides designation, 
viz. connotation and range of application. Our (some- 
what poor) working definition of synonymy is a relation 
hctween readings sharing designation only, both within a 
language and across languages (where it is traditionally 
called equivalence). 
~Thi.s is a condensed summary of \[van dcr Eijk, 1992a\]. 
5 Note that we ad(q~t the approach of discrete readings, el. 
\[tca Itltckcn, 199(}\]. 
AcrEs I)E COLING-92, NANqES, 23~28 Ao(rr 1997. 5 5 }'R~)C. OF COI.ING 92, NAN-rEs, AU(I. 23-28, 1992 
The relation between word forms aml canonical forms 
is many-to-many: ortimgrapllic variants are mapped 
onto a single canonical form, and a single word form 
call be related to ~veral lexical entries via inflectional 
rule* s. The monolingual dictionary is a net of lexical 
entries, which are pairings of canonical word forms of a 
language and their designations, and in addition describe 
their grarnrnatical properties. 
As a result, a lexical entry dmuld minimally have the 
two features canonical~form and semantics. The former 
feature has the simple type STRIM6, the latter, the de- 
scription of the designation, has a complex value, po&'fi- 
bly including ~nrantic features, but minimally contain- 
ing an identifying feature v, as we want to make sure it 
will always be possihle to interconnect tile monolingnal 
dictionaries via bilingual dictionaries. Apart from these 
two features, there will he other features for the d~crip- 
lion of the grammatical properties of the word. 
The combination of canonicalJ'orm and grammatical 
description should allow for the complete and correct 
generation of all word forms and their a.,mociated feature 
strnctures. As our intended client applications have front 
ends for this purpose the database was not designed to 
be a full form dictionary; tiffs could change, depending 
on the needs of future client applications. 
The ~t of designations can be viewed as a thesaurus or 
"knowledge base"; the lexical entries are "pointers" from 
words into this knowledge base, and can be implemented 
as sudl in TFS. 
The relation between canonical word forms and desig- 
nations is also many-to-many, due to synonymy (several 
word forms related to the same designation) and lexi- 
cal ambigality (one word form related to several designa- 
tions). In addition to this there will be alternations in 
the description because of alternative grammatical pat- 
terns. These alternations are implemented as TFS dis- 
junctions. 
4.2 The bilingual dictionary 
Bilingual dictionaries can be viewed as a relation be- 
tween words in two languages. The levels "word form", 
"lexical entry" and "reading" correspond to various de- 
grees of granularity in bilingual dictionaries. Ideally, the 
bilingnal dictionary relates lexical items between lan- 
guages at the level of readings, though in practice most 
existing dictionaries refer to canonical forms or even 
to word forum in the target language. Furthermore, 
the source language side in bilingual dictionaries usu- 
ally refers to readings different from the monolingually 
motivated ones, because they are tuned to tile target lan- 
guage: two readings are not distinguished if they trans- 
late to the same word, or an additional reading is created 
for an additional translation. An exception is the origi- 
nal concept of the bilingual Van Dale dictionaries, where 
the source language reading structure of the bilingual 
6g.g. the Dutch word form bekcnd is associated with the 
adjective bekcnd (meaning well-known)and (by participle for- 
mation) to the verb bekennen (to conJess). 
r'I'he name of t, tored semantic substructures in the TI:S 
database serves this purpose. 
dictionaries is hased directly on the moaolingual reading 
structure (of, \[van Sterkenhurg ef at., 1982\]). 
An interesting approach to the hilingual dictionary 
would be to view it ~.s describing pairings of bilingual 
synonyms. Tile advantage of this would be that 
1. the dictionary supl)orts preservation of meaning in 
translation. 
2. formal properties of equivalence relations (e.g. tran- 
sitive closure) can be exploited to automatically ex- 
paml the dictionary. 
3. coding efforts call be reduced: tile detinition of the 
designation can be shared between monolingual and 
bilingual synonyms. 
Tile main difference hetween traditional dictionaries 
and our approach is therefore that tile indirect transla- 
tional description of hilmgual synonymy is replaced by a 
direct relation between lexical entries in the nmnolingual 
dictionaries to all independent "knowledge hase" of syn- 
onym clusters. This approach is conamon ill e.g. multi- 
lingual terminology (cf. \[Picht and 1)raskau, 1985\]), but 
less common in lexicology. 
We will show that the two representations can be 
translated into each other. Section 4.3 describes how a 
knowledge base is generated from monolingual and bilin- 
gual dictionaries. A bilingual dictionary can be gener- 
ated automatically from a set of monolingual dictionaries 
and a klmwledge base by enumerating the pairs of lexical 
entries in two monolingual dictionaries pointing to the 
same synonym cluster. 
4.3 Gcneratlng Synonym Clusters 
Existing machine-readable trilingual dictionaries s can be 
converted to a representation based on bilingual syn- 
onymy, by "extracting" the underlying concepts. The 
process consists of tile following steps: 
First, the dictionaries are parsed and transformed to 
a table synoaym of the relation between a reading Rz in 
a language LZ and a reading R2 in L=. Two versions of 
this program have been developed and tested: one for the 
Van Dale Dutch-Spanish dictionary and one for bilingual 
entries in the EUROTRA transfer rule format. A version 
for dictionaries in a standard interchange format would 
be a possible future extension. 
Second, reflexive, symmetric, and transitive closure is 
applied to the synonyM/4 relation s. For each reading the 
generated synonym cluster can be viewed. E.g. accord- 
ing to tile Van Dale Dutch-Spanish dictionary, reading 
0.1 of Dutch eerbetoon (English Onark of) honour) has 
one synonymous reading in Dutch and three synonyms 
in Spanish. 
eerbetoon O. 1 : 
SActually, there is no restriction to it b=lingual dk'tionary: 
severe.l bi- or multilinguM dictionaries, and even monolingual 
diction;tries of synonyms, can be processed similarly, result- 
ing in a mulldingual dictionary. This has been checked using 
several Eurotra transfer dictionaries. 
9'I'hl8 program was first hnplentented in Prolog for the 
Ndict system (\[Bloksma et el., 1990\]) itnd modified for a Fro- 
tetra research group on "ll.cversibie Transfer". 
A(:rlis i)1~ COLINC~ 92, NAN-IES, 23-28 ^o(rI 2992 5 6 Psoc. O1: COLING-92, NANq'ES, AUG, 23-28, 2992 
$s: { ho~t.najo honoras tributo }. 
nl: ( eQrbetoon_0,1 .arbo~ija 0.1 }, 
The current implementation is not yet fully satisfying. 
l|ecauea: there is no reading distinction on the ,ql)rmish 
side in the Van Dale N-S (only the Dutch words in the 
example are marked with a reading nmnber, e.g. 0.1), 
some cltinters will get mixed up Is E.g. Spanish fresco 
as adjective means fresh and a~ noun fresco, though the 
program will currently not slake this distinction. 
:frssco_0.1 : 
os: {frauco limpio refresco }. 
nl: { fresco 0,1 grin 0.1 }. 
The program couhl of course be modified to ~Lse 
the grummatical information about the target word in 
tbc dictionary as reading distinguisher; the noun fresco 
would then never be confimed with the adjective. This 
is ullde~iral)le ill l)rincil)lc, bowcver, at; we do not WKllt 
syntactic criteria to guide readiug distinction, l"or in 
stance, many adjectives in I~x)mance languages have he 
mol)honous uominal counterparts, with identical mor- 
phology and ~manties. We don't want to be forced a pri- 
ori to distinguish separate readings for the~e two cases. 
Furthermore, well-known examples of category shift iu 
translation re.g. adverbs translating to verbs etc.) show 
it is impossible to attach a unique syntactic category to 
an equivalence class. 
These presentations of synonym clusters can be very 
helpful to interaetively improve transfer dictionaries: er- 
rors of this type can easily he detected by native speakers 
of the languages (who need not know the other language) 
and corrected by creating appropriate reading distinc- 
tion in Spanish. 
We cbecked the quality of tbe synonym clusters gener. 
ated from from both Van Dale and a EUItO'rRA Spanish 
Dutch dictionary. The Eurotra dictionary, where both 
source and target language items are referred to at the 
reading level, was converted to over 2187 chtsters, 315 of 
whicb contained more than one Spanish reading. Native 
speakers agreed with more than 95% of these synonym 
sets gcuerated via the bilingual elomlre step. The inter- 
pretation of bilingual translation as synonymy is there- 
fore correct in the vast majority of eases. 
llowever, exceptions exist, such as tbe translation of 
the Spanish reloj, which, even though a true (aud infre 
quest) l)utch synonym exists (viz. uurwerk (el. English 
limepiece)), more commolfiy trauslate~ to one of its hy- 
pouyn~ besiege (Eng watch) or klok (Eng clock). 
An interesting e\[ahoration of our approach would be 
to extend the k*mwledge base by ordering the synouym 
clusters themselves via hypono,ny It (cf. \[Cruse, 1986\], 
l°'l'he problem of c'annecting word forms to their readings 
ha* lu'en called the mappin 9 t~roblem. Gf. \[llyrd cl al., 1987\] 
for discussion of a method to map word forms to readings by 
comparing a.o. t~enlastic featnres like human of the source 
re~ling and potentiM target reaAings. 
l*'I'his idea is simil~tr to Wordnet, a collection of synonynl 
sets linked via a variety of Icxical relations (\[Bcckwirth et al., 
1989\]). Our &pproadl extends this idea by adding a multilin 
gtlaJ dimension. Wordset's sylloltym t~tt~ are ~.lsO related by 
relations with leas oh:as translational contu:qu(mt:cn. 
\[Lyous, 1977\]). Client applications could then extract 
Irauslati(mal data based not only on synonymy but also 
on hyp(er)onymy. However. this is a dillicult area, where 
no obvious solutions exist. It is not clear at all which 
translatiou solution automatic translators should select 
in c~mes like this anyway. 
After thls correction process the synonyln clnsters can 
be couverted to TFS format and stored in the database, 
The a.~sociated monolingunl dictk)nnries are then modi 
fled automatically by adding cross-reference informatiott 
(via the feature se mastics) from the lcxicnl entries to the 
synonym dustcrs they use a~uociated with. 
4.4 Creating a knowledge I)ase. 
Synonym clusters reMly become descriptions of desig- 
nations once semautlc information is added to the syn 
onym dusters, which is then, in a truly interlingual way, 
shared between synonyms. Much mmlaxltic information 
froul the (~ELEX 1)utch dictionary can I>e moved to the 
synonym clusters, as well as Van Dale defiuitious of con- 
cepts in natural language. Tbc latter arc useful for semi- 
automntic interactive applications Is. 
The current approach can be said to inlpiement the apo 
proach of possible bilingual lexlcal translalioa, Tiffs al> 
preach should he developed in a uumber of ways. Apart 
from the problem of translation to non-synouyms we 
mentioned, it is desirable to inchLde information in the 
dictionary to guide the choice among possible transla- 
tions, iu cases where there are several syuonyms in the 
target language. Stylistic, eolloeational and frequency 
infl)rmation can be of use for this purpose. This infer 
motion is partly available from existing sources (sucb as 
CF, I,EX attd Van Dale), and large text corpora are also 
obviously relevant sources of this information. 
5 A model for conversion 
Conversion or exchange of lexical data presupposes a de- 
tailed comparison of the various dictiouaries, which in 
turn requires a careful description of the various dictio- 
naries. Given the purpose of Comparison, the descrip- 
tions shouM be cast in a uniform, preferably high-level 
data descriptiou lauguage. Several such languages exist, 
such as the Entity- Relationship model, a tool in database 
design. We will use the TFS formalism introduced in 
~ction 3 for this purpose. 
A lirst step in tiffs comparison is to convcrt various dic- 
tionaries to the uniforln TFS format. In \[n~xqt NI,P for- 
malisms lexical entries are records or feature structures, 
so this syntactic transformation is generally unproblem- 
atic. In passing, implicit semantic structure in the wtr. 
ious dictionaries (e.g. feature cooeeurrence r~trictions) 
can be re,Meted explicit hy constructing a type hierarchy 
for the~ uystcms, 
()n the hasis of these descril)tions, constraints on the 
rehttion hetwc~m lexical entries in the dilt~rent dictio 
naries cau be detined, These constraints can be called 
Also see \[Calzolari, 1990\] for a i)roposM aimil~,r to ours to 
integrate the dictionary and the thesaurus. 
la l"or exautple, l{o.uetta illcorl)tltate~..3.11 interactive rea(Ihtg 
selection \[acillty. 
Acrli~; l)t!('(J ING 92. NAN ~ ES. 2~28 ao(Tr 1!)92 5 7 PRec. o1: COl,IN(; 92. NANTES. AUG, 23 28. 1992 
semantic, as they relate the content of the various dic- 
tionaries, and neutral as they merely pinpoint correspon- 
dences between dictionaries; they define the way dictio- 
naries (which may be unrelated in other respects) are 
similar. 
Constraints can be viewed as implicational and bicon- 
ditional constraints (as in \[van der Eijk, 1992b\]), and it 
is possible to implement them as a complex TFS type. 
This type serves both as documentation of the dictionary 
and as conversion specification. 
A conversion specification is a TFS type CONVERT hav- 
ing features for each of the dictionaries (e.g. lezic, eu- 
rotra and rosetta), and establishes the basic conversion 
relation between entries in the LEXIC dictionary (as de- 
rived from the sources and augmented by lexicographers) 
and entries in the EtrROTRA and ROSETTA dictionaries. 
This conversion type is structured hierarchically as well: 
the high-level type CONVERT has many subtypes specify- 
ing how specific subtypes (and hence subsets of the re- 
spective lexicons) of the various dictionaries are related. 
Disjuncts in the constraints of these types enumerate 
corresponding patterns described as feature structures. 
An advantage is that these conversion constraints can 
be defined at the appropriate level of abstraction. It is in 
principle possible to establish relations holding for all en- 
tries as well as for an individual entry. As the conversion 
types are also ordered in an inheritance hierarchy, sub- 
types will inherit the constraints of their supertype(s). 
Note the inherent declarative character of the conver- 
sion constraints: there is no notion of 'input' and 'out- 
put'. One advantage of this is that a single formalism 
can be nsed for importation, generation as well as in- 
tegration of lexicons. A second advantage is that the 
conversion constraints can also be used to test whethcr 
two existing dictionaries are related as postulated in the 
conversion constraints. 
Full derivability of a particular dictionary can be 
viewed as a special case of the general (in principle rela- 
tional) scheme, where the substructure of a feature like 
rosetta is fully (and functionally) derivable from the sub- 
structure of another (lezic). Informally, all primitive dis- 
tinctions in the target dictionary can be computed given 
the information in the source dictionary, i.e. the con- 
straints define a homomorphism from the serving dictio- 
nary to the client application. 
It is an empirical issue whether this derivability re- 
lation can actually he defined between two dictionaries. 
For newly to be created "generic" lexicons, this deriv- 
ability is a design requirement. For the client dictionar- 
ies we have had to look at, creation of a generic source 
appeared to be a complex, but feasible, task. 
Operationally, conversion proceeds 
as query-evaluation. Givcn an appropriate dcfinition of 
the CONVERT type, the solutions to the following query 
will find all lexieal entries whose canonical form is tiers in 
the LExIc database and return all corresponding further 
instantiations of the ROSETrA type. 
These instantiations correspond to the I~.OSETTA de- 
scriptions for this lexical entry. 
Ic°'"'i ENTRY lexic : canonical_form :fiets 
rosetta : ROSETTA 
6 Illustration 
We will illustrate conversion using the example in 
\[van der Eijk, 1992a\] relating two familiar linguistic 
theories, GPSG and a unification variant of Catego- 
ria\] Grammar, rather than the LEXIC fragment and 
ROSETTA, which we actually implemented. 
The categorial lexical entries have a feature subcat 
whose value is either a CATEGORY or a FUNCTION. The 
type FUNCTION has appropriate features argument, (with 
two features direction and category), and result, where 
the result can be either a function again or a CATEGORY. 
Individual Icxical entries are simply instances of this 
highly general recursive scheme. E.g. the subcat feature 
of a transitive verb (i.e. (NP\S)/NP) has type FUNCTION, 
with an NP argument to the right and, recursively, a 
FUNCTION from a subject NP to an S as result. 
In GPSG individual lexical entries also have a feature 
subcat, but its value, an intcgcr, is used to select the 
corresponding context-free grammar rule for this com- 
plcmentation pattern. 
One of the disjuncts of the constrain ts for the CONVERT 
type will then be the following. Unifying specific cate- 
gorial entries into the cg substructure will cause the cur- 
responding gpsg feature to become instantiated. 
• CONVERT 
cg : subcat : 
FUNCTION 
\[ dir:RIGNT \] 
arg : eat : NP 
dir : LEFT res : arg : cat : NP 
res : S 
n:-- 1 v:+ gpsg : bar : 0 
subcal : 2 
Due to the declarative character of TFS constraint 
evaluation, the above constraint will yield the same re- 
sult whether the cg, tbe gpsg or both features are instan- 
tiated. 
Evidently, the example is very simplistic. The pro- 
totype conversion module we developed in our project 
to translate LEXIC feature structures to I~OSETTA fea- 
ture structures contained over 500 disjuncts Is, and this 
module only covered conversion of a subset of the verbs. 
This number is caused by the fact that conversion rules 
laThis number results from expansion to disjunctive nor- 
real form. Tile actual notation for conversion rules allows for 
embedded disjunctions and is, hence, much more concise. 
AcrEs DE COLING-92, NAr,'fES, 23-28 AOI~T 1992 5 8 Pltoc. OF COLING-92, NANTES, AUG. 23-28, 1992 
tend to become very idiosyncratic once the underlying 
theories of two dictionaries diverge. 
7 Conclusion 
We discussed how a multilingual lexical database can be 
coustructed using a nmnber of existing lexical resources 
and lexicography. The TFS formalism is very appropri- 
ate for the design and implementation of NLP lexicons. 
We showed that its hierarchical structure can be n~d 
profitably in a data entry tool which allows the lexi- 
cographer to manipulate feature structures graphically. 
Lexical acquisition from existing lexical resources can be 
combined seamlessly with lexicographie work. 
The lexicon architecture we designed is an important 
improvement over earlier approaches: various abstrac- 
tion levels and the mappings between them are defined 
more precisely, and the modularity is increased signif- 
icantly by the ~paration of the knowledge base from 
language-specific dictionaries. 
With respect to the issue of reusability, we outlined a 
framework for the specification of comparative descrip- 
tion of linguistic encoding schemes. This specification 
can be used operationally as translation rules to convert 
lexical data. 
Acqa~s I)E COLING-92, NANTES. 23-28 Aoffr 1992 $ 9 PROC. Oi; COLING-92, NANTES. AUG. 23-28, 1992 

References 

\[Beckwirth et al., 1989\] Richard Beekwirth, Christiane 
Fefibaum, Derek Gross, and George Miller. Word- 
net: A lexieal database organized on psycholinguistic 
principles. Paper presented at the First Lexical Ac- 
quisition Workshop, IJCAI89, 1989. 

\[Bloksma et al., 1990\] Laura Bloksma, Aune van Bo\]o 
huts, Pim van der Eijk, Pius ten Hacken, Joy tlerklots, 
Dirk Heylen, Hans Pijnenburg, Frank Sesiuk, Anne- 
Marie Teeuw, Louis des Tombe, and Ton van der 
Wouden. Ndict: Final report. Technical report, 
Eurotra-NL, University of Utrecht, 1990. 

\[Boguraev and Briseoe, 1989\] Bran Boguraev and Ted 
Briscoe, editors. Computational Lexicography for Nat- 
ural Language Processing, London and New York, 
1989. Longman. 

\[Byrd el al., 1987\] Roy Byrd, Nieoletta Calzolari, Mar- 
tin Chodorow, Judith Klavaals, Mary Nell', and Om- 
meya Rizk. Tools and methods for computational lex- 
icology. Computational Linguistics, 13(3-4), 1987. 

\[Calzolari, 1990\] Nieoletta Calzolari. The dictionary and 
the thesaurus can be combined. In Relational Models 
of the Lexicon. Martha Evens, 1990. 

\[Carpenter, 1990\] Bob Carpenter. The logic of typed 
feature structures. Draft, 1990. 

\[Cruse, 1986\] D.A. Cruse. Lexical Semantics. Cam- 
bridge University Press, 1986. 

\[Emele and Zajac, 1990\] Martin Entele and R~mi Zajac. 
Typed unificatiou grammars. In Proceedings of the 
13th International Conference on Computational Lin- 
guistics (COLING), 1990. 

\[Fokker, 1992\] Jeroen Fokker. Lemming user manual. 
Technical Report INF/DOCL92-04, Department of 
Computer Science, State University of Utrecht, 1992. 

\[Franz, 1990\] Alex Franz. A parser for HPSG. Techni- 
cal report, Laboratory for Computational Linguistics, 
Carnegie Mellon University, 1990. No. CMU-LCL-90-3. 

\[Longman, 1987\] Longmau. Longman Dictionary of 
Contemporary English. Longman House, Burnt Mill, 
Harlow, Essex, England, 1987. Second Edition. 

\[Lyons, 1977\] John Lyons. Semantics. Cambridge Uni- 
versity Press, 1977. 

\[McNaught, 1988\] John McNaught. Computational lex- 
icography and computational linguistics. Lericogvaph. 
lea, (4), 1988. 

\[Pieht and Draskau, 1985\] Heribert Pieht and Jennifer 
Draskau. Terminology: An Introduction. University 
of Surrey, 1985. 

\[ten lfacken, 1990\] Plus ten tlacken. B.eading dictinc- 
tiou in MT. In Proceedings of the 13th International 
Conference on Computational Linguistics (COLING), 
1990. 

\[van der Eijk et al., 1991\] Pim van der Eijk, Laura 
Bloksma, Anne van Bolhuis, Joy Ilerklots, Lily van 
Munster, Jeroen Fokker, Mark van der Kraan, and 
Angelique Geilen. Final report of the Lexic Project 
Phase 1. Technical report, Foundation for Language 
Technology, 1991. 

\[van der Eijk, 1992a\] Pim van der Eijk. Multilingual 
lexicon architecture. Working Papers in Natural 
Language Processing, Katholieke Universiteit Leuven, 
Stichtiug Taaltechnologie Utrecht, 1992. forthcoming. 

\[van der Eijk, 1992b\] Pim van der Eijk. Neutral dictio- 
naries. \[n Cheng-Ming Guo, editor, Machine Tractable 
Dictionaries: Design and Construction, chapter 6. 
Ablex, 1992. forthcoraing. 

\[van Sterkenburg el al., 19821 Piet van 
Sterkenburg, Willy Martin, and Bernard AI. A new 
Van Dale project: Bilingual dictionaries on one and 
the same monolingua\[ basis. In J. Goetschalckx and 
L. l~olling, editors, Lexicography in the electronic age, 
pages 221-237. North-tlolland, Amsterdam, 1982. 

\[Zajac, 1990\] R~mi Zajac. A relational approach to 
translation. In P1vc. 3rd Int. Con\]'. on Theoretical 
and Methodological Issues in Machine Translation of 
Natural Language, 1990. 

\[Zernik and Jacobs, 1990\] Uri Zernik and Paul Jacohs. 
Tagging for learning: Collecting thematic relations 
from corpus. In Proceedings of the 13th International 
Conference on Computat~ional Linguistics (COLING), 
Helsinki, 1990. 

\[Zgusta, 1971\] Ladislav Zgusta. Manual of Lexicography. 
Mouton, 1971. 
