Language Resources for a Network-based Dictionary
Veit Reuer
Institute of Cognitive Science
University of Osnabr¨uck
Germany
vreuer@uos.de
Abstract
In order to facilitate the use of a dictionary for lan-
guage production and language learning we propose
the construction of a new network-based electronic
dictionary along the lines of Zock (2002). How-
ever, contrary to Zock who proposes just a paradig-
matic network with information about the various
ways in which words are similar we would like to
present several existing language resources (LRs)
which will be integrated in such a network result-
ing in more linguistic levels than one with paradig-
matically associated words. We argue that just as
the mental lexicon exhibits various, possibly inter-
woven layers of networks, electronic LRs contain-
ing syntagmatic, morphological and phonological
information need to be integrated into an associa-
tive electronic dictionary.
1 Introduction
Traditional dictionaries are mainly used for lan-
guage reception, even though they were also de-
veloped to be used for language production. How-
ever the form-based structure following ortho-
graphic conventions which could also be called
“one-dimensional”, makes it difficult to access the
information by meaning. Therefore the usage of
a traditional dictionary for text production is quite
limited as opposed to, for example, a thesaurus.
The main advantage of a thesaurus is the structuring
based on the semantic relation between words in an
entry. This allows for the availability of a different
type of information.
Therefore our proposal is to construct an elec-
tronic dictionary which has a network-like structure
and whose content is drawn from various existing
lexical resources. The dictionary will represent both
paradigmatic information - information about the
various ways in which words are similar - as well as
syntagmatic information - information about the re-
lationships among words that appear together. Ad-
ditionally information from other types of resources
such as morphology and phonology will be inte-
grated as they are also relevant in models of the
mental lexicon. In these models “associations” be-
tween words are based not only on meaning but also
on phonological or morphological properties of the
connected words. Following Brown and McNeill
(1966) and subsequent research people in the so-
called “tip-of-the-tongue”-state (TOT-state) are able
to clearly recall the properties of the missing word
such as the number of syllables or the meaning, and
can easily identify the target word when it is pre-
sented to them.
berry
hyper
TTTTT TT
herb
hyper III
birthmark
hyper
%%
chocolate
wwww
wwww shortcake
gggggggggggg
garden strwb. hypobb
strawberry cream
wild strwb.
hyposss
jam
RRRRR
RR
RRRRR
RR fruit
ZZZZZZZZZZZZZZZZZZZZZZZZZZ
beach strwb.
hypo¶
¶¶¶
¶¶¶
Virginia Strwb.
hypo????
????
Figure 1: Example network with data from Word-
Net (–) and Deutscher Wortschatz (=)
Figure 1 exemplifies the visualization of a single
node with related information from two LRs1. Here
a user would be able to find the term “shortcake”
even if s/he only knows only one part, namely straw-
berries.2 A click on a neighbouring node should e.g.
re-center the structure and hence allow the user to
“explore” the network.
As mentioned above the most obvious usage
seems to be in language production where infor-
mation can be provided not only for words already
activated in the mind of the language producer but
also for alternatives, specifications or for words not
directly accessible because of a TOT-state. This
seems reasonable in light of the fact that speaker’s
passively vocabularies are known to be larger than
1http://www.cogsci.princeton.edu/˜wn
http://www.wortschatz.uni-leipzig.de
2LDOCE (1987) mentions “cream” and “jam” but not
“shortcake” as part of the entry for “strawberry”. The entry for
“shortcake” however lists specifically “strawberry shortcake”.
their active vocabularies. The range of information
available of course depends on the material inte-
grated into the dictionary from the various resources
which are explored more closely below.
A second area of application of such a dictio-
nary is language learning. Apart from specify-
ing paradigmatic information which is usually also
part of the definition of a lemma, syntagmatic in-
formation representing collocations and cooccur-
rances is an important resource for language learn-
ers. Knowledge about collocations is a kind of lin-
guistic knowledge which is language-specific and
not systematically derivable making collocations es-
pecially difficult to learn.
Even though there are some studies that compare
the results from statistically computed association
measures with word association norms from psy-
cholinguistic experiments (Landauer et al., 1998;
Rapp, 2002) there has not been any research on
the usage of a digital, network-based dictionary re-
flecting the organisation of the mental lexicon to
our knowledge. Apart from studies using so called
Mind Maps or Concept Maps to visualize “world
knowledge”3 (Novak, 1998) nothing is known about
the psycholinguistic aspects which need to be con-
sidered for the construction of a network-based dic-
tionary.
In the following section we will summarize the
information made available by the various LRs we
plan to integrate into our system. The ideas pre-
sented here were developed in preparation of a
project at the University of Osnabr¨uck.
2 Language Resources
Zock (2002) proposes the use of only one type of in-
formation structure in his network, namely a type of
semantic information. There are, however, a num-
ber of other types of information structures that may
also be relevant for a user. Psychological experi-
ments show that almost all levels of linguistic de-
scription reveal priming effects. Strong mental be-
tween words are based not only on a semantic rela-
tionship but also on morphological an phonological
relationships. These types of relationships should
also be included in a network based dictionary as
well.
A number of LRs that are suitable in this scenario
already provide some sort of network-like structure
possibly closely related to networks meaningful to
a human user. All areas are large research fields of
their own and we will therefore only touch upon a
few aspects.
3The maps constitute a representation of the world rather
than reflecting the mental lexicon.
2.1 Manually Constructed Networks
Manually constructed networks usually consist of
paradigmatic information since words of the same
part-of-speech are related to each other. In ontolo-
gies usually only nouns are considered and are inte-
grated into these in order to structure the knowledge
to be covered.
The main advantage of such networks, since they
are hand-built, is the presumable correctness (if not
completeness) of the content. Additionally, these
semantic nets usually include typed relations be-
tween nodes, such as e.g. “hyperonymy” and “is a”
and therefore provides additional information for a
user. It is safe to rely on the structure of a network
coded by humans to a certain extend even if it has
certain disadvantages, too. For example networks
tend to be selective on the amount of data included,
i.e. sometimes only one restricted area of know-
ledge is covered. Furthermore they include basi-
cally only paradigmatic information with some ex-
ceptions. This however is only part of the greater
structure of lexical networks.
The most famous example is WordNet (Fellbaum,
1998) for English – which has been visualized al-
ready at http://www.visualthesaurus.com – and its
various sisters for other languages. It reflects a cer-
tain cognitive claim and was designed to be used
in computational tasks such as word sense disam-
biguation. Furthermore ontologies may be used as a
resource, because in ontologies usually single words
or NPs are used to label the nodes in the network.
An example is the “Universal Decimal Classifica-
tion”4 which was originally designed to classify all
printed and electronic publications in libraries with
the help of some 60,000 classes. However one can
also think of it as a knowledge representation sys-
tem as the information is coded in order to reflect
the knowledge about the topics covered.
2.2 Automatically Generated Paradigmatic
Networks
A common approach to the automatic generation of
semantic networks is to use some form of the so
called vector-space-model in order to map semanti-
cally similar words closely together in vector space
if they occur in similar contexts in a corpus (Man-
ning and Sch¨utze, 1999). One example, Latent Se-
mantic Analysis (Landauer et al., 1998, LSA) has
been accepted as a model of the mental lexicon and
is even used by psycholinguists as a basis for the
categorization and evaluation of test-items. The re-
sults from this line of research seem to describe not
only relations between words but seem to provide
4http://www.udcc.org
the basis for a network which could be integrated
into a network-based dictionary. A disadvantage of
LSA is the positioning of polysemous words at a
position between the two extremes, i.e. between the
two senses which makes the approach worthless for
polysemous words in the data.
There are several other approaches such as Ji
and Ploux (2003) and the already mentioned Rapp
(2002). Ji and Ploux also develop a statistics-based
method in order to determine so called “contex-
onyms”. This method allows one to determine dif-
ferent senses of a word as it connects to different
clusters for the various senses, which can be seen
as automatically derived SynSets as they are known
from WordNet. Furthermore her group developed a
visualization tool, that presents the results in a way
unseen before. Even though they claim to have de-
veloped an “organisation model” of the mental lex-
icon only the restricted class of paradigmatic rela-
tions shows up in their calculations.
Common to almost all the automatically derived
semantic networks is the problem of the unknown
relation between items as opposed to manually con-
structed networks. On the one hand a typed rela-
tion provides additional information for a user about
two connected nodes but on the other hand it seems
questionable if a known relation would really help
to actually infer the meaning of a connected node
(contrary to Zock (2002)).
2.3 Automatically Generated Syntagmatic
Networks
Substantial parts of the mental lexicon probably
also consist of syntagmatic relations between words
which are even more important for the interpre-
tation of collocations.5 The automatic extraction
of collocations, i.e. syntagmatic relations between
words, from large corpora has been an area of inter-
est in recent years as it provides a basis for the au-
tomatic enrichment of electronic lexicons and also
dictionaries. Usually attempts have been made at
extracting verb-noun-, verb-PP- or adjective-noun-
combinations. Noteworthy are the works by Krenn
and Evert (2001) who have tried to compare the
different lexical association measures used for the
extraction of collocations. Even though most ap-
proaches are purely statistics-based and use little
linguistic information, there are a few cases where
a parser was applied in order to enhance the recog-
nition of collocations with the relevant words not
5We define collocations as a syntactically more or less fixed
combination of words where the meaning of one word is usu-
ally altered so that a compositional construction of the meaning
is prevented.
being next to each other (Seretan et al., 2003).
The data available from the collocation extrac-
tion research of course cannot be put together to
give a complete and comprehensive network. How-
ever certain examples such as the German project
“Deutscher Wortschatz”6 and the visualization tech-
nique used there suggest a network like structure
also in this area useful for example in the language
learning scenario as mentioned above.
2.4 Phonological/Morphological Networks
Electronic lexica and rule systems for the phonolog-
ical representation of words can be used for spell-
checking as has been done e.g. in the Soundex ap-
proach (Mitton, 1996). In this approach a word not
contained in the lexicon is mapped onto a simpli-
fied and reduced phonological representation and
compared with the representations of words in the
lexicon. The correct words coming close to the
misspelled word on the basis of the comparison
are then chosen as possible correction candidates.
However this approach makes some drastic assump-
tions about the phonology of a language in order to
keep the system simple. With a more elaborate set
of rules describing the phonology of a language a
more complex analysis is possible which even al-
lows the determination of words that rhyme.7 Set-
ting a suitable threshold for some measure of simi-
larity a network should evolve with phonologically
similar words being connected with each other. A
related approach to spelling correction is the use of
so called “tries” for the efficient storage of lexical
data (Oflazer, 1996). The calculation of a minimal
editing distance between an unknown word and a
word in a trie determines a possible correct candi-
date.
Contrary to Zock (2002) who suggests this as an
analysis step on its own we think that the phonolo-
gical and morphological similarity can be exploited
to form yet another layer in a network-based dictio-
nary. Zock’s example of the looked-for “relegate”
may than be connected to “renegade” and “dele-
gate” via a single link and thus found easily. Here
again, probably only partial nets are created but they
may nevertheless help a user looking for a certain
word whose spelling s/he is not sure of.
Finally there are even more types of LRs contain-
ing network-like structures which may contribute
to a network-based dictionary. One example to be
mentioned here is the content of machine-readable
6Note however that the use of the term “Kollokation” in this
project is strictly based on statistics and has no relation to a
collocation in a linguistic sense (see figure 1).
7Dissertation project of Tobias Thelen: personal communi-
cation.
dictionaries. The words in definitions contained in
the dictionary entries – especially for nouns – are
usually on the one hand semantically connected to
the lemma and on the other hand are mostly entries
themselves which again may provide data for a net-
work. In research in computational linguistics the
relation between the lemma and the definition has
been utilized especially for word sense disambigua-
tion tasks and for the automatic enrichment of lan-
guage processing systems (Ide and V´eronis, 1994).
3 Open Issues and Conclusion
So far we have said nothing about two further im-
portant parts of such a dictionary: the representation
and the visualization of the data. There are a num-
ber of questions which still need to be answered in
order to build a comprehensive dictionary suitable
for an evaluation. With respect to the representation
two major questions seem to be the following.
• As statistical methods for the analysis of cor-
pora and for the extraction of frequent cooccur-
rance phenomena tend to use non-lemmatized
data, the question is whether it makes sense
to provide the user with the more specific data
based on inflected material.
• Secondly the question arises how to integrate
different senses of a word into the representa-
tion, if the data provides for this information
(as WordNet does).
With regard to visualization especially the dynamic
aspects of the presentation need to be considered.
There are various techniques that can be used to fo-
cus on parts of the network and suppress others in
order to make the network-based dictionary man-
ageable for a user which need to be evaluated in us-
ability studies. Among these are hyperbolic views
and so-called cone trees
As we have shown a number of LRs, espe-
cially those including syntagmatic, morphological
and phonological information, provide suitable data
to be included into a network-based dictionary. The
data in these LRs either correspond to the presumed
content of the mental lexicon or seem especially
suited for the intended usage. One major prop-
erty of the new type of dictionary proposed here
is the disintegration of the macro- and the micro-
structure of a traditional dictionary because parts
of the micro-structure (the definition of the entries)
become part of the macro-structure (primary links
to related nodes) of the new dictionary. Reflect-
ing the structure of the mental lexicon this dictio-
nary should allow new ways to access the lexical
data and support language production and language
learning.

References
R. Brown and D. McNeill. 1966. The ‘tip-of-the-
tongue’ phenomenon. Journal of Verbal Learn-
ing and Verbal Behavior, 5:325–337.
C. Fellbaum, editor. 1998. WordNet: an electronic
lexical database. MIT Press, Cambridge, MA.
N. Ide and J. V´eronis. 1994. Machine readable dic-
tionaries: What have we learned, where do we go.
In Proceedings of the post-COLING94 interna-
tional workshop on directions of lexical research,
Beijing.
H. Ji and S. Ploux. 2003. A mental lexicon organi-
zation model. In Proceedings of the Joint Inter-
national Conference on Cognitive Science, pages
240–245, Sydney.
B. Krenn and S. Evert. 2001. Can we do bet-
ter than frequency? A case study on extract-
ing PP-verb collocations. In Proceedings of the
ACL-Workshop on Collocations, pages 39–46,
Toulouse.
T. K. Landauer, P. W. Foltz, and D. Laham. 1998.
An introduction to latent semantic analysis. Dis-
course Processes, 25:259–284.
LDOCE. 1987. Longman Dictionary of Contempo-
rary English. Langenscheidt, Berlin, 2. edition.
C. D. Manning and H. Sch¨utze. 1999. Foundations
of Statistical Natural Language Processing. MIT
Press, Cambridge, MA.
R. Mitton. 1996. English Spelling and the Com-
puter. Longman, London.
J. D. Novak. 1998. Learning, Creating and Using
Knowledge: Concept Maps as Facilitative Tools
in Schools and Corporations. Erlbaum, Mahwah,
NJ.
K. Oflazer. 1996. Error-tolerant finite-state recog-
nition with applications to morphological analy-
sis and spelling correction. Computational Lin-
guistics, 22(1):73–89.
R. Rapp. 2002. The computation of word associa-
tions: Comparing syntagmatic and paradigmatic
approaches. In Proc. 19th Int. Conference on
Computational Linguistics (COLING), Taipeh.
V. Seretan, L. Nerima, and E. Wehrli. 2003. Extrac-
tion of multi-word collocations using syntactic
bigram composition. In Proceedings of the Inter-
national Conference on Recent Advances in NLP
(RANLP-2003), Borovets, Bulgaria.
M. Zock. 2002. Sorry, but what was your name
again, or, how to overcome the tip-of-the tongue
problem with the help of a computer? In Pro-
ceedings of the COLING-Workshop on Building
and Using Semantic Networks, Taipeh.
